bdm9 - comparison of oracle rdbms and cloudera impala for a hospital use case

1 / 18

EndoMine SystemJewish General Hospital

by David Lauzon and Anton ZakharovBig Data Montreal #9February 5th 2013

2 / 18

Presentation

• Our Objectives• Requirements and context• Project scope• Hadoop Solution

– Big Data Solution Overview– Hive Table Schema– Compression Performance– Data Architecture in Hadoop– Hadoop/Impala Prototype Demo

• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries?

3 / 18

Our Objectives

• Lead an end-of-study project in an industrial context– Requirements elicitation– Implement a « proof-of-concept » prototype

• Experiment with big data technologies– Compare with RDBMS

4 / 18

Requirements and context

• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...)– Dr. Shaun Eintracht

• « ad hoc » Query • ETL Query

– Dr. Elizabeth Mac Namara• « business intelligence » requirements• Realtime Dashboard

• Department of Endocrinology– Dr. Mark Trifiro

• Data mining

5 / 18

Project scope

• First iteration = improve ad-hoc queries– Slow analytical queries and ETL (MS Access)– Risk of « crashing » production DB– Some queries impossible to process

6 / 18

Production DB (Oracle)

7 / 18

Solutions

• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS

8 / 18

Big Data Solution Overview

9 / 18

Hive Table Schema

10 / 18

Compression Performance

Oracle FS Text File Sequence File

SeqFile + Gzip

SeqFile + Snappy

ImpalaHiveOracle

11 / 18

Data Architecture in Hadoop

• All big tables are pre-joined– With specimen (1) – Without specimen (2)

• Partitioned using two schemes – Year-month (3) – Year and Test (4)

• 4 different versions of the same data:– stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test

12 / 18

Hadoop Prototype Demo

13 / 18

Oracle Solution

• Same tables as source DB– A big pre-joined table is not a good solution

• Techniques explored :– Partitioning• Partitions automatically created

– Compression• Inefficient for joins

– Clustering– Join multiple partitioned tables

14 / 18

Oracle Solution (continued)

• Avoid too many indexes on the big tables:– Takes a lot of memory– Slow to create– May not be used if query use more than 5% of the

15 / 18

Comparison: Hadoop Solution

• Pro– Crunch massive amount of data– Scalability– Free software

• Cons– Needs better UI and tune-ups– Maintenance cost– Require ETL time to merge data into one table – BIG Joins should be avoided

16 / 18

Comparison: Oracle Solution

• Pro– Just need to create a slave DB (just?)– Faster random-lookup– Easier to find expertise

• Cons– Scalability up to a certain point..– Synchronisation with master DB:• Rebuilding indexes would take hours

17 / 18

What are expensive queries?

• If possible, avoid these constructs on large result sets– SELECT DISTINCT– ORDER BY– GROUP BY– JOIN big table with another big table• JOIN big table with multiple small tables should be OK

18 / 18

Conclusion

• Recommendation to use a “classic” RDBMS– The database fit on a single-node– Existing expertise in-house– Acceptable performance with appropriate

tune-ups– Stop using MS Access

• Disadvantage : limited scalability

bdm9 - comparison of oracle rdbms and cloudera impala for a hospital use case

big data technologies

big tables

oracle solution pro

table big joins

impala solution

solutions solution

production db oracle

big prejoined table

Technology

impala cookbook 01-2017 - cloudera blog · •as of cdh...

cloudera impala: a modern sql engine for hadoop

simba odbc driver for cloudera impala installation and...

cloudera jdbc driver for impala installation and ......

cloudera impala 1.0

cloudera impala

cloudera jdbc driver for impala installation and...

cloudera jdbc driver for impala - analytics | cloud ·...

cloudera impala: a modern sql engine for apache hadoop

cloudera impala - hug karlsruhe, july 04, 2013

刘诚忠：running cloudera impala on postgre sql

cloudera impala source code explanation and analysis

combat cyber threats with cloudera impala & apache hadoop

setting up a hadoop cluster with cloudera manager and impala

performance evaluation of cloudera impala ga

real time analytics using cloudera impala in manufacturing...

cloudera jdbc driver for impala installation and ... ·...

1 cloudera impala and improvements in hdfs for real-time...

impala ha with f5 big-ip - cloudera

apache atlas reference - cloudera · cloudera, cloudera...