veracity think bugdata #2 6.7.2015
TRANSCRIPT
![Page 1: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/1.jpg)
DWH OVER HADOOPDWH OVER HADOOP
![Page 2: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/2.jpg)
THETHE
BASICSBASICS
![Page 3: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/3.jpg)
COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET)Projection Push DownPredicate Push DownExcellent Compression RatiosColumn IndicesMax/Avg/Min valuesRows must be batched to benefit from these optimizations
![Page 4: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/4.jpg)
PARQUETPARQUET
Strongly endorsed by ClouderaOne of the few formats Impalasupports (and the most optimalfor it)Also supported by Hive, Spark,Tajo, Drill & Presto.Speaking from myown personal experience a bitmore expensive to generate.
ORCORC
Endorsed by HortonworksMost optimal for PrestoSpark support was recentlyintroduced.
![Page 5: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/5.jpg)
QUERYINGQUERYINGENGINESENGINES
![Page 6: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/6.jpg)
HIVEHIVE
Hive provides a SQL like interface ofaccessing the data (files) called HiveQL.The HQL is translated intoM/R code and executed immediately. Batch Oriented Fault tolerant and thus reliableNot a DB! Does not support updates & delete and hasno transaction (or does it ?)
![Page 7: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/7.jpg)
LOW LOW LATENCYLATENCYSQLSQL
Map-Reduce can be compared toa Tractor:It's very strong and can plow afield better than any other vehicle,but it's also very slow.As prices of memory dropped, ademand emerged to better utilizeit for faster response times.
![Page 8: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/8.jpg)
CLOUDERA IMPALACLOUDERA IMPALAWriten in C++Utilizes Hive's metadataVery fastNot fault toleranteDoesn't support custom dataformatsDoesn't support complex datatypes (maps/arrays/structs)A bit complicated setup for nonCDH distributions
![Page 9: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/9.jpg)
FACEBOOK PRESTOFACEBOOK PRESTOCan connect to:
CassandraHiveJMX SourcesPostgres & Mysql
Allows cross engine joins Used in Facebook to serve onlinedashboardsEasy to setup
![Page 10: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/10.jpg)
SPARK SQLSPARK SQLNot affiliated with any HadoopvendorSupport all of the optimized fileformats (ORC/Parquet/Avro)Can auto discover schemaAims to provide second/sub-second latnecyStill not very mature
![Page 11: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/11.jpg)
THE USUAL DATA FLOWTHE USUAL DATA FLOW
Collect -> Store -> Convert -> Select
The Data Latency conflict - lotsof fragmented small files or bigoptimized files with big latencyProcessing efforts involved inthe conversion process shouldbe minimizedExample..
![Page 12: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/12.jpg)
A BETTER DATA FLOWA BETTER DATA FLOW
Collec-tor-vert -> Select
Convert the data as it is beingcollected where possibleOr convert the data as it isbeing stored (streaming) butwithout losing optimizationsHow can this be achieved?
![Page 13: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/13.jpg)
SQOOPSQOOPImport data from RDBMS intoHadoopCreate java classes and hivetables on importExport data back to RDBMSRuns a "Map Only" job toperform the taskSupports incremental importsNow supports import rightaway as Parquet
![Page 14: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/14.jpg)
HIVE & ACIDHIVE & ACID
Recently a conceptual change has beenintroduced into Hive: CRUD with ACIDTransactions.It is not meant to replace your OLTP butrather supply a better data modificationmechanism to a subset of the data.Explanation on how it worksDemo simple insertStill requires M/R :(
![Page 15: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/15.jpg)
HIVE & STREAMING INGESTHIVE & STREAMING INGEST
With the new ACID capabilities it is nowpossible to continously insert data into hiveData apperas almost immediatelyData is optimized in a columnar formatData is compacted by different triggersCode snippet
![Page 16: Veracity think bugdata #2 6.7.2015](https://reader036.vdocuments.mx/reader036/viewer/2022062420/55c8ff58bb61ebed528b45b9/html5/thumbnails/16.jpg)
FLUMEFLUMEDistributedDurableScalableFault ToleranteServes for ingestion and basicpre-processing of the dataComposed of source -> channel -> Sink(Draw Architecture)Utilized Hive's ACID capabilitiesto instantly stream data into hive- demo