the holy grail of data analytics
TRANSCRIPT
![Page 1: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/1.jpg)
THE HOLY GRAIL OF DATA ANALYTICS
Dan Lynn, CEO
![Page 2: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/2.jpg)
• Data Services • Data Strategy • Data Integration / BI / Analytics • Modernize Data Infrastructures • Custom Applications & APIs
• Distributed over 6 states! • Fully-virtualized staff
www.agildata.com
Dan LynnCEO
Co-Founder @ FullContact 15 years building data systems Techstars [email protected]
![Page 3: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/3.jpg)
www.agildata.comAll product names, logos, and brands are property of their respective owners. All company, product and service names used are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.
Free MySQL Performance Analyzer
www.agildata.com/gibbs
AgilData Scalable Cluster
![Page 4: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/4.jpg)
![Page 5: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/5.jpg)
TRADE-OFFS
![Page 6: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/6.jpg)
OLTP vs OLAP
![Page 7: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/7.jpg)
OLTP OVERVIEW• “Online Transaction Processing”
• Database is optimized for low latency access to current data
• Short transactions (INSERT, UPDATE, DELETE)
• High concurrency
• Examples:
• Add item to shopping cart
• Reset password
![Page 8: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/8.jpg)
OLAP OVERVIEW• Online Analytical Processing
• Database is optimized for aggregation of historical data
• Aggregations can span millions or billions of records
• Low(er) concurrency
• Examples:
• What is our average shopping cart size, grouped by week and by affiliate?
• What are the top 5 paths that users take when navigating our website?
![Page 9: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/9.jpg)
HOW DATABASES OPTIMIZE FOR OLTP
• Optimized for reading or updating an entire row • (e.g. the full customer record)
• Data is written to and read from disk on a row-by-row basis.
• Indexes are used to construct full business object from multiple tables via JOINs. • (e.g. SELECT*FROMorderoJOINcustomercONc.id=o.customer_id)
• Hadoop and NoSQL systems generally behave the same.
• Scan performance is limited
![Page 10: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/10.jpg)
HOW DATABASES OPTIMIZE FOR OLAP
• Optimized for aggregating columns • (e.g. SELECTAVG(unit_price*qty)FROMorder_lineGROUPBYc.id)
• Data is laid out on disk on a per-column basis. • Great for scans, not so good for random row-level access
• Doesn’t support random UPDATEs
![Page 11: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/11.jpg)
HOW HADOOP OPTIMIZES FOR OLAP
• Data is partitioned in HDFS in append-only blocks of ~64MB.
• These blocks are spread out across the cluster.
• Processing (i.e. queries) is sent to the data, instead of bringing the data to the application for processing.
• Columnar data formats like Parquet can be stored on HDFS for very fast scan performance.
• Updates are very expensive.
![Page 12: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/12.jpg)
Scan Performance
VS
DATABASE
Updatability
![Page 13: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/13.jpg)
THE LAMBDA ARCHITECTURE
![Page 14: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/14.jpg)
Kafka, etc…
Data Stream
Write to HDFS Batch Computation(MapReduce, Spark)
Batch Views
Speed Layer(Storm, Spark Streaming, Flink, etc…)
Real-time views
Serving Layer(HBase, MySQL,
PostgreSQL, etc…)
THE LAMBDA ARCHITECTURE
![Page 15: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/15.jpg)
• Apache Project (incubating)
• Started at Cloudera, growing industry adoption.
• Currently v0.9.1
• 1.0 release likely coming out in September 2016
![Page 16: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/16.jpg)
Source: http://www.slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-on-fast-data
![Page 17: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/17.jpg)
APACHE KUDU USE CASES• Online Reporting
• Examples: Operational Data Store, Customer-facing analytics, real-time dashboards
• Workload: Inserts, updates, scans, random lookups
• Time Series • Examples: Market analytics, fraud section, risk monitoring, message queueing
• Workload: Inserts, updates, scans, random lookups
• Machine Data Analysis
• Examples: Network threat detection, devops monitoring and alerting
• Workload: Inserts, scans, random lookups
![Page 18: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/18.jpg)
THE ROAD AHEAD
![Page 19: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/19.jpg)
THE ROAD AHEAD
• Reactive processing
• Dynamic / intelligent indexing
• High performance mutable message queueing
![Page 20: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/20.jpg)
LINKS
• Kudu project website:http://kudu.apache.org/
• Details about OLTP vs OLAP workloadshttp://datawarehouse4u.info/OLTP-vs-OLAP.html
• Analyst perspective on Kuduhttp://www.dbms2.com/2015/09/28/introduction-to-cloudera-kudu/
![Page 22: The Holy Grail of Data Analytics](https://reader036.vdocuments.mx/reader036/viewer/2022070513/58880ed01a28ab083c8b49a9/html5/thumbnails/22.jpg)
CREDITS
• Grail image: https://upload.wikimedia.org/wikipedia/commons/1/10/London-Victoria_and_Albert_Museum-Grail-02.jpg
• Balanced scales:https://commons.wikimedia.org/wiki/File:Balanced_scale_of_Justice.svg