qubole @ aws meetup bangalore - july 2015
TRANSCRIPT
![Page 1: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/1.jpg)
Big Data as a Service
Joydeep Sen SarmaHariharan IyerShubham Tagra
![Page 2: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/2.jpg)
Introduction
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
2
Agenda
![Page 3: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/3.jpg)
Introduction
• Founded ~ 10/2011
• Team:– Founding Crew initial authors of Apache Hive, ran Data @Facebook– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto
• Financing:– Completed Series-B 10/2014– LightSpeed, Charles-River, Norwest, Anand/Venky
• Product: Qubole Data Service– Big Data as a Service– AWS/GCE/Azure
3
About Qubole
![Page 4: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/4.jpg)
Introduction
4
Customers
![Page 5: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/5.jpg)
IntroductionQubole Data Service
![Page 6: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/6.jpg)
6
IntroductionQubole Data Service
![Page 7: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/7.jpg)
Introduction
• Self-Serve Big Data Analytics:– Lack of Hadoop trained IT/engineers– Team of Analysts
• Lowest TCO– Cloud Optimized - takes full advantage of AWS
• Unified Platform for all Tools:– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…– Pick and Choose. Combine and Use
• Awesome Support and Solutions
7
Why Qubole
![Page 8: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/8.jpg)
Self Service Analytics: Direct Access to Big Data
8
![Page 9: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/9.jpg)
Self Service Analytics: Manage Clusters Easily
9
![Page 10: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/10.jpg)
Self Service Analytics: Schedule Jobs
10
![Page 11: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/11.jpg)
Self Service Analytics: Self Serve Dashboards using Notebooks
11
![Page 12: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/12.jpg)
Qubole and EC2
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
12
Agenda
![Page 13: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/13.jpg)
Qubole and EC2
• Custom AMIs for much faster boot-up
• Auto-termination
• Auto-scaling
• Spot Instances
• EBS
13
EC2 Magic Sauce
![Page 14: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/14.jpg)
• Auto-start and termination– Cluster starts automatically when you need to run a command
• Intelligent - no cluster required for metadata commands– Terminated after couple of hours of Idle time
• Auto-scaling– Min Size <= Cluster <= Max Size
14
Cluster LifeCycle Management
Qubole and EC2
![Page 15: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/15.jpg)
15
Map Slots Reduce Slots
Slave
Slave
Slave
Slave
Slave
SELECT * FROM FOO JOIN BAR ON BAZ = ...
Auto-Scaling
![Page 16: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/16.jpg)
• Upscaling– Engine-specific algorithms– Cannot just look at expected time (parallelism matters)
• Downscaling– Decommissioning takes time– Need to consider hour boundary– Stuck on mapper output
• Output offloading
• AWS Integration– Hour boundary– Eventual consistency
16
Why is it hard?
Qubole and EC2
![Page 17: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/17.jpg)
Min Cluster Size: 400Max Cluster Size: 800Time for which cluster size < max size: 49%
17
But it pays off!
Qubole and EC2
![Page 18: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/18.jpg)
18
But it pays off!
Expected Compute Hours Compute hours saved Savings (%)
2902246.2 2107311.01 72.6
4655815.5 2105486.11 45.2
1698052.65 1658738.375 97.6
1776944.4 1476547.835 83.1
2063127.85 838628.7 40.6
919721.25 613630.955 66.7
Qubole and EC2
![Page 19: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/19.jpg)
• Various configurations– All nodes on-demand– Minimum nodes on-demand, rest combination of on-demand & spot– All nodes spot
• Minimum set has higher bid price => less likely to lose– Up to 90% savings compared to on-demand price
19
Spot instances
Qubole and EC2
![Page 20: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/20.jpg)
• Not always available– Fall back to on-demand
• Increases overall cost of cluster– Periodically replace extra on-demand instances when spot available
• Can go away at any time– Hadoop has built-in resiliency– Place replicas on stable instances
• Auto-scaling– Must maintain requested ratio
20
Why is it hard?
Qubole and EC2
![Page 21: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/21.jpg)
• Useful for newer instance types - c3/m3– Low ephemeral storage
• Better performance/$– Compared to older instance types with more storage
• Writes are changed– Minimize writes to EBS volumes– Use them only when ephemeral is near full
21
Elastic Block Store
Qubole and EC2
![Page 22: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/22.jpg)
• Spot Fleets• EBS-only instances
22
What’s coming next
Qubole and EC2
![Page 23: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/23.jpg)
Resources
• Auto-scaling Hadoop Clusters in Qubole
• Spot Instances in Qubole Clusters
• Rebalancing Hadoop Clusters for Better Spot Utilization
• Improved Performance with Low-Ephemeral-Storage Instances
23
Qubole and EC2
![Page 24: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/24.jpg)
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
24
Agenda
![Page 25: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/25.jpg)
Qubole and S3
• No real Rename (aka Move) operation– renames are copies and expensive!
• S3 connection establishment is expensive– ie. - small calls like getObjectDetails(key) and reads are expensive!
• S3 has bulk prefix listing– listObjectsChunked(startKey, maxListing)
• Puts are atomic– Objects created when object uploaded– Unlike HDFS where files are created on first write
• MultiPart!
25
S3 != HDFS
![Page 26: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/26.jpg)
Qubole and S3
• Naiive
• Smart
• Up to 1000x improvement
26
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << listObject(path)
pathList = listPrefix(‘/x’)while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << entry
![Page 27: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/27.jpg)
Qubole and S3
• Split Computation– Divide input files into tasks for Map-Reduce/Spark/Presto
• Recovering Partitions
• List Paths matching regex pattern (‘/x/y/z/*/*’)
• and many more ..
27
Prefix Listing - Use Cases
![Page 28: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/28.jpg)
Qubole and S3
• Normally:– Write data to temporary location - atomically rename to final location
• With S3:– Write data to final location– Atomic puts deal with speculation/retries– Optional: Remove on Failure
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
28
Direct Writes
![Page 29: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/29.jpg)
Qubole and S3
• Naiive:
29
Pre-Fetching
Client S3
• Smart:
Client S3
![Page 30: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/30.jpg)
S3 Local Disk Cache (Presto)
30
![Page 31: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/31.jpg)
S3 Local Disk Cache (Presto)
31
![Page 32: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/32.jpg)
Qubole and S3
• Populating Cache while performing Query may cause Slowness
• Large Files are split– Cache Files or Splits?
• Should Caching Combine Small Files?
• Should Caching transform data into Columnar?
Watch out for Table Copies!
32
Why S3 Caching is hard
![Page 33: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/33.jpg)
Qubole and S3
• Handle S3 Timeouts and Exceptions (truncated streams)
• Optimize away seek() operations
• Data Sharing across Organizations using Cross-Account Roles
33
Miscellaneous
![Page 34: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/34.jpg)
Resources
● S3 Optimizations in Hive○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/
● Caching in Presto○ http://www.qubole.com/blog/product/caching-presto/
● Qubole vs. EMR Performance Comparison○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapre
duce-emr/
● Data Sharing via AWS Roles:○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2
Qubole and S3
![Page 35: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/35.jpg)
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
35
Agenda
![Page 36: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/36.jpg)
• Introduction to Presto
• Brief Introduction to Kinesis
• Qubole’s Value add to Kinesis
• Brief introduction to Redshift
• Qubole’s Value add to Redshift
• When to use which system
36
Presto, Kinesis and RedShift
Presto, Kinesis, RedShift
![Page 37: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/37.jpg)
• Interactive SQL Query Engine for Big Data
• Open source by Facebook in late 2013
• Follows ANSI-SQL
• Own execution engine
37
Presto, What is it?
Presto, Kinesis, RedShift
![Page 38: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/38.jpg)
● Extensibility
○ Pluggable Datasources
● Performance
○ In-memory Execution
○ Aggressive Pipelining
○ Highly efficient Java code
○ Dynamic Query Compilation
○ Vectorization38
Why Presto?
Presto, Kinesis, RedShift
![Page 39: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/39.jpg)
● Smooth learning curve as it adheres to ANSI-SQL
● Active open source community
● Proven worth at scale in companies like Facebook, NetFlix, Airbnb
39
Why not something else?
Presto, Kinesis, RedShift
![Page 40: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/40.jpg)
● Self managed Presto clusters
● Auto-configured
● Autoscaling
● Data Caching
40
Benefits of Presto @Qubole
Presto, Kinesis, RedShift
![Page 41: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/41.jpg)
● Kinesis Connector
● S3 Optimizations
● Insert Support
● UDF support
41
Qubole’s Contribution
Presto, Kinesis, RedShift
![Page 42: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/42.jpg)
● Average times across the performance tests by MediaMath on a 22TB text format table, partitioned on date, queried on partition with ~1.2b rows
http://www.qubole.com/blog/big-data/performance-testing-presto/
42
Comparison with Hive
Presto, Kinesis, RedShift
![Page 43: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/43.jpg)
● High Capacity Pipe for Real-Time Processing
● Key Concepts
○ Record
○ Streams
○ Shards
○ Checkpoints
43
Kinesis
Presto, Kinesis, RedShift
![Page 44: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/44.jpg)
● Streaming usecase
○ Spark
● SQL usecase
○ Via Hive
○ Via Presto
44
Qubole and Kinesis
Presto, Kinesis, RedShift
![Page 45: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/45.jpg)
● Kinesis Connector
45
Presto-Kinesis Integration
Presto, Kinesis, RedShift
![Page 46: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/46.jpg)
● Example○ Step1: Define Schema
46
Presto-Kinesis Integration
Presto, Kinesis, RedShift
![Page 47: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/47.jpg)
○ Step2: Run Query
47
Presto-Kinesis Integration
![Page 48: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/48.jpg)
● Datawarehouse service
● OLAP
● Storage + Compute
48
Redshift
Presto, Kinesis, RedShift
![Page 49: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/49.jpg)
● ETL Usecase
○ DBImport
○ DBExport
● Adhoc Query Usecase
○ Direct Query
○ Hive
○ Presto
49
Qubole and Redshift
Presto, Kinesis, RedShift
![Page 50: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/50.jpg)
Two ways to access Redshift via Presto● Via Hive JDBC Storage Handler
50
Presto-Redshift integration
Presto, Kinesis, RedShift
![Page 51: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/51.jpg)
Two ways to access Redshift via Presto● Via jdbc connector
51
Presto-Redshift integration
Presto, Kinesis, RedShift
![Page 52: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/52.jpg)
● Single Platform
● Consistent User Interface
● Cross-source Joins without data consolidation
52
New Opportunities
Presto, Kinesis, RedShift
![Page 53: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/53.jpg)
● Hive: ETL, huge Joins, Group By on high cardinality columns
● Redshift: Interactive Queries when data loading is acceptable
● Presto:
○ Interactive Queries
○ Direct Queries without loading
○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc
53
Hive? Presto? RedShift?
Presto, Kinesis, RedShift
![Page 54: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/54.jpg)
● Iterative Machine Learning
● In-Memory Computing
● Spark Streaming
54
Got Spark?
Presto, Kinesis, RedShift
![Page 55: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/55.jpg)
Resources
● Presto vs Hive○ http://www.qubole.com/blog/big-data/performance-testing-presto/
● Presto Kinesis Integration○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-I
nteractively-Querying-Streaming-Data
● Presto Kinesis Connector○ https://github.com/qubole/presto-kinesis
● Hive Redshift/Jdbc Connector○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/
● Hive Redshift/Jdbc Connector○ https://github.com/qubole/Hive-JDBC-Storage-Handler
![Page 56: Qubole @ AWS Meetup Bangalore - July 2015](https://reader037.vdocuments.mx/reader037/viewer/2022103021/55d0555bbb61ebb37b8b467c/html5/thumbnails/56.jpg)
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
56
Agenda