solving low latency query over big data with spark sql-(julien pierre, microsoft)
TRANSCRIPT
Client Data Fluency
Office
Skype
Bing
Modern Data Capability
Instrumentation & Ingestion
Processing & Storage
Reporting & Analytics
Information Management
Mobile-First Analytics Experience
Experimentation
Data Size
Query Latency
Get results inline in Zeppelin
Need to open the results in Excel
0 20 40 60 80 100 120 140 160 180 200
Cosmos
SparkSQL
SparkSQL with Cache
Write and Compile Query Submit and Wait in Job Queue Job Run Time
Mesos Cluster/HDFS
Job Manager Zookeeper
Job Frontend Web API
Spark Driver Host Pool
Spark Hive Thrift Server Zeppelin Server
Avocado (Hive Query + Schedule Task)
Rover (Drag & Drop BI tool with Hive Code
Gen)
Zeppelin Web UI
MetastoreDB Hive Loader
Cosmos Storage
Partition 1
Partition 2
...
Partition n
Export Cosmos Partition
Partition 1
Partition 2
...
Partition n
Task 2
HDFS.copyFromLocalFile
...
Task n
Partition 1
Partition 2
...
Partition n
saveAsParquetFile
Task 2...
Task n
<Database2>
<Table1><Database1>
<Partition1>
<Table2><Partition2>
MetastoreDB
Hive Thrift Server
Hive Loader
Zeppelin Server
UserQueryQuery
Data Ingest
Services
Clients
Transform Compute
Transform Compute
Data Streams
Data Sets
Store
Event Processing
HDFS Data Transportation
Spark Streaming Receiver
Analyst
Zeppelin Notebooks
Avocado
Simple query
Query language
“Analyze”
“Debug”
“Mine”
“Glance”
Data
Unified platform Intelligence Interactive
analytics Data
Products
Better Digital
Experiences
Dual users
“Bing”
“Office”