solving low latency query over big data with spark sql-(julien pierre, microsoft)

Client Data Fluency

Office

Skype

Bing

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Data Size

Query Latency

Get results inline in Zeppelin

Need to open the results in Excel

0 20 40 60 80 100 120 140 160 180 200

Cosmos

SparkSQL

SparkSQL with Cache

Write and Compile Query Submit and Wait in Job Queue Job Run Time

Mesos Cluster/HDFS

Job Manager Zookeeper

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server Zeppelin Server

Avocado (Hive Query + Schedule Task)

Rover (Drag & Drop BI tool with Hive Code

Gen)

Zeppelin Web UI

MetastoreDB Hive Loader

Cosmos Storage

Partition 1

Partition 2

...

Partition n

Export Cosmos Partition

Partition 1

Partition 2

...

Partition n

Task 2

HDFS.copyFromLocalFile

...

Task n

Partition 1

Partition 2

...

Partition n

saveAsParquetFile

Task 2...

Task n

<Database2>

<Table1><Database1>

<Partition1>

<Table2><Partition2>

MetastoreDB

Hive Thrift Server

Hive Loader

Zeppelin Server

UserQueryQuery

Data Ingest

Services

Clients

Transform Compute

Transform Compute

Data Streams

Data Sets

Store

Event Processing

HDFS Data Transportation

Spark Streaming Receiver

Analyst

Zeppelin Notebooks

Avocado

Simple query

Query language

“Analyze”

“Debug”

“Mine”

“Glance”

Data

Unified platform Intelligence Interactive

analytics Data

Products

Better Digital

Experiences

Dual users

“Bing”

“Office”

solving low latency query over big data with spark sql-(julien pierre, microsoft)

Data & Analytics