drilling on json

12
@apachedrill © 2014 MapR Technologies #NoSQLNow Drilling on JSON Keshav Murthy August 19 st 2014 [email protected] Twitter: @rkeshavmurthy Senior Director, Product Management, MapR

Upload: keshav-murthy

Post on 14-May-2015

913 views

Category:

Software


0 download

DESCRIPTION

Variety is the spice of life, but it’s also the reality of big data. For this reason, JSON has now becoming lingua franca of data in the internet – for APIs, data exchange, data storage and data processing. In the business intelligence world, SQL is the language to analyze the data in other forms. Hence, the myriad of “SQL-on-Hadoop” projects. However, traditional SQL isn’t JSON/Parquet/etc. friendly. ETL into flattened tables is costly and not real time. Apache Drill unifies SQL with variety of data forms on Hadoop. That enables interactive analytics using your favorite BI tool and visualization tool on you data simultaneously. In this talk, we’ll introduce Apache Drill and describe use cases. - See more at: http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6850#sthash.NhuLz6Dq.dpuf

TRANSCRIPT

Page 1: Drilling on JSON

© 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow

Drilling on JSON

Keshav Murthy

August 19st 2014

[email protected] Twitter: @rkeshavmurthy

Senior Director, Product Management, MapR

Page 2: Drilling on JSON

© 2014 MapR Technologies 2#NoSQLNow @apachedrill

NoSQL

We don't need no transactionWe don't need no ACID control

No schema in the tablesNo limit to the scale out

DBA, leave them JSON aloneHey DBA, leave them JSON alone

All in all it's just another data in the BASEAll in all it’s just another shard into cloud.

…With apologies to Roger Waters

Page 3: Drilling on JSON

© 2014 MapR Technologies 3

Martin Fowler says: “aggregate-oriented”What you're most likely to access as a unit.

Key Value Store Couchbase Riak Citrusleaf Redis BerkeleyDB Membrain ...

Document MongoDB CouchDB RavenDB Couchbase ... Graph

OrientDB DEX Neo4j GraphBase ...Wide Column

HBase Hypertable Cassandra MapR-DB ...

NoSQL Landscape

Page 4: Drilling on JSON

© 2014 MapR Technologies 4

Data landscape is changing

New types of applications• Social, mobile, Web, “Internet

of Things”, Cloud…• Iterative/Agile in nature• More users, more data

New data models & data types• Flexible Schema/Schema less• Rapidly changing• Semi-structured/Nested data

{   "data": [         "id": "X999_Y999",         "from": {            "name": "Tom Brady", "id": "X12"         },         "message": "Looking forward to 2014!",         "actions": [            {               "name": "Comment",               "link": "http://www.facebook.com/X99/posts Y999"            },            {               "name": "Like",               "link": "http://www.facebook.com/X99/posts Y999"            }         ],         "type": "status",         "created_time": "2013-08-02T21:27:44+0000",         "updated_time": "2013-08-02T21:27:44+0000"      }       }

JSON

Page 5: Drilling on JSON

© 2014 MapR Technologies 5

• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications

APACHE DRILL

40+ contributors150+ years of experience buildingdatabases and distributed systems

Page 6: Drilling on JSON

© 2014 MapR Technologies 6#NoSQLNow @apachedrill

Zero to Results in 2 Minutes (3 Commands)

$ tar xzf apache-drill.tar.gz

$ apache-drill/bin/sqlline -u jdbc:drill:zk=local

0: jdbc:drill:zk=local> SELECT DISTINCT users.name as name, users.emails.work as email FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5;+------------+------------+| name | email |+------------+------------+| john | [email protected]|| jack | [email protected]|| Ronn | [email protected] || Pat | [email protected]|...35 rows selected (0.847 seconds)

Install

Launch shell (embedded mode)

Query

Query

Page 7: Drilling on JSON

© 2014 MapR Technologies 7

Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 8: Drilling on JSON

© 2014 MapR Technologies 8#NoSQLNow @apachedrill

Self-Describing Data is Ubiquitous

Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)

Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

Page 9: Drilling on JSON

© 2014 MapR Technologies 9#NoSQLNow @apachedrill

Drill’s Data Model is Flexible

HBase

JSONBSON

CSVTSV

ParquetAvro

Schema-lessFixed schema

Flat

Complex

Flexibility

Flexibility

Name Gender Age

Michael M 6

Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 10: Drilling on JSON

© 2014 MapR Technologies 10#NoSQLNow @apachedrill

Core Modules within a Drillbit

SQL Parser Optimizer

Phy

sica

l Pla

n DFS

HBase

RPC Endpoint

Distributed Cache

Sto

rage

Plu

gins

Logi

cal P

lan

Execution Hive

MongoDB

CouchBase

Cassandra

RDBMS

Page 11: Drilling on JSON

© 2014 MapR Technologies 11#NoSQLNow @apachedrill

Processing in Files

MapReduceGeneric

fileformats

Rows/Columns in files (tables)Hive – Pig - etc

QueryImpala

TezHive

NoSQLMongoDB

HbaseCassandra

RiakRedis

HADOOPDisk & Storage

RDBMS

Highly Structured Data

ANSI-SQL

SQL++R, etc

bits,bytes,blocks

$100K – $200K / TB$1K/TB$10K/TB

Semi Structured & Self describingNo Structure

OLTP EDW

ApacheDrill

Page 12: Drilling on JSON

© 2014 MapR Technologies 12#NoSQLNow @apachedrill

NoSQL NoETL

Drill, Baby, Drill: Self-Service Data Exploration using Apache DrillThursday, August 21st. 9.30 AM

Apache Drill