jaql: querying json data on hadoop - ieeeewh.ieee.org/r6/scv/computer/nfic/2008/ibm jaql by kevin...

1

IBM Almaden Research Center

Jaql: Querying JSON data on Hadoop © 2008 IBM Corporation

Jaql: Querying JSON data on Hadoop

Kevin Beyer

Research Staff MemberIBM Almaden Research Center

In collaboration with Vuk Ercegovac, Ning Li, Jun Rao, Eugene Shekita

2



Outline

Overview of HadoopJSONJaql query language

3



The Hadoop Stack

Components:

Horizontal features:– Used at large scale (e.g., 10,000 cores at Yahoo)– Elastic (w/out data re-org)– Fault tolerant (getting there…)– Easy to administer

Non-features:– No data model or types in HDFS or HBase– No indexing– No query language

HDFS

HBase

Map-Reduce

Distributed file system

Simple distributed database

Parallel batch processing

4



HDFS Overview

Single file-system stored on direct-attached disks of commodity serversReplicate file blocks for failuresSimplified file system interface– not Posix

– Designed for large, sequential reads

HDFSSwitch

Rack 1 Rack N

Server: 2-4 disks

File:

Typically 64 MB blocks

Switch Switch

5



HBase Overview

p127532 itemType: “car” make: “VW” doors: 2

p187842 itemType: “apartment” rooms: 3 rent: 1200

…

columnnamekey

columnvalue

location: “45E, 32N”

HDFSRedundancy through HDFS replication

Column values– Are versioned– Stored vertically in HDFS: <key, column, timestamp, value>

……

Sort

ed b

y ke

y

Logical view of table Physical view of table

No schema, no types

6



Map-Reduce Overview

Input Output

M1

M4

M2

M3

R1

R2

Vi [ Km, Vm ] Km, [ Vm ] [ Vr ] Programmer focus:– Map: Vi → [ Km,Vm ]

– Reduce: Km, [ Vm ] → Vr

System provides:– Parallelism

– Fault tolerance

– Key partitioning (shuffle)

– Synchronization

– Map task reads local block

shuf

fle

7



Example: Counting Words

Mi

R2

R1

(Document Line) [(Word, Count),…] (Word, [Count, Count, …]) Word, Count

“The following example is simple and is the ‘Hello, World’for Map-Reduce”

the, 2following, 1example, 1…

the, [2,1,13,7,7]following, [2,1]

example, [1]

the, 30following, 3

example, 1

Aggregate locally when possible (combine step)

(Vi) (Km,Vm) (Km, [Vm,…]) (Vr)

8



Outline


9



What is JSON?

JSON == Java Script Object NotationBNF (from www.json.org):

value ::= record | array | atom

record ::= { (string : value)* }

array ::= [ (value)* ]

atom ::= string | number | boolean | null

http://www.json.org/

10



JSON Example

[ { publisher: 'Scholastic', author: 'J. K. Rowling', title: 'Deathly Hallows', year: 2007 },

{ publisher: 'Scholastic', author: 'J. K. Rowling', title: 'Chamber of Secrets', year: 1999, reviews: [

{ rating: 10, user: 'joe', review: ‘The best ...’ }, { rating: 6, user: 'mary', review: ‘Average ...’ }]},

{ publisher: 'Scholastic', author: 'J. K. Rowling', title: 'Sorcerers Stone', year: 1998},

{ publisher: 'Scholastic', author: 'R. L. Stine', title: 'Monster Blood IV', year: 1997, reviews: [

{ rating: 8, user: 'rob', review: 'High on my list...‘ }, { rating: 2, user: 'mike', review: 'Not worth the paper ...' }]},

{ publisher: 'Grosset', author: 'Carolyn Keene', title: 'The Secret of Kane', year: 1930 } ]

[] == array, {} == record or object, xxx: == field name

11



Why JSON?

Need nested, self-describing data– Data is typed, without requiring a schema

– Support data that vary or evolve over time

Standard– Wide-spread Web 2.0 adoption

– Bindings available for many programming languages

Not XML– XML data is untyped without schema validation

– XML was designed for document mark-up, not dataEasy integration in most programming languages

– JSON is a subset of Javascript, Python, Ruby, Groovy, …

12



Outline


13



Jaql: A JSON Query LanguageDesigned for JSON data– With additional atomic types: e.g., dateTime, binary

Designed for many environments– Massive-scale cloud computing

– Rewrite queries to use Map-Reduce– Micro-scale embedded in browser

Designed for extensibility– Read / write data from any source into JSON view of data– Add new functions

Functional query language– Few side-effects: e.g., writing to a file– Functions are data

Draw on other languages– SQL, XQuery, PigLatin, JavaScript, Lisp, Python …

14



Jaql using Map

// Query: equivalent map-reduce job in Jaql mapReduce({ input : {type: ‘hdfs’, location: ‘books’ }, map : fn($b) {

if exists($b.reviews) then[[ null, {$b.author, $b.title } ]]},

output : {type: 'hdfs', location: ‘reviewedBooks’}})

// Query: Find the authors and titles of books that have received a review.$reviewed = for $b in hdfsRead('books')

where exists( $b.reviews )return { $b.author, $b.title };

hdfsWrite( ‘reviewedBooks’, $reviewed );

Rewrite Engine M

Map (book $b) -> {$b.author, $b.title}

M

M

books reviewedBooks

Jaql’s use of function as data ->

evaluate “fn” in Map task

Turn query into plan

Plan is another query!

15



JSON

I/O Extensibility

I/O layer abstracts details of data location + formatExamples of data stores:

– HDFS, HBase, Amazon’s S3, local FS, HTTP request, JDBC callExamples of data formats:

– JSON text, CSV, XML– Default format is JSON binary

Simple to extend Jaql with new data stores and formats

JSONI/O

Layer

I/O

LayerJaql Interpreter

books reviewedBooks

Map Task

16



// Write the result to a local CSV filehdfsWrite(‘bookPrices’, { converter: ‘CSVWriter’ }, $result}

// Query: Group Books and Purchases to return book titles w/associated purchase prices$result = group $b in hbaseRead(‘books’) by $bid = $b.key into $books,

$p in hdfsRead(‘purchases’) by $bid = $p.bid into $purchasesreturn { bid: $bid, title: $books[0].title, prices: $purchases[*].price };

I/O Extensibility Example

Example: return purchase prices per book– Books stored in HBase

– Purchases stored in HDFS

– Output to a CSV file for graphingUse multiple InputFormats

User defined format

{bid: 123, books: [{…, title: ‘Deathly Hallows’,…}], purchases: [{bid: 123, price: 6.50,…}, {bid: 123, price: 3.43,…}, …]}

{bid: 789, books: [{…, title: ‘Chamber of Secrets’,…}], purchases: [{bid: 789, price: 10.99,…}, {bid: 789, price: 6.75,…}, …]}

{ bid: 123, title: ‘Deathly Hallows’, prices: [6.50, 3.43] }

{ bid: 123, title: ‘Chamber of Secrets’, prices: [10.99, 6.75] }

Co-group: “outer equi-join”

bid, title, purchases

123, ‘Deathly Hallows’, 6.50:3.43:…789, ‘Chamber of Secrets’, 10.99:6.75:…

17



Jaql I/O Extensibility using MapReduce// Query: Group Books and Purchases to return book titles w/associated purchase prices$result = group $b in hbaseRead(‘books’) by $bid = $b.key into $books,

$p in hdfsRead(‘purchases’) by $bid = $p.bid into $purchasesreturn { id: $bid, title: $books[0].title, prices: $purchases[*].price };

// Write the result to a local CSV filehdfsWrite( ‘bookPrices’, { converter: ‘CSVWriter’ }, $result } );

// Query: equivalent map-reduce job in Jaql mapReduce({

input : [{type: ‘hbase’, location: ‘books’ }, {type: ‘hdfs’, location: ‘purchases’}],

map : [ fn($b) { [[ $b.key, $b ]] },fn($p) { [[ $p.bid, $p ]] } ],

reduce : fn($bid, $books, $purchases) { [ { id: $bid, title: $books[0].title,

prices: $purchases[*].price } ] },

output : { type: 'hdfs', location: ‘bookPrices’, options: { converter: ‘CSVWriter’ }}})

Rewrite Engine

RM

M R

Map (book $b) -> [$b.key, $b] Map (purchases $p) -> [$p.bid, $p]

- Partition & sort by $bid Reduce ($bid, $books, $purchases)

Extract id, title, prices

18



Expression Extensibility ExampleExample: segment books by their reviews’ sentiment

– Extract sentiment [0 = awful, 9 = best seller!] from each book

– Return list of books per sentiment score

// Query: analyze book reviews$scoredBooks = for $b in hbaseRead(‘books’)

return { $b.title, score: extractSentiment( $b.reviews ) };

// Query: aggregate according to sentiment score$sentiments = group $s in $scoredBooks by $score = $s.score into $books

return { score: $score, books: $books };

// Write the resulthdfsWrite( ‘sentimentReport’, $sentiment);

Extend Jaql with user defined expression

Why user defined extension?– 3rd party libraries

– Better expressed using a programming languageCurrently support Java, working on additional languages

19



Aggregation ExampleExample: compute the stddev of sentiment per region

– Join books and purchases for geographic region information

– Group books by geographic region

– Calculate standard deviation of book sentiments per region

// Query: analyze book reviews$scoredBooks = for $b in hbaseRead(‘books’)

return { $b.id, score: extractSentiment($b.reviews) };

// Query: join scoredBooks with purchases$bookPurchases = join $s in $scoredBooks on $s.id,

$p in hadoopRead(‘purchases’) on $p.bidreturn { $s.id, $s.score, $p.region };

// Query: aggregate by region$regionStddev = group $bp in $bookPurchases by $r = $bp.region into $books

return { region: $r, stddev: stddev($books[*].score) };

// Write the resulthdfsWrite(‘sentimentReport’, $regionStddev);

20



Aggregation Example using Map-Reduce (1)// Query: aggregate by region$regionStddev = group $bp in $bookPurchases by $r = $bp.region into $books

return { region: $r, stddev: stddev( $books[*].score ) };

// Write the result to a local CSV filehdfsWrite( ‘sentimentReport’, $regionStddev );

// Query: equivalent map-reduce job in Jaql mapReduce({

input : [ … ],

map : fn($bp) { [[ $bp.region, $bp ]] },

reduce : fn($bid, $books) {[{ region: $r, stddev: stddev($books[*].score) }] },

output : { type: 'hdfs', location: ‘sentimentReport’ } })

Rewrite Engine

Standard deviation computed over large regions!

21



Distributive Aggregates

Standard deviation is distributive– Final result can be computed from partial aggregates

Map-Reduce can compute partial aggregates at Mapper– Map->Combine->Reduce

Jaql’s interface for distributive aggregates (for stddev):– Init($score):

– { n: 1, s: $score, s2: $score*$score }– Combine($a, $b):

– { n: $a.n + $b.n, s: $a.s + $b.s, s2: $a.s2 + $b.s2 }– Final($p):

– sqrt( $p.s2/$p.n – ($p.s/$p.n) * ($p.s/$p.n) )

22



Aggregation Example using MapReduce (2)

// Query: equivalent map-reduce job in Jaql mrAggregate({

input: { type: ‘hdfs’, location: ‘books’ },

init: fn ($bp) { [ $bp.region, { n: 1, s: $bp.score,

s2: $bp.score*$bp.score } ]},

combine: fn ($a, $b) {{ n: $a.n + $b.n, s: $a.s + $b.s, s2: $a.s2 + $b.s2 }},

final: fn ($r, $p) {[{ region: $r,

stddev: sqrt($p.s2/$p.n – ($p.s/$p.n)*($p.s/$p.n)) }]},

output: { type: 'hdfs', location: ‘sentimentReport’ }})

Rewrite Engine

RM

M R

// Query: aggregate by region$regionStddev = group $bp in $bookPurchases by $r = $bp.region into $books

return {region: $r, stddev: stddev($books[*].score)};

// Write the result to a local CSV filehdfsWrite(‘sentimentReport’, $regionStddev);

23



Related Work

SQL, XQuerySawzall (Google)

– Wrap Map in a scripting language + library of Reducers– Proprietary and not a query language

Pig (Yahoo)– Own data model vs. Jaql designed for JSON– Designed for Yahoo’s data– no types, not fully composable

Hive (Facebook)– Data warehouse catalog + SQL-like language

DryadLinq (Microsoft)– Dryad: DAG of compute vertices and communication edges– Linq: embed data access in the programming language stack

Groovy for Hadoop

24



Research Topics

Usability– Additional Jaql features– Integration with programming languages

Data model:– How much do we pay for dynamic typing?– How to take advantage of schema information?

Optimization:– Indexing– Join strategies

– Incorporate basic costs– More rewrites– Incremental compilation– Exploit HBase

– Filters can be pushed into HBase– Projections have implied predicate (r.x => x exists for record r)

– Code generation

25



Summary

Scale-out infrastructure for analyticsHadoop: popular, open source scale-out infrastructureJSON provides a data model for Hadoop

– Semi-structured and designed for dataJaql provides a query language for Hadoop

– Rich analytics run in parallel

– Extensible language and I/O layers

26



Questions?

jaql: querying json data on hadoop - ieeeewh.ieee.org/r6/scv/computer/nfic/2008/ibm jaql by kevin...

Documents