aggregation framework

75
Engineer, 10gen Bryan Reinero MapReduce, The Aggregation Framework & The Hadoop Connector

Upload: mongodb

Post on 30-Jun-2015

2.412 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Aggregation Framework

Engineer, 10gen

Bryan Reinero

MapReduce, The Aggregation Framework & The Hadoop

Connector

Page 2: Aggregation Framework

Data Warehousing

Page 3: Aggregation Framework

Data Warehousing

• We're storing our data in MongoDB

Page 4: Aggregation Framework

Data Warehousing

• We're storing our data in MongoDB

• We need reporting, grouping, common aggregations, etc.

Page 5: Aggregation Framework

Data Warehousing

• We're storing our data in MongoDB

• We need reporting, grouping, common aggregations, etc.

• What are we using for this?

Page 6: Aggregation Framework

Data Warehousing

• SQL for reporting and analytics• Infrastructure complications• Additional maintenance• Data duplication• ETL processes• Real time?

Page 7: Aggregation Framework

MapReduce

Page 8: Aggregation Framework

MapReduceWorker thread calls mapper

Data Set

Page 9: Aggregation Framework

MapReduceWorkers call Reduce()

Data Set

Output

Worker thread calls mapper

Page 10: Aggregation Framework

{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Our Example Data

Page 11: Aggregation Framework

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

Page 12: Aggregation Framework

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}

Page 13: Aggregation Framework

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

Page 14: Aggregation Framework

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

"results" : [{

"_id" : "English","value" : 653

},{

"_id" : "Russian","value" : 1440

}]

Page 15: Aggregation Framework

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

Page 16: Aggregation Framework

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

tags :{ workload :”prod”}

tags :{ workload :”prod”}

tags :{ workload :”analysis}priority : 0,

MapReduce JobsCRUD operations

Page 17: Aggregation Framework

MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks

Page 18: Aggregation Framework

MapReduce

• Versatile, powerful

Page 19: Aggregation Framework

MapReduce

• Versatile, powerful• Intended for complex data

analysis

Page 20: Aggregation Framework

MapReduce

• Versatile, powerful• Intended for complex data

analysis• Overkill for simple

aggregations

Page 21: Aggregation Framework

Aggregation Framework

Page 22: Aggregation Framework

Aggregation Framework

• Declared in JSON, executes in C++

Page 23: Aggregation Framework

Aggregation Framework

• Declared in JSON, executes in C++• Flexible, functional, and simple

Page 24: Aggregation Framework

Aggregation Framework

• Declared in JSON, executes in C++• Flexible, functional, and simple• Plays nice with sharding

Page 25: Aggregation Framework

Pipeline

Page 26: Aggregation Framework

Pipeline

ps ax |grep mongod |head 1

Piping command line operations

Page 27: Aggregation Framework

Pipeline

$match $group | $sort|

Piping aggregation operations

Stream of documents Result document

Page 28: Aggregation Framework

Pipeline Operators

• $match

• $project

• $group

• $unwind

• $sort

• $limit

• $skip

Page 29: Aggregation Framework

$match

• Filter documents

• Uses existing query syntax

• No geospatial operations or $where

Page 30: Aggregation Framework

{ $match : { language : "Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

Page 31: Aggregation Framework

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ $match : { pages : { $gt : 1000 } }{ $match : { language : "Russian" } }

Page 32: Aggregation Framework

$project

• Reshape documents

• Include, exclude or rename fields

• Inject computed fields

• Create sub-document fields

Page 33: Aggregation Framework

{ title: "Great Gatsby", language: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Selecting and Excluding Fields

$project: { _id: 0, title: 1, language: 1 }

Page 34: Aggregation Framework

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

Page 35: Aggregation Framework

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

New Field

Page 36: Aggregation Framework

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

New Field

Operation

Divisor

Dividend

Page 37: Aggregation Framework

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Renaming and Computing Fields

Page 38: Aggregation Framework

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

Page 39: Aggregation Framework

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

Page 40: Aggregation Framework

$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory

Page 41: Aggregation Framework

{ _id: "English", avgPages: 653}

{ _id: "Russian", avgPages: 1440}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Calculating an Average

$group: {_id: "$language”, avgPages: { $avg:"$pages" } }

Page 42: Aggregation Framework

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}

{ _id: "Russian", titles: [ "War and Peace" ]}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Collecting Distinct Values$group: { _id: "$language", titles: { $addToSet: "$title" }}

Page 43: Aggregation Framework

$unwind

• Operate on an array field

• Yield new documents for each array element– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values

Page 44: Aggregation Framework

$unwind

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

Page 45: Aggregation Framework

$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior

Page 46: Aggregation Framework

$sort, $limit, $skip

{ $sort: { title: 1 }}

{ title : “Invisible Man” }

{ title: ”Father and Sons" }

{ title: ”Lord of the Flies" }

{ title: ”Animal Farm" }

{ title: ”Grapes of Wrath” }

{ title: ”Brave New World" }

{ title: ”The Great Gatsby" }

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Page 47: Aggregation Framework

Sort All the Documents in the Pipeline

{ $skip: 2 }{ $sort: { title: 1 }}

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Page 48: Aggregation Framework

Sort All the Documents in the Pipeline

{ $limit: 4 }{ $skip : 2}

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Page 49: Aggregation Framework

Usage and Limitations

Page 50: Aggregation Framework

Usage

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command

Page 51: Aggregation Framework

db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database CommandCollection

db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

Page 52: Aggregation Framework

Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits

• Some BSON types unsupported– Binary, Code, deprecated types

Page 53: Aggregation Framework

Sharding

Page 54: Aggregation Framework

Sharding

mongos

client

Shard CShard BShard A$match$project$group

$match$project$group

$match: { /* filter by shard key */ }

Page 55: Aggregation Framework

Shardingclient

Shard BShard A

$match$project$group

$match$project$group

mongos$group

Shard C

Page 56: Aggregation Framework

Nice, but…

• Limited parallelism

• No access to analytics libraries

• Separation of concerns

• Need to integrate with existing tool chains

Page 57: Aggregation Framework

Hadoop Connector

Page 58: Aggregation Framework

Scaling MongoDB

Client Applicatio

n

Single InstanceOr

Replica Set

MongoDB

Sharded cluster

Page 59: Aggregation Framework

The Mechanism of Sharding

Complete Data Set

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Page 60: Aggregation Framework

The Mechanism of Sharding

Chunk Chunk

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Page 61: Aggregation Framework

The Mechanism of ShardingChunk Chunk ChunkChunk

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Page 62: Aggregation Framework

Chunk Chunk ChunkChunk

Shard 1 Shard 2 Shard 3 Shard 4

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Page 63: Aggregation Framework

Shard 1 Shard 2 Shard 3 Shard 4

Data Growth

Page 64: Aggregation Framework

Shard 1 Shard 2 Shard 3 Shard 4

Load Balancing

Page 65: Aggregation Framework

Processing Big Data

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Page 66: Aggregation Framework

Processing Big Data

Need to break data into smaller pieces

Process data across multiple nodes

HadoopHadoo

pHadoo

pHadoo

p

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Page 67: Aggregation Framework

Input splits on Unshared Systems

Single Map

Reduce

Single InstanceOr

Replica Set

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Total Dataset

Page 68: Aggregation Framework

Hadoop Connector in Javafinal Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…

Page 69: Aggregation Framework

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

Page 70: Aggregation Framework

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

Page 71: Aggregation Framework

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

Page 72: Aggregation Framework

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

Page 73: Aggregation Framework

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

MongoDB-Hadoop Quickstart

Page 74: Aggregation Framework

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount

MongoDB-Hadoop Quickstart

Page 75: Aggregation Framework

Engineer, 10gen

Bryan Reinero

Thank You