aggregation framework
TRANSCRIPT
Engineer, 10gen
Bryan Reinero
MapReduce, The Aggregation Framework & The Hadoop
Connector
Data Warehousing
Data Warehousing
• We're storing our data in MongoDB
Data Warehousing
• We're storing our data in MongoDB
• We need reporting, grouping, common aggregations, etc.
Data Warehousing
• We're storing our data in MongoDB
• We need reporting, grouping, common aggregations, etc.
• What are we using for this?
Data Warehousing
• SQL for reporting and analytics• Infrastructure complications• Additional maintenance• Data duplication• ETL processes• Real time?
MapReduce
MapReduceWorker thread calls mapper
Data Set
MapReduceWorkers call Reduce()
Data Set
Output
Worker thread calls mapper
{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Our Example Data
MapReduce
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function map() { var key = this.language; emit ( key, { totalPages : this.pages,
numBooks : 1 } ) }
MapReduce
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function map() { var key = this.language; emit ( key, { totalPages : this.pages,
numBooks : 1 } ) }
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}
MapReduce
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}
MapReduce
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}
db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
"results" : [{
"_id" : "English","value" : 653
},{
"_id" : "Russian","value" : 1440
}]
MapReduce
Primary Secondary Secondary
Three Node Replica Set
MapReduce Jobs
Primary
MapReduce
Primary Secondary Secondary
Three Node Replica Set
MapReduce Jobs
Primary
tags :{ workload :”prod”}
tags :{ workload :”prod”}
tags :{ workload :”analysis}priority : 0,
MapReduce JobsCRUD operations
MapReduce in MongoDB
• Implemented with JavaScript– Single-threaded– Difficult to debug
• Concurrency– Appearance of parallelism– Write locks
MapReduce
• Versatile, powerful
MapReduce
• Versatile, powerful• Intended for complex data
analysis
MapReduce
• Versatile, powerful• Intended for complex data
analysis• Overkill for simple
aggregations
Aggregation Framework
Aggregation Framework
• Declared in JSON, executes in C++
Aggregation Framework
• Declared in JSON, executes in C++• Flexible, functional, and simple
Aggregation Framework
• Declared in JSON, executes in C++• Flexible, functional, and simple• Plays nice with sharding
Pipeline
Pipeline
ps ax |grep mongod |head 1
Piping command line operations
Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort
• $limit
• $skip
$match
• Filter documents
• Uses existing query syntax
• No geospatial operations or $where
{ $match : { language : "Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: “War and Peace", pages: 1440, language: ”Russian"}
{ title: “Atlas Shrugged", pages: 1088, language: ”English"}
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: “War and Peace", pages: 1440, language: ”Russian"}
{ title: “Atlas Shrugged", pages: 1088, language: ”English"}
{ $match : { pages : { $gt : 1000 } }{ $match : { language : "Russian" } }
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
{ title: "Great Gatsby", language: "English"}
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Selecting and Excluding Fields
$project: { _id: 0, title: 1, language: 1 }
Renaming and Computing Fields
{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
Renaming and Computing Fields
{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
New Field
Renaming and Computing Fields
{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
New Field
Operation
Divisor
Dividend
{ _id: 375, avgChapterLength: 24.2222, lang: "English"}
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Renaming and Computing Fields
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Creating Sub-Document Fields
$project: { title: 1, stats: { pages: "$pages”, language:
"$language”, } }
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }}
Creating Sub-Document Fields
$project: { title: 1, stats: { pages: "$pages”, language:
"$language”, } }
$group
• Group documents by an ID– Field reference, object, constant
• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last
• Processes all data in memory
{ _id: "English", avgPages: 653}
{ _id: "Russian", avgPages: 1440}
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Calculating an Average
$group: {_id: "$language”, avgPages: { $avg:"$pages" } }
{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}
{ _id: "Russian", titles: [ "War and Peace" ]}
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Collecting Distinct Values$group: { _id: "$language", titles: { $addToSet: "$title" }}
$unwind
• Operate on an array field
• Yield new documents for each array element– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error
• Pipe to $group to aggregate array values
$unwind
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}
{ $unwind: "$subjects" }
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}
$sort, $limit, $skip
• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed
• Limit and skip follow cursor behavior
$sort, $limit, $skip
{ $sort: { title: 1 }}
{ title : “Invisible Man” }
{ title: ”Father and Sons" }
{ title: ”Lord of the Flies" }
{ title: ”Animal Farm" }
{ title: ”Grapes of Wrath” }
{ title: ”Brave New World" }
{ title: ”The Great Gatsby" }
{ title : “The Great Gatsby” }
{ title: “Lord of the Flies" }
{ title: ”Invisible Man" }
{ title: ” Grapes of Wrath" }
{ title: ”Fathers and Sons” }
{ title: ”Brave New World" }
{ title: ”Animal Farm" }
Sort All the Documents in the Pipeline
{ $skip: 2 }{ $sort: { title: 1 }}
{ title : “The Great Gatsby” }
{ title: “Lord of the Flies" }
{ title: ”Invisible Man" }
{ title: ” Grapes of Wrath" }
{ title: ”Fathers and Sons” }
{ title: ”Brave New World" }
{ title: ”Animal Farm" }
Sort All the Documents in the Pipeline
{ $limit: 4 }{ $skip : 2}
{ title : “The Great Gatsby” }
{ title: “Lord of the Flies" }
{ title: ”Invisible Man" }
{ title: ” Grapes of Wrath" }
{ title: ”Fathers and Sons” }
{ title: ”Brave New World" }
{ title: ”Animal Farm" }
Usage and Limitations
Usage
• collection.aggregate() method– Mongo shell– Most drivers
• aggregate database command
db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})
{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}
Database CommandCollection
db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])
Limitations
• Result limited by BSON document size– Final command result– Intermediate shard results
• Pipeline operator memory limits
• Some BSON types unsupported– Binary, Code, deprecated types
Sharding
Sharding
mongos
client
Shard CShard BShard A$match$project$group
$match$project$group
$match: { /* filter by shard key */ }
Shardingclient
Shard BShard A
$match$project$group
$match$project$group
mongos$group
Shard C
Nice, but…
• Limited parallelism
• No access to analytics libraries
• Separation of concerns
• Need to integrate with existing tool chains
Hadoop Connector
Scaling MongoDB
Client Applicatio
n
Single InstanceOr
Replica Set
MongoDB
Sharded cluster
The Mechanism of Sharding
Complete Data Set
Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World
Define shard key on title
The Mechanism of Sharding
Chunk Chunk
Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World
Define shard key on title
The Mechanism of ShardingChunk Chunk ChunkChunk
Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World
Define shard key on title
Chunk Chunk ChunkChunk
Shard 1 Shard 2 Shard 3 Shard 4
Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World
Define shard key on title
Shard 1 Shard 2 Shard 3 Shard 4
Data Growth
Shard 1 Shard 2 Shard 3 Shard 4
Load Balancing
Processing Big Data
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Processing Big Data
Need to break data into smaller pieces
Process data across multiple nodes
HadoopHadoo
pHadoo
pHadoo
p
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Input splits on Unshared Systems
Single Map
Reduce
Single InstanceOr
Replica Set
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Total Dataset
Hadoop Connector in Javafinal Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );
final Job job = new Job( conf, "word count" );
job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…
MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop
MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop
$ ./SBT package
MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop
$ ./SBT package
$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/
MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop
$ ./SBT package
$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/
$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
https://github.com/mongodb/mongo-hadoop
$ ./SBT package
$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/
$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop
$ ./SBT package
$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/
$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount
MongoDB-Hadoop Quickstart
Engineer, 10gen
Bryan Reinero
Thank You