aggregation framework

Engineer, 10gen

Bryan Reinero

MapReduce, The Aggregation Framework & The Hadoop

Connector

Data Warehousing

Data Warehousing

• We're storing our data in MongoDB

Data Warehousing


• We need reporting, grouping, common aggregations, etc.

Data Warehousing


• We need reporting, grouping, common aggregations, etc.

• What are we using for this?

Data Warehousing

• SQL for reporting and analytics• Infrastructure complications• Additional maintenance• Data duplication• ETL processes• Real time?

MapReduce

MapReduceWorker thread calls mapper

Data Set

MapReduceWorkers call Reduce()

Data Set

Output

Worker thread calls mapper

{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Our Example Data

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }


function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}

MapReduce


function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}


function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

MapReduce


function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}


"results" : [{

"_id" : "English","value" : 653

},{

"_id" : "Russian","value" : 1440

}]

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

tags :{ workload :”prod”}

tags :{ workload :”prod”}

tags :{ workload :”analysis}priority : 0,

MapReduce JobsCRUD operations

MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks

MapReduce

• Versatile, powerful

MapReduce

• Versatile, powerful• Intended for complex data

analysis

MapReduce

• Versatile, powerful• Intended for complex data

analysis• Overkill for simple

aggregations

Aggregation Framework


• Declared in JSON, executes in C++


• Declared in JSON, executes in C++• Flexible, functional, and simple


• Declared in JSON, executes in C++• Flexible, functional, and simple• Plays nice with sharding

Pipeline

Pipeline

ps ax |grep mongod |head 1

Piping command line operations

Pipeline

$match $group | $sort|

Piping aggregation operations

Stream of documents Result document

Pipeline Operators

• $match

• $project

• $group

• $unwind

• $sort

• $limit

• $skip

$match

• Filter documents

• Uses existing query syntax

• No geospatial operations or $where

{ $match : { language : "Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ $match : { pages : { $gt : 1000 } }{ $match : { language : "Russian" } }

$project

• Reshape documents

• Include, exclude or rename fields

• Inject computed fields

• Create sub-document fields

{ title: "Great Gatsby", language: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Selecting and Excluding Fields

$project: { _id: 0, title: 1, language: 1 }

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}



New Field



New Field

Operation

Divisor

Dividend

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}




Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }


{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory

{ _id: "English", avgPages: 653}

{ _id: "Russian", avgPages: 1440}


{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Calculating an Average

$group: {_id: "$language”, avgPages: { $avg:"$pages" } }

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}

{ _id: "Russian", titles: [ "War and Peace" ]}


{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Collecting Distinct Values$group: { _id: "$language", titles: { $addToSet: "$title" }}

$unwind

• Operate on an array field

• Yield new documents for each array element– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values

$unwind

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior

$sort, $limit, $skip

{ $sort: { title: 1 }}

{ title : “Invisible Man” }

{ title: ”Father and Sons" }

{ title: ”Lord of the Flies" }

{ title: ”Animal Farm" }

{ title: ”Grapes of Wrath” }

{ title: ”Brave New World" }

{ title: ”The Great Gatsby" }

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }



Sort All the Documents in the Pipeline

{ $skip: 2 }{ $sort: { title: 1 }}








Sort All the Documents in the Pipeline

{ $limit: 4 }{ $skip : 2}








Usage and Limitations

Usage

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command

db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database CommandCollection

db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits

• Some BSON types unsupported– Binary, Code, deprecated types

Sharding

Sharding

mongos

client

Shard CShard BShard A$match$project$group

$match$project$group

$match: { /* filter by shard key */ }

Shardingclient

Shard BShard A



mongos$group

Shard C

Nice, but…

• Limited parallelism

• No access to analytics libraries

• Separation of concerns

• Need to integrate with existing tool chains

Hadoop Connector

Scaling MongoDB

Client Applicatio

n

Single InstanceOr

Replica Set

MongoDB

Sharded cluster

The Mechanism of Sharding

Complete Data Set

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

The Mechanism of Sharding

Chunk Chunk



The Mechanism of ShardingChunk Chunk ChunkChunk



Chunk Chunk ChunkChunk

Shard 1 Shard 2 Shard 3 Shard 4




Data Growth


Load Balancing

Processing Big Data

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Processing Big Data

Need to break data into smaller pieces

Process data across multiple nodes

HadoopHadoo

pHadoo

pHadoo

p

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Input splits on Unshared Systems

Single Map

Reduce

Single InstanceOr

Replica Set

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Total Dataset

Hadoop Connector in Javafinal Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

https://github.com/mongodb/mongo-hadoop



$ ./SBT package




$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/




$ ./SBT package


$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/




$ ./SBT package



MongoDB-Hadoop Quickstart




$ ./SBT package



ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount

MongoDB-Hadoop Quickstart



Engineer, 10gen

Bryan Reinero

Thank You