aggregation framework

Engineer, 10gen

Bryan Reinero

MapReduce, The Aggregation Framework & The Hadoop

Connector

Data Warehousing

• We're storing our data in MongoDB

Data Warehousing

• We need reporting, grouping, common aggregations, etc.

Data Warehousing

• We need reporting, grouping, common aggregations, etc.

• What are we using for this?

Data Warehousing

• SQL for reporting and analytics• Infrastructure complications• Additional maintenance• Data duplication• ETL processes• Real time?

MapReduce

MapReduceWorker thread calls mapper

Data Set

MapReduceWorkers call Reduce()

Data Set

Output

Worker thread calls mapper

{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Our Example Data

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

MapReduce

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

MapReduce

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

"results" : [{

"_id" : "English","value" : 653

"_id" : "Russian","value" : 1440

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

tags :{ workload :”prod”}

tags :{ workload :”analysis}priority : 0,

MapReduce JobsCRUD operations

MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks

MapReduce

• Versatile, powerful

MapReduce

• Versatile, powerful• Intended for complex data

analysis

MapReduce

• Versatile, powerful• Intended for complex data

analysis• Overkill for simple

aggregations

Aggregation Framework

• Declared in JSON, executes in C++

• Declared in JSON, executes in C++• Flexible, functional, and simple

• Declared in JSON, executes in C++• Flexible, functional, and simple• Plays nice with sharding

Pipeline

ps ax |grep mongod |head 1

Piping command line operations

Pipeline

$match $group | $sort|

Piping aggregation operations

Stream of documents Result document

Pipeline Operators

• $match

• $project

• $group

• $unwind

• $sort

• $limit

• $skip

$match

• Filter documents

• Uses existing query syntax

• No geospatial operations or $where

{ $match : { language : "Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ $match : { pages : { $gt : 1000 } }{ $match : { language : "Russian" } }

$project

• Reshape documents

• Include, exclude or rename fields

• Inject computed fields

• Create sub-document fields

{ title: "Great Gatsby", language: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Selecting and Excluding Fields

$project: { _id: 0, title: 1, language: 1 }

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

New Field

Operation

Divisor

Dividend

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory

{ _id: "English", avgPages: 653}

{ _id: "Russian", avgPages: 1440}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Calculating an Average

$group: {_id: "$language”, avgPages: { $avg:"$pages" } }

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}

{ _id: "Russian", titles: [ "War and Peace" ]}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Collecting Distinct Values$group: { _id: "$language", titles: { $addToSet: "$title" }}

$unwind

• Operate on an array field

• Yield new documents for each array element– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values

$unwind

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior

$sort, $limit, $skip

{ $sort: { title: 1 }}

{ title : “Invisible Man” }

{ title: ”Father and Sons" }

{ title: ”Lord of the Flies" }

{ title: ”Animal Farm" }

{ title: ”Grapes of Wrath” }

{ title: ”Brave New World" }

{ title: ”The Great Gatsby" }

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

Sort All the Documents in the Pipeline

{ $skip: 2 }{ $sort: { title: 1 }}

Sort All the Documents in the Pipeline

{ $limit: 4 }{ $skip : 2}

Usage and Limitations

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command

db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database CommandCollection

db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits

• Some BSON types unsupported– Binary, Code, deprecated types

Sharding

mongos

client

Shard CShard BShard A$match$project$group

$match$project$group

$match: { /* filter by shard key */ }

Shardingclient

Shard BShard A

$match$project$group

mongos$group

Shard C

Nice, but…

• Limited parallelism

• No access to analytics libraries

• Separation of concerns

• Need to integrate with existing tool chains

Hadoop Connector

Scaling MongoDB

Client Applicatio

Single InstanceOr

Replica Set

MongoDB

Sharded cluster

The Mechanism of Sharding

Complete Data Set

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

The Mechanism of Sharding

Chunk Chunk

The Mechanism of ShardingChunk Chunk ChunkChunk

Chunk Chunk ChunkChunk

Shard 1 Shard 2 Shard 3 Shard 4

Data Growth

Load Balancing

Processing Big Data

Map Reduce

Processing Big Data

Need to break data into smaller pieces

Process data across multiple nodes

HadoopHadoo

pHadoo

Hadoop

Input splits on Unshared Systems

Single Map

Reduce

Single InstanceOr

Replica Set

Hadoop

Total Dataset

Hadoop Connector in Javafinal Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ ./SBT package

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

MongoDB-Hadoop Quickstart

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount

MongoDB-Hadoop Quickstart

Engineer, 10gen

Bryan Reinero

Thank You

aggregation framework

Documents

mongodb in action - holla.cz...5.2 mongodb’s query...

webinar: exploring the aggregation framework

aggregation framework

exploring the aggregation framework - runkel 2015 -...

cba accounting framework. bca accounting framework outline...

nessi open framework reference architecture...1.2.3...

back to basics-webinar 5: einführung in das...

a framework for secure data aggregation in sensor networks

webelement #24 - mongodb aggregation framework

de normalised london aggregation framework overview

information vs. robustness in rank aggregation: models...

african risk capacity strategic framework 2016-2020 ·...

prediction of “aggregation-prone” and “aggregation

aggregation framework in mongodb overview part-1

mongodb - aggregation pipeline · what is the aggregation...

analytics with mongodb aggregation framework and hadoop...

detox: a redundancy-based framework for faster and more...

mongodb for time series data part 2: analyzing time series...

conceptos básicos. seminario web 5: introducción a...

back to basics, webinar 5: introduzione ad aggregation...