aggregation framework

Post on 30-Jun-2015

2.413 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Engineer, 10gen

Bryan Reinero

MapReduce, The Aggregation Framework & The Hadoop

Connector

Data Warehousing

Data Warehousing

• We're storing our data in MongoDB

Data Warehousing

• We're storing our data in MongoDB

• We need reporting, grouping, common aggregations, etc.

Data Warehousing

• We're storing our data in MongoDB

• We need reporting, grouping, common aggregations, etc.

• What are we using for this?

Data Warehousing

• SQL for reporting and analytics• Infrastructure complications• Additional maintenance• Data duplication• ETL processes• Real time?

MapReduce

MapReduceWorker thread calls mapper

Data Set

MapReduceWorkers call Reduce()

Data Set

Output

Worker thread calls mapper

{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Our Example Data

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() { var key = this.language; emit ( key, { totalPages : this.pages,

numBooks : 1 } ) }

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result;}

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

MapReduce

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

function finalize( key, value ) {

if ( value.numBooks != 0 ) return value.totalPages / value.numBooks;}

db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )

"results" : [{

"_id" : "English","value" : 653

},{

"_id" : "Russian","value" : 1440

}]

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

MapReduce

Primary Secondary Secondary

Three Node Replica Set

MapReduce Jobs

Primary

tags :{ workload :”prod”}

tags :{ workload :”prod”}

tags :{ workload :”analysis}priority : 0,

MapReduce JobsCRUD operations

MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks

MapReduce

• Versatile, powerful

MapReduce

• Versatile, powerful• Intended for complex data

analysis

MapReduce

• Versatile, powerful• Intended for complex data

analysis• Overkill for simple

aggregations

Aggregation Framework

Aggregation Framework

• Declared in JSON, executes in C++

Aggregation Framework

• Declared in JSON, executes in C++• Flexible, functional, and simple

Aggregation Framework

• Declared in JSON, executes in C++• Flexible, functional, and simple• Plays nice with sharding

Pipeline

Pipeline

ps ax |grep mongod |head 1

Piping command line operations

Pipeline

$match $group | $sort|

Piping aggregation operations

Stream of documents Result document

Pipeline Operators

• $match

• $project

• $group

• $unwind

• $sort

• $limit

• $skip

$match

• Filter documents

• Uses existing query syntax

• No geospatial operations or $where

{ $match : { language : "Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: “War and Peace", pages: 1440, language: ”Russian"}

{ title: “Atlas Shrugged", pages: 1088, language: ”English"}

{ $match : { pages : { $gt : 1000 } }{ $match : { language : "Russian" } }

$project

• Reshape documents

• Include, exclude or rename fields

• Inject computed fields

• Create sub-document fields

{ title: "Great Gatsby", language: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Selecting and Excluding Fields

$project: { _id: 0, title: 1, language: 1 }

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

New Field

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

New Field

Operation

Divisor

Dividend

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Renaming and Computing Fields

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }}

Creating Sub-Document Fields

$project: { title: 1, stats: { pages: "$pages”, language:

"$language”, } }

$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory

{ _id: "English", avgPages: 653}

{ _id: "Russian", avgPages: 1440}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Calculating an Average

$group: {_id: "$language”, avgPages: { $avg:"$pages" } }

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}

{ _id: "Russian", titles: [ "War and Peace" ]}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Collecting Distinct Values$group: { _id: "$language", titles: { $addToSet: "$title" }}

$unwind

• Operate on an array field

• Yield new documents for each array element– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values

$unwind

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior

$sort, $limit, $skip

{ $sort: { title: 1 }}

{ title : “Invisible Man” }

{ title: ”Father and Sons" }

{ title: ”Lord of the Flies" }

{ title: ”Animal Farm" }

{ title: ”Grapes of Wrath” }

{ title: ”Brave New World" }

{ title: ”The Great Gatsby" }

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Sort All the Documents in the Pipeline

{ $skip: 2 }{ $sort: { title: 1 }}

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Sort All the Documents in the Pipeline

{ $limit: 4 }{ $skip : 2}

{ title : “The Great Gatsby” }

{ title: “Lord of the Flies" }

{ title: ”Invisible Man" }

{ title: ” Grapes of Wrath" }

{ title: ”Fathers and Sons” }

{ title: ”Brave New World" }

{ title: ”Animal Farm" }

Usage and Limitations

Usage

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command

db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database CommandCollection

db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits

• Some BSON types unsupported– Binary, Code, deprecated types

Sharding

Sharding

mongos

client

Shard CShard BShard A$match$project$group

$match$project$group

$match: { /* filter by shard key */ }

Shardingclient

Shard BShard A

$match$project$group

$match$project$group

mongos$group

Shard C

Nice, but…

• Limited parallelism

• No access to analytics libraries

• Separation of concerns

• Need to integrate with existing tool chains

Hadoop Connector

Scaling MongoDB

Client Applicatio

n

Single InstanceOr

Replica Set

MongoDB

Sharded cluster

The Mechanism of Sharding

Complete Data Set

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

The Mechanism of Sharding

Chunk Chunk

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

The Mechanism of ShardingChunk Chunk ChunkChunk

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Chunk Chunk ChunkChunk

Shard 1 Shard 2 Shard 3 Shard 4

Fathers & Sons Lord of the FliesAnimal Farm Invisible ManBrave New World

Define shard key on title

Shard 1 Shard 2 Shard 3 Shard 4

Data Growth

Shard 1 Shard 2 Shard 3 Shard 4

Load Balancing

Processing Big Data

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Processing Big Data

Need to break data into smaller pieces

Process data across multiple nodes

HadoopHadoo

pHadoo

pHadoo

p

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Input splits on Unshared Systems

Single Map

Reduce

Single InstanceOr

Replica Set

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Total Dataset

Hadoop Connector in Javafinal Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

MongoDB-Hadoop Quickstart

https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar \ ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount

MongoDB-Hadoop Quickstart

Engineer, 10gen

Bryan Reinero

Thank You

top related