raiding the mongodb toolbox with jeremy mikola

54
RAIDING THE MONGODB TOOLBOX Jeremy Mikola jmikola

Upload: mongodb

Post on 10-Aug-2015

343 views

Category:

Technology


1 download

TRANSCRIPT

RAIDING THEMONGODBTOOLBOX

Jeremy Mikolajmikola

Agenda

Full-text IndexingGeospatial QueriesData AggregationCreating a Job QueueTailable Cursors

Full-textIndexing

You have an awesome PHP blog

{ "_id": ObjectId("544fd63860dab3b12521379b"), "title": "Ten Secrets About PSR-7 You Won't Believe!", "content": "Phil Sturgeon caused quite a stir on the PHP-FIG mailing list this morning when he unanimously passed Matthew Weier O'Phinney's controversial PSR-7 specification. PHP-FIG members were outraged as the self-proclaimed Gordon Ramsay of PHP…", "published": true, "created_at": ISODate("2014-10-28T17:46:36.065Z")}

We’d like to search the content

Store arrays of keyword stringsQuery with regular expressionsSync data to Solr, Elasticsearch, etc.Create a full-text index

Creating a full-text index

$collection->createIndex( ['content' => 'text']);

Compound indexing with other fields$collection->createIndex( ['content' => 'text', 'created_at' => 1]);

Indexing multiple string fields$collection->createIndex( ['content' => 'text', 'title' => 'text']);

Step 1: Tokenization

[ Phil, Sturgeon, caused, quite, a, stir, on, the, PHP-FIG, mailing, list, this, morning, when, he, unanimously, passed, …]

Step 2: Trim stop-words

[ Phil, Sturgeon, caused, quite, stir, PHP-FIG, mailing, list, morning, unanimously, passed, …]

Step 3: Stemming

[ Phil, Sturgeon, cause, quite, stir, PHP-FIG, mail, list, morning, unanimous, pass, …]

Step 4: Profit?

Querying a text index

$cursor = $collection->find( ['$text' => ['$search' => 'Phil Sturgeon']]);

foreach ($cursor as $document) { echo $document['content'] . "\n\n";}

↓Phil Sturgeon caused quite a stir on the PHP-FIG…

Phil Jerkson, better known as @phpjerk on Twitter…

and Phrases negations

$cursor = $collection->find( ['$text' => ['$search' => 'PHP -"Phil Sturgeon"']]);

foreach ($cursor as $document) { echo $document['content'] . "\n\n";}

↓Be prepared for the latest and greatest version of PHP with…

Sorting by the match score

$cursor = $collection->find( ['$text' => ['$search' => 'Phil Sturgeon']], ['score' => ['$meta' => 'textScore']]);

$cursor->sort(['score' => ['$meta' => 'textScore']]);

foreach ($cursor as $document) { printf("%.6f: %s\n\n", $document['score'], $document['content']);}

↓1.035714: Phil Sturgeon caused quite a stir on the PHP-FIG…

0.555556: Phil Jerkson, better known as @phpjerk on Twitter…

Supporting multiple languages

$collection->createIndex( ['content' => 'text'], ['default_language' => 'en']);

$collection->insert([ 'content' => 'We are planning a hot dog conference',]);

$collection->insert([ 'content' => 'Die Konferenz wird WurstCon benannt werden', 'language' => 'de',]);

$collection->find( ['$text' => ['$search' => 'saucisse', '$language' => 'fr']],);

Geospatial Queries

Because some of us ❤ maps

in a nutshellGeoJSON

{ "type": "Point", "coordinates": [100.0, 0.0] }

{ "type": "LineString", "coordinates": [[100.0, 0.0], [101.0, 1.0]] }

{ "type": "Polygon", "coordinates": [ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]], [[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ]}

{ "type": "MultiPolygon", "coordinates": [ [[[102, 2], [103, 2], [103, 3], [102, 3], [102, 2]]], [[[100, 0], [101, 0], [101, 1], [100, 1], [100, 0.0]]] ]}

{ "type": "GeometryCollection", "geometries": [ { … }, { … } ]}

ARRAYS

ARRAYS EVERYWHERE

Indexing some places of interest

$collection->insert([ 'name' => 'Hyatt Regency Santa Clara', 'type' => 'hotel', 'loc' => [ 'type' => 'Point', 'coordinates' => [-121.976557, 37.404977], ],]);

$collection->insert([ 'name' => 'In-N-Out Burger', 'type' => 'restaurant', 'loc' => [ 'type' => 'Point', 'coordinates' => [-121.982102, 37.387993], ]]);

$collection->ensureIndex(['loc' => '2dsphere']);

Inclusion queries

// Define a GeoJSON polgyon$polygon = [ 'type' => 'Polygon', 'coordinates' => [ [ [-121.976557, 37.404977], // Hyatt Regency [-121.982102, 37.387993], // In-N-Out Burger [-121.992311, 37.404385], // Rabbit's Foot Meadery [-121.976557, 37.404977], ], ],];

// Find documents within the polygon's bounds$collection->find(['loc' => ['$geoWithin' => $polygon]]);

// Find documents within circular bounds$collection->find(['loc' => ['$geoWithin' => ['$centerSphere' => [ [-121.976557, 37.404977], // Center coordinate 5 / 3959, // Convert miles to radians]]]]);

Sorted proximity queries$point = [ 'type' => 'Point', 'coordinates' => [-121.976557, 37.404977]];

// Find locations nearest a point$collection->find(['loc' => ['$near' => $point]]);

// Find the nearest 50 restaurants within 5km$collection->find([ 'loc' => ['$near' => $point, '$maxDistance' => 5000], 'type' => 'restuarant',])->limit(50);

Data Aggregation

Count

$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);

$collection->count(); // Returns 4

$collection->count(['num' => ['$gte' => 250]]); // Returns 3

Distinct

$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);

$collection->distinct('code'); // Returns ["A123", "B212"]

Group

$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);

$result = $collection->group( ['code' => 1], // field(s) on which to group ['sum' => 0], // initial aggregate value new MongoCode('function(cur, agg) { agg.sum += cur.num }'));

foreach ($result['retval'] as $grouped) { printf("%s: %d\n", $grouped['code'], $grouped['sum']);}

↓A123: 1050B212: 200

MapReduce

Extremely versatile, powerfulIntended for complex data analysisOverkill for simple aggregation tasks

e.g. averages, summation, groupingIncremental data processing

Aggregating query profiler output

{ "op" : "query", "ns" : "db.collection", "query" : { "code" : "A123", "num" : { "$gt" : 225 } }, "ntoreturn" : 0, "ntoskip" : 0, "nscanned" : 11426, "lockStats" : { … }, "nreturned" : 0, "responseLength" : 20, "millis" : 12, "ts" : ISODate("2013-05-23T21:24:39.327Z"),}

Constructing a query skeleton

{ "code" : "A123", "num" : { "$gt" : 225 }}

↓{ "code" : <string>, "num" : { "$gt" : <number> }}

Aggregate stats for similar queries(e.g. execution time, index performance)

Aggregation framework

Process a stream of documentsOriginal input is a collectionOutputs one or more result documents

Series of operatorsFilter or transform dataInput/output chain

ps ax | grep mongod | head -n 1

Executing an aggregation pipeline

$collection->aggregateCursor([ ['$match' => ['status' => 'A']], ['$group' => ['_id' => '$cust_id', 'total' => ['$sum' => '$amount']]]]);

Pipeline operators

$match$geoNear$project$group$unwind

$sort$limit$skip$redact$out

Solving symbolic equations and calculus

Creating a Job Queue

Things not to do in your controllers

Send email messagesUpload files to S3Blocking API callsHeavy data processingMining cryptocurrency

Creating a job

$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate,]);

$collection->createIndex( ['processed' => 1, 'createdAt' => 1]);

Selecting a job

$job = $collection->findAndModify( ['processed' => false], ['$set' => ['processed' => true, 'receivedAt' => new MongoDate]], null, // field projection (if any) [ 'sort' => ['createdAt' => 1], 'new' => true, ]);

↓{ "_id" : ObjectId("54515e16ba5a4da1b15a1766"), "data" : { … }, "processed" : true, "createdAt" : ISODate("2014-10-29T21:37:26.405Z"), "receivedAt" : ISODate("2014-10-29T21:37:33.118Z")}

Schedule jobs in the future

$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'scheduledAt' => new MongoDate(strtotime('1 hour')),]);

↓$now = new MongoDate;

$job = $collection->findAndModify( ['processed' => false, 'scheduledAt' => ['$lt' => $now]], ['$set' => ['processed' => true, 'receivedAt' => $now]], null, [ 'sort' => ['createdAt' => 1], 'new' => true, ]);

Prioritize job selection

$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'priority' => 0,]);

// Index: { "processed": 1, "priority": -1, "createdAt": 1 }

↓$now = new MongoDate;

$job = $collection->findAndModify( ['processed' => false], ['$set' => ['processed' => true, 'receivedAt' => $now]], null, [ 'sort' => ['priority' => -1, 'createdAt' => 1], 'new' => true, ]);

Gracefully handle failed jobs

$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'attempts' => 0,]);

↓$now = new MongoDate;

$job = $collection->findAndModify( ['processed' => false], [ '$set' => ['processed' => true, 'receivedAt' => $now], '$inc' => ['attempts' => 1], ], null, [ 'sort' => ['createdAt' => 1], 'new' => true, ]);

Tailable Cursors

Capped collections

$database->createCollection( 'tailme', [ 'capped' => true, 'size' => 16777216, // 16 MiB 'max' => 1000, ]);

Producer

for ($i = 0; ++$i; ) { $collection->insert(['x' => $i]); printf("Inserted: %d\n", $i); sleep(1);}

↓Inserted: 1Inserted: 2Inserted: 3Inserted: 4Inserted: 5…

Consumer

$cursor = $collection->find();$cursor->tailable(true);$cursor->awaitData(true);

while (true) { if ($cursor->dead()) { break; }

if ( ! $cursor->hasNext()) { continue; }

printf("Consumed: %d\n", $cursor->getNext()['x']);}

↓Consumed: 1Consumed: 2…

Replica set oplog

$collection->insert([ 'x' => 1,]);

↓{ "ts" : Timestamp(1414624929, 1), "h" : NumberLong("2631382894387434484"), "v" : 2, "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("545176a14ab5c0c999da70f0"), "x" : 1 }}

Replica set oplog

$collection->update( ['x' => 1], ['$inc' => ['x' => 1]]);

↓{ "ts" : Timestamp(1414624962, 1), "h" : NumberLong("5079425106850550701"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("545176a14ab5c0c999da70f0") }, "o" : { "$set" : { "x" : 2 } }}

THANKS!

Questions?

Image CreditsBooks designed by from the Aggregator designed by from the Register designed by from the Ouroboros designed by from the

Catherine Please Noun Projectstuart mcmorris Noun Project

Wilson Joseph Noun ProjectSilas Reeves Noun Project

http://mariompittore.com/wp-content/uploads/2013/08/Social-Gnomes1.png