raiding the mongodb toolbox with jeremy mikola
TRANSCRIPT
You have an awesome PHP blog
{ "_id": ObjectId("544fd63860dab3b12521379b"), "title": "Ten Secrets About PSR-7 You Won't Believe!", "content": "Phil Sturgeon caused quite a stir on the PHP-FIG mailing list this morning when he unanimously passed Matthew Weier O'Phinney's controversial PSR-7 specification. PHP-FIG members were outraged as the self-proclaimed Gordon Ramsay of PHP…", "published": true, "created_at": ISODate("2014-10-28T17:46:36.065Z")}
We’d like to search the content
Store arrays of keyword stringsQuery with regular expressionsSync data to Solr, Elasticsearch, etc.Create a full-text index
Creating a full-text index
$collection->createIndex( ['content' => 'text']);
Compound indexing with other fields$collection->createIndex( ['content' => 'text', 'created_at' => 1]);
Indexing multiple string fields$collection->createIndex( ['content' => 'text', 'title' => 'text']);
Step 1: Tokenization
[ Phil, Sturgeon, caused, quite, a, stir, on, the, PHP-FIG, mailing, list, this, morning, when, he, unanimously, passed, …]
Step 2: Trim stop-words
[ Phil, Sturgeon, caused, quite, stir, PHP-FIG, mailing, list, morning, unanimously, passed, …]
Step 3: Stemming
[ Phil, Sturgeon, cause, quite, stir, PHP-FIG, mail, list, morning, unanimous, pass, …]
Querying a text index
$cursor = $collection->find( ['$text' => ['$search' => 'Phil Sturgeon']]);
foreach ($cursor as $document) { echo $document['content'] . "\n\n";}
↓Phil Sturgeon caused quite a stir on the PHP-FIG…
Phil Jerkson, better known as @phpjerk on Twitter…
and Phrases negations
$cursor = $collection->find( ['$text' => ['$search' => 'PHP -"Phil Sturgeon"']]);
foreach ($cursor as $document) { echo $document['content'] . "\n\n";}
↓Be prepared for the latest and greatest version of PHP with…
Sorting by the match score
$cursor = $collection->find( ['$text' => ['$search' => 'Phil Sturgeon']], ['score' => ['$meta' => 'textScore']]);
$cursor->sort(['score' => ['$meta' => 'textScore']]);
foreach ($cursor as $document) { printf("%.6f: %s\n\n", $document['score'], $document['content']);}
↓1.035714: Phil Sturgeon caused quite a stir on the PHP-FIG…
0.555556: Phil Jerkson, better known as @phpjerk on Twitter…
Supporting multiple languages
$collection->createIndex( ['content' => 'text'], ['default_language' => 'en']);
$collection->insert([ 'content' => 'We are planning a hot dog conference',]);
$collection->insert([ 'content' => 'Die Konferenz wird WurstCon benannt werden', 'language' => 'de',]);
$collection->find( ['$text' => ['$search' => 'saucisse', '$language' => 'fr']],);
Geospatial indexes
, , and 2dsphere for earth-like geometry
Supports objects2d for flat geometry
Legacy point format: [x,y]
Inclusion proximity intersection
GeoJSON
in a nutshellGeoJSON
{ "type": "Point", "coordinates": [100.0, 0.0] }
{ "type": "LineString", "coordinates": [[100.0, 0.0], [101.0, 1.0]] }
{ "type": "Polygon", "coordinates": [ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]], [[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ]}
{ "type": "MultiPolygon", "coordinates": [ [[[102, 2], [103, 2], [103, 3], [102, 3], [102, 2]]], [[[100, 0], [101, 0], [101, 1], [100, 1], [100, 0.0]]] ]}
{ "type": "GeometryCollection", "geometries": [ { … }, { … } ]}
Indexing some places of interest
$collection->insert([ 'name' => 'Hyatt Regency Santa Clara', 'type' => 'hotel', 'loc' => [ 'type' => 'Point', 'coordinates' => [-121.976557, 37.404977], ],]);
$collection->insert([ 'name' => 'In-N-Out Burger', 'type' => 'restaurant', 'loc' => [ 'type' => 'Point', 'coordinates' => [-121.982102, 37.387993], ]]);
$collection->ensureIndex(['loc' => '2dsphere']);
Inclusion queries
// Define a GeoJSON polgyon$polygon = [ 'type' => 'Polygon', 'coordinates' => [ [ [-121.976557, 37.404977], // Hyatt Regency [-121.982102, 37.387993], // In-N-Out Burger [-121.992311, 37.404385], // Rabbit's Foot Meadery [-121.976557, 37.404977], ], ],];
// Find documents within the polygon's bounds$collection->find(['loc' => ['$geoWithin' => $polygon]]);
// Find documents within circular bounds$collection->find(['loc' => ['$geoWithin' => ['$centerSphere' => [ [-121.976557, 37.404977], // Center coordinate 5 / 3959, // Convert miles to radians]]]]);
Sorted proximity queries$point = [ 'type' => 'Point', 'coordinates' => [-121.976557, 37.404977]];
// Find locations nearest a point$collection->find(['loc' => ['$near' => $point]]);
// Find the nearest 50 restaurants within 5km$collection->find([ 'loc' => ['$near' => $point, '$maxDistance' => 5000], 'type' => 'restuarant',])->limit(50);
We have some commands for this
aggregatecountdistinctgroupmapReduce
Count
$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);
$collection->count(); // Returns 4
$collection->count(['num' => ['$gte' => 250]]); // Returns 3
Distinct
$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);
$collection->distinct('code'); // Returns ["A123", "B212"]
Group
$collection->insert(['code' => 'A123', 'num' => 500 ]);$collection->insert(['code' => 'A123', 'num' => 250 ]);$collection->insert(['code' => 'B212', 'num' => 200 ]);$collection->insert(['code' => 'A123', 'num' => 300 ]);
$result = $collection->group( ['code' => 1], // field(s) on which to group ['sum' => 0], // initial aggregate value new MongoCode('function(cur, agg) { agg.sum += cur.num }'));
foreach ($result['retval'] as $grouped) { printf("%s: %d\n", $grouped['code'], $grouped['sum']);}
↓A123: 1050B212: 200
MapReduce
Extremely versatile, powerfulIntended for complex data analysisOverkill for simple aggregation tasks
e.g. averages, summation, groupingIncremental data processing
Aggregating query profiler output
{ "op" : "query", "ns" : "db.collection", "query" : { "code" : "A123", "num" : { "$gt" : 225 } }, "ntoreturn" : 0, "ntoskip" : 0, "nscanned" : 11426, "lockStats" : { … }, "nreturned" : 0, "responseLength" : 20, "millis" : 12, "ts" : ISODate("2013-05-23T21:24:39.327Z"),}
Constructing a query skeleton
{ "code" : "A123", "num" : { "$gt" : 225 }}
↓{ "code" : <string>, "num" : { "$gt" : <number> }}
Aggregate stats for similar queries(e.g. execution time, index performance)
Aggregation framework
Process a stream of documentsOriginal input is a collectionOutputs one or more result documents
Series of operatorsFilter or transform dataInput/output chain
ps ax | grep mongod | head -n 1
Executing an aggregation pipeline
$collection->aggregateCursor([ ['$match' => ['status' => 'A']], ['$group' => ['_id' => '$cust_id', 'total' => ['$sum' => '$amount']]]]);
Things not to do in your controllers
Send email messagesUpload files to S3Blocking API callsHeavy data processingMining cryptocurrency
Creating a job
$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate,]);
$collection->createIndex( ['processed' => 1, 'createdAt' => 1]);
Selecting a job
$job = $collection->findAndModify( ['processed' => false], ['$set' => ['processed' => true, 'receivedAt' => new MongoDate]], null, // field projection (if any) [ 'sort' => ['createdAt' => 1], 'new' => true, ]);
↓{ "_id" : ObjectId("54515e16ba5a4da1b15a1766"), "data" : { … }, "processed" : true, "createdAt" : ISODate("2014-10-29T21:37:26.405Z"), "receivedAt" : ISODate("2014-10-29T21:37:33.118Z")}
Schedule jobs in the future
$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'scheduledAt' => new MongoDate(strtotime('1 hour')),]);
↓$now = new MongoDate;
$job = $collection->findAndModify( ['processed' => false, 'scheduledAt' => ['$lt' => $now]], ['$set' => ['processed' => true, 'receivedAt' => $now]], null, [ 'sort' => ['createdAt' => 1], 'new' => true, ]);
Prioritize job selection
$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'priority' => 0,]);
// Index: { "processed": 1, "priority": -1, "createdAt": 1 }
↓$now = new MongoDate;
$job = $collection->findAndModify( ['processed' => false], ['$set' => ['processed' => true, 'receivedAt' => $now]], null, [ 'sort' => ['priority' => -1, 'createdAt' => 1], 'new' => true, ]);
Gracefully handle failed jobs
$collection->insert([ 'data' => [ … ], 'processed' => false, 'createdAt' => new MongoDate, 'attempts' => 0,]);
↓$now = new MongoDate;
$job = $collection->findAndModify( ['processed' => false], [ '$set' => ['processed' => true, 'receivedAt' => $now], '$inc' => ['attempts' => 1], ], null, [ 'sort' => ['createdAt' => 1], 'new' => true, ]);
Capped collections
$database->createCollection( 'tailme', [ 'capped' => true, 'size' => 16777216, // 16 MiB 'max' => 1000, ]);
Producer
for ($i = 0; ++$i; ) { $collection->insert(['x' => $i]); printf("Inserted: %d\n", $i); sleep(1);}
↓Inserted: 1Inserted: 2Inserted: 3Inserted: 4Inserted: 5…
Consumer
$cursor = $collection->find();$cursor->tailable(true);$cursor->awaitData(true);
while (true) { if ($cursor->dead()) { break; }
if ( ! $cursor->hasNext()) { continue; }
printf("Consumed: %d\n", $cursor->getNext()['x']);}
↓Consumed: 1Consumed: 2…
Replica set oplog
$collection->insert([ 'x' => 1,]);
↓{ "ts" : Timestamp(1414624929, 1), "h" : NumberLong("2631382894387434484"), "v" : 2, "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("545176a14ab5c0c999da70f0"), "x" : 1 }}
Replica set oplog
$collection->update( ['x' => 1], ['$inc' => ['x' => 1]]);
↓{ "ts" : Timestamp(1414624962, 1), "h" : NumberLong("5079425106850550701"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("545176a14ab5c0c999da70f0") }, "o" : { "$set" : { "x" : 2 } }}
Fun with MongoDB’s oplog
Syncing MongoDB to Solr with PHPMongoDB river plugin for ElasticsearchBuilding real-time systems @ StripeScalable oplog tailing @ Meteor
Image CreditsBooks designed by from the Aggregator designed by from the Register designed by from the Ouroboros designed by from the
Catherine Please Noun Projectstuart mcmorris Noun Project
Wilson Joseph Noun ProjectSilas Reeves Noun Project
http://mariompittore.com/wp-content/uploads/2013/08/Social-Gnomes1.png