mongodb + pig on hadoop (mongosv 2012)
DESCRIPTION
Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig. Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.TRANSCRIPT
Jeremy Karn - co-founder, MortarMongoDB + Pig
OF THIS SESSIONOverview
Intro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
RAPID OVERVIEWHadoop
MapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
RAPID OVERVIEWHadoop
RAPID OVERVIEWHadoop
Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
STRENGTHSHadoop
ScalableOpen sourceLots of momentumVery broadly applicable
Social Graph
Predict
Detect
Genetics
PROBLEMSHadoop
DifficultBatch only (...or it was)
FUTUREHadoop
YarnMapReduce optionalGeneric management + distributed appsImpala
Alternatives to Hadoop
Write MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
MONGODB NATIVE MAPREDUCE
Hadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
Alternatives to HadoopMONGODB NATIVE MAPREDUCE
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
Great when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
But you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
ON HADOOPPig
Less codeExpressive code
BRIEF, EXPRESSIVELIKE PROCEDURAL SQL
Pig
(thanks: twitter hadoop world presentation)
FOR SERIOUSThe Same Script, In MapReduce
ON HADOOPPig
Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford
MOTIVATIONSMongoDB + Pig
Data storage and data processing are often separate concerns
Hadoop is built for scalable processing of large datasets
SIMILAR STANCE MongoDB, Pig
Poly-structured data• MongoDB: stores data, regardless of
structure• Pig: reads data, regardless of structure
(got its name because Pigs are omnivorous)
JSON-PIG DATA TYPE MAPPINGMongoDB, Pig
JSON Pig
string chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
MONGODB-PIG DATA TYPE MAPPINGMongoDB, Pig
MongoDB Pig
date datetimeobject id chararraybinary data
bytearrayregexp chararraycode chararray
MortarFAST INTRO
Open-source code-based dev framework for data, built on Hadoop and Pig
Inspired by Rails
Self-contained, organized, executable projects
> gem install mortar
> mortar new my_project
MortarFAST INTRO
Our service hosts and executes mortar projects
> mortar jobs:run your_pigscript --clustersize 5
MortarFAST INTRO
Browser-only interface, great for demonstrating Hadoop
LOADING DATAMongoDB, Pig
One requirement:• Must specify top level fields to load from
the mongoDB collection.
Optional:• Specify a subset of embedded fields• Data type for any/all fields
LOADING DATA - ENRON DATAMongoDB, Pig
{ "body": "the ... person...", "subFolder": "notes_inbox", "mailbox": "bass-e", "filename": "450.", "headers": { "From": "[email protected]", "To": "[email protected]", “Subject”: “Subject” "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)", }}
SCRIPT DEMOMongoDB, Pig
STORE STATEMENTMongoDB, Pig
The MongoStorage function takes an optional list of arguments of two types:• A single set of keys to base updating on.
This has three options: None, update, or multi.
• Multiple indexes to ensure in the same format as db.col.ensureIndex().
ILLUSTRATEPig
Auto-select dataset
Exercise every execution path
Step-by-step execution
WHY ILLUSTRATEPig
Write correct code quickly
Understand others’ code
Test every execution path, every step
USER-DEFINED FUNCTIONS (UDF)Pig
Pig is like procedural SQL
UDFs for rich data manipulation
UDFs: Java-based language
We made Pig work with CPython (NumPy, etc)
WITHOUT MORTARMongoDB + Pig
Get the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
SUMMARYMongoDB + Pig
Hadoop and friends are maturingMongoDB and Pig are philosophically alignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible • work is offloaded• external libraries available