mongodb et hadoop
TRANSCRIPT
![Page 3: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/3.jpg)
Agenda
Evolving Data Landscape
MongoDB & Hadoop Use Cases
MongoDB Connector Features
Demo
![Page 4: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/4.jpg)
Evolving Data Landscape
![Page 5: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/5.jpg)
• Terabyte and Petabyte datasets• Data warehousing• Advanced analytics
Hadoop
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.”
http://hadoop.apache.org
![Page 6: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/6.jpg)
‹#›
Enterprise IT Stack
![Page 7: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/7.jpg)
‹#›
Operational vs. Analytical: Enrichment
Warehouse, AnalyticsApplications, Interactions
![Page 8: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/8.jpg)
Operational: MongoDB
First-Level Analytics
Internet of Things
Social
Mobile Apps
Product/Asset Catalog
Security & Fraud
Single View
Customer Data Management
Churn Analysis
Risk Modeling
Sentiment Analysis
Trade Surveillance
Recommender
Warehouse & ETL
Ad Targeting
Predictive Analytics
![Page 9: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/9.jpg)
Analytical: Hadoop
First-Level Analytics
Internet of Things
Social
Mobile Apps
Product/Asset Catalog
Security & Fraud
Single View
Customer Data Management
Churn Analysis
Risk Modeling
Sentiment Analysis
Trade Surveillance
Recommender
Warehouse & ETL
Ad Targeting
Predictive Analytics
![Page 10: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/10.jpg)
Operational & Analytical: Lifecycle
First-Level Analytics
Internet of Things
Social
Mobile Apps
Product/Asset Catalog
Security & Fraud
Single View
Customer Data Management
Churn Analysis
Risk Modeling
Sentiment Analysis
Trade Surveillance
Recommender
Warehouse & ETL
Ad Targeting
Predictive Analytics
![Page 11: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/11.jpg)
MongoDB & Hadoop Use Cases
![Page 12: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/12.jpg)
Commerce
Applicationspowered by
Analysispowered by
Products & Inventory
Recommended products
Customer profile
Session management
Elastic pricing
Recommendation models
Predictive analytics
Clickstream history
MongoDB Connectorfor Hadoop
![Page 13: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/13.jpg)
Insurance
Applicationspowered by
Analysispowered by
Customer profiles
Insurance policies
Session data
Call center data
Customer action analysis
Churn analysis
Churn prediction
Policy rates
MongoDB Connectorfor Hadoop
![Page 14: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/14.jpg)
Fraud Detection
MongoDB Connectorfor Hadoop
Payments Nightly Analysis
3rd Party
Data Sources
Results CacheFraud
Detection
Qu
ery
On
ly
Query Only
![Page 15: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/15.jpg)
MongoDB Connector for Hadoop
![Page 16: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/16.jpg)
‹#›
Connector Overview
DATA
• Read/Write MongoDB• Read/Write BSON
TOOLS
• MapReduce• Pig• Hive• Spark
PLATFORMS
• Apache Hadoop• Cloudera CDH• Hortonworks HDP• MapR• Amazon EMR
![Page 17: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/17.jpg)
‹#›
Connector Features and Functionality
• Computes splits to read data• Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive• MongoDB as a standard data source/destination
• Support for• Filtering data with MongoDB queries• Authentication• Reading from Replica Set tags• Appending to existing collections
![Page 18: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/18.jpg)
‹#›
MapReduce Configuration
• MongoDB input/outputmongo.job.input.format = com.mongodb.hadoop.MongoInputFormatmongo.input.uri = mongodb://mydb:27017/db1.collection1mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormatmongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/outputmongo.job.input.format = com.hadoop.BSONFileInputFormatmapred.input.dir = hdfs:///tmp/database.bsonmongo.job.output.format = com.hadoop.BSONFileOutputFormatmapred.output.dir = hdfs:///tmp/output.bson
![Page 19: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/19.jpg)
‹#›
Pig Mappings
• Input: BSONLoader and MongoLoaderdata = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorageSTORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
![Page 20: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/20.jpg)
‹#›
Hive Support
• Access collections as Hive tables• Use with MongoStorageHandler or BSONStorageHandler
CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
![Page 21: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/21.jpg)
‹#›
Spark
• Use with MapReduce input/output formats
• Create Configuration objects with input/output formats and data URI
• Load/save data using SparkContext Hadoop file API
![Page 22: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/22.jpg)
‹#›
Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Dynamic queries with most recent data
Puts load on operational database
Snapshots move load to Hadoop
Snapshots add predictable load to MongoDB
![Page 23: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/23.jpg)
Demo : Recommendation Platform
![Page 24: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/24.jpg)
‹#›
Movie Web
![Page 25: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/25.jpg)
‹#›
MovieWeb Web Application
• Browse - Top movies by ratings count- Top genres by movie count
• Log in to - See My Ratings- Rate movies
• Recommendations- Movies You May Like- Recommendations
![Page 26: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/26.jpg)
‹#›
MovieWeb Components
• MovieLens dataset– 10M ratings, 10K movies, 70K users– http://grouplens.org/datasets/movielens/
• Python web app to browse movies, recommendations– Flask, PyMongo
• Spark app computes recommendations– MLLib collaborative filter
• Predicted ratings are exposed in web app– New predictions collection
![Page 27: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/27.jpg)
‹#›
Spark Recommender
• Apache Hadoop (2.3) - HDFS & YARN- Top genres by movie count
• Spark (1.0)- Execute within YARN- Assign executor resources
• Data- From HDFS, MongoDB- To MongoDB
![Page 28: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/28.jpg)
‹#›
MovieWeb Workflow
Snapshot dbas BSON
Predict ratings for all pairings
Write Prediction to MongoDB collection
Store BSON in HDFS
Read BSON into Spark App
Create user movie pairing
Web Application exposes
recommendationsRepeat Process
Train Model from existing ratings
![Page 29: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/29.jpg)
‹#›
Execution
$ spark-submit --master local \ --driver-memory 2G --executor-memory 2G \ --jars mongo-hadoop-core.jar,mongo-java-driver.jar \ --class com.mongodb.workshop.SparkExercise \ ./target/spark-1.0-SNAPSHOT.jar \ hdfs://localhost:9000 \ mongodb://127.0.0.1:27017/movielens \ predictions \
![Page 30: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/30.jpg)
Should I use MongoDB or Hadoop?
![Page 31: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/31.jpg)
‹#›
Business First!
First-Level Analytics
Internet of Things
Social
Mobile Apps
Product/Asset
Catalog
Security & Fraud
Single View
Customer Data
Management
Churn Analysis
Risk Modeling
Sentiment Analysis
Trade Surveillance
Recommender
Warehouse & ETL
Ad Targeting
Predictive Analytics
What/Why How
![Page 32: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/32.jpg)
‹#›
The good tool for the task
• Dataset size• Data processing complexity• Continuous improvement
V1.0
![Page 33: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/33.jpg)
‹#›
The good tool for the task
• Dataset size• Data processing complexity• Continuous improvement
V2.0
![Page 34: MongoDB et Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022081513/5591b0db1a28ab1b518b4705/html5/thumbnails/34.jpg)
‹#›
Resources / Questions
• MongoDB Connector for Hadoop- http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-s
tarted-with-hadoop/
• MongoDB-Spark Demo- https://github.com/crcsmnky/mongodb-hadoop-work
shop