gl conference2014 deployment_rajat
DESCRIPTION
Using GraphLab to build big data analytics pipelines and manage them in production. A presentation from the 3rd annual GraphLab Conference.TRANSCRIPT
![Page 1: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/1.jpg)
GraphLab in Production: Data Pipelines
Rajat Arya Software Engineer July 21, 2014
![Page 2: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/2.jpg)
Reusable components
Runs on Hadoop CDH5 now; Pivotal, Spark coming…
Runs on Cloud EC2 now; Azure, Google coming…
Data pipelines & Predictive services
Clean Learn Deploy
GraphLab Data Pipeline
Beyond batch & stream processing
Predictive applications require real-time service
Deployed directly from data pipeline
GraphLab Predictive Service
Monitor from GraphLab Canvas
![Page 3: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/3.jpg)
Sample Data Pipeline
A Simple Recommender System
Train Model Recommend Persist
• Source: Raw data from CSV • Tasks: Train Model, Produce Recommenda;ons, Persist • Des;na;on: Write to Database
![Page 4: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/4.jpg)
Sample Prototype
![Page 5: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/5.jpg)
Sample Prototype
MESSY NOT MODULAR
FILE PATHS NOT PORTABLE
![Page 6: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/6.jpg)
Typical Challenges to Production
• Refactor code to remove magic numbers, file paths, support dynamic config
• Rewrite entire prototype in ‘production’ language
• Build / integrate workflow support tools • Build / integrate monitoring & management
tools
![Page 7: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/7.jpg)
Typical Challenges to Production
• Refactor code to remove magic numbers, file paths, support dynamic config
• Rewrite entire prototype in ‘production’ language
• Build / integrate workflow support tools • Build / integrate monitoring & management
tools
GraphLab Create provides a better way …
![Page 8: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/8.jpg)
Sample Data Pipeline
TRAIN
RECOMMEND
Disc
. users:
csv:
model:
def train_model(task): csv = task.params[‘csv’] data = gl.SFrame.read_csv(csv’) model = gl.recommender.create(data) task.outputs[‘model’] = model task.outputs[‘users’] = data
PERSIST
§ Code can be Python functions or file(s)
![Page 9: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/9.jpg)
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. users:
Disc
. recs: § Code can be Python functions or file(s)
def gen_recs(task): model = task.inputs[‘model’] users = task.inputs[‘users’] recs = model.recommend(users) task.outputs[‘recs’] = recs
§ Dependencies managed logically by name
model:
![Page 10: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/10.jpg)
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. users:
Disc
. recs: § Code can be Python functions or file(s)
§ Dependencies managed logically by name
def persist_db(task): recs = task.inputs[‘recs’] conn = task.params[‘conn’] import mysqlconnector save_to_db(conn, recs.save(format…)
model:
§ Set required python packages so Task is portable
§ Automatic installation and configuration prior to execution
![Page 11: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/11.jpg)
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. recs:
Disc
. users: model:
INTERN TRAIN
§ Tasks are modular and reusable, enabling incremental development and rapid iterations
![Page 12: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/12.jpg)
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. recs:
Disc
. users: model:
INTERN TRAIN
§ Tasks are modular and reusable, enabling incremental development and rapid iterations
![Page 13: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/13.jpg)
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings)
![Page 14: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/14.jpg)
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs
![Page 15: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/15.jpg)
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘ec2-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without
changing code
![Page 16: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/16.jpg)
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without
changing code • Recall previous Jobs and Tasks, maintain
workbench
![Page 17: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/17.jpg)
GraphLab Data Pipeline Demo
![Page 18: Gl conference2014 deployment_rajat](https://reader033.vdocuments.mx/reader033/viewer/2022051412/5482098db07959600c8b46b3/html5/thumbnails/18.jpg)
GraphLab Data Pipeline Recap
Define it Once Run & Monitor it anywhere
All in GraphLab Create