designing data pipelines using hadoop
DESCRIPTION
This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.TRANSCRIPT
![Page 1: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/1.jpg)
Rocket FuelBig Data and Artificial Intelligence for Digital Advertising
Abhijit PolMarilson Campos
Designing Data Pipelines
July, 2013
![Page 2: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/2.jpg)
What We Do?
Data Partners*
Optimize
Bid Request
Rocket Fuel Winning Ad
Ad Request
Ad Served to User
Page Request
Bid & Ad
Web Browser
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Response Prediction
Model
Publishers
User Engagement Recorded
User Engages with Ad
Refresh learning
Campaign & User Data
Warehouse
Qualify Audience
Some Exchange Partners
AdExchange
Ads & Budget
![Page 3: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/3.jpg)
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel
![Page 4: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/4.jpg)
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel
~5 billion
10 million
30 billion
~20 billion
![Page 5: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/5.jpg)
BIG DATA + AI
![Page 6: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/6.jpg)
Advertising That Learns
![Page 7: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/7.jpg)
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
![Page 8: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/8.jpg)
Architecture for Growth
•20 GB/month to 2 PB/month in 3 years•New and complex requirements•More consumers•Rapid growth
![Page 9: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/9.jpg)
How We Started?
![Page 10: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/10.jpg)
Architecture 2.0
![Page 11: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/11.jpg)
Current Architecture
![Page 12: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/12.jpg)
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
![Page 13: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/13.jpg)
Hurdles and Challenges Faced
•Exponential data growth and user queries•Network issues•Bots•Bad user queries
![Page 14: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/14.jpg)
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
![Page 15: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/15.jpg)
Data Pipeline Design Best Practices
Job Design
ConsistencyJob Features
Avoid Re-work Golden Input
Shadow ClusterData Collection
Dashboard
![Page 16: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/16.jpg)
Job Design / Consistency
• Idempotent
•Execution by different users
•Account for Execution Time
![Page 17: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/17.jpg)
Job Execution Timeline
![Page 18: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/18.jpg)
Job Features / Re-Work
•Smaller Jobs
•Record completion of steps
![Page 19: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/19.jpg)
Recording completion times
Start
Is mark already there?
Step of workflow, job or script
Yes
No
Execute work for the step.
Create the mark
End
Collect other data (Optional)
![Page 20: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/20.jpg)
Golden Input / Shadow Cluster
• Integration tests on realistic data sets.
•Safe environment to innovate.
![Page 21: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/21.jpg)
Data Collection - Delivery time view
J
Data product
Workflow Workflow
Job
Job
Job Job
Job Job
Job
Job
JobJob
Job
Hive/Pig SSH Script
J J… J
J
Hive
J J J
Pig
…
![Page 22: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/22.jpg)
Data collection : Data profiles view
Data product
Data set
Data set
= Data Set
= Transformation
Record Size & Type
Job Counts
Join success ratios Data Set Consistency
![Page 23: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/23.jpg)
Data Collection Hierarchy
wk_external_events
wk_build_profile
user_profile
extract_fields
consolidate_metrics
load_into_data_centers
extract_features
compact_user_profile
Workflow/Job/Script StepData Product
![Page 24: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/24.jpg)
Golden Input / Shadow Cluster
• Integration tests on realistic data sets.
•Safe environment to innovate.
![Page 25: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/25.jpg)
Dashboard
• Delivery Time• Data Profile Ratios• Counters• Alarms
![Page 26: Designing Data Pipelines Using Hadoop](https://reader035.vdocuments.mx/reader035/viewer/2022062709/558de0e61a28ab0f578b47c6/html5/thumbnails/26.jpg)
Thank you
www.rocketfuel.com