designing data pipelines using hadoop

Rocket FuelBig Data and Artificial Intelligence for Digital Advertising

Abhijit PolMarilson Campos

Designing Data Pipelines

July, 2013

What We Do?

Data Partners*

Optimize

Bid Request

Rocket Fuel Winning Ad

Ad Request

Ad Served to User

Page Request

Bid & Ad

Web Browser

Rocket Fuel Platform

Real-time BidderAutomated Decisions

Response Prediction

Model

Publishers

User Engagement Recorded

User Engages with Ad

Refresh learning

Campaign & User Data

Warehouse

Qualify Audience

Some Exchange Partners

AdExchange

Ads & Budget

How Big Is This Problem Each Day?

Trades on NASDAQ

Facebook Page Views

Searches on Google

Bid Requests Considered by Rocket Fuel

How Big Is This Problem Each Day?

Trades on NASDAQ

Facebook Page Views

Searches on Google

Bid Requests Considered by Rocket Fuel

~5 billion

10 million

30 billion

~20 billion

BIG DATA + AI

Advertising That Learns

Outline

•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices

Architecture for Growth

•20 GB/month to 2 PB/month in 3 years•New and complex requirements•More consumers•Rapid growth

How We Started?

Architecture 2.0

Current Architecture

Outline


Hurdles and Challenges Faced

•Exponential data growth and user queries•Network issues•Bots•Bad user queries

Outline


Data Pipeline Design Best Practices

Job Design

ConsistencyJob Features

Avoid Re-work Golden Input

Shadow ClusterData Collection

Dashboard

Job Design / Consistency

• Idempotent

•Execution by different users

•Account for Execution Time

Job Execution Timeline

Job Features / Re-Work

•Smaller Jobs

•Record completion of steps

Recording completion times

Start

Is mark already there?

Step of workflow, job or script

Yes

No

Execute work for the step.

Create the mark

End

Collect other data (Optional)

Golden Input / Shadow Cluster

• Integration tests on realistic data sets.

•Safe environment to innovate.

Data Collection - Delivery time view

J

Data product

Workflow Workflow

Job

Job

Job Job

Job Job

Job

Job

JobJob

Job

Hive/Pig SSH Script

J J… J

J

Hive

J J J

Pig

…

Data collection : Data profiles view

Data product

Data set

Data set

= Data Set

= Transformation

Record Size & Type

Job Counts

Join success ratios Data Set Consistency

Data Collection Hierarchy

wk_external_events

wk_build_profile

user_profile

extract_fields

consolidate_metrics

load_into_data_centers

extract_features

compact_user_profile

Workflow/Job/Script StepData Product

Golden Input / Shadow Cluster

• Integration tests on realistic data sets.

•Safe environment to innovate.

Dashboard

• Delivery Time• Data Profile Ratios• Counters• Alarms

Thank you

www.rocketfuel.com

designing data pipelines using hadoop

Technology

data partners

data optional

realistic data sets

exponential data growth

rocket fuel big data

data pipelines best

data collection hierarchy

job execution timeline