apache hadoop india summit 2011 talk "online content optimization using hadoop" by shail...

Yahoo!

Online Content Optimization using Hadoop

Shail [email protected]

Hadoop Summit 2011

Yahoo!

What do we do ?

Deliver right CONTENT to the right USER at the right TIME”

o Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives

A new scientific discipline at the interface ofo Large scale Machine Learning and Statisticso Multi-objective optimization in the presence of uncertaintyo User understandingo Content understanding

Yahoo!

Content Relevance at Yahoo!

Important

Editors

Popular

Personal / Social

Editorial10s of Items

ScienceMillions of Items

Yahoo!

Content Ranking Problems

Most PopularMost engaging overall based on objective metrics

Related Items and Context-Sensitive ModelsBehavioral Affinity: People who did X, did YMost engaging in this page/section/property/device/referral context?

Deep PersonalizationMost relevant to me based on my deep interests (entities, sources, categories, keywords)

X Y

Real-time Dashboard

Voice and Business Rules

Revenue Optimization

Light PersonalizationMore relevant to me based on my age, gender, location, and property usage

Most Popular + Per User HistoryRotate stories I’ve already seen

Layout OptimizationWhich modules/ad units should be shown to this user in this context?

Yahoo!

Yahoo Frontpage

Today Module(Light personalization)

PersonalAssistant

(LightPersonalization)

Trending Now (Most popular)

National News(Most Popular +

User History bucket)

Deals (most popular)

Yahoo!

Recommendation: A Match-making Problem

OpportunityUsers, queries,

pages, …

Item InventoryArticles, web page,

ads, …

Use an automated algorithm to select item(s) to show

Get feedback (click, time spent,..) Refine the models

Repeat (large number of times)Measure metric(s) of interest

(Total clicks, Total revenue,…)

• Recommendation problems• Search: Web, Vertical• Online advertising• …

Yahoo!

Problem Characteristics : Today module

Traffic obtained from a controlled randomized experimentThings to note: a) Short lifetimes b) temporal effects c) often breaking news story

Yahoo!

Scale: Why use Hadoop?

• Million events per second (user view/click, content update)

• Hundreds of GB data collected and modeled per run

• Millions of items in pool

• Millions of user profiles

• Tens of thousands of Features (Content and/or User)

Yahoo!

Data Flow

Optimization EngineContent feed with biz rules

Explore~1%

Exploit~99%

Near Real-timeFeedback

Content Metadata

Dashboard Optimized Module

Real-timeInsights

Rules Engine

Yahoo!

How it happens ?

At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ atPosition ‘o’Property/Site ‘p’ Section - sModule – mInternational - i’

UserEvents

ItemMetadata

Modeling

ITEM Model

USER Model

Content ‘id’Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category}

FeatureGeneration

Additional Content & UserFeature Generation

Item BASE M F ATTR CAT_Sports

id1 0.8 +1.2 -1.5 -0.9 1.0

id2 -0.9 -0.9 +2.6 +0.3 1.0

Item BASE M F ATTR CAT_Sports

u1 0.8 1 1 0.2

u2 -0.9 1 -1.2

STORE: PNUTS

5 minlatency

RankingB-Rules

Request

5 – 30 minlatency

SLA 50 ms – 200 ms

STORE: HBASE

Yahoo!

Technology Stack

Analytics and Debugging

Ingest

Yahoo!

Modeling Framework

Global state provided by HBase

Hadoop processing via a collection of PIG UDFs

Different flows for modeling or stages assembled in PIG

o OLR, Clustering, Affinity, Regression Models, Decompositions

(Cholesky…)

o Timeseries models (generally trends – extract of user activity on

content)

Configuration based behavior for various stages of modeling

o Type of Features to be generated

o Type of joins to perform – User / Item / Feature

Input : DFS and/or HBase

Output: DFS and/or HBase

Yahoo!

HBase

ITEM Model• Stores item related features• Stores ITEM x USER FEATURES model • Stores parameters about item like view count, click count, unique user count.• 10 of Millions of Items• Updated every 5 minutes

USER Model• Store USER x CONTENT FEATURES model for each individual user by either a Unique ID• Stores summarized user history – Essential for Modeling in terms of item decay• Millions of profiles• Updated every 5 to 30 minutes

TERM Model• Inverts the Item Table and stores statistics for the terms. • Used to find the trending features and provide baselines for user features• Millions of terms and hundreds of parameters tracked• Updates every 5 minutes

Yahoo!

Grid Edge Services

Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily

Have different scaling characteristics (E.g. Memory, CPU)

Provide gateway for accessing external data sources in M/R

Map and/or Reduce step interact with Edge Services using standard client

Examples

Categorization

Geo Tagging

Feature Transformation

Yahoo!

Analytics and Debugging

Provides ability to debug modeling issues near-real time

Run complex queries for analysis

Easy to use interface

PM, Engineers, Research use this cluster to get near-real time insights

10s of Modeling monitoring and Reporting queries every 5 minute

We use HIVE

Yahoo!

Learnings

PIG & HBase has been best combination so far

Made it simple to build different kind of science models

Point lookup using HBase has proven to be very useful

Modeling = Matrices

HBase provides a natural way to represent and access them

Edge Services

Have provided simplicity to whole stack

Management (Upgrades, Outage) has been easy

HIVE has provided us a great way for analyzing the results

PIG was also considered

Yahoo!

Thank you

Shail [email protected]

Deliver right CONTENT to the right USER at the right TIME”

apache hadoop india summit 2011 talk "online content optimization using hadoop" by shail...

Documents

useruser affinity

tracks user

user item feature

right user

user interactions

features content andor

individual user

user activity