apache hadoop india summit 2011 talk "online content optimization using hadoop" by shail...
TRANSCRIPT
Yahoo!
What do we do ?
Deliver right CONTENT to the right USER at the right TIME”
o Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives
A new scientific discipline at the interface ofo Large scale Machine Learning and Statisticso Multi-objective optimization in the presence of uncertaintyo User understandingo Content understanding
Yahoo!
Content Relevance at Yahoo!
Important
Editors
Popular
Personal / Social
Editorial10s of Items
ScienceMillions of Items
Yahoo!
Content Ranking Problems
Most PopularMost engaging overall based on objective metrics
Related Items and Context-Sensitive ModelsBehavioral Affinity: People who did X, did YMost engaging in this page/section/property/device/referral context?
Deep PersonalizationMost relevant to me based on my deep interests (entities, sources, categories, keywords)
X Y
Real-time Dashboard
Voice and Business Rules
Revenue Optimization
Light PersonalizationMore relevant to me based on my age, gender, location, and property usage
Most Popular + Per User HistoryRotate stories I’ve already seen
Layout OptimizationWhich modules/ad units should be shown to this user in this context?
Yahoo!
Yahoo Frontpage
Today Module(Light personalization)
PersonalAssistant
(LightPersonalization)
Trending Now (Most popular)
National News(Most Popular +
User History bucket)
Deals (most popular)
Yahoo!
Recommendation: A Match-making Problem
OpportunityUsers, queries,
pages, …
Item InventoryArticles, web page,
ads, …
Use an automated algorithm to select item(s) to show
Get feedback (click, time spent,..) Refine the models
Repeat (large number of times)Measure metric(s) of interest
(Total clicks, Total revenue,…)
• Recommendation problems• Search: Web, Vertical• Online advertising• …
Yahoo!
Problem Characteristics : Today module
Traffic obtained from a controlled randomized experimentThings to note: a) Short lifetimes b) temporal effects c) often breaking news story
Yahoo!
Scale: Why use Hadoop?
• Million events per second (user view/click, content update)
• Hundreds of GB data collected and modeled per run
• Millions of items in pool
• Millions of user profiles
• Tens of thousands of Features (Content and/or User)
Yahoo!
Data Flow
Optimization EngineContent feed with biz rules
Explore~1%
Exploit~99%
Near Real-timeFeedback
Content Metadata
Dashboard Optimized Module
Real-timeInsights
Rules Engine
Yahoo!
How it happens ?
At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ atPosition ‘o’Property/Site ‘p’ Section - sModule – mInternational - i’
UserEvents
ItemMetadata
Modeling
ITEM Model
USER Model
Content ‘id’Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category}
FeatureGeneration
Additional Content & UserFeature Generation
Item BASE M F ATTR CAT_Sports
id1 0.8 +1.2 -1.5 -0.9 1.0
id2 -0.9 -0.9 +2.6 +0.3 1.0
Item BASE M F ATTR CAT_Sports
u1 0.8 1 1 0.2
u2 -0.9 1 -1.2
STORE: PNUTS
5 minlatency
RankingB-Rules
Request
5 – 30 minlatency
SLA 50 ms – 200 ms
STORE: HBASE
Yahoo!
Technology Stack
Analytics and Debugging
Ingest
Yahoo!
Modeling Framework
Global state provided by HBase
Hadoop processing via a collection of PIG UDFs
Different flows for modeling or stages assembled in PIG
o OLR, Clustering, Affinity, Regression Models, Decompositions
(Cholesky…)
o Timeseries models (generally trends – extract of user activity on
content)
Configuration based behavior for various stages of modeling
o Type of Features to be generated
o Type of joins to perform – User / Item / Feature
Input : DFS and/or HBase
Output: DFS and/or HBase
Yahoo!
HBase
ITEM Model• Stores item related features• Stores ITEM x USER FEATURES model • Stores parameters about item like view count, click count, unique user count.• 10 of Millions of Items• Updated every 5 minutes
USER Model• Store USER x CONTENT FEATURES model for each individual user by either a Unique ID• Stores summarized user history – Essential for Modeling in terms of item decay• Millions of profiles• Updated every 5 to 30 minutes
TERM Model• Inverts the Item Table and stores statistics for the terms. • Used to find the trending features and provide baselines for user features• Millions of terms and hundreds of parameters tracked• Updates every 5 minutes
Yahoo!
Grid Edge Services
Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily
Have different scaling characteristics (E.g. Memory, CPU)
Provide gateway for accessing external data sources in M/R
Map and/or Reduce step interact with Edge Services using standard client
Examples
Categorization
Geo Tagging
Feature Transformation
Yahoo!
Analytics and Debugging
Provides ability to debug modeling issues near-real time
Run complex queries for analysis
Easy to use interface
PM, Engineers, Research use this cluster to get near-real time insights
10s of Modeling monitoring and Reporting queries every 5 minute
We use HIVE
Yahoo!
Learnings
PIG & HBase has been best combination so far
Made it simple to build different kind of science models
Point lookup using HBase has proven to be very useful
Modeling = Matrices
HBase provides a natural way to represent and access them
Edge Services
Have provided simplicity to whole stack
Management (Upgrades, Outage) has been easy
HIVE has provided us a great way for analyzing the results
PIG was also considered