making advertising personal, 4th nl recommenders meetup
TRANSCRIPT
Copyright © 2014 Criteo
Making Advertising Personal
Large Scale Real-Time Product Recommendation at CriteoOlivier Koch & Romain Lerallut
May 19th, 2015
Copyright © 2014 Criteo
Performance Advertising
We buy
• Inventory ! (ad spaces)• Billions of times a day• All over the Internet• For 95% of the population
Where is the need for tech ?
We sell
• Clicks !• (that convert)
• (that convert a lot)
We take the risk
You pay only for what you get
Copyright © 2013. Confidential
International infrastructureKey figures
Traffic
550 k HTTP requests / sec (peak activity)
23000 impressions /sec (peak activity)
180 k requests / sec on RTB (average)
Less than 10 ms to process an RTB request
3Figures of May2013
Physical infrastructure
6 Data centers on 3 continents operated and conceived in-house
~ 12000 servers, largest Hadoop cluster in Europe
Availability / Uptime >99.95%
More than 20 PB of storage Big Data
Copyright © 2012. Confidential 4
Arbitrage• Should we bid?• At which price?
Recommendation• Which products should we
display?
Graphical optimization• Big image vs small image• Background color, ...
Prediction• Generic prediction engine
• Specific models trained on TBs of data
Global Engine Architecture
Copyright © 2014 Criteo
Building ads with personalized content in realtime
Recommendation for Advertising
2.5 billion products~ 8ms response time
Copyright © 2014 Criteo
Data Sources
• Catalog data• Feed provided by the merchants
• User behavior data• Large scale intent data• All visits to merchant websites• Page views, basket, sales events
• Ad display data• Displayed and clicked ads
6
Copyright © 2012. Confidential 7
Recommendation execution flow
Candidates Generation• Get candidates from all
sources using user historical products and bestofs
Candidates Aggregation• Remove duplicates• Aggregate features
Preselection • Call degraded prediction to
decrease number of candidates
Scoring• Call full prediction model to
score each candidate
Winner Selector• Select N products on score,
randomization
Glup Logging• Log recommended
products, prediction variables
Copyright © 2014 Criteo
Issue #1: Retrieving user-specific products
• The need: storing [user → interesting products] vectors• Difficult to store and retrieve at scale (900+ mln users)• Hard to keep up to date
• Using seen products as a proxy• Store the [user → viewed products] vectors
• Easier to maintain• Store [viewed product → interesting products] vectors
• Based on aggregated user behavior data• Computable offline
• Final ranking by a ML model
8
Copyright © 2014 Criteo
Issue #2: Scoring products
• Need to fuse data from several sources• Product-specific• User-specific• User-product interactions• Display-specific
• 1st solution: regression model• Predict P(product click then sale)• Easy to evaluate
• 2nd solution: ranking model• More appropriate for our needs• Still maximizing post-click sales• Can be evaluated only on multi-product banners
9
Copyright © 2014 Criteo
Issue #3: Picking the right products
• Several questions:• User-product fatigue• Independent product choice assumption• Explore / exploit
• Solution 1: Randomization in the banners• Keep independent products assumption• Separate optimization process to shuffle displayed items
• Solution 2: A better scoring model• Score a full banner, not independent products• Store all product display counts
10
Copyright © 2014 Criteo
Issue #4: 8ms response time ! Tweaking for performance
• CTR / Sales prediction takes ~40 µs per candidate using in-house library• 2-step prediction:
• A fast pass to remove most of the candidates• A slow pass to score accurately the remaining candidates
And the technical fine print:
• All real-time code in C#
• Async I/O for better efficiency
• HAProxy to scale the front-end
• Memcached to store all required data in memory
11
Copyright © 2014 Criteo
12
Upcoming challenges
• Long(er)-term user profiles• More and better product information (images, semantic, NLP)• Vertical-optimized engine
• Classifieds (catalog-free recommendation ?)• Travel
• Instant-update of similarities • (because batch computation is soooo last year)
Copyright © 2014 Criteo
Fancy a try ?
13
On your own:
With us !http://www.criteo.com/careers/
• Our 1st public dataset is online: http://bit.ly/1vgw2XC• 4GB display and click data, Kaggle challenge in 2014
• NEW : 1TB dataset released a few weeks ago• Hosted on Microsoft Azure, just waiting for you
Copyright © 2014 Criteo
Questions?
Copyright © 2014 Criteo
Thank you [email protected]@criteo.com