Download - Recommending Similar Items at Large Scale
Recommending Similar Items at Large Scale
Jay KatukuriMerchandising Team - eBay
07/25/2012
Similar Items Clustering Platform
• Introduction• Merchandising Challenges• Similar Item Clustering (SIC) Architecture• Clustering Approach– Features– Method
• Cluster Assignment Service• Applications– Replacement/Equivalent items on CVIP – Non-winner– Related/Complementary items on Checkout
Introduction• Grouping of items that are similar to each other is essential
for recommendation algorithms.• Two distinct items can be considered similar if important
features are similar:– Titles– Attributes– Images.
• Similar Item Clustering (SIC) platform creates clusters of items.• These clusters are used for various recommendation systems
on the site now.
Similar Recommendations: Before
4
Similar Recommendations: Before & After
5
Merchandising Challenges - Motivation for SIC
• Non-productized inventory, long tail.– Product coverage is there only for few categories– Majority of items are ad hoc listings not covered by catalog taxonomy– Maintaining catalogs is a daunting task for the long tail.– One-of-a-kind inventory, Items are short-lived
• Unstructured data– Attribute coverage is minimal
• Sparsity in the transactional data– Very few purchases for certain kinds of items
Merchandising Challenges - Motivation for SIC
– Item-item pairs are supported by even fewer users.• We may not see users buying both a product and
accessories on eBay.• Large Data – Much bigger data set in both users and inventory than
other ecommerce sites.• Scale – Several 100 Million listings.– Several million new items every day
Similar Item Recommendations
Similar Item Recommendations
Item Signatures: possibility ?
apple ipod touch 4g clear film protector screen
Cluster
clarks women shoe pumps classics
Similar Items: Clustering Architecture
Off-line
Hadoop
ClusterGeneration
ClusterDictionary
ClusterAssignment Service
Applications:• Merchandising• Navigation• etc.
item
Slow, Periodic
Fast
ItemClusterIndex
Run-time
Cluster Generation
Query-Item Set
• Use 1 month of User behavior data to collect initial query-set.• Filter queries by length and category specific demand/supply ratios.
Query to Items Data
Click-stream Log
Query Backend
Query Normalization
Filter Queries by Demand/Supply
Query Selection• Input Data:– Click-stream logs
• Method for choosing the queries:• Minimum frequency• Average supply threshold• Min and max token constraint• Morphological constraints–Queries that have only numbers are not allowed:
“10 5”
K-Means Clustering
Split Clusters
Query to Items Data
Base Cluster Generation
K-Means Clustering of Base Clusters
Generate Item FeaturesScoring Models
•Use item title, category and attributes as features for clustering.• Applying k-means on the base clusters separately produce better quality of clusters and makes the process faster.• Use cosine distance for item clustering.• Cluster size is chosen as a tuning parameter.
Base Cluster Generation• Base Cluster ≡ Query• Find merge candidates based on query term overlap
– Eg: “nike airmax tennis shoes” -> “nike airmax” “nike airmax tennis shoes” -> “nike shoes”
• Score candidates using cosine similarity– Term weight : TF-IDF in the query space(document=query)
• TF : Query Demand• IDF : Number of Queries
• Most similar merge candidate wins– Eg: “nike airmax tennis shoes” -> “nike airmax”
• Merge corresponding recall sets
Base Cluster Merge
• Reduces the number of base clusters to half.• Example
phrase(hand,made) phrase(king,s) queen quiltphrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quiltphrase(hand,made) queen quiltphrase(hand,made) phrase(prae,owned) quiltphrase(hand,made) quilt size twinphrase(hand,made) quilt silkphrase(hand,made) quilt twinphrase(hand,made) phrase(patch,work) quiltphrase(hand,made) quilt whitephrase(hand,made) phrase(king,size) quiltphrase(hand,made) phrase(yo,yo,s) quiltphrase(hand,made) quilt salephrase(hand,made) quilt red
phrase(hand,made) quilt
Item Features GenerationItem Title
Normalization
Concept Extractor
Expansions
3x clear screen protector film skin for apple ipod touch 4 4g
3-x clear screen protector film skin for apple ipod touch 4 4-g
3-x color=clear type=‘screen protector’ film skin compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g
PHRASE(3,x) color=clear type=‘screen protector’ OR(film,films) OR(skin,skins) compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g
Normalized Item
Features
Item Features : Concept Extraction • Problem: Extract concepts from item title.• Purpose:
– Attributes coverage is sparse in many categories.– Extracted concepts can be used as features
• Approach:– Fast online service to extract entities from any eBay Text (item
title/product title etc) – Batch capability to be able to use in Hadoop– Restricted to known and important (above certain threshold)
name/values.– Unsupervised model – Use a statistical approach based on large amount of data
Examples
Women’s black dress size 16 worn once
Size - 16Gender – WomenColor - BlackStyle – Dress
Gucci medium ivory leather handbag
Brand - GucciSize – MediumColor - IvoryMaterial - LeatherStyle – Handbag
Unstructured Item Title Extracted Structured Data
Black Leather Case Cover for Reader Amazon Kindle 3 3G
Brand – Amazon Kindle 3Model – 3GType – Leather CaseColor – Black
Itemid : 380361729748 Meta : Computers & Networking
Itemid : 300477503372Meta : CSA
Itemid : 300494995198Meta : CSA
Dictionary Generation Method
Data Cleansing
Dictionary Generation
Data-warehouse
Co-occurence Matrix of Name-values
Concept Dictionary
Tf-Idf scores of name-values in a category
Other dictionaries used:Units dictionarySynonym namesFamous persons
list
Item Features : Concept Extraction• Co-occurrence of concepts is used to approximate the joint
probability.– Brand=apple, model=iphone 4
• Use of dictionaries at multiple levels reduces ambiguity in same value having multiple names.– “apple” is “compatible brand” in accessories category– “apple” is “brand” in devices category
• 'hp pavilion', 'hp' are both valid values for brand , ambiguity is resolved using tf-idf scores of name value pairs in particular category.
• Regexes were added to extract size patterns in CSA.
Item Features : Term Scores
• Problem: Given an item title in a leaf category, compute the significance of the terms in the title– While assigning items to clusters, identify which
terms in item title are more important that others• Issues:– Existing scoring models built as service– Inefficient for using them in batch mode on hadoop – Unigram models
Mutual Information
• Score of a term ‘t’ for a given item ‘i’ is computed using the mutual information of term ‘t’ and category ‘c’.– ‘c’ is the l2 category of item ‘i’.
• Item titles from EDW are used as input data.• Scores are computed for the normalized tokens.
K-Means 1/3K-Means is a well known clustering Algorithm.Choose k initial cluster centroids: m1
(1),…,mk(1)
Assignment Step:
Update Step:
Optimize:
Intra-Cluster SimilarityInter Cluster Distortion
K-Means 2/3
1. Choose Random Cluster Centroids 2. Update centroids based on neighborhood
3. Final clusters
We use a version of k-means called “Bisecting K-means” which tend to produce better quality results than standard k-means.
K-Means 3/3
• Pros– Simple to understand and implement.– Easily parallelizable– Generally produce good quality clusters when K is small.
• Cons– Slow to converge when K is large.– Cluster quality degrades with large K.– Need to decide K before hand, needs domain
knowledge and tuning to find suitable K.
K-Means Clustering : Cluster Description
• Clusters are described using the centroids of the clusters.• Cluster 1:
“L1=293 L2=56169 L3=168096 compatible brand = apple compatible product = ipod touch Phrase(4,g) clear film protector screen“
• Cluster 2:“L1=11450 L2=3034 L3=55793
brand = indigo by clarks shoe style = pumps classics”
• There are about x million clusters for US.• These x million clusters cover more than 92% of the US inventory.
Shingling for cluster merging
• Problem: Given a set of clusters, find a grouping of similar clusters.
• Approach:– Represent each cluster as a “document”– Compute 5 min 3-shingles– Check for 80% match for belonging to the same
group
Shingling basics 1/3
Shingling basics 2/3
Shingling basics 3/3
Cluster Assignment
Cluster AssignmentInverted
IndexCluster
Dictionary
Assignment ServicePre-processing Meta-data
Files
Rank Clusters
Voyager Call for top N clusters
Rank top N similar Items*
Closed View Item
Recommended Similar Items
Item Title, Attributes, Leaf Category, Site
Implemented using Lucene
Cluster Assignment : Pre-processing
new,2-x,for,canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital,rebel,t-3-i
new 2x for canon lp-e8 battery + charger + lens hood eos 550d 600d digital rebel t3i
new,2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i
2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i
2-x,for canon,PHRASE(lp,e,8),OR(batteries,battery,batterys),OR(charger,chargers),lens,OR(hood,PHRASE(hood,s),hoods),eos,PHR
ASE(550,d),PHRASE(600,d),digital rebel,t-3-i
2-x,for canon,phrase(lp,e,8),batteries,charger,lens,phrase(hood,s),eos,phrase(550,d),phrase(600,d),digital rebel,t-3-i
RTL Normalization
Concept Extraction
Stop Word Filtering
STL Expansion
Query Reduction / Unification
Cluster Assignment : Scoring
• Indexing Fields: Title terms and Categories• Reward matching terms and penalize on non-
matching terms
• - Reward matching terms• Number of terms matching from input
• - Importance of term in input• Query Time Boost
• - Penalize non-matching terms from cluster ‘c’• Index time boost: Field length normalization
Cluster Assignment Cross Validation
• Compute precision of recommending items from the “correct” cluster(s)– Clusters that generate purchases (BIDs and/or
BINs• Labeled Data – View-Buy data generated from user session
analysis • CVIP -> Bid/BIN in same user session• Same category
Cluster Assignment Cross Validation : Method
• For each and in , top k(5) clusters list and • Ignoring the position, compute precision in top k
• Ignores – True dependent on ranking– Assume every item belonging to a cluster is equally likely to be
recommended
– Normalized Precision
– where is the smallest cluster in
Merchandising Applications
Merchandising Applications
• There are two kinds of recommendation systems that are using SIC:– Recommending similar items on CVIP-non winner
page– Collaborative Filtering (CF) algorithms:• “Buy-Buy” – On Checkout Page• “View-Buy” – On AVIP
Similar Item Recommendations
• User bid on but lost an item– Show similar items as replacement items.
• User was watching an item that has ended– Show similar items as replacement items
• User viewed an item but did not make a purchase– Show similar items to showcase more choices.– Inject diversity in the recommendation.
Similar Item Recommendations - Example
Similar Item Recommendations ( contd..)
Collaborative Filtering on SIC – “Buy-Buy”
• Once a user has purchased an Item, what else can we recommend to the user to go with his purchase?
• Drive incremental purchases – On check-out, recommend other items that “go-
together” with the purchased item– E.g. for a cell-phone we may recommend a charger,
case, screen protector.– For a dress shirt, we may recommend a tie, a dress
shoe or a jacket.
Collaborative Filtering on SIC – “Buy-Buy”
• Non-productized item inventory with short lifetime makes any CF based approach difficult.
• Map the items to a higher level abstraction (clusters) to handle data sparsity.
• Re-use the item clusters generated for Similar Item Recommendation.
Related Recommendations: Before & After
46
Recommendations forXbox 360 4GB on Checkout page
Conclusion
• SIC platform has proven its utility and is a critical component of merchandising algorithms
• Future Work– Quality needs to be improved for long tail
categories like Art, Collectibles, etc– Better distinguish between CVIP loser/browser– End-to-End Cross-Validation framework
Cluster Assignment : Aspect Demand
• Historical (6-7 months) user behavior data• Rank ordered lists of aspects used in
– Search Queries– Left Navigation Filters
• Combined using rank aggregation– Importance of aspect in category– Used as query time boost during cluster index lookup
• Example :– Input : AIR JORDAN RETRO 4 IV MILITARY BLUE 2006 SIZE 9.5 USED
» k:air jordan^2.0 k:retro^1.25 k:military^1.25 k:blue^1.2
– Also used in Concept Landing Pages (CLPs) and Popular Watches w/ Aspects
Ranking
• Aspect demand data based on the input item is used in ranking– Ex: material=‘leather’ may not be there in the
cluster description.– Clarks Women Shoes
• Format Bias based on seed item’s format
Format Affinity
• X% seed items are auction for CVIP non winner
• High affinity towards the seed item's format