recommendation engines
DESCRIPTION
Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential techniques for delivering state-of-the-art personalization effects on a web site. These slides are based on a presentation that I gave to New England's Java User Group (NEJUG) in 2009; in that respect, they are quite old. Nevertheless, the content is about the fundamental concepts of these techniques and the fundamentals never go out of fashion. The code references are from the project Yooreeka. The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). You can find the Yooreeka 2.0 API (Javadoc) at http://www.marmanis.com/static/javadoc/index.htmlTRANSCRIPT
Recommendation Engines:A key personalization feature of modern web applications
Haralambos (Babis) Marmanis
NEJUGJune 11, 2009
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Online store recommendations
Amazon.comProvide recommendations for purchasing more items
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Online store recommendations
Netflix.comProvide recommendations for viewing more movies
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Content recommendations
Any news portal or other content aggregatorRecommendations for articles, books, news stories
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long TailErik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,used a log-linear curve to describe the relationshipbetween Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s booksales come from obscure books that were not available inbrick-and-mortar stores.
They also found that consumer benefit from access toincreased product variety in online book stores is ten timeslarger than their benefit from access to lower prices online!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long TailErik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,used a log-linear curve to describe the relationshipbetween Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s booksales come from obscure books that were not available inbrick-and-mortar stores.
They also found that consumer benefit from access toincreased product variety in online book stores is ten timeslarger than their benefit from access to lower prices online!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long TailErik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,used a log-linear curve to describe the relationshipbetween Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s booksales come from obscure books that were not available inbrick-and-mortar stores.
They also found that consumer benefit from access toincreased product variety in online book stores is ten timeslarger than their benefit from access to lower prices online!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Java source code
Yooreeka!
Open Source, Machine Learning librarySearch, recommendations, clustering, classification, andcombination of classifiers!URL: http://code.google.com/p/yooreeka/
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Frank’s music ratings
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Constantine’s music ratings
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Catherine’s music ratings
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of SimilarityOften based on the notion of distanceThe smaller the distance, the greater the similaritySimilarity values, typically, constrained in [0,∞) or [0,1]It is not necessary to define similarity formulas. E.g. ifd < ε then similar, otherwise not.Similarity could also be empirical or probabilistic
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of SimilarityOften based on the notion of distanceThe smaller the distance, the greater the similaritySimilarity values, typically, constrained in [0,∞) or [0,1]It is not necessary to define similarity formulas. E.g. ifd < ε then similar, otherwise not.Similarity could also be empirical or probabilistic
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of SimilarityOften based on the notion of distanceThe smaller the distance, the greater the similaritySimilarity values, typically, constrained in [0,∞) or [0,1]It is not necessary to define similarity formulas. E.g. ifd < ε then similar, otherwise not.Similarity could also be empirical or probabilistic
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of SimilarityOften based on the notion of distanceThe smaller the distance, the greater the similaritySimilarity values, typically, constrained in [0,∞) or [0,1]It is not necessary to define similarity formulas. E.g. ifd < ε then similar, otherwise not.Similarity could also be empirical or probabilistic
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of SimilarityOften based on the notion of distanceThe smaller the distance, the greater the similaritySimilarity values, typically, constrained in [0,∞) or [0,1]It is not necessary to define similarity formulas. E.g. ifd < ε then similar, otherwise not.Similarity could also be empirical or probabilistic
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
d =
(N∑
i=1
|Xi − Yi |p) 1
p
(1)
Manhattan distance
d = maxi|Xi − Yi | (2)
Chebychev or L∞ distance
d = limp→∞
(N∑
i=1
|Xi − Yi |p) 1
p
(3)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
d =
(N∑
i=1
|Xi − Yi |p) 1
p
(1)
Manhattan distance
d = maxi|Xi − Yi | (2)
Chebychev or L∞ distance
d = limp→∞
(N∑
i=1
|Xi − Yi |p) 1
p
(3)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
d =
(N∑
i=1
|Xi − Yi |p) 1
p
(1)
Manhattan distance
d = maxi|Xi − Yi | (2)
Chebychev or L∞ distance
d = limp→∞
(N∑
i=1
|Xi − Yi |p) 1
p
(3)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
d =
(N∑
i=1
|Xi − Yi |p) 1
p
(1)
Manhattan distance
d = maxi|Xi − Yi | (2)
Chebychev or L∞ distance
d = limp→∞
(N∑
i=1
|Xi − Yi |p) 1
p
(3)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Naıve Similarity
simNaive =β
β + d(4)
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
simII = simI ×commonmaximum
(6)
There is more . . . Jaccard, Tanimoto, and so on
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Naıve Similarity
simNaive =β
β + d(4)
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
simII = simI ×commonmaximum
(6)
There is more . . . Jaccard, Tanimoto, and so on
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Naıve Similarity
simNaive =β
β + d(4)
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
simII = simI ×commonmaximum
(6)
There is more . . . Jaccard, Tanimoto, and so on
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?There is no such thing! It depends on the problem, thedata, the definition of ... ”best”Spertus,Sahami, and Buyukkokten (2005)Evaluating similarity measures: a large-scale study in theorkut social network. Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discoveryin data miningThe simple L2 based (cosine) similarity showed the bestempirical results among seven similarity metrics.
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?There is no such thing! It depends on the problem, thedata, the definition of ... ”best”Spertus,Sahami, and Buyukkokten (2005)Evaluating similarity measures: a large-scale study in theorkut social network. Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discoveryin data miningThe simple L2 based (cosine) similarity showed the bestempirical results among seven similarity metrics.
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?There is no such thing! It depends on the problem, thedata, the definition of ... ”best”Spertus,Sahami, and Buyukkokten (2005)Evaluating similarity measures: a large-scale study in theorkut social network. Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discoveryin data miningThe simple L2 based (cosine) similarity showed the bestempirical results among seven similarity metrics.
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?There is no such thing! It depends on the problem, thedata, the definition of ... ”best”Spertus,Sahami, and Buyukkokten (2005)Evaluating similarity measures: a large-scale study in theorkut social network. Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discoveryin data miningThe simple L2 based (cosine) similarity showed the bestempirical results among seven similarity metrics.
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
TapestryExperimental mail system by Goldberg et al. (circa 1992)in Xerox PARC
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
TapestryExperimental mail system by Goldberg et al. (circa 1992)in Xerox PARC
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
TapestryExperimental mail system by Goldberg et al. (circa 1992)in Xerox PARC
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
TapestryExperimental mail system by Goldberg et al. (circa 1992)in Xerox PARC
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
TapestryExperimental mail system by Goldberg et al. (circa 1992)in Xerox PARC
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
User based
User Similarity MatrixU1 U2 U3 U4 U5 ..
U1 [ S11 S12 S13 S14 S15 ... ]U2 [ S21 S22 S23 S24 S25 ... ]U3 [ S31 S32 S33 S34 S35 ... ]U4 [ S41 S42 S43 S44 S45 ... ]U5 [ S51 S52 S53 S54 S55 ... ].. [ ... ... ... ... ... ... ]
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
User based
User Similarity Matrix (cont.)U1 U2 U3 U4 U5 ..
U1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ]U2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ]U3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ]U4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ]U5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ].. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Rating Counting Matrix
Rating Counting MatrixR1 R2 R3 R4 R5
R1 [ X11 X12 X13 X14 X15 ]R2 [ X21 X22 X23 X24 X25 ]R3 [ X31 X32 X33 X34 X35 ]R4 [ X41 X42 X43 X44 X45 ]R5 [ X51 X52 X53 X54 X55 ]
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Rating Counting Matrix
BeanShell script (Users)BaseDataset ds = MusicData.createDataset();
Delphi delphi = newDelphi(ds,RecommendationType.USER_BASED);
MusicUser mu1 = ds.pickUser("Bob");
delphi.findSimilarUsers(mu1);
delphi.recommend(mu1);
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Item Similarity MatrixI1 I2 I3 I4 I5 ...
I1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ]I2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ]I3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ]I4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ]I5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ].. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
BeanShell script (Items)Delphi delphi = newDelphi(ds,RecommendationType.ITEM_BASED);
MusicUser mu1 = ds.pickUser("Bob");
delphi.recommend(mu1);
MusicItem mi = ds.pickItem("La Bamba");
delphi.findSimilarItems(mi);
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the codeDelphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the codeDelphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the codeDelphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the codeDelphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the codeDelphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?Now we deal with documents
So, we need to define similarity based on the content ofthe documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?Now we deal with documents
So, we need to define similarity based on the content ofthe documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?Now we deal with documents
So, we need to define similarity based on the content ofthe documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?Now we deal with documents
So, we need to define similarity based on the content ofthe documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prizeMore than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prizeMore than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prizeMore than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prizeMore than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerationsData normalization
Neighbor selectionHow many neighbors?Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should beconcentrated in deriving substantially different approaches,rather than refining a single technique.”
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerationsData normalization
Neighbor selectionHow many neighbors?Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should beconcentrated in deriving substantially different approaches,rather than refining a single technique.”
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerationsData normalization
Neighbor selectionHow many neighbors?Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should beconcentrated in deriving substantially different approaches,rather than refining a single technique.”
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerationsData normalization
Neighbor selectionHow many neighbors?Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should beconcentrated in deriving substantially different approaches,rather than refining a single technique.”
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 IntroductionRecommendations in Action“It’s the Economy ...”Java source code
2 Basic ConceptsThe Online Music Store ExampleSimilarityDistance (formulas)Similarity (formulas)The ”best” Similarity formula
3 Collaborative FilteringUser basedRating Counting MatrixItem based
4 Content basedText Parsing & AnalysisDocument representation
5 Netflix PrizeNetflix Prize DescriptionLessons learned
6 Summary
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerationsBusiness value validation - ”Long Tail”, ”niches to riches”,etc.
Similarity metrics - Many to choose from, do not be afraidto explore!
Collaborative Filtering: ”Show me your friend ...”User basedItem based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerationsBusiness value validation - ”Long Tail”, ”niches to riches”,etc.
Similarity metrics - Many to choose from, do not be afraidto explore!
Collaborative Filtering: ”Show me your friend ...”User basedItem based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerationsBusiness value validation - ”Long Tail”, ”niches to riches”,etc.
Similarity metrics - Many to choose from, do not be afraidto explore!
Collaborative Filtering: ”Show me your friend ...”User basedItem based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerationsBusiness value validation - ”Long Tail”, ”niches to riches”,etc.
Similarity metrics - Many to choose from, do not be afraidto explore!
Collaborative Filtering: ”Show me your friend ...”User basedItem based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerationsBusiness value validation - ”Long Tail”, ”niches to riches”,etc.
Similarity metrics - Many to choose from, do not be afraidto explore!
Collaborative Filtering: ”Show me your friend ...”User basedItem based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality