jhjohn s. breeze, diddavid hhhecherman, clcarl kdikadie · user votes to predict new vote...
TRANSCRIPT
J h S B D id H h C l K diJohn S. Breeze, David Hecherman, Carl Kadie
Presentation GroupPetros AdamopoulosRobert CamilleriIoannis SarantidisCharalampos Vrisagotis
Collaborative FilteringgIntroduction – What is Collaborative Filtering?M B d Al i hMemory Based Algorithms:
CorrelationlVector Similarity
Model Based Algorithms:Bayesian NetworksClustering
Evaluation – Methods and ResultsConclusion
IntroductionPrediction of very complex attributes – opinion C ll b i Fil i A hCollaborative Filtering Approach
Given a user, try to predict his preference on item Y by fi di t f i il b d i finding a set of similar users, based on previous preferences, and using their preference on item Y.“The underlying assumption of Collaborative Filtering The underlying assumption of Collaborative Filtering approach is that those who agreed in the past tend to agree again in the future” – wikipediag g f p
Introduction4 users:
A) M t lli I M id S lt M d thA) Metallica, Iron Maiden, Sepultura, MegadethB) 50 Cent, Dr. Dre, P Diddy, Snoop Doggy DoggC) C Sh ki B i S Ch i i A il C) 50 Cent, Shakira, Britney Spears, Christina Aguilera. D) Britney Spears
Wh ill b h C f D??? What will be the vote on 50 Cent for user D???
Intuitively we see that we can say something about h th D ill lik C t t b ki Cwhether D will like 50 Cent or not by asking C.
IntroductionMemory‐Based Algorithms – use a whole sample of user votes to predict new voteuser votes to predict new voteModel‐Based Algorithms – use sample of user votes to learn a model Use function to predict new votelearn a model. Use function to predict new vote.Explicit Voting – a user expresses a preference on an itemitemImplicit Vote – a user ‘consuming’ an item indicates a
fpreference.
Introduction
Very large datasetsVery large datasets.Very sparse information. Millions of items in existence but any user will have expressed some preference over but any user will have expressed some preference over a very small subset of these.New User problemNew User problem.New Item problem.
Introduction
www.amazon.co.uk
Memory‐Based Algorithms(I) Collaborative Filtering task: Predict votes of a gparticular user (active user) from a database of user votesMean vote for user i:
1,
1
i
i i jj IiI
υ υ∈
= ∑Ii is the set of items on which user i has voted
iji
Memory‐Based Algorithms (II)Approach: Predict votes of active user based on
ti l i f f th d t f i ht f partial info from the user and a set of weights from the user database
, ,( , )( )n
a j a i j ip w a iυ κ υ υ= + −∑Weights w(a,i) can reflect: distance, correlation or
1i=
similarityAlgorithms differ in the weight calculation
Correlation (I) One of the first approaches to collaborative filteringC l ti h t th d di ti f li l ti Correlation shows strength and direction of linear relation between 2 random variables
Correlation (II) Pearson correlation coefficient
Computes the correlation between 2 users a and iComputes the correlation between 2 users a and i
( )( )a j a i j ijυ υ υ υ− −∑ , ,
2 2, ,
( )( )( , )
( ) ( )
a j a i j ij
a j a i j ij j
w a iυ υ υ υ
=− −
∑∑ ∑
summations over the items for which both users a and ih d d
j j
have recorded votes
Default VotingExtension to the correlation algorithmWh t if h t d f t hi it ?What if 2 users have voted on few matching items?Only uses Ia Ij∩Assumes a default vote d for items for which we have no vote and now uses Ia Ij∪Can also be extended to items neither has voted ond will most of the times reflect neutral or negative preferenceIn applications with implicit voting default voting gives missing data their actual value
Vector Similarity (I)y ( )Information RetrievalDocuments are represented as
Collaborative FilteringUsers are represented as vectors Documents are represented as
vectors of the form:
0 1( , ,..., ,..., )D D D D j D NU υ υ υ υ=
Users are represented as vectors of the form:
,0 ,1 , ,( , ,..., ,..., )j NU α α α α αυ υ υ υ=N:number of different wordsυD,j: frequency of word j in D
,0 ,1 , ,( )D D D j D N
N:number of different itemsυα,j: vote of item j by user α
, , , ,j
Information Retrieval
Collaborative Filtering
Documents Users
Words Items
Word VotesWord Frequencies
Votes
Vector Similarity (II)y ( )Basic formula
( )w a i υ υ υ υ= ∗ = ∗∑j: Common items between α and i.However, this metric is biased. Why?
, ,( , ) a i a j i jj
w a i υ υ υ υ= ∗ = ∗∑
However, this metric is biased. Why?
Cosine Similarityυ υ υ υ∗ ∗, ,
2 2, ,
( , ) cos( )a i
a j a j i ja j
ja j a k i kk I k I
w a iυ υ υ υ
υ υυ υ υ υ
∈ ∈
∗ ∗= ∗ = =∑
∑ ∑Other normalization schemes
Absolute sumNumber of votesNumber of votes
Vector Similarity ‐ Exampley p
I υ /|υ | υ /|υ |Item 1υα,1/|υα| υi,1/|υi|
User α User iItem 2
User iUser iUser iUser αItem 3
Item 4Conclusion:Meaning of υα 1/|υα|:Cosine Similarity = Probability of the 2 users meeting at an item node
g α,1/| α|Probability of a user following the corresponding edge
Inverse User Frequency(I)q y( )Key concept:
In Information RetrievalNot all words have the same importance in determining whether two documents are similar. Words that are popular between the document carry less information about the similarity between the documentsSi il l f C ll b ti filt iSimilarly for Collaborative filtering:Not all items have the same importance in determining whether two users are similar. Items that are popular between the users carry less information about the similarity between the usersinformation about the similarity between the users
Inverse User Frequency formula: n
nj: The number of users who have voted for item jh l b f
log( )jj
nfn
=
n: The total number of users
Inverse User Frequency (II)q y ( )Extending the vector similarity algorithm using IUF:
Votes for items are transformed using the following formulaVotes for items are transformed using the following formula
, , *new oldj j jfα αυ υ=
Then the vector similarity metric is evaluated using the transformed votes
Can be applied to the Correlation algorithm too:Can be applied to the Correlation algorithm too:
, , , ,( )( )( , )
j j a j i j j j j jj j j jf f f f
w jUV
α ιυ υ υ υα
−=∑ ∑ ∑ ∑
whereUV
2 2( ( ) )j j j j a jU f f fαυ υ= −∑ ∑ ∑2 2, ,( ( ) )j j i j j i jV f f fυ υ= −∑ ∑ ∑, ,( ( ) )j j j j a j
j j jf f fα∑ ∑ ∑
j j j
Case AmplificationpIt is a weight transformationLow weights are punishedLow weights are punishedHigh weights are favoredThe transformation formula is:The transformation formula is:
1 , ,( ) if 0*
old p oldp i inew old old
w wα α− ⎧ ⎫≥⎪ ⎪⎨ ⎬
C b d t d i d d t t hi h
, , ,, ,
*( ) if 0
new old oldi i i old p old
i i
w w ww wα α αα α
⎪ ⎪= = ⎨ ⎬− − <⎪ ⎪⎩ ⎭
Can be used to reduce noise and demonstrates higher accuracyUsually the amplification power is p=2 5Usually the amplification power is p=2.5
Model‐Based MethodsEffort to approach the problem from a probabilistic point of viewof view.Main goal of these methods is to estimate the expected vote value of a user given his previous votes on other items:value of a user given his previous votes on other items:
, , , ,( ) Pr( | , )m
a j a j a j a k ap E i k I iυ υ υ= = = ∈ ⋅∑where: α is the active user
j is the item
0i=
j is the item[0,m] is the discrete range of vote values I is the set of preferred items of user αIa is the set of preferred items of user α
Cluster ModelsUnderlying complex pattern that can be captured by latent variableslatent variables.Users can be classified into groups (classes) of individuals with similar interestsindividuals with similar interests.Classes are latent variables.A i Gi b hi i Assumption: Given membership in a group, an individual’s vote on various items is independent.
Cluster Models1
1
Pr( , ,..., ) Pr( ) Pr( | )n
n ii
C c C c C cυ υ υ=
= = = =∏
and are estimated from user d b
1i=
Pr( | )i C cυ =Pr( )C c=database. Expectation‐Maximisation Algorithm (EM)Various models constructed with varying number of latent variables.
Cluster ModelsRatings (‐1, 1) dislike, likeUser A The Godfather (1) Grease ( 1) Goodfellas (1) User A – The Godfather (1), Grease (‐1), Goodfellas (1), Casino (1), Raging Bull (1), Taxi Driver (1)User B – The Godfather (1) Dirty Dancing (‐1) User B The Godfather (1), Dirty Dancing ( 1), Goodfellas (1), Taxi Driver (?????)User preferences on individual items are initially not p yindependent. Introducing latent variables explains away the similarity and hence votes become
di i ll i d dconditionally independent.NB: We do not know what similarities the latent variables are capturingvariables are capturing.
Bayesian Network Models (I)y ( )What is a Bayesian Network Model?
A Graphical Model (Directed Acyclic Graph) that represents A Graphical Model (Directed Acyclic Graph) that represents the dependencies between random variablesEach discrete random variable which takes k values is represented with a node of k states in the networkArcs between nodes show the dependencies between variables
What is the structure of the graphical model?A h h d i bl Y d d X di h Assume that the random variable Y depends on X according to the following Conditional Probability Table (CPT):
Bayesian Network Models (II)y ( )P(Y=0|X=0) = 0.9 P(Y=0|X=1) = 0.3P(Y=1|X=0) = 0 1 P(Y=1|X=1) = 0 7P(Y=1|X=0) = 0.1 P(Y=1|X=1) = 0.7The dependency is illustrated as follows:
X Y
Generally, more complex graphical models can arise with b f d d C
X Y
various number of nodes, states and CPTs
Y W
X
Z V
T
Bayesian Network Models inBayesian Network Models in collaborative filtering (I)g ( )
How can Bayesian Networks Models be applied to collaborative filtering?collaborative filtering?
Represent items with nodesEach possible vote for an item is a state of its corresponding p p gnodeWhat happens with the missing values?‐ can add an extra state for “no‐vote”
Wh t b t th CPT ?What about the CPTs?They can be obtained from the datasetIt is more convenient to represent the CPTs with Decision TreesIt is more convenient to represent the CPTs with Decision Trees
Bayesian Network Models inBayesian Network Models in collaborative filtering (II)g ( )
Example of CPT for Melrose Place item using decision treeProbability that an individual watched Melrose Place given that y gwatched the parent programs (all possible votes unified to watched)
Beverly Hills, 90210 Beverly Hills, 90210Watched Not Watched
Friends FriendsWatched Not WatchedWatched Not Watched
Melrose Place Melrose Place Melrose PlaceWatched Watched WatchedNot Watched Not Watched Not Watched
Why Bayesian Network Models?y yThey can capture the dependencies between different itemsitemsEasy in learning – obtaining the CPTs from the dataset is a straightforward process (though very expensive)straightforward process (though very expensive)Generally provide accurate results, even in cases with little sensitivity between nodesy
Learning with Bayesian NetworkLearning with Bayesian Network Models(I)( )
Basically, the goal of the learning algorithm is to obtain the CPTs from the dataset and check for dependenciesp
Roughly, the algorithm can be specified as follows:for each item Ii
h f t f it S hi h I d dsearch for set of items S on which Ii dependsset items of S as parent nodes of Ii in the network
end
However, having many parent nodes for an item leads to several problems during training
E ti l bl f th CPTExponential blow‐up of the CPTsOver‐fitting by capturing dependencies that happen to occur only in the dataset and are not true in the real world
Learning with Bayesian NetworkLearning with Bayesian Network Models(II)( )
A solution is to penalize dense networks with many parent nodes for an itemnodes for an item
Keeping the number of parents under 10 makes the network more efficient and accurate
....
I
I2
I
I5
I ........
I1
I3
I4
I6
I7
3 6
( )Evaluation Criteria (I)Classes of Collaborative Filtering applicationsClasses of Collaborative Filtering applications
1st Class: Items presented one‐at‐a‐time with rating2nd Class: Items recommended as an ordered list2 d Class: Items recommended as an ordered list
Evaluation Criteria (II)Evaluation sequence
Dataset divided into training and test setgTraining set used as collaborative filtering database or for the probabilistic modelCycle through the users in the test set, each viewed as the active userDivide votes to observed Iα and a set we will attempt to predict Pα
Scoring metrics – Class 1Individual scoringAverage absolute deviation of predicted vote to the actual g pvote:
1, ,
1a a j a jS p
mυ= −∑
b f t d it f
aj Pam ∈
ma = number of voted items for user aThe scores are then averaged over all the users
Scoring Metrics – Class 2 (I)Ranked scoring
Evaluation of ranked lists from Information Retrieval:Recall: percentage of relevant items returnedPrecision: percentage of returned items that were relevant
Binary votes allow a similar approachMore general approach: estimate the expected utility of a
k d li ranked list to userThe expected utility: the probability of viewing a recommended item times its utilitytimes its utilityHere: item utility = difference between vote and neutral vote
S i M t i Cl 2 (II)Scoring Metrics – Class 2 (II)Th d ili f k d li f i iThe expected utility for a ranked list of items is:
max( ,0)a j dυ −∑ ,
( 1) ( 1)
a ( ,0)2
a ja j a
j
dR
υ− −=∑
d = neutral vote, a = viewing half‐lifeHalf‐life: number of the item in the list the user has a 50% chance h ill i it he will review it. Here half‐life of 5 items
Final score over all active users: R∑Final score over all active users:
R i hi bl ili
max100 aa
aa
RR
R= ∑
∑Rmax = maximum achievable utility
DatasetsMS Web: Individual visits of Microsoft website
Implicit voting (visited or not) Nielsen network: Television viewing data for a 2‐week period
Implicit voting (show watched or not) EachMovie:
Explicit voting (voting range: 0‐5) MSWEB Nielsen EachMovie
Total users 3453 1463 4119
Total votes 294 203 1623Total votes 294 203 1623
Mean votes peruser
3.95 9.55 46.4
Median votes per user
3 8 26
Protocols4 protocols used, 2 classes:
All but 1. All votes known except for 1pEvaluate performance with as much data as possible from each test user
Gi X O l X b dGiven X. Only X votes as observedGiven 2, Given 5, Given 10Evaluate performance with less data available for each userEvaluate performance with less data available for each user
N b f t i l f h t l iNumber of trials for each protocol varies
Experiments‐Algorithmsp g
C l i i h I U F D f l Correlation with Inverse User Frequency, Default Voting and Case Amplification (CR+)V Si il i i h I U F (VSIM)Vector Similarity with Inverse User Frequency (VSIM)Bayesian Networks (BN)Clustering Model (BC)Using the most popular items (POP)g
Experiments‐Results(I)p ( )MS Web, Rank Scoring
Algorithm Given2 Given5 Given10 AllBut1Go Algorithm Given2 Given5 Given10 AllBut1
BN 59.95 59.84 53.92 66.69
CR+ 60.64 57.89 51.47 63.59
ood
VSIM 59.22 56.13 49.33 61.70
BC 57.03 54.83 47.83 59.42
O
Sco POP 49.14 46.91 41.14 49.77
RD 0.91 1.82 4.49 0.93
ores
What is RD?
Experiments‐Results(II)p ( )Nielsen, Rank Scoring
Algorithm Given2 Given5 Given10 AllBut1Go Algorithm Given2 Given5 Given10 AllBut1
BN 34.90 42.24 47.39 44.92
CR+ 39.44 43.23 43.47 39.49
ood
VSIM 39.20 40.89 39.12 36.23
BC 19.55 18.85 22.51 16.48
O
Sco POP 20.17 19.53 19.04 13.91
RD 1.53 1.78 2.42 2.40
ores
Bayesian Networks seem to need more data to have better results
Experiments‐Results(II)p ( )Nielsen, Rank Scoring
Algorithm Given2 Given5 Given10 AllBut1Go Algorithm Given2 Given5 Given10 AllBut1
BN 34.90 42.24 47.39 44.92
CR+ 39.44 43.23 43.47 39.49
ood
VSIM 39.20 40.89 39.12 36.23
BC 19.55 18.85 22.51 16.48
O
Sco POP 20.17 19.53 19.04 13.91
RD 1.53 1.78 2.42 2.40
ores
Vector Similarity and Clusters t h dl ti l d t b ttseem to handle partial data better
Experiments‐Results(III)p ( )EachMovie, Rank Scoring
Algorithm Given2 Given5 Given10 AllBut1Go Algorithm Given2 Given5 Given10 AllBut1
CR+ 41.60 42.33 41.46 23.16
VSIM 42.45 42.12 40.15 22.07
ood
BC 38.06 36.68 34.98 21.38
BN 28.64 30.50 33.16 23.49
O
Sco POP 30.80 28.90 28.01 13.94
RD 0.75 0.75 0.78 0.78
ores
Why Bayesian Networks and Clusters perform so poorly?p p y
Experiments‐Results(IV)p ( )EachMovie, Absolute DeviationG
oAlgorithm Given2 Given5 Given10 AllBut1CR 1.257 1.139 1.069 0.994
ood
BC 1.127 1.144 1.135 1.103
BN 1.154 1.154 1.139 1.066
VSIM 2 113 2 177 2 235 2 136
Sco VSIM 2.113 2.177 2.235 2.136
RD 0.028 0.023 0.025 0.043
ores
Different Correlation algorithm. Why?g y
Experiments‐Important Notes(I)p p ( )Effects of Inverse User FrequencyRanked Scoringg
Average improvement of Correlation: 1.5%Average improvement of Vector Similarity: 2.2%
Ab l t D i ti S iAbsolute Deviation ScoringAverage improvement of Correlation: 6.5%Average improvement of Vector Similarity: 15.5%g p y
Effects of Case AmplificationAverage improvement of Correlation/Ranked Scoring: 4.8%A i f C l i /Ab l D i i Average improvement of Correlation/Absolute Deviation Scoring: not significantEffects of both extensions seems to be additive
Experiments‐Important Notes(II)p p ( )Bayesian NetworksEffects of priors‐How much complex a tree should be?Effects of priors How much complex a tree should be?
Priors in general enhance performance. However…Very small trees (priors that strongly penalize splits)?y p g y p pLarger trees (with more ancestors and distributions)?
Clustering modelsInformation from clustering models can be used to create user profiles, which can be used to enhance:
AdvertizingAdvertizingMarketingEnhanced User Services
Conclusion (I)( )What is the best method for collaborative filtering?
Many different methods tested over various datasetsMany different methods tested over various datasetsDifficult to make a straightforward comparison between methods used (depends on matters like nature of dataset, application efficiency)application, efficiency)Generally, Bayesian Network Models and correlation provided more accurate results than Bayesian clustering and vector
lsimilarity Reasonable as B.N.M. and correlation capture the dependencies of the dataset whereas clustering and vector similarity don’tg yBut:They are more susceptible to fewer votes
Conclusion(II)( )What about efficiency?
In terms of memory Bayesian Network Models need less In terms of memory, Bayesian Network Models need less resources than the other methodsHowever, the networks used in these approach are very time‐consuming
ExtensionsDistributed collaborative filtering
More flexible recommender systems that can give recommendations to y gthe active user according to the preferences of any users group (even not similar to the active user)
Hybrid approachesCombination of memory‐based and model‐based methodsT k i h f f b h h ifi d h Take into account the preferences of both the specific user and the group of users similar to him
Matters of privacy in recommender systemsSystems that use homomorphic encryption and verification schemes so that not to expose the users’ preferencesthat not to expose the users preferences
Questions?Q
??? ? ??? ????
? ?? ??? ?