jhjohn s. breeze, diddavid hhhecherman, clcarl kdikadie · user votes to predict new vote...

J h S B D id H h C l K diJohn S. Breeze, David Hecherman, Carl Kadie

Presentation GroupPetros AdamopoulosRobert CamilleriIoannis SarantidisCharalampos Vrisagotis

Collaborative FilteringgIntroduction – What is Collaborative Filtering?M B d Al i hMemory Based Algorithms:

CorrelationlVector Similarity

Model Based Algorithms:Bayesian NetworksClustering

Evaluation – Methods and ResultsConclusion

IntroductionPrediction of very complex attributes – opinion C ll b i Fil i A hCollaborative Filtering Approach

Given a user, try to predict his preference on item Y by fi di t f i il b d i finding a set of similar users, based on previous preferences, and using their preference on item Y.“The underlying assumption of Collaborative Filtering The underlying assumption of Collaborative Filtering approach is that those who agreed in the past tend to agree again in the future” – wikipediag g f p

Introduction4 users:

A) M t lli I M id S lt M d thA) Metallica, Iron Maiden, Sepultura, MegadethB) 50 Cent, Dr. Dre, P Diddy, Snoop Doggy DoggC) C Sh ki B i S Ch i i A il C) 50 Cent, Shakira, Britney Spears, Christina Aguilera. D) Britney Spears

Wh ill b h C f D??? What will be the vote on 50 Cent for user D???

Intuitively we see that we can say something about h th D ill lik C t t b ki Cwhether D will like 50 Cent or not by asking C.

IntroductionMemory‐Based Algorithms – use a whole sample of user votes to predict new voteuser votes to predict new voteModel‐Based Algorithms – use sample of user votes to learn a model Use function to predict new votelearn a model. Use function to predict new vote.Explicit Voting – a user expresses a preference on an itemitemImplicit Vote – a user ‘consuming’ an item indicates a

fpreference.

Introduction

Very large datasetsVery large datasets.Very sparse information. Millions of items in existence but any user will have expressed some preference over but any user will have expressed some preference over a very small subset of these.New User problemNew User problem.New Item problem.

Introduction

www.amazon.co.uk

Memory‐Based Algorithms(I) Collaborative Filtering task: Predict votes of a gparticular user (active user) from a database of user votesMean vote for user i:

1,

1

i

i i jj IiI

υ υ∈

= ∑Ii is the set of items on which user i has voted

iji

Memory‐Based Algorithms (II)Approach: Predict votes of active user based on

ti l i f f th d t f i ht f partial info from the user and a set of weights from the user database

, ,( , )( )n

a j a i j ip w a iυ κ υ υ= + −∑Weights w(a,i) can reflect: distance, correlation or

1i=

similarityAlgorithms differ in the weight calculation

Correlation (I) One of the first approaches to collaborative filteringC l ti h t th d di ti f li l ti Correlation shows strength and direction of linear relation between 2 random variables

Correlation (II) Pearson correlation coefficient

Computes the correlation between 2 users a and iComputes the correlation between 2 users a and i

( )( )a j a i j ijυ υ υ υ− −∑ , ,

2 2, ,

( )( )( , )

( ) ( )

a j a i j ij

a j a i j ij j

w a iυ υ υ υ

=− −

∑∑ ∑

summations over the items for which both users a and ih d d

j j

have recorded votes

Default VotingExtension to the correlation algorithmWh t if h t d f t hi it ?What if 2 users have voted on few matching items?Only uses Ia Ij∩Assumes a default vote d for items for which we have no vote and now uses Ia Ij∪Can also be extended to items neither has voted ond will most of the times reflect neutral or negative preferenceIn applications with implicit voting default voting gives missing data their actual value

Vector Similarity (I)y ( )Information RetrievalDocuments are represented as

Collaborative FilteringUsers are represented as vectors Documents are represented as

vectors of the form:

0 1( , ,..., ,..., )D D D D j D NU υ υ υ υ=

Users are represented as vectors of the form:

,0 ,1 , ,( , ,..., ,..., )j NU α α α α αυ υ υ υ=N:number of different wordsυD,j: frequency of word j in D

,0 ,1 , ,( )D D D j D N

N:number of different itemsυα,j: vote of item j by user α

, , , ,j

Information Retrieval

Collaborative Filtering

Documents Users

Words Items

Word VotesWord Frequencies

Votes

Vector Similarity (II)y ( )Basic formula

( )w a i υ υ υ υ= ∗ = ∗∑j: Common items between α and i.However, this metric is biased. Why?

, ,( , ) a i a j i jj

w a i υ υ υ υ= ∗ = ∗∑

However, this metric is biased. Why?

Cosine Similarityυ υ υ υ∗ ∗, ,

2 2, ,

( , ) cos( )a i

a j a j i ja j

ja j a k i kk I k I

w a iυ υ υ υ

υ υυ υ υ υ

∈ ∈

∗ ∗= ∗ = =∑

∑ ∑Other normalization schemes

Absolute sumNumber of votesNumber of votes

Vector Similarity ‐ Exampley p

I υ /|υ | υ /|υ |Item 1υα,1/|υα| υi,1/|υi|

User α User iItem 2

User iUser iUser iUser αItem 3

Item 4Conclusion:Meaning of υα 1/|υα|:Cosine Similarity = Probability of the 2 users meeting at an item node

g α,1/| α|Probability of a user following the corresponding edge

Inverse User Frequency(I)q y( )Key concept:

In Information RetrievalNot all words have the same importance in determining whether two documents are similar. Words that are popular between the document carry less information about the similarity between the documentsSi il l f C ll b ti filt iSimilarly for Collaborative filtering:Not all items have the same importance in determining whether two users are similar. Items that are popular between the users carry less information about the similarity between the usersinformation about the similarity between the users

Inverse User Frequency formula: n

nj: The number of users who have voted for item jh l b f

log( )jj

nfn

=

n: The total number of users

Inverse User Frequency (II)q y ( )Extending the vector similarity algorithm using IUF:

Votes for items are transformed using the following formulaVotes for items are transformed using the following formula

, , *new oldj j jfα αυ υ=

Then the vector similarity metric is evaluated using the transformed votes

Can be applied to the Correlation algorithm too:Can be applied to the Correlation algorithm too:

, , , ,( )( )( , )

j j a j i j j j j jj j j jf f f f

w jUV

α ιυ υ υ υα

−=∑ ∑ ∑ ∑

whereUV

2 2( ( ) )j j j j a jU f f fαυ υ= −∑ ∑ ∑2 2, ,( ( ) )j j i j j i jV f f fυ υ= −∑ ∑ ∑, ,( ( ) )j j j j a j

j j jf f fα∑ ∑ ∑

j j j

Case AmplificationpIt is a weight transformationLow weights are punishedLow weights are punishedHigh weights are favoredThe transformation formula is:The transformation formula is:

1 , ,( ) if 0*

old p oldp i inew old old

w wα α− ⎧ ⎫≥⎪ ⎪⎨ ⎬

C b d t d i d d t t hi h

, , ,, ,

*( ) if 0

new old oldi i i old p old

i i

w w ww wα α αα α

⎪ ⎪= = ⎨ ⎬− − <⎪ ⎪⎩ ⎭

Can be used to reduce noise and demonstrates higher accuracyUsually the amplification power is p=2 5Usually the amplification power is p=2.5

Model‐Based MethodsEffort to approach the problem from a probabilistic point of viewof view.Main goal of these methods is to estimate the expected vote value of a user given his previous votes on other items:value of a user given his previous votes on other items:

, , , ,( ) Pr( | , )m

a j a j a j a k ap E i k I iυ υ υ= = = ∈ ⋅∑where: α is the active user

j is the item

0i=

j is the item[0,m] is the discrete range of vote values I is the set of preferred items of user αIa is the set of preferred items of user α

Cluster ModelsUnderlying complex pattern that can be captured by latent variableslatent variables.Users can be classified into groups (classes) of individuals with similar interestsindividuals with similar interests.Classes are latent variables.A i Gi b hi i Assumption: Given membership in a group, an individual’s vote on various items is independent.

Cluster Models1

1

Pr( , ,..., ) Pr( ) Pr( | )n

n ii

C c C c C cυ υ υ=

= = = =∏

and are estimated from user d b

1i=

Pr( | )i C cυ =Pr( )C c=database. Expectation‐Maximisation Algorithm (EM)Various models constructed with varying number of latent variables.

Cluster ModelsRatings (‐1, 1) dislike, likeUser A The Godfather (1) Grease ( 1) Goodfellas (1) User A – The Godfather (1), Grease (‐1), Goodfellas (1), Casino (1), Raging Bull (1), Taxi Driver (1)User B – The Godfather (1) Dirty Dancing (‐1) User B The Godfather (1), Dirty Dancing ( 1), Goodfellas (1), Taxi Driver (?????)User preferences on individual items are initially not p yindependent. Introducing latent variables explains away the similarity and hence votes become

di i ll i d dconditionally independent.NB: We do not know what similarities the latent variables are capturingvariables are capturing.

Bayesian Network Models (I)y ( )What is a Bayesian Network Model?

A Graphical Model (Directed Acyclic Graph) that represents A Graphical Model (Directed Acyclic Graph) that represents the dependencies between random variablesEach discrete random variable which takes k values is represented with a node of k states in the networkArcs between nodes show the dependencies between variables

What is the structure of the graphical model?A h h d i bl Y d d X di h Assume that the random variable Y depends on X according to the following Conditional Probability Table (CPT):

Bayesian Network Models (II)y ( )P(Y=0|X=0) = 0.9 P(Y=0|X=1) = 0.3P(Y=1|X=0) = 0 1 P(Y=1|X=1) = 0 7P(Y=1|X=0) = 0.1 P(Y=1|X=1) = 0.7The dependency is illustrated as follows:

X Y

Generally, more complex graphical models can arise with b f d d C

X Y

various number of nodes, states and CPTs

Y W

X

Z V

T

Bayesian Network Models inBayesian Network Models in collaborative filtering (I)g ( )

How can Bayesian Networks Models be applied to collaborative filtering?collaborative filtering?

Represent items with nodesEach possible vote for an item is a state of its corresponding p p gnodeWhat happens with the missing values?‐ can add an extra state for “no‐vote”

Wh t b t th CPT ?What about the CPTs?They can be obtained from the datasetIt is more convenient to represent the CPTs with Decision TreesIt is more convenient to represent the CPTs with Decision Trees

Bayesian Network Models inBayesian Network Models in collaborative filtering (II)g ( )

Example of CPT for Melrose Place item using decision treeProbability that an individual watched Melrose Place given that y gwatched the parent programs (all possible votes unified to watched)

Beverly Hills, 90210 Beverly Hills, 90210Watched Not Watched

Friends FriendsWatched Not WatchedWatched Not Watched

Melrose Place Melrose Place Melrose PlaceWatched Watched WatchedNot Watched Not Watched Not Watched

Why Bayesian Network Models?y yThey can capture the dependencies between different itemsitemsEasy in learning – obtaining the CPTs from the dataset is a straightforward process (though very expensive)straightforward process (though very expensive)Generally provide accurate results, even in cases with little sensitivity between nodesy

Learning with Bayesian NetworkLearning with Bayesian Network Models(I)( )

Basically, the goal of the learning algorithm is to obtain the CPTs from the dataset and check for dependenciesp

Roughly, the algorithm can be specified as follows:for each item Ii

h f t f it S hi h I d dsearch for set of items S on which Ii dependsset items of S as parent nodes of Ii in the network

end

However, having many parent nodes for an item leads to several problems during training

E ti l bl f th CPTExponential blow‐up of the CPTsOver‐fitting by capturing dependencies that happen to occur only in the dataset and are not true in the real world

Learning with Bayesian NetworkLearning with Bayesian Network Models(II)( )

A solution is to penalize dense networks with many parent nodes for an itemnodes for an item

Keeping the number of parents under 10 makes the network more efficient and accurate

....

I

I2

I

I5

I ........

I1

I3

I4

I6

I7

3 6

( )Evaluation Criteria (I)Classes of Collaborative Filtering applicationsClasses of Collaborative Filtering applications

1st Class: Items presented one‐at‐a‐time with rating2nd Class: Items recommended as an ordered list2 d Class: Items recommended as an ordered list

Evaluation Criteria (II)Evaluation sequence

Dataset divided into training and test setgTraining set used as collaborative filtering database or for the probabilistic modelCycle through the users in the test set, each viewed as the active userDivide votes to observed Iα and a set we will attempt to predict Pα

Scoring metrics – Class 1Individual scoringAverage absolute deviation of predicted vote to the actual g pvote:

1, ,

1a a j a jS p

mυ= −∑

b f t d it f

aj Pam ∈

ma = number of voted items for user aThe scores are then averaged over all the users

Scoring Metrics – Class 2 (I)Ranked scoring

Evaluation of ranked lists from Information Retrieval:Recall: percentage of relevant items returnedPrecision: percentage of returned items that were relevant

Binary votes allow a similar approachMore general approach: estimate the expected utility of a

k d li ranked list to userThe expected utility: the probability of viewing a recommended item times its utilitytimes its utilityHere: item utility = difference between vote and neutral vote

S i M t i Cl 2 (II)Scoring Metrics – Class 2 (II)Th d ili f k d li f i iThe expected utility for a ranked list of items is:

max( ,0)a j dυ −∑ ,

( 1) ( 1)

a ( ,0)2

a ja j a

j

dR

υ− −=∑

d = neutral vote, a = viewing half‐lifeHalf‐life: number of the item in the list the user has a 50% chance h ill i it he will review it. Here half‐life of 5 items

Final score over all active users: R∑Final score over all active users:

R i hi bl ili

max100 aa

aa

RR

R= ∑

∑Rmax = maximum achievable utility

DatasetsMS Web: Individual visits of Microsoft website

Implicit voting (visited or not) Nielsen network: Television viewing data for a 2‐week period

Implicit voting (show watched or not) EachMovie:

Explicit voting (voting range: 0‐5) MSWEB Nielsen EachMovie

Total users 3453 1463 4119

Total votes 294 203 1623Total votes 294 203 1623

Mean votes peruser

3.95 9.55 46.4

Median votes per user

3 8 26

Protocols4 protocols used, 2 classes:

All but 1. All votes known except for 1pEvaluate performance with as much data as possible from each test user

Gi X O l X b dGiven X. Only X votes as observedGiven 2, Given 5, Given 10Evaluate performance with less data available for each userEvaluate performance with less data available for each user

N b f t i l f h t l iNumber of trials for each protocol varies

Experiments‐Algorithmsp g

C l i i h I U F D f l Correlation with Inverse User Frequency, Default Voting and Case Amplification (CR+)V Si il i i h I U F (VSIM)Vector Similarity with Inverse User Frequency (VSIM)Bayesian Networks (BN)Clustering Model (BC)Using the most popular items (POP)g

Experiments‐Results(I)p ( )MS Web, Rank Scoring

Algorithm Given2 Given5 Given10 AllBut1Go Algorithm Given2 Given5 Given10 AllBut1

BN 59.95 59.84 53.92 66.69

CR+ 60.64 57.89 51.47 63.59

ood

VSIM 59.22 56.13 49.33 61.70

BC 57.03 54.83 47.83 59.42

O

Sco POP 49.14 46.91 41.14 49.77

RD 0.91 1.82 4.49 0.93

ores

What is RD?

Experiments‐Results(II)p ( )Nielsen, Rank Scoring


BN 34.90 42.24 47.39 44.92

CR+ 39.44 43.23 43.47 39.49

ood

VSIM 39.20 40.89 39.12 36.23

BC 19.55 18.85 22.51 16.48

O

Sco POP 20.17 19.53 19.04 13.91

RD 1.53 1.78 2.42 2.40

ores

Bayesian Networks seem to need more data to have better results

Experiments‐Results(II)p ( )Nielsen, Rank Scoring


BN 34.90 42.24 47.39 44.92

CR+ 39.44 43.23 43.47 39.49

ood

VSIM 39.20 40.89 39.12 36.23

BC 19.55 18.85 22.51 16.48

O

Sco POP 20.17 19.53 19.04 13.91

RD 1.53 1.78 2.42 2.40

ores

Vector Similarity and Clusters t h dl ti l d t b ttseem to handle partial data better

Experiments‐Results(III)p ( )EachMovie, Rank Scoring


CR+ 41.60 42.33 41.46 23.16

VSIM 42.45 42.12 40.15 22.07

ood

BC 38.06 36.68 34.98 21.38

BN 28.64 30.50 33.16 23.49

O

Sco POP 30.80 28.90 28.01 13.94

RD 0.75 0.75 0.78 0.78

ores

Why Bayesian Networks and Clusters perform so poorly?p p y

Experiments‐Results(IV)p ( )EachMovie, Absolute DeviationG

oAlgorithm Given2 Given5 Given10 AllBut1CR 1.257 1.139 1.069 0.994

ood

BC 1.127 1.144 1.135 1.103

BN 1.154 1.154 1.139 1.066

VSIM 2 113 2 177 2 235 2 136

Sco VSIM 2.113 2.177 2.235 2.136

RD 0.028 0.023 0.025 0.043

ores

Different Correlation algorithm. Why?g y

Experiments‐Important Notes(I)p p ( )Effects of Inverse User FrequencyRanked Scoringg

Average improvement of Correlation: 1.5%Average improvement of Vector Similarity: 2.2%

Ab l t D i ti S iAbsolute Deviation ScoringAverage improvement of Correlation: 6.5%Average improvement of Vector Similarity: 15.5%g p y

Effects of Case AmplificationAverage improvement of Correlation/Ranked Scoring: 4.8%A i f C l i /Ab l D i i Average improvement of Correlation/Absolute Deviation Scoring: not significantEffects of both extensions seems to be additive

Experiments‐Important Notes(II)p p ( )Bayesian NetworksEffects of priors‐How much complex a tree should be?Effects of priors How much complex a tree should be?

Priors in general enhance performance. However…Very small trees (priors that strongly penalize splits)?y p g y p pLarger trees (with more ancestors and distributions)?

Clustering modelsInformation from clustering models can be used to create user profiles, which can be used to enhance:

AdvertizingAdvertizingMarketingEnhanced User Services

Conclusion (I)( )What is the best method for collaborative filtering?

Many different methods tested over various datasetsMany different methods tested over various datasetsDifficult to make a straightforward comparison between methods used (depends on matters like nature of dataset, application efficiency)application, efficiency)Generally, Bayesian Network Models and correlation provided more accurate results than Bayesian clustering and vector

lsimilarity Reasonable as B.N.M. and correlation capture the dependencies of the dataset whereas clustering and vector similarity don’tg yBut:They are more susceptible to fewer votes

Conclusion(II)( )What about efficiency?

In terms of memory Bayesian Network Models need less In terms of memory, Bayesian Network Models need less resources than the other methodsHowever, the networks used in these approach are very time‐consuming

ExtensionsDistributed collaborative filtering

More flexible recommender systems that can give recommendations to y gthe active user according to the preferences of any users group (even not similar to the active user)

Hybrid approachesCombination of memory‐based and model‐based methodsT k i h f f b h h ifi d h Take into account the preferences of both the specific user and the group of users similar to him

Matters of privacy in recommender systemsSystems that use homomorphic encryption and verification schemes so that not to expose the users’ preferencesthat not to expose the users preferences

Questions?Q

??? ? ??? ????

? ?? ??? ?