algorithms for efficient collaborative filtering vreixo formoso fidel cacheda víctor carneiro...

Algorithms for Efficient Collaborative Filtering

Vreixo Formoso

Fidel Cacheda

Víctor CarneiroUniversity of A Coruña (Spain)

Glasgow - 30th March 2008EIIR 20082

Outline

Introduction Background in Collaborative Filtering Proposed algorithms Experiments Conclusions


Introduction

More and more information every day Personalized retrieval systems are quite

interesting– Recommender systems: recommend items that

would be more appropriate for the user’s needs or preferences

– Useful in e-commerce, but we think they could be also useful in Web IR

Recommender systems store some information about the user preferences User profile– Explicit or implicit


Introduction

Types of recommender systems:– Content-based filtering: recommend items

based on their content Depends on automatic analysis of the items Unable to determine the item quality Serendipitous find

– Collaborative filtering: based on other users evaluations

It will recommend items well considered by other users with similar interests

Problems with computational performance and efficiency


Outline



Background

User profile: evaluations carried by the user Evaluation: numerical value (e.g. 1 – 5) Evaluation matrix: contains the evaluation of

the users Types of collaborative filtering algorithms:

– Memory-based: use similarity measures to predict related neighbours (users or items)

The entire matrix is used in each prediction

– Model-based: build a model that represents the user behaviour predict his evaluations

The parameters of the model are estimated using the evaluation matrix (off-line)


Background

Memory-based– Simple and give reasonably precise results– Low scalability– More sensitive to common recommender systems

problems: sparsity, cold-start and spam. Model-based

– Finds underlying characteristics in the data– Faster in prediction time– Complexity of the models:

Sensitive to changes in the data High construction times Model updating when new data are available


Background: Notation

i1 i2u1

u2

…

.

.

.

in

um

v11 …

… v2n

vm1 vm2 …

.

.

.

.

.

.

.

.

.

.

.

.

Items (I)

Users (U)

User profile (I1)

Users that have evaluated i1 (U1)

Evaluation matrix (V)

Prediction of evaluation of user m for item n (pmn)

vu. : evaluations of user u

v.i : evaluations for item i

Mean values: vu. and v.i


Outline



Proposed algorithms

Objectives:– Good behaviour in low density– Computational efficiency– Constant updating

Item mean algorithm– Our base Use the mean of an item as its prediction–

Simple mean based algorithm– The item mean is corrected with the mean of the user

–

ui ip v

( )

| |u

uj jj I

ui iu

v v

p vI


Proposed algorithms

Tendencies based algorithm– Main idea: users tend to evaluate items positively

or negatively Include tendencies in the formula– Tendency ≠ mean– Tendency of a user (ubu) and tendency of an item

(ibi):

– In this algorithm we use the mean of the item and the user as well as their respective tendencies.

( )

| |u

ui ii I

uu

v v

ubI

( )

| |i

ui uu U

ii

v v

ibU


Proposed algorithms

Tendencies based algorithm

max( , )ui u i i up v ib v ub

min( , )ui u i i up v ib v ub

min[max( , ) ( )(1 )), ]ui u i u u i ip v v ub v ib v

(1 )ui i up v v


Outline



Experiments

Algorithms evaluated– Memory-based: user-based, item-based and similarity

fusion– Model-based: regression based, slope one, latent semantic

index and cluster based smoothing– Hybrid: personality diagnosis

Dataset MovieLens– Real rating of films: 1 (very bad) – 5 (excellent)– 100,000 evaluations from 943 users for 1,682 movies (1.78

items evaluated/user). Density 6%– Training set: 10%, 50% and 90%

For each algorithm we evaluated (5 times):– Training and prediction times– Quality of the predictions


Proposed algorithms

Tendencies based algorithm

Only 5% of the prediction with 10% training set 2% of the prediction with 90% training set This case represents some unusual elements Tendencies seem a good prediction mechanism


Experiments: Computational complexity

AlgorithmTraining complexity

Prediction complexity

User Based - O(mn)

Item-Based O(mn²) O(n)

Similarity Fusion O(n²m + m²n) O(mn)

Personality Diagnosis O(m²n) O(m)

Regression Based O(mn²) O(n)

Slope One O(mn²) O(n)

Latent Semantic Indexing O((m+n)³) O(1)

Cluster Based Smoothing O(mnα + m²n) O(mn)

Item Mean O(mn) O(1)

Simple Mean Based O(mn) O(1)

Tendencies Based O(mn) O(1)


Experiments: Training time

Algorithms 10% 50% 90%

User Based 0 0 0

Item Based 415 1,060 1,986

Similarity Fusion 987 3,840 5,474

Personality Diagnosis 257 994 2,213

Regression Based 3,302 4,575 7,780

Slope One 1,246 2,175 2,541

Latent Semantic Indexing 117,758 115,218 102,855

Cluster Based Smoothing 60,247 71,529 44,635

Item Mean 2 3 3

Simple Mean Based 7 10 5

Tendencies Based 11 15 9


Experiments: Prediction time


User Based 6,250 15,597 8,915

Item Based 221 1,864 909

Similarity Fusion 227,736 756,834 264,951

Personality Diagnosis 1,369 3,845 1,400

Regression Based 205 570 265

Slope One 319 501 116

Latent Semantic Indexing 162 158 20

Cluster Based Smoothing 70,515 251,595 118,552

Item Mean 24 12 2

Simple Mean Based 25 11 4

Tendencies Based 24 16 4


Experiments: Prediction quality


User Based 0.99 0.71 0.68

Item Based 0.92 0.75 0.71

Similarity Fusion 0.84 0.73 0.71

Personality Diagnosis 0.82 0.78 0.78

Regression Based 1.03 0.76 0.74

Slope One 0.90 0.72 0.70

Latent Semantic Indexing 0.85 0.77 0.73

Cluster Based Smoothing 0.97 0.87 0.80

Item Mean 0.82 0.79 0.79

Simple Mean Based 0.79 0.72 0.72

Tendencies Based 0.79 0.72 0.71


Outline



Conclusions

We have presented a couple of algorithms for collaborative filtering:– Very simple Good response times– Tendencies based algorithm:

Quality of the predictions equivalent to the best algorithms

Even better in low density training sets

Next steps: use these algorithms in Web IR– Problems: dataset?


Thank you!

Questions?

algorithms for efficient collaborative filtering vreixo formoso fidel cacheda víctor carneiro...

Documents

user slide

user evaluation

users evaluations

efficiency slide

background user profile

item i

contentbased filtering

implicit slide