lcbm: statistics-based parallel collaborative …...lcbm: statistics-based parallel collaborative...

21
LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1 , Leonardo Querzoni 1 , Roberto Beraldi 1 , Mario Paolucci 2 Cyber Intelligence and information Security CIS Sapienza 1 Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome - petroni|querzoni|[email protected] 2 Institute of Cognitive Sciences and Technologies, CNR - [email protected] Larnaca, Cyprus 22-23 May, 2014

Upload: others

Post on 24-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

LCBM: Statistics-based Parallel Collaborative Filtering

Fabio Petroni 1, Leonardo Querzoni 1, Roberto Beraldi 1, Mario Paolucci 2

Cyber Intelligence and information Security

CIS Sapienza1 Department of Computer Control andManagement Engineering Antonio Ruberti,

Sapienza University of Rome -petroni|querzoni|[email protected]

2 Institute of Cognitive Sciencesand Technologies, CNR -

[email protected]

Larnaca, Cyprus22-23 May, 2014

Page 2: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Personalization

I Most of todays internet businesses deeply root their success in theability to provide users with strongly personalized experiences.

I Recommender Systems: are a subclass of information filteringsystem that seek to predict the rating or preference that userwould give to an item.

2 of 21

Page 3: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Collaborative Filtering

I Collaborative Filtering (CF) represents today’s a widely adoptedstrategy to build recommendation engines.

I CF analyzes the known preferences of a group of users to makepredictions of the unknown preferences for other users.

Filtering

filter the user data stream

satisfying user personal interests

Discovery

recommends new content to the

user that matches his interests

3 of 21

Page 4: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Challenges

I Key factors for recommendation e↵ectiveness:

amount of data availablelarger data-sets improve quality

more than better algorithms

timelinesstimely deliver output to users

constitutes a business advantage

7 Most CF algorithms typically need to go through a very lengthytraining stage, incurring in prohibitive computational costs andlarge time-to-prediction intervals when applied on large data sets.

7 This lack of e�ciency is going to quickly limit the applicability ofthese solutions at the current rates of data production growth.

4 of 21

Page 5: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Taxonomy Of CF Algorithms

MEMORY-BASED

MODEL-BASED

Collaborative Filtering

neighborhood models

user-oriented item-oriented

Pearson Correlation Similarity

Cosine Similarity

cluster models

Bayesian networks

linear regression

LATENT FACTORS MODELS

pLSA neural networks

Latent Dirichlet

AllocationMATRIX

FACTORIZATION

ALS SGD SVD

5 of 21

Page 6: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Memory-Based CF Algorithms

I Operate on the entire set of ratings to generate predictions:B compute a similarity score, which reflects the distance between two

users or two items.- Pearson Correlation Similarity- Cosine Similarity

B Predictions are computed by looking at the ratings of the k mostsimilar users or items (k-nearest neighbor).

3 Simple design and implementation;

3 used in a lot of real-world systems.

7 Impose several scalability limitations;

7 their use is impractical when dealing with large amounts of data.

6 of 21

Page 7: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Model-Based CF Algorithms

I Use the set of available ratings to estimate or learn a model andthen apply this model to make rating predictions.

I The most successful are based on matrix factorization models.

7 of 21

Page 8: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Matrix Factorization 1/2

LOSS(P ,Q) =X

(Rij � PiQj)2

I The most popular techniques to minimize LOSS are AlternatingLeast Squares (ALS) and Stochastic Gradient Descent (SGD).

ALSALS alternates between keeping

P and Q fixed, solving from time

to time a least-squares problem.

SGDSGD works by taking steps

proportional to the negative of the

gradient of the LOSS function.

I Both algorithms need several passes through the set of ratings.

8 of 21

Page 9: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Matrix Factorization 2/2

3 Provide high quality results;3 SGD was chosen by the top three solutions of KDD-Cup 2011;3 there exist parallel and distributed implementations.

7 High computational costs;7 large time-to-prediction intervals;7 especially when applied on large data sets.

SGD ALS LCBM

training time O(X · K · E ) ⌦(K 3(N +M) + K 2X ) O(X )

prediction time O(K ) O(K ) O(1)memory usage O(K (N +M)) O(M2 + X ) O(N +M)

In the table X is the number of collected ratings, K the number of hidden

features, E the iterations, N and M the number of users and items respectively.

9 of 21

Page 10: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Contributions

In this paper we present:

LCBM: Linear Classifier of Beta distributions Means

a novel collaborative filtering algorithm for binary ratings that:

3 is inherently parallelizable;

3 provides results whose quality is on-par with state-of-the-artsolutions;

3 in shorter time;

3 using less computational resources.

10 of 21

Page 11: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Overview

LCBMWorking phase

ratings

Training phaseItem rating PDF

inferenceratings peritem

User profilerratings peruser

standarderror

mean

>predictions

threshold

I We consider:B

items as elements whose tendency to be rated positively/negativelycan be statistically characterized using a probability density function;

Busers as entities with di↵erent tastes that rate the same items usingdi↵erent criteria and that must thus be profiled;

I The algorithm needs two passes over the set of available ratingsto train the model (build users and items profiles).

11 of 21

Page 12: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Item Rating PDF Inference

I Beta distribution:B X = the relative frequency of positive

votes that the item will receive.B ↵ = OK + 1, � = KO + 1

Item profile

MEAN = ↵↵+� = OK+1

OK+KO+2

SE = 1OK+KO+2

q(OK+1)(KO+1)

(OK+KO)(OK+KO+3)

I Chebyshev’s inequality:

Pr(MEAN��SE X MEAN+�SE ) � 1� 1

�2

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1

PD

F

X

I Example:B OK = 8, KO = 3B MEAN ⇡ 0.7B SE ⇡ 0.04

– Pr(0.62 X 0.78) � 0.75

12 of 21

Page 13: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

User Profiler

0.1 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.65 0.7 0.8 0.85

discriminant function

OK errors KO errors

I Every rating is represented by a point with two attributes:B a boolean value containing the rating (OK or KO).B a key 2 [0, 1] that gives the point’s rank in the order;

Iif value = OK ! key = MEAN + �SE

Iif value = KO ! key = MEAN � �SE

User profile

Quality threshold (QT) = the discriminant function of a 1D linearclassifier that separates the points with a minimal number of errors.

13 of 21

Page 14: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Working Phase

LCBMWorking phase

ratings

Training phaseItem rating PDF

inferenceratings peritem

User profilerratings peruser

standarderror

mean

>predictions

threshold

I In order to produce a prediction the algorithm simply comparesQT values from the user profiler with MEAN values from theitem’s rating PDF inference block:B if MEAN > QT ! prediction = OKB if MEAN QT ! prediction = KO

14 of 21

Page 15: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Experiments: Setting and The Data Sets

Dataset MovieLens 100k MovieLens 10M Netflixusers 943 69878 480189items 1682 10677 17770

ratings 100000 10000054 100480507

I 5-fold cross-validation approach:B This technique divides the dataset in 5 folds and then uses in turn

one of the folds as test set and the remaining ones as training set.

I We compared LCBM against CFimplementations in Apache Mahout:

B ParallelSGDFactorizer: lock-free and parallel implementation of SGDB ALSWRFactorizer: ALS with Weighted-�-Regularization

15 of 21

Page 16: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Experiments: Performance Metrics

I Confusion matrix: true positives (TP) false negatives (FN)false positives (FP) true negatives (TN)

I Matthews correlation coe�cient (MCC) 2 [�1, 1]

MCC =TP ⇥ TN � FP ⇥ FNp

(TP + FP)⇥ (TP + FN)⇥ (TN + FP)⇥ (TN + FN)

B 1 ! perfect prediction;B 0 ! no better than random prediction;B �1 ! total disagreement between prediction and observation.

Itime needed to run the test;

I the peak memory load during the test.

16 of 21

Page 17: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Results: Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

MovieLens (105) MovieLens (107) Netflix (108)

MC

C

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

0

0.1

0.2

0.3

0.4

0.5

0.6

MovieLens (105) MovieLens (107) Netflix (108)

MC

C

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

17 of 21

Page 18: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Results: Computational Resources

0

2

4

6

8

10

12

14

16

18

20

MovieLens (105) MovieLens (107) Netflix (108)

me

mo

ry (

GB

)

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

0

2

4

6

8

10

12

14

16

18

20

MovieLens (105) MovieLens (107) Netflix (108)

me

mo

ry (

GB

)

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

0

0.2

Zoom

0

0.2

Zoom

18 of 21

Page 19: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Results: Timeliness

0

500

1000

1500

2000

2500

3000

3500

4000

4500

MovieLens (105) MovieLens (107) Netflix (108)

time

(s)

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

0

500

1000

1500

2000

2500

3000

3500

4000

4500

MovieLens (105) MovieLens (107) Netflix (108)

time

(s)

number of ratings

LCBMSGD K=256

SGD K=32SGD K=8

ALS

0

10

Zoom

0

10

Zoom

19 of 21

Page 20: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Conclusions

I LCBM is a novel CF algorithm with binary ratings.I It works by analyzing collected ratings to:

B infer a probability density function of the relative frequency ofpositive votes that the item will receive;

B compute a personalized quality threshold for each user.

I These two information are used to predict missing ratings.

I LCBM is inherently parallelizable.

Future works:

I Handle multidimensional ratings.

I Provide recommendations: a list of items the user will like themost.

20 of 21

Page 21: LCBM: Statistics-based Parallel Collaborative …...LCBM: Statistics-based Parallel Collaborative Filtering Fabio Petroni 1, Leonardo Querzoni ,Roberto Beraldi Mario Paolucci 2 Cyber

Thank you!

Questions?

Fabio Petroni

Rome, Italy

Current position:

PhD Student in Computer Engineering, Sapienza University of Rome

Research Areas:

Recommendation Systems, Collaborative Filtering, Distributed Systems

[email protected]

21 of 21