scalable collaborative filtering recommendation algorithms on apache spark

Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce

● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables● Demo

What are recommendation algorithms?● Problem:

○ “Information overload”○ Diverse user interests

● User-Item Recommendation○ Recommend content for each user

based on a larger training set of user interaction histories

Motivation● Large-scale recommender systems

○ Millions of users and items (100m+ ratings)● Problems:

○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity

Collaborative Filtering

Shawn

Billy

Mary

4 3 8 9

2

4

3 4

1

2

8 8

4

● Similarity based approach

● Two main variants:○ User-based○ Item-based

?? ?

?

?

User-based Collaborative Filtering

● Step 1: Obtain user-itemmatrix denoted Mi,j


● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors

pairwise users

rating vectors


● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score

recommendation score of item

pairwise user similarities

mean rating

co-rated user rating

ResultsStandalone Cluster: Amazon EC2 Cluster:

Evaluation

Lessons Learned● Must manually specify number of tasks

○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for

data that will be reused● Must account for the “power users”

○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!

○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity

scalable collaborative filtering recommendation algorithms on apache spark

Technology

userbased item

user similarities

collaborative filtering

corated user rating

useritem matrix denoted

apache spark lessons

shared data

recommendation engines