scalable collaborative filtering recommendation algorithms on apache spark
DESCRIPTION
Presentation on scalable collaborative filtering algorithms on Apache Spark given at the the Tapad Taptalk on 6/6/2014TRANSCRIPT
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014
Overview● Apache Spark
○ Dataflow model○ Spark vs Hadoop MapReduce
● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned
Apache Spark● Distributed data-processing
framework built on top of HDFS● Use cases:
○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!
Spark vs Hadoop MapReduce● In-memory data flow model
optimized for multi-stage jobs
● Novel approach to fault tolerance
● Similar programming style to Scalding/Cascading
Programming Model● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize● Parallel Operations
○ Map, GroupBy, Filter, Join, etc● Optimizations
○ Caching, shared variables● Demo
What are recommendation algorithms?● Problem:
○ “Information overload”○ Diverse user interests
● User-Item Recommendation○ Recommend content for each user
based on a larger training set of user interaction histories
Motivation● Large-scale recommender systems
○ Millions of users and items (100m+ ratings)● Problems:
○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity
Collaborative Filtering
Shawn
Billy
Mary
4 3 8 9
2
4
3 4
1
2
8 8
4
● Similarity based approach
● Two main variants:○ User-based○ Item-based
?? ?
?
?
User-based Collaborative Filtering
● Step 1: Obtain user-itemmatrix denoted Mi,j
User-based Collaborative Filtering
● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors
pairwise users
rating vectors
User-based Collaborative Filtering
● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score
recommendation score of item
pairwise user similarities
mean rating
co-rated user rating
ResultsStandalone Cluster: Amazon EC2 Cluster:
Evaluation
Lessons Learned● Must manually specify number of tasks
○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for
data that will be reused● Must account for the “power users”
○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!
○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity