sparking science up with research recommendations
TRANSCRIPT
![Page 1: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/1.jpg)
Sparking Science up with Research Recommendations
Maya Hristakeva @mayahhf
![Page 2: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/2.jpg)
Overview • What is Mendeley Suggest?
• Computation Layer
• Conclusions
![Page 3: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/3.jpg)
Read &
Organize
Search &
Discover
Collaborate &
Network
Experiment &
Synthesize
Mendeley builds tools to help researchers …
![Page 4: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/4.jpg)
Being the best researcher you can be! • Good researchers are on top of their game • Large amount of research produced • Takes time to get what you need
• Help researchers by recommending relevant research
![Page 5: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/5.jpg)
Mendeley Suggest Personalized Article
Recommender
![Page 6: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/6.jpg)
Recommender System Components
information flow (components often built in parallel)
Data (Feature
Engineering) Algorithms Business Logic
and Analytics User Experience
![Page 7: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/7.jpg)
Mendeley Suggest Components (Past)
information flow (components often built in parallel)
Data (Feature
Engineering) Algorithms Business Logic
and Analytics User Experience
![Page 8: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/8.jpg)
Mendeley Suggest Components (Present)
information flow (components often built in parallel)
Data (Feature
Engineering) Algorithms Business Logic
and Analytics User Experience
![Page 9: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/9.jpg)
Mendeley Suggest Components (Goal)
information flow (components often built in parallel)
Data (Feature
Engineering) Algorithms Business Logic
and Analytics User Experience
![Page 10: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/10.jpg)
Overview • What is Mendeley Suggest?
• Computation Layer – Algorithms
– Evaluation
– Implementations & Performance
• Conclusions
![Page 11: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/11.jpg)
Personalized Article Recommendations Input: User libraries
Output: Suggested articles to read
Algorithms: • Collaborative Filtering
– Item-based
– User-Based
– Matrix Factorization
• Content-based
![Page 12: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/12.jpg)
Item-based Collaborative Filtering Recommend articles that are similar to the ones you read
– Similarity is based on article co-occurrences in users’ libraries – “Users who read x also read y”
![Page 13: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/13.jpg)
User-based Collaborative Filtering
Find users who have similar appreciation for articles as you – Similarity is based on users’ libraries overlap
Recommend new articles based on what the users similar to you read
– “Users similar to you (based on a, b, c) also read x”
![Page 14: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/14.jpg)
Matrix Factorization CF
2 4 5
5 4 1
5 ? 2
1 5 4
4 2
4 5 1
U n x k
V k x m
fij= <Ui*,V*j> E(U,V) = L(Xij, fij) + R(U,V)
X n x m
![Page 15: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/15.jpg)
Overview • What is Mendeley Suggest?
• Computation Layer – Algorithms
– Evaluation
– Implementations
• Conclusions
![Page 16: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/16.jpg)
Performance Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
![Page 17: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/17.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Performance
![Page 18: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/18.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Performance
![Page 19: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/19.jpg)
How to measure quality? • Offline Evaluation
– Parameter sweep is quick – Don’t offend real users
• Methodology – n-fold cross-validation – time-based validation
• Metrics – precision, recall and f-measure – AUC (area under roc curve), NDCG (normalized discounted cumulative gain)
![Page 20: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/20.jpg)
Overview • What is Mendeley Suggest?
• Computation Layer – Algorithms
– Evaluation
– Implementations
• Conclusions
![Page 21: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/21.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 22: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/22.jpg)
Setup • EMR Cluster
– Master: 1 x r3.xlarge instance (4 core, 32GB) – Core: 10 x r3.2xlarge instances (8 core, 64GB)
• Data: user libraries – 15mil documents >>> 1mil users – 150mil interactions
• Offline Evaluation – Methodology: time-based evaluation – Metric: precision@10
![Page 23: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/23.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 24: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/24.jpg)
Apache Mahout • Mahout (out-of-the-box)
– Item-based CF • org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
– ALS Matrix Factorization • org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob
• org.apache.mahout.cf.taste.hadoop.als.RecommenderJob
• Implemented User-based CF on top of Mahout at Mendeley
![Page 25: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/25.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Orig. item-based mahout
Tuned item-based mahout
-0.5K (-60%)
Performance
~$125
![Page 26: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/26.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Orig. item-based mahout
Tuned item-based mahout
-0.5K (-60%)
Orig. user-based mahout
Tuned user-based mahout
-0.1K (-40%)
Performance
~$125
![Page 27: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/27.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Orig. item-based mahout
Tuned item-based mahout Orig. user-based
mahout
Tuned user-based mahout
+150%
-0.2K (-55%)
-0.7K (-82%)
Performance
~$125
![Page 28: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/28.jpg)
Mahout Performance • Mahout’s recommender is already efficient
– But your data may have unusual properties
• We’ve got improvements by – Tuning Hadoop’s mapper and reducer allocation over the Recommender Job steps – Using an appropriate partitioner
• Improve quality – Mahout provides Item-based CF – We have many more items than users – Typically, user-based is more appropriate
![Page 29: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/29.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 30: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/30.jpg)
Mahout Spark • Co-occurrence Recommenders with Spark
– Item-Item similarity • mahout spark-itemsimilarity
SimilarityAnalysis.cooccurrencesIDSs(ratings, …)
– User-User similarity • mahout spark-rowsimilarity
SimilarityAnalysis.rowSimilarityIDSs(ratings, …)
• Only supports Boolean data and log-likelihood similarity
• Does not generate actual recommendations
![Page 31: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/31.jpg)
Mahout Spark • Could not get to run successfully on our data
• Got further by tuning parameters but still failed with OOM – spark.driver.maxResultSize
– spark.kryoserializer.buffer.max
– spark.default.parallelism
– spark.storage.memoryFraction
• Gave best runtime performance on MovieLens datasets
![Page 32: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/32.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 33: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/33.jpg)
Mendeley Spark • Started as hack-day project
– Implement Item-based and User-based CF in Spark
• Can be implemented in two steps 1. Compute Item-Item or User-User Similarities
• given user preferences
2. Compute Recommendations • given similarities and user preferences
![Page 34: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/34.jpg)
Spark: Item-Item Similarity
![Page 35: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/35.jpg)
Spark: Item-Item Similarity
![Page 36: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/36.jpg)
Spark: Item-Item Similarity
![Page 37: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/37.jpg)
Spark: Item-Item Similarity
![Page 38: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/38.jpg)
Spark: Item-Based Recs
![Page 39: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/39.jpg)
Spark: Item-Based Recs
![Page 40: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/40.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Orig. UB Spark
Performance
~$50
![Page 41: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/41.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Orig. UB Spark
Tuned UB Spark
Tuned IB Spark
-0.1K (-40%)
Performance
~$50
![Page 42: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/42.jpg)
Mendeley Spark Performance • Spark implementation of User-based CF performs well
• Managed to run variation of Item-based CF – Uses fewer items per user to recommend similar items to – Quality not impacted much
• We’ve got improvements by tuning – Resource allocation – Parallelism – http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-
part-2/
![Page 43: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/43.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 44: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/44.jpg)
Spark MLlib DimSum • DimSum: efficient algorithm for computing all-pairs similarity
– “Dimension Independent Matrix Square using MapReduce” – Contributed by Twitter
• Replace similarity computation with DimSum – Only supports cosine similarity
• Does not generate actual recommendations – Compute recommendations as before
![Page 45: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/45.jpg)
MLlib DimSum Item-Item Similarity
![Page 46: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/46.jpg)
MLlib DimSum User-User Similarity
![Page 47: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/47.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSum Spark MLlib
Performance
~$50
![Page 48: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/48.jpg)
Spark MLlib Matrix Factorization Implements alternating least squares (ALS) 1. Compute Model 2. Compute Recommendations
![Page 49: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/49.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSum Spark MLlib
ALS Matrix Fact. Spark MLlib
-50%
Performance
~$50
![Page 50: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/50.jpg)
MLlib Performance • Provides good alternative for computing user-user similarities
– Due to data sparsity, not getting big gains in runtime – Only supports cosine similarity
• Failed to compute item-item similarities – Exceeds maximum allowed value of 2G for spark.kryoserializer.buffer.max
• User-based CF outperforms ALS CF
• Need scalable solution for generating recommendations based on ALS CF model
![Page 51: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/51.jpg)
Implementations Mahout
(Hadoop)Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
![Page 52: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/52.jpg)
Overview • What is Mendeley Suggest?
• Computation Layer
• Conclusions
![Page 53: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/53.jpg)
Costly & Good Costly & Bad
Cheap & Good Cheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSum Spark MLlib
ALS Matrix Fact. Spark MLlib
Performance
+100%
+150% ~$50
![Page 54: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/54.jpg)
Mendeley Suggest Components (Future)
information flow (components often built in parallel)
Data (Feature
Engineering) Algorithms Business Logic
and Analytics User Experience
![Page 55: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/55.jpg)
Conclusions • Mendeley Suggest is a personalized article recommender
• Spark is good alternative to Mahout as computation layer – Needs some love and tuning – Much fewer lines of code – easier to maintain and extend
• User-based can outperform item-based and matrix factorization
• Save resources and money by understanding your data
• Test offline before deploying – but also need online tests to get real performance
![Page 56: Sparking Science up with Research Recommendations](https://reader031.vdocuments.mx/reader031/viewer/2022021813/589b864d1a28abc0098b4695/html5/thumbnails/56.jpg)
Thank you! mendeley.com/suggest