sparking science up with research recommendations

56
Sparking Science up with Research Recommendations Maya Hristakeva @mayahhf

Upload: maya-hristakeva

Post on 09-Feb-2017

2.248 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Sparking Science up with Research Recommendations

Sparking Science up with Research Recommendations

Maya Hristakeva @mayahhf

Page 2: Sparking Science up with Research Recommendations

Overview •  What is Mendeley Suggest?

•  Computation Layer

•  Conclusions

Page 3: Sparking Science up with Research Recommendations

Read &

Organize

Search &

Discover

Collaborate &

Network

Experiment &

Synthesize

Mendeley builds tools to help researchers …

Page 4: Sparking Science up with Research Recommendations

Being the best researcher you can be! •  Good researchers are on top of their game •  Large amount of research produced •  Takes time to get what you need

•  Help researchers by recommending relevant research

Page 5: Sparking Science up with Research Recommendations

Mendeley Suggest Personalized Article

Recommender

Page 6: Sparking Science up with Research Recommendations

Recommender System Components

information flow (components often built in parallel)

Data (Feature

Engineering) Algorithms Business Logic

and Analytics User Experience

Page 7: Sparking Science up with Research Recommendations

Mendeley Suggest Components (Past)

information flow (components often built in parallel)

Data (Feature

Engineering) Algorithms Business Logic

and Analytics User Experience

Page 8: Sparking Science up with Research Recommendations

Mendeley Suggest Components (Present)

information flow (components often built in parallel)

Data (Feature

Engineering) Algorithms Business Logic

and Analytics User Experience

Page 9: Sparking Science up with Research Recommendations

Mendeley Suggest Components (Goal)

information flow (components often built in parallel)

Data (Feature

Engineering) Algorithms Business Logic

and Analytics User Experience

Page 10: Sparking Science up with Research Recommendations

Overview •  What is Mendeley Suggest?

•  Computation Layer –  Algorithms

–  Evaluation

–  Implementations & Performance

•  Conclusions

Page 11: Sparking Science up with Research Recommendations

Personalized Article Recommendations Input: User libraries

Output: Suggested articles to read

Algorithms: •  Collaborative Filtering

–  Item-based

–  User-Based

–  Matrix Factorization

•  Content-based

Page 12: Sparking Science up with Research Recommendations

Item-based Collaborative Filtering Recommend articles that are similar to the ones you read

–  Similarity is based on article co-occurrences in users’ libraries –  “Users who read x also read y”

Page 13: Sparking Science up with Research Recommendations

User-based Collaborative Filtering

Find users who have similar appreciation for articles as you –  Similarity is based on users’ libraries overlap

Recommend new articles based on what the users similar to you read

–  “Users similar to you (based on a, b, c) also read x”

Page 14: Sparking Science up with Research Recommendations

Matrix Factorization CF

2 4 5

5 4 1

5 ? 2

1 5 4

4 2

4 5 1

U n x k

V k x m

fij= <Ui*,V*j> E(U,V) = L(Xij, fij) + R(U,V)

X n x m

Page 15: Sparking Science up with Research Recommendations

Overview •  What is Mendeley Suggest?

•  Computation Layer –  Algorithms

–  Evaluation

–  Implementations

•  Conclusions

Page 16: Sparking Science up with Research Recommendations

Performance Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Page 17: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Performance

Page 18: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Performance

Page 19: Sparking Science up with Research Recommendations

How to measure quality? •  Offline Evaluation

–  Parameter sweep is quick –  Don’t offend real users

•  Methodology –  n-fold cross-validation –  time-based validation

•  Metrics –  precision, recall and f-measure –  AUC (area under roc curve), NDCG (normalized discounted cumulative gain)

Page 20: Sparking Science up with Research Recommendations

Overview •  What is Mendeley Suggest?

•  Computation Layer –  Algorithms

–  Evaluation

–  Implementations

•  Conclusions

Page 21: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 22: Sparking Science up with Research Recommendations

Setup •  EMR Cluster

–  Master: 1 x r3.xlarge instance (4 core, 32GB) –  Core: 10 x r3.2xlarge instances (8 core, 64GB)

•  Data: user libraries –  15mil documents >>> 1mil users –  150mil interactions

•  Offline Evaluation –  Methodology: time-based evaluation –  Metric: precision@10

Page 23: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 24: Sparking Science up with Research Recommendations

Apache Mahout •  Mahout (out-of-the-box)

–  Item-based CF •  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob  

 

–  ALS Matrix Factorization •  org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  

•  org.apache.mahout.cf.taste.hadoop.als.RecommenderJob  

 

•  Implemented User-based CF on top of Mahout at Mendeley

Page 25: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Orig. item-based mahout

Tuned item-based mahout

-0.5K (-60%)

Performance

~$125

Page 26: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Orig. item-based mahout

Tuned item-based mahout

-0.5K (-60%)

Orig. user-based mahout

Tuned user-based mahout

-0.1K (-40%)

Performance

~$125

Page 27: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Orig. item-based mahout

Tuned item-based mahout Orig. user-based

mahout

Tuned user-based mahout

+150%

-0.2K (-55%)

-0.7K (-82%)

Performance

~$125

Page 28: Sparking Science up with Research Recommendations

Mahout Performance •  Mahout’s recommender is already efficient

–  But your data may have unusual properties

•  We’ve got improvements by –  Tuning Hadoop’s mapper and reducer allocation over the Recommender Job steps –  Using an appropriate partitioner

•  Improve quality –  Mahout provides Item-based CF –  We have many more items than users –  Typically, user-based is more appropriate

Page 29: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 30: Sparking Science up with Research Recommendations

Mahout Spark •  Co-occurrence Recommenders with Spark

–  Item-Item similarity •  mahout spark-itemsimilarity

SimilarityAnalysis.cooccurrencesIDSs(ratings,  …)    

–  User-User similarity •  mahout spark-rowsimilarity

SimilarityAnalysis.rowSimilarityIDSs(ratings,  …)    

•  Only supports Boolean data and log-likelihood similarity

•  Does not generate actual recommendations

Page 31: Sparking Science up with Research Recommendations

Mahout Spark •  Could not get to run successfully on our data

•  Got further by tuning parameters but still failed with OOM –  spark.driver.maxResultSize  

–  spark.kryoserializer.buffer.max    

–  spark.default.parallelism  

–  spark.storage.memoryFraction  

 

•  Gave best runtime performance on MovieLens datasets

Page 32: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 33: Sparking Science up with Research Recommendations

Mendeley Spark •  Started as hack-day project

–  Implement Item-based and User-based CF in Spark

•  Can be implemented in two steps 1.  Compute Item-Item or User-User Similarities

•  given user preferences

2.  Compute Recommendations •  given similarities and user preferences

Page 34: Sparking Science up with Research Recommendations

Spark: Item-Item Similarity

Page 35: Sparking Science up with Research Recommendations

Spark: Item-Item Similarity

Page 36: Sparking Science up with Research Recommendations

Spark: Item-Item Similarity

Page 37: Sparking Science up with Research Recommendations

Spark: Item-Item Similarity

Page 38: Sparking Science up with Research Recommendations

Spark: Item-Based Recs

Page 39: Sparking Science up with Research Recommendations

Spark: Item-Based Recs

Page 40: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Orig. UB Spark

Performance

~$50

Page 41: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Orig. UB Spark

Tuned UB Spark

Tuned IB Spark

-0.1K (-40%)

Performance

~$50

Page 42: Sparking Science up with Research Recommendations

Mendeley Spark Performance •  Spark implementation of User-based CF performs well

•  Managed to run variation of Item-based CF –  Uses fewer items per user to recommend similar items to –  Quality not impacted much

•  We’ve got improvements by tuning –  Resource allocation –  Parallelism –  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-

part-2/

Page 43: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 44: Sparking Science up with Research Recommendations

Spark MLlib DimSum •  DimSum: efficient algorithm for computing all-pairs similarity

–  “Dimension Independent Matrix Square using MapReduce” –  Contributed by Twitter

•  Replace similarity computation with DimSum –  Only supports cosine similarity

•  Does not generate actual recommendations –  Compute recommendations as before

Page 45: Sparking Science up with Research Recommendations

MLlib DimSum Item-Item Similarity

Page 46: Sparking Science up with Research Recommendations

MLlib DimSum User-User Similarity

Page 47: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSum Spark MLlib

Performance

~$50

Page 48: Sparking Science up with Research Recommendations

Spark MLlib Matrix Factorization Implements alternating least squares (ALS) 1.  Compute Model 2.  Compute Recommendations

Page 49: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSum Spark MLlib

ALS Matrix Fact. Spark MLlib

-50%

Performance

~$50

Page 50: Sparking Science up with Research Recommendations

MLlib Performance •  Provides good alternative for computing user-user similarities

–  Due to data sparsity, not getting big gains in runtime –  Only supports cosine similarity

•  Failed to compute item-item similarities –  Exceeds maximum allowed value of 2G for spark.kryoserializer.buffer.max    

 

•  User-based CF outperforms ALS CF

•  Need scalable solution for generating recommendations based on ALS CF model

Page 51: Sparking Science up with Research Recommendations

Implementations Mahout

(Hadoop)Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 52: Sparking Science up with Research Recommendations

Overview •  What is Mendeley Suggest?

•  Computation Layer

•  Conclusions

Page 53: Sparking Science up with Research Recommendations

Costly & Good Costly & Bad

Cheap & Good Cheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSum Spark MLlib

ALS Matrix Fact. Spark MLlib

Performance

+100%

+150% ~$50

Page 54: Sparking Science up with Research Recommendations

Mendeley Suggest Components (Future)

information flow (components often built in parallel)

Data (Feature

Engineering) Algorithms Business Logic

and Analytics User Experience

Page 55: Sparking Science up with Research Recommendations

Conclusions •  Mendeley Suggest is a personalized article recommender

•  Spark is good alternative to Mahout as computation layer –  Needs some love and tuning –  Much fewer lines of code – easier to maintain and extend

•  User-based can outperform item-based and matrix factorization

•  Save resources and money by understanding your data

•  Test offline before deploying –  but also need online tests to get real performance

Page 56: Sparking Science up with Research Recommendations

Thank you! mendeley.com/suggest