Recommendations with hadoop streaming and python

Download Recommendations with hadoop streaming and python

Post on 25-May-2015

4.814 views

Category:

Technology

4 download

Embed Size (px)

TRANSCRIPT

<ul><li> 1. Recommendations withPython and HadoopStreamingAndrew LookSenior EngineerShopzilla</li></ul> <p> 2. Getting started Slides http://bit.ly/J7vmx7 Python/NumPy Installed http://bit.ly/JWNWbq Sample code http://aws-hadoop.s3.amazonaws.com/similarity.zip 3. Outline Problem Recommendation basics MapReduce review and conventions Python + Hadoop Streaming basics MapReduce jobs (data, code, data-flow) Recommendation algorithm 4. Problem - Music Recommendations We want to recommend similar artists We have data from Last.fm Which Last.fm users liked which artists? How can we decide which artists are similar? Toby Keith TupacDe La Soul Garth Brooks 5. Solution - Find Artist Similarities Well follow along with a tutorial from AWS By Data Wrangling blogger/AWS developerPeter Skomoroch Uses publicly available data from Last.fm Users rating of artist is number of plays 6. Solution - Find Artist Similarities We can look at co-ratings One user played artist A songs X times Same user played artist B songs Y times co-rating = ((A,X),(B,Y)) 7. Recommendation Basics User Based Given a user, recommend the artists that are favoredby users with similar artist preferences Item Based Given an item (artist), recommend the artists thatwere most commonly favored by users that alsoliked the input artist 8. Recommendation Basics Types of data Explicit - user rates a movie on Netflix Implicit - user watches a YouTube video Types of ratings Multivalued - bounded, ex. star rating (1-5) Multivalued - unbounded, ex. number of plays (&gt;0) Binary - did a user play a movie or not? 9. Last.fm Recommendations Data was implicitly collected (as users play songs) Transform binary data (did user listen to artist?) ... Into multivalued data (how many times?) Well use item-based recommendations 10. Mapper Input 11. Map Output - Reduce Input 12. Chaining MapReduce Jobs 13. Distributed Cache 14. Python Shell and Hadoop StreamingStreaming API requires shell commands Mapper Reducer 15. Python Shell and Hadoop StreamingStreaming API requires shell commands Mapper ReducerFor mapper / reducer commands StreamingAPI will Partition the input Distribute across mappers and reducers 16. Python Shell and Hadoop Streaming 17. Full Recommendation Job Overview 18. Example - Working Data Set Inspect your working data set ... Each row is one "rating" Each "number of plays" is the "rating value" Code cat input/sample_user_artist_data.txt| head 19. Example - Working Data SetUser IDArtist ID Number of Plays10000201001820 2010000201003557 11000021700 110000291001819 110000361001820 3410000361011819 21000036700 210000401001820 110000571011819 371000060700 17 20. Mapper 1 - Count Ratings per Artist Prepend LongValueSum: More on this later Use a value of "1"Codecat input/sample_user_artist_data.txt | ./similarity.py mapper1 21. Mapper 1 - Count Ratings per ArtistArtist IDNumber of RatingsLongValueSum:1001820 1LongValueSum:1003557 1LongValueSum:700 1LongValueSum:1001819 1LongValueSum:1001820 1LongValueSum:1011819 1LongValueSum:700 1LongValueSum:1001820 1LongValueSum:1011819 1LongValueSum:700 1 22. Mapper 1 - Count Ratings per Artist We use the sort command locally We sort by artist ID Emulates shuffle/sort in HadoopCodecat input/sample_user_artist_data.txt | ./similarity.py mapper1 | sort 23. Mapper 1 - Count Ratings per ArtistArtist IDNumber of PlaysLongValueSum:1001820 1LongValueSum:1001820 1LongValueSum:1001820 1LongValueSum:1003557 1LongValueSum:1011819 1LongValueSum:1011819 1LongValueSum:1011819 1LongValueSum:700 1LongValueSum:700 1LongValueSum:700 1 24. Reducer 1 - Count Ratings by Artist LongValueSum tells aggregate reducer Group by artist ID Sum up the 1s Emit artist ID as Key, count(ratings) as Value Code cat input/sample_user_artist_data.txt| ./similarity.py mapper1 | sort | ./similarity.py reducer1 &gt; input/artist_playcounts.txt 25. Reducer 1 - Count Ratings by ArtistArtist ID Number of Ratings1000143 19051000418 1841001820 12950700 72431003557 29761011819 76011012511 1881 26. Mapper 2 - User Artist Preferences Mapper2 outputs key user ID, artist ID Mapper2 outputs rating as value (# plays) Code cat input/sample_user_artist_data.txt| ./similarity.py mapper2 int 27. Mapper 2 - User Artist PreferencesUser ID, Artist ID Number of Plays1000020,1001820201000020,100355711000021,70011000029,101181911000036,1001820341000036,101181921000036,70021000040,100182011000057,1011819371000060,70017 28. Mapper 2 - User Artist Preferences Can large counts skew our results? Apply log function to outliers. Code cat input/sample_user_artist_data.txt| ./similarity.py mapper2 log | sort 29. Mapper 2 - Logarithmic SmoothingUser ID, Artist ID Smoothing Smoothed Count1000020,1001820log(20) 31000020,1003557log(1)11000021,700log(1)11000029,1011819log(1)11000036,1001820log(34) 41000036,1011819log(2)11000036,700log(2)11000040,1001820log(1)11000057,1011819log(37) 41000060,700log(17) 3 30. Reducer 2 - Aggregate User Prefs Reduce for each user Key - user ID Value is complex Count(ratings) Sum(rating values) Space delimited list of - artist ID, rating valueCodecat input/sample_user_artist_data.txt| ./similarity.py mapper2 log | sort | ./similarity.py reducer2 31. Reducer 2 - Aggregated User Prefs User ID Smoothing 1000020 2 | 4 | 1001820,3 1003557,1 1000021 1 | 1 | 700,1 1000029 1 | 1 | 1011819,1 1000036 3 | 6 | 1001820,4 1011819,1 700,1 1000040 1 | 1 | 1001820,1 1000057 1 | 4 | 1011819,4 1000060 1 | 3 | 700,3 32. Mapper 3 - User Co-Ratings Mapper3 culls users via cutoff Drop user ID, emit pairwise Code cat input/sample_user_artist_data.txt| ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100input/artist_playcounts.txt | sort 33. Mapper 3 - User Co-RatingsArtist ID: X, Y Rating: X, Y1000143 1003577 2 31000143 1011819 2 31001820 700 1 21001820 700 1 31011819 700 3 21011819 700 3 31011819 700 4 21011819 700 4 21011819 700 5 51012511 700 1 1 34. Reducer 3 - Artist Similarities Given num artists, computes similarities Each pair of artists emitted w/ similarities Code cat input/sample_user_artist_data.txt| ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100input/artist_playcounts.txt | sort | ./similarity.py reducer3 147160&gt; artist_similarities.txt 35. Reducer 3 - Artist Similarities Artist ID, Similarity, Artist ID, Co-Ratings 1003557 0.121659425105 1012511 360 1012511 0.121659425105 1003557 360 1003557 0.0197107349416 700 212 700 0.0197107349416 1003557 212 1011819 0.0128808637553 1012511 259 1012511 0.0128808637553 1011819 259 1011819 0.297222927702 700 3050 700 0.297222927702 1011819 3050 1012511 0.0426446192482 700 270 700 0.0426446192482 1012511 270 36. Mapper 4 - Sort by Artist Correlation Emit artist ID, similarity concatenated Sort by similarity = recommendation Code cat artist_similarities.txt | ./similarity.py mapper4 20 | sort 37. Mapper 4 - Sort by Artist Correlation Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings 1012511,0.9242192719371000143 237 1012511,0.9456534126491001820 468 1012511,0.957355380752700 270 1012511,0.9614549171981000418 50 1012511,0.9871191362451011819 259 700,0.7027770722981011819 3050 700,0.8988113373031001820 2250 700,0.95212801312 1000143 114 700,0.9573553807521012511 270 700,0.9802892650581003557 212 38. Reducer 4 - Cosmetic Results Reducer attaches artist names Code cat artist_similarities.txt| ./similarity.py mapper4 20 | sort | ./similarity.py reducer4 3 lastfm/artist_data.txt&gt; related_artists.tsv 39. Reducer 4 - Cosmetic ResultsArtist ID Related Artist Similarity Number of Co- Artist NameIDRatings1000143 100014310 Toby Keith1000143 10035570.2434 809 Garth Brooks1000143 10004180.1068 120 Mark Chestnutt1000143 10125110.0758 237 Kenny Rogers1000418 100041810 Mark Chestnutt1000418 10001430.1068 120 Toby Keith1000418 10035570.056114 Garth Brooks1000418 10125110.0385 50Kenny Rogers 40. Pearson Similarity - Visualization covariance(A, B) = 2.44 covariance(C, D) =-2.36 41. Pearson Similarity - Equationpearson(x, y)= covariance(x, y)/ (stddev(x) * stddev(y)) pearson(A, B) = 0.772 pearson(C, D) = -0.746 42. Pearson Similarity - Summary Pearson similarity normalizes correlation Linear dependence between two variables Normalized ... -1 &lt; pearson(x, y) &lt; 1(for any x, y) 43. Questions? 44. Appendix Hadoop Streaming http://hadoop.apache.org/common/docs/r0.20.1/streaming.html Explanation of LongValueSum http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce Pearson Correlation http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming http://aws.amazon.com/articles/2294 45. Appendix Anscombes Quartet http://en.wikipedia.org/wiki/Anscombes_quartet Tau Coefficient http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient Jaccard Indexhttp://en.wikipedia.org/wiki/Jaccard_index Quality of Recommendations http://en.wikipedia.org/wiki/Mean_squared_error</p>

Recommended

View more >