big data and machine learning @ spotify
TRANSCRIPT
● D-student starting 2009● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data Engineer at Spotify in Gothenburg
Me
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Supervised learning: data (X), labels (Y)
Unsupervised learning:data (X)
In the Machine Learning class:
What is data at Spotify?
Songs Track Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks of playlist
Album Clicks Add/Removes
Genres, Mood etc
Page views
30 Million songs
60 Million Monthly Active Users
58 Markets
15 Million subscribers
1.5 Billion Playlists
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Big Data and processing it
● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)
● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs
Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”
Big Data and processing it
● Distributed computing and storage○ Hadoop
■ MapReduce○ Cassandra
● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Using data at Spotify
Everyone part of the company is interested in our data
● Product○ Are people using X? Should we focus on features such as Y?
● Insights○ What music is trending? What artists is popular where?
● Performance○ How is latency in country Y? Did this reduce stutter in country X?
Using data at Spotify
● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists
● A/B test everything!● A/B testing:
○ Statistical hypothesis testing○ Simple randomized experiment with >= 2
variants (A, B)
Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger button the faster users finds it
Test set up: ● A - variant 1
○ 2% US and SE MAU users● B - variant 2
○ 2% US and SE MAU users● Control - normal
○ Rest of users in US SE
“The shuffle button”
Using data at Spotify: A/B testing
CONTROL A B
Analytics: A/B testing
Metric:Share of users playing first play > 500ms
(500ms is made up)
Lets roll out A to all users and throw away B!
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems
Outline
“ A music session somehow represents a moment for the user. Can we find these moments and
describe them? ”
● Take a subset of user listening data with new genre data○ Combine listens in sessions
■ Consequent plays, no 15 min pause○ Session = [genres]
● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering
● Describe the clusters using logistic regression
Machine Learning: Cluster user music sessions
Machine Learning: Cluster user music sessions
K-Means Per cluster classification
Machine Learning: Cluster user music sessions
Per cluster logistic regression
w: weight vector
Each w_i can be interpreted as the effect in the x_i variable
x_i = genres
Machine Learning: Cluster user music sessions
Clusters described by logistic regression name of x_iat largestw_i
Machine Learning: Cluster user music sessions
Machine Learning: Cluster user music sessions
Machine Learning
Artist disambiguation
Cleaning up the artists pages
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
Lets listen to those tracks!
Is it really the same Fredrik?
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
● Rank artists with probability of being ambiguous
● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,
label/licensor etc.○ Distinct cluster could be different artists
● Nicely present this for manual curation
Machine Learning: Recommender system
The discover page
Machine Learning: Recommender system
Collaborative filtering
Machine Learning: Recommender system
Collaborative filtering● Build a matrix of user plays● Compute similarity between items
Machine Learning: Recommender system
4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF
Machine Learning: Recommender system
Matrix factorization (latent factor models)
Machine Learning: Recommender system
Small vectorsCosine similarity and dot product efficient
Machine Learning: Recommender system
Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy
Related artists & Radio:Similar to user recommendations, more models and not
all CF-based
Multiple models:Score candidates from all models, combine and rank!
Machine Learning: Recommender system
I just went through this quickly, read more details of Spotify Rec sys here:
Doing this on MapReduce Comparing with NetflixMusic Rec @ MLConf 2014
● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using
convolutional neural networks
● Personalize everything○ Emails○ Ads○ User profiling
● ML on other parts of product than Rec Sys
.. final last words on the Future of ML at Spotify
Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis○ Cluster and describe with classifier
● Artist disambiguation○ Cluster and give to manual curators
● Recommender systems○ Collaborative filtering
● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails
● We have internships!
www.spotify.com/jobs
.. and potentially you could help us?
Oscar [email protected]
Thank you for listening!