big data and machine learning @ spotify

Oscar CarlssonData [email protected]

Big Dataand Machine Learning@ Spotify

Friday 6/3 2015

● D-student starting 2009● Graduated last year from CSALL

(Student in this class 2013)

● Master thesis at Spotify

● Data Engineer at Spotify in Gothenburg

Me

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Supervised learning: data (X), labels (Y)

Unsupervised learning:data (X)

In the Machine Learning class:

What is data at Spotify?

Songs Track Metadata

User generated Users Playlists

Cover arts Listens Country, email etc Tracks of playlist

Album Clicks Add/Removes

Genres, Mood etc

Page views

30 Million songs

60 Million Monthly Active Users

58 Markets

15 Million subscribers

1.5 Billion Playlists





Outline

Big Data and processing it

● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)

● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs

Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”

Big Data and processing it

● Distributed computing and storage○ Hadoop

■ MapReduce○ Cassandra

● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day





Outline

Using data at Spotify

Everyone part of the company is interested in our data

● Product○ Are people using X? Should we focus on features such as Y?

● Insights○ What music is trending? What artists is popular where?

● Performance○ How is latency in country Y? Did this reduce stutter in country X?

Using data at Spotify

● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists

● A/B test everything!● A/B testing:

○ Statistical hypothesis testing○ Simple randomized experiment with >= 2

variants (A, B)

Using data at Spotify: A/B testing

Objective: Decrease time from loading playlist to first play

Hypothesis: The bigger button the faster users finds it

Test set up: ● A - variant 1

○ 2% US and SE MAU users● B - variant 2

○ 2% US and SE MAU users● Control - normal

○ Rest of users in US SE

“The shuffle button”

Using data at Spotify: A/B testing

CONTROL A B

Analytics: A/B testing

Metric:Share of users playing first play > 500ms

(500ms is made up)

Lets roll out A to all users and throw away B!





Outline

● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems

Outline

“ A music session somehow represents a moment for the user. Can we find these moments and

describe them? ”

● Take a subset of user listening data with new genre data○ Combine listens in sessions

■ Consequent plays, no 15 min pause○ Session = [genres]

● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering

● Describe the clusters using logistic regression

Machine Learning: Cluster user music sessions

http://publications.lib.chalmers.se/records/fulltext/202958/202958.pdf


K-Means Per cluster classification



Per cluster logistic regression

w: weight vector

Each w_i can be interpreted as the effect in the x_i variable

x_i = genres



Clusters described by logistic regression name of x_iat largestw_i


Machine Learning

Artist disambiguation

Cleaning up the artists pages

Machine Learning: Artist disambiguation


Lets listen to those tracks!

Is it really the same Fredrik?

https://open.spotify.com/artist/1jSOp6z42xe8bnOj0SnU3i

https://open.spotify.com/artist/1jSOp6z42xe8bnOj0SnU3i


● Rank artists with probability of being ambiguous

● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,

label/licensor etc.○ Distinct cluster could be different artists

● Nicely present this for manual curation

Machine Learning: Recommender system

The discover page


Collaborative filtering


Collaborative filtering● Build a matrix of user plays● Compute similarity between items


4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF


Matrix factorization (latent factor models)


Small vectorsCosine similarity and dot product efficient


Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy

Related artists & Radio:Similar to user recommendations, more models and not

all CF-based

Multiple models:Score candidates from all models, combine and rank!

https://github.com/spotify/annoy


I just went through this quickly, read more details of Spotify Rec sys here:

Doing this on MapReduce Comparing with NetflixMusic Rec @ MLConf 2014

http://www.a1k0n.net/spotify/ml-madison/

http://www.a1k0n.net/spotify/ml-madison/

http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify

http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify

http://www.slideshare.net/erikbern/music-recommendations-mlconf-2014?related=1

http://www.slideshare.net/erikbern/music-recommendations-mlconf-2014?related=1

● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using

convolutional neural networks

● Personalize everything○ Emails○ Ads○ User profiling

● ML on other parts of product than Rec Sys

.. final last words on the Future of ML at Spotify

http://the.echonest.com/

http://benanne.github.io/2014/08/05/spotify-cnns.html




Summary

● Multiple data sources -> multiple angles

● Data drives decision with A/B testing

● User analysis○ Cluster and describe with classifier

● Artist disambiguation○ Cluster and give to manual curators

● Recommender systems○ Collaborative filtering

● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails

● We have internships!

www.spotify.com/jobs

.. and potentially you could help us?

Oscar [email protected]

Thank you for listening!

mailto:[email protected]

mailto:[email protected]

http://se.linkedin.com/in/oscarlsson1

http://se.linkedin.com/in/oscarlsson1

big data and machine learning @ spotify

Data & Analytics

big data big data

data xin

data product

spotify data engineer

spotify datadriven decision

tb compressed data day

machine learning class

genres machine learning