big data and machine learning @ spotify

40
Oscar Carlsson Data Engineer [email protected] Big Data and Machine Learning @ Spotify Friday 6/3 2015

Upload: oscar-carlsson

Post on 14-Jul-2015

376 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Big data and machine learning @ Spotify

Oscar CarlssonData [email protected]

Big Dataand Machine Learning@ Spotify

Friday 6/3 2015

Page 2: Big data and machine learning @ Spotify

● D-student starting 2009● Graduated last year from CSALL

(Student in this class 2013)

● Master thesis at Spotify

● Data Engineer at Spotify in Gothenburg

Me

Page 3: Big data and machine learning @ Spotify

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Page 4: Big data and machine learning @ Spotify

Supervised learning: data (X), labels (Y)

Unsupervised learning:data (X)

In the Machine Learning class:

Page 5: Big data and machine learning @ Spotify

What is data at Spotify?

Songs Track Metadata

User generated Users Playlists

Cover arts Listens Country, email etc Tracks of playlist

Album Clicks Add/Removes

Genres, Mood etc

Page views

30 Million songs

60 Million Monthly Active Users

58 Markets

15 Million subscribers

1.5 Billion Playlists

Page 6: Big data and machine learning @ Spotify

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Page 7: Big data and machine learning @ Spotify

Big Data and processing it

● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)

● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs

Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”

Page 8: Big data and machine learning @ Spotify

Big Data and processing it

● Distributed computing and storage○ Hadoop

■ MapReduce○ Cassandra

● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day

Page 9: Big data and machine learning @ Spotify

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Page 10: Big data and machine learning @ Spotify

Using data at Spotify

Everyone part of the company is interested in our data

● Product○ Are people using X? Should we focus on features such as Y?

● Insights○ What music is trending? What artists is popular where?

● Performance○ How is latency in country Y? Did this reduce stutter in country X?

Page 11: Big data and machine learning @ Spotify

Using data at Spotify

● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists

● A/B test everything!● A/B testing:

○ Statistical hypothesis testing○ Simple randomized experiment with >= 2

variants (A, B)

Page 12: Big data and machine learning @ Spotify

Using data at Spotify: A/B testing

Objective: Decrease time from loading playlist to first play

Hypothesis: The bigger button the faster users finds it

Test set up: ● A - variant 1

○ 2% US and SE MAU users● B - variant 2

○ 2% US and SE MAU users● Control - normal

○ Rest of users in US SE

“The shuffle button”

Page 13: Big data and machine learning @ Spotify

Using data at Spotify: A/B testing

CONTROL A B

Page 14: Big data and machine learning @ Spotify

Analytics: A/B testing

Metric:Share of users playing first play > 500ms

(500ms is made up)

Lets roll out A to all users and throw away B!

Page 15: Big data and machine learning @ Spotify

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Page 16: Big data and machine learning @ Spotify

● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems

Outline

Page 17: Big data and machine learning @ Spotify

“ A music session somehow represents a moment for the user. Can we find these moments and

describe them? ”

Page 18: Big data and machine learning @ Spotify

● Take a subset of user listening data with new genre data○ Combine listens in sessions

■ Consequent plays, no 15 min pause○ Session = [genres]

● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering

● Describe the clusters using logistic regression

Machine Learning: Cluster user music sessions

Page 19: Big data and machine learning @ Spotify

Machine Learning: Cluster user music sessions

K-Means Per cluster classification

Page 20: Big data and machine learning @ Spotify

Machine Learning: Cluster user music sessions

Per cluster logistic regression

w: weight vector

Each w_i can be interpreted as the effect in the x_i variable

x_i = genres

Page 21: Big data and machine learning @ Spotify

Machine Learning: Cluster user music sessions

Clusters described by logistic regression name of x_iat largestw_i

Page 22: Big data and machine learning @ Spotify

Machine Learning: Cluster user music sessions

Page 23: Big data and machine learning @ Spotify

Machine Learning: Cluster user music sessions

Page 24: Big data and machine learning @ Spotify

Machine Learning

Artist disambiguation

Cleaning up the artists pages

Page 25: Big data and machine learning @ Spotify

Machine Learning: Artist disambiguation

Page 26: Big data and machine learning @ Spotify

Machine Learning: Artist disambiguation

Lets listen to those tracks!

Is it really the same Fredrik?

Page 27: Big data and machine learning @ Spotify

Machine Learning: Artist disambiguation

Page 28: Big data and machine learning @ Spotify

Machine Learning: Artist disambiguation

● Rank artists with probability of being ambiguous

● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,

label/licensor etc.○ Distinct cluster could be different artists

● Nicely present this for manual curation

Page 29: Big data and machine learning @ Spotify

Machine Learning: Recommender system

The discover page

Page 30: Big data and machine learning @ Spotify

Machine Learning: Recommender system

Collaborative filtering

Page 31: Big data and machine learning @ Spotify

Machine Learning: Recommender system

Collaborative filtering● Build a matrix of user plays● Compute similarity between items

Page 32: Big data and machine learning @ Spotify

Machine Learning: Recommender system

4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF

Page 33: Big data and machine learning @ Spotify

Machine Learning: Recommender system

Matrix factorization (latent factor models)

Page 34: Big data and machine learning @ Spotify

Machine Learning: Recommender system

Small vectorsCosine similarity and dot product efficient

Page 35: Big data and machine learning @ Spotify

Machine Learning: Recommender system

Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy

Related artists & Radio:Similar to user recommendations, more models and not

all CF-based

Multiple models:Score candidates from all models, combine and rank!

Page 37: Big data and machine learning @ Spotify

● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using

convolutional neural networks

● Personalize everything○ Emails○ Ads○ User profiling

● ML on other parts of product than Rec Sys

.. final last words on the Future of ML at Spotify

Page 38: Big data and machine learning @ Spotify

Summary

● Multiple data sources -> multiple angles

● Data drives decision with A/B testing

● User analysis○ Cluster and describe with classifier

● Artist disambiguation○ Cluster and give to manual curators

● Recommender systems○ Collaborative filtering

Page 39: Big data and machine learning @ Spotify

● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails

● We have internships!

www.spotify.com/jobs

.. and potentially you could help us?