data at spotify
DESCRIPTION
Data infrastructure at SpotifyTRANSCRIPT
![Page 1: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/1.jpg)
June 12, 2014
Danielle Jabin Data Engineer, A/B Testing
Data at Spotify
![Page 2: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/2.jpg)
I’m Danielle Jabin
• Data Engineer in the Stockholm office • A/B testing infrastructure
• California born & raised • If I can survive a Swedish winter, so can you!
• Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
![Page 3: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/3.jpg)
3
Over 40 million active users
As of June 9, 2014
![Page 4: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/4.jpg)
4
Access to more than 20 million songs
As of June 9, 2014
![Page 5: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/5.jpg)
Big Data
• 40 million Monthly Active Users • 20+ million tracks • 1.5 TB of compressed data from users per day • 64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014
![Page 6: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/6.jpg)
6
So how much data is that?
![Page 7: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/7.jpg)
Let’s compare: 64 TB
• 293, 203, 072 books (200 pages or 240,000 characters)
• 16,777,216 MP3 files (with 4MB average file size) • 22,369,600 images (with 3MB average file size)
![Page 8: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/8.jpg)
8
That’s a lot of selfies
![Page 9: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/9.jpg)
9
How do we use this data?
![Page 10: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/10.jpg)
Use Cases
• Reporting • Business Analytics • Operational Analytics • Product Features
![Page 11: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/11.jpg)
Reporting
• Reporting to labels, licensors, partners, and advertisers • We support our partners
![Page 12: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/12.jpg)
Business Analytics
• Analyzing growth, user behavior, sign-up funnels, etc • Company KPIs • NPS analysis
![Page 13: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/13.jpg)
Operational Metrics
• Root cause analysis • Latency analysis • Better capacity planning (servers, people, bandwidth)
![Page 14: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/14.jpg)
Product Features
• Discover and Radio • Top lists • Personalized recommendations • A/B Testing
![Page 15: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/15.jpg)
15
How do we collect this data?
![Page 16: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/16.jpg)
The three pillars of our Data Infrastructure:
Kafka Collection
Hadoop Processing
DatabasesAnalytics/Visualization
![Page 17: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/17.jpg)
This is Dave. Data Engineer at Spotify by day…
![Page 18: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/18.jpg)
…chiptune DJ Demoscene Time Machine by night.
![Page 19: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/19.jpg)
Let’s listen to Dave’s song
![Page 20: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/20.jpg)
Kafka
• High volume pub-sub system
• “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
![Page 21: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/21.jpg)
Kafka
• Robust and scalable solution for collection of logs • Fast data transfer • Low CPU overhead • Built-in partitioning, replication, and fault-tolerance
• Consumers can pull data at different rates • Able to handle extremely high volumes
![Page 22: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/22.jpg)
Other people listened too!
![Page 23: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/23.jpg)
Hadoop
• Process and store massive amounts of unstructured data across a distributed cluster
• One cluster with 37 nodes to 690 nodes today • 28 PB of storage • The largest Hadoop cluster in Europe
![Page 24: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/24.jpg)
Hadoop
• Entering the land of optimizations • Data retention policy • Move to JVM-based languages
• MapReduce languages • Moving to Crunch, JVM-based, for speed and scalability • Python with Hadoop Streaming, Java, Hive, PIG, Scala
• Sprunch: Crunch wrapper for Scala, open sourced by Spotify
• Spotify open-sourced scheduler, Luigi, written in Python • Simple and easy way to chain jobs
![Page 25: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/25.jpg)
What if we want to know more?
vs
![Page 26: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/26.jpg)
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop • Core data can be used and manipulated for various needs
• Ad hoc queries • Dashboards
![Page 27: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/27.jpg)
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop
• Ad hoc queries • Dashboards
![Page 28: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/28.jpg)
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop
• Ad hoc queries • Dashboards
![Page 29: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/29.jpg)
Questions?
![Page 30: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/30.jpg)
A/B testing questions? Find me!
Control
vs
![Page 31: Data at Spotify](https://reader034.vdocuments.mx/reader034/viewer/2022051210/54b7690a4a7959a23c8b4869/html5/thumbnails/31.jpg)
Thank you!