data democratization at soundcloud - bruno sá (soundcloud)

Post on 12-May-2015

1.527 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

SoundCloud is the world’s leading social sound platform where anyone can create sounds and share them everywhere. 200 Million people every month listen to sounds on SoundCloud. That is eight percent of the Internet. 12 hours are uploaded on SoundCloud every minute. This means that SoundCloud not only deals with a lot of data (3-digit terabytes approximately) but embraces the concept of “data democratization,” which means that all data must be available for anyone in the company that needs to access and work with it.

TRANSCRIPT

DATA DEMOCRATIZATION @ SOUNDCLOUDOctober, 29th 2013

HI, I’M BRUNO

SOUNDCLOUD IS THE WORLD’S ˝LEADING AUDIO PLATFORM

Every minute, creators upload

12hrs of audio

reaching over

200m

people every month

!

8% of the internet

FOO FIGHTERS SNOOP LION MADONNA MACKLEMOREPRESIDENT OBAMA JOHN OLIVER˝(DAILY SHOW/BUGLE)

SKRILLEX

what gets listened to where?

how many new usersdid we get from that campaign?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

what makes asound successfull?

did the product change affect feature x?

do comments ontracks correlatewith listens?

what makes anartist successfull?

• Avoid Silos

• Remove unnecessary restrictions

• Provide simple tools

• Teach People how to use data

DATA DEMOCRATIZATION

In one sentence:

DATA DEMOCRATIZATION

Deliver the right information to the right person at the right time.

DATA ANALYSIS AND REPORTING

PRODUCTION DB

ANALYTICS DB

2010-2012

DATA ANALYSIS AND REPORTING

ListensSoundsUsersCommentsFavoritesSharesReposts

ImpressionsClicksConversionsSuggestionsDownloadsTaggingsUploads

DATA ANALYSIS AND REPORTING

Listens

timestampdurationsound ownerlistenerAPI-key(location) country

DATA ANALYSIS AND REPORTING

additional metadata:• location within sound• context (location on site)• segmentation

Listening creates >6000 events/s

BIG DATA

HADOOP TO THE RESCUE

2 Datacenter in AMS 200+ Nodes

HADOOP TO THE RESCUE

listen data listen metadata search data recommender data product testing data backend production databackend logs

HADOOP AND DATA DEMOCRATIZATION

Data is siloed on hadoopData governance not existing Technical hurdles for accessNot realtimeSlow access

AMAZON REDSHIFT

Fast fully managed DW service

Optimized for petabyte or more datasets

Fast query and I/O performance

Columnar storage technology

Staging Area

Pig/Ruby Scripts

Amazon EMR

COPY

Pig/Ruby Scripts

Job execution powered by:

2013

BI INFRASTRUCTURE

Data Exploration

Source Systems

Hadoop

MySql

External Systems

...

(production db)MySql

DataWarehouse

ETL Scripts ETL Scripts

First: Gather data from the several source systems into S3

Hadoop

MySql

External Systems

(production db)MySql

Full/Daily Imports

MapReduce for: - Listens - Plays - Impressions - Affiliations - ...

IMPORT DATA FROM SOURCE SYSTEMS

Second: Rebuild staging area tables for full imports

IMPORT DATA FROM SOURCE SYSTEMS

Staging Area

tracks users client applications

...

Based on configuration files!Create statements generated!Re-create DISTKEYS and SORTKEYS Full control in changes in the data model!

yaml config files

Third: Import the data from S3 to RedShift

Staging Area

tracks users client applications

...

Full import: TRUNCATE & COPY Daily import: COPY

IMPORT DATA FROM SOURCE SYSTEMS

!ETL scripts divided into layers:

!- Layer 1: Staging -> DW (dimensions)

- Layer 2: Staging -> DW (fact tables - raw data)

- Layer 3: DW -> DW (aggregated fact tables)

- Layer 4: DW -> Reporting Data Cubes (reporting data)

!

ETL AND DW DATAMODEL

DataWarehouse

ETL AND DW DATAMODEL

Staging Area

Data Cleaning Data Transformation !Ruby/SQL Scripts

ETL Layer 1 & 2

Data Aggregation !Ruby/SQL Scripts

ETL Layer 3

Data Exploration

ETL Layer 4

Data Presentation !SQL

JOB SCHEDULE AND EXECUTION

Job-scheduling tool developed internally

Set dependencies between jobs

Execution in multiple machines

Supports all the ETL layers

DATA EXPLORATION

Simple and fast access to data

More time for “deep dives” into data

Individualized Reporting

Allows interactivity between users

Integrated with RedShift

TIMELINEWeek 2 Week 4 Week 8 Week 10 Week 12 Week 14 Week 16

• Gap Analysis˝• Business Exploration

(requirements interviews)

• Information Mapping Design˝

• Solution Design (Draft)

Requirement Analysis˝

Analysis Stage˝

End of Analysis Stage˝

Milestones˝ Design & Build˝

• Define Infrastructure˝• Design Data Model

Week 6

Infrastructure Ready!˝

• Build ETL ˝• Build Data Cubes˝• Design Reports/Dashboards (Presentation

Layer)BI 1.0 is built!˝

• System/Integration Tests ˝

• User Acceptance

BI 1.0 is tested!˝

• User Workshops˝• BI 1.0 Evaluation

BI 1.0 is ready to use!˝

Test & Deploy

• Reports designed by end users

• Central repository for data analysis

• User interaction

• Data from one source only

• Scalable solution

• Data to the people!

DATA DEMOCRATIZATION

what gets listened to where?

how many new usersdid we get from that campaign?

what makes asound successfull?

did the product change affect feature x?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

do comments ontracks correlatewith listens?

what makes anartist successfull?

!

QUESTIONS?

THANK YOU!

P.S. WE’RE HIRING.SOUNDCLOUD.COM/JOBS

top related