data engineering at udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Data Engineering at Udemy

Keeyong HanPrincipal Data Architect @Udemy


About Me

• 20+ years of experience from 9 different companies

• Currently manages Data team at Udemy• Prior to joining Udemy– Manager of data/search team at Polyvore– Director of Engineering at Yahoo Search– Started career from Samsung Electronics in Korea


Agenda

• Typical Evolution of Data Processing• Data Engineering at Udemy• Lessons Learned


TYPICAL EVOLUTION OFDATA PROCESSING

From a small start-up


In the beginning

• You don’t have any data

• So no data infrastructure or data science– The most important thing is to survive and to keep

iterating


After a struggle you have some data

• Now you survived and now you have some data to work with– Data analysts are hired– They want to analyze the data


Then …

• You don’t know where the data is exactly• You find your data but– It is not clean and is missing key information– Data is likely not in the format you want

• You store them in non-optimal storage– MySQL is likely used to store all kinds of data• But MySQL doesn’t scale

– You ask analysts to query MySQL• They will kill the web site a few times


Now what to do? (I)• You have to find a scalable and separate storage for data

analysis– This is called Data Warehouse or Data Analytics– This will be the central storage for your important data– Udemy uses AWS Redshift

• Migrate some data from MySQL– Key/Value data to NoSQL solution (Cassandra/Hbase, MongoDB, …)– Log type of data (use Nginx log for example)– MySQL should only have key data which is needed from Web

service


Now what to do? (II)

• The goal is to put every data into a single storage– This is the most important and the very first step

toward becoming “true” data organization– This storage should be separated from runtime

storage (MySQL for example)– This storage should be scalable– Being consistent is more important than being

correct in the beginning


Now You Add More Data

• Different Ways of Collecting Data– This is called ETL (Extract, Transform and Load)– Different Aspects to Consider

• Size: 1KB to 20GB• Frequency: Hourly, Daily, Weekly, Monthly• How to collect data:

– FTP, API, Webhook, S3, HTTP, mysql commandline

• You will have multiple data collection workflows– Use cronjob (or some scheduler) to manage– Udemy uses Pinball (Open Source from Pinterest)


How It Will Look Like

Your CoolWeb

Service

Log Files

MySQL

Key/Value

DataWarehouse

ExternalData SourcesETL

ETL


Simple Data import

• Just use some script language– Many data sources are small and simple enough to

use a script language• Udemy uses Python for this purpose– Implemented a set of Python classes to handle

different types of data import– Plan to open source this in 1st half of 2016


Large Data Batch Import

• Large data import and processing will require more scalable solution

• Hadoop can be used for this purpose– SQL on Hadoop: Hive, Tajo, Presto and so on– Pig, Java MapReduce

• Spark is getting a lot of attention and we plan to evaluate


Realtime Data import

• Some of data better be imported as it happens• This requires different type of technology– Realtime Data Message Queue: Kafka, Kinesis– Realtime Data Consumer: Storm, Samza, Spark

Streaming


What’s Next? (I)

• Build Summary Tables– Having raw data tables is good but it can be too

detailed and too much information– Build these tables in your Data Warehouse

• Track the performance of key metrics– This should be from summary tables above– You need dashboard tool (build one or use 3rd

party solution – Birst, chartIO, Tableau and so on)


What’s Next? (II)

• Provide this data to Data Science team– Draw insight and create feedback loop– Build machine learned models for recommendation,

search ranking and so on– The topic for the next session (Thanks Larry!)

• Supporting Data Science from Infrastructure– This will require scalable infrastructure– Example: Scoring every pairs of user/course in Udemy

• 7M users X 30K courses = 210B pairs of computation– You need scalable Serving Layer (Cassandra, Hbase, …)


DATA ENGINEERING AT UDEMY


Data Warehouse/Analytics

• We use AWS Redshift as our data warehouse (or data analytics backend)

• What is AWS Redshift?– Scalable Postgresql Engine up to 1.6PB of data– Roughly it is 600 USD per TB per month– Mainly for offline batch processing– Supports bulk update (through AWS S3)– Two type of options: Compute vs. Storage


Kind of Data Stored in Redshift• 800+ tables with 2.4TB of data

• Key tables from MySQL• Email Marketing Data• Ads Campaign Performance Data• SEO data from Google• Data from Web access log• Support Ticket Data• A/B Test Data (Mobile, Web)• Human curated data from Google Spreadsheets


Details on ETL Pipelines

• All data pipelines are scheduled through Pinball– Every 5 minutes, hourly, daily, weekly and monthly

• Most pipelines are purely in Python• Some uses Hadoop/Hive and Hadoop/Pig for

Batch Processing• Start using Kinesis for Realtime Processing


Pinball Screenshot


Batching Processing Infrastructure

• We use Hadoop 2.6 with Hive and Pig– CDH 5.4 (community version)

• We use our own hadoop cluster and AWS EMR (ElasticMapReduce) at the same time– This is used to do ETL on massive data– This is also used to run massive model/scoring

pipelines from Data Science team• Plan to evaluate Spark potentially as an

alternative


Realtime Processing

• Applications– The first application is to process web access log– Eventually we plan to use this to generate personalized

recommendation on-the-fly• Plan to use AWS Kinesis– Evaluated Apache Kafka and AWS Kinesis

• They are very similar but Kafka is an open source while Kinesis is a managed service from AWS

• Decided to use AWS Kinesis

• Plan to evaluate Realtime Consumer– Samza, Storm, Spark Streaming


What is Kinesis (Kafka)?

• Realtime data processing service in AWS– Publisher-Subscriber message broker– Very similar to Kafka

• It has two components– One is message queue where stream of data is stored

• 24 hours of retention period• Pay hourly by the read/write rate of the queue

– The other is KCL (Kinesis Client Library)• Using this, build Data Producer application or Data Consumer

Application• This can be combined with Storm, Spark Streaming, …


Data Serving Layer

• Redshift isn’t a good fit to read out the data in realtime fashion so you need something else

• We are using (or plan to use) the followings:– Cassandra– Redis– ElasticSearch– MySQL


How It Looks Like

Udemy

Log Files(Nginx)

MySQL

Key/Value(Cassandra)

DataWarehouse(Redshift)

ExternalData Sources

Data Science Pipeline

ETL

ETL

Data Science Pipeline


LESSONS LEARNED


• As a small start-up survive first and then work on data

• Starting point is to store all data in a single location (data warehouse)

• Start with batch processing and then realtime• Consider the type of data you store– Log vs. Key/Value vs. Transactional Record

• Store data in the form of log (change history)– So that you can always go back and debug/replay

• Cloud is good unless you have really massive data


Q & AUdemy is Hiring

data engineering at udemy

Data & Analytics