data engineering at udemy

30
Data Engineering at Udemy Keeyong Han Principal Data Architect @Udemy Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Upload: ankara-big-data-meetup

Post on 15-Apr-2017

1.220 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Data Engineering at Udemy

Keeyong HanPrincipal Data Architect @Udemy

Page 2: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

About Me

• 20+ years of experience from 9 different companies

• Currently manages Data team at Udemy• Prior to joining Udemy– Manager of data/search team at Polyvore– Director of Engineering at Yahoo Search– Started career from Samsung Electronics in Korea

Page 3: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Agenda

• Typical Evolution of Data Processing• Data Engineering at Udemy• Lessons Learned

Page 4: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

TYPICAL EVOLUTION OFDATA PROCESSING

From a small start-up

Page 5: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

In the beginning

• You don’t have any data

• So no data infrastructure or data science– The most important thing is to survive and to keep

iterating

Page 6: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

After a struggle you have some data

• Now you survived and now you have some data to work with– Data analysts are hired– They want to analyze the data

Page 7: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Then …

• You don’t know where the data is exactly• You find your data but– It is not clean and is missing key information– Data is likely not in the format you want

• You store them in non-optimal storage– MySQL is likely used to store all kinds of data• But MySQL doesn’t scale

– You ask analysts to query MySQL• They will kill the web site a few times

Page 8: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Page 9: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Now what to do? (I)• You have to find a scalable and separate storage for data

analysis– This is called Data Warehouse or Data Analytics– This will be the central storage for your important data– Udemy uses AWS Redshift

• Migrate some data from MySQL– Key/Value data to NoSQL solution (Cassandra/Hbase, MongoDB, …)– Log type of data (use Nginx log for example)– MySQL should only have key data which is needed from Web

service

Page 10: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Now what to do? (II)

• The goal is to put every data into a single storage– This is the most important and the very first step

toward becoming “true” data organization– This storage should be separated from runtime

storage (MySQL for example)– This storage should be scalable– Being consistent is more important than being

correct in the beginning

Page 11: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Now You Add More Data

• Different Ways of Collecting Data– This is called ETL (Extract, Transform and Load)– Different Aspects to Consider

• Size: 1KB to 20GB• Frequency: Hourly, Daily, Weekly, Monthly• How to collect data:

– FTP, API, Webhook, S3, HTTP, mysql commandline

• You will have multiple data collection workflows– Use cronjob (or some scheduler) to manage– Udemy uses Pinball (Open Source from Pinterest)

Page 12: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

How It Will Look Like

Your CoolWeb

Service

Log Files

MySQL

Key/Value

DataWarehouse

ExternalData SourcesETL

ETL

Page 13: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Simple Data import

• Just use some script language– Many data sources are small and simple enough to

use a script language• Udemy uses Python for this purpose– Implemented a set of Python classes to handle

different types of data import– Plan to open source this in 1st half of 2016

Page 14: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Large Data Batch Import

• Large data import and processing will require more scalable solution

• Hadoop can be used for this purpose– SQL on Hadoop: Hive, Tajo, Presto and so on– Pig, Java MapReduce

• Spark is getting a lot of attention and we plan to evaluate

Page 15: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Realtime Data import

• Some of data better be imported as it happens• This requires different type of technology– Realtime Data Message Queue: Kafka, Kinesis– Realtime Data Consumer: Storm, Samza, Spark

Streaming

Page 16: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

What’s Next? (I)

• Build Summary Tables– Having raw data tables is good but it can be too

detailed and too much information– Build these tables in your Data Warehouse

• Track the performance of key metrics– This should be from summary tables above– You need dashboard tool (build one or use 3rd

party solution – Birst, chartIO, Tableau and so on)

Page 17: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

What’s Next? (II)

• Provide this data to Data Science team– Draw insight and create feedback loop– Build machine learned models for recommendation,

search ranking and so on– The topic for the next session (Thanks Larry!)

• Supporting Data Science from Infrastructure– This will require scalable infrastructure– Example: Scoring every pairs of user/course in Udemy

• 7M users X 30K courses = 210B pairs of computation– You need scalable Serving Layer (Cassandra, Hbase, …)

Page 18: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

DATA ENGINEERING AT UDEMY

Page 19: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Data Warehouse/Analytics

• We use AWS Redshift as our data warehouse (or data analytics backend)

• What is AWS Redshift?– Scalable Postgresql Engine up to 1.6PB of data– Roughly it is 600 USD per TB per month– Mainly for offline batch processing– Supports bulk update (through AWS S3)– Two type of options: Compute vs. Storage

Page 20: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Kind of Data Stored in Redshift• 800+ tables with 2.4TB of data

• Key tables from MySQL• Email Marketing Data• Ads Campaign Performance Data• SEO data from Google• Data from Web access log• Support Ticket Data• A/B Test Data (Mobile, Web)• Human curated data from Google Spreadsheets

Page 21: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Details on ETL Pipelines

• All data pipelines are scheduled through Pinball– Every 5 minutes, hourly, daily, weekly and monthly

• Most pipelines are purely in Python• Some uses Hadoop/Hive and Hadoop/Pig for

Batch Processing• Start using Kinesis for Realtime Processing

Page 22: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Pinball Screenshot

Page 23: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Batching Processing Infrastructure

• We use Hadoop 2.6 with Hive and Pig– CDH 5.4 (community version)

• We use our own hadoop cluster and AWS EMR (ElasticMapReduce) at the same time– This is used to do ETL on massive data– This is also used to run massive model/scoring

pipelines from Data Science team• Plan to evaluate Spark potentially as an

alternative

Page 24: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Realtime Processing

• Applications– The first application is to process web access log– Eventually we plan to use this to generate personalized

recommendation on-the-fly• Plan to use AWS Kinesis– Evaluated Apache Kafka and AWS Kinesis

• They are very similar but Kafka is an open source while Kinesis is a managed service from AWS

• Decided to use AWS Kinesis

• Plan to evaluate Realtime Consumer– Samza, Storm, Spark Streaming

Page 25: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

What is Kinesis (Kafka)?

• Realtime data processing service in AWS– Publisher-Subscriber message broker– Very similar to Kafka

• It has two components– One is message queue where stream of data is stored

• 24 hours of retention period• Pay hourly by the read/write rate of the queue

– The other is KCL (Kinesis Client Library)• Using this, build Data Producer application or Data Consumer

Application• This can be combined with Storm, Spark Streaming, …

Page 26: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Data Serving Layer

• Redshift isn’t a good fit to read out the data in realtime fashion so you need something else

• We are using (or plan to use) the followings:– Cassandra– Redis– ElasticSearch– MySQL

Page 27: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

How It Looks Like

Udemy

Log Files(Nginx)

MySQL

Key/Value(Cassandra)

DataWarehouse(Redshift)

ExternalData Sources

Data Science Pipeline

ETL

ETL

Data Science Pipeline

Page 28: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

LESSONS LEARNED

Page 29: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

• As a small start-up survive first and then work on data

• Starting point is to store all data in a single location (data warehouse)

• Start with batch processing and then realtime• Consider the type of data you store– Log vs. Key/Value vs. Transactional Record

• Store data in the form of log (change history)– So that you can always go back and debug/replay

• Cloud is good unless you have really massive data

Page 30: Data Engineering at Udemy

Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Q & AUdemy is Hiring