(bdt303) construct your etl pipeline with aws data pipeline, amazon emr, and amazon redshift | aws...

76
November 13, 2014 | Las Vegas | NV Roy Ben-Alta (Big Data Analytics Practice Lead, AWS) Thomas Barthelemy (Software Engineer, Coursera)

Upload: amazon-web-services

Post on 16-Jul-2015

2.361 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

November 13, 2014 | Las Vegas | NV

Roy Ben-Alta (Big Data Analytics Practice Lead, AWS)Thomas Barthelemy (Software Engineer, Coursera)

Page 2: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Amazon

Redshift

Amazon

EMR

Amazon

EC2

Process & Analyze

Store

AWS Direct

Connect

Amazon

S3

Amazon

Kinesis

Amazon

Glacier

AWS

Import/Export

DynamoDB

Collect

Page 3: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Billing

DB

CSV

filesCRM

DB

Extract Transform Load

DWHDBDB

DB

ETL

Page 4: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 5: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Amazon

Redshift

Amazon

RDS

Amazon S3 as

Data Lake

Amazon

DynamoDB

Amazon EMRAmazon S3 as

Landing Zone

AWS Data Pipeline

Data

Feeds

Amazon EC2

Page 6: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 7: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 8: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 9: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-

program-pipeline.html

Page 10: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Amazon S3 as Data

Lake/Landing ZoneAmazon EMR

as ETL Grid

Amazon Redshift –

Production DWHVisualization

Logs

Page 11: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 12: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Define aws datapipeline create-pipeline –name myETL --unique-id token

--output: df-09222142C63VXJU0HC0A

Import

aws datapipeline put-pipeline-definition --pipeline-id df-

09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json

Activate

aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A

Page 13: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 14: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 15: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 16: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 17: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 18: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 19: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

copy weblogs

from 's3://staging/'

credentials 'aws_access_key_id=<my access

key>;aws_secret_access_key=<my secret key>'

delimiter ','

Page 20: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 21: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 22: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 23: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 24: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 25: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 26: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 27: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 28: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Coursera Big Data Analytics Powered by AWS

Thomas BarthelemySWE, Data Infrastructure

[email protected]

Page 29: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Overview

● About Coursera● Phase 1: Consolidate data● Phase 2: Get users hooked● Phase 3: Increase reliability● Looking forward

Page 30: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Coursera at a Glance

Page 31: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

About Coursera

● Platform for Massive Open Online Courses● Universities create the content● Content free to the public

Page 32: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Coursera Stats

● ~10 million learners● +110 university partners● >200 courses open now● ~170 employees

Page 33: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

The value of data at Coursera

● Making strategic and tactical decisions

● Studying pedagogy

Page 34: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Becoming More Data-Driven

● Since the early days, Coursera has understood the value of datao Founders came from machine learningo Many of the early employees researched with the founders

● Cost of data access was higho each analysis requires extraction, pre-processingo data only available to data scientists + engineers

Page 35: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Phase 1:Consolidate data

Page 36: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Sources

● MySQL

o Site data

o Course data sharded across multiple database

● Cassandra increasingly used for course data

● Logged event data

● External APIs

ClassesClasses

SQL

Sharded SQL

NoSQL

Logs

External APIs

Page 37: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

(Obligatory meme)

Page 38: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

What platforms to use?

● Amazon Redshift had glowing recommendations

● AWS Data Pipeline has native support for various Amazon services

Page 39: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

ETL development was slow :(

● Slow to do the following in the console:

o create one pipeline

o create similar pipelines

o update existing pipelines

Page 40: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Solution: Programmatically create pipelines● Break ETL into reusable steps

o Extract from variety of sources

o Transform data with Amazon EMR, Amazon EC2, or within Amazon Redshift

o Load data principally into Amazon Redshift

● Use Amazon S3 as intermediate state for Amazon EMR- or Amazon EC2-based transformations

● Map steps to set of AWS Data Pipeline objects

Page 41: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Example step: extract-rds

● Parameters: hostname, database name, SQL

● Creates pipeline objects:

S3 Node can be used by many other step types

Page 42: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Example load definition

steps:

- type: extract-from-rds

sql: | SELECT instructor_id, course_id, rank

FROM courses_instructorincourse;

hostname: maestro-read-replica

database: maestro

- type: load-into-staging-table

table: staging.maestro_instructors_sessions

- type: reload-prod-table

source: staging.maestro_instructors_sessions

destination: prod.instructors_sessions

Extract data from Amazon RDS

Load intermediate table in Amazon Redshift

Reload target table with new data

Page 43: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

ETL – Amazon RDS

SQLAmazon

S3AmazonRedshift

Extract Load

Page 44: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

ETL – Sharded RDS

AmazonS3

AmazonS3

AmazonRedshift

SQL

AmazonS3

AmazonEMR

AmazonS3

AmazonEMR

AmazonEMR

Transform

ExtractLoad

Page 45: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

ETL – Logs

Logs

Amazon EC2

Amazon S3

Amazon S3

AmazonEMR

AmazonRedshift

Amazon S3

Transform

LoadExtract

Page 46: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Reporting Model, Dec 2013

Page 47: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Reporting Model, Sep 2014

Page 48: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

AWS Data Pipeline

● Easily handles starting/stopping of resources

● Handles permissions, roles

● Integrates with other AWS services

● Handles “flow” of data, data dependencies

Page 49: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Dealing with large pipelines

● Monolithic pipelines hard to maintain

● Moved to making pipelines smaller

o Hooray modularity!

● If pipeline B depended on pipeline A, just schedule it later

o Add a time buffer just to be safe

Page 50: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Setting cross-pipeline dependencies

● Dependencies accomplished using a script that would wait until dependencies finished

o ShellCommandActivity to the rescue

Page 51: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

The beauty of ShellCommandActivity

● You can use it anywhere

o Accomplish tasks that have no corresponding activity type

o Override native AWS Data Pipeline support if it does not meet your needs

Page 52: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

ETL library

● Install on machine as first step of each pipeline

o With ShellCommandActivity

● Allows for even more modularity

Page 53: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Phase 2:Get users hooked

Page 54: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

We have data. Now what?

● Simply collecting data will not make a company data-driven

● First step: make data easier for the analysts to use

Page 55: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Certifying a version of the truth

● Data Infrastructure team creates a 3NF model of the data● Model is designed to be as interpretable as possible

Page 56: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Example: Forums

● Full-word names for database objects

● Join keys made obvious through universally-unique column names

o e.g. session_id joins to session_id

● Auto-generated ER-diagram, so documentation is up-to-date

Page 57: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Some power users need data faster!

● Added new schemas for developers to load data

● Help us to remain data driven during early phases of…

o Product release

o Analyzing new data

Page 58: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Lightweight SQL interface

● Access via browser

o Users don’t need database credentials or special software

● Data exportable as CSV

● Auto-complete for DB object names

● Took ~2 days to make

Page 59: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Lightweight SQL interface

Page 60: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

But what about people who don’t know SQL?

● Reporting tools to the rescue!

● In-house dashboard framework to support:

o Support

o Growth

o Data Infrastructure

o Many other teams...

● Tableau for interactive analysis

Page 61: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

But what about people who don’t know SQL?

Page 62: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Phase 3:Increase reliability

Page 63: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

If you build it, they will come

● Built ecosystem of data consumers

o Analysts

o Business users

o Product team

Page 64: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Maintaining the ecosystem

● As we accumulate consumers, need to invest effort in...

o Model stability

o Data quality

o Data availability

Page 65: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Persist Amazon Redshift logs to keep track of usage● Leverage database credentials to understand usage patterns

● Know whom to contact when

o Need to change the schema

o Need to add a new set of tables similar to an existing set (e.g. assessments)

Page 66: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Data Quality

● Quality of source systems (GIGO)

o Encourage source system to fix issues

● Quality of transformations

o Responsibility of the data infrastructure team

Page 67: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Quality Transformations

● Automated QA checks as part of ETL

o Again, use ShellCommandActivity

● Key factors

o Counts

o Value comparisons

o Modeling rules (primary keys)

Page 68: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Ensuring High Reliability

● Step retries catch most minor hiccups

● On-call system to recover failed pipelines

● Persist DB logs to keep track of load times

o Delayed loads can also alert on-call devs

o Bonus: users know how fresh the data is

o Bonus: can keep track of success rate

● AWS very helpful in debugging issues

Page 69: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Looking Forward

Page 70: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Current bottlenecks

● Complex ETL

o How do we easily ETL complex, nested structures?

● Developer bottleneck

o How do we allow users to access raw data more quickly?

Page 71: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Amazon S3 as principal data store

● Amazon S3 to store raw data as soon as it’s produced (data lake)

● Amazon EMR to process data

● Amazon Redshift to remain as analytic platform

Page 72: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Thank you!

Page 73: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Want to learn more?

● github.com/coursera/dataduct

● http://tech.coursera.org

[email protected]

Page 74: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014
Page 75: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

http://blogs.aws.amazon.com/bigdata/

http://docs.aws.amazon.com/datapipeline/lates

t/DeveloperGuide/dp-concepts-schedules.html

Page 76: (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

http://bit.ly/awsevals