aws re:invent 2016: streaming etl for rds and dynamodb (dat315)

46
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Brandt, Liyin Tang (Airbnb) December 2, 2016 Streaming ETL For Amazon RDS and Amazon DynamoDB DAT315

Upload: amazon-web-services

Post on 06-Jan-2017

89 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Greg Brandt, Liyin Tang (Airbnb)

December 2, 2016

Streaming ETLFor Amazon RDS and Amazon DynamoDB

DAT315

Page 2: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

What to Expect from the Session

• Database Change Data Capture (CDC)

• Improving ETL to Data Warehouse

Page 3: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Spinaltap (CDC)

Page 4: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Architectural Evolution

From monolithic Rails app

Too many specialized

services/data stores

Page 5: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

New Challenges

• Co-processing logic breaks down out of process/transaction context

• Primary tables/indices on many machines, not single RDBMS

• Specialized systems needed for certain use cases (analytics, search,

etc.)

Page 6: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Architectural Tenants

• Build for production

• Plan for the future, build for today

• Prefer existing solutions and patterns that we have

experience with in production

• Services should own their data and not share their

storage

• Mutations to data should be propagated via

standardized events

Page 7: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Change Data Capture (CDC)

Goal: Provide streams of data mutations

• In near real time

• With timeline consistency

To keep all these systems in sync

Page 8: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Option 1: Application-Driven Dual Writes

• Consistency hard

• (2PC/consensus needed)

• Data model easy

• (Schema controlled by application)

• Development easy

• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS

Page 9: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Option 2: Database Log Mining

• Consistency easy

• (Leverage commit log semantics)

• Parsing/Data model hard

• (Database’s internal commit log)

Page 10: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

We Chose Database Log Mining

• Parsing is easier than consensus

• Many libraries/APIs exist to make parsing easy

• Consuming stream of commits gives timeline

consistency by default

Page 11: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Data Ecosystem

Page 12: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Requirements

• Timeline consistency with at-least-once message

delivery

• Easily add new sources to consume (new machines if

necessary)

• Support low latency and high throughput use cases

• High availability with automatic failover

• Heterogeneous data sources (MySQL, Amazon

DynamoDB)

Page 13: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

MySQL Commit Log

• Java library for binary log parsing • https://github.com/shyiko/mysql-binlog-

connector-java/

• Emit mutation events • (Write_rows, Update_rows, Delete_rows)

• Logical clock determined from binlog

file/offset • (Single-master, Multi-AZ setup)

• Leverage XidEvent for transaction

boundary metadata/checkpointing• (InnoDB implementation detail)

Page 14: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

DynamoDB Streams

• Using DynamoDB Streams Kinesis

Adapter

• Guarantees• Each stream record appears exactly once

in the stream.

• Stream records appear in the same

sequence as the actual modifications to

the item

• Monotonically increasing logical clock

is hard• Need to incorporate shard id, parent/child

splitting semantics

• SequenceNumber is not global

Page 15: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Abstract Mutation

• Provide monotonically increasing* id

from logical clock

• Source-specific metadata (e.g. MySQL

binlog filename/offset)

• The beforeImage of the row in DB

(possibly null)

• The afterImage of the row in DB

(possibly null)

• Encode this using source-agnostic

format (e.g. Thrift)

• Write this object to message bus (e.g.

Kafka)

{

id: Long,

opCode: [

INSERT,

UPDATE,

DELETE

],

metadata: Map<String, String>,

beforeImage: Record,

afterImage: Record

}

Page 16: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Clustering/Configuration

• LEADER/STANDBY state model

• Each machine is LEADER for a subset of

sources

• Workload distributed evenly

• Use ZooKeeper-based Apache Helix

framework for cluster management

• http://helix.apache.org/

• Dynamic source configuration changes

• Helix Instance group tags to separate

MySQL/DynamoDB nodes

Page 17: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Fault Tolerance

• Controller handles node failure/elects

new LEADER for sources

• Maintain leader_epoch counter in Helix

ZooKeeper property store

• Prefix generated ids with leader_epoch

for monotonicity

• E.g. (leader_epoch, binlog_file,

binlog_pos)

Page 18: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Pub/Sub

• Produce mutations to Kafka with

durable configuration*

• Async coprocessors consume

messages, produce new streams

• Model streaming library allows

encapsulation of DB table schema• Service controls both API endpoint and

streaming view of data

• Keep 24 hours of MySQL binlog• Alert / rewind on failures in this tier

Page 19: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Online Validation

• Download binlog after it is flushed/immutable

• Check for holes/ordering violations by consuming stream from Kafka

• Allows us to maintain low latency with confidence in consistency of stream

• Auto-healing• Reset binlog position to earlier if too many failures

Page 20: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Production Lessons

• Need schema history store for regions of commit log to support rewind• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain

range/schema mapping

• Be careful about table encodings! (latin1, utf8...)

• request.required.acks = all can potentially hit every broker…• (Group produce requests by broker to avoid hitting too many)

• Per-source produce buffer size• (Tune for throughput/latency)

Page 21: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Data Ecosystem

Page 22: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Streaming DB Exports

Page 23: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Batch Infrastructure

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Page 24: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Growing Pain

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Page 25: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Point-in-Time Restore based DB Export

• Pros:

• Simple

• Especially for schema change

• Consistent

• Cons:

• No SLA for RDS PITR restoration time

• No near real time ad hoc query

• No hourly snapshot

• High storage cost

Page 26: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Overviews

Page 27: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Real-Time Ingestion on HBase

HBase HDFSSpinaltap

Query Engines: Hive/Presto/Spark

Spark

Streaming

RDS

Real time

query

snapshot

Batch

query

Page 28: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Access Data in HBase

HBase HDFS

Streaming:

Spark

snapshot

Unified view on real time data

Interactive Query:

Presto

Batch Job:

Hive/Spark

Page 29: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Snapshot & Reseed

HBase HDFS

Snapshot

(Hfile Links)

Bulk upload

(Reseed)

Page 30: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Onboard New Tables

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 31: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Checkpoint

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 32: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Rewind

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 33: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Reseed

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 34: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Schema

Page 35: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Key Space Design

• Multiplex all DB tables on Single HBase Table

• Fast point look up based on primary keys

• Efficient sequential scans for one table

• Load balance

Page 36: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Row Keys – Primary Keys

• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)

• Row Key = Hash Key + DB_TABLE + PK1=v1 +

Pk2=v2

• Fast point lookup based on primary keys

• Efficient sequential scan for all the keys in same

DB/Table

• Balanced based on hash key

Hash DB_TABLE PK1=v1 PK2=v2

Page 37: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Row Keys – Secondary Keys

• Hash Key= md5(DB_TABLE, Index_1=v1)

• Row Key = Hash Key + DB_TABLE + Index_1=v1 +

PK1=vpk1

• Prefix scan for a given secondary index

Hash DB_TABLE Index=v1 PK1=vpk1

Page 38: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Versioning

Rows CF: Columns Version Value

<ShardKey><DB_TABLE_#1><

PK_a=A>id Fri May 19 00:33:19 2016 101

<ShardKey><DB_TABLE_#1><

PK_a=A>city Fri May 19 00:33:19 2016 San Francisco

<ShardKey><DB_TABLE_#1><

PK_a=A>city Fri May 10 00:34:19 2016 New York

<ShardKey><DB_TABLE_#2><

PK_a=A’>id Fri May 19 00:33:19 2016 1

Page 39: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 102

TXN 3

COMMIT_T

S: 103

TXN N

COMMIT_T

S: N’…

Page 40: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: T1

TXN 2

COMMIT_T

S: T3

TXN 3

COMMIT_T

S: T2

TXN N

COMMIT_T

S: N’…

mysql-

bin.00000:1

00

mysql-

bin.00000:1

01

mysql-

bin.00000:1

02

mysql-

bin.00000:

N

NTP

Page 41: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Versioning

Rows CF: Columns Version Commit TS

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:100 T0

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:101 T1

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:102 T3

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:103 T2

Page 42: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

PITR Semantics

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 103

TXN 3

COMMIT_T

S: 102

TXN N

COMMIT_T

S: N’…

NTP

Page 43: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

PITR Semantics: Binlog Commit Time Index

Rows Version (Logical Offset) Value

<ShardKey><DB_TABLE_#1><

2016-05-23 23><100>100 mysql-bin.00000:100

<ShardKey><DB_TABLE_#1><

2016-05-23 23><101>101 mysql-bin.00000:101

<ShardKey><DB_TABLE_#1><

2016-05-23 23><103>103 mysql-bin.00000:103

<ShardKey><DB_TABLE_#1><2

016-05-24 00><102>102 mysql-bin.00000:102

First mutation

across PITR

The last

mutation before

PITR

Page 44: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Streaming DB Export

• Pros:

• Consistent

• High SLA for the daily snapshot

• Consistent as PITR semantics

• Near real time ad hoc query

• Hive/Spark compatible

• Hourly snapshot view

• Low storage cost

• Cons:

• Schema change

Page 45: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Thank you!

Page 46: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Remember to complete

your evaluations!