aws re:invent 2016: streaming etl for rds and dynamodb (dat315)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Greg Brandt, Liyin Tang (Airbnb)

December 2, 2016

Streaming ETLFor Amazon RDS and Amazon DynamoDB

DAT315

What to Expect from the Session

• Database Change Data Capture (CDC)

• Improving ETL to Data Warehouse

Spinaltap (CDC)

Architectural Evolution

From monolithic Rails app

Too many specialized

services/data stores

New Challenges

• Co-processing logic breaks down out of process/transaction context

• Primary tables/indices on many machines, not single RDBMS

• Specialized systems needed for certain use cases (analytics, search,

etc.)

Architectural Tenants

• Build for production

• Plan for the future, build for today

• Prefer existing solutions and patterns that we have

experience with in production

• Services should own their data and not share their

storage

• Mutations to data should be propagated via

standardized events

Change Data Capture (CDC)

Goal: Provide streams of data mutations

• In near real time

• With timeline consistency

To keep all these systems in sync

Option 1: Application-Driven Dual Writes

• Consistency hard

• (2PC/consensus needed)

• Data model easy

• (Schema controlled by application)

• Development easy

• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS

Option 2: Database Log Mining

• Consistency easy

• (Leverage commit log semantics)

• Parsing/Data model hard

• (Database’s internal commit log)

We Chose Database Log Mining

• Parsing is easier than consensus

• Many libraries/APIs exist to make parsing easy

• Consuming stream of commits gives timeline

consistency by default

Data Ecosystem

Requirements

• Timeline consistency with at-least-once message

delivery

• Easily add new sources to consume (new machines if

necessary)

• Support low latency and high throughput use cases

• High availability with automatic failover

• Heterogeneous data sources (MySQL, Amazon

DynamoDB)

MySQL Commit Log

• Java library for binary log parsing • https://github.com/shyiko/mysql-binlog-

connector-java/

• Emit mutation events • (Write_rows, Update_rows, Delete_rows)

• Logical clock determined from binlog

file/offset • (Single-master, Multi-AZ setup)

• Leverage XidEvent for transaction

boundary metadata/checkpointing• (InnoDB implementation detail)

DynamoDB Streams

• Using DynamoDB Streams Kinesis

Adapter

• Guarantees• Each stream record appears exactly once

in the stream.

• Stream records appear in the same

sequence as the actual modifications to

the item

• Monotonically increasing logical clock

is hard• Need to incorporate shard id, parent/child

splitting semantics

• SequenceNumber is not global

Abstract Mutation

• Provide monotonically increasing* id

from logical clock

• Source-specific metadata (e.g. MySQL

binlog filename/offset)

• The beforeImage of the row in DB

(possibly null)

• The afterImage of the row in DB

(possibly null)

• Encode this using source-agnostic

format (e.g. Thrift)

• Write this object to message bus (e.g.

Kafka)

{

id: Long,

opCode: [

INSERT,

UPDATE,

DELETE

],

metadata: Map<String, String>,

beforeImage: Record,

afterImage: Record

}

Clustering/Configuration

• LEADER/STANDBY state model

• Each machine is LEADER for a subset of

sources

• Workload distributed evenly

• Use ZooKeeper-based Apache Helix

framework for cluster management

• http://helix.apache.org/

• Dynamic source configuration changes

• Helix Instance group tags to separate

MySQL/DynamoDB nodes

Fault Tolerance

• Controller handles node failure/elects

new LEADER for sources

• Maintain leader_epoch counter in Helix

ZooKeeper property store

• Prefix generated ids with leader_epoch

for monotonicity

• E.g. (leader_epoch, binlog_file,

binlog_pos)

Pub/Sub

• Produce mutations to Kafka with

durable configuration*

• Async coprocessors consume

messages, produce new streams

• Model streaming library allows

encapsulation of DB table schema• Service controls both API endpoint and

streaming view of data

• Keep 24 hours of MySQL binlog• Alert / rewind on failures in this tier

Online Validation

• Download binlog after it is flushed/immutable

• Check for holes/ordering violations by consuming stream from Kafka

• Allows us to maintain low latency with confidence in consistency of stream

• Auto-healing• Reset binlog position to earlier if too many failures

Production Lessons

• Need schema history store for regions of commit log to support rewind• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain

range/schema mapping

• Be careful about table encodings! (latin1, utf8...)

• request.required.acks = all can potentially hit every broker…• (Group produce requests by broker to avoid hitting too many)

• Per-source produce buffer size• (Tune for throughput/latency)

Data Ecosystem

Streaming DB Exports

Batch Infrastructure

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Growing Pain

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Point-in-Time Restore based DB Export

• Pros:

• Simple

• Especially for schema change

• Consistent

• Cons:

• No SLA for RDS PITR restoration time

• No near real time ad hoc query

• No hourly snapshot

• High storage cost

Overviews

Real-Time Ingestion on HBase

HBase HDFSSpinaltap

Query Engines: Hive/Presto/Spark

Spark

Streaming

RDS

Real time

query

snapshot

Batch

query

Access Data in HBase

HBase HDFS

Streaming:

Spark

snapshot

Unified view on real time data

Interactive Query:

Presto

Batch Job:

Hive/Spark

Snapshot & Reseed

HBase HDFS

Snapshot

(Hfile Links)

Bulk upload

(Reseed)

Onboard New Tables

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Disaster Recovery - Checkpoint

HBase

RDS

HDFS


Reseed

Reseed

Ingest

Disaster Recovery - Rewind

HBase

RDS

HDFS


Reseed

Reseed

Ingest

Disaster Recovery - Reseed

HBase

RDS

HDFS


Reseed

Reseed

Ingest

HBase Schema

Key Space Design

• Multiplex all DB tables on Single HBase Table

• Fast point look up based on primary keys

• Efficient sequential scans for one table

• Load balance

HBase Row Keys – Primary Keys

• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)

• Row Key = Hash Key + DB_TABLE + PK1=v1 +

Pk2=v2

• Fast point lookup based on primary keys

• Efficient sequential scan for all the keys in same

DB/Table

• Balanced based on hash key

Hash DB_TABLE PK1=v1 PK2=v2

HBase Row Keys – Secondary Keys

• Hash Key= md5(DB_TABLE, Index_1=v1)

• Row Key = Hash Key + DB_TABLE + Index_1=v1 +

PK1=vpk1

• Prefix scan for a given secondary index

Hash DB_TABLE Index=v1 PK1=vpk1

HBase Versioning

Rows CF: Columns Version Value

<ShardKey><DB_TABLE_#1><

PK_a=A>id Fri May 19 00:33:19 2016 101


PK_a=A>city Fri May 19 00:33:19 2016 San Francisco


PK_a=A>city Fri May 10 00:34:19 2016 New York


PK_a=A’>id Fri May 19 00:33:19 2016 1

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 102

TXN 3

COMMIT_T

S: 103

TXN N

COMMIT_T

S: N’…

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: T1

TXN 2

COMMIT_T

S: T3

TXN 3

COMMIT_T

S: T2

TXN N

COMMIT_T

S: N’…

mysql-

bin.00000:1

00

mysql-

bin.00000:1

01

mysql-

bin.00000:1

02

mysql-

bin.00000:

N

NTP

HBase Versioning

Rows CF: Columns Version Commit TS


PK_a=A>id mysql-bin.00000:100 T0







PITR Semantics

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 103

TXN 3

COMMIT_T

S: 102

TXN N

COMMIT_T

S: N’…

NTP

PITR Semantics: Binlog Commit Time Index

Rows Version (Logical Offset) Value


2016-05-23 23><100>100 mysql-bin.00000:100


2016-05-23 23><101>101 mysql-bin.00000:101


2016-05-23 23><103>103 mysql-bin.00000:103

<ShardKey><DB_TABLE_#1><2

016-05-24 00><102>102 mysql-bin.00000:102

First mutation

across PITR

The last

mutation before

PITR

Streaming DB Export

• Pros:

• Consistent

• High SLA for the daily snapshot

• Consistent as PITR semantics

• Near real time ad hoc query

• Hive/Spark compatible

• Hourly snapshot view

• Low storage cost

• Cons:

• Schema change

Thank you!

Remember to complete

your evaluations!

aws re:invent 2016: streaming etl for rds and dynamodb (dat315)

Technology