aws re:invent 2016: streaming etl for rds and dynamodb (dat315)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Brandt, Liyin Tang (Airbnb)
December 2, 2016
Streaming ETLFor Amazon RDS and Amazon DynamoDB
DAT315
What to Expect from the Session
• Database Change Data Capture (CDC)
• Improving ETL to Data Warehouse
Spinaltap (CDC)
Architectural Evolution
From monolithic Rails app
Too many specialized
services/data stores
New Challenges
• Co-processing logic breaks down out of process/transaction context
• Primary tables/indices on many machines, not single RDBMS
• Specialized systems needed for certain use cases (analytics, search,
etc.)
Architectural Tenants
• Build for production
• Plan for the future, build for today
• Prefer existing solutions and patterns that we have
experience with in production
• Services should own their data and not share their
storage
• Mutations to data should be propagated via
standardized events
Change Data Capture (CDC)
Goal: Provide streams of data mutations
• In near real time
• With timeline consistency
To keep all these systems in sync
Option 1: Application-Driven Dual Writes
• Consistency hard
• (2PC/consensus needed)
• Data model easy
• (Schema controlled by application)
• Development easy
• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS
Option 2: Database Log Mining
• Consistency easy
• (Leverage commit log semantics)
• Parsing/Data model hard
• (Database’s internal commit log)
We Chose Database Log Mining
• Parsing is easier than consensus
• Many libraries/APIs exist to make parsing easy
• Consuming stream of commits gives timeline
consistency by default
Data Ecosystem
Requirements
• Timeline consistency with at-least-once message
delivery
• Easily add new sources to consume (new machines if
necessary)
• Support low latency and high throughput use cases
• High availability with automatic failover
• Heterogeneous data sources (MySQL, Amazon
DynamoDB)
MySQL Commit Log
• Java library for binary log parsing • https://github.com/shyiko/mysql-binlog-
connector-java/
• Emit mutation events • (Write_rows, Update_rows, Delete_rows)
• Logical clock determined from binlog
file/offset • (Single-master, Multi-AZ setup)
• Leverage XidEvent for transaction
boundary metadata/checkpointing• (InnoDB implementation detail)
DynamoDB Streams
• Using DynamoDB Streams Kinesis
Adapter
• Guarantees• Each stream record appears exactly once
in the stream.
• Stream records appear in the same
sequence as the actual modifications to
the item
• Monotonically increasing logical clock
is hard• Need to incorporate shard id, parent/child
splitting semantics
• SequenceNumber is not global
Abstract Mutation
• Provide monotonically increasing* id
from logical clock
• Source-specific metadata (e.g. MySQL
binlog filename/offset)
• The beforeImage of the row in DB
(possibly null)
• The afterImage of the row in DB
(possibly null)
• Encode this using source-agnostic
format (e.g. Thrift)
• Write this object to message bus (e.g.
Kafka)
{
id: Long,
opCode: [
INSERT,
UPDATE,
DELETE
],
metadata: Map<String, String>,
beforeImage: Record,
afterImage: Record
}
Clustering/Configuration
• LEADER/STANDBY state model
• Each machine is LEADER for a subset of
sources
• Workload distributed evenly
• Use ZooKeeper-based Apache Helix
framework for cluster management
• http://helix.apache.org/
• Dynamic source configuration changes
• Helix Instance group tags to separate
MySQL/DynamoDB nodes
Fault Tolerance
• Controller handles node failure/elects
new LEADER for sources
• Maintain leader_epoch counter in Helix
ZooKeeper property store
• Prefix generated ids with leader_epoch
for monotonicity
• E.g. (leader_epoch, binlog_file,
binlog_pos)
Pub/Sub
• Produce mutations to Kafka with
durable configuration*
• Async coprocessors consume
messages, produce new streams
• Model streaming library allows
encapsulation of DB table schema• Service controls both API endpoint and
streaming view of data
• Keep 24 hours of MySQL binlog• Alert / rewind on failures in this tier
Online Validation
• Download binlog after it is flushed/immutable
• Check for holes/ordering violations by consuming stream from Kafka
• Allows us to maintain low latency with confidence in consistency of stream
• Auto-healing• Reset binlog position to earlier if too many failures
Production Lessons
• Need schema history store for regions of commit log to support rewind• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain
range/schema mapping
• Be careful about table encodings! (latin1, utf8...)
• request.required.acks = all can potentially hit every broker…• (Group produce requests by broker to avoid hitting too many)
• Per-source produce buffer size• (Tune for throughput/latency)
Data Ecosystem
Streaming DB Exports
Batch Infrastructure
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Growing Pain
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Point-in-Time Restore based DB Export
• Pros:
• Simple
• Especially for schema change
• Consistent
• Cons:
• No SLA for RDS PITR restoration time
• No near real time ad hoc query
• No hourly snapshot
• High storage cost
Overviews
Real-Time Ingestion on HBase
HBase HDFSSpinaltap
Query Engines: Hive/Presto/Spark
Spark
Streaming
RDS
Real time
query
snapshot
Batch
query
Access Data in HBase
HBase HDFS
Streaming:
Spark
snapshot
Unified view on real time data
Interactive Query:
Presto
Batch Job:
Hive/Spark
Snapshot & Reseed
HBase HDFS
Snapshot
(Hfile Links)
Bulk upload
(Reseed)
Onboard New Tables
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Checkpoint
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Rewind
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Reseed
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
HBase Schema
Key Space Design
• Multiplex all DB tables on Single HBase Table
• Fast point look up based on primary keys
• Efficient sequential scans for one table
• Load balance
HBase Row Keys – Primary Keys
• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)
• Row Key = Hash Key + DB_TABLE + PK1=v1 +
Pk2=v2
• Fast point lookup based on primary keys
• Efficient sequential scan for all the keys in same
DB/Table
• Balanced based on hash key
Hash DB_TABLE PK1=v1 PK2=v2
HBase Row Keys – Secondary Keys
• Hash Key= md5(DB_TABLE, Index_1=v1)
• Row Key = Hash Key + DB_TABLE + Index_1=v1 +
PK1=vpk1
• Prefix scan for a given secondary index
Hash DB_TABLE Index=v1 PK1=vpk1
HBase Versioning
Rows CF: Columns Version Value
<ShardKey><DB_TABLE_#1><
PK_a=A>id Fri May 19 00:33:19 2016 101
<ShardKey><DB_TABLE_#1><
PK_a=A>city Fri May 19 00:33:19 2016 San Francisco
<ShardKey><DB_TABLE_#1><
PK_a=A>city Fri May 10 00:34:19 2016 New York
<ShardKey><DB_TABLE_#2><
PK_a=A’>id Fri May 19 00:33:19 2016 1
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 102
TXN 3
COMMIT_T
S: 103
TXN N
COMMIT_T
S: N’…
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: T1
TXN 2
COMMIT_T
S: T3
TXN 3
COMMIT_T
S: T2
TXN N
COMMIT_T
S: N’…
mysql-
bin.00000:1
00
mysql-
bin.00000:1
01
mysql-
bin.00000:1
02
mysql-
bin.00000:
N
NTP
HBase Versioning
Rows CF: Columns Version Commit TS
<ShardKey><DB_TABLE_#1><
PK_a=A>id mysql-bin.00000:100 T0
<ShardKey><DB_TABLE_#1><
PK_a=A>id mysql-bin.00000:101 T1
<ShardKey><DB_TABLE_#1><
PK_a=A>id mysql-bin.00000:102 T3
<ShardKey><DB_TABLE_#1><
PK_a=A>id mysql-bin.00000:103 T2
PITR Semantics
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 103
TXN 3
COMMIT_T
S: 102
TXN N
COMMIT_T
S: N’…
NTP
PITR Semantics: Binlog Commit Time Index
Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><
2016-05-23 23><100>100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><
2016-05-23 23><101>101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><
2016-05-23 23><103>103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2
016-05-24 00><102>102 mysql-bin.00000:102
First mutation
across PITR
The last
mutation before
PITR
Streaming DB Export
• Pros:
• Consistent
• High SLA for the daily snapshot
• Consistent as PITR semantics
• Near real time ad hoc query
• Hive/Spark compatible
• Hourly snapshot view
• Low storage cost
• Cons:
• Schema change
Thank you!
Remember to complete
your evaluations!