개발자가 알아야 할 amazon dynamodb 활용법 :: 김일호 :: aws summit seoul 2016

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

김일호, Solutions Architect

05-17-2016

개발자가알아야 할 Amazon DynamoDB 활용법

Agenda

Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling

Scenario based Best Practice

DynamoDB Streams



DynamoDB Streams

Local secondary index (LSI)

Alternate sort(=range) key attributeIndex is local to a partition(=hash) key

A1(partition)

A3(sort)

A2(tablekey)

A1(partition)

A2(sort)

A3 A4 A5

LSIs A1(partition)

A4(sort)

A2(tablekey)

A3(projected)

Table

KEYS_ONLY

INCLUDE A3

A1(partition)

A5(sort)

A2(tablekey)

A3(projected)

A4(projected) ALL

10 GB max per hash key, i.e. LSIs limit the # of range keys!

Global secondary index (GSI)

Alternate partition keyIndex is across all table partition key

A1(partition)

A2 A3 A4 A5

GSIs A5(partition)

A4(sort)

A1(tablekey)

A3(projected)

Table

INCLUDE A3

A4(partition)

A5(sort)

A1(tablekey)

A2(projected)

A3(projected) ALL

A2(partition)

A1(tablekey)

KEYS_ONLY

RCUs/WCUs provisioned separately for GSIs

Online indexing

How do GSI updates work?

Table

Primary tablePrimary

tablePrimary tablePrimary

tableGlobal

Secondary Index

Client

2. Asynchronous update (in progress)

If GSIs don’t have enough write capacity, table writes will be throttled!

LSI or GSI?

LSI can be modeled as a GSIIf data size in an item collection > 10 GB, use GSI

If eventual consistency is okay for your scenario, use GSI!



DynamoDB Streams

Scaling

Throughput• Provision any amount of throughput to a table

Size• Add any number of items to a table

• Max item size is 400 KB• LSIs limit the number of range keys due to 10 GB limit

Scaling is achieved through partitioning

Throughput

Provisioned at the table level• Write capacity units (WCUs) are measured in 1 KB per second• Read capacity units (RCUs) are measured in 4 KB per second

• RCUs measure strictly consistent reads• Eventually consistent reads cost 1/2 of consistent reads

Read and write throughput limits are independent200 RCU

Partitioning math

Number of PartitionsBy Capacity (Total RCU / 3000) + (Total WCU / 1000)By Size Total Size / 10 GBTotal Partitions CEILING(MAX (Capacity, Size))

Partitioning example

Tablesize=8GB,RCUs=5000,WCUs=500

RCUsperpartition=5000/3=1666.67WCUsperpartition=500/3=166.67Data/partition=10/3=3.33GB

RCUs and WCUs are uniformly spread across partitions

Number of PartitionsBy Capacity (5000 / 3000) + (500 / 1000) = 2.17By Size 8 / 10 = 0.8Total Partitions CEILING(MAX (2.17, 0.8)) = 3

Allocation of partitions

A partition split occurs when• Increased provisioned throughput settings• Increased storage requirements

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Example: hot keys

Parti

tion

Time

Heat

Example: periodic spike

Getting the most out of DynamoDB throughput

“To get the most out of DynamoDB throughput, create tables where the partition key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.”

—DynamoDB Developer Guide

Space: access is evenly spread over the key-space

Time: requests arrive evenly spaced in time

What causes throttling?

If sustained throughput goes beyond provisioned throughput per partition

Non-uniform workloads• Hot keys/hot partitions• Very large bursts

Mixing hot data with cold data• Use a table per time period

From the example before:• Table created with 5000 RCUs, 500 WCUs• RCUs per partition = 1666.67• WCUs per partition = 166.67• If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may throttle

requests• Solution: Increase provisioned throughput



DynamoDB Streams

1:1 relationships or key-values

Use a table or GSI with a partition keyUse GetItem or BatchGetItem APIExample: Given an SSN or license number, get attributes

UsersTablePartitionkey AttributesSSN=123-45-6789 [email protected], License =TDL25478134SSN=987-65-4321 [email protected],License =TDL78309234

Users-Email-GSIPartitionkey Attributes

License =TDL78309234 [email protected],SSN=987-65-4321

License =TDL25478134 [email protected], SSN=123-45-6789

1:N relationships or parent-children

Use a table or GSI with partition and sort keyUse Query API

Example:• Given a device, find all readings between epoch X, Y

Device-measurementsPartitionKey Sortkey AttributesDeviceId=1 epoch=5513A97C Temperature=30,pressure=90DeviceId=1 epoch=5513A9DB Temperature=30,pressure=90

N:M relationships

Use a table and GSI with partition and sort key elements switchedUse Query API Example: Given a user, find all games. Or given a game, find all users.

User-Games-TableHashKey RangekeyUserId=bob GameId=Game1UserId=fred GameId=Game2UserId=bob GameId=Game3

Game-Users-GSIHashKey RangekeyGameId=Game1 UserId=bobGameId=Game2 UserId=fredGameId=Game3 UserId=bob

Documents (JSON)

New data types (M, L, BOOL, NULL) introduced to support JSONDocument SDKs

• Simple programming model• Conversion to/from JSON• Java, JavaScript, Ruby, .NET

Cannot index (S,N) elements of a JSON object stored in M

• Only top-level table attributes can be used in LSIs and GSIs without Streams/Lambda

JavaScript DynamoDBstring S

number Nboolean BOOL

null NULLarray Lobject M

Rich expressions

Projection expression to get just some of the attributes• Query/Get/Scan: ProductReviews.FiveStar[0]

Rich expressions

Projection expression to get just some of the attributes• Query/Get/Scan: ProductReviews.FiveStar[0]

ProductReviews: { FiveStar: [

"Excellent! Can't recommend it highly enough! Buy it!", "Do yourself a favor and buy this." ],

OneStar: [ "Terrible product! Do not buy this." ] }

]}

Rich expressions

Filter expression• Query/Scan: #VIEWS > :num

Update expression• UpdateItem: set Replies = Replies + :num

Rich expressions

Conditional expression • Put/Update/DeleteItem

• attribute_not_exists (#pr.FiveStar)• attribute_exists(Pictures.RearView)

1. First looks for an item whose primary key matches that of the item to be written.

2. Only if the search returns nothing is there no partition key present in the result.

3. Otherwise, the attribute_not_exists function above fails and the write will be prevented.



DynamoDB Streams

Game logging

Storing time series data

Time series tables

Events_table_2015_AprilEvent_id(partition key)

Timestamp(sort key) Attribute1 …. Attribute N

Events_table_2015_MarchEvent_id(partition key)


Events_table_2015_FeburaryEvent_id(partition key)


Events_table_2015_JanuaryEvent_id(partition key)


RCUs = 1000WCUs = 100

RCUs = 10000WCUs = 10000

RCUs = 100WCUs = 1

RCUs = 10WCUs = 1

Current table

Older tables

Hot

dat

aC

old

data

Don’t mix hot and cold data; archive cold data to Amazon S3

Use a table per time period

• Pre-create daily, weekly, monthly tables• Provision required throughput for current table• Writes go to the current table• Turn off (or reduce) throughput for older tables

• Pre-create heavy users, light users tables

Item shop catalog

Popular items (read)

Partition 12000 RCUs

Partition K2000 RCUs

Partition M2000 RCUs

Partition 502000 RCU

Scaling bottlenecks

Product A Product B

Gamers

ItemShopCatalog Table

SELECT Id, Description, ...FROM ItemShopCatalog

Req

uest

s Pe

r Sec

ond

Item Primary Key

Request Distribution Per Partition Key

DynamoDB Requests

Partition 1 Partition 2

ItemShopCatalog Table

User

DynamoDB

User

Cachepopular items

SELECT Id, Description, ...FROM ProductCatalog

Req

uest

s Pe

r Sec

ond

Item Primary Key

Request Distribution Per Partition Key

DynamoDB Requests Cache Hits

Multiplayer online gaming

Query filters vs.composite key indexes

GameId Date Host Opponent Statusd9bl3 2014-10-02 David Alice DONE72f49 2014-09-30 Alice Bob PENDINGo2pnb 2014-10-08 Bob Carol IN_PROGRESSb932s 2014-10-03 Carol Bob PENDINGef9ca 2014-10-03 David Bob IN_PROGRESS

Games Table

Multiplayer online game data

Partition key

Query for incoming game requests

DynamoDB indexes provide partition and sortWhat about queries for two equalities and a range?

SELECT * FROM GameWHERE Opponent='Bob‘AND Status=‘PENDING'ORDER BY Date DESC

(partition)

(sort)(?)

Secondary Index

Opponent Date GameId Status HostAlice 2014-10-02 d9bl3 DONE DavidCarol 2014-10-08 o2pnb IN_PROGRESS BobBob 2014-09-30 72f49 PENDING AliceBob 2014-10-03 b932s PENDING CarolBob 2014-10-03 ef9ca IN_PROGRESS David

Approach 1: Query filter

BobPartition key Sort key

Secondary Index

Approach 1: Query filter

Bob

Opponent Date GameId Status HostAlice 2014-10-02 d9bl3 DONE DavidCarol 2014-10-08 o2pnb IN_PROGRESS BobBob 2014-09-30 72f49 PENDING AliceBob 2014-10-03 b932s PENDING CarolBob 2014-10-03 ef9ca IN_PROGRESS David

SELECT * FROM GameWHERE Opponent='Bob' ORDER BY Date DESCFILTER ON Status='PENDING'

(filtered out)

Needle in a haystack

Bob

Use query filter

• Send back less data “on the wire”• Simplify application code• Simple SQL-like expressions

• AND, OR, NOT, ()

Use when your index isn’t entirely selective

Approach 2: composite key

StatusDateDONE_2014-10-02IN_PROGRESS_2014-10-08IN_PROGRESS_2014-10-03PENDING_2014-09-30PENDING_2014-10-03

StatusDONEIN_PROGRESSIN_PROGRESSPENDINGPENDING

Date2014-10-022014-10-082014-10-032014-10-032014-09-30

+ =

Secondary Index


Opponent StatusDate GameId HostAlice DONE_2014-10-02 d9bl3 DavidCarol IN_PROGRESS_2014-10-08 o2pnb BobBob IN_PROGRESS_2014-10-03 ef9ca DavidBob PENDING_2014-09-30 72f49 AliceBob PENDING_2014-10-03 b932s Carol

Partition key Sort key

Opponent StatusDate GameId HostAlice DONE_2014-10-02 d9bl3 DavidCarol IN_PROGRESS_2014-10-08 o2pnb BobBob IN_PROGRESS_2014-10-03 ef9ca DavidBob PENDING_2014-09-30 72f49 AliceBob PENDING_2014-10-03 b932s Carol

Secondary Index


Bob

SELECT * FROM GameWHERE Opponent='Bob'

AND StatusDate BEGINS_WITH 'PENDING'

Needle in a sorted haystack

Bob

Sparse indexes

Id(Hash) User Game Score Date Award

1 Bob G1 1300 2012-12-23

2 Bob G1 1450 2012-12-233 Jay G1 1600 2012-12-244 Mary G1 2000 2012-10-24 Champ5 Ryan G2 123 2012-03-106 Jones G2 345 2012-03-20

Game-scores-tableAward(Hash) Id User Score

Champ 4 Mary 2000

Award-GSI

Scan sparse hash GSIs

Replace filter with indexes

Concatenate attributes to form usefulsecondary index keys

Take advantage of sparse indexes

Use when You want to optimize a query as much as possible

Status + Date

Big data analyticswith DynamoDB

Transactional Data Processing

DynamoDB is well-suited for transactional processing:• High concurrency• Strong consistency• Atomic updates of single items• Conditional updates for de-dupe and optimistic concurrency• Supports both key/value and JSON document schema• Capable of handling large table sizes with low latency data access

Case 1: Store and Index Metadata for Objects Stored in Amazon S3

Case 1: Use Case

We have a large number of digital audio files stored in Amazon S3 and we want to make them searchable

à Use DynamoDB as the primary data store for the metadata.à Index and query the metadata using Elasticsearch.

Case 1: Steps to Implement

1. Create a Lambda function that reads the metadata from the ID3 tag and inserts it into a DynamoDB table.

2. Enable S3 notifications on the S3 bucket storing the audio files.

3. Enable streams on the DynamoDB table.4. Create a second Lambda function that takes the metadata in

DynamoDB and indexes it using Elasticsearch.5. Enable the stream as the event source for the Lambda

function.

Case 1: Key Takeaways

DynamoDB + Elasticsearch = Durable, scalable, highly-available database with rich query capabilities.Use Lambda functions to respond to events in both DynamoDB streams and Amazon S3 without having to manage any underlying compute infrastructure.

Case 2 – Execute Queries Against Multiple Data Sources Using DynamoDB and Hive

Case 2: Use Case

We want to enrich our audio file metadata stored in DynamoDB with additional data from the Million Song dataset:

à Million song data set is stored in text files.à ID3 tag metadata is stored in DynamoDB.à Use Amazon EMR with Hive to join the two datasets together in a query.

Case 2: Steps to Implement1. Spin up an Amazon EMR cluster with

Hive.2. Create an external Hive table using the

DynamoDBStorageHandler.3. Create an external Hive table using the

Amazon S3 location of the text files containing the Million Song project metadata.

4. Create and run a Hive query that joins the two external tables together and writes the joined results out to Amazon S3.

5. Load the results from Amazon S3 into DynamoDB.


Use Amazon EMR to quickly provision a Hadoop cluster with Hive and to tear it down when done.Use of Hive with DynamoDB allows items in DynamoDB tables to be queried/joined with data from a variety of sources.

Case 3 – Store and Analyze Sensor Data with DynamoDB and Amazon Redshift

Dashboard

Case 3: Use Case

A large number of sensors are taking readings at regular intervals. You need to aggregate the data from each reading into a data warehouse for analysis:• Use Amazon Kinesis to ingest the raw sensor data.• Store the sensor readings in DynamoDB for fast access and real-

time dashboards.• Store raw sensor readings in Amazon S3 for durability and backup.• Load the data from Amazon S3 into Amazon Redshift using AWS

Lambda.

Case 3: Steps to Implement1. Create two Lambda functions to

read data from the Amazon Kinesis stream.

2. Enable the Amazon Kinesis stream as an event source for each Lambda function.

3. Write data into DynamoDB in one of the Lambda functions.

4. Write data into Amazon S3 in the other Lambda function.

5. Use the aws-lambda-redshift-loader to load the data in Amazon S3 into Amazon Redshift in batches.


Amazon Kinesis + Lambda + DynamoDB = Scalable, durable, highly available solution for sensor data ingestion with very low operational overhead.DynamoDB is well-suited for near-realtime queries of recent sensor data readings.Amazon Redshift is well-suited for deeper analysis of sensor data readings spanning longer time horizons and very large numbers of records.Using Lambda to load data into Amazon Redshift provides a way to perform ETL in frequent intervals.



DynamoDB Streams

Stream of updates to a tableAsynchronousExactly onceStrictly ordered

• Per item

Highly durable• Scale with table

24-hour lifetimeSub-second latency

DynamoDB Streams

View type Destination

Old image—before update Name = John, Destination = Mars

New image—after update Name = John, Destination = Pluto

Old and new images Name = John, Destination = MarsName = John, Destination = Pluto

Keys only Name = John

View types

UpdateItem (Name = John, Destination = Pluto)

Stream

Table

Partition 1

Partition 2

Partition 3

Partition 4

Partition 5

Table

Shard 1

Shard 2

Shard 3

Shard 4

KCLWorker

KCLWorker

KCLWorker

KCLWorker

Amazon Kinesis Client Library application

DynamoDBclient application

Updates

DynamoDB Streams and Amazon Kinesis Client Library

DynamoDB Streams

Open Source Cross-Region Replication Library

Asia Pacific (Sydney) EU (Ireland) Replica

US East (N. Virginia)

Cross-region replication

DynamoDB Streams and AWS Lambda

Triggers

Lambda functionNotify change

Derivative tables

Amazon CloudSearchAmazon ElasticSearch

Amazon ElastiCache

DynamoDB Streams

Analytics with DynamoDB Streams

Collect and de-dupe data in DynamoDBAggregate data in-memory and flush periodically

Performing real-time aggregation and analytics

EMR

Redshift

DynamoDB

Cross-region Replication

여러분의 피드백을 기다립니다!

https://www.awssummit.co.kr

모바일 페이지에 접속하셔서, 지금 세션 평가에참여하시면, 행사후 기념품을 드립니다.

#AWSSummit 해시태그로 소셜 미디어에 여러분의행사 소감을 올려주세요.

발표 자료 및 녹화 동영상은 AWS Korea 공식 소셜채널로 곧 공유될 예정입니다.

개발자가 알아야 할 amazon dynamodb 활용법 :: 김일호 :: aws summit seoul 2016

Technology