amazon dynamo db for developers (김일호) - aws db day

DynamoDB for developers

김일호, Solutions Architect

Time : 15:10 – 16:30

Agenda§Tip 1. DynamoDB Index(LSI, GSI)§Tip 2. DynamoDB Scaling§Tip 3. DynamoDB Data Modeling

§Scenario based Best Practice

§DynamoDB Streams

§Nexon use-case

Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling

Scenario based Best Practice

DynamoDB Streams

Nexon use-case

Local secondary index (LSI)§Alternate range key attribute§ Index is local to a hash key (or partition)

A1(hash)

A3(range)

A2(tablekey)

A1(hash)

A2(range)

A3 A4 A5

LSIs A1(hash)

A4(range)

A2(tablekey)

A3(projected)

Table

KEYS_ONLY

INCLUDEA3

A1(hash)

A5(range)

A2(tablekey)

A3(projected)

A4(projected) ALL

10GBmaxperhashkey,i.e.LSIslimitthe#of rangekeys!

Global secondary index (GSI)§Alternate hash (+range) key§ Index is across all table hash keys (partitions)

A1(hash)

A2 A3 A4 A5

GSIs A5(hash)

A4(range)

A1(tablekey)

A3(projected)

Table

INCLUDEA3

A4(hash)

A5(range)

A1(tablekey)

A2(projected)

A3(projected) ALL

A2(hash)

A1(tablekey) KEYS_ONLY

RCUs/WCUsprovisionedseparatelyforGSIs

Onlineindexing

How do GSI updates work?

Table

Primarytable

Primarytable

Primarytable

Primarytable

GlobalSecondaryIndex

Client

2.Asynchronousupdate(inprogress)

IfGSIsdon’t haveenough writecapacity,tablewriteswillbethrottled!

LSI or GSI?

§LSI can be modeled as a GSI§ If data size in an item collection > 10 GB, use GSI§ If eventual consistency is okay for your scenario, use

GSI!



DynamoDB Streams

Nexon use-case

Scaling§Throughput

§ Provision any amount of throughput to a table§Size

§ Add any number of items to a table§ Max item size is 400 KB§ LSIs limit the number of range keys due to 10 GB limit

§Scaling is achieved through partitioning

Throughput§Provisioned at the table level

§ Write capacity units (WCUs) are measured in 1 KB per second§ Read capacity units (RCUs) are measured in 4 KB per second

§ RCUs measure strictly consistent reads§ Eventually consistent reads cost 1/2 of consistent reads

§Read and write throughput limits are independent

WCURCU

Partitioning math

NumberofPartitions

ByCapacity (TotalRCU/3000)+(TotalWCU/1000)

BySize TotalSize/10GB

TotalPartitions CEILING(MAX(Capacity,Size))

Partitioning exampleTablesize=8GB,RCUs=5000,WCUs=500

RCUsperpartition=5000/3=1666.67WCUsperpartition=500/3=166.67Data/partition=10/3=3.33GB

RCUsandWCUsareuniformly spreadacrosspartitions

NumberofPartitions

By Capacity (5000/3000)+(500/1000)=2.17

BySize 8 /10=0.8

TotalPartitions CEILING(MAX(2.17,0.8))=3

Allocation of partitions§A partition split occurs when

§ Increased provisioned throughput settings§ Increased storage requirements

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Example: hot keys

Partition

Time

Heat

Example: periodic spike

Getting the most out of DynamoDB throughput

“To get the most out of DynamoDB throughput, create tables where the hash key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.”

—DynamoDB Developer Guide

§Space: access is evenly spread over the key-space

§Time: requests arrive evenly spaced in time

What causes throttling?§ If sustained throughput goes beyond provisioned throughput per partition

§ Non-uniform workloads§ Hot keys/hot partitions§ Very large bursts

§ Mixing hot data with cold data§ Use a table per time period

§ From the example before:§ Table created with 5000 RCUs, 500 WCUs§ RCUs per partition = 1666.67§ WCUs per partition = 166.67§ If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may

throttle requests§ Solution: Increase provisioned throughput



DynamoDB Streams

Nexon use-case

1:1 relationships or key-values§Use a table or GSI with a hash key§Use GetItem or BatchGetItem API§Example: Given an SSN or license number, get

attributesUsersTableHashkey AttributesSSN=123-45-6789 [email protected], License =TDL25478134SSN=987-65-4321 [email protected],License =TDL78309234

Users-Email-GSIHashkey AttributesLicense =TDL78309234 [email protected],SSN=987-65-4321

License =TDL25478134 [email protected], SSN=123-45-6789

1:N relationships or parent-children§Use a table or GSI with hash and range key§Use Query API

Example:§ Given a device, find all readings between epoch X, Y

Device-measurementsHashKey Rangekey AttributesDeviceId=1 epoch=5513A97C Temperature=30,pressure =90DeviceId=1 epoch=5513A9DB Temperature=30,pressure =90

N:M relationships§Use a table and GSI with hash and range key elements

switched§Use Query API Example: Given a user, find all games. Or given a game, find all users.

User-Games-TableHashKey RangekeyUserId=bob GameId=Game1UserId=fred GameId=Game2UserId=bob GameId=Game3

Game-Users-GSIHashKey RangekeyGameId=Game1 UserId=bobGameId=Game2 UserId=fredGameId=Game3 UserId=bob

Documents (JSON)

§ New data types (M, L, BOOL, NULL) introduced to support JSON

§ Document SDKs§ Simple programming model§ Conversion to/from JSON§ Java, JavaScript, Ruby, .NET

§Cannot index (S,N) elements of a JSON object stored in M§ Only top-level table attributes

can be used in LSIs and GSIs without Streams/Lambda

JavaScript DynamoDB

string S

number N

boolean BOOL

null NULL

array L

object M

Rich expressions

§Projection expression to get just some of the attributes§ Query/Get/Scan: ProductReviews.FiveStar[0]

Rich expressions

§Projection expression to get just some of the attributes§ Query/Get/Scan: ProductReviews.FiveStar[0]

ProductReviews: { FiveStar: [

"Excellent! Can't recommend it highly enough! Buy it!", "Do yourself a favor and buy this." ],

OneStar: [ "Terrible product! Do not buy this." ] }

]}

Rich expressions

§Filter expression§ Query/Scan: #VIEWS > :num

§Update expression§ UpdateItem: set Replies = Replies + :num

Rich expressions§Conditional expression

§ Put/Update/DeleteItem§ attribute_not_exists (#pr.FiveStar)§ attribute_exists(Pictures.RearView)

1. First looks for an item whose primary key matches that of the item to be written.

2. Only if the search returns nothing is there no partition key present in the result.

3. Otherwise, the attribute_not_exists function above fails and the write will be prevented.



DynamoDB Streams

Nexon use-case

Game logging

Storing time series data

Time series tables

Events_table_2015_AprilEvent_id(Hashkey)

Timestamp(rangekey)

Attribute1 …. AttributeN

Events_table_2015_MarchEvent_id(Hashkey)

Timestamp(rangekey)


Events_table_2015_FeburaryEvent_id(Hashkey)

Timestamp(rangekey)


Events_table_2015_JanuaryEvent_id(Hashkey)

Timestamp(rangekey)


RCUs=1000WCUs=100

RCUs=10000WCUs=10000

RCUs=100WCUs=1

RCUs=10WCUs=1

Currenttable

Oldertables

Hotd

ata

Colddata

Don’tmixhotandcolddata;archivecolddatatoAmazonS3

Importantwhen:

Use a table per time period

§Pre-create daily, weekly, monthly tables§Provision required throughput for current table§Writes go to the current table§Turn off (or reduce) throughput for older tables

Dealing with time series data

Item shop catalog

Popular items (read)

Partition12000RCUs

PartitionK2000RCUs

PartitionM2000RCUs

Partition502000RCU

Scaling bottlenecks

Product A Product B

Gamers

ItemShopCatalog Table

SELECT Id, Description, ...FROM ItemShopCatalog

Requ

estsPerSecon

d

ItemPrimaryKey

RequestDistributionPerHashKey

DynamoDBRequests

Partition 1 Partition 2

ItemShopCatalog Table

User

DynamoDB

User

Cachepopularitems

SELECT Id, Description, ...FROM ProductCatalog

Requ

estsPerSecon

d

ItemPrimaryKey

RequestDistributionPerHashKey

DynamoDBRequests CacheHits

Multiplayer online gaming

Query filters vs.composite key indexes

GameId Date Host Opponent Status

d9bl3 2014-10-02 David Alice DONE

72f49 2014-09-30 Alice Bob PENDING

o2pnb 2014-10-08 Bob Carol IN_PROGRESS

b932s 2014-10-03 Carol Bob PENDING

ef9ca 2014-10-03 David Bob IN_PROGRESS

GamesTable

Multiplayer online game data

Hashkey

Query for incoming game requests§DynamoDB indexes provide hash and range§What about queries for two equalities and a range?

SELECT * FROM GameWHERE Opponent='Bob‘AND Status=‘PENDING'ORDER BY Date DESC

(hash)

(range)(?)

SecondaryIndex

Opponent Date GameId Status Host

Alice 2014-10-02 d9bl3 DONE David

Carol 2014-10-08 o2pnb IN_PROGRESS Bob

Bob 2014-09-30 72f49 PENDING Alice

Bob 2014-10-03 b932s PENDING Carol

Bob 2014-10-03 ef9ca IN_PROGRESS David

Approach 1: Query filter

BobHashkey Rangekey

SecondaryIndex

Approach 1: Query filter

Bob

Opponent Date GameId Status Host

Alice 2014-10-02 d9bl3 DONE David

Carol 2014-10-08 o2pnb IN_PROGRESS Bob

Bob 2014-09-30 72f49 PENDING Alice

Bob 2014-10-03 b932s PENDING Carol

Bob 2014-10-03 ef9ca IN_PROGRESS David

SELECT * FROM GameWHERE Opponent='Bob' ORDER BY Date DESCFILTER ON Status='PENDING'

(filteredout)

Needle in a haystack

Bob

Importantwhen:

Use query filter

§Send back less data “on the wire”§Simplify application code§Simple SQL-like expressions

§ AND, OR, NOT, ()

Your index isn’t entirely selective

Approach 2: composite key

StatusDate

DONE_2014-10-02

IN_PROGRESS_2014-10-08

IN_PROGRESS_2014-10-03

PENDING_2014-09-30

PENDING_2014-10-03

Status

DONE

IN_PROGRESS

IN_PROGRESS

PENDING

PENDING

Date

2014-10-02

2014-10-08

2014-10-03

2014-10-03

2014-09-30

+ =

SecondaryIndex


Opponent StatusDate GameId Host

Alice DONE_2014-10-02 d9bl3 David

Carol IN_PROGRESS_2014-10-08 o2pnb Bob

Bob IN_PROGRESS_2014-10-03 ef9ca David

Bob PENDING_2014-09-30 72f49 Alice

Bob PENDING_2014-10-03 b932s Carol

Hashkey Rangekey

Opponent StatusDate GameId Host

Alice DONE_2014-10-02 d9bl3 David

Carol IN_PROGRESS_2014-10-08 o2pnb Bob

Bob IN_PROGRESS_2014-10-03 ef9ca David

Bob PENDING_2014-09-30 72f49 Alice

Bob PENDING_2014-10-03 b932s Carol

SecondaryIndex


Bob

SELECT * FROM GameWHERE Opponent='Bob'

AND StatusDate BEGINS_WITH 'PENDING'

Needle in a sorted haystack

Bob

Sparse indexes

Id(Hash) User Game Score Date Award

1 Bob G1 1300 2012-12-232 Bob G1 1450 2012-12-233 Jay G1 1600 2012-12-244 Mary G1 2000 2012-10-24 Champ5 Ryan G2 123 2012-03-106 Jones G2 345 2012-03-20

Game-scores-table

Award(Hash) Id User Score

Champ 4 Mary 2000

Award-GSI

ScansparsehashGSIs

Importantwhen:

Replace filter with indexes

§Concatenate attributes to form useful§secondary index keys

§Take advantage of sparse indexes

You want to optimize a query as much as possible

Status + Date

Big data analyticswith DynamoDB

Transactional Data Processing§DynamoDB is well-suited for transactional processing:• High concurrency• Strong consistency• Atomic updates of single items• Conditional updates for de-dupe and optimistic concurrency• Supports both key/value and JSON document schema• Capable of handling large table sizes with low latency data

access

Case 1: Store and Index Metadata for Objects Stored in Amazon S3

Case 1: Use Case§We have a large number of digital audio files stored in

Amazon S3 and we want to make them searchable:§Use DynamoDB as the primary data store for the

metadata.§ Index and query the metadata using Elasticsearch.

Case 1: Steps to Implement1. Create a Lambda function that reads the metadata

from the ID3 tag and inserts it into a DynamoDBtable.

2. Enable S3 notifications on the S3 bucket storing the audio files.

3. Enable streams on the DynamoDB table.4. Create a second Lambda function that takes the

metadata in DynamoDB and indexes it using Elasticsearch.

5. Enable the stream as the event source for the Lambda function.

Case 1: Key Takeaways§DynamoDB + Elasticsearch = Durable, scalable, highly-

available database with rich query capabilities.§Use Lambda functions to respond to events in both

DynamoDB streams and Amazon S3 without having to manage any underlying compute infrastructure.

Case 2 – Execute Queries Against Multiple Data Sources Using DynamoDB and Hive

Case 2: Use CaseWe want to enrich our audio file metadata stored in DynamoDB with additional data from the Million Song dataset:

§Million song data set is stored in text files.§ ID3 tag metadata is stored in DynamoDB.§Use Amazon EMR with Hive to join the two datasets

together in a query.

Case 2: Steps to Implement1. Spin up an Amazon EMR cluster

with Hive.2. Create an external Hive table

using the DynamoDBStorageHandler.

3. Create an external Hive table using the Amazon S3 location of the text files containing the Million Song project metadata.

4. Create and run a Hive query that joins the two external tables together and writes the joined results out to Amazon S3.

5. Load the results from Amazon S3 into DynamoDB.

Case 2: Key Takeaways§Use Amazon EMR to quickly provision a Hadoop

cluster with Hive and to tear it down when done.§Use of Hive with DynamoDB allows items in

DynamoDB tables to be queried/joined with data from a variety of sources.

Case 3 – Store and Analyze Sensor Data with DynamoDB and Amazon Redshift

Dashboard

Case 3: Use CaseA large number of sensors are taking readings at regular intervals. You need to aggregate the data from each reading into a data warehouse for analysis:• Use Amazon Kinesis to ingest the raw sensor data.• Store the sensor readings in DynamoDB for fast access and real-

time dashboards.• Store raw sensor readings in Amazon S3 for durability and

backup.• Load the data from Amazon S3 into Amazon Redshift using AWS

Lambda.

Case 3: Steps to Implement1. Create two Lambda functions to

read data from the Amazon Kinesis stream.

2. Enable the Amazon Kinesis stream as an event source for each Lambda function.

3. Write data into DynamoDB in one of the Lambda functions.

4. Write data into Amazon S3 in the other Lambda function.

5. Use the aws-lambda-redshift-loader to load the data in Amazon S3 into Amazon Redshift in batches.

Case 3: Key Takeaways§ Amazon Kinesis + Lambda + DynamoDB = Scalable, durable,

highly available solution for sensor data ingestion with very low operational overhead.

§ DynamoDB is well-suited for near-realtime queries of recent sensor data readings.

§ Amazon Redshift is well-suited for deeper analysis of sensor data readings spanning longer time horizons and very large numbers of records.

§ Using Lambda to load data into Amazon Redshift provides a way to perform ETL in frequent intervals.



DynamoDB Streams

Nexon use-case

§Stream of updates to a table

§Asynchronous§Exactly once§Strictly ordered

§ Per item

§Highly durable§ Scale with table

§24-hour lifetime§Sub-second latency

DynamoDB Streams

View type Destination

Old image—before update Name = John, Destination = Mars

New image—after update Name = John, Destination = Pluto

Old and new images Name = John, Destination = MarsName = John, Destination = Pluto

Keys only Name = John

View types

UpdateItem (Name = John, Destination = Pluto)

Stream

Table

Partition1

Partition2

Partition3

Partition4

Partition5

Table

Shard1

Shard2

Shard3

Shard4

KCLWorker

KCLWorker

KCLWorker

KCLWorker

Amazon Kinesis Client Library application

DynamoDBclientapplication

Updates

DynamoDB Streams and Amazon Kinesis Client Library

DynamoDB Streams

Open Source Cross-Region Replication Library

Asia Pacific (Sydney) EU (Ireland) Replica

USEast(N.Virginia)

Cross-region replication

DynamoDB Streams and AWS Lambda

Triggers

LambdafunctionNotifychange

Derivativetables

Amazon CloudSearchAmazon ElasticSearch

Amazon ElastiCache

Analytics with DynamoDB Streams§Collect and de-dupe data in DynamoDB§Aggregate data in-memory and flush periodically

§Performing real-time aggregation and analyticsEMR

Redshift

DynamoDB

Cross-region Replication



DynamoDB Streams

Nexon use-case

HIT chose DynamoDB§HIT Main Game DB• 33Tables

• CloudWatch Alarms

• FlexibleDashboards

Architecture§The key is All gaming users directly access DynamoDB

170,000Concurrentusers

AmaingameDB(33Tables,NRT)

Game,PvP,Raidandetc….

UnrealEngine+AWS.NETSDK

Main DynamoDB tables

MailTable

JewelTable

FriendTable

AccountTable

CharacterTable

SkillsTable

MissionTable

ItemTable

YourideaJ

Your application with DynamoDB

Table

Table Table

Table

Table

Table

Table

Table

Thank you!

amazon dynamo db for developers (김일호) - aws db day

Technology