개발자가 알아야 할 amazon dynamodb 활용법 :: 김일호 :: aws summit seoul 2016
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
김일호, Solutions Architect
05-17-2016
개발자가알아야 할 Amazon DynamoDB 활용법
Agenda
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
Local secondary index (LSI)
Alternate sort(=range) key attributeIndex is local to a partition(=hash) key
A1(partition)
A3(sort)
A2(tablekey)
A1(partition)
A2(sort)
A3 A4 A5
LSIs A1(partition)
A4(sort)
A2(tablekey)
A3(projected)
Table
KEYS_ONLY
INCLUDE A3
A1(partition)
A5(sort)
A2(tablekey)
A3(projected)
A4(projected) ALL
10 GB max per hash key, i.e. LSIs limit the # of range keys!
Global secondary index (GSI)
Alternate partition keyIndex is across all table partition key
A1(partition)
A2 A3 A4 A5
GSIs A5(partition)
A4(sort)
A1(tablekey)
A3(projected)
Table
INCLUDE A3
A4(partition)
A5(sort)
A1(tablekey)
A2(projected)
A3(projected) ALL
A2(partition)
A1(tablekey)
KEYS_ONLY
RCUs/WCUs provisioned separately for GSIs
Online indexing
How do GSI updates work?
Table
Primary tablePrimary
tablePrimary tablePrimary
tableGlobal
Secondary Index
Client
2. Asynchronous update (in progress)
If GSIs don’t have enough write capacity, table writes will be throttled!
LSI or GSI?
LSI can be modeled as a GSIIf data size in an item collection > 10 GB, use GSI
If eventual consistency is okay for your scenario, use GSI!
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
Scaling
Throughput• Provision any amount of throughput to a table
Size• Add any number of items to a table
• Max item size is 400 KB• LSIs limit the number of range keys due to 10 GB limit
Scaling is achieved through partitioning
Throughput
Provisioned at the table level• Write capacity units (WCUs) are measured in 1 KB per second• Read capacity units (RCUs) are measured in 4 KB per second
• RCUs measure strictly consistent reads• Eventually consistent reads cost 1/2 of consistent reads
Read and write throughput limits are independent200 RCU
Partitioning math
Number of PartitionsBy Capacity (Total RCU / 3000) + (Total WCU / 1000)By Size Total Size / 10 GBTotal Partitions CEILING(MAX (Capacity, Size))
Partitioning example
Tablesize=8GB,RCUs=5000,WCUs=500
RCUsperpartition=5000/3=1666.67WCUsperpartition=500/3=166.67Data/partition=10/3=3.33GB
RCUs and WCUs are uniformly spread across partitions
Number of PartitionsBy Capacity (5000 / 3000) + (500 / 1000) = 2.17By Size 8 / 10 = 0.8Total Partitions CEILING(MAX (2.17, 0.8)) = 3
Allocation of partitions
A partition split occurs when• Increased provisioned throughput settings• Increased storage requirements
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Example: hot keys
Parti
tion
Time
Heat
Example: periodic spike
Getting the most out of DynamoDB throughput
“To get the most out of DynamoDB throughput, create tables where the partition key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.”
—DynamoDB Developer Guide
Space: access is evenly spread over the key-space
Time: requests arrive evenly spaced in time
What causes throttling?
If sustained throughput goes beyond provisioned throughput per partition
Non-uniform workloads• Hot keys/hot partitions• Very large bursts
Mixing hot data with cold data• Use a table per time period
From the example before:• Table created with 5000 RCUs, 500 WCUs• RCUs per partition = 1666.67• WCUs per partition = 166.67• If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may throttle
requests• Solution: Increase provisioned throughput
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
1:1 relationships or key-values
Use a table or GSI with a partition keyUse GetItem or BatchGetItem APIExample: Given an SSN or license number, get attributes
UsersTablePartitionkey AttributesSSN=123-45-6789 [email protected], License =TDL25478134SSN=987-65-4321 [email protected],License =TDL78309234
Users-Email-GSIPartitionkey Attributes
License =TDL78309234 [email protected],SSN=987-65-4321
License =TDL25478134 [email protected], SSN=123-45-6789
1:N relationships or parent-children
Use a table or GSI with partition and sort keyUse Query API
Example:• Given a device, find all readings between epoch X, Y
Device-measurementsPartitionKey Sortkey AttributesDeviceId=1 epoch=5513A97C Temperature=30,pressure=90DeviceId=1 epoch=5513A9DB Temperature=30,pressure=90
N:M relationships
Use a table and GSI with partition and sort key elements switchedUse Query API Example: Given a user, find all games. Or given a game, find all users.
User-Games-TableHashKey RangekeyUserId=bob GameId=Game1UserId=fred GameId=Game2UserId=bob GameId=Game3
Game-Users-GSIHashKey RangekeyGameId=Game1 UserId=bobGameId=Game2 UserId=fredGameId=Game3 UserId=bob
Documents (JSON)
New data types (M, L, BOOL, NULL) introduced to support JSONDocument SDKs
• Simple programming model• Conversion to/from JSON• Java, JavaScript, Ruby, .NET
Cannot index (S,N) elements of a JSON object stored in M
• Only top-level table attributes can be used in LSIs and GSIs without Streams/Lambda
JavaScript DynamoDBstring S
number Nboolean BOOL
null NULLarray Lobject M
Rich expressions
Projection expression to get just some of the attributes• Query/Get/Scan: ProductReviews.FiveStar[0]
Rich expressions
Projection expression to get just some of the attributes• Query/Get/Scan: ProductReviews.FiveStar[0]
ProductReviews: { FiveStar: [
"Excellent! Can't recommend it highly enough! Buy it!", "Do yourself a favor and buy this." ],
OneStar: [ "Terrible product! Do not buy this." ] }
]}
Rich expressions
Filter expression• Query/Scan: #VIEWS > :num
Update expression• UpdateItem: set Replies = Replies + :num
Rich expressions
Conditional expression • Put/Update/DeleteItem
• attribute_not_exists (#pr.FiveStar)• attribute_exists(Pictures.RearView)
1. First looks for an item whose primary key matches that of the item to be written.
2. Only if the search returns nothing is there no partition key present in the result.
3. Otherwise, the attribute_not_exists function above fails and the write will be prevented.
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
Game logging
Storing time series data
Time series tables
Events_table_2015_AprilEvent_id(partition key)
Timestamp(sort key) Attribute1 …. Attribute N
Events_table_2015_MarchEvent_id(partition key)
Timestamp(sort key) Attribute1 …. Attribute N
Events_table_2015_FeburaryEvent_id(partition key)
Timestamp(sort key) Attribute1 …. Attribute N
Events_table_2015_JanuaryEvent_id(partition key)
Timestamp(sort key) Attribute1 …. Attribute N
RCUs = 1000WCUs = 100
RCUs = 10000WCUs = 10000
RCUs = 100WCUs = 1
RCUs = 10WCUs = 1
Current table
Older tables
Hot
dat
aC
old
data
Don’t mix hot and cold data; archive cold data to Amazon S3
Use a table per time period
• Pre-create daily, weekly, monthly tables• Provision required throughput for current table• Writes go to the current table• Turn off (or reduce) throughput for older tables
• Pre-create heavy users, light users tables
Item shop catalog
Popular items (read)
Partition 12000 RCUs
Partition K2000 RCUs
Partition M2000 RCUs
Partition 502000 RCU
Scaling bottlenecks
Product A Product B
Gamers
ItemShopCatalog Table
SELECT Id, Description, ...FROM ItemShopCatalog
Req
uest
s Pe
r Sec
ond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests
Partition 1 Partition 2
ItemShopCatalog Table
User
DynamoDB
User
Cachepopular items
SELECT Id, Description, ...FROM ProductCatalog
Req
uest
s Pe
r Sec
ond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests Cache Hits
Multiplayer online gaming
Query filters vs.composite key indexes
GameId Date Host Opponent Statusd9bl3 2014-10-02 David Alice DONE72f49 2014-09-30 Alice Bob PENDINGo2pnb 2014-10-08 Bob Carol IN_PROGRESSb932s 2014-10-03 Carol Bob PENDINGef9ca 2014-10-03 David Bob IN_PROGRESS
Games Table
Multiplayer online game data
Partition key
Query for incoming game requests
DynamoDB indexes provide partition and sortWhat about queries for two equalities and a range?
SELECT * FROM GameWHERE Opponent='Bob‘AND Status=‘PENDING'ORDER BY Date DESC
(partition)
(sort)(?)
Secondary Index
Opponent Date GameId Status HostAlice 2014-10-02 d9bl3 DONE DavidCarol 2014-10-08 o2pnb IN_PROGRESS BobBob 2014-09-30 72f49 PENDING AliceBob 2014-10-03 b932s PENDING CarolBob 2014-10-03 ef9ca IN_PROGRESS David
Approach 1: Query filter
BobPartition key Sort key
Secondary Index
Approach 1: Query filter
Bob
Opponent Date GameId Status HostAlice 2014-10-02 d9bl3 DONE DavidCarol 2014-10-08 o2pnb IN_PROGRESS BobBob 2014-09-30 72f49 PENDING AliceBob 2014-10-03 b932s PENDING CarolBob 2014-10-03 ef9ca IN_PROGRESS David
SELECT * FROM GameWHERE Opponent='Bob' ORDER BY Date DESCFILTER ON Status='PENDING'
(filtered out)
Needle in a haystack
Bob
Use query filter
• Send back less data “on the wire”• Simplify application code• Simple SQL-like expressions
• AND, OR, NOT, ()
Use when your index isn’t entirely selective
Approach 2: composite key
StatusDateDONE_2014-10-02IN_PROGRESS_2014-10-08IN_PROGRESS_2014-10-03PENDING_2014-09-30PENDING_2014-10-03
StatusDONEIN_PROGRESSIN_PROGRESSPENDINGPENDING
Date2014-10-022014-10-082014-10-032014-10-032014-09-30
+ =
Secondary Index
Approach 2: composite key
Opponent StatusDate GameId HostAlice DONE_2014-10-02 d9bl3 DavidCarol IN_PROGRESS_2014-10-08 o2pnb BobBob IN_PROGRESS_2014-10-03 ef9ca DavidBob PENDING_2014-09-30 72f49 AliceBob PENDING_2014-10-03 b932s Carol
Partition key Sort key
Opponent StatusDate GameId HostAlice DONE_2014-10-02 d9bl3 DavidCarol IN_PROGRESS_2014-10-08 o2pnb BobBob IN_PROGRESS_2014-10-03 ef9ca DavidBob PENDING_2014-09-30 72f49 AliceBob PENDING_2014-10-03 b932s Carol
Secondary Index
Approach 2: composite key
Bob
SELECT * FROM GameWHERE Opponent='Bob'
AND StatusDate BEGINS_WITH 'PENDING'
Needle in a sorted haystack
Bob
Sparse indexes
Id(Hash) User Game Score Date Award
1 Bob G1 1300 2012-12-23
2 Bob G1 1450 2012-12-233 Jay G1 1600 2012-12-244 Mary G1 2000 2012-10-24 Champ5 Ryan G2 123 2012-03-106 Jones G2 345 2012-03-20
Game-scores-tableAward(Hash) Id User Score
Champ 4 Mary 2000
Award-GSI
Scan sparse hash GSIs
Replace filter with indexes
Concatenate attributes to form usefulsecondary index keys
Take advantage of sparse indexes
Use when You want to optimize a query as much as possible
Status + Date
Big data analyticswith DynamoDB
Transactional Data Processing
DynamoDB is well-suited for transactional processing:• High concurrency• Strong consistency• Atomic updates of single items• Conditional updates for de-dupe and optimistic concurrency• Supports both key/value and JSON document schema• Capable of handling large table sizes with low latency data access
Case 1: Store and Index Metadata for Objects Stored in Amazon S3
Case 1: Use Case
We have a large number of digital audio files stored in Amazon S3 and we want to make them searchable
à Use DynamoDB as the primary data store for the metadata.à Index and query the metadata using Elasticsearch.
Case 1: Steps to Implement
1. Create a Lambda function that reads the metadata from the ID3 tag and inserts it into a DynamoDB table.
2. Enable S3 notifications on the S3 bucket storing the audio files.
3. Enable streams on the DynamoDB table.4. Create a second Lambda function that takes the metadata in
DynamoDB and indexes it using Elasticsearch.5. Enable the stream as the event source for the Lambda
function.
Case 1: Key Takeaways
DynamoDB + Elasticsearch = Durable, scalable, highly-available database with rich query capabilities.Use Lambda functions to respond to events in both DynamoDB streams and Amazon S3 without having to manage any underlying compute infrastructure.
Case 2 – Execute Queries Against Multiple Data Sources Using DynamoDB and Hive
Case 2: Use Case
We want to enrich our audio file metadata stored in DynamoDB with additional data from the Million Song dataset:
à Million song data set is stored in text files.à ID3 tag metadata is stored in DynamoDB.à Use Amazon EMR with Hive to join the two datasets together in a query.
Case 2: Steps to Implement1. Spin up an Amazon EMR cluster with
Hive.2. Create an external Hive table using the
DynamoDBStorageHandler.3. Create an external Hive table using the
Amazon S3 location of the text files containing the Million Song project metadata.
4. Create and run a Hive query that joins the two external tables together and writes the joined results out to Amazon S3.
5. Load the results from Amazon S3 into DynamoDB.
Case 2: Key Takeaways
Use Amazon EMR to quickly provision a Hadoop cluster with Hive and to tear it down when done.Use of Hive with DynamoDB allows items in DynamoDB tables to be queried/joined with data from a variety of sources.
Case 3 – Store and Analyze Sensor Data with DynamoDB and Amazon Redshift
Dashboard
Case 3: Use Case
A large number of sensors are taking readings at regular intervals. You need to aggregate the data from each reading into a data warehouse for analysis:• Use Amazon Kinesis to ingest the raw sensor data.• Store the sensor readings in DynamoDB for fast access and real-
time dashboards.• Store raw sensor readings in Amazon S3 for durability and backup.• Load the data from Amazon S3 into Amazon Redshift using AWS
Lambda.
Case 3: Steps to Implement1. Create two Lambda functions to
read data from the Amazon Kinesis stream.
2. Enable the Amazon Kinesis stream as an event source for each Lambda function.
3. Write data into DynamoDB in one of the Lambda functions.
4. Write data into Amazon S3 in the other Lambda function.
5. Use the aws-lambda-redshift-loader to load the data in Amazon S3 into Amazon Redshift in batches.
Case 3: Key Takeaways
Amazon Kinesis + Lambda + DynamoDB = Scalable, durable, highly available solution for sensor data ingestion with very low operational overhead.DynamoDB is well-suited for near-realtime queries of recent sensor data readings.Amazon Redshift is well-suited for deeper analysis of sensor data readings spanning longer time horizons and very large numbers of records.Using Lambda to load data into Amazon Redshift provides a way to perform ETL in frequent intervals.
Tip 1. DynamoDB Index(LSI, GSI)Tip 2. DynamoDB ScalingTip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams
Stream of updates to a tableAsynchronousExactly onceStrictly ordered
• Per item
Highly durable• Scale with table
24-hour lifetimeSub-second latency
DynamoDB Streams
View type Destination
Old image—before update Name = John, Destination = Mars
New image—after update Name = John, Destination = Pluto
Old and new images Name = John, Destination = MarsName = John, Destination = Pluto
Keys only Name = John
View types
UpdateItem (Name = John, Destination = Pluto)
Stream
Table
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Table
Shard 1
Shard 2
Shard 3
Shard 4
KCLWorker
KCLWorker
KCLWorker
KCLWorker
Amazon Kinesis Client Library application
DynamoDBclient application
Updates
DynamoDB Streams and Amazon Kinesis Client Library
DynamoDB Streams
Open Source Cross-Region Replication Library
Asia Pacific (Sydney) EU (Ireland) Replica
US East (N. Virginia)
Cross-region replication
DynamoDB Streams and AWS Lambda
Triggers
Lambda functionNotify change
Derivative tables
Amazon CloudSearchAmazon ElasticSearch
Amazon ElastiCache
DynamoDB Streams
Analytics with DynamoDB Streams
Collect and de-dupe data in DynamoDBAggregate data in-memory and flush periodically
Performing real-time aggregation and analytics
EMR
Redshift
DynamoDB
Cross-region Replication
여러분의 피드백을 기다립니다!
https://www.awssummit.co.kr
모바일 페이지에 접속하셔서, 지금 세션 평가에참여하시면, 행사후 기념품을 드립니다.
#AWSSummit 해시태그로 소셜 미디어에 여러분의행사 소감을 올려주세요.
발표 자료 및 녹화 동영상은 AWS Korea 공식 소셜채널로 곧 공유될 예정입니다.