aws re:invent 2016: deep dive on amazon glacier (stg302)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mas Kubo, Senior Product Manager, Amazon Glacier
Andy Shenkler, EVP and Chief Solutions & Technology Officer, Sony DADC New Media
Solutions (NMS)
November 30, 2016
Deep Dive on Amazon Glacier
STG302
Audio archives – SoundCloud
• World’s leading social sound
platform
• Audio files transcoded and
stored in multiple formats
• Stores petabytes of data
• Transcoded files served from
Amazon S3
• Originals moved to Amazon
Glacier for longterm retention
Patient data – Philips Healthcare
• HealthSuite digital platform
powered by AWS
• 15 petabytes of patient data
• Securely stored for decades
(beyond the lifetime of patients)
• Uses HIPAA-eligible AWS
services
Tape replacement – King County
• Most populous county in
Washington State
• Replaced tape solution for
backups from 17 agencies
• Meets compliance
requirements
• Saved $1MM in first year, no
more tape refresh or
management churn
Batches and Streams
Direct
Connect
Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Storage
GatewayKinesis Firehose
File
Amazon EFS
Block
Amazon EBS (persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store (ephemeral)
Data Storage Demand
• Media assets, 4k, 8k
• Healthcare/life sciences
• Financial services
• Regulated industries
• Oil and gas/geospatial
• Digital preservation
• Longterm backups
• Logs
Archive:
• Secure and durable
• Low cost
• Flexible data access
• Compliant
Amazon Glacier
• Extremely low-cost archive storage service, starting at $0.004
per GB per month
• New! Three retrieval options ranging from minutes to hours
(more later)
• 99.999999999% of durability (5-6 orders of magnitude higher
than 2 copies of tape)
• All data is encrypted at rest
• Features: compliance, data management, cost management,
audit logging
Amazon Glacier
Metered
usage:
pay as you go
No capital investment
No commitment
No risky capacity
planning
Avoid risks of
physical media
handling
Control your
geographic
locality for
performance
and compliance
Key Terms and Concepts
• Vaults – container for archives, up to 1,000 vaults per account
• Archives – basic unit, write-once, 40 TB max, unlimited archives
• Inventory – cold index of archives refreshed every 24 hours
• Access – three ways to access Amazon Glacier
• Uploads – multipart, lifecycle, cost optimizations, AWS Snowball
• Data management – Vault Lock, tagging, audit logs
• Retrievals – retrieval policies, range retrievals, new retrieval
features
Accessing Amazon Glacier
1. Direct Amazon Glacier API/SDK
2. Amazon S3 lifecycle integration
3. Third-party tools and gateways
FastGlacier
Uploading data: Internet or sneaker-net
AWS Direct
ConnectDedicated bandwidth between
your site and AWS
InternetTransfer data in a secure SSL tunnel
over the public Internet
AWS Import/Export
AWS SnowballPhysical transfer of media into
and out of AWS
Uploading data: archive descriptions
• Use archive description field for
metadata
• If local index is corrupted or
destroyed, use archive description
to reconstruct critical mappings
• For example, create index entry,
add primary key to archive
description on upload
Local Index Entry
Primary key: 12345
Description: 2014Audit
Dept: FinanceDept
ArchiveID: 9FG23…..
…..
UploadArchive(data,
ArchiveDescription=“12345,
2014Audit,FinanceDept”) ->
Archive ID = 9FG23…..
Uploading data: optimizing costs
• Every archive has 32 KB of associated
overhead and some operations are charged per
request
• For archive size of 3.2 MB ~1% cost overheads
• For 1 KB archive, 97% of cost would go to
overhead
• Solution is aggregation – recommend minimum
size on the order of at least MBs
Checksum 2
Checksum 1
File 2
Checksum 3
. . .
Local index
File 1 offset
File 1
File 2 offset
File 3 offset
Index/directory
…
Checksum & metadata
Checksum & metadata
Checksum & metadata
Archive
Uploading data: aggregating archives
Best practices: multipart uploads
Improve throughput, reliability, and get idempotency
1. InitiateMultipartUpload(partSize) → uploadId
2. UploadPart(uploadId, data)
3. CompleteMultipartUpload(uploadId) → archiveId
Arc
hiv
e
Parallel Uploads
Parts
Amazon Glacier: Amazon S3 lifecycle policies
• Seamlessly move data from Amazon S3 to Amazon Glacier
• Automated lifecycle rules
• Transition based on object age
Amazon Glacier: Amazon S3 lifecycle policies
• Object-level tagging for S3
objects
• Apply lifecycle rules based on
object tags
• Example: transition objects to
Amazon Glacier when 1 year
old and have object tags
‘Project=Delta’ and ‘Data
type=HPI’.
Management features: audit logging via
AWS CloudTrail
• Enable AWS
CloudTrail in console
• Control plane events:
vault activities
• Data plane events:
archive activities
Management features: vault access policies
• Manage access to a vault in a single location – single AWS Identity and
Access Management (IAM) policy
– Grant/revoke access to internal business units/teams
– “Marketing_Vault” has an access policy that is distinct from
“DevOps_Vault”
• Easily manage cross-account access for your business partner
– Simply add a section for your business partner in the same policy
Management features: Vault Lock
• Non-overwrite, non-erasable records
• Time-based retention with “ArchiveAgeInDays” control
• Policy lockdown (strong governance)
• Legal hold with vault-level tags
• Configure optional designated third-party access and grant
temporary access
Vault Lock: two-step locking
• InitiateVaultLock
– Effectuates a retention policy for testing (in-progress state)
– Returns a unique lock ID (expires after 24 hours)
• AbortVaultLock
– Deletes an in-progress policy
– Ability to modify a policy before locking it down
• CompleteVaultLock
– Locks down the vault with the appropriate lock ID
– A Vault Lock policy cannot be aborted once locked
Management features: Vault Lock
• Set up a legal hold tag
– Configure a vault-level tag “LegalHold”
– Set initial value to “False”
• Add compliance control for legal hold in a vault lock policy
– Deny delete archive operation
– From anybody (root, administrators, users, business partners)
– When LegalHold tag = “True”
• Place or lift legal hold by updating the tag value
Legal hold with vault-level tags
Management features: Vault Lock
• Map one vault to a single retention range
– Group regulatory data by retention: 1-year vault, 6-year vault, etc.
• Create a new vault and lock it before storing production data
– Enforce the full ArchiveAgeInDays on all new archives
– Leave no “gap” on existing archives
• Thoroughly test a vault lock policy before locking it down (Abort/Initiate)
• Implement only the most restrictive controls with Vault Lock
– Leave the flexible controls to vault access policy
Vault Lock best practices
Management features: Vault Lock
Amazon Glacier received a third-party assessment
from Cohasset Associates on how Amazon Glacier
with Vault Lock can be used to meet the
requirements of SEC 17a-4(f) and CFTC 1.31(b)-(c)
Third-party assessment
Management features: Vault Lock
Data retrievals: basic concepts
Initiate jobArchiveId: AE99F…
Vault: Films -> Job ID
1
3-5 hours for job completion2
3 Job completion notification
4 Download output
Data retrievals: data retrieval policies
• Provides transparency and cost control for data retrievals
• Governs all retrieval activities for an account in a region
• Synchronously accepts or rejects each retrieval request
• Accounts for inflight retrieval operations
Checksum 2
Checksum 1
File 2
Checksum 3
. . .
Local index
File 1 offset
File 1
File 2 offset
File 3 offset
Index/directory
…
Checksum & metadata
Checksum & metadata
Checksum & metadata
Archive
Data retrievals: range retrievals
Data retrievals: expedited and bulk retrievals
Expedited Standard Bulk
Data Access Time 1 - 5 minutes 3 - 5 hours 5 - 12 hours
Data Retrievals $0.03 per GB $0.01 per GB $0.0025 per GB
Retrieval Requests $0.01 per request $0.05 per 1,000 requests $0.025 per 1,000 requests
• Expedited: designed for occasional urgent access to a small
number of archives
• Standard: low-cost option for retrieving data in just a few hours
• Bulk: lowest cost option optimized for large retrievals, up to
petabytes of data in 12 hours
• Three flexible and powerful retrieval options to access any of your
Amazon Glacier data
“If physical deliveries can happen within one hour based on
unpredictable requests, surely we are able to exceed such expectations digitally”
@SonyDADCNMS
Our migration
The Challenge
• Seamlessly migrate a platform that enables content
delivery across all devices and more than 1,200
distribution points worldwide
• Store 20 petabytes of motion picture and television
content
• Equating to 1,000,000 M+ hours of content
• At a growth curve of ~1 petabyte every quarter
Desired Goals:
• One-hour delivery turn around time
• Agile, scalable, predictable cost model and
infrastructure
• Investing in innovation vs. hardware
@SonyDADCNMS
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Who: Lead Software Development Engineers, Architects, and Technical PMs
Where: Storage Booth Walk-up Bar
When: Exhibit hours (Tues 5-7pm, Wed & Thurs 10:30a-6:00p)
What: Architecture best practices, code reviews, feature requests
Storage “Office Hours”Meet the People who Build AWS Storage