aws re:invent 2016: deep dive on amazon glacier (stg302)

43
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Mas Kubo, Senior Product Manager, Amazon Glacier Andy Shenkler, EVP and Chief Solutions & Technology Officer, Sony DADC New Media Solutions (NMS) November 30, 2016 Deep Dive on Amazon Glacier STG302

Upload: amazon-web-services

Post on 11-Jan-2017

70 views

Category:

Technology


3 download

TRANSCRIPT

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Mas Kubo, Senior Product Manager, Amazon Glacier

Andy Shenkler, EVP and Chief Solutions & Technology Officer, Sony DADC New Media

Solutions (NMS)

November 30, 2016

Deep Dive on Amazon Glacier

STG302

Audio archives – SoundCloud

• World’s leading social sound

platform

• Audio files transcoded and

stored in multiple formats

• Stores petabytes of data

• Transcoded files served from

Amazon S3

• Originals moved to Amazon

Glacier for longterm retention

Patient data – Philips Healthcare

• HealthSuite digital platform

powered by AWS

• 15 petabytes of patient data

• Securely stored for decades

(beyond the lifetime of patients)

• Uses HIPAA-eligible AWS

services

Tape replacement – King County

• Most populous county in

Washington State

• Replaced tape solution for

backups from 17 agencies

• Meets compliance

requirements

• Saved $1MM in first year, no

more tape refresh or

management churn

Batches and Streams

Direct

Connect

Snowball,

Snowball Edge,

Snowmobile

3rd Party

Connectors

Transfer

Acceleration

Storage

GatewayKinesis Firehose

File

Amazon EFS

Block

Amazon EBS (persistent)

Object

Amazon GlacierAmazon S3 Amazon EC2

Instance Store (ephemeral)

Data Storage Demand

• Media assets, 4k, 8k

• Healthcare/life sciences

• Financial services

• Regulated industries

• Oil and gas/geospatial

• Digital preservation

• Longterm backups

• Logs

Archive:

• Secure and durable

• Low cost

• Flexible data access

• Compliant

Amazon Glacier

• Extremely low-cost archive storage service, starting at $0.004

per GB per month

• New! Three retrieval options ranging from minutes to hours

(more later)

• 99.999999999% of durability (5-6 orders of magnitude higher

than 2 copies of tape)

• All data is encrypted at rest

• Features: compliance, data management, cost management,

audit logging

Amazon Glacier

Metered

usage:

pay as you go

No capital investment

No commitment

No risky capacity

planning

Avoid risks of

physical media

handling

Control your

geographic

locality for

performance

and compliance

Key Terms and Concepts

• Vaults – container for archives, up to 1,000 vaults per account

• Archives – basic unit, write-once, 40 TB max, unlimited archives

• Inventory – cold index of archives refreshed every 24 hours

• Access – three ways to access Amazon Glacier

• Uploads – multipart, lifecycle, cost optimizations, AWS Snowball

• Data management – Vault Lock, tagging, audit logs

• Retrievals – retrieval policies, range retrievals, new retrieval

features

Accessing Amazon Glacier

1. Direct Amazon Glacier API/SDK

2. Amazon S3 lifecycle integration

3. Third-party tools and gateways

FastGlacier

Uploading data: Internet or sneaker-net

AWS Direct

ConnectDedicated bandwidth between

your site and AWS

InternetTransfer data in a secure SSL tunnel

over the public Internet

AWS Import/Export

AWS SnowballPhysical transfer of media into

and out of AWS

Uploading data: archive descriptions

• Use archive description field for

metadata

• If local index is corrupted or

destroyed, use archive description

to reconstruct critical mappings

• For example, create index entry,

add primary key to archive

description on upload

Local Index Entry

Primary key: 12345

Description: 2014Audit

Dept: FinanceDept

ArchiveID: 9FG23…..

…..

UploadArchive(data,

ArchiveDescription=“12345,

2014Audit,FinanceDept”) ->

Archive ID = 9FG23…..

Uploading data: optimizing costs

• Every archive has 32 KB of associated

overhead and some operations are charged per

request

• For archive size of 3.2 MB ~1% cost overheads

• For 1 KB archive, 97% of cost would go to

overhead

• Solution is aggregation – recommend minimum

size on the order of at least MBs

Checksum 2

Checksum 1

File 2

Checksum 3

. . .

Local index

File 1 offset

File 1

File 2 offset

File 3 offset

Index/directory

Checksum & metadata

Checksum & metadata

Checksum & metadata

Archive

Uploading data: aggregating archives

Best practices: multipart uploads

Improve throughput, reliability, and get idempotency

1. InitiateMultipartUpload(partSize) → uploadId

2. UploadPart(uploadId, data)

3. CompleteMultipartUpload(uploadId) → archiveId

Arc

hiv

e

Parallel Uploads

Parts

Amazon Glacier: Amazon S3 lifecycle policies

• Seamlessly move data from Amazon S3 to Amazon Glacier

• Automated lifecycle rules

• Transition based on object age

Amazon Glacier: Amazon S3 lifecycle policies

• Object-level tagging for S3

objects

• Apply lifecycle rules based on

object tags

• Example: transition objects to

Amazon Glacier when 1 year

old and have object tags

‘Project=Delta’ and ‘Data

type=HPI’.

Management features: vault tagging

Management features: audit logging via

AWS CloudTrail

• Enable AWS

CloudTrail in console

• Control plane events:

vault activities

• Data plane events:

archive activities

Management features: vault access policies

• Manage access to a vault in a single location – single AWS Identity and

Access Management (IAM) policy

– Grant/revoke access to internal business units/teams

– “Marketing_Vault” has an access policy that is distinct from

“DevOps_Vault”

• Easily manage cross-account access for your business partner

– Simply add a section for your business partner in the same policy

Management features: Vault Lock

• Non-overwrite, non-erasable records

• Time-based retention with “ArchiveAgeInDays” control

• Policy lockdown (strong governance)

• Legal hold with vault-level tags

• Configure optional designated third-party access and grant

temporary access

Vault Lock: two-step locking

• InitiateVaultLock

– Effectuates a retention policy for testing (in-progress state)

– Returns a unique lock ID (expires after 24 hours)

• AbortVaultLock

– Deletes an in-progress policy

– Ability to modify a policy before locking it down

• CompleteVaultLock

– Locks down the vault with the appropriate lock ID

– A Vault Lock policy cannot be aborted once locked

Management features: Vault Lock

• Set up a legal hold tag

– Configure a vault-level tag “LegalHold”

– Set initial value to “False”

• Add compliance control for legal hold in a vault lock policy

– Deny delete archive operation

– From anybody (root, administrators, users, business partners)

– When LegalHold tag = “True”

• Place or lift legal hold by updating the tag value

Legal hold with vault-level tags

Management features: Vault Lock

Example control: legal hold

Management features: Vault Lock

• Map one vault to a single retention range

– Group regulatory data by retention: 1-year vault, 6-year vault, etc.

• Create a new vault and lock it before storing production data

– Enforce the full ArchiveAgeInDays on all new archives

– Leave no “gap” on existing archives

• Thoroughly test a vault lock policy before locking it down (Abort/Initiate)

• Implement only the most restrictive controls with Vault Lock

– Leave the flexible controls to vault access policy

Vault Lock best practices

Management features: Vault Lock

Amazon Glacier received a third-party assessment

from Cohasset Associates on how Amazon Glacier

with Vault Lock can be used to meet the

requirements of SEC 17a-4(f) and CFTC 1.31(b)-(c)

Third-party assessment

Management features: Vault Lock

Data retrievals: basic concepts

Initiate jobArchiveId: AE99F…

Vault: Films -> Job ID

1

3-5 hours for job completion2

3 Job completion notification

4 Download output

Data retrievals: restoring via lifecycle

1 2

Data retrievals: restoring via lifecycle

3

4

Data retrievals: data retrieval policies

• Provides transparency and cost control for data retrievals

• Governs all retrieval activities for an account in a region

• Synchronously accepts or rejects each retrieval request

• Accounts for inflight retrieval operations

Checksum 2

Checksum 1

File 2

Checksum 3

. . .

Local index

File 1 offset

File 1

File 2 offset

File 3 offset

Index/directory

Checksum & metadata

Checksum & metadata

Checksum & metadata

Archive

Data retrievals: range retrievals

Data retrievals: expedited and bulk retrievals

Expedited Standard Bulk

Data Access Time 1 - 5 minutes 3 - 5 hours 5 - 12 hours

Data Retrievals $0.03 per GB $0.01 per GB $0.0025 per GB

Retrieval Requests $0.01 per request $0.05 per 1,000 requests $0.025 per 1,000 requests

• Expedited: designed for occasional urgent access to a small

number of archives

• Standard: low-cost option for retrieving data in just a few hours

• Bulk: lowest cost option optimized for large retrievals, up to

petabytes of data in 12 hours

• Three flexible and powerful retrieval options to access any of your

Amazon Glacier data

Accelerated Media Lifecycle

@SonyDADCNMS

“If physical deliveries can happen within one hour based on

unpredictable requests, surely we are able to exceed such expectations digitally”

@SonyDADCNMS

Our migration

The Challenge

• Seamlessly migrate a platform that enables content

delivery across all devices and more than 1,200

distribution points worldwide

• Store 20 petabytes of motion picture and television

content

• Equating to 1,000,000 M+ hours of content

• At a growth curve of ~1 petabyte every quarter

Desired Goals:

• One-hour delivery turn around time

• Agile, scalable, predictable cost model and

infrastructure

• Investing in innovation vs. hardware

@SonyDADCNMS

On-premise Asset Storage Workflow

@SonyDADCNMS

AWS Cloud-based asset storage workflow

@SonyDADCNMS

AMAZON

GLACIER

Amazon Glacier vs. on-premises cost comparison

@SonyDADCNMS

Thank you!

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Who: Lead Software Development Engineers, Architects, and Technical PMs

Where: Storage Booth Walk-up Bar

When: Exhibit hours (Tues 5-7pm, Wed & Thurs 10:30a-6:00p)

What: Architecture best practices, code reviews, feature requests

Storage “Office Hours”Meet the People who Build AWS Storage

Remember to complete

your evaluations!