building a data lake on aws

38
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steve Abraham, Solutions Architect October 26, 2016 Building a Data Lake on AWS

Upload: amazon-web-services

Post on 06-Jan-2017

383 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Building a Data Lake on AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Steve Abraham, Solutions Architect

October 26, 2016

Building a Data Lake on AWS

Page 2: Building a Data Lake on AWS

Evolution of “Data Lakes”

Page 3: Building a Data Lake on AWS

Databases

Transactions

Data warehouse

Evolution of big data architecture

Extract, transform and load (ETL)

Page 4: Building a Data Lake on AWS

Databases

Files

Transactions

Logs

Data warehouse

Evolution of big data architecture

ETL

ETL

Page 5: Building a Data Lake on AWS

Databases

Files

Streams

Transactions

Logs

Events

Data warehouse

Evolution of big data architecture

? Hadoop ?

ETL

ETL

Page 6: Building a Data Lake on AWS

Amazon Glacier

Amazon S3 Amazon DynamoDB

Amazon RDS

Amazon EMR

Amazon Redshift

AWS Data Pipeline

Amazon Kinesis Amazon CloudSearch

Amazon Kinesis-enabled app

AWS Lambda Amazon Machine Learning

Amazon SQS

Amazon ElastiCache

Amazon DynamoDBStreams

A growing ecosystem…

Page 7: Building a Data Lake on AWS

Databases

Files

Streams

Transactions

Logs

Events

Data warehouse

DataLake

The Genesis of “Data Lakes”

Page 8: Building a Data Lake on AWS

What really is a “Data Lake”

Page 9: Building a Data Lake on AWS

Components of a Data Lake

Collect & Store

Catalogue & Search

Entitlements

API & UI An API and user interface that expose these features to internal and external users

A robust set of security controls – governance through technology, not policy

A search index and workflow which enables data discovery

A foundation of highly durable data storage and streaming of any type of data

Page 10: Building a Data Lake on AWS

StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost

Page 11: Building a Data Lake on AWS

Data Lake – Hadoop (HDFS) as the Storage

Search

Access

QueryProcess

Archive

Page 12: Building a Data Lake on AWS

Transactions

Data Lake – Amazon S3 as the storage

Search

Access

QueryProcess

Archive

Amazon RDS

Amazon DynamoDB

AmazonElasticsearch

Service

AmazonGlacier

Amazon S3

Amazon Redshift

Amazon Elastic

MapReduce

Amazon Machine Learning

Amazon ElastiCache

Page 13: Building a Data Lake on AWS

Metadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance

Catalogue & search

Page 14: Building a Data Lake on AWS

Catalogue & Search Architecture

Page 15: Building a Data Lake on AWS

Encryption for Data protectionAuthentication & AuthorizationAccess Control & Restrictions

Entitlements

Page 16: Building a Data Lake on AWS

Data Protection via EncryptionAWS CloudHSM

Dedicated Tenancy SafeNet Luna SA HSM DeviceCommon Criteria EAL4+, NIST FIPS 140-2

AWS Key Management ServiceAutomated key rotation & auditingIntegration with other AWS services

AWS server side encryptionAWS managed key infrastructure

Page 17: Building a Data Lake on AWS

Entitlements – Access to Encryption Keys

Customer Master Key

Customer Data Keys

CiphertextKey

PlaintextKey

IAM TemporaryCredential

Security Token Service

MyData

MyData

S3

S3 Object…

Name: MyDataKey: Ciphertext

Key…

Page 18: Building a Data Lake on AWS

Exposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected

API & UI

Page 19: Building a Data Lake on AWS

API & UI Architecture

API Gateway

UI - Elastic Beanstalk

AWS Lambda Metadata IndexUsersIAM

TVM - Elastic Beanstalk

Page 20: Building a Data Lake on AWS

Putting It All Together

Page 21: Building a Data Lake on AWS

Amazon Kinesis

Amazon S3 Amazon Glacier

IAM

Encrypted Data

Security Token Service

AWS Lambda

SearchIndex

Metadata Index

API GatewayUsers UI - Elastic Beanstalk

KMS

Collect & Store

Catalogue & Search

Entitlements & Access Controls

APIs & UI

Page 22: Building a Data Lake on AWS

Amazon S3 - Foundation for your Data Lake

Page 23: Building a Data Lake on AWS

Designed for 11 9s of durability

Designed for 99.99% availability

Durable Available High performance Multiple upload Range GET

Store as much as you need Scale storage and compute

independently No minimum usage commitments

Scalable AWS Elastic MapReduce Amazon Redshift Amazon DynamoDB

Integrated Simple REST API AWS SDKs Read-after-create consistency Event Notification Lifecycle policies

Easy to use

Why Amazon S3 for Data Lake?

Page 24: Building a Data Lake on AWS

Why Amazon S3 for Data Lake?

Natively supported by frameworks like — Spark, Hive, Presto, etc.

Can run transient Hadoop clusters

Multiple clusters can use the same data

Highly durable, available, and scalable

Low Cost: S3 Standard starts at $0.0275 per GB per month

Page 25: Building a Data Lake on AWS

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis Firehose

S3 Transfer Acceleration

AWS Storage Gateway

Data Ingestion into Amazon S3

Page 26: Building a Data Lake on AWS

Choice of storage classes on S3

Standard

Active data Archive dataInfrequently accessed data

Standard - Infrequent Access Amazon Glacier

Page 27: Building a Data Lake on AWS

Encryption ComplianceSecurity

Identity and Access Management (IAM) policies

Bucket policies Access Control Lists (ACLs) Query string authentication

SSL endpoints Server Side Encryption

(SSE-S3) Server Side Encryption

with provided keys (SSE-C, SSE-KMS)

Client-side Encryption

Buckets access logs Lifecycle Management

Policies Access Control Lists

(ACLs) Versioning & MFA

deletes Certifications – HIPAA,

PCI, SOC 1/2/3 etc.

Implement the right controls

Rahul Pathak
Under compliance, I might add HIPAA, PCI, SOC 1/2/3, etc.
Rahul Bhatia
Done
Page 28: Building a Data Lake on AWS

Use Case

We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline.

”Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

Eva TseDirector, Big Data Platform

Page 29: Building a Data Lake on AWS

Tip #1: Use versioning

Protects from accidental overwrites and deletes

New version with every upload

Easy retrieval of deleted objects and roll back to previous versions

Versioning

Page 30: Building a Data Lake on AWS

Tip #2: Use lifecycle policies

Automatic tiering and cost controls Includes two possible actions:

Transition: archives to Standard - IA or Amazon Glacier based on object age you specified

Expiration: deletes objects after specified time Actions can be combined Set policies at the bucket or prefix level Set policies for current version or non-

current versionsLifecycle policies

Page 31: Building a Data Lake on AWS

Versioning

Lifecyclepolicies

Recycle bin

Automaticcleaning

Versioning + lifecycle policies

Page 32: Building a Data Lake on AWS

Expired object delete marker policy

Deleting a versioned object makes a delete marker the current version of the object

Removing expired object delete marker can improve list performance

Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist

Expired object delete marker

Page 33: Building a Data Lake on AWS

Insert console screen shot

Enable policy with the console

Page 34: Building a Data Lake on AWS

Incomplete multipart upload expiration policy

Partial upload does incur storage charges Set a lifecycle policy to automatically make

incomplete multipart uploads expire after a predefined number of days

Incomplete multipart upload expiration

Best Practice

Page 35: Building a Data Lake on AWS

Enable policy with the Management Console

Page 36: Building a Data Lake on AWS

Considerations for organizing your Data Lake

Amazon S3 storage uses a flat keyspace

Separate data by business unit, application, type, and time

Natural data partitioning is very useful

Paths should be self documenting and intuitive

Changing prefix structure in future is hard/costly

Page 37: Building a Data Lake on AWS

Best Practices for your Data Lake

Always store a copy of raw input as the first rule of thumb

Use automation with S3 Events to enable trigger based workflows

Use a format that supports your data, rather than force your data into a format

Apply compression everywhere to reduce the network load

Page 38: Building a Data Lake on AWS

Thank you!