building a data lake on aws

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Steve Abraham, Solutions Architect

October 26, 2016

Building a Data Lake on AWS

Evolution of “Data Lakes”

Databases

Transactions

Data warehouse

Evolution of big data architecture

Extract, transform and load (ETL)

Databases

Files

Transactions

Logs

Data warehouse


ETL

ETL

Databases

Files

Streams

Transactions

Logs

Events

Data warehouse


? Hadoop ?

ETL

ETL

Amazon Glacier

Amazon S3 Amazon DynamoDB

Amazon RDS

Amazon EMR

Amazon Redshift

AWS Data Pipeline

Amazon Kinesis Amazon CloudSearch

Amazon Kinesis-enabled app

AWS Lambda Amazon Machine Learning

Amazon SQS

Amazon ElastiCache

Amazon DynamoDBStreams

A growing ecosystem…

Databases

Files

Streams

Transactions

Logs

Events

Data warehouse

DataLake

The Genesis of “Data Lakes”

What really is a “Data Lake”

Components of a Data Lake

Collect & Store

Catalogue & Search

Entitlements

API & UI An API and user interface that expose these features to internal and external users

A robust set of security controls – governance through technology, not policy

A search index and workflow which enables data discovery

A foundation of highly durable data storage and streaming of any type of data

StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost

Data Lake – Hadoop (HDFS) as the Storage

Search

Access

QueryProcess

Archive

Transactions

Data Lake – Amazon S3 as the storage

Search

Access

QueryProcess

Archive

Amazon RDS

Amazon DynamoDB

AmazonElasticsearch

Service

AmazonGlacier

Amazon S3

Amazon Redshift

Amazon Elastic

MapReduce

Amazon Machine Learning

Amazon ElastiCache

Metadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance

Catalogue & search

Catalogue & Search Architecture

Encryption for Data protectionAuthentication & AuthorizationAccess Control & Restrictions

Entitlements

Data Protection via EncryptionAWS CloudHSM

Dedicated Tenancy SafeNet Luna SA HSM DeviceCommon Criteria EAL4+, NIST FIPS 140-2

AWS Key Management ServiceAutomated key rotation & auditingIntegration with other AWS services

AWS server side encryptionAWS managed key infrastructure

Entitlements – Access to Encryption Keys

Customer Master Key

Customer Data Keys

CiphertextKey

PlaintextKey

IAM TemporaryCredential

Security Token Service

MyData

MyData

S3

S3 Object…

Name: MyDataKey: Ciphertext

Key…

Exposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected

API & UI

API & UI Architecture

API Gateway

UI - Elastic Beanstalk

AWS Lambda Metadata IndexUsersIAM

TVM - Elastic Beanstalk

Putting It All Together

Amazon Kinesis

Amazon S3 Amazon Glacier

IAM

Encrypted Data

Security Token Service

AWS Lambda

SearchIndex

Metadata Index

API GatewayUsers UI - Elastic Beanstalk

KMS

Collect & Store

Catalogue & Search

Entitlements & Access Controls

APIs & UI

Amazon S3 - Foundation for your Data Lake

Designed for 11 9s of durability

Designed for 99.99% availability

Durable Available High performance Multiple upload Range GET

Store as much as you need Scale storage and compute

independently No minimum usage commitments

Scalable AWS Elastic MapReduce Amazon Redshift Amazon DynamoDB

Integrated Simple REST API AWS SDKs Read-after-create consistency Event Notification Lifecycle policies

Easy to use

Why Amazon S3 for Data Lake?

Why Amazon S3 for Data Lake?

Natively supported by frameworks like — Spark, Hive, Presto, etc.

Can run transient Hadoop clusters

Multiple clusters can use the same data

Highly durable, available, and scalable

Low Cost: S3 Standard starts at $0.0275 per GB per month

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis Firehose

S3 Transfer Acceleration

AWS Storage Gateway

Data Ingestion into Amazon S3

Choice of storage classes on S3

Standard

Active data Archive dataInfrequently accessed data

Standard - Infrequent Access Amazon Glacier

Encryption ComplianceSecurity

Identity and Access Management (IAM) policies

Bucket policies Access Control Lists (ACLs) Query string authentication

SSL endpoints Server Side Encryption

(SSE-S3) Server Side Encryption

with provided keys (SSE-C, SSE-KMS)

Client-side Encryption

Buckets access logs Lifecycle Management

Policies Access Control Lists

(ACLs) Versioning & MFA

deletes Certifications – HIPAA,

PCI, SOC 1/2/3 etc.

Implement the right controls

Rahul Pathak

Under compliance, I might add HIPAA, PCI, SOC 1/2/3, etc.

Rahul Bhatia

Done

Use Case

We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline.

“

”Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

Eva TseDirector, Big Data Platform

http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

Tip #1: Use versioning

Protects from accidental overwrites and deletes

New version with every upload

Easy retrieval of deleted objects and roll back to previous versions

Versioning

Tip #2: Use lifecycle policies

Automatic tiering and cost controls Includes two possible actions:

Transition: archives to Standard - IA or Amazon Glacier based on object age you specified

Expiration: deletes objects after specified time Actions can be combined Set policies at the bucket or prefix level Set policies for current version or non-

current versionsLifecycle policies

Versioning

Lifecyclepolicies

Recycle bin

Automaticcleaning

Versioning + lifecycle policies

Expired object delete marker policy

Deleting a versioned object makes a delete marker the current version of the object

Removing expired object delete marker can improve list performance

Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist

Expired object delete marker

Insert console screen shot

Enable policy with the console

Incomplete multipart upload expiration policy

Partial upload does incur storage charges Set a lifecycle policy to automatically make

incomplete multipart uploads expire after a predefined number of days

Incomplete multipart upload expiration

Best Practice

Enable policy with the Management Console

Considerations for organizing your Data Lake

Amazon S3 storage uses a flat keyspace

Separate data by business unit, application, type, and time

Natural data partitioning is very useful

Paths should be self documenting and intuitive

Changing prefix structure in future is hard/costly

Best Practices for your Data Lake

Always store a copy of raw input as the first rule of thumb

Use automation with S3 Events to enable trigger based workflows

Use a format that supports your data, rather than force your data into a format

Apply compression everywhere to reduce the network load

Thank you!

building a data lake on aws

Technology