building a data lake on aws
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steve Abraham, Solutions Architect
October 26, 2016
Building a Data Lake on AWS
Evolution of “Data Lakes”
Databases
Transactions
Data warehouse
Evolution of big data architecture
Extract, transform and load (ETL)
Databases
Files
Transactions
Logs
Data warehouse
Evolution of big data architecture
ETL
ETL
Databases
Files
Streams
Transactions
Logs
Events
Data warehouse
Evolution of big data architecture
? Hadoop ?
ETL
ETL
Amazon Glacier
Amazon S3 Amazon DynamoDB
Amazon RDS
Amazon EMR
Amazon Redshift
AWS Data Pipeline
Amazon Kinesis Amazon CloudSearch
Amazon Kinesis-enabled app
AWS Lambda Amazon Machine Learning
Amazon SQS
Amazon ElastiCache
Amazon DynamoDBStreams
A growing ecosystem…
Databases
Files
Streams
Transactions
Logs
Events
Data warehouse
DataLake
The Genesis of “Data Lakes”
What really is a “Data Lake”
Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI An API and user interface that expose these features to internal and external users
A robust set of security controls – governance through technology, not policy
A search index and workflow which enables data discovery
A foundation of highly durable data storage and streaming of any type of data
StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost
Data Lake – Hadoop (HDFS) as the Storage
Search
Access
QueryProcess
Archive
Transactions
Data Lake – Amazon S3 as the storage
Search
Access
QueryProcess
Archive
Amazon RDS
Amazon DynamoDB
AmazonElasticsearch
Service
AmazonGlacier
Amazon S3
Amazon Redshift
Amazon Elastic
MapReduce
Amazon Machine Learning
Amazon ElastiCache
Metadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance
Catalogue & search
Catalogue & Search Architecture
Encryption for Data protectionAuthentication & AuthorizationAccess Control & Restrictions
Entitlements
Data Protection via EncryptionAWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM DeviceCommon Criteria EAL4+, NIST FIPS 140-2
AWS Key Management ServiceAutomated key rotation & auditingIntegration with other AWS services
AWS server side encryptionAWS managed key infrastructure
Entitlements – Access to Encryption Keys
Customer Master Key
Customer Data Keys
CiphertextKey
PlaintextKey
IAM TemporaryCredential
Security Token Service
MyData
MyData
S3
S3 Object…
Name: MyDataKey: Ciphertext
Key…
Exposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected
API & UI
API & UI Architecture
API Gateway
UI - Elastic Beanstalk
AWS Lambda Metadata IndexUsersIAM
TVM - Elastic Beanstalk
Putting It All Together
Amazon Kinesis
Amazon S3 Amazon Glacier
IAM
Encrypted Data
Security Token Service
AWS Lambda
SearchIndex
Metadata Index
API GatewayUsers UI - Elastic Beanstalk
KMS
Collect & Store
Catalogue & Search
Entitlements & Access Controls
APIs & UI
Amazon S3 - Foundation for your Data Lake
Designed for 11 9s of durability
Designed for 99.99% availability
Durable Available High performance Multiple upload Range GET
Store as much as you need Scale storage and compute
independently No minimum usage commitments
Scalable AWS Elastic MapReduce Amazon Redshift Amazon DynamoDB
Integrated Simple REST API AWS SDKs Read-after-create consistency Event Notification Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
Why Amazon S3 for Data Lake?
Natively supported by frameworks like — Spark, Hive, Presto, etc.
Can run transient Hadoop clusters
Multiple clusters can use the same data
Highly durable, available, and scalable
Low Cost: S3 Standard starts at $0.0275 per GB per month
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis Firehose
S3 Transfer Acceleration
AWS Storage Gateway
Data Ingestion into Amazon S3
Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
Encryption ComplianceSecurity
Identity and Access Management (IAM) policies
Bucket policies Access Control Lists (ACLs) Query string authentication
SSL endpoints Server Side Encryption
(SSE-S3) Server Side Encryption
with provided keys (SSE-C, SSE-KMS)
Client-side Encryption
Buckets access logs Lifecycle Management
Policies Access Control Lists
(ACLs) Versioning & MFA
deletes Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls
Use Case
We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline.
“
”Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Eva TseDirector, Big Data Platform
Tip #1: Use versioning
Protects from accidental overwrites and deletes
New version with every upload
Easy retrieval of deleted objects and roll back to previous versions
Versioning
Tip #2: Use lifecycle policies
Automatic tiering and cost controls Includes two possible actions:
Transition: archives to Standard - IA or Amazon Glacier based on object age you specified
Expiration: deletes objects after specified time Actions can be combined Set policies at the bucket or prefix level Set policies for current version or non-
current versionsLifecycle policies
Versioning
Lifecyclepolicies
Recycle bin
Automaticcleaning
Versioning + lifecycle policies
Expired object delete marker policy
Deleting a versioned object makes a delete marker the current version of the object
Removing expired object delete marker can improve list performance
Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist
Expired object delete marker
Insert console screen shot
Enable policy with the console
Incomplete multipart upload expiration policy
Partial upload does incur storage charges Set a lifecycle policy to automatically make
incomplete multipart uploads expire after a predefined number of days
Incomplete multipart upload expiration
Best Practice
Enable policy with the Management Console
Considerations for organizing your Data Lake
Amazon S3 storage uses a flat keyspace
Separate data by business unit, application, type, and time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly
Best Practices for your Data Lake
Always store a copy of raw input as the first rule of thumb
Use automation with S3 Events to enable trigger based workflows
Use a format that supports your data, rather than force your data into a format
Apply compression everywhere to reduce the network load
Thank you!