aws re:invent 2016: building big data applications with the aws big data platform (bda206)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Matt Yanchyshyn
Sr. Manager, Solutions Architecture, AWS
November 30, 2016
Building Big Data Applications
with the AWS Big Data Platform
BDA206
Ingest/
Collect
Consume/
visualizeStore Process/
analyze
Data1 4
0 95 Answers &
insights
START HEREWITH A BUSINESS CASE
AWS Data PipelineAWS Database Migration Service
EMR
Analyze
Amazon
GlacierS3
StoreCollect
Amazon Kinesis
Direct Connect
Amazon
Machine
Learning
Amazon
Redshift
DynamoDB AWS IoT
AWS Snowball
QuickSight
Amazon Athena
EC2Amazon
Elasticsearch
Service
Lambda
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
AWS Cloudcorporate data center
Build a data warehouse with Amazon Redshift
Structured Data Processing
• Petabyte-scale relational, MPP, data warehousing
• Fully managed with SSD and HDD platforms
• Built-in end-to-end security, including customer-managed keys
• Fault-tolerant. Automatically recovers from disk and node failures
• Data automatically backed up to Amazon S3 with cross-region
backup capability for global disaster recovery
• Over 140 new features added since launch
• $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale
from 160 GB to 2 PB of compressed data with just a few clicks
Amazon Redshift
How do you get your (big) data into AWS?
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
AWS Cloudcorporate data center
Migrate your data to AWS
AWS Database
Migration Service
AWS Direct Connect
AWS Import/Export
& Snowball
Start your first migration in 10 minutes or less
Keep your apps running during the migration
Migrate to databases running on Amazon EC2,
Amazon RDS, or Amazon Redshift
AWS
Database
Migration Service
AWS Snowball: PB-scale Data Transport
E-ink shipping
label
Ruggedized
case
“8.5G Impact”
All data encrypted
end-to-end50TB & 80TB
10G network
Rain & dust
resistant
Tamper-resistant
case & electronics
Your CEO doesn’t want to look at
raw SQL query output
Business Intelligence
• Fast and cloud-powered
• Easy to use, no infrastructure to manage
• Scales to 100s of thousands of users
• Quick calculations with SPICE
• 1/10th the cost of legacy BI software
Amazon
QuickSight
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloudcorporate data center
Visualize your data with Amazon QuickSight
AWS Database
Migration Service
AWS Direct Connect
AWS Import/Export
& Snowball
What if your data isn’t structured?
What if you don’t need all the raw data?
What if you need to combine multiple data sets?
Serverless Event Processing
• Serverless compute service that runs your code in
response to events
• Extend AWS services with user-defined custom logic
• Write custom code in Node.js, Python, and Java
• Pay only for the requests served and compute time
required - billing in increments of 100 milliseconds
AWS Lambda
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloud
Event-driven data transformations with AWS Lambda
corporate data center
AWS LambdaStructured Data
In Amazon S3
Raw data
In Amazon S3
How will this work at scale?
What if the data processing exceeds the timeout?
Semi-structured/Unstructured Data Processing
• Hadoop, Hive, Presto, Spark, Tez, Impala etc.
• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3
and HBase on S3, Phoenix, Tez, Flink.
• New applications added within 30 days of their open source release
• Fully managed, Auto Scaling clusters with support for on-demand and
spot pricing
• Support for HDFS and S3 file systems enabling separated compute and
storage; multiple clusters can run against the same data in S3
• HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client-
side encryption with customer managed keys and AWS KMS
Amazon EMR
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloud
Transform your and explore your data at scale with Amazon EMR
corporate data center
Amazon EMR Structured Data
In Amazon S3
Raw data
In Amazon S3
What about ad-hoc queries when you are
exploring new data?
Serverless Query Processing
• Serverless query service for querying data in S3 using standard SQL with
no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and window
functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet
• Pay per query only when you’re running queries based on data scanned.
If you compress your data, you pay less and your queries run faster
Amazon
Athena
Building a Big Data ApplicationExtend your data warehouse to S3 with Amazon Athena
web clients
mobile clients
DBMS
Raw data
In Amazon S3
Amazon Redshift
Staging Data
in Amazon S3
Amazon
QuickSight
AWS Cloudcorporate data center
Amazon
EMR
Amazon
Athena
Building a Big Data ApplicationExtend your data warehouse to S3 with Amazon Athena
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloudcorporate data center
Amazon
EMR
Orc/Parquet in Amazon S3
(Columnar Data Format)Amazon
EMR
Raw data
In Amazon S3
Staging Data
in Amazon S3
Amazon
Athena
What if I want to run custom code or
multiple frameworks?
Building a Big Data ApplicationExtend your Data Warehouse to S3 with Presto, Spark SQL, etc. on Amazon EMR
web clients
mobile clients
DBMSAmazon Redshift
Orc/Parquet in Amazon S3
(Columnar Data Format)
Amazon
QuickSight
AWS Cloudcorporate data center
Amazon
EMR
Amazon
EMR
Amazon
EMR
Raw data
In Amazon S3Staging Data
in Amazon S3
What about real-time data?
Stream Processing
• Real-time stream processing
• High throughput; elastic
• Highly available; data replicated across multiple Availability
Zones with configurable retention
• S3, Amazon Redshift, DynamoDB integrations
• Amazon Kinesis Streams for custom streaming applications;
Amazon Kinesis Firehose for easy integration with Amazon
S3 and Amazon Redshift; Amazon Kinesis Analytics for
streaming SQL
Amazon
Kinesis
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Orc/Parquet
(Columnar Data Format)
Amazon
QuickSight
Amazon Kinesis
Streams
AWS Cloud
Add a real-time layer with Amazon Kinesis + Spark on Amazon EMR
corporate data center
Amazon
EMR
Amazon
EMR
Amazon
EMR
Raw data
In Amazon S3Staging Data
In Amazon S3
Amazon
Athena
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloud
React to real-time data with Amazon Kinesis Analytics and AWS Lambda
corporate data center
Amazon Kinesis
Firehose
Amazon Kinesis
Analytics
AWS Lambda
Amazon
Kinesis
Streams
Amazon SNS
Reference data
in Amazon S3
Amazon
Athena
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloud
React intelligently in real-time with Amazon Machine Learning
corporate data center
Amazon Kinesis
Firehose
Amazon Kinesis
Analytics
AWS Lambda
Amazon
Kinesis
Streams
Reference data
in Amazon S3
Amazon
Machine
Learning
Amazon SNS
Amazon
Athena
What if you need encryption and network
isolation to meet industry regulations?
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
Amazon Kinesis
Streams
AWS Cloud
Add encryption at rest with AWS KMS
corporate data center
AWS KMS
Amazon
EMR
Amazon
EMR
Raw data in S3 Staging Data in S3
Orc/Parquet in Amazon S3
(Columnar data)
Building a Big Data Application
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
Amazon Kinesis
Streams
AWS Cloud
AWS KMS
VPC subnet
SSL/TLS
SSL/TLS
Protect data in transit & add network isolation
corporate data center
Raw data in S3 Staging Data in S3
Orc/Parquet in Amazon S3
(Columnar data)
Which customers are doing this?
Ingest/
Collect
Consume/
visualizeStore
Process/
analyze
Data
1 40 9
5
Amazon S3
Data LakeAmazon EMR
Amazon
Kinesis
Amazon Redshift
Answers &
insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
Redfin
Ingest/
Collect
Consume/
visualizeStore
Process/
analyze
Data
1 40 9
5
Outcomes
& insights
Personalized
recommendations within
seconds (from 15-20 min)
Scale the expertise of
stylists to all shoppers
Reduce costs by 2X order
of magnitude
…
Mobile Users
Desktop Users
Analytics
Tools
Online Stylist
Amazon
Redshift
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDBAWS
Lambda
Amazon S3
Data Storage
NORDSTROM
Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
ProvidersAuto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
>5 PB, up to 75 billion events per day
web clients
mobile clients
DBMSAmazon Redshift
Amazon
QuickSight
AWS Cloudcorporate data center
Amazon Kinesis
Firehose
Amazon Kinesis
Analytics
AWS Lambda
Amazon
Kinesis
Streams
Reference data
in Amazon S3
Amazon
Machine
Learning
Amazon SNS
<YOUR COMPANY NAME HERE>
Amazon
Athena
Thank you!
Remember to complete
your evaluations!