© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Erik Swensson, Shree Kenghe & Erick Dame
January 26, 2016
Getting Started with Big DataAnalytic Options on AWS & Common Use Cases
Table of Contents
• Big Data Introduction for AWS
• Big Data Analytics Option on AWS• Usage Patterns & Anti-Patterns• Performance & Cost• Durability & Scalability• Interfaces
• Building Big Data Analytic Solutions – The AWS Approach
• Example Scenarios
Big Data on AWS
Immediate Availability. Deploy instantly. No hardware to procure, no infrastructure to maintain & scale
Trusted & Secure. Designed to meet the strictest requirements. Continuously audited, including certifications such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS.
Broad & Deep Capabilities. Over 50 services and 100s of features to support virtually any big data application & workload
Hundreds of Partners & Solutions. Get help from a consulting partner or choose from hundreds of tools and applications across the entire data management stack.
Real-timeAmazon Kinesis Firehose
Object StorageAmazon S3
RDBMSAmazon RDS
NoSQLDynamoDB
Hadoop EcosystemAmazon EMR
Real-timeAWS Lambda
Amazon Kinesis Analytics
Data WarehousingAmazon Redshift
Machine LearningAmazon Machine
Learning
Business Intelligence & Data VisualizationAmazon QuickSight
Real-timeAmazon Kinesis Streams
Elastic Search AnalyticsAmazon ElasticSearch
Collect Store Process & Analyze Visualize
Data ImportAmazon Import/Export
Snowball
IoTAmazon IoT
Broad, Tightly Integrated Capabilities
Petabyte scale
Massively parallel
Relational data warehouse
Fully managed, zero admin
As low as $1,000/TB/Year
a lot fastera lot cheapera whole lot simpler
Amazon Redshift
Amazon Redshift• Ideal Usage Patterns - Analyze
• Sales data • Historical data• Gaming data • Social trends • Ad data
• Performance• Massively Parallel Processing• Columnar Storage• Data Compression• Zone Maps• Direct-attached Storage
• Cost model• No upfront costs or long term commitments• Free backup storage equivalent to 100% of
provisioned storage
With columnar storage, you only read the data you need
Amazon Redshift
• Scalability & Elasticity• Resize or scale - Number or type of nodes
can be changed with a few clicks
• Durability and Availability• Replication• Backup • Automated recovery from failed drives &
nodes • Interfaces
• JDBC/ODBC interface with BI/ETL tools• Amazon S3 or DynamoDB
• Anti-patterns• Small datasets• OLTP• Unstructured Data• Blob Data
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
Ingest streaming data
Process data in real-time
Store terabytes of data per hour
Amazon Kinesis
Amazon Kinesis Streams• Ideal Usage Patterns – Streaming
data ingestion and processing• Real-time data analytics• Data feed intake and processing e.g. logs• Real-time metrics and reporting
• Performance• Throughput capacity in terms of shards
• Cost model• No upfront costs or long term
commitments•Pay as you go pricing•Hourly charge per shard•Charge for 1 million PUT transactions
Amazon Kinesis Streams
• Scalability & Elasticity•Scale – increase number of shards
• Durability and Availability• Replication• Cursor preservation
• Interfaces•Input – data coming in•Output – data going out•Kinesis Firehose
• Anti-patterns•Small scale consistent throughput•Long term data storage and analytics
Launch a cluster in minutes
Pay by the hour and save with spot
MapReduce, Apache Spark, Presto
Amazon EMR
Amazon EMR• Ideal Usage Patterns
• Log processing and analytics • Large ETL and data movement• Risk modeling and threat analytics• Ad targeting and click stream analytics• Genomics• Predictive analytics• Ad-hoc data mining and analytics
• Performance – driven by• Type of instance• Number of instances
• Cost model• Only pay for hours the cluster is up• EC2 instance and EMR price
Amazon EMR
• Scalability & Elasticity• Resize a running cluster• Add more core or task nodes
• Durability and Availability• Fault tolerant for slave node (HDFS) • Backup to S3 for resilience against master
node failures• Interfaces
• Hive, Pig, Spark, Hbase, Impala, Hunk, Presto, other popular tools
• Anti-patterns• Small data sets• ACID (Atomicity, Consistency, Isolation and
Durability)
Amazon EMR Cluster
Amazon EMR Cluster
Amazon EMR Cluster
Fully managed NoSQL database
Single-Digit Millisecond latency at scale
Supports document and key-value
AmazonDynamoDB
Amazon DynamoDB• Ideal Usage Patterns
• Mobile apps, gaming, digital ad serving, live voting, sensor networks, log ingestion
• Access control for web-based content, e-commerce shopping carts
• Web session management• Performance
• SSD• Provision throughput by table
• Scalability & Elasticity•No limit to the amount of data stored•Dial-up or dial-down the read and write
capacity of a table• Cost Model
• Pay as you go• Provisioned throughput capacity (per hour)• Indexed data storage (per GB per month)• Data transfer in or out (per GB per month)
Provisioned read/write performance per table. Predictable high performance scaled via console or API
Amazon DynamoDB
• Durability and Availability• Three Availability Zones (AZ)
• Interfaces• AWS Management Console• API’s• SDK’s
• Anti-patterns• Application tied to traditional relational
database• Joins and or complex transactions• BLOB data• Large data with low I/O rate
AZ-A
AZ-B
AZ-C
Managed service designed to make it easy for developers of all levels to use machine learning
Based on the same ML technology used for years by Amazon’s internal data scientists
Amazon Machine Learning uses scalable and robust implementations of industry-standard ML algorithms
Amazon Machine Learning
Amazon Machine Learning
• Ideal Usage Patterns • Enable applications that flag suspicious
transactions• Personalize application content• Predict user activity• Listen to social media
• Cost Model• Pay for what you use• No need to manage instances, only pay for
the service• Performance
• Real-time predictions designed to return within 100ms
• 200 transactions can be handled per second by default (can be raised)
Amazon Machine Learning• Durability and Availability
• No maintenance windows or scheduled downtimes
• Designed across multiple availability zones
• Scalability & Elasticity• Model training up to 100GB• Multiple ML jobs can run simultaneously
• Interfaces• Create a data source from S3, RDS and
Redshift• Interact with ML via console, SDKs, and
the ML API• Anti-patterns
• Massive Data Sets for modeling > 100GB
• Sequence prediction or unsupervised clustering task
Event driven, fully managed compute
No Infrastructure to Manage
Automatic Scaling
AWS Lambda
AWS Lambda• Ideal Usage Patterns
• Real-time file processing• Extract, Transform, Load
• Performance• Process events within
milliseconds• Cost Model
• Pay for what you use• No need to manage instances,
only pay for the service• Lambda free tier includes 1M free
requests
1 2 3Serverless Event-Driven Scale Subsecond Billing
AWS Lambda• Durability and Availability
• No maintenance windows or scheduled downtime
• Async functions are retried 3 times if there is a failure
• Scalability & Elasticity• Any number of concurrent functions that
can be run• AWS Lambda will dynamically allocate
capacity to match the rate of incoming events.
• Interfaces• Lambda supports Java, Node.js, and
Python• Trigger via event or schedule
• Anti-patterns• Long running applications• Stateful applications in Lambda
Setup Elasticsearch cluster in minutes
Integrated with Logstash and Kibana
Scale Elasticsearch cluster seamlessly
Amazon Elasticsearch
Service
Amazon Elasticsearch• Ideal Usage Patterns
• Analyze logs• Analyze data stream updates from other AWS
services• Provide customers a rich search and navigation
experience• Usage monitoring for mobile applications
• Performance• Depends on multiple factors including instance
type, workload, index, number of shards used, read replicas
• Storage configurations –instance storage or EBS storage
• Cost Model• Pay as you go• Only pay for compute and storage
Amazon Elasticsearch• Durability and Availability
• Zone Awareness• Automatic and Manual snapshots
• Scalability & Elasticity• Add or remove instances• Modify EBS volumes for data growth
• Interfaces• AWS Management Console• API’s• SDK’s• Kibana and Logstash (ELK Stack)
• Anti-patterns• OLTP• Workloads needing larger than 5TB of
storage requirements
Elasticsearch + Logstash + Kibana = real-time analytics & visualization
Build visualizations
Perform ad-hoc analysis
Share and collaborate via storyboards
Native access on major mobile platforms
Amazon QuickSight
Introducing Amazon QuickSight
Cloud-Powered Business Intelligence Service For 1/10th the Cost of Traditional BI Software
No IT effort. No dimensional modeling
Auto-discovery of all AWS data sources
Super-fast, Parallel, In-memory Calculation Engine
(SPICE)
Fully Managed
Available in Previewaws.amazon.com/quicksight
Scale up or down as needed
Pay for what you use
Multiple options
Do-it-yourself big data applications
Amazon EC2
The AWS Approach
• Flexible Use the best tool for the job• Data structure, latency, throughput, access patterns
• Scalable Immutable (append-only)• Batch/speed/serving layer
• Minimum Admin Overhead Leverage AWS managed services• No or very low admin
• Low Cost Big data ≠ big cost
Scenario 1: Enterprise Data Warehouse
Scenario 2: Capturing and Analyzing Sensor Data
Scenario 3: Sentiment Analysis of Social Media
Big Data Scenarios
Scenario 1: Enterprise Data Warehouse
Data Warehouse Architecture
Data Sources
AmazonS3
AmazonEMR
AmazonS3
AmazonRedshift
AmazonQuickSight
Scenario 2: Capturing and Analyzing Sensor Data
Data Sources
AmazonS3
AmazonRedshift
AmazonQuickSight
AmazonKinesisEnabled
App
AmazonKinesisEnabled
App
AmazonDynamoDB
ReportingDashboard
Customer Access
AmazonKinesis
1
2 3 4 5
6 7 8 9
Scenario 3: Sentiment Analysis of Social Media
Social Media Data
AmazonEC2
AmazonLambda
AmazonML
AmazonKinesis
AmazonS3
AmazonSNS
1 2 4 5 6
3 7
Next Steps• Subscribe to the AWS Big Data Blog
blogs.aws.amazon.com/bigdata
• Learn more, check the tutorials, guides, and self-paced labsaws.amazon.com/big-data
• Register for the next Big Data WebinarBuilding Smart Applications with Amazon Machine Learningaws.amazon.com/about-aws/events/monthlywebinarseriesThu, Jan 28 2016 | 9AM PST