cloud analytics and business intelligence on awsaws-de-media.s3.amazonaws.com/images/aws summit...
TRANSCRIPT
Cloud Analytics and Business
Intelligence on AWS
Infrastructure Regions Availability Zones Points of Presence
Enterprise
Applications Virtual Desktops Sharing & Collaboration
Core ServicesStorage
(Object, Block
and Archival)
Compute
(VMs, Auto-scaling
and Load Balancing)
Databases
(Relational, NoSQL,
Caching)
Networking
(VPC, DX,
DNS)
CDN
Access
Control
Usage & Resource
Tracking
Monitoring
and Logs
Administration &
SecurityKey Storage &
Management
Identity
Management
Service
Catalog
Platform
Services
Deployment & Management
One-click web app
deployment
Dev/ops resource
management
Resource Templates
Push
Notifications
Mobile Services
Identity
Sync
Mobile
Analytics
App Services
Queuing &
Notifications
Workflow
App streaming
Transcoding
Search
Analytics
Hadoop
Data Pipeline
Data Warehouse
Real-time
Streaming Data
Code Deploy
Code Pipeline
Code Commit
Machine
Learning
Availability99.99%
Durability 99.999999999%
A Distributed Object StoreNot a file system
No Single Points of Failure
Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.03/GB/month
Typical use case Write once, read many
Simple Storage
ServiceHighly scalable object
storage for the internet
1 byte to 5TB in size
99.999999999% durability
34 secs per terabyte
GB/Second
Re
ad
er
Co
nn
ection
s
Amazon S3 provides near linear scalability
S3 Streaming Performance100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
S3 Performance & Scalability
Application Services
Amazon KinesisManaged Service for Real Time Big Data Processing
Create Streams to Produce & Consume Data
Elastically Add and Remove Shards for Performance
Use Kinesis Worker Library to Process Data
Integration with S3, Redshift and Dynamo DB
Compute Storage
AWS Global Infrastructure
Databas
e
App Services
Deployment & Administration
Networking
Analytics
Data Sources
App.4
[Machine Learning]
AW
S En
dp
oin
t
App.1
[Aggregate & De-Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extraction]
S3
DynamoDB
Redshift
App.3[Sliding Window Analysis]
Data Sources
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
ZoneAvailability
Zone
Amazon Kinesis
Cloud HSMDedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management ServiceImplemented on HSM
Automated Key Rotation & Auditing
Integration with other AWS Services
AWS Server Side EncryptionAWS Managed Key Infrastructure
AWS Security Services
Compute Storage
AWS Global Infrastructure
Databas
e
App Services
Deployment & Administration
Networking
Analytics
Structured Data Management
Database
Relational Database ServiceManaged Oracle, MySQL & SQL Server
Dynamo DBManaged NOSQL Database
ElastiCacheManaged In Memory Caching
RDS Dynamo
DB
Redshift Elasticache
Amazon RedshiftMassively Parallel Petabyte Scale Data Warehouse
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Database
Relational Database ServiceDatabase-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Integration with Data Pipeline
RDS Dynamo
DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Database
DynamoDBProvisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo
DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
• Writes– Writes are acknowledged
(committed) once they exist in at least two physical data centers
– Writes are persisted to SSD
• Reads– Tunable for Application
Requirements
• No reduction in durability or consistency in order to achieve throughput
Dynamo Consistency
Eventually Consistent Read Strongly Consistent Read
Stale Values reads possible No Stale Values read
Highest Throughput Lower Potential Throughput
√ √
√
Database
RedshiftManaged Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 1.6 PB Online
RDS Dynamo
DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Query
Load
Backup
Restore
Resize
ComputeNode
ComputeNode
ComputeNode
LeaderNode
Common BI Tools
JDBC/ ODBC
10GigE Mesh
Redshift Parallelizes Everything
Exploratory Analytics…
Data Cleansing…
Advanced Data Science
Elastic MapReduce
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with S3, DynamoDB and Redshift
Install End User Tools Automatically (Spark,
Impala)
Support for EC2 Spot Instances
Transient or Always on Clusters
Managed Big Data
Elastic
MapReduce
Compute Storage
AWS Global Infrastructure
Databas
e
App Services
Deployment & Administration
Networking
Analytics
EMR
Pig
Vibrant Ecosystem
HDFS
Weather Insurance for Farms
Challenge:Volatile weather is deadly to crops
like grapes
60 years of crop data
200 TB of S3 Data
1M government Doppler radar points
Solution:Built a predictive model based on
freely available data:
150B Soil
Observations
850K Precision Rainfall
Grids Tracked
3M Daily Weather
Measurements
50 EMR clusters process new data as it comes into S3 each
day, continuously updating the model
Try different configurations to find the optimal cost/performance balance
CPU
c3 family
cc2.8xlarge
d2 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m3 family
Choose your instance types
ETL Machine Learning Spark HDFS
Custom Intel Xeon processors for AWS C4 = highest performing EC2 instances
New EC2 Instances – C4
The Financial Industry Regulatory Authority
30 Billion Market Events / Day
Objective to react to changing Market Dynamics
Amazon Elastic MapReduce & Amazon S3
$10-20M Savings by moving Platform to AWS
Event Processing
AWS LambdaFully Managed Event Processor
Node.js, Integrated AWS SDK & ImageMagick
Natively Compile & Install Node.js modules
Specify Runtime RAM & Timeout
Automatically Scaled to support Event Volume
Events from S3, Dynamo DB, Kinesis & Lambda
Integrated CloudWatch Logging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Introducing Amazon Machine Learning
Easily create machine learning models
Visualize and optimize models
Put models into production in seconds
Battle-hardened technologyMachine Learning
expertise
SDE expertise
Easy to Use, High Performance
Train and optimize models on GBs of data
Batch process predictions
Real-time prediction API in one-click
No servers to provision or manage
Developing with Amazon Machine Learning
Buildmodel
Validate &optimize
Make predictions
1 2 3
Building a Predictive Model with Amazon Machine Learning
Use existing data in S3, Redshift and RDS
Automatic data visualization
& exploration
Descriptive and summary statistics
Your data doesn’t have to be perfect
Missing data, malformed data records, type validation
Model Validation and Optimization Tools
Making Predictions with Amazon Machine Learning
Batch predictions
Asynchronous predictions with trained model
Real time predictions
Synchronous, low latency, high throughput
Mount API end-point with a single click
Traditional Business Intelligence…
OLAP…
Data Sources for ML
Managed Data Warehouse
Redshift
Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 1.6 PB Online
RDS Dynamo DB
Redshift ElastiCache
Compute Storage
AWS Global Infrastructure
Databas
e
App Services
Deployment & Administration
Networking
Analytics
Redshift lets you start small and grow big
Extra Large Node (dw1.xl & dw2.xl)
3 spindles, 15GiB RAM 2 virtual cores, 10GigE
Single Node (160GB SSD or 2TB Magnetic)
Cluster 2-32 Nodes (320GB SSD – 64TB Magnetic)
8 Extra Large Node (dw1.8xl & dw2.8xl)
24 spindles, 120GiB RAM, 1.2TB SSD or 16TB Magnetic, 16 virtual cores, 10GigE
Cluster 2-100 Nodes (2.4TB SSD – 1.6PB Magnetic)
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
XL
End User Reporting
Redshift
S3
EMR
Dynamo DB
Ignite Your Ambition
34
Leading Index Provider With
41,000+ Indexes Across Asset Classes And Geographies
Over 10,000 Corporate Clients in
60 countries
Our technology
powers over
70
MARKETPLACES,
regulators, CSDs
and clearing-
houses
in over
50 COUNTRIES
100+ DATA
PRODUCT OFFERINGS
supporting 2.5+ millioninvestment professionals
and users
IN 98 COUNTRIES
26 Markets
3 Clearing Houses
5 Central Securities
Depositories
Lists more than 3,500
companies in 35 countries,
representing more than $8.8
trillion in total market value
NDW 1.0 Requirements
Original scope was to replace on-premises warehouse with Redshift, keeping equivalent schemas and data
4-8 Billion Rows/Day
Legacy limited to 1 Year Retention
Must be lower cost than legacy system
Legacy $1.16M/Year
Must satisfy multiple security and regulatory requirements
Must perform similarly to legacy warehouse under concurrent query load
Migration Completed On Schedule
Migrated off legacy warehouse to Redshift (start to finish) in 7 man-months
Redshift costs were 43% of legacy budget for the same data set (~1100 tables)
Tuned queries now running faster than on legacy system
Data Ingest5.5B rows/day average for 2014
High water mark: 14B rows in 1 day
Best write rates ~2.76M rows/second
450 GB/day (after compression) into Redshift
1,895 GB/day average uncompressed
Currently resize clusters once a quarter (if necessary)
NDW_Prod is currently growing +3 dw1.8xl nodes per quarter
Integrated Analytics