full stack analytics on aws - london-summit-slides-2017.s3 ... summit... · elastic transcoder...
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sr Mgr, AWS Specialist Solution Architecture
June 28th, 2017
Full Stack Analytics on AWSIan Meyers
Forces and Trends
Cost OptimizationLicensesHardwareData center and operations
Dark DataPrematurely discarding data
AgilityExperimentation (data & tools)Democratised Access to DataTime-to-first-results Terminate failed experiments early
From BI to Data ScienceIn-house data scienceFrom back office to product
Storage is the Gravity for Cloud Applications
Separation of Storage and Compute
Store all your data, forever, at every stage of its lifecycle Apply it using the appropriate technology
Storage is Job #1
Foundations: Storage, Discovery and Lifecycle
Secure, governed, scalable, cheap
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Amazon EFS
File
Amazon EBS Amazon EC2Instance Store
Block
Amazon S3 Amazon Glacier
Object
Data Transfer
AWS Direct Connect
AWS Snowball
ISV Connectors Amazon Kinesis
Firehose
S3 TransferAcceleration
StorageGateway
AWS Storage Platforms
Amazon S3 Amazon Glacier
Object
Object storage is foundational
EC2 Lambda EMR Data Pipeline Kinesis
CloudFront RDS DynamoDB RedShift
Database
AnalyticsCompute
Elastic Transcoder
Content Delivery
S3 Data Lifecycle and Events
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access
Amazon Glacier
Create
Delete
Augmenting storage with a data catalog
AWS Glue: Components
Data Catalog
Crawl, store, search metadata in different data stores
Populate in a Hive metastore compliant catalog
Job Execution
Fully managed orchestration & execution of ETL jobs
Server-less execution model – no need to pre-provision resources
Job Authoring
Author, edit, share ETL jobs in using your favorite tools
Store, share, re-use ETL code/script with Git integration
Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools such as Hive, Presto, Spark, etc.
We added a few extensions:§ Search metadata for data discovery§ Connection info – JDBC URLs, credentials§ Classification for identifying and parsing files§ Versioning of table metadata as schemas
evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through crawlers.
Glue Data Catalog
Crawlers: Auto-Populate Data Catalog
Automatic schema inference:• Built-in classifiers detect file type and
extract schema: record structure and data types.
• Add your own or share with others in the Glue community - It's all Grok and Python.
Auto-detects Hive-style partitions, grouping similar files into one table.
Run crawlers on schedule to discovernew data and schema changes.
Serverless – only pay when crawls run.
semi-structuredper-file schema
semi-structured unified schema
identify file type and parse files
enumerateS3 objects
file 1
file 2
file N
…int
array
intchar
struct
char int
array
struct
char
bool int
int
arraybool int
char
char intcustom classifiers
app log parsermetrics parser
…
system classifiersJSON parser
CSV parserApache log parser
…
Crawlers: Automatic Schema Inference
Security is Job #0
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Data Access & AuthorisationGive your users easy and secure access
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
AWS implements security at the data level, not tool-by-tool
IAM
AmazonS3
Amazon ElastiCache
AmazonDynamoDB
Amazon EMR
Amazon Kinesis
AmazonAthena
Service API Access
Third party ecosystem security tools
AmazonS3
AWSCloudTrail
http://amzn.to/2tSimHjAmazonAthena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon EMR
http://amzn.to/2si6RqS
+ storage level support for access logging and audit
Additional S3 Security PracticesUse S3 Bucket policies:• Restrict access by IP
address• Restrict deletes• Enforce encryption use
and
Restrict deletes to require MFA Authentication
Use Versioning!!!
AWSServer-sideencryptionAWSmanagedkeyinfrastructure
AWSKeyManagementServiceAutomatedkeyrotation&auditingIntegrationwithotherAWSservices
AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDeviceCommonCriteriaEAL4+,NISTFIPS140-2
Encryption Options
Extensible & hybrid crypto integration for AWS services
class myCrypt implements EncryptionMaterialsProvider
Amazon Redshift
On PremisesHSM
Data Access & AuthorisationGive your users easy and secure access
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Kinesis Firehose
Data Access & AuthorisationGive your users easy and secure access
Data IngestionGet your data into S3 quickly and securely
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Data Ingestion into S3
S3 Transfer Acceleration
S3 BucketAWS EdgeLocation
Uploader
OptimizedThroughput!
Typically 50%-400% faster
Change your endpoint, not your code
No firewall exceptions or client software required
59 global edge locations
Rio De Janeiro
Warsaw New York Atlanta Madrid Virginia Melbourne Paris Los Angeles
Seattle Tokyo Singapore
Tim
e [h
rs.]
500 GB upload from these edge locations to a bucket in Singapore
Public Internet
How Fast is S3 Transfer Acceleration?S3 Transfer Acceleration
Tip: Parallelizing PUTs with Multipart Uploads
• Increase aggregate throughput by parallelizing PUTs on high-bandwidth networks• Move the bottleneck to the network,
where it belongs • Increase resiliency to network errors;
fewer large restarts on error-prone networks
• Performed automatically by the aws-cli and ’TransferManager’ modules
Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture
Kinesis Firehose
Data Access & AuthorisationGive your users easy and secure access
Data IngestionGet your data into S3 quickly and securely
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Kinesis Firehose
AthenaQuery Service Glue
Data Access & AuthorisationGive your users easy and secure access
Data IngestionGet your data into S3 quickly and securely
Processing & AnalyticsUse of predictive and prescriptive
analytics to gain better understanding
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Machine LearningPredictive analytics Amazon AI
Glue: Managed ETL
• Serverless job execution
• Visual Workflowor
• Directly edit PySparktransformations
• Monitoring, metrics and notifications
Glue: Managed ETL
• Combine with AWS Lambda and AWS Step Functions for complex data orchestrations
• Automatically maintain data catalog entries
• Track lineage of data over time
Analysing streaming data…
Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL• Build fully managed and elastic stream processing
applications that process data for real-time visualizations and alarms
SELECT STREAM author, count(author) OVER ONE_MINUTE
FROM Tweets WINDOW ONE_MINUTE AS(PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING)WHERE text LIKE ‘%#AwsLondonSummit%';
Amazon Kinesis Analytics – Simple SQL Interface
Analysing streaming data… and when at rest
Amazon Athena
• No Infrastructure or administration• Zero Spin up time• Transparent upgrades• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
Simple Query editor with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and Catalog Management
QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight
Building smarter applications
Add Machine Learning CapabilitiesAmazon Machine Learning ServiceBatch and online predictionsTrain using data in S3, RDS and Redshift
Amazon EMRComprehensive machine learning libraries (eg Spark mLib, Anaconda)Provision analytics clusters in minutes, autoscale with data volume or query demand
Amazon AI Services
Amazon Polly – Lifelike Text-to-Speech47 voices, 24 languagesLow-latency, real time
Amazon Rekognition – Image AnalysisObject and scene detectionFacial analysis
Amazon Lex – Conversational EngineSpeech and text recognitionEnterprise connectors
Let’s hear from Polly..
Demographic Data
Facial Landmarks
Sentiment Expressed
Image Quality
Facial Analysis with Rekognition
Brightness: 25.84Sharpness: 160
General Attributes
Up to ~40k CUDA coresPre-configured CUDA driversJupyter notebook with Python2, Python3, Anaconda
CloudFormation TemplateAWS Marketplace – one-click deploy
AWS Deep Learning AMI
Scaling Distributed Experiments
• Inception v3 model
• Increasing machines
from 1 to 47
• 2x faster than
TensorFlow if using more
than 10 machines
Example MXNet User | TuSimple|Autonomous Driving
Kinesis Firehose
AthenaQuery Service Glue
Machine LearningPredictive analytics
Data Access & AuthorisationGive your users easy and secure access
Data IngestionGet your data into S3 quickly and securely
Processing & AnalyticsUse of predictive and prescriptive
analytics to gain better understanding
Storage & CatalogSecure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Amazon AI
Full Stack Analytics on AWS