full stack analytics on aws - london-summit-slides-2017.s3 ... summit... · elastic transcoder...

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Sr Mgr, AWS Specialist Solution Architecture

June 28th, 2017

Full Stack Analytics on AWSIan Meyers

Forces and Trends

Cost OptimizationLicensesHardwareData center and operations

Dark DataPrematurely discarding data

AgilityExperimentation (data & tools)Democratised Access to DataTime-to-first-results Terminate failed experiments early

From BI to Data ScienceIn-house data scienceFrom back office to product

Storage is the Gravity for Cloud Applications

Separation of Storage and Compute

Store all your data, forever, at every stage of its lifecycle Apply it using the appropriate technology

Storage is Job #1

Foundations: Storage, Discovery and Lifecycle

Secure, governed, scalable, cheap

Storage & CatalogSecure, cost-effective storage in Amazon

S3. Robust metadata in AWS Catalog

Amazon EFS

File

Amazon EBS Amazon EC2Instance Store

Block

Amazon S3 Amazon Glacier

Object

Data Transfer

AWS Direct Connect

AWS Snowball

ISV Connectors Amazon Kinesis

Firehose

S3 TransferAcceleration

StorageGateway

AWS Storage Platforms

Amazon S3 Amazon Glacier

Object

Object storage is foundational

EC2 Lambda EMR Data Pipeline Kinesis

CloudFront RDS DynamoDB RedShift

Database

AnalyticsCompute

Elastic Transcoder

Content Delivery

S3 Data Lifecycle and Events

Standard

Active data Archive dataInfrequently accessed data

Standard - Infrequent Access

Amazon Glacier

Create

Delete

Augmenting storage with a data catalog

AWS Glue: Components

Data Catalog

Crawl, store, search metadata in different data stores

Populate in a Hive metastore compliant catalog

Job Execution

Fully managed orchestration & execution of ETL jobs

Server-less execution model – no need to pre-provision resources

Job Authoring

Author, edit, share ETL jobs in using your favorite tools

Store, share, re-use ETL code/script with Git integration

Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools such as Hive, Presto, Spark, etc.

We added a few extensions:§ Search metadata for data discovery§ Connection info – JDBC URLs, credentials§ Classification for identifying and parsing files§ Versioning of table metadata as schemas

evolve and other metadata are updated

Populate using Hive DDL, bulk import, or automatically through crawlers.

Glue Data Catalog

Crawlers: Auto-Populate Data Catalog

Automatic schema inference:• Built-in classifiers detect file type and

extract schema: record structure and data types.

• Add your own or share with others in the Glue community - It's all Grok and Python.

Auto-detects Hive-style partitions, grouping similar files into one table.

Run crawlers on schedule to discovernew data and schema changes.

Serverless – only pay when crawls run.

semi-structuredper-file schema

semi-structured unified schema

identify file type and parse files

enumerateS3 objects

file 1

file 2

file N

…int

array

intchar

struct

char int

array

struct

char

bool int

int

arraybool int

char

char intcustom classifiers

app log parsermetrics parser

…

system classifiersJSON parser

CSV parserApache log parser

…

Crawlers: Automatic Schema Inference

Security is Job #0

Data Access & AuthorisationGive your users easy and secure access



Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified

AWS implements security at the data level, not tool-by-tool

IAM

AmazonS3

Amazon ElastiCache

AmazonDynamoDB

Amazon EMR

Amazon Kinesis

AmazonAthena

Service API Access

Third party ecosystem security tools

AmazonS3

AWSCloudTrail

http://amzn.to/2tSimHjAmazonAthena

Access Logging

API Logging

Access Log

Analytics

IAM

Amazon EMR

http://amzn.to/2si6RqS

+ storage level support for access logging and audit

Additional S3 Security PracticesUse S3 Bucket policies:• Restrict access by IP

address• Restrict deletes• Enforce encryption use

and

Restrict deletes to require MFA Authentication

Use Versioning!!!

AWSServer-sideencryptionAWSmanagedkeyinfrastructure

AWSKeyManagementServiceAutomatedkeyrotation&auditingIntegrationwithotherAWSservices

AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDeviceCommonCriteriaEAL4+,NISTFIPS140-2

Encryption Options

Extensible & hybrid crypto integration for AWS services

class myCrypt implements EncryptionMaterialsProvider

Amazon Redshift

On PremisesHSM

Kinesis Firehose


Data IngestionGet your data into S3 quickly and securely




Data Ingestion into S3

S3 Transfer Acceleration

S3 BucketAWS EdgeLocation

Uploader

OptimizedThroughput!

Typically 50%-400% faster

Change your endpoint, not your code

No firewall exceptions or client software required

59 global edge locations

Rio De Janeiro

Warsaw New York Atlanta Madrid Virginia Melbourne Paris Los Angeles

Seattle Tokyo Singapore

Tim

e [h

rs.]

500 GB upload from these edge locations to a bucket in Singapore

Public Internet

How Fast is S3 Transfer Acceleration?S3 Transfer Acceleration

Tip: Parallelizing PUTs with Multipart Uploads

• Increase aggregate throughput by parallelizing PUTs on high-bandwidth networks• Move the bottleneck to the network,

where it belongs • Increase resiliency to network errors;

fewer large restarts on error-prone networks

• Performed automatically by the aws-cli and ’TransferManager’ modules

Write Database Changes to S3 with DMS

<schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv

Full Load

Change Data Capture

Kinesis Firehose






Kinesis Firehose

AthenaQuery Service Glue



Processing & AnalyticsUse of predictive and prescriptive

analytics to gain better understanding




Machine LearningPredictive analytics Amazon AI

Glue: Managed ETL

• Serverless job execution

• Visual Workflowor

• Directly edit PySparktransformations

• Monitoring, metrics and notifications

Glue: Managed ETL

• Combine with AWS Lambda and AWS Step Functions for complex data orchestrations

• Automatically maintain data catalog entries

• Track lineage of data over time

Analysing streaming data…

Amazon Kinesis Analytics

• Interact with streaming data in real time using SQL• Build fully managed and elastic stream processing

applications that process data for real-time visualizations and alarms

SELECT STREAM author, count(author) OVER ONE_MINUTE

FROM Tweets WINDOW ONE_MINUTE AS(PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING)WHERE text LIKE ‘%#AwsLondonSummit%';

Amazon Kinesis Analytics – Simple SQL Interface

Analysing streaming data… and when at rest

Amazon Athena

• No Infrastructure or administration• Zero Spin up time• Transparent upgrades• Query data in its raw format

• AVRO, Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the

best performance and lowest cost• No loading of data, no ETL required

• Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability

Simple Query editor with syntax highlighting

and autocomplete

Data Catalog

Query History, Saved Queries, and Catalog Management

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena

Amazon RDS

Amazon S3

Amazon Redshift

Amazon Athena

Using Amazon Athena with Amazon QuickSight

Building smarter applications

Add Machine Learning CapabilitiesAmazon Machine Learning ServiceBatch and online predictionsTrain using data in S3, RDS and Redshift

Amazon EMRComprehensive machine learning libraries (eg Spark mLib, Anaconda)Provision analytics clusters in minutes, autoscale with data volume or query demand

Amazon AI Services

Amazon Polly – Lifelike Text-to-Speech47 voices, 24 languagesLow-latency, real time

Amazon Rekognition – Image AnalysisObject and scene detectionFacial analysis

Amazon Lex – Conversational EngineSpeech and text recognitionEnterprise connectors

Let’s hear from Polly..

Demographic Data

Facial Landmarks

Sentiment Expressed

Image Quality

Facial Analysis with Rekognition

Brightness: 25.84Sharpness: 160

General Attributes

Up to ~40k CUDA coresPre-configured CUDA driversJupyter notebook with Python2, Python3, Anaconda

CloudFormation TemplateAWS Marketplace – one-click deploy

AWS Deep Learning AMI

Scaling Distributed Experiments

• Inception v3 model

• Increasing machines

from 1 to 47

• 2x faster than

TensorFlow if using more

than 10 machines

Example MXNet User | TuSimple|Autonomous Driving

Kinesis Firehose

AthenaQuery Service Glue

Machine LearningPredictive analytics



Processing & AnalyticsUse of predictive and prescriptive

analytics to gain better understanding




Amazon AI

Full Stack Analytics on AWS

full stack analytics on aws - london-summit-slides-2017.s3 ... summit... · elastic transcoder...

Documents