analytics on aws - amazon web services, inc. · store analyze amazon glacier amazon s3 amazon...

46
ANALYTICS ON AWS Paul Armstrong, Solutions Architect, Amazon Web Services

Upload: others

Post on 16-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

ANALYTICS ON AWSPaul Armstrong, Solutions Architect, Amazon Web Services

Page 2: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

What to expect from this session

• AWS toolkit for analytics

• Analytics stakeholders

• Amazon Redshift and Amazon QuickSight

• Anomaly Detection

• Amazon Machine Learning – Churn Prediction Example

• Q & A

Page 3: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

AnalyzeStore

Amazon

Glacier

Amazon

S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

Big data portfolio – focus on choice

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

AWS Database

Migration Services

Amazon

Kinesis

Analytics

Amazon Kinesis

Firehose

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

Amazon

QuickSight

AWS Import/Export

Page 4: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

AnalyzeStore

Amazon

Glacier

Amazon

S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

Big data portfolio – focus on choice

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

Amazon

Kinesis

Analytics

Amazon Kinesis

Firehose

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

Amazon

QuickSight

AWS Import/Export

AWS Database

Migration Services

Page 5: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Match toolset to right persona

• Business intelligence (BI) analyst• Primary tool is SQL

• Historical data resides in data warehouse such as Amazon Redshift

• Data scientist • Uses programmatic languages such as R or Python

• Application developer• Requires API to integrate with AWS services

Page 6: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

B I A N A L Y S T

Page 7: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

BI analyst with existing BI tools

BI Analyst

BI tools

Amazon EC2

Amazon Redshift

Amazon QuickSight API

• Primary tool is SQL

• Data is largely structured with well known data sources

• Primary concern is fast, consistent performance

• Need to extend SQL with custom functions

BI tools

Amazon EC2

Amazon QuickSight

Amazon QuickSight

Page 8: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Amazon Redshift system architecture

Leader node• SQL endpoint

• Stores metadata

• Coordinates query execution

Compute nodes• Local, columnar storage

• Execute queries in parallel

• Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH

Two hardware platforms• Optimized for data processing

• DS2: HDD; scale from 2 TB to 2 PB

• DC1: SSD; scale from 160 GB to 356 TB

10 GigE

(HPC)

JDBC/ODBC

Page 9: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

New SQL functions

We add SQL functions regularly to expand Amazon Redshift’s query capabilities

Added 25+ window and aggregate functions since launch, including:

LISTAGG

[APPROXIMATE] COUNT

DROP IF EXISTS, CREATE IF NOT EXISTS

REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE

PERCENTILE_CONT, _DISC, MEDIAN

PERCENT_RANK, RATIO_TO_REPORT

We’ll continue iterating but also want to enable you to write your own

Window function examples: http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_examples.html

Page 10: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Scalar user defined functions

You can write UDFs using Python 2.7

• Syntax is largely identical to PostgreSQL UDF

• Python execution is performed in parallel

• System and network calls within UDFs are prohibited

Comes integrated with Pandas, NumPy, SciPy, DateUtil, and

Pytz analytic libraries

• Import your own libraries for even more flexibility

• Take advantage of thousands of functions available through Python

libraries to perform operations not easily expressed in SQL

Page 11: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

A very fast, cloud-powered, business

intelligence service for 1/10 the cost of

traditional BI software

What is Amazon QuickSight?

Page 12: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Business

User

Business

User

Amazon

QuickSight

APIAmazon QuickSight UI

Mobile Devices Web Browsers

Partner BI Products

MetadataData PrepConnectors SuggestionsSPICE

Amazon

S3

Amazon

Kinesis

Amazon

DynamoDB

Amazon EMRAmazon

RedshiftAmazon RDSFiles Third-party

Page 13: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

D A T A

S C I E N T I S T

Page 14: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Data scientist with existing toolsets

Data scientistToolkits like SAS or

R Studio installed

with Amazon EC2

Unstructured data

Amazon S3

Structured data

Amazon Redshift

• Work with unstructured datasets

• Use existing toolsets to connect to Amazon Redshift

Page 15: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Querying Amazon Redshift with R packages

• RJDBC—Supports SQL queries

• dplyr—Uses R code for data

analysis

• RPostgreSQL—R compliant

driver or Database Interface (DBI)R UserR Studio

Amazon

EC2

Unstructured data

Amazon S3

User profile

Amazon RDS

Amazon Redshift

Connecting R with Amazon Redshift blog post: https://aws.amazon.com/blogs/big-data/connecting-r-with-amazon-redshift/

Page 16: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Querying Amazon Redshift with R packages example

Page 17: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

A P P L I C A T I O N

D E V E L O P E R

Page 18: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Application developers can build smart

applications using Amazon Machine Learning

Structured data/predictions

Amazon Redshift

Generate/query

predictions

Amazon QuickSight

Application

Amazon Machine

Learning

Visualize

• All skill levels

• Amazon Machine Learning technology is accessed through APIs and SDKs

• Embed visualizations in applications

Page 19: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Resources

Amazon Redshift Getting Started Guide:

http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

Scalar UDF Documentation: http://docs.aws.amazon.com/redshift/latest/dg/user-defined-

functions.html

Introduction to Python UDFs in Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-UDFs-in-

Amazon-Redshift

Connecting R with Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-

Redshift

Databricks Apache Spark–Amazon Redshift Tutorial: https://github.com/databricks/spark-

redshift/tree/master/tutorial

Amazon ML Getting Started Guide: https://aws.amazon.com/machine-learning/getting-started/

Amazon QuickSight: https://aws.amazon.com/quicksight/

Page 20: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Real-Time Anomaly Detection

• Ingest data from website through API Gateway and Amazon Kinesis

Streams

• Use Amazon Kinesis Analytics to produce an anomaly score for

each data point and identify trends in data

• Send users and machines notifications through Amazon SNS

Amazon API

Gateway

Amazon

Kinesis

Streams

Amazon

Kinesis

Streams

Amazon

Kinesis

Analytics

Lambda

functionAmazon

SNStopic

email

notification

users

SMS

notification

SMS

Ingest clickstream data Detect anomalies & take action Notify users

Page 21: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Predicting Customer Churn with Amazon ML

Page 22: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Supervised Learning

Page 23: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Supervised Learning

Input Outcome

Page 24: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Page 25: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Supervised

Learning

known historical data

Amazon ML

Page 26: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Supervised

Learning

Unseen Input Same Outcome

known historical data

Amazon ML

Page 27: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Amazon Machine Learning Service

Page 28: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Amazon Machine Learning Service

Page 29: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Amazon Machine Learning Service

Page 30: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Amazon Machine Learning Service

Page 31: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Telco Churn Dataset

• US telco customers, their cell phone plans and usage

• 21 attributes, 3333 rows:

• Customer: State, Area_Code, Phone

• Plan: Intl_Plan, VMail_Plan

• Behavior: VMail_Messages, Day_Mins, Day_Calls,

Day_Charge, Eve_Mins, Eve_Calls, Eve_Charge,

Night_Mins, Night_Calls, Night_Charge, Intl_Mins,

Intl_Calls, Intl_Charge

• Other: Account_Length, CustServ_Calls, Churn

Page 32: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Telco Churn Dataset

• US telco customers, their cell phone plans and usage

• 21 attributes, 3333 rows:

• Customer: State, Area_Code, Phone

• Plan: Intl_Plan, VMail_Plan

• Behavior: VMail_Messages, Day_Mins, Day_Calls,

Day_Charge, Eve_Mins, Eve_Calls, Eve_Charge,

Night_Mins, Night_Calls, Night_Charge, Intl_Mins,

Intl_Calls, Intl_Charge

• Other: Account_Length, CustServ_Calls, Churn

Page 33: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Telco Churn Dataset

KS, 128, 415, 382-4657, 0, 1, 25, 265.100000, 110, 45.070000, 197.400000, 99, 16.780000, 244.700000, 91, 11.010000, 10.000000, 3, 2.700000, 1, 0

OH, 107, 415, 371-7191, 0, 1, 26, 161.600000, 123, 27.470000, 195.500000, 103, 16.620000, 254.400000, 103, 11.450000, 13.700000, 3, 3.700000, 1, 0

NJ, 137, 415, 358-1921, 0, 0, 0, 243.400000, 114, 41.380000, 121.200000, 110, 10.300000, 162.600000, 104, 7.320000, 12.200000, 5, 3.290000, 0, 0

OH, 84, 408, 375-9999, 1, 0, 0, 299.400000, 71, 50.900000, 61.900000, 88, 5.260000, 196.900000, 89, 8.860000, 6.600000, 7, 1.780000, 2, 0

OK, 75, 415, 330-6626, 1, 0, 0, 166.700000, 113, 28.340000, 148.300000, 122, 12.610000, 186.900000, 121, 8.410000, 10.100000, 3, 2.730000, 3, 0

AL, 118, 510, 391-8027, 1, 0, 0, 223.400000, 98, 37.980000, 220.600000, 101, 18.750000, 203.900000, 118, 9.180000, 6.300000, 6, 1.700000, 0, 0

Page 34: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Console: Creating Datasource for Amazon ML

Page 35: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Console: Creating Datasource for Amazon ML

Page 36: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Console: Creating Datasource for Amazon ML

Page 37: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Console: Building the Amazon ML Model

Page 38: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Recipe

{ "groups": {

"NUMERIC_VARS_NORM": "group('Intl_Charge','Night_Calls','Day_Calls','Eve_Calls','Eve_Mins','Intl_Mins','VMail_Message','Intl_Calls','Day_Mins','Night_Mins','Day_Charge','Night_Charge','Eve_Charge','Account_Length')” },

"assignments": {},

"outputs": [

"ALL_BINARY",

"State",

"Area_Code",

"normalize(NUMERIC_VARS_NORM)",

"CustServ_Calls"

]

}

Page 39: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Recipe: normalize() function

Account_Length Normalized Value

128 0.808771865

107 -0.047574816

137 1.175777586

84 -0.985478323

75 -1.352484044

118 0.400987732

Page 40: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Building the Amazon ML Model

Page 41: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Cost of Errors

• Cost of Customer Churn and Acquisition (false

negative):

• foregone cashflow

• advertising costs

• POS and sign-up admin costs

• Customer Retention Cost (false + true positive)

• Discounts

• Phone upgrades

• etc

Page 42: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

Page 43: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

False Negative True + False Pos Retention Cost Cost with ML

4.80% 12.10% + 14.30% $100.00 $50.40

Page 44: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

False Negative True + False Pos Retention Cost Cost with ML

4.80% 12.10% + 14.30% $100.00 $50.40

• Threshold 0.3 0.17

• $22.06 of savings per customer

• With 100,000 customers over $2MM

in savings with ML

Page 45: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

What Next?

• https://aws.amazon.com/getting-started/projects/build-machine-

learning-model/

• https://aws.amazon.com/machine-learning/developer-resources/

• Cost Threshold Calculation

https://github.com/dbatalov/cost_based_ml

• Apache Spark on EMR https://aws.amazon.com/emr/details/spark/

• Artificial Intelligence on AWS https://aws.amazon.com/amazon-ai/

• Amazon AMIs for Deep Learning https://aws.amazon.com/amazon-

ai/amis/

Page 46: ANALYTICS ON AWS - Amazon Web Services, Inc. · Store Analyze Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora Big data portfolio –focus on choice AWS Data Pipeline

THANK YOU!