analytics on aws - amazon web services, inc. · store analyze amazon glacier amazon s3 amazon...

Post on 16-Oct-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ANALYTICS ON AWSPaul Armstrong, Solutions Architect, Amazon Web Services

What to expect from this session

• AWS toolkit for analytics

• Analytics stakeholders

• Amazon Redshift and Amazon QuickSight

• Anomaly Detection

• Amazon Machine Learning – Churn Prediction Example

• Q & A

AnalyzeStore

Amazon

Glacier

Amazon

S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

Big data portfolio – focus on choice

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

AWS Database

Migration Services

Amazon

Kinesis

Analytics

Amazon Kinesis

Firehose

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

Amazon

QuickSight

AWS Import/Export

AnalyzeStore

Amazon

Glacier

Amazon

S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

Big data portfolio – focus on choice

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

Amazon

Kinesis

Analytics

Amazon Kinesis

Firehose

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

Amazon

QuickSight

AWS Import/Export

AWS Database

Migration Services

Match toolset to right persona

• Business intelligence (BI) analyst• Primary tool is SQL

• Historical data resides in data warehouse such as Amazon Redshift

• Data scientist • Uses programmatic languages such as R or Python

• Application developer• Requires API to integrate with AWS services

B I A N A L Y S T

BI analyst with existing BI tools

BI Analyst

BI tools

Amazon EC2

Amazon Redshift

Amazon QuickSight API

• Primary tool is SQL

• Data is largely structured with well known data sources

• Primary concern is fast, consistent performance

• Need to extend SQL with custom functions

BI tools

Amazon EC2

Amazon QuickSight

Amazon QuickSight

Amazon Redshift system architecture

Leader node• SQL endpoint

• Stores metadata

• Coordinates query execution

Compute nodes• Local, columnar storage

• Execute queries in parallel

• Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH

Two hardware platforms• Optimized for data processing

• DS2: HDD; scale from 2 TB to 2 PB

• DC1: SSD; scale from 160 GB to 356 TB

10 GigE

(HPC)

JDBC/ODBC

New SQL functions

We add SQL functions regularly to expand Amazon Redshift’s query capabilities

Added 25+ window and aggregate functions since launch, including:

LISTAGG

[APPROXIMATE] COUNT

DROP IF EXISTS, CREATE IF NOT EXISTS

REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE

PERCENTILE_CONT, _DISC, MEDIAN

PERCENT_RANK, RATIO_TO_REPORT

We’ll continue iterating but also want to enable you to write your own

Window function examples: http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_examples.html

Scalar user defined functions

You can write UDFs using Python 2.7

• Syntax is largely identical to PostgreSQL UDF

• Python execution is performed in parallel

• System and network calls within UDFs are prohibited

Comes integrated with Pandas, NumPy, SciPy, DateUtil, and

Pytz analytic libraries

• Import your own libraries for even more flexibility

• Take advantage of thousands of functions available through Python

libraries to perform operations not easily expressed in SQL

A very fast, cloud-powered, business

intelligence service for 1/10 the cost of

traditional BI software

What is Amazon QuickSight?

Business

User

Business

User

Amazon

QuickSight

APIAmazon QuickSight UI

Mobile Devices Web Browsers

Partner BI Products

MetadataData PrepConnectors SuggestionsSPICE

Amazon

S3

Amazon

Kinesis

Amazon

DynamoDB

Amazon EMRAmazon

RedshiftAmazon RDSFiles Third-party

D A T A

S C I E N T I S T

Data scientist with existing toolsets

Data scientistToolkits like SAS or

R Studio installed

with Amazon EC2

Unstructured data

Amazon S3

Structured data

Amazon Redshift

• Work with unstructured datasets

• Use existing toolsets to connect to Amazon Redshift

Querying Amazon Redshift with R packages

• RJDBC—Supports SQL queries

• dplyr—Uses R code for data

analysis

• RPostgreSQL—R compliant

driver or Database Interface (DBI)R UserR Studio

Amazon

EC2

Unstructured data

Amazon S3

User profile

Amazon RDS

Amazon Redshift

Connecting R with Amazon Redshift blog post: https://aws.amazon.com/blogs/big-data/connecting-r-with-amazon-redshift/

Querying Amazon Redshift with R packages example

A P P L I C A T I O N

D E V E L O P E R

Application developers can build smart

applications using Amazon Machine Learning

Structured data/predictions

Amazon Redshift

Generate/query

predictions

Amazon QuickSight

Application

Amazon Machine

Learning

Visualize

• All skill levels

• Amazon Machine Learning technology is accessed through APIs and SDKs

• Embed visualizations in applications

Resources

Amazon Redshift Getting Started Guide:

http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

Scalar UDF Documentation: http://docs.aws.amazon.com/redshift/latest/dg/user-defined-

functions.html

Introduction to Python UDFs in Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-UDFs-in-

Amazon-Redshift

Connecting R with Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-

Redshift

Databricks Apache Spark–Amazon Redshift Tutorial: https://github.com/databricks/spark-

redshift/tree/master/tutorial

Amazon ML Getting Started Guide: https://aws.amazon.com/machine-learning/getting-started/

Amazon QuickSight: https://aws.amazon.com/quicksight/

Real-Time Anomaly Detection

• Ingest data from website through API Gateway and Amazon Kinesis

Streams

• Use Amazon Kinesis Analytics to produce an anomaly score for

each data point and identify trends in data

• Send users and machines notifications through Amazon SNS

Amazon API

Gateway

Amazon

Kinesis

Streams

Amazon

Kinesis

Streams

Amazon

Kinesis

Analytics

Lambda

functionAmazon

SNStopic

email

notification

users

SMS

notification

SMS

Ingest clickstream data Detect anomalies & take action Notify users

Predicting Customer Churn with Amazon ML

Supervised Learning

Supervised Learning

Input Outcome

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Supervised

Learning

known historical data

Amazon ML

Supervised Learning

Input Outcome

Input

Input

Input

Outcome

Outcome

Outcome

Supervised

Learning

Unseen Input Same Outcome

known historical data

Amazon ML

Amazon Machine Learning Service

Amazon Machine Learning Service

Amazon Machine Learning Service

Amazon Machine Learning Service

Telco Churn Dataset

• US telco customers, their cell phone plans and usage

• 21 attributes, 3333 rows:

• Customer: State, Area_Code, Phone

• Plan: Intl_Plan, VMail_Plan

• Behavior: VMail_Messages, Day_Mins, Day_Calls,

Day_Charge, Eve_Mins, Eve_Calls, Eve_Charge,

Night_Mins, Night_Calls, Night_Charge, Intl_Mins,

Intl_Calls, Intl_Charge

• Other: Account_Length, CustServ_Calls, Churn

Telco Churn Dataset

• US telco customers, their cell phone plans and usage

• 21 attributes, 3333 rows:

• Customer: State, Area_Code, Phone

• Plan: Intl_Plan, VMail_Plan

• Behavior: VMail_Messages, Day_Mins, Day_Calls,

Day_Charge, Eve_Mins, Eve_Calls, Eve_Charge,

Night_Mins, Night_Calls, Night_Charge, Intl_Mins,

Intl_Calls, Intl_Charge

• Other: Account_Length, CustServ_Calls, Churn

Telco Churn Dataset

KS, 128, 415, 382-4657, 0, 1, 25, 265.100000, 110, 45.070000, 197.400000, 99, 16.780000, 244.700000, 91, 11.010000, 10.000000, 3, 2.700000, 1, 0

OH, 107, 415, 371-7191, 0, 1, 26, 161.600000, 123, 27.470000, 195.500000, 103, 16.620000, 254.400000, 103, 11.450000, 13.700000, 3, 3.700000, 1, 0

NJ, 137, 415, 358-1921, 0, 0, 0, 243.400000, 114, 41.380000, 121.200000, 110, 10.300000, 162.600000, 104, 7.320000, 12.200000, 5, 3.290000, 0, 0

OH, 84, 408, 375-9999, 1, 0, 0, 299.400000, 71, 50.900000, 61.900000, 88, 5.260000, 196.900000, 89, 8.860000, 6.600000, 7, 1.780000, 2, 0

OK, 75, 415, 330-6626, 1, 0, 0, 166.700000, 113, 28.340000, 148.300000, 122, 12.610000, 186.900000, 121, 8.410000, 10.100000, 3, 2.730000, 3, 0

AL, 118, 510, 391-8027, 1, 0, 0, 223.400000, 98, 37.980000, 220.600000, 101, 18.750000, 203.900000, 118, 9.180000, 6.300000, 6, 1.700000, 0, 0

Console: Creating Datasource for Amazon ML

Console: Creating Datasource for Amazon ML

Console: Creating Datasource for Amazon ML

Console: Building the Amazon ML Model

Recipe

{ "groups": {

"NUMERIC_VARS_NORM": "group('Intl_Charge','Night_Calls','Day_Calls','Eve_Calls','Eve_Mins','Intl_Mins','VMail_Message','Intl_Calls','Day_Mins','Night_Mins','Day_Charge','Night_Charge','Eve_Charge','Account_Length')” },

"assignments": {},

"outputs": [

"ALL_BINARY",

"State",

"Area_Code",

"normalize(NUMERIC_VARS_NORM)",

"CustServ_Calls"

]

}

Recipe: normalize() function

Account_Length Normalized Value

128 0.808771865

107 -0.047574816

137 1.175777586

84 -0.985478323

75 -1.352484044

118 0.400987732

Building the Amazon ML Model

Cost of Errors

• Cost of Customer Churn and Acquisition (false

negative):

• foregone cashflow

• advertising costs

• POS and sign-up admin costs

• Customer Retention Cost (false + true positive)

• Discounts

• Phone upgrades

• etc

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

False Negative True + False Pos Retention Cost Cost with ML

4.80% 12.10% + 14.30% $100.00 $50.40

Financial Outcome of Applying a Model

Prior Churn Churn Cost Cost without ML

14.49% $500.00 $72.46

False Negative True + False Pos Retention Cost Cost with ML

4.80% 12.10% + 14.30% $100.00 $50.40

• Threshold 0.3 0.17

• $22.06 of savings per customer

• With 100,000 customers over $2MM

in savings with ML

What Next?

• https://aws.amazon.com/getting-started/projects/build-machine-

learning-model/

• https://aws.amazon.com/machine-learning/developer-resources/

• Cost Threshold Calculation

https://github.com/dbatalov/cost_based_ml

• Apache Spark on EMR https://aws.amazon.com/emr/details/spark/

• Artificial Intelligence on AWS https://aws.amazon.com/amazon-ai/

• Amazon AMIs for Deep Learning https://aws.amazon.com/amazon-

ai/amis/

THANK YOU!

top related