aws july webinar series: amazon redshift reporting and advanced analytics

48
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Khairallah, Business Development Manager, AWS Adam Savitzky, Software Development Engineer, Yahoo! Scott Hoover, Data Scientist, Looker July 23, 2015 Best Practices: Amazon Redshift Reporting and Advanced Analytics

Upload: amazon-web-services

Post on 12-Aug-2015

458 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Greg Khairallah, Business Development Manager, AWS

Adam Savitzky, Software Development Engineer, Yahoo!

Scott Hoover, Data Scientist, Looker

July 23, 2015

Best Practices: Amazon RedshiftReporting and Advanced Analytics

Page 2: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Amazon Redshift – Resources

Getting Started – June Webinar Series: https://www.youtube.com/watch?v=biqBjWqJi-Q

Best Practices – July Webinar Series:

Optimizing Performance – July 21, 2015

Migration and Data Loading – July 22,2015

Reporting and Advanced Analytics – July 23, 2015

Page 3: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Agenda

• Connecting to Amazon Redshift• Case Study – Redshift analytics at Yahoo• Case Study - Redshift Optimizations at Looker • Questions and Answers

Page 4: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Petabyte scale; massively

parallel

Relational data warehouse

Fully managed; zero admin

SSD & HDD platforms

As low as $1,000/TB/Year

Amazon Redshift

Page 5: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Common Customer Use Cases

Reduce costs by extending DW rather than adding HW

Migrate completely from existing DW systems

Respond faster to business

Improve performance by an order of magnitude

Make more data available for analysis

Access business data via standard reporting tools

Add analytic functionality to applications

Scale DW capacity as demand grows

Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Page 6: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Custom ODBC and JDBC Drivers

Up to 35% higher performance than open source drivers

Supported by most Business Intelligence tools

Will continue to support PostgreSQL open source drivers

Download drivers from console

Page 7: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Amazon Redshift Partners

Page 8: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Redshift for Analytics at Yahoo

Adam SavitzkyTech Yahoo, Software Development Engineer

Page 9: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Introduction

Who am I?• Yahoo growth team• Supporting analytics for 6 products in Yahoo’s mobile

portfolio

In the past:

Page 10: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Introduction

What do we do?▪ Real-time ad-hoc analytics▪ Mobile properties▪ What do we care about?

› Engagement and Activity› User demographics› Experimentation› Funnel analysis› Modeling revenue and user Lifetime Value› Cohort analysis and retention

Page 11: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

High Level Architecture

Mobile App

Hadoop

S3 Redshift

ETL

Page 12: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Scale

▪ On an average day› 1 billion events› 25 million devices› 2 billion parameter key/value pairs

▪ Planned Capacity› 21 dc1.8xlarge nodes› 80 billion events› 100 million devices› 50 TB (compressed!)

Page 13: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Data Model

Page 14: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Performance Optimizations

▪ Heavy use of summarization where appropriate▪ Sort keys and partitioning▪ Data encoding

Page 15: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Event Schema

event_rawmail

eventhourly

eventdaily

installinstall

attribution

event_rawflickr

event_rawhomerun

event_rawstark

event_rawarrow

event

raw

union

view

userretention

funnelfirst_event

date

parammail

paramflickr

paramhomerun

paramstark

paramarrow

param

union

view

is_active

paramkeys

telemetrydaily

revenuedaily

Raw Tables Summary Tables

Derived Tables

Page 16: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Case StudyUser Retention Analysis

Page 17: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Definitions

▪ Cohort - A group of product users that share one or more attributes› Example: All users who installed on Monday with Android devices

▪ Retention - How many members of a cohort of continue to use the product over time› Example: 100 users installed on Monday with Android devices. 7 days

later, 50 of those users returned to the product. We would say the 7-day retention for this cohort is 50%.

Page 18: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Why Study User Retention?

▪ Quantifies how “sticky” your product is▪ Allows us to measure Customer Lifetime Value (CLV or

LTV)

Page 19: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Why Study User Retention?

Asymptotic Retention

No Retention

%Retained

Page 20: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Why Study User Retention?

TotalUsers

Time

Asymptotic Retention

No Retention

Page 21: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Definition: For each possible combination of cohort dimensions, for every possible event date, how many devices belong to that cohort, and how many devices from that cohort were active on that day

event_date product install_date os_name active_users cohort_size

monday mail monday android 100 100

tuesday mail monday android 83 100

monday mail monday ios 75 75

tuesday mail monday ios 62 75

Example with one dimension, os_name:

Page 22: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?

event_date product install_date os_name active_users cohort_size

monday mail monday android 100 100

tuesday mail monday android 83 100

monday mail monday ios 75 75

tuesday mail monday ios 62 75

Page 23: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?

event_date product install_date os_name active_users cohort_size

monday mail monday android 100 100

tuesday mail monday android 83 100

monday mail monday ios 75 75

tuesday mail monday ios 62 75

Example with one dimension, os_name:

Page 24: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?

event_date product install_date os_name active_users cohort_size

tuesday mail monday android 83 100

tuesday mail monday ios 62 75

145 175

Example with one dimension, os_name:

Aggregate retention across both ios and android is (83 + 62) / (100 + 75) = 83%

Page 25: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Steps:1. For each day, determine whether each device was active or not

device_id date is_active

1 2015-01-01 1

1 2015-01-02 0

2 2015-01-01 1

2 2015-01-01 1

Page 26: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 1

device_id date is_active os install_date

1 2015-01-01 1 ios 2015-01-01

1 2015-01-02 0 ios 2015-01-01

2 2015-01-01 1 ios 2015-01-01

2 2015-01-01 1 ios 2015-01-01

Page 27: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any

other cohort dimensions)

date active_user_count os install_date

2015-01-01 2 ios 2015-01-01

2015-01-02 1 ios 2015-01-01

Page 28: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Calculating User Retention

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any

other cohort dimensions)4. Join the size of each cohort to the result of Step 3

date active_user_count os install_date cohort_size

2015-01-01 2 ios 2015-01-01 2

2015-01-02 1 ios 2015-01-01 2

Page 29: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Demo using Looker

Page 30: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Lessons Learned

▪ Summarize data for optimal query performance (hourly or daily rollups)

▪ Think carefully about data model ahead of time. Choose the right sort keys.

▪ Invest in a good tool for ETL (we use Airflow)▪ Invest in a good tool for query building and sharing (we

use Looker)▪ Reserve plenty of spare capacity (at least 40% free)▪ Reserved nodes are much cheaper▪ DC nodes are faster, but much smaller capacity

Page 31: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scott Hoover, Data Scientist

Redshift and Looker

Page 32: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• We use Redshift to power our own implementation of Looker, which serves every department with business intelligence and data for analytics.

• I have worked at Looker for just over two years, doing everything from Sales Engineering to Professional Services to Data Engineering. I currently head up our internal analytics efforts.

Introduction

Page 33: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• How Looker uses Redshift to supply business intelligence and drive analytics internally.

• How a few Looker customers use Redshift for reporting and analytics.

Agenda

Page 34: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

At Looker, we have two major use cases which drove our decision to go with Redshift:

• fast analysis of usage data (300+ million events);

• to centralize multiple data sources into a single warehouse.

Looker and Redshift

Page 35: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• Customer Health:- MoM/WoW percent change in usage- Users added/removed- User engagement (developer, explorer, consumer, occasional consumer)- LookML contributions and contributors

• Product Usage:- Features used/not used- Release pain points- Github issue/feature tracking

• Reporting for Sales and Marketing:- Usage in trial- Performance to quota (sales, meetings, leads, etc.)- Lead/prospect fit- Campaign attribution- SaaS metrics: MRR, cMRR, Churn

What We Care About Most

Page 36: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Redshift Data Pipeline

Pinger

License

Real-Time RDS

Page 37: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Data ModelEvent Data & Everything Else

Page 38: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Event Schema{

"event_id": "1",

"event_type" : "view_connection",

"created_at" : "2015-07-08 20:04:08 +0000",

"attrs" : { "country" : "US",

"state" : "CA",

"browser" : "Safari/537.36",

"uri" : "%2Fadmin%2Fconnections"

}

},

{

"event_id": "2",

"event_type" : "save_look",

"created_at" : "2015-07-08 20:04:12 +0000",

"attrs" : { "country" : "US",

"state" : "CA",

"browser" : "Safari/537.36",

"look_id" : "32"

}

}

Page 39: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Event Schema

id type created_at country state uri browser error … k

1view_

connection2015-07-08

20:04:08 +0000US CA

%2Fadmin%2Fconnecti

onsSafari/537.36 ø … k1

2 save_look2015-07-08

20:04:12 +0000US CA ø Safari/537.36 ø … k2

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

. .

.

.

..

N run_query2015-07-08

22:01:16 +0000UK ø %2Ffields= Chrome ø … kN

Page 40: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

- explore: events extends: license_base label: 'Pinger' always_filter: events.created_date: '30 days' joins: - join: license sql_on: ${events.license_slug} = ${license.new_slug} relationship: many_to_one - join: license_users sql_on: ${events.user_id} = ${license_users.id} relationship: many_to_many - join: client sql_on: ${client.id} = ${events.client_id} relationship: many_to_one - join: account sql_on: ${client.salesforce_account_id} = ${account.id} relationship: many_to_one - join: opportunity sql_on: ${account.id} = ${opportunity.account_id} relationship: many_to_one

[...] - join: sessions sql_on: ${sessions.event_id} = ${events.id} relationship: many_to_one

Event Schema

Page 41: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Everything Else

company_id account_id opportunity_id trial_id license_id lead_id campaign_idcampaign_member_

at… k

1 E000000zD0IFIA0E000000Oi9mxIA

B0000014uTRG

MA21423

00QE000000NqLsvMAF

701E00000006MC7IAM

2013-09-23 23:03:05 +0000

… k1

1 E000000zD0IFIA0E000000Oi9mxIA

B0000014uTRG

MA21423

00QE000000e0ZsYMAU

701E00000006OAaIAM

2014-02-20 22:39:25 +0000

… k2

1 E000000zD0IFIA0E000000Oi9mxIA

B0000014uTRG

MA21423

00QE000000e0ZsYMAU

701E00000008XEbIAM

2015-02-18 00:06:09 +0000

… k3

2 E000000zrbTgIAIE000000VuLHhIA

Na06E000000a

NOcVIAW1601

00QE000000XJVJiMAP

701E00000006OB9IAM

2015-04-01 22:04:05 +0000

… k4

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

. . .

.

..

N … kN

Page 42: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

- explore: company joins: - join: account sql_on: ${company.account_id} = ${account.id} relationship: many_to_one

- join: opportunity sql_on: ${company.opportunity_id} = ${opportunity.id} relationship: many_to_one - join: lead sql_on: ${company.lead_id} = ${lead.id} relationship: many_to_one - join: contact sql_on: ${contact.id} = ${company.contact_id} relationship: many_to_one fields: [export_set*]

- join: campaign sql_on: ${company.campaign_id} = ${campaign.id} relationship: many_to_one - join: trial sql_on: ${company.trial_id} = ${trial.id} relationship: many_to_one - join: account_representative from: user sql_on: ${opportunity.owner_id} = ${account_representative.id} fields: [name, count] relationship: many_to_one - join: license sql_on: ${company.account_id} = ${license.salesforce_account_id} relationship: one_to_one

Everything Else

Page 43: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Explore and Visualize

Page 44: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Analyze - Lead Scoring

API 3.0

API

• Construct historical data set or “Look.”

• GET “Look" using Looker API.

• Train/test model in R.• Output PMML file.• EC2 hosts

Openscoring REST service + PMML.

• Hit Salesforce API for new leads; score leads; update each lead record.

• View prioritized lists in Looker.

GET lead

UPDATE lead

GET look

Page 45: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• Scale/Performance- Transactional databases are not ideal for analytics (slow).- Redshift scales quickly and is incredibly fast.

• Accessibility - SQL is in many analysts’ wheelhouse and is easy to adopt.- Obvious choice for those in the AWS ecosystem or who

preferred managed offerings.• Centralization of data

- When it comes time to tie top-of-funnel actions to bottom-of-funnel behavior.

Why Our Customers Use Redshift

Page 46: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• Backstage/Sonicbids: They built an artist search tool that uses social data from Facebook, Twitter, YouTube, and Soundcloud to inform booking agents on what sort of draw they could expect from a certain artist. They used Snowplow, Redshift, the Looker API , Elasticsearch to build this system.

How Our Customers Use Redshift

Page 47: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

• Smartling: sources website translation snippets from translators the world over. They maintain a database of translated snippets, like “the car is red” in Turkish, in order validate incoming translations. So, when a request for “the car is blue” in Turkish comes in, they can make an assessment on the syntactic validity of the translation.

How Our Customers Use Redshift

Page 48: AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Learn more at www.looker.com