aws july webinar series: amazon redshift reporting and advanced analytics
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Khairallah, Business Development Manager, AWS
Adam Savitzky, Software Development Engineer, Yahoo!
Scott Hoover, Data Scientist, Looker
July 23, 2015
Best Practices: Amazon RedshiftReporting and Advanced Analytics
Amazon Redshift – Resources
Getting Started – June Webinar Series: https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
Agenda
• Connecting to Amazon Redshift• Case Study – Redshift analytics at Yahoo• Case Study - Redshift Optimizations at Looker • Questions and Answers
Petabyte scale; massively
parallel
Relational data warehouse
Fully managed; zero admin
SSD & HDD platforms
As low as $1,000/TB/Year
Amazon Redshift
Common Customer Use Cases
Reduce costs by extending DW rather than adding HW
Migrate completely from existing DW systems
Respond faster to business
Improve performance by an order of magnitude
Make more data available for analysis
Access business data via standard reporting tools
Add analytic functionality to applications
Scale DW capacity as demand grows
Reduce HW & SW costs by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Custom ODBC and JDBC Drivers
Up to 35% higher performance than open source drivers
Supported by most Business Intelligence tools
Will continue to support PostgreSQL open source drivers
Download drivers from console
Amazon Redshift Partners
Redshift for Analytics at Yahoo
Adam SavitzkyTech Yahoo, Software Development Engineer
Introduction
Who am I?• Yahoo growth team• Supporting analytics for 6 products in Yahoo’s mobile
portfolio
In the past:
Introduction
What do we do?▪ Real-time ad-hoc analytics▪ Mobile properties▪ What do we care about?
› Engagement and Activity› User demographics› Experimentation› Funnel analysis› Modeling revenue and user Lifetime Value› Cohort analysis and retention
High Level Architecture
Mobile App
Hadoop
S3 Redshift
ETL
Scale
▪ On an average day› 1 billion events› 25 million devices› 2 billion parameter key/value pairs
▪ Planned Capacity› 21 dc1.8xlarge nodes› 80 billion events› 100 million devices› 50 TB (compressed!)
Data Model
Performance Optimizations
▪ Heavy use of summarization where appropriate▪ Sort keys and partitioning▪ Data encoding
Event Schema
event_rawmail
eventhourly
eventdaily
installinstall
attribution
event_rawflickr
event_rawhomerun
event_rawstark
event_rawarrow
event
raw
union
view
userretention
funnelfirst_event
date
parammail
paramflickr
paramhomerun
paramstark
paramarrow
param
union
view
is_active
paramkeys
telemetrydaily
revenuedaily
Raw Tables Summary Tables
Derived Tables
Case StudyUser Retention Analysis
Definitions
▪ Cohort - A group of product users that share one or more attributes› Example: All users who installed on Monday with Android devices
▪ Retention - How many members of a cohort of continue to use the product over time› Example: 100 users installed on Monday with Android devices. 7 days
later, 50 of those users returned to the product. We would say the 7-day retention for this cohort is 50%.
Why Study User Retention?
▪ Quantifies how “sticky” your product is▪ Allows us to measure Customer Lifetime Value (CLV or
LTV)
Why Study User Retention?
Asymptotic Retention
No Retention
%Retained
Why Study User Retention?
TotalUsers
Time
Asymptotic Retention
No Retention
Calculating User Retention
Definition: For each possible combination of cohort dimensions, for every possible event date, how many devices belong to that cohort, and how many devices from that cohort were active on that day
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Example with one dimension, os_name:
Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Example with one dimension, os_name:
Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?
event_date product install_date os_name active_users cohort_size
tuesday mail monday android 83 100
tuesday mail monday ios 62 75
145 175
Example with one dimension, os_name:
Aggregate retention across both ios and android is (83 + 62) / (100 + 75) = 83%
Calculating User Retention
Steps:1. For each day, determine whether each device was active or not
device_id date is_active
1 2015-01-01 1
1 2015-01-02 0
2 2015-01-01 1
2 2015-01-01 1
Calculating User Retention
Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 1
device_id date is_active os install_date
1 2015-01-01 1 ios 2015-01-01
1 2015-01-02 0 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01
Calculating User Retention
Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)
date active_user_count os install_date
2015-01-01 2 ios 2015-01-01
2015-01-02 1 ios 2015-01-01
Calculating User Retention
Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)4. Join the size of each cohort to the result of Step 3
date active_user_count os install_date cohort_size
2015-01-01 2 ios 2015-01-01 2
2015-01-02 1 ios 2015-01-01 2
Demo using Looker
Lessons Learned
▪ Summarize data for optimal query performance (hourly or daily rollups)
▪ Think carefully about data model ahead of time. Choose the right sort keys.
▪ Invest in a good tool for ETL (we use Airflow)▪ Invest in a good tool for query building and sharing (we
use Looker)▪ Reserve plenty of spare capacity (at least 40% free)▪ Reserved nodes are much cheaper▪ DC nodes are faster, but much smaller capacity
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scott Hoover, Data Scientist
Redshift and Looker
• We use Redshift to power our own implementation of Looker, which serves every department with business intelligence and data for analytics.
• I have worked at Looker for just over two years, doing everything from Sales Engineering to Professional Services to Data Engineering. I currently head up our internal analytics efforts.
Introduction
• How Looker uses Redshift to supply business intelligence and drive analytics internally.
• How a few Looker customers use Redshift for reporting and analytics.
Agenda
At Looker, we have two major use cases which drove our decision to go with Redshift:
• fast analysis of usage data (300+ million events);
• to centralize multiple data sources into a single warehouse.
Looker and Redshift
• Customer Health:- MoM/WoW percent change in usage- Users added/removed- User engagement (developer, explorer, consumer, occasional consumer)- LookML contributions and contributors
• Product Usage:- Features used/not used- Release pain points- Github issue/feature tracking
• Reporting for Sales and Marketing:- Usage in trial- Performance to quota (sales, meetings, leads, etc.)- Lead/prospect fit- Campaign attribution- SaaS metrics: MRR, cMRR, Churn
What We Care About Most
Redshift Data Pipeline
Pinger
License
Real-Time RDS
Data ModelEvent Data & Everything Else
Event Schema{
"event_id": "1",
"event_type" : "view_connection",
"created_at" : "2015-07-08 20:04:08 +0000",
"attrs" : { "country" : "US",
"state" : "CA",
"browser" : "Safari/537.36",
"uri" : "%2Fadmin%2Fconnections"
}
},
{
"event_id": "2",
"event_type" : "save_look",
"created_at" : "2015-07-08 20:04:12 +0000",
"attrs" : { "country" : "US",
"state" : "CA",
"browser" : "Safari/537.36",
"look_id" : "32"
}
}
Event Schema
id type created_at country state uri browser error … k
1view_
connection2015-07-08
20:04:08 +0000US CA
%2Fadmin%2Fconnecti
onsSafari/537.36 ø … k1
2 save_look2015-07-08
20:04:12 +0000US CA ø Safari/537.36 ø … k2
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
. .
.
.
..
N run_query2015-07-08
22:01:16 +0000UK ø %2Ffields= Chrome ø … kN
- explore: events extends: license_base label: 'Pinger' always_filter: events.created_date: '30 days' joins: - join: license sql_on: ${events.license_slug} = ${license.new_slug} relationship: many_to_one - join: license_users sql_on: ${events.user_id} = ${license_users.id} relationship: many_to_many - join: client sql_on: ${client.id} = ${events.client_id} relationship: many_to_one - join: account sql_on: ${client.salesforce_account_id} = ${account.id} relationship: many_to_one - join: opportunity sql_on: ${account.id} = ${opportunity.account_id} relationship: many_to_one
[...] - join: sessions sql_on: ${sessions.event_id} = ${events.id} relationship: many_to_one
Event Schema
Everything Else
company_id account_id opportunity_id trial_id license_id lead_id campaign_idcampaign_member_
at… k
1 E000000zD0IFIA0E000000Oi9mxIA
B0000014uTRG
MA21423
00QE000000NqLsvMAF
701E00000006MC7IAM
2013-09-23 23:03:05 +0000
… k1
1 E000000zD0IFIA0E000000Oi9mxIA
B0000014uTRG
MA21423
00QE000000e0ZsYMAU
701E00000006OAaIAM
2014-02-20 22:39:25 +0000
… k2
1 E000000zD0IFIA0E000000Oi9mxIA
B0000014uTRG
MA21423
00QE000000e0ZsYMAU
701E00000008XEbIAM
2015-02-18 00:06:09 +0000
… k3
2 E000000zrbTgIAIE000000VuLHhIA
Na06E000000a
NOcVIAW1601
00QE000000XJVJiMAP
701E00000006OB9IAM
2015-04-01 22:04:05 +0000
… k4
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
. . .
.
..
N … kN
- explore: company joins: - join: account sql_on: ${company.account_id} = ${account.id} relationship: many_to_one
- join: opportunity sql_on: ${company.opportunity_id} = ${opportunity.id} relationship: many_to_one - join: lead sql_on: ${company.lead_id} = ${lead.id} relationship: many_to_one - join: contact sql_on: ${contact.id} = ${company.contact_id} relationship: many_to_one fields: [export_set*]
- join: campaign sql_on: ${company.campaign_id} = ${campaign.id} relationship: many_to_one - join: trial sql_on: ${company.trial_id} = ${trial.id} relationship: many_to_one - join: account_representative from: user sql_on: ${opportunity.owner_id} = ${account_representative.id} fields: [name, count] relationship: many_to_one - join: license sql_on: ${company.account_id} = ${license.salesforce_account_id} relationship: one_to_one
Everything Else
Explore and Visualize
Analyze - Lead Scoring
API 3.0
API
• Construct historical data set or “Look.”
• GET “Look" using Looker API.
• Train/test model in R.• Output PMML file.• EC2 hosts
Openscoring REST service + PMML.
• Hit Salesforce API for new leads; score leads; update each lead record.
• View prioritized lists in Looker.
GET lead
UPDATE lead
GET look
• Scale/Performance- Transactional databases are not ideal for analytics (slow).- Redshift scales quickly and is incredibly fast.
• Accessibility - SQL is in many analysts’ wheelhouse and is easy to adopt.- Obvious choice for those in the AWS ecosystem or who
preferred managed offerings.• Centralization of data
- When it comes time to tie top-of-funnel actions to bottom-of-funnel behavior.
Why Our Customers Use Redshift
• Backstage/Sonicbids: They built an artist search tool that uses social data from Facebook, Twitter, YouTube, and Soundcloud to inform booking agents on what sort of draw they could expect from a certain artist. They used Snowplow, Redshift, the Looker API , Elasticsearch to build this system.
How Our Customers Use Redshift
• Smartling: sources website translation snippets from translators the world over. They maintain a database of translated snippets, like “the car is red” in Turkish, in order validate incoming translations. So, when a request for “the car is blue” in Turkish comes in, they can make an assessment on the syntactic validity of the translation.
How Our Customers Use Redshift
Learn more at www.looker.com