introduction to snowplow - big data & data science israel

39
Introduction to Snowplow - an open source event analytics platform Big Data & Data Science – Israel

Upload: alexander-dean

Post on 27-Jan-2015

125 views

Category:

Business


4 download

DESCRIPTION

Snowplow is an open source event analytics platform built on Hadoop, Amazon Kinesis & Redshift Snowplow is a web and event analytics platform with a difference: rather than tell our users how they should analyze their data, we deliver their event-level data in their own data warehouse, on their own Amazon Redshift or Postgres database, so they can analyze it any way they choose. Snowplow is used by data-savvy media, retail, and SaaS businesses to better understand their audiences and how they engage with their websites and applications. Agenda: 1. Intro to Snowplow - why we built it, what needs does it solve 2. Current Snowplow design and architecture 3. Agile event analytics with Snowplow & Looker 4. Evolution of Snowplow - from web analytics to a business' digital nervous system with Amazon Kinesis 5. Snowplow research & roadmap - event grammars, unified logs, feedback loops Presenter Bio: Alex Dean is the co-founder and technical lead at Snowplow Analytics. At Snowplow Alex is responsible for Snowplow's technical architecture, stewarding the open source community and evaluating new technologies such as Amazon Kinesis. Prior to Snowplow, Alex was a partner at technology consultancy Keplar, where the idea for Snowplow was conceived. Before Keplar Alex was a Senior Engineering Manager at OpenX, the open source ad technology company. Alex lives in London, UK.

TRANSCRIPT

Page 1: Introduction to Snowplow - Big Data & Data Science Israel

Introduction to Snowplow - an open source event analytics

platformBig Data & Data Science – Israel

Page 2: Introduction to Snowplow - Big Data & Data Science Israel

Agenda today

1. Introduction to Snowplow

2. Current Snowplow design and architecture

3. Agile event analytics with Snowplow & Looker

4. Evolution of Snowplow

5. Questions

Many thanks for organizing to:

Page 3: Introduction to Snowplow - Big Data & Data Science Israel

Introduction to Snowplow

Page 4: Introduction to Snowplow - Big Data & Data Science Israel

Snowplow is an open-source web and event analytics platform, first version released in early 2012

• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008

• After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy

• We released Snowplow as a skunkworks prototype at start of 2012:

github.com/snowplow/snowplow

• We started working full time on Snowplow in summer 2013

Page 5: Introduction to Snowplow - Big Data & Data Science Israel

At Keplar, we grew frustrated by significant limitations in traditional web analytics programs

• Sample-based (e.g. Google Analytics)

• Limited set of events e.g. page views, goals, transactions

• Limited set of ways of describing events (custom dim 1, custom dim 2…)

Data collection Data processing Data access

• Data is processed ‘once’• No validation• No opportunity to

reprocess e.g. following update to business rules

• Data is aggregated prematurely• Only particular

combinations of metrics / dimensions can be pivoted together (Google Analytics)

• Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst

• Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst)

• As a result, data is siloed: hard to join with other data sets

Page 6: Introduction to Snowplow - Big Data & Data Science Israel

And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Amazon EMRAmazon S3CloudFront

Amazon Redshift

Page 7: Introduction to Snowplow - Big Data & Data Science Israel

We wanted to take a fresh approach to web analytics

• Your own web event data -> in your own data warehouse• Your own event data model• Slice / dice and mine the data in highly bespoke ways to answer your

specific business questions• Plug in the broadest possible set of analysis tools to drive value from your

data

Data warehouseData pipeline

Analyse your data in any analysis tool

Page 8: Introduction to Snowplow - Big Data & Data Science Israel

Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

D = Standardised data protocols

Generate event data from any environment

Launched with:• JavaScript

tracker

Log raw events from trackers

Launched with:• CloudFront

collector

Validate and enrich raw events

Launched with:• HiveQL +

Java UDF-based enrichment

Store enriched events ready for analysis

Launched with:• Amazon S3

Analyze enriched events

Launched with:• HiveQL

recipes

These turned out to be critical to allowing us to evolve the above stack

Page 9: Introduction to Snowplow - Big Data & Data Science Israel

Our initial skunkworks version of Snowplow – it was basic but it worked, and we started getting traction

Website / webapp

Snowplow data pipeline v1 (spring 2012)

CloudFront-based pixel

collector

HiveQL + Java UDF

“ETL” Amazon S3

JavaScript event tracker

Page 10: Introduction to Snowplow - Big Data & Data Science Israel

What did people start using it for?

Warehousing their web event data

Agile aka ad hoc analytics

To enable…

Marketing attribution modelling

Customer lifetime value calculations

Customer churn detection RTB fraud Product

recommendations

Page 11: Introduction to Snowplow - Big Data & Data Science Israel

Current Snowplow design and architecture

Page 12: Introduction to Snowplow - Big Data & Data Science Israel

Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components…

Website / webapp

Snowplow data pipeline v2 (spring 2013)

CloudFront-based event

collectorScalding-

based enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

Page 13: Introduction to Snowplow - Big Data & Data Science Israel

Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components…

Website / webapp

Snowplow data pipeline v2 (spring 2013)

CloudFront-based event

collectorScalding-

based enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

• Allow Snowplow users to set a third-party cookie with a user ID

• Important for ad networks, widget companies, multi-domain retailers

• Because Snowplow users wanted a much faster query loop than HiveQL/MapReduce

• We wanted a robust, feature-rich framework for managing validations, enrichments etc

Page 14: Introduction to Snowplow - Big Data & Data Science Israel

What is Scalding?

• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:

Hadoop DFS

Hadoop MapReduce

Cascading Hive Pig

Java

Scalding Cascalog PyCascading cascading. jruby

Page 15: Introduction to Snowplow - Big Data & Data Science Israel

We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce

Page 16: Introduction to Snowplow - Big Data & Data Science Israel

Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?

• Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project)

• Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet

• We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing:

• Define the inputs and outputs of each of your data processing steps in an unambiguous way

• Catch errors as soon as possible – and report them in a strongly typed way too

Page 17: Introduction to Snowplow - Big Data & Data Science Israel

Our “enrichment process” (formerly known as ETL) actually does two things: validation and enrichment

• Our validation model looks like this:

• Under the covers, we use a lot of monadic Scala (Scalaz) code

Raw events“Bad” raw events +

reasons why they are bad

“Good” enriched

events

Enrichment Manager

Page 18: Introduction to Snowplow - Big Data & Data Science Israel

Adding the enrichments that web analysts expect = very important to Snowplow uptake

• Web analysts are used to a very specific set of enrichments from Google Analytics, Site Catalyst etc

• These enrichments have evolved over the past 15-20 years and are very domain specific:

• Page querystring -> marketing campaign information (utm_ fields)

• Referer data -> search engine name, country, keywords

• IP address -> geographical location

• Useragent -> browser, OS, computer information

Page 19: Introduction to Snowplow - Big Data & Data Science Israel

We aim to make our validation and enrichment process as modular as possible

Enrichment Manager

Not yet integrated

• This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow

• The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects:

Page 20: Introduction to Snowplow - Big Data & Data Science Israel

Agile event analytics with Snowplow and Looker

Page 21: Introduction to Snowplow - Big Data & Data Science Israel

Just last week we announced our official partnership with Looker

• Looker is a BI visualization and data modelling startup with some cool features:

1. Slice and dice any combination of dimension and metrics

2. Quickly and easily define dimensions and metrics that are specific to your business using Looker's light-weight metadata model

3. Drill-up and drill-down to visitor-level and event-level data

4. Dashboards are a starting point for more involved analysis

5. Access your data from any application: Looker as a general purpose data server

+

Page 22: Introduction to Snowplow - Big Data & Data Science Israel

Demo – first let’s look at some enriched Snowplow events in Redshift

Page 23: Introduction to Snowplow - Big Data & Data Science Israel

Demo – now let’s see how that translates into Looker

Page 24: Introduction to Snowplow - Big Data & Data Science Israel

Evolution of Snowplow

Page 25: Introduction to Snowplow - Big Data & Data Science Israel

There are three big aspects to Snowplow’s roadmap

1. Make Snowplow work as well for non-web (e.g. mobile, IoT) environments as the web

2. Make Snowplow work as well with unstructured events as it does with structured events (aka page views, ecommerce transactions etc)

3. Move Snowplow away from an S3-based data pipeline to a unified log (Kinesis/Kafka)-based data pipeline

Page 26: Introduction to Snowplow - Big Data & Data Science Israel

Snowplow is developing into an event analytics platform (not just a web analytics platform)

Data warehouse

Collect event data from any connected

device

Page 27: Introduction to Snowplow - Big Data & Data Science Israel

So far we have open-sourced a few different trackers – with more planned

JavaScript Tracker – the original

No-JS aka pixel tracker

Lua Tracker – for games

Arduino Tracker – for the Internet of

Things

Python Tracker – releasing this week

Page 28: Introduction to Snowplow - Big Data & Data Science Israel

As we get further away from the web, we need to start supporting unstructured events

• By unstructured events, we mean events represented as JSONs with arbitrary name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!)

_snaq.push(['trackUnstructEvent', 'Viewed Product', { product_id: 'ASO01043', category: 'Dresses', brand: 'ACME', returning: true, price: 49.95, sizes: ['xs', 's', 'l', 'xl', 'xxl'], available_since$dt: new Date(2013,3,7) } ]);

Page 29: Introduction to Snowplow - Big Data & Data Science Israel

Supporting structured and unstructured events is a difficult problem

• Almost all of our competitors fall on one or other side of the structured-unstructured divide:

Structured events (page views etc) Unstructured events (JSONs)

Page 30: Introduction to Snowplow - Big Data & Data Science Israel

We want to bridge that divide, making it so that Snowplow comes with structured events “out of the box”, but is extensible with unstructured events

Structured events (page views etc) Unstructured events (JSONs)

Page 31: Introduction to Snowplow - Big Data & Data Science Israel

This is super-important to enable businesses to construct their own high-value bespoke analytics• What is the impact of different ad campaigns and creative on the way users

behave, subsequently? What is the return on that ad spend?

• How do visitors use social channels (Facebook / Twitter) to interact around video content? How can we predict which content will “go viral”?

• How do updates to our product change the “stickiness” of our service? ARPU? Does that vary by customer segment?

Page 32: Introduction to Snowplow - Big Data & Data Science Israel

To achieve this, we are prototyping a new approach using JSON Schema, Thrift/Avro and a shredding library

• We are planning to replace the existing flow with a JSON Schema-driven approach:

Enrichment Manager

Raw events in JSON format

JSON Schema defining events

Enriched events in Thrift or

Arvo format

Shredder

1. Define structure

2. Validate events

3. Define structure

4. Drive shredding

Enriched events in TSV ready for loading

into db

5. Define structure

Page 33: Introduction to Snowplow - Big Data & Data Science Israel

JSON Schema just gives us a way of representing structure – we are also evolving a grammar to represent events

Subject DirectObject

IndirectObjectVerb

Event Context

Prep.Object~

Page 34: Introduction to Snowplow - Big Data & Data Science Israel

In parallel, we plan to evolve Snowplow from an event analytics platform into a “digital nervous system” for data driven companies

• The event data fed into Snowplow is written into a “Unified Log”

• This becomes the “single source of truth”, upstream from the datawarehouse

• The same source of truth is used for real-time data processing as analytics e.g.• Product recommendations• Ad targeting• Real-time website personalisation• Systems monitoring

Snowplow will drive data-driven processes as well as off-line analytics

Page 35: Introduction to Snowplow - Big Data & Data Science Israel

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc analytics

Management reporting

Fraud detection

Churn prevention

APIs

Some background on unified log based architectures

Page 36: Introduction to Snowplow - Big Data & Data Science Israel

We are part way through our Kinesis support, with additional components being released soon

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis app

Snowplow Trackers

• The parts in grey are still under development – we are working with Snowplow community members on these collaboratively

Page 37: Introduction to Snowplow - Big Data & Data Science Israel

Questions?

http://snowplowanalytics.comhttps://github.com/snowplow/snowplow

@snowplowdataTo have a meeting, coffee or beer tomorrow (Monday) –

@alexcrdean or [email protected]

Page 38: Introduction to Snowplow - Big Data & Data Science Israel

Useful for answering questions…

Website / webapp

Snowplow data pipeline v2 (spring 2013)

CloudFront-based event

collectorScalding-

based enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

Page 39: Introduction to Snowplow - Big Data & Data Science Israel

Useful for answering questions…

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

D = Standardised data protocols

Generate event data from any environment

Launched with:• JavaScript

tracker

Log raw events from trackers

Launched with:• CloudFront

collector

Validate and enrich raw events

Launched with:• HiveQL +

Java UDF-based enrichment

Store enriched events ready for analysis

Launched with:• Amazon S3

Analyze enriched events

Launched with:• HiveQL

recipes

These turned out to be critical to allowing us to evolve the above stack