snowplow the evolving data pipeline

Post on 16-Apr-2017

358 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Snowplow: evolve your analytics stack with

your business

Snowplow Meetup Tel Aviv July 2016

Hello! I’m Yali

• Co-founder at Snowplow: open source event data pipeline

• Analytics Lead. Focus on business analytics

I work with our clients so they get more out of their data

• Marketing / customer analytics: how do we engage users better?

• Product analytics: how do we improve our user-facing products?

• Content / merchandise analytics:

• How to we write/produce/buy better content?

• How do we optimise the use of our existing content?

Self-describing data Event data modeling+

Event data pipeline that evolves with your business

Self-describing dataOverview

Event data varies widely by company

As a Snowplow user, you can define your own events and

entities

Events

Entities (contexts)

• Build castle • Form alliance• Declare war

• Player• Game• Level• Currency

• View product• Buy product• Deliver product

• Product• Customer• Basket• Delivery van

You then define a schema for each event

and entity{"$schema":

"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",

"description": "Schema for a fighter context","self": {

"vendor": "com.ufc","name": "fighter_context","format": "jsonschema","version": "1-0-1"

},

"type": "object","properties": {

"FirstName": {"type": "string"

},"LastName": {

"type": "string"},"Nickname": {

"type": "string"},"FacebookProfile": {

"type": "string"},"TwitterName": {

"type": "string"},"GooglePlusProfile": {

"type": "string"},

"HeightFormat": {"type": "string"

},"HeightCm": {

"type": ["integer", "null"]},"Weight": {

"type": ["integer", "null"]},"WeightKg": {

"type": ["integer", "null"]},"Record": {

"type": "string","pattern": "^[0-9]+-[0-9]+-[0-

9]+$"},"Striking": {

"type": ["number", "null"],"maxdecimal": 15

},"Takedowns": {

"type": ["number", "null"],"maxdecimal": 15

},"Submissions": {

"type": ["number", "null"],"maxdecimal": 15

},"LastFightUrl": {

"type": "string"},

"LastFightEventText": {"type": "string"

},"NextFightUrl": {

"type": "string"},"NextFightEventText": {

"type": "string"},"LastFightDate": {

"type": "string","format": "timestamp"

}},"additionalProperties": false

} Upload the schema to

Iglu

Then send data into Snowplow as self-describing JSONs

{ “schema”: “iglu:com.israel365/temperature_measure/jsonschema/1-0-0”, “data”: { “timestamp”: “2016-07-11 17:53:21”, “location”: “Tel-Aviv”, “temperature”: 32 “units”: “Centigrade” }}

{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",

"description": "Schema for an ad impression event",

"self": {"vendor": “com.israel365","name": “temperature_measure","format": "jsonschema","version": "1-0-0"

},"type": "object",

"properties": { "timestamp": { "type": "string" }, "location": { "type": "string" }, … },…}

Event

Schema reference

Schema

The schemas can then be used in a number of

ways• Validate the data (important for data quality)

• Load the data into tidy tables in your data warehouse

• Make it easy / safe to write downstream data processing application (for real-time users)

Event data modeling

Overview

What is event data modeling?

Event data modeling is the process of using business logic to aggregate over event-level data to produce 'modeled' data that is simpler for

querying.

Immutable. Unopiniated. Hard to consume. Not

contentious

Mutable and opinionated. Easy to

consume. May be contentious

Unmodeled data Modeled data

In general, event data modeling is performed on the complete event

stream

• Late arriving events can change the way you understand earlier arriving events

• If we change our data models: this gives us the flexibility to recompute historical data based on the new model

The evolving event data pipeline

How do we handle pipeline evolution?

PUSH FACTORS:

What is being

tracked will change over

time

PULL FACTORS:

What questions are being

asked of the data will

change over time

Businesses are not static, so event pipelines should not be either

Push example:new source of event data

• If data is self-describing it is easy to add an additional sources

• Self-describing data is good for managing bad data and pipeline evolution

I’m an email send event and I have information

about the recipient (email address, customer ID) and

the email (id, tags, variation)

Pull example: new business question

Answering the question: 3 possibilities

1. Existing data model supports answer

2. Need to update data model

3. Need to update data model and data

collection

• Possible to answer question with existing modeled data

• Data collected already supports answer

• Additional computation required in data modeling step (additional logic)

• Need to extend event tracking

• Need to update data models to incorporate additional data (and potentially additional logic)

Self-describing data and the ability to recompute data models are essential to enable

pipeline evolutionSelf-describing data Recompute data models on entire data set

• Updating existing events and entities in a backward compatible way e.g. add optional new fields

• Update existing events and entities in a backwards incompatible way e.g. change field types, remove fields, add compulsory fields

• Add new event and entity types

• Add new columns to existing derived tables e.g. add new audience segmentation

• Change the way existing derived tables are generated e.g. change sessionization logic

• Create new derived tables

top related