big data meetup budapest adding data schemas to snowplow

23
Adding Data Schemas to Snowplow Big Data Budapest Meetup 5 June 2014

Upload: yalisassoon

Post on 06-May-2015

1.387 views

Category:

Travel


2 download

TRANSCRIPT

Page 1: Big data meetup budapest   adding data schemas to snowplow

   Adding  Data  Schemas  to  Snowplow  

Big  Data  Budapest  Meetup  -­‐  5  June  2014  

Page 2: Big data meetup budapest   adding data schemas to snowplow

Agenda  today  

1.  Introduc;on  to  Snowplow  

2.  Evolu;on  of  Snowplow  

3.  The  answer:  schema  all  the  things!  

4.  Snowplow  roadmap  

5.  Ques;ons  

Page 3: Big data meetup budapest   adding data schemas to snowplow

Introduc8on  to  Snowplow  

Page 4: Big data meetup budapest   adding data schemas to snowplow

Snowplow  is  an  open-­‐source  web  and  event  analy8cs  pla<orm,  first  version  released  in  early  2012  

•  Co-­‐founders  Alex  Dean  and  Yali  Sassoon  met  at  OpenX,  the  open-­‐source  ad  technology  business  in  2008  

•  ASer  leaving  OpenX,  Alex  and  Yali  set  up  Keplar,  a  niche  digital  product  and  analy;cs  consultancy  

• We  released  Snowplow  as  a  skunkworks  prototype  at  start  of  2012:  

                 github.com/snowplow/snowplow  

• We  started  working  full  ;me  on  Snowplow  in  summer  2013  

Page 5: Big data meetup budapest   adding data schemas to snowplow

We  wanted  to  take  a  fresh  approach  to  web  analy8cs  

•  Your  own  web  event  data  -­‐>  in  your  own  data  warehouse  •  Your  own  event  data  model  •  Slice  /  dice  and  mine  the  data  in  highly  bespoke  ways  to  answer  your  specific  business  ques;ons  

•  Plug  in  the  broadest  possible  set  of  analysis  tools  to  drive  value  from  your  data  

Data  warehouse  Data  pipeline  

Analyse  your  data  in  any  analysis  tool  

Page 6: Big data meetup budapest   adding data schemas to snowplow

By  spring  2013  we  had  arrived  at  a  rela8vely  stable  batch-­‐based  processing  architecture  

Website  /  webapp  

Snowplow  Hadoop  data  pipeline  

CloudFront-­‐based  event  collector  

Scalding-­‐based  

enrichment  on  Hadoop  

JavaScript  event  tracker  

Amazon  RedshiS  /  PostgreSQL  

Amazon  S3  

or  

Clojure-­‐based  event  collector  

Page 7: Big data meetup budapest   adding data schemas to snowplow

Evolu8on  of  Snowplow  

Page 8: Big data meetup budapest   adding data schemas to snowplow

Snowplow  is  evolving  from  a  web  analy8cs  pla<orm  into  a  general  event  analy8cs  pla<orm  

Data  warehouse  

Collect  event  data  from  any  connected  

device  

Page 9: Big data meetup budapest   adding data schemas to snowplow

Web  analysts  work  with  a  small  number  of  event  types  –  outside  of  web,  the  number  of  possible  event  types  is…  infinite  

Web  events  

All  events  

•  Page  view   •  Order   •  Add  to  basket  •  Page  ac;vity  

•  Game  saved   •  Machine  broke  •  Car  started  

•  Spellcheck  run   •  Screenshot  taken  •  Fridge  empty  

•  App  crashed   •  Disk  full  •  SMS  sent  

•  Screen  viewed   •  Tweet  draSed  •  Player  died  

•  Taxi  arrived   •  Phonecall  ended  •  Cluster  started  

•  Till  opened   •  Product  returned   ∞  

Page 10: Big data meetup budapest   adding data schemas to snowplow

There  are  two  historic  approaches  to  dealing  with  the  explosion  of  possible  event  types  

Web  analy8cs  vendors   Mobile  and  app  analy8cs  vendors  

Custom  Variables   Schema-­‐less  JSONs  

Page 11: Big data meetup budapest   adding data schemas to snowplow

Custom  variables  are  very  restric8ve  

 1.  Take  a  standard  web  event,  like  a  page  view:  

2.  and  add  custom  variables  un;l  it  becomes  something  totally  different:  

                                         =  a  “taxi  arrived”  event,  kind  of!  

 

Page  View  

Page  View   vehicle=taxi23   status=arrived  +   +  

Page 12: Big data meetup budapest   adding data schemas to snowplow

Schema-­‐less  JSONs  are  beWer,  but  they  have  a  different  set  of  problems  

Issues  with  the  event  name:  •  Separate  from  the  event  proper;es  •  Not  versioned  •  Not  unique  –  HBO  video  played  

versus  Brightcove  video  played  

Lots  of  unanswered  ques;ons  about  the  proper;es:  •  Is  length  required,  and  is  it  always  a  

number?  •  Is  id  required,  and  is  it  always  a  string?  •  What  other  op;onal  proper;es  are  

allowed  for  a  video  play?  

Other  issues:  •  What  if  the  developer  

accidentally  starts  sending  “len”  instead  of  “length”?  The  data  will  end  up  split  across  two  separate  fields  

•  Why  does  the  analyst  need  to  keep  an  implicit  schema  in  their  head  to  analyze  video  played  events?  

Page 13: Big data meetup budapest   adding data schemas to snowplow

The  answer:  schema  all  the  things!  

Page 14: Big data meetup budapest   adding data schemas to snowplow

When  a  developer  or  analyst  defines  a  new  event  in  JSON,  let’s  ask  them  to  create  a  JSON  Schema  for  that  event  

Addi;onal  op;onal  field  we  might  not  know  about  otherwise  

No  other  fields  allowed  

Yes  length  should  always  be  a  number  

Page 15: Big data meetup budapest   adding data schemas to snowplow

But  we  need  to  let  our  event  defini8ons  evolve,  so  let’s  add  versioning  –  we’re  calling  this  SchemaVer  

MODEL-REVISION-ADDITION!

•  Start  versioning  at  1-­‐0-­‐0  –  so  1-­‐0-­‐0,  1-­‐0-­‐1,  1-­‐0-­‐2,  1-­‐1-­‐0  etc  •  Try  to  s;ck  to  backwards-­‐compa;ble  ADDITION  upgrades  as  much  as  possible  

Page 16: Big data meetup budapest   adding data schemas to snowplow

Where  are  our  schemas  going  to  live?  We  need  a  schema  repository/registry  

Schema  repo  {}!

Enrichment  Manager  

Raw  events  in  JSON  format  

Enriched  events  in  ThriS  or  Arvo  format  

Shredder  

1.  Test  instrumenta;on  

2.  Validate  events  

3.  Define  structure  

4.  Drive  shredding  

Enriched  events  in  TSV  ready  for  loading  into  db  

5.  Define  structure  

Page 17: Big data meetup budapest   adding data schemas to snowplow

We  need  to  namespace  our  schemas  properly  to  prevent  clashes  and  confusion  in  our  schema  repository  

iglu:com.channel2.vod/video_played/jsonschema/1-0-0!

We  are  calling  our  schema  methodology  “Iglu”  

The  vendor  of  this  event  

Event  name  

Schema  format  

Schema  version  

Page 18: Big data meetup budapest   adding data schemas to snowplow

Bringing  it  all  together,  let’s  now  make  the  event  JSONs  self-­‐describing,  with  a  schema  header  and  data  body  

Page 19: Big data meetup budapest   adding data schemas to snowplow

And  for  good  measure,  let’s  add  in  our  schema  informa8on  into  the  JSON  Schema  itself    

Page 20: Big data meetup budapest   adding data schemas to snowplow

Snowplow  roadmap  

Page 21: Big data meetup budapest   adding data schemas to snowplow

Self-­‐describing  JSON  Schemas  are  coming  in  the  next  release  of  Snowplow  

Page 22: Big data meetup budapest   adding data schemas to snowplow

We  are  also  star8ng  to  define  third-­‐party  events  for  Snowplow  integra8on,  star8ng  with  Zendesk  customer  support  events  

Page 23: Big data meetup budapest   adding data schemas to snowplow

Ques8ons?  

 

hlp://snowplowanaly;cs.com  hlps://github.com/snowplow/snowplow  

@snowplowdata    

To  chat  –  @alexcrdean  on  Twiler  or  alex@snowplowanaly;cs.com