scala exchange: building robust data pipelines in scala

Download Scala eXchange: Building robust data pipelines in Scala

Post on 12-Jul-2015

1.064 views

Category:

Software

2 download

Embed Size (px)

TRANSCRIPT

PowerPoint Presentation

Building robust data pipelines in Scala: the Snowplow experience

Introducing myselfAlex Dean

Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]

Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]

[1] https://github.com/snowplow/snowplow

[2] http://manning.com/deanSnowplow what is it?Snowplow is an open source event analytics platform1a. Trackers2. Collectors3. Enrich4. Storage5. AnalyticsBCDADStandardised data protocols

1b. WebhooksAYour granular, event-level and customer-level data, in your own data warehouseConnect any analytics tool to your dataJoin your event data with any other data setToday almost all users/customers are running a batch-based Snowplow configurationHadoop-based enrichmentSnowplow event tracking SDKAmazon RedshiftAmazon S3

HTTP-based event collectorBatch-basedNormally run overnight; sometimes every 4-6 hours

We also have a real-time pipeline for Snowplow in beta, built on Amazon Kinesis (Apache Kafka support coming next year)scala-stream-collector

scala-kinesis-enrichS3Redshift

S3 sink Kinesis appRedshift sink Kinesis appSnowplow Trackers= not yet releasedkinesis-elasticsearch-sinkDynamoDBElastic-searchEvent aggregator Kinesis appAnalytics on Read for agile exploration of events, machine learning, auditing, re-processingAnalytics on Write for operational reporting, real-time dashboards, audience segmentation, personalizationRaw event streamBad raw event streamEnriched event streamSnowplow and Scala

Today, Snowplow is primarily developed in ScalaData modelling scriptsUsed for Snowplow orchestrationNo event-level processing occurs in RubyUsed for event validation, enrichment and other processingIncreasingly used for event storageStarting to be used for event collection tooOur initial skunkworks version of Snowplow had no Scala Website / webappSnowplow data pipeline v1CloudFront-based pixel collectorHiveQL + Java UDF ETL Amazon S3JavaScript event tracker

But our schema-first, loosely coupled approach made it possible to start swapping out existing components Website / webappSnowplow data pipeline v2CloudFront-based event collectorScalding-based enrichmentJavaScript event trackerHiveQL + Java UDF ETL Amazon Redshift / PostgreSQLAmazon S3

orClojure-based event collector

What is Scalding?Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:Hadoop DFSHadoop MapReduceCascadingHivePigJavaScaldingCascalog

PyCascadingcascading. jrubyWe chose Cascading because we liked their plumbing abstraction over vanilla MapReduce

Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?Lots of internal experience with Scala could hit the ground running (only very basic awareness of Clojure when we started the project)

Scalding created and supported by Twitter, who use it throughout their organization so we knew it was a safe long-term bet

More controversial opinion (although maybe not at a Scala conference): we believe that data pipelines should be as strongly typed as possible all the other DSLs/APIs on top of Cascading encourage dynamic typing

Robust data pipelinesRobust data pipelines means strongly typed data pipelines why?Catch errors as soon as possible and report them in a strongly typed way too

Define the inputs and outputs of each of your data processing steps in an unambiguous way

Forces you to formerly address the data types flowing through your system

Lets you write code like this:

Robust data processing is a state of mind: failures will happen, dont panic, but dont sweep them under the carpet eitherOur basic processing model for Snowplow looks like this:

Looks familiar? stdin, stdout, stderr

Raw eventsSnowplow enrichment processBad raw events + reasons why they are badGood enriched eventsThis pattern is extremely composable, especially with Kinesis or Kafka streams/topics as the core building blockValidation, the gateway drug to ScalazInside and across our components, we use the Validation applicative functor from the Scalaz project extensivelyScalaz Validation lets us perform a variety of different event validations and enrichments, and then compose (i.e. collate) the failures

This is really powerful!

The Scalaz codebase calls |@| a DSL for constructing Applicative expressions I think of it as the Scream operator Individual components of the enrichment process can themselves collate their own internal failures

There is a great F# article by Scott Wlaschin which describes this approach as railway-oriented programming [1]

The Happy PathIf everything succeeds, then this path outputs an enriched eventAny individual failure along the path could switch us onto the failure pathWe never get back onto the happy path once we leave itThe Failure PathAny failure can take us onto the failure pathWe can choose whether to switch straight to the failure path (fail fast), or collate failures from multiple independent tests[1] http://fsharpforfunandprofit.com/posts/recipe-part2/Putting it all together, the Snowplow enrichment process boils down to one big type transformation

Types abstracting over simpler typesNo mutable stateRailway-oriented programmingCollate failures inside a processing stage, fail fast between processing stages

Using Scott Wlaschins fruit as cargo metaphor:

Currently Snowplow uses a Non-Empty List of Strings to collect our failures:

We are working on a ProcessingMessage case class, to capture much richer and more structured failures than we can using Strings

The only limitation is that the Failure Path restricts us to a single type

A brief aside on testingOn the testing side: we love Specs2 data tablesThey let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:

and are starting to do more with ScalaCheckScalaCheck is a property-based testing framework, originally inspired by Haskells QuickCheck

We use it in a few places including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:

Robustness in the face of user-defined typesSnowplow is evolving from a fixed-schema platform to a platform supporting user-defined JSONs

Where other analytics tools depend on schema-less JSONs or custom variables, we use JSON Schema

Snowplow users send in events as self-describing JSONs which have to include the schema URI which validates the events JSON body:

To support JSON Schema, we have open-sourced Iglu, a new schema repository system in Scala/Spray/Swagger/Jackson

Our Scala client library for Iglu lets us work with JSONs in a safe way from within Snowplow

If a JSON passes its JSON Schema validation, we should be able to deserialize it and work with it safely in Scala in a strongly-typed way:

We use json4s with the Jackson bindings, as JSON Schema support in Java/Scala is Jackson-based

We still wrap our JSON deserialization in Scalaz Validations in case of any mismatch between the Scala deserialization code and the JSON schema

Questions?http://snowplowanalytics.comhttps://github.com/snowplow/snowplow@snowplowdataTo meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.comDiscount code: ulogprugcf (43% off Unified Log Processing eBook)