scala exchange: building robust data pipelines in scala

Building robust data pipelines in Scala: the Snowplow experience

Introducing myself

• Alex Dean

• Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]

• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]

[1] https://github.com/snowplow/snowplow

[2] http://manning.com/dean

Snowplow – what is it?

Snowplow is an open source event analytics platform

1a. Trackers

2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D

A D Standardised data protocols

1b. Webhooks

A

• Your granular, event-level and customer-level data, in your own data warehouse

• Connect any analytics tool to your data• Join your event data with any other data set

Today almost all users/customers are running a batch-based Snowplow configuration

Hadoop-based

enrichment

Snowplow event

tracking SDK

Amazon Redshift

Amazon S3

HTTP-based event

collector

• Batch-based• Normally run overnight;

sometimes every 4-6 hours

We also have a real-time pipeline for Snowplow in beta, built on Amazon Kinesis (Apache Kafka support coming next year)

scala-stream-collector

scala-kinesis-enrich

S3Redshift

S3 sink Kinesis app

Redshift sink

Kinesis app

Snowplow Trackers

= not yet released

kinesis-elasticsearch-

sink

DynamoDBElastic-search

Event aggregator Kinesis app

Analytics on Read for agile exploration of

events, machine learning,

auditing, re-processing…

Analytics on Write for operational reporting, real-time dashboards, audience segmentation, personalization…

Raw event

stream

Bad raw event

stream

Enriched event

stream

Snowplow and Scala

Today, Snowplow is primarily developed in Scala

Data modelling scripts

• Used for Snowplow orchestration

• No event-level processing occurs in Ruby

• Used for event validation, enrichmentand other processing

• Increasingly used for event storage

• Starting to be used for event collection too

Our initial skunkworks version of Snowplow had no Scala

Website / webapp

Snowplow data pipeline v1

CloudFront-based pixel

collector

HiveQL + Java UDF

“ETL” Amazon S3

JavaScript event tracker

But our schema-first, loosely coupled approach made it possible to start swapping out existing components…

Website / webapp

Snowplow data pipeline v2

CloudFront-based event

collector

Scalding-based

enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

What is Scalding?

• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:

Hadoop DFS

Hadoop MapReduce

Cascading Hive Pig

Java

Scalding Cascalog PyCascadingcascading.

jruby

We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce

Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?

• Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojurewhen we started the project)

• Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet

• More controversial opinion (although maybe not at a Scala conference): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing

Robust data pipelines

Robust data pipelines means strongly typed data pipelines –why?

• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an unambiguous way

• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:

Robust data processing is a state of mind: failures will happen, don’t panic, but don’t sweep them under the carpet either

• Our basic processing model for Snowplow looks like this:

• Looks familiar? stdin, stdout, stderr

Raw eventsSnowplow

enrichment process

“Bad” raw events +

reasons why they are bad

“Good” enriched events

This pattern is extremely composable, especially with Kinesis or Kafka streams/topics as the core building block

Validation, the “gateway drug” to Scalaz

Inside and across our components, we use the Validation applicative functor from the Scalaz project extensively

• Scalaz Validation lets us perform a variety of different event validations and enrichments, and then compose (i.e. collate) the failures

• This is really powerful!

• The Scalaz codebase calls |@| a “DSL for constructing

Applicative expressions” – I think of it as “the Scream operator”

• Individual components of the enrichment process can themselves collate their own internal failures

There is a great F# article by Scott Wlaschin which describes this approach as “railway-oriented programming” [1]

The Happy Path• If everything succeeds, then this path outputs an enriched event• Any individual failure along the path could switch us onto the

failure path• We never get back onto the happy path once we leave it

The Failure Path• Any failure can take us onto the failure path• We can choose whether to switch straight to the

failure path (“fail fast”), or collate failures from multiple independent tests

[1] http://fsharpforfunandprofit.com/posts/recipe-part2/

Putting it all together, the Snowplow enrichment process boils down to one big type transformation

• Types abstracting over simpler types

• No mutable state

• Railway-oriented programming

• Collate failures inside a processing stage, fail fast between processing stages

• Using Scott Wlaschin’s “fruit as cargo” metaphor:

• Currently Snowplow uses a Non-Empty List of Strings to collect our failures:

• We are working on a ProcessingMessage case class, to capture much richer and more structured failures than we can using Strings

The only limitation is that the Failure Path restricts us to a single type

A brief aside on testing

On the testing side: we love Specs2 data tables…

• They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:

… and are starting to do more with ScalaCheck

• ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck

• We use it in a few places –including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:

Robustness in the face of user-defined types

Snowplow is evolving from a fixed-schema platform to a platform supporting user-defined JSONs

• Where other analytics tools depend on schema-less JSONs or custom variables, we use JSON Schema

• Snowplow users send in events as “self-describing JSONs” which have to include the schema URI which validates the event’s JSON body:

To support JSON Schema, we have open-sourced Iglu, a new schema repository system in Scala/Spray/Swagger/Jackson

Our Scala client library for Iglu lets us work with JSONs in a safe way from within Snowplow

• If a JSON passes its JSON Schema validation, we should be able to deserialize it and work with it safely in Scala in a strongly-typed way:

• We use json4s with the Jackson bindings, as JSON Schema support in Java/Scala is Jackson-based

• We still wrap our JSON deserialization in Scalaz Validations in case of any mismatch between the Scala deserialization code and the JSON schema

Questions?

http://snowplowanalytics.com

https://github.com/snowplow/snowplow

@snowplowdata

To meet up or chat, @alexcrdean on Twitter or [email protected]

Discount code: ulogprugcf (43% off Unified Log Processing eBook)

http://snowplowanalytics.com/

https://github.com/snowplow/snowplow

scala exchange: building robust data pipelines in scala

Software

event data

event validation

event storagestarting

data warehouseconnect

data settoday

customerlevel data

snowplow experience

collector scalakinesis