idml deep dive strata

IDML Deep DiveData preparation without the painJon DaveyStrata 2015

Background1

Remember 2011?

4

That McKinsey whitepaper The world discovered Hadoop

The first Strata

Also in 2011

5

๏DataSift launched

๏Twitter firehose re-syndication

๏+ handful of other data sources

๏500-1000 lines of data preparation code per source

Now

6

๏More data sources - More to build and maintain

๏Many people with an interest in how data is prepared - Support, Product, Solutions

๏Lots of problems to solve - Scaling, stability, training customers and new staff

Many stakeholders in data ingestion

๏Support - “Why can’t customer X see field Y?”

๏Data Science - “Is field A populated enough to be statistically significant?”

๏ Documentation - “What is the purpose of field A and how does it relate to field B?”

๏Test - “How do we measure the entropy in random IDs so we can be sure we aren’t losing data during de-duplication after redundancy?”

7

Engineering challenges

๏Detecting upstream schema changes

๏Supporting multiple data versions

๏Reducing boilerplate code

๏Software reusability

8

IDML (Ingestion Data Mapping Language)

๏Cleaner than a general purpose programming language

๏Readable by people who aren’t writing code every day

๏Wide range of features, extensible

9

What it does2

A sample preparation task: Sanitize scraped content

11

=>

Data preparation can be verbose..

12

Data preparation can be verbose..

13

It’s simpler if you use something designed for it

14

IDML is designed for data preparation

15

Closer look at features3

Deeply nested structures (without NPEs)

17

Aliasing with coalesce

18

Wide range of validation and transform functions

19

It’s there or it’s not - No try..catch

20

Lenient but consistent

21

The runtime figures things out

22

Arrays are easy to work with

23

Filter things

24

In-place validation

25

Other features

๏Detects fields that have not been mapped, making it easy to find data that’s not understood

๏Generates metrics about why a rule failed

๏Uniform interface allows the same syntax for JSON and XML

26

Where it fits4

Multiple deployment patterns

๏Deployable as a standalone service

๏Usable as a library

๏ Kafka consumer

๏ MapReduce mapper

๏ NSQ consumer

๏ Amazon SQS consumer

๏Command line, including REPL

28

Performance

๏ It’s an interpreter so it’s noticeably slower than hand-written code in contrived benchmarks

๏ In real cases, IO has usually been the bottleneck

๏Unstructured data is inherently suboptimal - dynamic structures like JsonNode are backed with HashMaps and Trees

๏One day it might be faster. Runtimes can often be optimized in much smarter ways: Consider why Java is faster than C++ at virtual method calls

29

Open sourcing it soon

๏May be rebranded as Ptolemy

๏Support for JSON and XML (and SGML - don’t ask)

๏May improve any of these areas, depending on interest:

๏ Performance

๏ More input and output types

๏ More integration: Spark, Kinesis

๏Would you use it on your own projects? Would you help?

30

QUESTIONS?

THANK YOU!

idml deep dive strata

Documents

data sources

data science

data ingestion support

lines of data preparation

multiple data versions

bottleneck unstructured

nd data thats

purpose of eld