idml deep dive strata
TRANSCRIPT
IDML Deep DiveData preparation without the painJon DaveyStrata 2015
Background1
Remember 2011?
4
That McKinsey whitepaper The world discovered Hadoop
The first Strata
Also in 2011
5
๏DataSift launched
๏Twitter firehose re-syndication
๏+ handful of other data sources
๏500-1000 lines of data preparation code per source
Now
6
๏More data sources - More to build and maintain
๏Many people with an interest in how data is prepared - Support, Product, Solutions
๏Lots of problems to solve - Scaling, stability, training customers and new staff
Many stakeholders in data ingestion
๏Support - “Why can’t customer X see field Y?”
๏Data Science - “Is field A populated enough to be statistically significant?”
๏ Documentation - “What is the purpose of field A and how does it relate to field B?”
๏Test - “How do we measure the entropy in random IDs so we can be sure we aren’t losing data during de-duplication after redundancy?”
7
Engineering challenges
๏Detecting upstream schema changes
๏Supporting multiple data versions
๏Reducing boilerplate code
๏Software reusability
8
IDML (Ingestion Data Mapping Language)
๏Cleaner than a general purpose programming language
๏Readable by people who aren’t writing code every day
๏Wide range of features, extensible
9
What it does2
A sample preparation task: Sanitize scraped content
11
=>
Data preparation can be verbose..
12
Data preparation can be verbose..
13
It’s simpler if you use something designed for it
14
IDML is designed for data preparation
15
Closer look at features3
Deeply nested structures (without NPEs)
17
Aliasing with coalesce
18
Wide range of validation and transform functions
19
It’s there or it’s not - No try..catch
20
Lenient but consistent
21
The runtime figures things out
22
Arrays are easy to work with
23
Filter things
24
In-place validation
25
Other features
๏Detects fields that have not been mapped, making it easy to find data that’s not understood
๏Generates metrics about why a rule failed
๏Uniform interface allows the same syntax for JSON and XML
26
Where it fits4
Multiple deployment patterns
๏Deployable as a standalone service
๏Usable as a library
๏ Kafka consumer
๏ MapReduce mapper
๏ NSQ consumer
๏ Amazon SQS consumer
๏Command line, including REPL
28
Performance
๏ It’s an interpreter so it’s noticeably slower than hand-written code in contrived benchmarks
๏ In real cases, IO has usually been the bottleneck
๏Unstructured data is inherently suboptimal - dynamic structures like JsonNode are backed with HashMaps and Trees
๏One day it might be faster. Runtimes can often be optimized in much smarter ways: Consider why Java is faster than C++ at virtual method calls
29
Open sourcing it soon
๏May be rebranded as Ptolemy
๏Support for JSON and XML (and SGML - don’t ask)
๏May improve any of these areas, depending on interest:
๏ Performance
๏ More input and output types
๏ More integration: Spark, Kinesis
๏Would you use it on your own projects? Would you help?
30
QUESTIONS?
THANK YOU!