
7 Key Recipes

for Data



We will explore 7 key recipes about Data Engineering.

The 5th is absolutely game changing!

>>>> N E X T

Thank You


BI and Big Data Consultant

About MeJonathan WINANDY

Lead Data Engineer: - Data Lake building, - Audit / Coaching, - Spark Training.

Founder of Univalence (BI / Big Data)

Co-Founder of CYM (IoT / Predictive Maintenance), Craft Analytics† (BI / Big Data), and Valwin (Health Care Data).

2016 has been amazing for Data Engineering !

but ...

1.It’s all about our


1.It’s all about our Organisations

Data engineering is not about scaling computation.

1.It’s all about our Organisations

Data engineering is not a support function

for Data Scientists[1].

[1] whatever they are nowadays

1.It’s all about our Organisations

Instead, Data engineering enables access to Data!

1.It’s all about our Organisations

access to Data … in complex organisations.

Product OpsBI You



new data


1.It’s all about our Organisations

access to Data … in complex organisations.



new data

Entity 1MarketingIT

Entity NMarketingIT

1.It’s all about our Organisations

access to Data … in complex organisations.

It’s very frustrating!

We run a support group meetup if you are interested : Paris Data Engineers!

1.It’s all about our Organisations

Small tips : Only one hadoop cluster (no TEST/REC/INT/PREPROD).

No Air-Data-Eng, it helps no one.

Radical transparency with other teams.

Hack that sh**.

2. Optimising our work

2. Optimising our workThere are 3 key concerns governing our decisions : Lead time


Failure management

2. Optimising our workLead time (noun) : The period of time between the initial phase of a process and the emergence of results, as between the planning and completed manufacture of a product.

Short lead times are essential!

The Elastic stack helps a lot in this area.

2. Optimising our workImpact

To have impact, we have to analyse beyond immediate needs. That way, we’re able to provide solutions to entire kinds of problems.

2. Optimising our workFailure managementThings fail, be prepared!

On the same morning the RER A public transportsand

our Hadoop job tracker can fail.

Unprepared failures may pile up and lead to huge wastes.

2. Optimising our work

“What is likely to fail?” $componentName_____

“How? (root cause)”

“Can we know if this will fail?”

“Can we prevent this failure?”

“What are the impacts?”

“How to fix it when it happens?”“Can we facilitate today?”

How to mitigate failure in 7 questions.

2. Optimising our work

Track your work!

3. Staging the Data

3. Staging the dataData is moving around, freeze it!Staging changed with Big Data. We moved from transient staging (FTP, NFS, etc.) to persistent staging in distributed solutions: ● In Streaming with Kafka, we may retain logs in Kafka

for several months.● In Batch, staging in HDFS may retain source Data for


3. Staging the dataModern staging anti-pattern :

Dropping destination places before moving the Data.

Having incomplete data visible.

Short log retention in streams (=> new failure modes).

Modern staging should be seen as a persistent data structure.

3. Staging the dataHDFS staging :

/staging|-- $tablename |-- dtint=$dtint |--$dsparam.value |-- ... |-- ... |-- uuid=$uuid

4. Using RDDs or Dataframes

4. Using RDDs or DataframesDataframes have great performance,

but are untyped and foreign.

RDDs have a robust Scala API, but are a pain to map from data sources.


4. Using RDDs or DataframesDataframes RDDs

Predicate push down Types !!

Bare metal / unboxed Nested structures

Connectors Better unit tests

Pluggable Optimizer Less stages

SQL + Meta Scala * Scala

4. Using RDDs or DataframesWe should use RDDs in large ETL jobs :

Loading the data with dataframe APIs,

Basic case class mapping (or better Datasets),

Typesafe transformations,

Storing with dataframe APIs

4. Using RDDs or DataframesDataframes are perfect for :

Exploration, drill down,

Light jobs,

Dynamic jobs.

4. Using RDDs or Dataframes

RDD based jobs are like marine mammals.

5. Cogroup all the things

5. Cogroup all the thingsThe cogroup is the best operation

to link data together.

It changes fundamentally the way we work with data.

5. Cogroup all the things join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))] leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))]rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )]outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))]

cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))]

groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])]

On cogroup and groupBy, for a given key:K, there is only one unique row with that key in the output dataset.

5. Cogroup all the things

5. Cogroup all the things

{case (k,(s1,s2)) => (k,( ,}


5. Cogroup all the things

3k LoC30 minutes to run (non-blocking)

15 LoC11 hours to run (blocking)

5. Cogroup all the thingsWhat about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by minimising examples.

Test workflow :

Write a predicate to isolate the bug.

Get the minimal cogrouped rowouput the row in test resources.

Reproduce the bug.

Write tests and fix code.

6. Inline data quality

6. Inline data quality

Data quality improves resilience to bad data.

But data quality concerns come second.

6. Inline data qualitycase class FixeVisiteur( devicetype: String, isrobot: Boolean, recherche_visitorid: String, sessions: List[FixeSession]) { def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches)}

object FixeVisiteur { @autoBuildResult def build( devicetype: Result[String], isrobot: Result[Boolean], recherche_visitorid: Result[String], sessions: Result[List[FixeSession]] ): Result[FixeVisiteur] = MacroMarker.generated_applicative}

Example :

6. Inline data qualitycase class Annotation( anchor: Anchor, message: String, badData: Option[String], expectedData: List[String], remainingData: List[String], level: String @@ AnnotationLevel, annotationId: Option[AnnotationId], stage: String)

case class Anchor(path: String @@ AnchorPath, typeName: String)

6. Inline data qualityMessage :






Levels :




6. Inline data qualityData quality is available within the output rows.

case class HVisiteur( visitorId: String, atVisitorId: Option[String], isRobot: Boolean, typeClient: String @@ TypeClient, typeSupport: String @@ TypeSupport, typeSource: String @@ TypeSource, hVisiteurPlus: Option[HVisiteurPlus], sessions: List[HSession], annotations: Seq[HAnnotation] )

6. Inline data quality(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message -> NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),15)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),566973)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message -> EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614)(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),94)

6. Inline data quality (presented in october 2015)

There are opportunities to make those approaches more “precepte-like”.

(DAG of workflow, provenance of every fields, structure tags)

7. Create real programs

7. Create real programsMost pipelines are designed as “Stateless” computation.

They require no state (good) Or

Infer the current state based on filesystem’ states (bad).

7. Create real programsSolution : Allow pipelines to access a commit log to read about past execution and to push data for future execution.

7. Create real programsIn progress: project codename Kerguelen

Multi level abstractions / commit log backed / api for jobs.

Allow creation of jobs that have different concern level.

Level 1 : name resolvingLevel 2 : smart intermediaries (schema capture, stats, delta, …)Level 3 : smart high level scheduler (replay, load management, coherence)Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)


Thank youfor listening!

[email protected]

Top Related