recomp:preserving the value of large scale data analytics over time through selective re-computation

42
ReComp – Keele University Dec. 2016 – P. Missier ReComp: Preserving the value of large scale data analytics over time through selective re-computation recomp.org.uk Paolo Missier, Jacek Cala, Manisha Rathi School of Computing Science Newcastle University Keele University, Dec. 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)

Upload: paolo-missier

Post on 25-Jan-2017

138 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

ReComp:Preserving the value of large scale data analytics over time

through selective re-computationrecomp.org.uk

Paolo Missier, Jacek Cala, Manisha Rathi

School of Computing ScienceNewcastle University

Keele University, Dec. 2016

(*) Painting by Johannes Moreelse

(*)

Panta Rhei (Heraclitus)

Page 2: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

2

Data Science

Meta-knowledge

BigData

The BigAnalyticsMachine

AlgorithmsTools

MiddlewareReferencedatasets

“ValuableKnowledge”

Page 3: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

3

Data Science over time

BigData

The BigAnalyticsMachine

“ValuableKnowledge”

V3

V2

V1

Meta-knowledge

AlgorithmsTools

MiddlewareReferencedatasets

t

t

t

Page 4: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

4

Example: supervised learning

Meta-knowledge

Trainingset

Modellearning

Classificationalgorithms

Predictiveclassifier

BackgroundKnowledge(prior)

the training set is no longer representative of current data the model loses predictive powerEx.: training set is a sample from social media stream (Twitter, Instagram, …)

• Incremental training: established (neural networks, Bayes classifiers, …)• Incremental unlearning: some established work [1]

t

[1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345.[2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497–508. doi:10.1109/5326.983933.[3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.

Page 5: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

5

Example: stream Analytics

Meta-knowledge

Datastream

TimeSeries analysis

Patternrecognitionalgorithms

- Temporal Patterns- Activity detection- User behaviour- …

BackgroundKnowledge

• If the output is stable over time, can I save computation and deliver older outcomes instead?

• How do I quantify the quality/ cost trade-offs?

Page 6: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

6

Analytics functions and their dependencies can be complex

Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”) D: vector of dependencies: libraries, reference data Y outputs (vector of arbitrary data structures, “knowledge”)

Ex.:machine learningUsing Pythonand scikit-learn

Learn model to recognise

activity pattern

Python 3Ubuntu x.y.zAzure VM

Modeltraining Model

Scikit-learnNumpyPandas

Ubuntuon Azure

Dependencies

Training +Testingdataset

config

Ex.: workflow toIdentify mutationsin a patient’s genome

Workflowspecification

WF managerLinux VM cluster on

Azure

AnalyseInput genome variants

GATK/Picard/BWAWorkflow Manager(and its own dependencies)

Ubuntuon Azure

Dep.

Inputgenome config

Refgenome Variants

DBs

Page 7: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

7

Complex NGS pipelines

RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK

Computes coverage of each read.

VCF Subsetting by filtering, eg non-exomic variants

Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations

Aligns sample sequence to HG19 reference genomeusing BWA aligner

Cleaning, duplicate elimination

Picard tools

Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels

Variant recalibration attempts to reduce false positive rate from caller

Page 8: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

8

Problem size: HPC vs Cloud deployment

Configuration: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM,

160 GB scratch space

Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD

0 6 12 18 2400:00

12:00

24:00

36:00

48:00

60:00

72:00

3 eng (24 cores)6 eng (48 cores)12 eng (96 cores)

Number of samples

Resp

onse

tim

e [h

h:m

m]

Big Data:• raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient• processed in cohorts of 20–40 or close to 1 TB per cohort• time required to process a 24-sample cohort can easily exceed 2 CPU months• WES is about 2% of what the Whole Genome Sequencing analyses require

Page 9: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

9

Understanding change: threats and opportunities

BigData

Life SciencesAnalytics

“ValuableKnowledge”

V3

V2

V1

Meta-knowledge

AlgorithmsTools

Middleware

Referencedatasets

t

t

t

• Threats: Will any of the changes invalidate prior findings?

• Opportunities: Can the findings from the pipelines be improved over time?

• Cost: Need to model future costs based on past history and pricing trends for virtual appliances

• Impact analysis:• Which patients/samples are likely to be affected?• How do we estimate the potential benefits on affected patients?• Can we estimate the impact of these changes without re-computing entire cohorts?

Changes:• Algorithms and tools• Accuracy of input sequences• Reference databases (HGMD, ClinVar,

OMIM GeneMap, GeneCard,…)

Page 10: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

10

ReComp

Observe change• In big data• In meta-knowledge

Assess and measure• knowledge decay

Estimate• Cost and benefits of refresh

Enact• Reproduce

(analytics) processes

BigData

Life SciencesAnalytics

“ValuableKnowledge”

V3V2

V1Meta-knowledge

AlgorithmsTools

MiddlewareReferencedatasets

t

t

t

A decision support system for selectively re-computing complex analytics in reaction to change

- Generic: not just for the life sciences- Customisable: eg for genomics pipelines

Page 11: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

11

Challenges

3. Control How much control do we have on the system?• Re-run: How often• Total vs partial execution• Input density / resolution / incremental update

• Eg nonmonotonic learning / unlearning

ChangeEvents

Diff(.,.)functions

“businessRules”

Optimal re-computation prioritisaton

Impact and Cost estimatesReproducibility assessment

ReComp DecisionSupportSystem

History of pastKnowledge Assets

1. Observability: To what extent can we observe the process and its execution?• Process structure• Data flow provenance

2. Detecting and quantifying changes: • In inputs, dependencies, outputs diff() functions

Page 12: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

12

General ReComp problem formulation

Page 13: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

13

Change Impact

Page 14: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

14

Example: NGS variant interpretation

Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis- Eg 100K Genome Project, Genomics England, GeCIP

Also: Metagenomics: Species identification. Eg The EBI metagenomics portal

Can help to confirm/reject a hypothesis of patient’s phenotype

Classifies variants into three categories: RED, GREEN, AMBERpathogenic, benign and unknown/uncertain

Page 15: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

15

The SVI example

Page 16: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

16

Change in variant interpretation

What changes:

- Improved sequencing / variant calling- ClinVar, OMIM evolve rapidly- New reference data sources

Page 17: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

17

ReComp Problem Statement

1. Estimate impact of changes

2. Optimise ReComp decisions: select subset of population that maximises espected impact, subject to a budget constraint

Problem: P computationally expensive

Page 18: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

18

Estimators: formalisation and a possible approach

And local changes

Problem: f() computationally expensive

Approach: learn an approximation f’() of f(): a surrogate (emulator)

Sensitivity Analysis:

Given

Assess

where ε is a stochastic term that accounts for the error in approximating f, and is typically assumed to be Gaussian

Learning f’() requires a training set { (xi, yi) } …

If f’() can be found, then we can hope to use it to approximate:

which can then be used to carry out sensitivity analysis

For simplicity

Page 19: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

19

Scope of change

2. Change: affects a single patient partial re-run

May affect a subset of the patients population scopeWhich patients will be affected?

1. Change:

Page 20: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

20

Challenge 1: battleships

Patient / change impact matrix

First challenge:precisely identify the scope of a change

Blind reaction to change: recompute the entire matrix

Can we do better? - Hit the high impact cases (the X) without re-computing the entire matrix

Page 21: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

21

SVI process: detailed design

Phenotype to genes

Variant selection

Variant classification

Patientvariants

GeneMap

ClinVar

Classified variants

Phenotypehypothesis

Page 22: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

22

Baseline: Blind recomputation

• 17 minutes / patient (single-core VM)• Runtime consistent across different

phenotypes• Changes to GeneMap/ClinVar have

negligible impact on the execution time

Run time [mm:ss]

GeneMap version

2016-03-08 2016-04-28 2016-06-07

μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17

Page 23: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

23

Inside a single instance: Partial re-computation

Change inClinVar

Change inGeneMap

Page 24: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

24

White-box granular provenance

x11

x12 y11

P

D11 D12

- Using provenance metadata to identify fragments of SVI that are affected by the change in reference data

Page 25: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

26

Results

Run time [mm:ss]

Savings Run time [mm:ss]

Savings

GeneMapversion 2016-04-28 2016-06-07

μ ± σ 11:51 ± 16 31% 11:50 ± 20 31%

ClinVarversion 2016-02 2016-05

μ ± σ 9:51 ± 14 43% 9:50 ± 15 42%

• How much can we save?• Process structure• First usage of reference data

• Overhead: storing interim data required in partial re-execution• 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes

Page 26: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

27

Partial re-computation using input difference

Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:Q(CV) Q(diff(CV1, CV2))

Works for SVI, but hard to generalise: depends on the type of processBigger gain: diff(CV1, CV2) much smaller than CV2

GeneMap versionsfrom –> to

ToVersion rec. count

Differencerec. count Reduction

16-03-08 –> 16-06-07 15910 1458 91%

16-03-08 –> 16-04-28 15871 1386 91%

16-04-28 –> 16-06-01 15897 78 99.5%

16-06-01 –> 16-06-02 15897 2 99.99%

16-06-02 –> 16-06-07 15910 33 99.8%

ClinVar versionsfrom –> to

ToVersion rec. count

Differencerec. count Reduction

15-02 –> 16-05 290815 38216 87%

15-02 –> 16-02 285042 35550 88%

16-02 –> 16-05 290815 3322 98.9%

Page 27: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

29

Saving resources on stream processing

x1

x2

…xk

xk+1

x2k

W1

W2

Rawstream

windows

P

P

y1

y2

… Wi+1 Wi Comp / noComp

…yi-h-1

yi-h h<i

P

y’i

yi-h

yi

Baseline stream processing Conditional stream processing

- If we could predict that yi+1 will be similar to

yi, we could skip computing P(Wi+1), save

resources and instead deliver yi again

- Can we make optimal comp/noComp decisions? What is required?

Page 28: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

30

Diff and currency functions

the quality of yi is initially maximal, and decreases over time in a way that

depends on how rapidly the new values yj diverge from yi.

Page 29: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

31

Measuring DeComp performance

Evaluating the performance of comp / nocomp decisions on each window:

Cost:

- Very conservative DeComp computes every value:

- Very optimistic, only computes first value:

Boundary cases:

Page 30: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

32

Diff time series

Page 31: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

33

Forecasting drift

… Wi+1 Wi Comp / noComp

…yi-h-1

yi-h h<i

P

y’i

yi

yi…

Derived Time series

drift forecasting

Page 32: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

34

Initial experiments: the DEBS’15 Taxi routes challenge

• Find the most frequent / most profitable taxi routes in Manhattan within each 30’ window

VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon, PickupLat,DropoffLon,DropofLat,Pay,Fare$, ...0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,40.715008,CSH, 3.50, ...22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CSH,27.00, ...0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897,40.760445,CSH, 4.00, ...1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-74.003838,40.726189,CSH, 4.00, ...3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-73.983772,40.730995,CRD, 4.00, ...5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 2.50, ...DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,-73.979439,40.784386,CRD, 3.00, ...1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000, 0.000000,CSH, 2.50, ...4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-73.967758,40.760326,CSH, 6.50, ...5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-73.981453,40.778465,CRD, 6.00, ...6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,-73.972206,40.752502,CRD, 4.50, ...75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 3.00, ...C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,-73.934540,40.797314,CSH, 4.50, ...C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000, 0.000000,CSH, 2.50, ...ta

Page 33: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

35

Diff time series – taxi routes

Raw data stream

st1 ft1, x1y1 x2y2

st2 ft2, x3y3 x4y2

. . .

Routes time series

ft1, R1

ft2, R2

. .ftn, R1

ftn+1, R1

ftn+2, R3

.

.ftm, R2

ftm+1, R4

.

.

Top-k time series

R1 Freq1

R2 Freq2

. .Rk Freqk

Rk+1 Freqk+1

Rk+2 Freqk+2

.

.R2k Freq2k

R2k+1 Freq2k+1

.

.

W1

W2

W3

W1

W2

W3

=> =>

Page 34: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

36

Routes drift – comparing ranked lists

[1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856.

P outputs a list of top most frequent/profitable routesTo compare lists we use the generalised Kendall’s tau (Fagin et al. [1])

Quantify how much the top-k changes between one window and the next

Input parameters determine stability / sensitivity:K: how many routeswindow size (e.g. 30’)

Page 35: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

371 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165

0

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00)

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)

1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 309 323 337 351 365 379 393 407 421 435 449 463 477 491 505 519 533 547 561 575 589 603 617 631 645 6590

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)

Page 36: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

381 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-40, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-20, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300

0.2

0.4

0.6

0.8

1

1.2

Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)

Page 37: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

39

Approach: ARIMA forecasting

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 640

0.2

0.4

0.6

0.8

1

1.2

Actual normalised drift vs ARIMA forecastDrift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00)

new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast

Drift prediction using time series forecasting

• This is the derived diff() time series!• Autoregressive integrated moving average (ARIMA)• Widely used and well understood, well supported• Fast to compute• Assumes normality of underlying random variable

Poor prediction: compute P too often or too rarely

Page 38: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

40

The next steps -- challenges

• Can we learn effective surrogate models and estimators of change impact?

• diff() functions, estimators seem very problem-specific• To what extent can the ReComp framework be made generic,

reusable, yet still useful?

• Metadata infrastructure: A DB of past executions history

• Reproducibility: What really happens when I press the “ReComp” button?

Page 39: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

41

Summary and challenges

ReComp:a meta-process to observe and control underlying analytics processes

Page 40: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

42

ReComp scenarios

ReComp scenario Target Impact areas Why is ReComp relevant?

Proof of concept experiments

Expected optimisation

Dataflow, experimental science

Genomics - Rapid Knowledge advances

- Rapid scaling up of genetic testing at population level

WES/SVI pipeline, workflow implementation (eScience Central)

Timeliness and accuracy of patient diagnosis subject to budget constraints

Time series analysis - Personal health monitoring- Smart city

analytics- IoT data streams

- Rapid data drift- Cost of computation at network edge (eg IoT)

NYC taxi rides challenge (DEBS’15)

Use of low-power edge devices when outcome is predictable and data drift is low

Data layer optimisation

Tuning of large-scale Data management stack

Optimal Data organisation sensitive to current data profiles

Graph DB re-partitioning

System throughput vs cost of re-tuning

Model learning Applications of predictive analytics

Predictive models are very sensitive to data drift

Twitter content analysis

Sustained model predictive power over time vs retraining cost

Simulation TBD repeated simulation. Computationally expensive but often not beneficial

Flood modelling / CityCat Newcastle

Computational resources vs marginal benefit of new simulation model

Page 41: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

43

Observability / transparency

White box Black box

Structure(static view)

Dataflow- eScience Central, Taverna, VisTrails…Scripting:- R, Matlab, Python...- Functions semantics

- Packaged components- Third party services

Data dependencies(runtime view)

Provenance recording:• Inputs,• Reference datasets,• Component versions,• Outputs

• Input• Outputs• No data dependencies• No details on individual

componentsCost • Detailed resource monitoring

• Cloud £££• Wall clock time• Service pricing• Setup time (eg model

learning)

Page 42: ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReC

omp

– K

eele

Uni

vers

ityD

ec. 2

016

– P.

Mis

sier

44

Project structure

• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call• Feb. 2016 - Jan. 2019

• 2 RAs fully employed in Newcastle• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)• CO-Investigators (8% each):

• Prof. Watson, School of Computing Science, Newcastle University• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University• Dr. Phil James, Civil Engineering, Newcastle University

Builds upon the experience of the Cloud-e-Genome project: 2013-2015

Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud- To facilitate the adoption of reliable genetic testing in clinical practice

- A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University

- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”