recomp:preserving the value of large scale data analytics over time through selective re-computation
TRANSCRIPT
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
ReComp:Preserving the value of large scale data analytics over time
through selective re-computationrecomp.org.uk
Paolo Missier, Jacek Cala, Manisha Rathi
School of Computing ScienceNewcastle University
Keele University, Dec. 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
2
Data Science
Meta-knowledge
BigData
The BigAnalyticsMachine
AlgorithmsTools
MiddlewareReferencedatasets
“ValuableKnowledge”
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
3
Data Science over time
BigData
The BigAnalyticsMachine
“ValuableKnowledge”
V3
V2
V1
Meta-knowledge
AlgorithmsTools
MiddlewareReferencedatasets
t
t
t
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
4
Example: supervised learning
Meta-knowledge
Trainingset
Modellearning
Classificationalgorithms
Predictiveclassifier
BackgroundKnowledge(prior)
the training set is no longer representative of current data the model loses predictive powerEx.: training set is a sample from social media stream (Twitter, Instagram, …)
• Incremental training: established (neural networks, Bayes classifiers, …)• Incremental unlearning: some established work [1]
t
[1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345.[2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497–508. doi:10.1109/5326.983933.[3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
5
Example: stream Analytics
Meta-knowledge
Datastream
TimeSeries analysis
Patternrecognitionalgorithms
- Temporal Patterns- Activity detection- User behaviour- …
BackgroundKnowledge
• If the output is stable over time, can I save computation and deliver older outcomes instead?
• How do I quantify the quality/ cost trade-offs?
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
6
Analytics functions and their dependencies can be complex
Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”) D: vector of dependencies: libraries, reference data Y outputs (vector of arbitrary data structures, “knowledge”)
Ex.:machine learningUsing Pythonand scikit-learn
Learn model to recognise
activity pattern
Python 3Ubuntu x.y.zAzure VM
Modeltraining Model
Scikit-learnNumpyPandas
Ubuntuon Azure
Dependencies
Training +Testingdataset
config
Ex.: workflow toIdentify mutationsin a patient’s genome
Workflowspecification
WF managerLinux VM cluster on
Azure
AnalyseInput genome variants
GATK/Picard/BWAWorkflow Manager(and its own dependencies)
Ubuntuon Azure
Dep.
Inputgenome config
Refgenome Variants
DBs
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
7
Complex NGS pipelines
RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK
Computes coverage of each read.
VCF Subsetting by filtering, eg non-exomic variants
Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations
Aligns sample sequence to HG19 reference genomeusing BWA aligner
Cleaning, duplicate elimination
Picard tools
Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels
Variant recalibration attempts to reduce false positive rate from caller
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
8
Problem size: HPC vs Cloud deployment
Configuration: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM,
160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD
0 6 12 18 2400:00
12:00
24:00
36:00
48:00
60:00
72:00
3 eng (24 cores)6 eng (48 cores)12 eng (96 cores)
Number of samples
Resp
onse
tim
e [h
h:m
m]
Big Data:• raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient• processed in cohorts of 20–40 or close to 1 TB per cohort• time required to process a 24-sample cohort can easily exceed 2 CPU months• WES is about 2% of what the Whole Genome Sequencing analyses require
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
9
Understanding change: threats and opportunities
BigData
Life SciencesAnalytics
“ValuableKnowledge”
V3
V2
V1
Meta-knowledge
AlgorithmsTools
Middleware
Referencedatasets
t
t
t
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings from the pipelines be improved over time?
• Cost: Need to model future costs based on past history and pricing trends for virtual appliances
• Impact analysis:• Which patients/samples are likely to be affected?• How do we estimate the potential benefits on affected patients?• Can we estimate the impact of these changes without re-computing entire cohorts?
Changes:• Algorithms and tools• Accuracy of input sequences• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
10
ReComp
Observe change• In big data• In meta-knowledge
Assess and measure• knowledge decay
Estimate• Cost and benefits of refresh
Enact• Reproduce
(analytics) processes
BigData
Life SciencesAnalytics
“ValuableKnowledge”
V3V2
V1Meta-knowledge
AlgorithmsTools
MiddlewareReferencedatasets
t
t
t
A decision support system for selectively re-computing complex analytics in reaction to change
- Generic: not just for the life sciences- Customisable: eg for genomics pipelines
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
11
Challenges
3. Control How much control do we have on the system?• Re-run: How often• Total vs partial execution• Input density / resolution / incremental update
• Eg nonmonotonic learning / unlearning
ChangeEvents
Diff(.,.)functions
“businessRules”
Optimal re-computation prioritisaton
Impact and Cost estimatesReproducibility assessment
ReComp DecisionSupportSystem
History of pastKnowledge Assets
1. Observability: To what extent can we observe the process and its execution?• Process structure• Data flow provenance
2. Detecting and quantifying changes: • In inputs, dependencies, outputs diff() functions
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
12
General ReComp problem formulation
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
13
Change Impact
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
14
Example: NGS variant interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis- Eg 100K Genome Project, Genomics England, GeCIP
Also: Metagenomics: Species identification. Eg The EBI metagenomics portal
Can help to confirm/reject a hypothesis of patient’s phenotype
Classifies variants into three categories: RED, GREEN, AMBERpathogenic, benign and unknown/uncertain
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
15
The SVI example
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
16
Change in variant interpretation
What changes:
- Improved sequencing / variant calling- ClinVar, OMIM evolve rapidly- New reference data sources
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
17
ReComp Problem Statement
1. Estimate impact of changes
2. Optimise ReComp decisions: select subset of population that maximises espected impact, subject to a budget constraint
Problem: P computationally expensive
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
18
Estimators: formalisation and a possible approach
And local changes
Problem: f() computationally expensive
Approach: learn an approximation f’() of f(): a surrogate (emulator)
Sensitivity Analysis:
Given
Assess
where ε is a stochastic term that accounts for the error in approximating f, and is typically assumed to be Gaussian
Learning f’() requires a training set { (xi, yi) } …
If f’() can be found, then we can hope to use it to approximate:
which can then be used to carry out sensitivity analysis
For simplicity
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
19
Scope of change
2. Change: affects a single patient partial re-run
May affect a subset of the patients population scopeWhich patients will be affected?
1. Change:
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
20
Challenge 1: battleships
Patient / change impact matrix
First challenge:precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better? - Hit the high impact cases (the X) without re-computing the entire matrix
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
21
SVI process: detailed design
Phenotype to genes
Variant selection
Variant classification
Patientvariants
GeneMap
ClinVar
Classified variants
Phenotypehypothesis
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
22
Baseline: Blind recomputation
• 17 minutes / patient (single-core VM)• Runtime consistent across different
phenotypes• Changes to GeneMap/ClinVar have
negligible impact on the execution time
Run time [mm:ss]
GeneMap version
2016-03-08 2016-04-28 2016-06-07
μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
23
Inside a single instance: Partial re-computation
Change inClinVar
Change inGeneMap
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
24
White-box granular provenance
x11
x12 y11
P
D11 D12
- Using provenance metadata to identify fragments of SVI that are affected by the change in reference data
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
26
Results
Run time [mm:ss]
Savings Run time [mm:ss]
Savings
GeneMapversion 2016-04-28 2016-06-07
μ ± σ 11:51 ± 16 31% 11:50 ± 20 31%
ClinVarversion 2016-02 2016-05
μ ± σ 9:51 ± 14 43% 9:50 ± 15 42%
• How much can we save?• Process structure• First usage of reference data
• Overhead: storing interim data required in partial re-execution• 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
27
Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:Q(CV) Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of processBigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versionsfrom –> to
ToVersion rec. count
Differencerec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versionsfrom –> to
ToVersion rec. count
Differencerec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
29
Saving resources on stream processing
x1
x2
…xk
xk+1
…
x2k
W1
W2
Rawstream
windows
P
P
y1
y2
… Wi+1 Wi Comp / noComp
…yi-h-1
yi-h h<i
P
y’i
yi-h
yi
Baseline stream processing Conditional stream processing
- If we could predict that yi+1 will be similar to
yi, we could skip computing P(Wi+1), save
resources and instead deliver yi again
- Can we make optimal comp/noComp decisions? What is required?
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
30
Diff and currency functions
the quality of yi is initially maximal, and decreases over time in a way that
depends on how rapidly the new values yj diverge from yi.
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
31
Measuring DeComp performance
Evaluating the performance of comp / nocomp decisions on each window:
Cost:
- Very conservative DeComp computes every value:
- Very optimistic, only computes first value:
Boundary cases:
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
32
Diff time series
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
33
Forecasting drift
… Wi+1 Wi Comp / noComp
…yi-h-1
yi-h h<i
P
y’i
yi
yi…
Derived Time series
drift forecasting
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
34
Initial experiments: the DEBS’15 Taxi routes challenge
• Find the most frequent / most profitable taxi routes in Manhattan within each 30’ window
VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon, PickupLat,DropoffLon,DropofLat,Pay,Fare$, ...0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,40.715008,CSH, 3.50, ...22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CSH,27.00, ...0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897,40.760445,CSH, 4.00, ...1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-74.003838,40.726189,CSH, 4.00, ...3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-73.983772,40.730995,CRD, 4.00, ...5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 2.50, ...DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,-73.979439,40.784386,CRD, 3.00, ...1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000, 0.000000,CSH, 2.50, ...4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-73.967758,40.760326,CSH, 6.50, ...5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-73.981453,40.778465,CRD, 6.00, ...6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,-73.972206,40.752502,CRD, 4.50, ...75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 3.00, ...C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,-73.934540,40.797314,CSH, 4.50, ...C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000, 0.000000,CSH, 2.50, ...ta
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
35
Diff time series – taxi routes
Raw data stream
st1 ft1, x1y1 x2y2
st2 ft2, x3y3 x4y2
. . .
Routes time series
ft1, R1
ft2, R2
. .ftn, R1
ftn+1, R1
ftn+2, R3
.
.ftm, R2
ftm+1, R4
.
.
Top-k time series
R1 Freq1
R2 Freq2
. .Rk Freqk
Rk+1 Freqk+1
Rk+2 Freqk+2
.
.R2k Freq2k
R2k+1 Freq2k+1
.
.
W1
W2
W3
W1
W2
W3
=> =>
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
36
Routes drift – comparing ranked lists
[1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856.
P outputs a list of top most frequent/profitable routesTo compare lists we use the generalised Kendall’s tau (Fagin et al. [1])
Quantify how much the top-k changes between one window and the next
Input parameters determine stability / sensitivity:K: how many routeswindow size (e.g. 30’)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
371 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165
0
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00)
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 309 323 337 351 365 379 393 407 421 435 449 463 477 491 505 519 533 547 561 575 589 603 617 631 645 6590
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
381 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-40, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-20, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 302 309 316 323 3300
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
39
Approach: ARIMA forecasting
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 640
0.2
0.4
0.6
0.8
1
1.2
Actual normalised drift vs ARIMA forecastDrift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00)
new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast
Drift prediction using time series forecasting
• This is the derived diff() time series!• Autoregressive integrated moving average (ARIMA)• Widely used and well understood, well supported• Fast to compute• Assumes normality of underlying random variable
Poor prediction: compute P too often or too rarely
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
40
The next steps -- challenges
• Can we learn effective surrogate models and estimators of change impact?
• diff() functions, estimators seem very problem-specific• To what extent can the ReComp framework be made generic,
reusable, yet still useful?
• Metadata infrastructure: A DB of past executions history
• Reproducibility: What really happens when I press the “ReComp” button?
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
41
Summary and challenges
ReComp:a meta-process to observe and control underlying analytics processes
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
42
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp relevant?
Proof of concept experiments
Expected optimisation
Dataflow, experimental science
Genomics - Rapid Knowledge advances
- Rapid scaling up of genetic testing at population level
WES/SVI pipeline, workflow implementation (eScience Central)
Timeliness and accuracy of patient diagnosis subject to budget constraints
Time series analysis - Personal health monitoring- Smart city
analytics- IoT data streams
- Rapid data drift- Cost of computation at network edge (eg IoT)
NYC taxi rides challenge (DEBS’15)
Use of low-power edge devices when outcome is predictable and data drift is low
Data layer optimisation
Tuning of large-scale Data management stack
Optimal Data organisation sensitive to current data profiles
Graph DB re-partitioning
System throughput vs cost of re-tuning
Model learning Applications of predictive analytics
Predictive models are very sensitive to data drift
Twitter content analysis
Sustained model predictive power over time vs retraining cost
Simulation TBD repeated simulation. Computationally expensive but often not beneficial
Flood modelling / CityCat Newcastle
Computational resources vs marginal benefit of new simulation model
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
43
Observability / transparency
White box Black box
Structure(static view)
Dataflow- eScience Central, Taverna, VisTrails…Scripting:- R, Matlab, Python...- Functions semantics
- Packaged components- Third party services
Data dependencies(runtime view)
Provenance recording:• Inputs,• Reference datasets,• Component versions,• Outputs
• Input• Outputs• No data dependencies• No details on individual
componentsCost • Detailed resource monitoring
• Cloud £££• Wall clock time• Service pricing• Setup time (eg model
learning)
ReC
omp
– K
eele
Uni
vers
ityD
ec. 2
016
– P.
Mis
sier
44
Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call• Feb. 2016 - Jan. 2019
• 2 RAs fully employed in Newcastle• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)• CO-Investigators (8% each):
• Prof. Watson, School of Computing Science, Newcastle University• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University• Dr. Phil James, Civil Engineering, Newcastle University
Builds upon the experience of the Cloud-e-Genome project: 2013-2015
Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud- To facilitate the adoption of reliable genetic testing in clinical practice
- A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University
- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”