systems analysis of innate immune mechanisms in infection – a role for hpc peter ghazal
TRANSCRIPT
Systems analysis of innate immune mechanisms in infection – a role for HPC
Peter Ghazal
What is Pathway Biology?
Pathway biology is….
A systems biology approach for understanding a biological process
- empirically by functional association of
multiple gene products & metabolites
- computationally by defining networks of cause-effect relationships.
Pathway Models link molecular; cellular; whole organism levels.
FORMAL MODELS --- ALLOW PREDICTING the outcome of Costly or Intractable Experiments
Focus and outline of talk
• High through-put approaches to mapping and understanding host-response to infection.
• Targeting the host NOT the “bug” as anti-infective strategy
• Making HPC more accessible: SPRINT a new framework for high dimensional biostatistic computation
Story starts at the bed side
Differentially expressed genes Differentially expressed genes in neonatesin neonatescontrol vs Infected (FDR p>1x10control vs Infected (FDR p>1x10-5-5, FC±4), FC±4)
Dealing with HTP data: Impact of data variability
Replicate
Patient where,
j
iby ijiij
),0(~ 2bi Nb ),0(~ 2
Nij
222
variationTechnical
variationBiological
variationTotal
b
• Model for introducing biological and technical variation:
Modelling patient variability and biomarkers for classification
Machine Learning methods:
Random Forest (RF)
Support Vector Machine (SVM)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbour (K-NN)
How different data characteristics affect the misclassification errors?
Factors investigated:Data variability (biological and technical variations)
Training set sizeNumber of replications
Correlation between RNA biomarkers
Mizanur Khondoker
Error rate vs. (number of biomarkers, total variation) An example of a simulation model to quantify number of biomarkers
and level of patient variability
Conclusions from simulations• There is increased predictive value using multiple markers – although there is
no magic number that can be recommended as optimal in all situations.
• Optimal number greatly depends on the data under study.
• The important determining factors of optimal number of biomarkers are:
• The degree of differential expression (fold-change, p-values etc.)
• Amount of biological and technical variation in the data.
• The size of the training set upon which the classifier is to be built.
• The number of replication for each biomarkers.
• The degree of correlation between biomarkers.
• Now possible to predict optimal number through simulation.
Rule of five: Criteria for pathogenesis based biomarkers
• Readily accessible• Multiple markers• Appropriately powered statistical association
• Physiological relevance• Causally linked to phenotype
Key challenge is mapping biomarkers into: biological context and understanding
Requires an experimental model system
PluripotentStem Cell
MyeloidStem Cell
“Activated” Cytolytic Macrophage
Primed Macrophage
Resident Macrophage(immature)
ActivatedT-Lymphocyte
Promonocyte(Primary Signal)
InflammationIFN-gamma
(Secondary Signal)Endotoxin, IFN-gamma
Lymphokines
?
Monocyte?
Transcriptional profile of MΦ activated by Ifng
How do we tackle this?
PATHWAY BIOLOGYLiteratureData-mining
ModellingNetwork analysis
Experimentationgenetic screens
microarraysY2H
mechanism basedstudies
A sub-system study of cause effect relationships with a defined start (input) and end (output).
Mapping new nodes
PATHWAY BIOLOGYLiteratureData-mining
Experimentation
Transcriptional profile of MΦinfected with CMV
Hypothesis generation
• Blue zone vs red zone
Down regulation of sterol pathway
BUT… recorded changes are small – Do they have any
effect?
Next step modelling
PATHWAY BIOLOGY
Pure and applied modellingNetwork inference analysis
Experimentaldata
Workflow
ODE model
Literature derived model
Known parameters
Unknown parameters
Vary parametersby an order of
magnitude
Order of magnitudeestimation
Ensemble of ODE models Results
Ensembleaverage
Modelling
Where available, parameters obtained from the Brenda enzyme database
http://www.brenda-enzymes.info/
Cholesterol Synthesis
ODE model, Michaelis-Menten interactions• 57 Parameters
• 25 Known Parameters• 32 Unknown Parameters
Algorithm• Using the first three time points, calculate an equilibrium state
• Release model from equilibrium and simulate using enzyme data• For each unknown, consider this model across 3 orders of magnitude,
holding the other unknowns parameters fixed.
Cholesterol (output of sterol pathway) results from simulation and expts
Cholesterol rate/flux Cholesterol levels
Predictions: Experiments:
Lipidomic – mass spec results
• Infection down regulate cholesterol biosynthesis pathway and free intra-cellular cholesterol.
• Can now predict the behaviour of the pathway.
• But?• Just as a good as UK (Met Office) weather predictions……
because……
Scalability issues related to increased complexity
• Increasing complexity and size of biological data
• Solution: High Performance Computing (HPC)?
HPC for High Throughput Post-Genomic Data
Problems with large biological data sets
– Volume of data• Many research groups can now routinely generate high volumes of data
– Memory (RAM) handling:• Input data size is too big• Algorithms cause linear, exponential or other growth in
data volume
– CPU performance:• Routine analyses take too long
Limitation examples: Clustering
• Gene clustering using R on a high-spec workstation:– 16,000 genes, k=12 gene clusters runs for ~30min– 16,000 genes, k=40 gene clusters runs for ~10hrs
Partitioning-Around-Medoids, n genes, k=12 clusters requested
Memory fail limit
Outcome: Adverse effect on research
• Arbitrary size reduction of input data
• Batch processing of data
• Analyses in smaller steps
• Avoidance of some algorithms
• Failure to analyse
Solution: High Performance Computing
• HPC takes many forms:– clusters, networks, supercomputer, grid, GPUs, “cloud”, ...
• Provides more computational power
• HPC is technically accessible for most: – Department own, Eddie, HECToR,...
However!
HPC Access Hurdles
• Cost of access
• Time to adapt
• Complex, require specialist skills • Consultancy (e.g. EPCC) only feasible on
ad-hoc basis, not routinely
HPC Access Hurdles
HPC is (currently) optimal for:- Specific problems that can be tackled as a project- Individuals who are familiar with parallelisation and
system architectures
HPC is not optimal for:- Routine/casual analyses of high-throughput data- Ad-hoc and ever-changing analyses algorithms- Data analysts without time or knowledge to sidestep
into parallelisation software/hardware.
Need a step change (up!) to broaden HPC
access to all biologists
Challenge two fold!!
• Provide a generic solution
• Easy to use
SPRINT
Post Genomic
Data
R
Biological Results
Very Large Post Genomic
Data
R
Very Large Post Genomic
Data
R
Biological Results
HPC(Eddie)
A solution for analyses using R
SPRINT (DPM & EPCC))
SPRINTSPRINT has 2 components:1.HPC harness manages access to HPC2.Library of parallel R functions
e.g. cor (correlation) pam (clustering)
maxt (permutation
Allows non-specialists to make use of HPC resources, with analysis functions parallelised by us or the R community.
data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl
resT <- mt.maxT(smallgd, classlabel, test="t", side="abs")
quit(save="no")
library("sprint")
data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl
resT <- pmaxT(smallgd, classlabel, test="t", side="abs")
pterminate()
quit(save="no")
Code comparison
Permutation Benchmark
Input Array Data Size
Permutation Count
Estimatedmaxt 1 CPU
Pmaxt on 256 CPUs
(s)
36,612 x 76 500,000 6 hrs 73.18
36,612 x 76 1,000,000 12 hrs 46.64
73,224 x 76 500,000 10 hrs 148.46
100,000 x 320 1,000,000 20 hrs 294.61
Correlation Benchmark
Input Array Data Size Output Array Data Size pcor() on 256 CPUs
(s)
11,000 x 320(27 MB)
923 MB 4.76
22,000 x 320
(54 MB)
3.6 GB 13.87
35,000 x 320
(85 MB)
9.1 GB 36.64
45,000 x 320
(110 MB)
15 GB 42.18
Clustering Benchmarks
Future
• Cloud (confidentiality issues)
• GPU (limitations is data size)
New therapeutic and diagnostic opportunities
Viral Interaction Networks
Host Interaction Networks
Virus Antiviral HostSystemic Therapeutic
Bed-bench-models-almost back to bed
THANK YOU&
Acknowlegments to our sponsors
Mathieu Blanc
Steven Watterson
Mizanur Khondoker
Paul Dickinson
Thorsten Forster
Muriel Mewissen
Terry Sloan
Jon Hill
Michal Piotrowski
Arthur Trew
Acknowledgement
EPCC