turning genomics data into biology
DESCRIPTION
Turning genomics data into Biology. Martijn Huynen Nijmegen Center for Molecular Life Sciences, Centre for Molecular and Biomolecular Informatics. Comparative genomics. The (somewhat) intelligent comparative genomics meat grinder. Method development. Prediction of protein function, pathways. - PowerPoint PPT PresentationTRANSCRIPT
Turning genomics data into Biology Turning genomics data into Biology
Martijn HuynenMartijn Huynen
Nijmegen Center for Molecular Life Sciences,Nijmegen Center for Molecular Life Sciences,Centre for Molecular and Biomolecular InformaticsCentre for Molecular and Biomolecular Informatics
Comparative genomicsComparative genomics
Prediction of protein function, pathways
The (somewhat) intelligent comparative genomics meat grinder
Evolution of biosystems
Method development
A phosphomannomutase (A phosphomannomutase (pmmpmm) is predicted to ) is predicted to have acquired a phosphoribomutase (deoB) have acquired a phosphoribomutase (deoB)
functionfunction
deoD deoC deoA cdd pmm
M.genitalium
M.tuberculosis
deoxyribose-1-P
deoxycitidine
deoxyuridine, deoxythimidine
purine deoxyribonucleosides
deoxyribose-5-P
Glyceraldehyde-3-p,acetaldehyde
Cdd
DeoA
DeoD
deoCdeoB
deoB ?
Predicting functional relations between genes Predicting functional relations between genes using (conserved) genomic contextusing (conserved) genomic context
Conserved NeighborhoodConserved NeighborhoodConserved NeighborhoodConserved Neighborhood Gene FusionGene FusionGene FusionGene Fusion
Co-occurrenceCo-occurrenceCo-occurrenceCo-occurrenceGenomic Context Types:Genomic Context Types:Genomic Context Types:Genomic Context Types:
http://string.embl.de
Snel et al., NAR 1999von Mering et al., NAR 2002von Mering et al, NAR 2005
Dandekar Dandekar et alet al., 1998., 1998Overbeek Overbeek et alet al., 1999., 1999
Marcotte Marcotte et alet al., 1999., 1999Enright Enright et alet al., 1999., 1999
Huynen and Bork 1998Huynen and Bork 1998Pellegrini Pellegrini et alet al., 1999., 1999
PyrAB
CarB
MJ1378 & MJ1381
MTH997 & MTH996
EC0033
HP0919
AQ2101 & AQ1172
AF1274
sll0370
Rv1384
YJL130C
D2085.1
YJR109C
83
93
100
88
96
100100
88
92
Gene fission in the evolution of carbamoyl phosphate Gene fission in the evolution of carbamoyl phosphate synthase B (synthase B (carBcarB))
Predicting functional interactions between proteins by Predicting functional interactions between proteins by the co-occurrence of their genes in genomesthe co-occurrence of their genes in genomes.
Distribution of four M.genitalium genes among 25 genomes
MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
Using the mutual information between genes as a scoring heuristic for their co-occurrence.
M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19
0 0.2 0.4 0.6 0.8 1Evolutionary conservation score
0
0.2
0.4
0.6
0.8
1
FusionGene OrderCo-occurrence
Fra
ctio
n sa
me
path
wa y
(K
EG
G)
Evolutionary conservation of genomic context Evolutionary conservation of genomic context increases the likelihood of functional interactionincreases the likelihood of functional interaction
1
10
100
1000
10000
0 3 6 9 12 15 18 21 24 27 30
co-occurrences in operons
num
ber
of C
OG
s
0
1
2
3
4
5
6
aver
age
met
abol
ic
dist
ance
number of COGS
average metabolicdistance
Correlation between the strength of the Correlation between the strength of the genomic and functional associationsgenomic and functional associations
30%
33%
7%
6%
10%
10%4% physical interaction
complex
metabolic pathway
non-metabolic pathway
process
hypothetical
unknown interaction
Gene Order Conservation
Gene Fusion
7%
15%
22%
56%
Co-occurrence in Genomes23%
4%
25%
14%
23%
11%
Genomic associations correlate with a wide array Genomic associations correlate with a wide array of functional interactionsof functional interactions
Huynen et al, Genome research 2000
Repeated occurrence of Repeated occurrence of MG009MG009, a phosphohydrolase, , a phosphohydrolase, with thymidilate kinase (tmk) suggests a role of with thymidilate kinase (tmk) suggests a role of MG009MG009 in pyrimidine metabolism.in pyrimidine metabolism.
Combining homology information with genomic Combining homology information with genomic association for function predictionassociation for function prediction
Conservation of gene order of the hypothetical gene Conservation of gene order of the hypothetical gene MG134MG134 with with dnaXdnaX, , RecRRecR suggests physical interaction between their suggests physical interaction between their gene productsgene products
Phylogenomics for protein function predictionPhylogenomics for protein function prediction
An ancient paralog of N7BM has been lost in the same lineages as N7BM itself, An ancient paralog of N7BM has been lost in the same lineages as N7BM itself, implicating a possible role in Complex Iimplicating a possible role in Complex I
Gabaldon et al. (2005) J. Mol. Biol.
Experimental confirmation of a role of the N7BM paralog in Complex IExperimental confirmation of a role of the N7BM paralog in Complex I
J. Clin. Invest. (2005)
Mt-Ku gene order physical interaction double-stranded DNA repair [56]GnlK gene order physical interaction signal transduction for ammonium transport [57,58]PH0272 gene order metabolic pathway methylmalonyl-CoA racemase [59]PrpD gene order metabolic pathway 2-methylcitrate dehydratase [22,60]arok gene order metabolic pathway shikimate kinase [61]ComB gene order metabolic pathway 2-phosphosulfolactate phosphatase [62]KynB gene order metabolic pathway kynurenine formamidase
[63]PvlArgDC gene order metabolic pathway arginine decarboxylase [64]FabK gene order metabolic pathway enoyl-ACP reductase [65]FabM gene order metabolic pathway trans-2-decenoyl-ACP isomerase [66]COG0042 gene order tRNA modification tRNA-dihydrouridine synthase [67]Yfh1 co-occurrence process iron-sulfur cluster assembly [68,69]YchB co-occurrence metabolic pathway terpenoid synthesis [70]SmpB co-occurrence process trans-translation [5,71]ThyX complementary enzymatic activity thymidilate synthase [14,72]ThiN complementary enzymatic activity thiamine phosphate synthase [73,74]ThiE complementary enzymatic activity thiamine phosphate synthase [74]Prx fusion pathway peroxiredoxin [75]YgbB fusion/ gene order metabolic pathway terpenoid synthesis [76]SelR fusion./order/co-o. enzymatic activity methionine sulfoxide reductase [14,22,77]FadE reg. sequence metabolic pathway acyl CoA dehydrogenase [78,79]TogMNAB reg. sequence metabolic pathway Oligogalacturonide transport [80,81]MetD reg. sequence metabolic pathway Methionine transport [82]
ProteinProtein Context Context type of interactiontype of interaction functionfunction refref
Verified function predictions: Making predictions Verified function predictions: Making predictions is easy, testing them is another matter.is easy, testing them is another matter.
Huynen et al., Curr Op. Cell Biol. 2003
4compl.
distribution
3 gene fusion
13 gene order
4co-occur-
rence
3regulatoryelement
Experimentally confirmed protein functions, predicted with various Experimentally confirmed protein functions, predicted with various types of contexttypes of context
Predicting gene function by conservation of Predicting gene function by conservation of co-expressionco-expression
Evolutionary conservation of co-expression increases Evolutionary conservation of co-expression increases the likelihood of functional interactionthe likelihood of functional interaction
Total # of pairs
# of pairs > 0.6
Observed fraction > 0.6
Expected fraction > 0.6
Observed/Expected
Gene-pairs with an orthologous gene-pair > 0.6
Worm 18161 803 0.0442* 0.00379 12
Yeast 36548 1215 0.0332* 0.00216 15
Gene-pairs with a paralogous gene-pair > 0.6
Worm 207214 29031 0.1401* 0.00379 37
Yeast 38253 2167 0.0566* 0.00216 26
Low but significant levels of conservation of co-Low but significant levels of conservation of co-expressionexpression(see Teichmann et al, TIBS 2002, Stuart et al., Science 2003)
van Noort et al, TIG, 2003
Conservation of protein-protein interaction measured by Conservation of protein-protein interaction measured by yeast-2-hybrid increases the likelihood of interactionyeast-2-hybrid increases the likelihood of interaction
Comparison of Giot (Fly) and Ito (Yeast), Uetz (Yeast) y-2-h interactionsComparison of Giot (Fly) and Ito (Yeast), Uetz (Yeast) y-2-h interactions
GTPase XAB1/CG3704 hypothetical, GTPase YOR262/CG10222
XAB1 interacts with the DNA repair protein XPA1, inferred to be required for XPA1’s XAB1 interacts with the DNA repair protein XPA1, inferred to be required for XPA1’s import in the nucleusimport in the nucleus..
A “new”, conserved interaction:
Fraction hypothetical proteins in conserved Y2H interactions relatively lowFraction hypothetical proteins in conserved Y2H interactions relatively low
Hypotheticals:Hypotheticals:In conserved interactionsIn conserved interactions 13 13 5% 5% In complete genomeIn complete genome ~1600 ~1600 27%27%
Dataset Comparison
Protein interactions, both proteins in the other dataset
Conserved interactions
Fraction conserved interactions
Average fraction conserved interactions
Ito / UetzYeast vs. Yeast
858 / 697 201 23.4% / 28.8% 26.1%
Ito / GiotYeast vs. Fly
229 / 394 45 19.6% / 11.4% 15.5%
Uetz / GiotYeast vs. Fly
120 / 168 33 27.5% / 19.6% 23.5%
Physical interaction is reasonably well conserved between Physical interaction is reasonably well conserved between (…..compared to the “conservation” within a species…)(…..compared to the “conservation” within a species…)
Huynen et al, TIG, 2004
Conservation of protein-protein interaction between speciesConservation of protein-protein interaction between species
Is the low level of conservation between Is the low level of conservation between S. S. cerevisiaecerevisiae and and C. elegansC. elegans of co-expression ( < of co-expression ( < 5%) “real”, reflecting evolution and species-5%) “real”, reflecting evolution and species-
specific interactions, or are we just comparing specific interactions, or are we just comparing noisy datasets ?noisy datasets ?
Species specific (idiosyncratic) coregulation:
“Efficient expression of the Saccharomyces cerevisiae
glycolytic gene ADH1 is dependent upon a cis-acting
regulatory element UASRPG found initially in genes
encoding ribosomal proteins.” Tornow and Santangelo,
Gene, 1990
Low (but significant) correlation between ChIP-on-chip data (sharing Transcription Factor Low (but significant) correlation between ChIP-on-chip data (sharing Transcription Factor Binding Sites) and expression data in S.cerevisiaeBinding Sites) and expression data in S.cerevisiae
Noisy genomics data Noisy genomics data
Filtering out the noise by combining ChIP-Filtering out the noise by combining ChIP-on-chip and co-expression in yeaston-chip and co-expression in yeast
Correlation of co-regulation with functional interactions
Data set of gene pairs Percent same pathway Number of gene pairs
r > 0.5 43 169,768 r > 0.6 52 65,430 r > 0.7 51 22,459 Sharing 1 TFBS 50 356,947 Sharing 2 TFBS 77 39,818 Sharing 1 TFBS and r > 0.3 86 19,386 Sharing 1 TFBS and r > 0.4 88 11,434 Sharing 1 TFBS and r > 0.5 90 6,687 Sharing 1 TFBS and r > 0.6 90 3,382 Sharing 1 TFBS and r > 0.7 86 1,156
High level of conservation of co-High level of conservation of co-regulation after speciationregulation after speciation
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
co-expression correlation (r)
freq
uen
cy d
istr
ibu
tio
n
worm orthologous gene pairs ofyeast gene pairs with r > 0.6and sharing TFBSall worm gene pairs
76 %
Comparing co-regulation in Bacteria indicates a level of conservation of 80%Comparing co-regulation in Bacteria indicates a level of conservation of 80%(operons in B. subtilis versus regulons in E.coli)
NB: NB: 1)1) Based on operon conservation is only 50%Based on operon conservation is only 50%
2)2) Disregard cases of gene loss Disregard cases of gene loss
Noisy genomics data lead to drastic underestimations of conservation of interactions
Conclusions co-regulation Conclusions co-regulation conservationconservation
• Gene co-regulation tends to be Gene co-regulation tends to be conserved in Eukaryotes (76%) and conserved in Eukaryotes (76%) and in prokaryotes (80%)in prokaryotes (80%)
• In the case of gene duplication one In the case of gene duplication one gene tends to maintain the co-gene tends to maintain the co-regulatory link regulatory link there appears to there appears to be one functionally equivalent be one functionally equivalent orthologortholog
Snel et al, Nucleic Acids Res 2004
Exploiting genomics data to predict the function for a Exploiting genomics data to predict the function for a hypothetical protein: BolAhypothetical protein: BolA
An interaction of BolA with a mono-thiol glutaredoxin ?An interaction of BolA with a mono-thiol glutaredoxin ?(STRING) (STRING)
BolABolA
BolA and Grx occur as neighbors in a number of genomesBolA and Grx occur as neighbors in a number of genomes
Bola
Grx
BolA and Grx have an (almost) identical phylogenetic distributionBolA and Grx have an (almost) identical phylogenetic distribution
BolA and Grx have been shown to interact in Y2H in S.cerevisiae BolA and Grx have been shown to interact in Y2H in S.cerevisiae and D.melanogaster, and in Flag tag in S.cerevisiaeand D.melanogaster, and in Flag tag in S.cerevisiae
BolA phylogeny
BolA does have (predicted) interactions with cell-division / cell-wall proteins. Those appear secondary to the link with GrX
Genomic context analyses have obtained a higher resolution in function prediction than phenotypic analyses
Cell division / Cell wallCell division / Cell wall (oxidative) stressoxidative) stress
BolA is homologous to the peroxide reductase OsmC, suggesting a similar BolA is homologous to the peroxide reductase OsmC, suggesting a similar functionfunction
Protein Family (PDB entry)
3D similarity to BolA. DALI, Z-scores
Sequence profile similarity to BolA. COMPASS, SW-score (E-value)
OsmC (1ml8A/1lqlA) Ohr (1n2fA)
5.8 / 5.5 5.2
73 (2.4 E-5)
KH 1 (1hnxC) 5.3 46 (9.4 E-3)
DUF150 (1ib8A) 3.7 44 (4.2 E-2)
GMP synthase C (1gpmA) 2.9 57 (7.0 E-4)
KH 2 (1egaB) 3.8 35 (2.7 E-1)
RBFA (1kkgA) 4.2 40 (9.6 E-2)
BolA is, relative to other class II KH folds and sequences, most similar to OsmCBolA is, relative to other class II KH folds and sequences, most similar to OsmC
OsmC uses thiol groups of two, evolutionary conserved cysteines to OsmC uses thiol groups of two, evolutionary conserved cysteines to reduce substratesreduce substrates
Problem: The BolA family does not have conserved cysteines. Problem: The BolA family does not have conserved cysteines.
……It would have to obtain its reducing equivalents from elsewhere…It would have to obtain its reducing equivalents from elsewhere…
BolA family alignmentBolA family alignment
BolA is (homologous to) a reductaseBolA interacts with GrX?
GrX provides BolA with reducing equivalents !?
Prediction of interaction partner and molecular function complement each otherPrediction of interaction partner and molecular function complement each other
There is a wealth of functional and structural There is a wealth of functional and structural genomics data that can be related to the genomics data that can be related to the
function of individual proteins. function of individual proteins. Exploiting that data is becoming a trade in Exploiting that data is becoming a trade in
itselfitself(biochemistry by other means)(biochemistry by other means)