selecting targets which probe family and function space
DESCRIPTION
Selecting Targets which Probe Family and Function Space. How many protein families can we identify in the genomes with/without structures? Which families should we target to maximise the structural coverage of the genomes? Can we optimise function coverage?. CATH , Gene3D. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/1.jpg)
MCSG Site Visit, Argonne, January 30, 2003
Selecting Targets which Probe Selecting Targets which Probe Family and Function SpaceFamily and Function Space
How many protein families can we identify in the genomes with/without structures?
Which families should we target to maximise the structural coverage of the genomes?
Can we optimise function coverage?
James Bray, David Lee, Russell Marsden,Annabel ToddJanet Thornton, Andrzej Joachimiak
NIH Funded Midwest ConsortiumNIH Funded Midwest Consortium
CATHCATH,,Gene3DGene3D
![Page 2: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/2.jpg)
protein families
Identify protein families in the genomesIdentify protein families in the genomes
![Page 3: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/3.jpg)
domain families
Identify domain families and consider Identify domain families and consider domain compositions of the protein familiesdomain compositions of the protein families
![Page 4: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/4.jpg)
domain family of known structure
Identify structurally characterised domain familiesIdentify structurally characterised domain families
![Page 5: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/5.jpg)
650,000 protein sequences 650,000 protein sequences from 120 completed from 120 completed
genomesgenomes
14 eukaryotic genomes including human, mouse, fly, 14 eukaryotic genomes including human, mouse, fly, wormworm
92 bacterial genomes92 bacterial genomes
14 archael genomes14 archael genomes
Protein Families in Complete Genomes Protein Families in Complete Genomes with Structural/Functional Annotationswith Structural/Functional Annotations
Gene3DGene3DBuchan, Thornton, OrengoBuchan, Thornton, Orengo,, Genome Genome
Research (2002), NAR (2002)Research (2002), NAR (2002)
Currently being updated with 30 more complete genomesCurrently being updated with 30 more complete genomes
![Page 6: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/6.jpg)
BLAST all the sequences from 120 completed BLAST all the sequences from 120 completed genomes against each and cluster into protein genomes against each and cluster into protein familiesfamilies
For each protein family identify domain composition For each protein family identify domain composition (by mapping CATH and Pfam domains)(by mapping CATH and Pfam domains)
Clustering Sequences into Protein Families Clustering Sequences into Protein Families of Known Domain Compositionof Known Domain Composition
PFscape - Protein Family LandscapePFscape - Protein Family Landscape
SAM-T99 - sequence mapping of CATH & Pfam SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000
TRIBE-MCL - Markov Clustering TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002
![Page 7: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/7.jpg)
Consistency of TribeMCL Clusters for Consistency of TribeMCL Clusters for Genes of Known Structure in CATH Genes of Known Structure in CATH
DatabaseDatabase
Perc
en
tag
e o
f G
en
es
wit
h
Perc
en
tag
e o
f G
enes
wit
h
com
mon f
am
ily a
nnota
tion
co
mm
on f
am
ily a
nn
ota
tion
Granularity of ClusteringGranularity of Clustering
![Page 8: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/8.jpg)
clustering ~650,000 genes from 120 clustering ~650,000 genes from 120 complete genomescomplete genomes
PFscapePFscape
Protein Family 4
Protein Family 3
Protein Family 2
Protein Family 1
~50,000 protein families of 2 or more sequences, ~50,000 protein families of 2 or more sequences, ~60,000 singletons~60,000 singletons
on average 10-15% of sequences in a genome are singletonson average 10-15% of sequences in a genome are singletons
![Page 9: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/9.jpg)
Library of profiles (HMMs) built for representative sequences from each Library of profiles (HMMs) built for representative sequences from each CATH and Pfam domain superfamilyCATH and Pfam domain superfamily
E-value thresholds validated by structure comparisonE-value thresholds validated by structure comparison
Mapping CATH and Pfam Domains onto Mapping CATH and Pfam Domains onto Genome SequencesGenome Sequences
ScanScanagainst against
CATH & PfamCATH & PfamSAM-T99SAM-T99
HMM libraryHMM library(1467 CATH(1467 CATH6190 Pfam)6190 Pfam)
protein sequencesprotein sequencesfrom genomesfrom genomes
assign domains toassign domains toCATH and Pfam CATH and Pfam
familiesfamilies
![Page 10: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/10.jpg)
Performance of Sequence Mapping MethodPerformance of Sequence Mapping Method
1D-HMM 1D-HMM (SAM-T99)(SAM-T99)
Coverage vs Error rate (OHPS)
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Error rate (%)
Co
vera
ge
Sreps.v2.5_Sreps.v2.5
Sreps.v2.4_Sreps.v2.5
Percentage of remote, structurally validated CATH Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by homologues (<35% sequence identity) identified by
SAM-T99SAM-T99
(%)
of
hom
olo
gues
fou
nd
(%)
of
hom
olo
gu
es
fou
nd
Error rate
Library of 1D-HMM models detects >80% of remote Library of 1D-HMM models detects >80% of remote homologueshomologues
![Page 11: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/11.jpg)
50,000 protein families in Gene3DUse HMMs to identify CATH and Pfam domains in the genome sequences
domain compositions for protein families in domain compositions for protein families in Gene3DGene3D
CATHCATHPfamPfam
NewFamNewFam
![Page 12: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/12.jpg)
0
10
20
30
40
50
60
70
80
90
100
Ape
Afu
Hsp
Mja
Mka
Mac
Mm
a
Mth
Pa
e
Pa
b
Pfu
Ph
o
Sso
Sto
Ta
c
Tv
o
Aae
Bsu
Ctr
Cpn
Cte
Dra
Ec
o
Fu
n
Hin
Hpy26
695
HpyJ9
9
Mtu
CD
C
Mtu
H37
Mge
Mpn
Pa
ePA
O1
Rpr
Sau
Sco
Syn
Te
l
Tm
a
Tp
a
Organism
Pe
rce
nta
ge
of
ge
ne
s a
nn
ota
ted
Percentage of Genes w ith a Pfam Assignment
Percentage of Genes w ith a CATH assignment
EukaryotesArchaea Bacteria
CATH and Pfam domain families cover nearly CATH and Pfam domain families cover nearly 60-90% of genome sequences60-90% of genome sequences
Pfam CATH
100
80
60
40
20
organism
Perc
en
tag
e o
f seq
uen
ces a
nn
ota
ted
Gene3DGene3D databasedatabase
![Page 13: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/13.jpg)
Iterative Profile SearchMethodology
120 genomes clustered into ~50,000 protein families
structural domain assignments from CATH
functional domain assignments from Pfam,
domain compositions for each protein family
Also: SWISS-PROT, EC, COGs, GO, KEGG annotations
Gene3DGene3D Database:Database:Protein Families in 120 Completed Protein Families in 120 Completed
GenomesGenomes
Gene3DGene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Buchan, Thornton, OrengoBuchan, Thornton, Orengo,, 2002, Genome Research 2002, Genome Research
Recent update submitted to Proteins (2004)Recent update submitted to Proteins (2004)
![Page 14: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/14.jpg)
CATH
Number of Non-identical Relatives
Pfam
NewFam
Number of Non-identical Relatives
Number of Non-identical Relatives
Perc
enta
ge o
f Fa
mili
es
Maximise structural coverage of the genomes by Maximise structural coverage of the genomes by targetting the largest domain families targetting the largest domain families
Perc
enta
ge o
f Fa
mili
es
•NewFam families are very small
•Target large structurally uncharacterised Pfam families to increase structural coverage of genomes
![Page 15: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/15.jpg)
CATH, Pfam, Unassigned Hlevels vs s100
0
10
20
30
40
50
60
70
80
90
100
0 5000 10000 15000 20000 25000 30000 35000
#Hlevel targets
% T
ota
l s10
0
~70% of genomes are contained in ~2000 largest CATH and/or ~70% of genomes are contained in ~2000 largest CATH and/or Pfam domain families (1345 Pfam families with no structural Pfam domain families (1345 Pfam families with no structural
representative)representative)->Target large structurally uncharacterised Pfam families to ->Target large structurally uncharacterised Pfam families to increase coarse grained structural coverage of the genomesincrease coarse grained structural coverage of the genomes
Genome Coverage by Domain Families Genome Coverage by Domain Families
Domain Families Ordered by Size
0
50
100
0 5,000 10,000 15,000 20,000 25,000 30,000Perc
en
tag
e o
f N
on
-sin
gle
ton
Dom
ain
Perc
en
tag
e o
f N
on
-sin
gle
ton
Dom
ain
S
eq
uen
ces
in 1
20
Com
ple
ted
Gen
om
es
Seq
uen
ces
in 1
20
Com
ple
ted
Gen
om
es
![Page 16: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/16.jpg)
Structural Family(CATH)
Close Sequence Family (30%ID)
Profile Family(HMM based/Pfam)
2000 of the largest domain families cover 70% of genome sequences (~650 CATH + ~1350 Pfam families)
How many fine grained targets should be selected to provide good homology models for all the relatives in these
families?
Fine GrainedTarget Selection
![Page 17: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/17.jpg)
45,000 targets are needed to give good homology 45,000 targets are needed to give good homology models for 70% of eukaryotic and prokaryotic domains?models for 70% of eukaryotic and prokaryotic domains?
Number of Targets for Close Sequence Families
Perc
en
tag
e o
f N
on
-sin
gle
ton
Perc
en
tag
e o
f N
on
-sin
gle
ton
d
om
ain
seq
uen
ces
dom
ain
seq
uen
ces
prokaryotes
eukaryotes
eukaryotes plusprokaryotes
25,000 45,000 30,000
![Page 18: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/18.jpg)
MCSG Site Visit, Argonne, January 30, 2003
Target Selection StrategyTarget Selection Strategy
~2000 of the largest CATH and/or Pfam families cover >70% of domain sequences in the genomes
it is not feasible to target all the close sequence families in these families to build good homology models for all relatives (45,000 targets)
accurate homology models are not needed for all families
->target sequence families of biological or medical interest (these could be small families
or singletons)
->target additional representatives in very large families especially functionally diverse
families
![Page 19: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/19.jpg)
Domain Recurrences in the GenomesDomain Recurrences in the Genomes
0
10
20
30
40
50
60
70
80
90
1001 3 5 7 9 11 13 15 49 59 67 79 96 102
219
Occurrences
No
. Of
Fa
mili
es
E.coli
M.jannaschii
S.cerevisiae
nu
mb
er
of
fam
ilie
sn
um
ber
of
fam
ilie
s
occurrencesoccurrences
730730 570570
large,extensively duplicated families
![Page 20: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/20.jpg)
structural family(CATH)
close sequence
family (30%)
profile family(Pfam)
in these very large families we will need finer grained selection in these very large families we will need finer grained selection of targets to understand the evolution of new functions/biological of targets to understand the evolution of new functions/biological
roles in different organisms roles in different organisms
![Page 21: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/21.jpg)
In >87% of families -> changes in substrate specificity In >87% of families -> changes in substrate specificity modulated modulated by changes in domain partners by changes in domain partners
In >92% of these families -> conservation or semi-In >92% of these families -> conservation or semi-conservation of conservation of reaction chemistryreaction chemistry
Changes in Domain Partnerships can Changes in Domain Partnerships can Modulate FunctionModulate Function
domain duplication
domain fusion, change in domain partner
67% of enzyme families in CATH show variation in 67% of enzyme families in CATH show variation in functional properties of relativesfunctional properties of relatives
![Page 22: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/22.jpg)
Methionine Aminopeptidase Type 1
(1mat)
Creatinase (1chmA)
monomer/protein substrates
Change in Domain Partner Modulates Function
dimer/small molecule substrates
![Page 23: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/23.jpg)
representative structures for large families may also representative structures for large families may also help to identify functional families help to identify functional families
profile family(Pfam)
close sequence
family (30%)
![Page 24: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/24.jpg)
MCSG Site Visit, Argonne, January 30, 2003
ProFunc: Predicting Functional SitesProFunc: Predicting Functional Sites
Most likely binding site
Surface clefts
Residue conservation
Conserved surface patches
Laskowski and Thornton
![Page 25: Selecting Targets which Probe Family and Function Space](https://reader035.vdocuments.mx/reader035/viewer/2022062803/568146bb550346895db3ec0c/html5/thumbnails/25.jpg)
functional subclusters identified by:functional subclusters identified by:
- domain partnerships from Gene3D- domain partnerships from Gene3D
- sequence conservation- sequence conservation
- functional annotations stored in - functional annotations stored in Gene3DGene3D
- results from ProFunc analysis- results from ProFunc analysis
functional clustersfunctional clustersfamily_1
SuperfamilySuperfamily
Representative Structures for Superfamilies Representative Structures for Superfamilies will help identify Functional Subfamilieswill help identify Functional Subfamilies
family_2
family_3
family_4
family_5