concepts, historical milestones & the central place of bioinformatics in modern biology:
DESCRIPTION
Concepts, historical milestones & the central place of bioinformatics in modern biology:. a European perspective. Overview. Where the term bioinformatics originated Where the ‘ modern ’ concept originated Some key events & folk Its place in ‘ the new biology ’. Origin of Bioinformatics. - PowerPoint PPT PresentationTRANSCRIPT
04/21/23 1Teresa K.Attwood University of Manchester
• Where the Where the termterm bioinformatics originated bioinformatics originated
• Where the ‘modern’ Where the ‘modern’ conceptconcept originated originated
• Some key events & folkSome key events & folk
• Its place inIts place in‘‘the new biologythe new biology’’
04/21/23 Teresa K.Attwood University of Manchester
2
• The origin of the The origin of the termterm ‘bioinformatics’ has been ‘bioinformatics’ has been attributed to Paulien Hogewegattributed to Paulien Hogeweg– Dutch theoretical biologist Dutch theoretical biologist
• She & colleague Ben Hesper coined the term in the She & colleague Ben Hesper coined the term in the early ‘70s, defining it as early ‘70s, defining it as – ““the study of informatic processes in biotic systems”the study of informatic processes in biotic systems”
• Hogeweg, P. (2011) The roots of bioinformatics in theoretical Hogeweg, P. (2011) The roots of bioinformatics in theoretical biology. biology. PLoS Computational BiologyPLoS Computational Biology, , 77(3), e1002021(3), e1002021
• The term failed to gain traction for ~20 yearsThe term failed to gain traction for ~20 years
04/21/23 Teresa K.Attwood University of Manchester
3
• The origins of the ‘modern’ The origins of the ‘modern’ conceptconcept of bioinformatics of bioinformatics
are are rooted in rooted in sequencesequence analysisanalysis
• Driven Driven by the desire to by the desire to – collectcollect
– annotateannotate
– & analyse sequence data& analyse sequence data
• systematically (systematically (i.ei.e., using computers)!., using computers)!
04/21/23 Teresa K.Attwood University of Manchester
4
ThisThis concept of concept of‘‘bioinformaticsbioinformatics’’was barely known pre 1990…was barely known pre 1990…
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
GIVEQCCASVCSLYQLENYCN
FVNQHLCGSHLVEALYLVCGERGFFYTPKA
CSD
CSD
• Pioneer of computer methods to compare proteins Pioneer of computer methods to compare proteins – & to derive evolutionary histories from & to derive evolutionary histories from alignments alignments
• Particular interest in deducing evolutionary connections Particular interest in deducing evolutionary connections
from sequence evidencefrom sequence evidence
04/21/23 Teresa K.Attwood University of Manchester
6
• Collected all the known protein sequences Collected all the known protein sequences – made them available to the scientific communitymade them available to the scientific community
• In 1965, she compiled a bookIn 1965, she compiled a book– Atlas of Protein Sequence & StructureAtlas of Protein Sequence & Structure
04/21/23 7Teresa K.Attwood University of Manchester
““There is a tremendous amount of information There is a tremendous amount of information
regarding the evolutionary history and biochemical regarding the evolutionary history and biochemical
function implicit in each sequence andfunction implicit in each sequence and the number of the number of
known sequences is growing explosivelyknown sequences is growing explosively. . We feel it We feel it
is important to collect this significant information, is important to collect this significant information,
correlate it into a unified whole and interpret itcorrelate it into a unified whole and interpret it” ”
M.O.Dayhoff to C.Berkley, February 27, 1967M.O.Dayhoff to C.Berkley, February 27, 1967Strasser, B. (2008)Strasser, B. (2008)
““GenBank – Natural history in the 21GenBank – Natural history in the 21stst century?” century?” ScienceScience, , 322322, 537-538, 537-538
04/21/23 Teresa K.Attwood University of Manchester
8
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnet
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Exam 1Exam 1
What pernicious, life-changing What pernicious, life-changing development occurred in 1971?development occurred in 1971?
““the rate limiting step in the process of nucleic acid the rate limiting step in the process of nucleic acid
sequencing is now shifting from data acquisition sequencing is now shifting from data acquisition
towards the towards the organizationorganization and analysis of that data and analysis of that data””
Gingeras, T.R. & Roberts, R.J. (1980)Gingeras, T.R. & Roberts, R.J. (1980)
““Steps toward Computer Analysis of Nucleotide Sequences,” Steps toward Computer Analysis of Nucleotide Sequences,”
ScienceScience, , 209209, 1322-1328, 1322-1328
04/21/23 Teresa K.Attwood University of Manchester
10
““a a centralized data bank centralized data bank [is] essential for the efficient [is] essential for the efficient
use of nucleic acid sequence informationuse of nucleic acid sequence information””
C.Anderson, Minutes, 1980C.Anderson, Minutes, 1980
04/21/23 Teresa K.Attwood University of Manchester
11
• While the US debated where to locate a new While the US debated where to locate a new
centralised resource, EMBL acted…centralised resource, EMBL acted…
• The 1The 1stst internationally funded, public ‘central’ internationally funded, public ‘central’
nucleotide sequence database was thus European nucleotide sequence database was thus European – the EMBL data library, Heidelbergthe EMBL data library, Heidelberg
• preceded the 1preceded the 1stst release of GenBank by ~6 months release of GenBank by ~6 months
04/21/23 Teresa K.Attwood University of Manchester
12
Attwood, T.K. Attwood, T.K. et alet al. (2011) . (2011) Concepts, Historical Milestones & the Central Place of Bioinformatics in Modern Biology: Concepts, Historical Milestones & the Central Place of Bioinformatics in Modern Biology:
A European PerspectiveA European PerspectiveIn In Bioinformatics - Trends & MethodologiesBioinformatics - Trends & Methodologies , Intech Online Publishers, , Intech Online Publishers,
• Copies of the EMBL data library & GenBank were Copies of the EMBL data library & GenBank were
being maintained in Cambridgebeing maintained in Cambridge– together with their search tools, together with their search tools, etc.etc.
• An integrated system gave access to the dbs & toolsAn integrated system gave access to the dbs & tools– ““this system is presently being used by over 30 researchers this system is presently being used by over 30 researchers
in 8 departments in the University & in local research in 8 departments in the University & in local research
institutes. These users can keep in touch with each other via institutes. These users can keep in touch with each other via
the MAIL commandthe MAIL command”!”!
04/21/23 Teresa K.Attwood University of Manchester
13
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnetemail
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Internet
Internet
EMBL
, Gen
Bank
EMBL
, Gen
Bank
PIRPIR
568 859
• A A crazy crazy postgrad student in Switzerlandpostgrad student in Switzerland– interested in space exploration & the search for ET lifeinterested in space exploration & the search for ET life
• His project was to develop s/w to analyse protein & His project was to develop s/w to analyse protein &
nucleotide sequencesnucleotide sequences– PC/GenePC/Gene
04/21/23 Teresa K.Attwood University of Manchester
15
• Published his 1Published his 1stst paper in 1982 paper in 1982– a letter to the a letter to the BJBJ
• Suggested use of checksumsSuggested use of checksums– ““toto facilitate detection of typographical & keyboard errorsfacilitate detection of typographical & keyboard errors””
04/21/23 Teresa K.Attwood University of Manchester
16
• Why?Why?
• Alongside PC/Gene, he needed to supply a dbAlongside PC/Gene, he needed to supply a db
• The The AtlasAtlas wasn’t available electronically wasn’t available electronically– typed in >1,000 protein sequencestyped in >1,000 protein sequences
– some from the literaturesome from the literature
– most from the most from the AtlasAtlas
• by 1981, this was a large book, plus several by 1981, this was a large book, plus several
supplements, listing 1,660 proteinssupplements, listing 1,660 proteins
04/21/23 Teresa K.Attwood University of Manchester
17
• In 1983, he acquired a computer tape of the EMBL In 1983, he acquired a computer tape of the EMBL
Data LibraryData Library– version 2, with 811 sequencesversion 2, with 811 sequences
• In 1984, he received the 1In 1984, he received the 1stst available computer tape available computer tape
copy of the copy of the AtlasAtlas– (which became known as the PIR-PSD)(which became known as the PIR-PSD)
– but… he disliked the PIR formatbut… he disliked the PIR format
04/21/23 Teresa K.Attwood University of Manchester
18
• So he converted the PIR database into the semi-So he converted the PIR database into the semi-
structured format of EMBLstructured format of EMBL– part manually & part automaticallypart manually & part automatically
• The result was PIR+The result was PIR+– & was distributed as part of PC/Gene (now commercial)& was distributed as part of PC/Gene (now commercial)
• In summer 1986, he finally released the database In summer 1986, he finally released the database
independently of PC/Geneindependently of PC/Gene– to make it available to all, free of chargeto make it available to all, free of charge
04/21/23 Teresa K.Attwood University of Manchester
19
• This new database was called Swiss-Prot This new database was called Swiss-Prot
• 11stst released on 21 July 1986 released on 21 July 1986– the exact number of entries is unknown, as he the exact number of entries is unknown, as he lostlost the the
original floppy disks!original floppy disks!
04/21/23 Teresa K.Attwood University of Manchester
20
• As part of his work on PC/Gene, he created another As part of his work on PC/Gene, he created another key database key database – diagnostic tool for characterising protein familiesdiagnostic tool for characterising protein families
• 11stst released March1989, with 58 entries released March1989, with 58 entries– this was PROSITEthis was PROSITE
• Philosophy of his approachPhilosophy of his approach– coupling high quality data analysis with manual annotationcoupling high quality data analysis with manual annotation
04/21/23 Teresa K.Attwood University of Manchester
21
21/04/23 Teresa K Attwood University of Manchester
22
PRINTSPRINTSPRINTSPRINTS
[IVM]-[AS]-L-W-S-L-V2-L-A-[IV]-E-R-Y-[IV]3-C-K-P-M[IVM]-[AS]-L-W-S-L-V2-L-A-[IV]-E-R-Y-[IV]3-C-K-P-M PROSITEPROSITEPROSITEPROSITE
• Database annotation…Database annotation…
21/04/23 Teresa K Attwood University of Manchester
23
DatabaseDatabaseMaintenaMaintenancence
DatabasDatabase e annotatiannotationon
Nirva
Nirva
nana
21/04/23 Teresa K Attwood University of Manchester
24
““It is quite depressive to think that we are spending millions It is quite depressive to think that we are spending millions
in grants for people to perform experiments, produce new in grants for people to perform experiments, produce new
knowledge, hide this knowledge in often badly written text knowledge, hide this knowledge in often badly written text
and then spend some more millions trying to second guess and then spend some more millions trying to second guess
what the authors really did and found”what the authors really did and found”
Bairoch, A. (2009)Bairoch, A. (2009)
The future of annotation/biocurationThe future of annotation/biocuration
Nature PrecedingsNature Precedings
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnetemail
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Internet
Internet
EMBL
, Gen
Bank
EMBL
, Gen
Bank
PIRPIR
568 859
Swiss
-Pro
t
Swiss
-Pro
tPR
OSI
TEPR
OSI
TEPR
INTS
PRIN
TS
3,900
• The number of sequences was growingThe number of sequences was growing
• The number of structures was growingThe number of structures was growing
• The number of protein family signatures was growingThe number of protein family signatures was growing
04/21/23 Teresa K.Attwood University of Manchester
26
Exam 2Exam 2
Two Two extraordinaryextraordinary developments had developments had yet to take place. What were they?yet to take place. What were they?
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnetemail
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Internet
Internet
EMBL
, Gen
Bank
EMBL
, Gen
Bank
PIRPIR
568 859
Swiss
-Pro
t
Swiss
-Pro
tPR
OSI
TEPR
OSI
TEPR
INTS
PRIN
TS
3,900
HT D
NA se
quen
cing
HT D
NA se
quen
cing
wwwwww
H.in
fluen
zae
geno
me
H.in
fluen
zae
geno
me
S.ce
revi
sae
geno
me
S.ce
revi
sae
geno
me
D.m
elan
ogas
ter g
enom
e
D.m
elan
ogas
ter g
enom
eH.
sapi
ens g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
FlyB
ase
TrEM
BLTr
EMBL
105,000Pf
amPf
amIn
terP
roIn
terP
ro
2,423
21/04/23 28
InterProInterProInterProInterPro
ProDomProDomProDomProDom PRINTSPRINTSPRINTSPRINTS
PrositePrositePrositeProsite
PANTHERPANTHERPANTHERPANTHER
SMARTSMARTSMARTSMART
HAMAPHAMAPHAMAPHAMAPPIRSFPIRSFPIRSFPIRSF
TIGRFAMTIGRFAMTIGRFAMTIGRFAM
SUPERFAMILYSUPERFAMILYSUPERFAMILYSUPERFAMILYGene3DGene3DGene3DGene3D
PfamPfamPfamPfam ProfilesProfilesProfilesProfiles
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnetemail
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Internet
Internet
EMBL
, Gen
Bank
EMBL
, Gen
Bank
PIRPIR
568 859
Swiss
-Pro
t
Swiss
-Pro
tPR
OSI
TEPR
OSI
TEPR
INTS
PRIN
TS
3,900
HT D
NA se
quen
cing
HT D
NA se
quen
cing
wwwwww
H.in
fluen
zae
geno
me
H.in
fluen
zae
geno
me
S.ce
revi
sae
geno
me
S.ce
revi
sae
geno
me
D.m
elan
ogas
ter g
enom
e
D.m
elan
ogas
ter g
enom
eH.
sapi
ens g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
FlyB
ase
TrEM
BLTr
EMBL
105,000Pf
amPf
amIn
terP
roIn
terP
ro
2,423>500B
36.0M
ENA
ENA
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
UniP
rot
UniP
rot
ELIXIRELIXIRSIBSIBEBIEBI
EMBnetEMBnetNCBI
NCBI
insu
linin
sulin
ribon
ucle
ase
ribon
ucle
ase
Dayh
off A
tlas
Dayh
off A
tlas
CSD
CSD
ARPAnet
ARPAnetemail
PDB
PDB
65 7Auto
pro
tein
sequ
ence
rs
Auto
pro
tein
sequ
ence
rs
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
DNA
sequ
enci
ng
Auto
DNA
sequ
encin
g
Auto
DNA
sequ
encin
g
Internet
Internet
EMBL
, Gen
Bank
EMBL
, Gen
Bank
PIRPIR
568 859
Swiss
-Pro
t
Swiss
-Pro
tPR
OSI
TEPR
OSI
TEPR
INTS
PRIN
TS
3,900
HT D
NA se
quen
cing
HT D
NA se
quen
cing
wwwwww
H.in
fluen
zae
geno
me
H.in
fluen
zae
geno
me
S.ce
revi
sae
geno
me
S.ce
revi
sae
geno
me
D.m
elan
ogas
ter g
enom
e
D.m
elan
ogas
ter g
enom
eH.
sapi
ens g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
FlyB
ase
TrEM
BLTr
EMBL
105,000Pf
amPf
amIn
terP
roIn
terP
ro
2,423>500B
36.0M
ENA
ENA
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
UniP
rot
UniP
rot
ELIXIRELIXIRSIBSIBEBIEBI
EMBnetEMBnetNCBI
NCBI
thousands morethousands more
billions morebillions more
hundreds morehundreds more
Red LineGrowth of EMBL since its inception
Green LineGrowth of manually annotated Swiss-Prot
Blue LineGrowth of PDB
ByBy 2020, NGS & 3Gen 2020, NGS & 3Gen technologies will be technologies will be producing data a producing data a million times faster million times faster than the current ratethan the current rate
04/21/23 31
282 M282 M
540 K540 K
35 M35 M
84 K84 K
• Hopefully, this potted history speaks for itselfHopefully, this potted history speaks for itself
• In the last 30 years, bioinformatics has given usIn the last 30 years, bioinformatics has given us– the first the first ‘‘completecomplete’’ catalogues of DNA & protein sequences catalogues of DNA & protein sequences
• including genomes & proteomes of organisms across the Tree of Lifeincluding genomes & proteomes of organisms across the Tree of Life
– software to analyse biological data on an unprecedented scalesoftware to analyse biological data on an unprecedented scale
– & hence tools to help understand & hence tools to help understand • more about evolutionary processes in generalmore about evolutionary processes in general
• our place on the Tree of Life in particularour place on the Tree of Life in particular
• &, ultimately, more about health & disease&, ultimately, more about health & disease
• It It isnisn’’t t a panacea, but its contribution has been a panacea, but its contribution has been hugehuge04/21/23 Teresa K.Attwood
University of Manchester32
Recommended readingRecommended readingRichon, A.B. A short history of bioinformatics (http://www.netsci.org/Science/Bioinform/feature06.html)Richon, A.B. A short history of bioinformatics (http://www.netsci.org/Science/Bioinform/feature06.html)Bairoch, A. (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting Bairoch, A. (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times. times. BioinformaticsBioinformatics, , 1616(1), 48-64.(1), 48-64.Ashburner, M. (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Lab. PressAshburner, M. (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Lab. PressStrasser, B.J. (2008) GenBank – Natural history in the 21Strasser, B.J. (2008) GenBank – Natural history in the 21stst century? century? ScienceScience, , 322322, 537-538., 537-538.Attwood, T.K., Gisel, A., Eriksson, N-E. & Bongcam-Rudloff, EAttwood, T.K., Gisel, A., Eriksson, N-E. & Bongcam-Rudloff, E. (2011) . (2011) Concepts, Historical Milestones and the Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European PerspectiveCentral Place of Bioinformatics in Modern Biology: A European Perspective
04/21/23 Teresa K.Attwood University of Manchester
33