university of manchester symposium 2012: extraction and representation of in silico biological...
TRANSCRIPT
![Page 1: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/1.jpg)
Extrac'on and Representa'on of in silico Biological Methods from the
Literature
Geraint Duck
Supervisors: Robert Stevens, Goran Nenadic and David Robertson
Advisor: Joshua Knowles
School of Computer Science, University of Manchester
![Page 2: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/2.jpg)
Importance of Method in Science
• Understanding – Key part of research, central to science – Reproducibility and replica'on – What? Why? Where? How? When? – Extension
• Advise/evaluate – “Current Approach” – “Best Prac'ce”
2
![Page 3: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/3.jpg)
Background
• In silico: performed on a computer, or through computer simula'on
• Bioinforma'cs is a resource-‐focused domain – Numerous resources appearing – Literature is growing rapidly
• Resource availability and usage is central to biological research
• Current aTempts oUen manually curated and/or incomplete
3
![Page 4: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/4.jpg)
The Method to Obtain a Method
4
1. Extrac'on – Automa'cally extract resource and task men'ons from the bioinforma'cs literature • This presenta'on focuses on this step
2. Representa'on and Analysis – Evaluate the extracted men'ons for paTerns of
representa'on 3. Explora'on – Provide a means of exploring the methods extracted
to aid other research/researchers
![Page 5: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/5.jpg)
Key Hypothesis: Resource ordering implies method
• An analogy – baking a cake: – Ingredients: buTer, eggs, flour, sugar, etc…
– Recipe/method: Set oven to 180°C, mix in a bowl the buTer and sugar… Divide between 'ns, cook in oven for 30mins…
5
![Page 6: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/6.jpg)
Key Hypothesis: Resource ordering implies method
• An analogy – baking a cake: – Ingredients: bu#er, eggs, flour, sugar, etc…
– Recipe/method: Set oven to 180°C, mix in a bowl the bu#er and sugar… Divide between 2ns, cook in oven for 30mins…
6 Key: Resource; Task
![Page 7: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/7.jpg)
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es'mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were ploTed using MicrosoU Excel and MiniTab.
7
![Page 8: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/8.jpg)
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.
8
Key: Resource; Task; Poten2al Challenge
![Page 9: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/9.jpg)
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters. … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.
9
Key: Resource; Task; Poten2al Challenge
![Page 10: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/10.jpg)
Example: Lagerström et al. (2006)
10
Key: Resource; Task
GenBank BLAT, aligned
BLAST, searched ClustalW, aligned
SEQBOOT, bootstrapped (Phylip)
TreePuzzle, esDmated
ClustalW, aligned infoalign, scored
(EMBOSS)
MiniTab, staDsDcs MS Excel, graphs ploIed MiniTab, graphs ploIed
Tree Construc'on
Sequence and Tree Analysis
Result Visualisa'on
Sequence Alignment
![Page 11: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/11.jpg)
Example…
• Mul'ple methods – Usage counts – Recentness of use – “best-‐prac'ce”
11
![Page 12: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/12.jpg)
Challenges -‐ Ambiguity
• leg • white • cab
• HIV – Human immunodeficiency virus
– Human immunovirus
• analysis • Network • graph
• DIP – distal interphalangeal – Database of Interac'ng Proteins
12
![Page 13: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/13.jpg)
Challenges -‐ Variability
• Orthographics – Swiss Prot – SWISS-‐PROT – SwissProt
• Misspellings and typos – One paper, same resource, spelt 3 different ways
• Abbrevia'ons – Different authors can use different acronyms for the same thing
13
![Page 14: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/14.jpg)
Name Composi'on
• Majority are single nouns – includes acronyms
• 6% lowercase common nouns – affy, bioconductor
• A few contained numbers – S4, t2prhd
• A few misclassified as verbs – …each query protein is first BLASTed with… – …held near their equilibrium values using SHAKE. – …graphical representaKons were achieved using dot v1.10… 14
![Page 15: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/15.jpg)
Name Composi'on
• Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic'onary: 12 – PredicKon of Protein SorKng Signals and LocalisaKon Sites in Amino Acid Sequences
• Evaluated token frequencies within our dic'onary – Long-‐tail curve – 87% used only once
15
![Page 16: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/16.jpg)
!"#$%"&
'($)"*#&!"#"&
+",-"#."&
/-%0#&
1&
21&
31&
41&
51&
611&
621&
1& 27& 71& 87& 611& 627& 671&
!"#$%&'($)
*$%+,&
!"-&./0&!"#$%1&23"(415&
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&
16
![Page 17: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/17.jpg)
Named En'ty Recogni'on (NER)
• Variety of NER uses – Species – Gene/protein names – Chemical names
• Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)
17
![Page 18: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/18.jpg)
bioNerDS
• Automa'cally matches database and soLware names in the literature – Uses dic'onary, rules and clues
• F-‐scores between 63 and 91% – Mixed results depending on corpus – Issues of mul'ple men'ons of a single resource in one paper
– Ambiguity and variability…
18 hTp://bionerds.sourceforge.net/
![Page 19: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/19.jpg)
!
!"#$%"&'#(()*+!"#$%&'()!*$(+,(!-./+#,00,(!
!
/.'1,(2"!
2.3#.'%$(4!
,
2.3#.'%$(4!-''567*!
!
8.%)!7%5%'9%!0,%#.'%+!
,
8.-#,(!-.5,-4!8:+!
!
;'0/.%,!#<,!+3'(,+!
!
",3'%)!*$++!9.#<!0,%#.'%+!$/'=,!#<,!
#<(,+<'-)!
System Overview
19
![Page 20: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/20.jpg)
Preliminary Analysis of Resource Usage
• Used bioNerDS to extract name men'ons from two journals: – Genome Biology – BMC Bioinforma'cs
• Analysed differences
20
![Page 21: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/21.jpg)
bioNerDS: Results
• Over 36,000 men'ons in BMC BioinformaKcs
• Over 15,000 men'ons in Genome Biology.
• 78% of Genome Biology and 98% of BMC BioinformaKcs papers contained at least one resource men'on.
• The top 5 men'oned resources were: R, BLAST, GO, GenBank, GEO and PDB.
• The general trend across both journals have most major resources declining in usage
21
![Page 22: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/22.jpg)
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
Rela've Usage within the Top 50 Genome Biology BMC BioinformaDcs
22 BLAST Bioconductor ClustalW Ensembl GenBank Gene Ontology R Swiss-‐Prot
![Page 23: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/23.jpg)
bioNerDS: Full PMC Set
• Run on full open-‐access PMC set – ~230,000 full-‐text ar'cles
– ~1000 different journals – Extracted ~1.8M men'ons
• Method? • Method fingerprints
• Trying to extract (data-‐mine): – Ordering – PaTerns – Co-‐occurance – Rela'onships – Associate rules – Frequent subsets – “Networks”
23
![Page 24: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/24.jpg)
Method Analysis and Explora'on
• Mining “best-‐prac'ce”: Metrics – Most common – Newest – Who uses it – What resources is it comprised of
• Challenges – Scien'fic discourse – provenance informa'on – Men'on order does not imply order of use
• Clustering and associa'ons • Fingerprints 24
![Page 25: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/25.jpg)
Conclusion
• Literature mining bioinforma'cs in silico methods
• Developed bioNerDS: automated resource name extrac'on
• Extrac'ng and analysing paTerns of resource usage – Full PMC corpus
• Provided a way to extract method for any resource based domain – Applied this to bioinforma'cs
25
![Page 26: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/26.jpg)
Thank-‐you
• Acknowledgements – Supervisors:
• Robert Stevens • Goran Nenadic • David Robertson
– Funding:
26
![Page 27: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/27.jpg)
Resource Men'ons per Journal Journal Total ArDcles Total MenDons RaDo Nucleic Acids Research 7,192 200,339 27.8558 PLoS One 15,791 168,624 10.6785 BMC Bioinforma'cs 3,982 149,668 37.5861 BMC Genomics 3,203 90,396 28.2223 Genome Biology 2,321 48,976 21.1012 Acta Crystallographica. Sec'on E, Structure Reports Online 11,834 41,383 3.497 BMC Evolu'onary Biology 1,570 31,222 19.8866 PLoS Computa'on Biology 1,613 30,185 18.7136 PLoS Gene'cs 1,876 29,734 15.8497 PLoS Pathology 1,691 20,661 12.2182
27
![Page 28: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/28.jpg)
Named En'ty Recogni'on (NER)
• Variety of NER uses – Species – Gene/protein names – Chemical names
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: – Recall: – F-‐score:
28
![Page 29: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/29.jpg)
Named En'ty Recogni'on (NER)
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves
• tp: Correct • fp: Returned incorrect • fn: Missed
– Precision: tp / ( tp + fp ) • How accurate are the results we obtained
– Recall: tp / ( tp + fn ) • How many of the total correct results did we obtain
– F-‐score: 2 x P x R / ( P + R ) 29
![Page 30: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature](https://reader034.vdocuments.mx/reader034/viewer/2022042723/58736b161a28abe7648b7c37/html5/thumbnails/30.jpg)
Named En'ty Recogni'on (NER)
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: tp / ( tp + fp ) – Recall: tp / ( tp + fn ) – F-‐score: 2 x P x R / ( P + R )
• Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)
30