university of manchester symposium 2012: extraction and representation of in silico biological...

30
Extrac’on and Representa’on of in silico Biological Methods from the Literature Geraint Duck Supervisors: Robert Stevens, Goran Nenadic and David Robertson Advisor: Joshua Knowles School of Computer Science, University of Manchester

Upload: geraintduck

Post on 09-Jan-2017

175 views

Category:

Science


0 download

TRANSCRIPT

Page 1: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Extrac'on  and  Representa'on  of    in  silico  Biological  Methods  from  the  

Literature  

Geraint  Duck    

Supervisors:  Robert  Stevens,  Goran  Nenadic  and    David  Robertson  

Advisor:  Joshua  Knowles    

School  of  Computer  Science,  University  of  Manchester  

Page 2: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Importance  of  Method  in  Science  

•  Understanding  – Key  part  of  research,  central  to  science    – Reproducibility  and  replica'on  – What?  Why?  Where?  How?  When?  – Extension  

•  Advise/evaluate  – “Current  Approach”  – “Best  Prac'ce”  

2  

Page 3: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Background  

•  In  silico:  performed  on  a  computer,  or  through  computer  simula'on  

•  Bioinforma'cs  is  a  resource-­‐focused  domain  – Numerous  resources  appearing  – Literature  is  growing  rapidly  

•  Resource  availability  and  usage  is  central  to  biological  research  

•  Current  aTempts  oUen  manually  curated  and/or  incomplete  

3  

Page 4: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

The  Method  to  Obtain  a  Method  

4  

1.  Extrac'on  – Automa'cally  extract  resource  and  task  men'ons  from  the  bioinforma'cs  literature  •  This  presenta'on  focuses  on  this  step  

2.  Representa'on  and  Analysis  –  Evaluate  the  extracted  men'ons  for  paTerns  of  

representa'on  3.  Explora'on  –  Provide  a  means  of  exploring  the  methods  extracted  

to  aid  other  research/researchers    

Page 5: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Key  Hypothesis:  Resource  ordering  implies  method    

•  An  analogy  –  baking  a  cake:  –  Ingredients:  buTer,  eggs,  flour,  sugar,  etc…  

– Recipe/method:  Set  oven  to  180°C,  mix  in  a  bowl  the  buTer  and  sugar…  Divide  between  'ns,  cook  in  oven  for  30mins…  

5  

Page 6: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Key  Hypothesis:  Resource  ordering  implies  method    

•  An  analogy  –  baking  a  cake:  –  Ingredients:  bu#er,  eggs,  flour,  sugar,  etc…  

– Recipe/method:  Set  oven  to  180°C,  mix  in  a  bowl  the  bu#er  and  sugar…  Divide  between  2ns,  cook  in  oven  for  30mins…  

6  Key:    Resource;  Task    

Page 7: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Example:  Lagerström  et  al.  (2006)  …  all  sequences  were  aligned  …  using  …  BLAT  3.0  …  in  which  case  the  GenBank  sequence  was  used…  …  divided  …  by  BLAST  searches  …  were  combined  into    a  FASTA  file  and  aligned  using  …  ClustalW  1.82  …  The  alignment  was  bootstrapped  …  using  SEQBOOT  from    the  …  Phylip  3.6  package  …  [excerpt  removed]  …  branch  lengths  were  es'mated  in  TreePuzzle  using  the  following  parameters  …    …  constructed  and  scored  automa'cally  using  a  bash-­‐script  that  u'lized  ClustalW  as  alignment  engine  and  infoalign  from  the  EMBOSS  2.8.0  package  for  scoring,  …  All  sta's'cal  analysis  was  performed  using  MiniTab.  Graphs  were  ploTed  using  MicrosoU  Excel  and  MiniTab.  

7  

Page 8: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Example:  Lagerström  et  al.  (2006)  …  all  sequences  were  aligned  …  using  …  BLAT  3.0  …  in  which  case  the  GenBank  sequence  was  used…  …  divided  …  by  BLAST  searches  …  were  combined  into    a  FASTA  file  and  aligned  using  …  ClustalW  1.82  …  The  alignment  was  bootstrapped  …  using  SEQBOOT  from  the  …  Phylip  3.6  package  …  [excerpt  removed]  …  branch  lengths  were  es2mated  in  TreePuzzle  using  the  following  parameters  …    …  constructed  and  scored  automa'cally  using  a  bash-­‐script  that  u'lized  ClustalW  as  alignment  engine  and  infoalign  from  the  EMBOSS  2.8.0  package  for  scoring,  …  All  sta's'cal  analysis  was  performed  using  MiniTab.  Graphs  were  plo#ed  using  MicrosoL  Excel  and  MiniTab.  

8  

Key:    Resource;  Task;  Poten2al  Challenge    

Page 9: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Example:  Lagerström  et  al.  (2006)  …  all  sequences  were  aligned  …  using  …  BLAT  3.0  …  in  which  case  the  GenBank  sequence  was  used…  …  divided  …  by  BLAST  searches  …  were  combined  into    a  FASTA  file  and  aligned  using  …  ClustalW  1.82  …  The  alignment  was  bootstrapped  …  using  SEQBOOT  from  the  …  Phylip  3.6  package  …  [excerpt  removed]  …  branch  lengths  were  es2mated  in  TreePuzzle  using  the  following  parameters.    …  constructed  and  scored  automa'cally  using  a  bash-­‐script  that  u'lized  ClustalW  as  alignment  engine  and  infoalign  from  the  EMBOSS  2.8.0  package  for  scoring,  …  All  sta's'cal  analysis  was  performed  using  MiniTab.  Graphs  were  plo#ed  using  MicrosoL  Excel  and  MiniTab.  

9  

Key:    Resource;  Task;  Poten2al  Challenge    

Page 10: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Example:  Lagerström  et  al.  (2006)  

10  

Key:    Resource;  Task    

GenBank  BLAT,  aligned  

BLAST,  searched  ClustalW,  aligned  

   

SEQBOOT,  bootstrapped  (Phylip)  

TreePuzzle,  esDmated      

ClustalW,  aligned  infoalign,  scored  

(EMBOSS)      

MiniTab,  staDsDcs  MS  Excel,  graphs  ploIed  MiniTab,  graphs  ploIed  

   

Tree  Construc'on    

Sequence  and  Tree  Analysis  

Result  Visualisa'on  

Sequence  Alignment  

Page 11: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Example…  

•  Mul'ple  methods  –  Usage  counts  –  Recentness  of  use  –  “best-­‐prac'ce”  

11  

Page 12: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Challenges  -­‐  Ambiguity  

•  leg  •  white  •  cab  

•  HIV  –  Human  immunodeficiency  virus  

–  Human  immunovirus  

•  analysis  •  Network  •  graph    

•  DIP  –  distal  interphalangeal  –  Database  of  Interac'ng  Proteins  

12  

Page 13: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Challenges  -­‐  Variability  

•  Orthographics  – Swiss  Prot  – SWISS-­‐PROT  – SwissProt    

•  Misspellings  and  typos    – One  paper,  same  resource,  spelt  3  different  ways  

•  Abbrevia'ons  – Different  authors  can  use  different  acronyms  for  the  same  thing  

13  

Page 14: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Name  Composi'on  

•  Majority  are  single  nouns    –  includes  acronyms  

•  6%  lowercase  common  nouns  –  affy,  bioconductor  

•  A  few  contained  numbers  –  S4,  t2prhd  

•  A  few  misclassified  as  verbs  –  …each  query  protein  is  first  BLASTed  with…  –  …held  near  their  equilibrium  values  using  SHAKE.  –  …graphical  representaKons  were  achieved  using  dot  v1.10…   14  

Page 15: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Name  Composi'on  

•  Longest  Names  (most  tokens)  – Corpus:  5  –  Gene  Expression  Profile  Analysis  Suite  – Dic'onary:  12  –  PredicKon  of  Protein  SorKng  Signals  and  LocalisaKon  Sites  in  Amino  Acid  Sequences  

•  Evaluated  token  frequencies  within  our  dic'onary  – Long-­‐tail  curve  – 87%  used  only  once  

15  

Page 16: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

!"#$%"&

'($)"*#&!"#"&

+",-"#."&

/-%0#&

1&

21&

31&

41&

51&

611&

621&

1& 27& 71& 87& 611& 627& 671&

!"#$%&'($)

*$%+,&

!"-&./0&!"#$%1&23"(415&

!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&

16  

Page 17: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Named  En'ty  Recogni'on  (NER)  

•  Variety  of  NER  uses  – Species  – Gene/protein  names  – Chemical  names  

•  Variety  of  NER  accuracy  – 95%  F-­‐score  species  (LINNAEUS)  – 73%  F-­‐score  (strict)  gene  name  (ABNER)  – Over  70%  F-­‐score  chemical  names  (OSCAR3)  

17  

Page 18: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

bioNerDS    

•   Automa'cally  matches  database  and            soLware  names  in  the  literature  –   Uses  dic'onary,  rules  and  clues  

•   F-­‐scores  between  63  and  91%  – Mixed  results  depending  on  corpus  –  Issues  of  mul'ple  men'ons  of  a  single  resource  in  one  paper  

– Ambiguity  and  variability…    

18  hTp://bionerds.sourceforge.net/    

Page 19: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

!

!"#$%"&'#(()*+!"#$%&'()!*$(+,(!-./+#,00,(!

!

/.'1,(2"!

2.3#.'%$(4!

,

2.3#.'%$(4!-''567*!

!

8.%)!7%5%'9%!0,%#.'%+!

,

8.-#,(!-.5,-4!8:+!

!

;'0/.%,!#<,!+3'(,+!

!

",3'%)!*$++!9.#<!0,%#.'%+!$/'=,!#<,!

#<(,+<'-)!

System    Overview    

19  

Page 20: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Preliminary  Analysis  of  Resource  Usage  

•  Used  bioNerDS  to  extract  name  men'ons  from  two  journals:  – Genome  Biology  – BMC  Bioinforma'cs  

•  Analysed  differences  

20  

Page 21: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

bioNerDS:  Results  

•  Over  36,000  men'ons  in  BMC  BioinformaKcs  

•  Over  15,000  men'ons  in  Genome  Biology.  

•  78%  of  Genome  Biology  and  98%  of  BMC  BioinformaKcs  papers  contained  at  least  one  resource  men'on.  

•  The  top  5  men'oned  resources  were:    R,  BLAST,  GO,  GenBank,  GEO  and  PDB.  

•  The  general  trend  across  both  journals  have  most  major  resources  declining  in  usage  

21  

Page 22: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

2001  

2002  

2003  

2004  

2005  

2006  

2007  

2008  

2009  

2010  

2011  

2001  

2002  

2003  

2004  

2005  

2006  

2007  

2008  

2009  

2010  

2011  

Rela've  Usage  within  the  Top  50  Genome  Biology   BMC  BioinformaDcs  

22  BLAST   Bioconductor   ClustalW   Ensembl  GenBank   Gene  Ontology   R   Swiss-­‐Prot  

Page 23: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

bioNerDS:  Full  PMC  Set  

•  Run  on  full  open-­‐access  PMC  set  –  ~230,000  full-­‐text  ar'cles  

–  ~1000  different  journals  –  Extracted  ~1.8M  men'ons  

•  Method?  •  Method  fingerprints  

•  Trying  to  extract    (data-­‐mine):  –  Ordering    –  PaTerns  –  Co-­‐occurance  –  Rela'onships  –  Associate  rules  –  Frequent  subsets  –  “Networks”  

23  

Page 24: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Method  Analysis  and  Explora'on  

•  Mining  “best-­‐prac'ce”:  Metrics  – Most  common  – Newest  – Who  uses  it  – What  resources  is  it  comprised  of  

•  Challenges  – Scien'fic  discourse  –  provenance  informa'on    – Men'on  order  does  not  imply  order  of  use  

•  Clustering  and  associa'ons    •  Fingerprints     24  

Page 25: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Conclusion  

•  Literature  mining  bioinforma'cs  in  silico  methods  

•  Developed  bioNerDS:  automated  resource  name  extrac'on    

•  Extrac'ng  and  analysing  paTerns  of  resource  usage  – Full  PMC  corpus  

•  Provided  a  way  to  extract  method  for  any  resource  based  domain  – Applied  this  to  bioinforma'cs  

25  

Page 26: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Thank-­‐you  

•  Acknowledgements  – Supervisors:  

•  Robert  Stevens  •  Goran  Nenadic  •  David  Robertson    

– Funding:    

26  

Page 27: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Resource  Men'ons  per  Journal  Journal   Total  ArDcles   Total  MenDons   RaDo  Nucleic  Acids  Research   7,192   200,339   27.8558  PLoS  One   15,791   168,624   10.6785  BMC  Bioinforma'cs   3,982   149,668   37.5861  BMC  Genomics   3,203   90,396   28.2223  Genome  Biology     2,321   48,976   21.1012  Acta  Crystallographica.  Sec'on  E,  Structure  Reports  Online   11,834   41,383   3.497  BMC  Evolu'onary  Biology   1,570   31,222   19.8866  PLoS  Computa'on  Biology   1,613   30,185   18.7136  PLoS  Gene'cs   1,876   29,734   15.8497  PLoS  Pathology   1,691   20,661   12.2182  

27  

Page 28: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Named  En'ty  Recogni'on  (NER)  

•  Variety  of  NER  uses  – Species  – Gene/protein  names  – Chemical  names  

•  Evalua'ng  NER  – True  posi'ves,  false  posi'ves,  false  nega'ves  – Precision:    – Recall:    – F-­‐score:    

28  

Page 29: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Named  En'ty  Recogni'on  (NER)  

•  Evalua'ng  NER  – True  posi'ves,  false  posi'ves,  false  nega'ves  

•  tp:  Correct  •  fp:  Returned  incorrect  •  fn:  Missed    

– Precision:  tp  /  (  tp  +  fp  )  •  How  accurate  are  the  results  we  obtained  

– Recall:  tp  /  (  tp  +  fn  )  •  How  many  of  the  total  correct  results  did  we  obtain  

– F-­‐score:  2  x  P  x  R  /  (  P  +  R  )  29  

Page 30: University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

Named  En'ty  Recogni'on  (NER)  

•  Evalua'ng  NER  – True  posi'ves,  false  posi'ves,  false  nega'ves  – Precision:  tp  /  (  tp  +  fp  )  – Recall:  tp  /  (  tp  +  fn  )  – F-­‐score:  2  x  P  x  R  /  (  P  +  R  )  

•  Variety  of  NER  accuracy  – 95%  F-­‐score  species  (LINNAEUS)  – 73%  F-­‐score  (strict)  gene  name  (ABNER)  – Over  70%  F-­‐score  chemical  names  (OSCAR3)  

30