GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
TextpressoSearch engine for Biomedical Literature
~Eimear Kenny~
Born out of frustration….
• Search systems effective at locating Search systems effective at locating interesting papers ….. BUT …. have to interesting papers ….. BUT …. have to read the paper to get to the facts. read the paper to get to the facts.
• Many data are not contained in abstract Many data are not contained in abstract or index …. therefore, important papers or index …. therefore, important papers can be missed by search engines.can be missed by search engines.
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
The Perfect System
Type in question and the search
engine tells you the answer!
Full text
“Conceptual search”
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
• Searches full text– returns any sentences that match your query
• Provides two ways to query– search raw data – Keyword search– search meta-data – Category search
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
Enter Textpresso
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29.
Biological Process
Regulation RegulationGene
GeneMolecular Function
Biological Process
<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?><!DOCTYPE article SYSTEM "/var/www/html/textpresso.dtd"><article> // <sentence id='s7'> // <process grammar ='NN' source='textpresso' type='general' biosynthesis='no'> activation</process> <pposition grammar ='IN' type='of'> of </pposition> <gene grammar ='JJ' reference='direct'> let-7 </gene> <text>RNA</text> <process grammar ='NN' source='textpresso' type='molecular' biosynthesis='expression'> expression</process> <regulation grammar ='NNS' type='negative'> down regulates</regulation> <function grammar ='NNP' reference='direct' source='textpresso' protein='yes'> LIN-41 </function> <pposition grammar ='TO' type='to'>to </pposition> <text>relieve</text> <regulation grammar ='NNS' type='negative'> inhibition </regulation> <pposition grammar ='IN' type='of'> of</pposition> <gene grammar ='NNP' reference='direct'> lin-29 </gene> <text>. </text> </sentence> //</article>
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
Categories
GENEPATHWAY
REGULATION CELL
Locus let-60 eat-4 LIN-12
repress enhanced upregulate inhibition
precursorupstream cascade descendants
Neuron EMS
HSN AB Vulva precursor
37 Categories!!!
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
lin-39 acts downstream of Ras
lin-25 acts indirectly via sur-2
eor-1 and eor-2 are closely involved in Ras signaling
Find sentences from the literature that describe genetic interaction!
>= 2 named “Gene” &&(>= 1 “Association” || >= 1 “Regulation”)
Using Textpresso to expediate curation
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
Sentences containing gene-gene interactions
Random 1 (0.5%)
2 named genes 13 (6.5%)
2 named genes
+
1 category39 (19.5%)
Sampling 200 sentences ……
Adding Textpresso category enriches 3-fold!
Installation and Adaption of Textpresso for your Domain
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
Dependencies• Tested on Redhat 9.0 or Debian 3.1 (kernel 2.4.20 or higher)
– should work on any unix-based system
• Apache (1.3.29), Perl (5.6.1 or higher)• Perl Modules:
– XML::Parser XML::RegExp – XML::XQL XML::Checker– XML::DOM XML::Parser::PerlSAX– PDF::Create
• Brill Tagger (C compiler)– parts of speech tagger (http://research.microsoft.com/~brill/)
• XPDF– pdftotext utility (http://www.foolabs.com/xpdf/)
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
Download
http://www.textpresso.org
http://www.gmod.org
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
Build Scripts
Electronic PDF
Raw Text
Parts-of-speech Text
Annotated Text
Abstracts
Keywords
Index Maker
PDF2Text
Preprocessor
Text2XML
Textpresso Database
Wormbase Database
Journal Web-sites
TextpressoOntology
CollectPapers
CollectAbstracts
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
Tailoring Pt 1 -Text Collection
• Abstracts Collection– can be downloaded from central resource such as PubMed – PubFetch!
• PDF Collection:– limited to open access journals (PLoS Biology) or journals
to which you subscribe– inject_pmid script from Textpresso web-site (Allen Day)– manual download from journal web-site
Tailoring Pt 2 – Adapting Ontology
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
• Almost all “Relationship and Description” and “Syntax and Grammer”categories and some “Biological Concepts” categories are generic to the Biomedical domain.
• Some new categories can use existing category structure (yeast genes replace worm genes)
• Some de novo categories would be useful (Cell Cycle, Chromosomal Aberrations, Disease etc).
Overhaul Code
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
• Adding another layer of abstraction– definition files and modulesuse constant SY_ANNOTATION_FIELDS => { abstract => ‘abstract/’,
body=> ‘body/’, title=> ‘title/’};
… defines which fields are to be annotated during the build process
• Advantages:– easy to adapt software (no script tweaking)– easy to add new modules
Search for patterns in sentences
The life-extension phenotype of old-1 was completely suppressed by daf-16 ( m26 ) ( Figure 1e ) . <determiner> <text> <phenotype> <preposition> <gene> <auxiliary> <effect> <regulation> <preposition> <gene> <bracket> <text> <bracket> <bracket> <text> <text> <bracket> <text>
Developed hidden Markov model to identify common patterns of text that surrounds required entities.
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
Hidden Markov Model
Match Match Match
I I I I I II
Begin End
I I
<gene> <gene><regulation>
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary
GMOD Meeting, May 2005
Patent Pending,Caltech Proprietary
True test sentences have similar score to training sentences
Textpresso TeamDevelopers:Eimear KennyHans-Michael Müller
Code Contributers:Allen Day (many patches including inject_pmid)Robert Li (alternative pdf2text converter)Stan Dong and Christopher Lane (code optimization for speed)Juancarlos Chan (web-site scripting)
Information Extraction Analysis:Andrei Petcherski
Paper Collection:Daniel Wang
Principle Investigator:Paul Sternberg
GMOD Meeting, May 2005Patent Pending,Caltech Proprietary