biomedical information extraction. outline intro to biomedical information extraction pasta...
Post on 20-Dec-2015
226 Views
Preview:
TRANSCRIPT
Outline Intro to biomedical information
extraction PASTA [Demetriou and Gaizauskas]
Biomedical named entities Name variability [Cohen, Dolbey,
Acquaah-Mensah, and Hunter] Name tagging [Tanabe and Wilbur]
Terminology Tagging protein species residue site region secondary
structure
supersecondary structure
quaternary structure
base atom non-protein
compound interaction
Template Filling
residue :=NAME: stringSITE/FUN: stringSEC_STRUCT: stringQUAT_STRUCT: stringREGION: stringINTERACTION: string
in_protein :=RESIDUE: residuePROTEIN protein
protein :=NAME: string
species :=NAME: string
in_species :=PROTEIN: proteinSPECIES: species
PASTA Architecture Terminological Processing
Morphological analysis biochemical morphemes “-ase”
Lexical lookup token lookup in databases token grammatical class tagging
Terminology parsing create multi-token terms, rule-based
parsing using grammatical tags
PASTA Architecture Syntactic and Semantic Processing
Part-of-speech tags Phrase structure Compositional semantics
Discourse Processing Semantic representations
incorporated into discourse model of concept hierarchy and inference rules
PASTA Architecture Template Extraction
Scan discourse model for template instances, check slots, build template
Performance
Dev Inter-annotator
Test
Terminology
88R/94P 92R/86P 82R/84P
Template 69R/79P 78R/80P 69R/64P
PASTAWeb
Index document -> terminology, template terms -> templates from multiple
documents
IE tools need to be incorporated into effective interfaces for biology researchers
Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and
Hunter] Named Entities
location vs. identification
Variability somatotropin rat somatotropin growth hormone
Variability Non-contrast (synonyms)
tumor protein homolog vs tumour protein homologue
Contrast (diffonyms?) ACE1 vs ACE2
Transformations1. Remove first character2. Remove first word3. Remove last character4. Remove last word5. Replace sequence of vowels with one
letter6. Replace hyphen with space7. Remove parenthesized material8. Convert to lowercase
Experiment Collect groups of synonym gene
names Get mouse, rat, and human genes from
LocusLink Group OFFICIAL GENE NAME, PREFERRED
GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms
Results LMW, RMC, RMW identify
contrastive variability Contrasts likely marked at name
boundaries VS, HYPH, CASE, PM identify non-
contrastive variability
Pattern Heuristics
1. Equivalence of vowel sequences2. Optionality of hyphens3. Optionality of parenthesized
material4. Case insensitivity
Tagging Genes and Proteins [Tanabe and Wilbur] ABGene
Trained on MEDLINE abstracts Tested on PUBMED full texts
ABGene Transformation-based tagger False-positive and false-negative
filters Compound term recovery Document ranking
Transformation-Based Tagging Learns sequence of transformation
rules of the form A -> B / C greedily, based on number of errors
corrected in training data tags Applies rules sequentially to tag
new text
Gene Transformations
GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE
Problems in Full Text Terms that do not appear in
abstracts restriction enzyme site, lab protocol
kits, primers, vectors, supply companies, chemical reagents
Figures and tables
top related