information extraction from the cancer literature the pediatric hematology/oncology seminar series...

Download Information Extraction from the Cancer Literature The Pediatric Hematology/Oncology Seminar Series Children’s Hospital of Philadelphia March 8, 2005 Philadelphia,

Post on 12-Jan-2016




0 download

Embed Size (px)


  • Information Extraction from the Cancer Literature

    The Pediatric Hematology/Oncology Seminar SeriesChildrens Hospital of PhiladelphiaMarch 8, 2005Philadelphia, PA

  • A Global Challenge

    Cell Clinic

    DNA sequenceGenomic variationMicroarraysRNAiProtein interactions

    Patient recordsTest resultsClinical reportsProceduresPhone calls

    Natural language understanding

  • Too Much TextSolution 2: Read everything Leukemia: 181,394 articles 20/day=25 years 385,034 new articles by thenBiomedical text:15 million articles1.5 billion words

    Solution 3: Impose structure on the descriptionsSolution 1: ApproximateWhat you can findWhat finds you

  • IE ProcessPhase 1: Domain selection and definition

    Phase 2: Manual annotation

    Phase 3: Create and train machine-learning algorithms

    Phase 4: Active Annotation

    Phase 5: Utilization of annotations

  • DomainBiological DomainsGenomic variations in malignancyNeuroblastoma

    Entity ClassesGenes (genes, transcripts, proteins)Genomic variations (type, location, state)Malignant typeMalignancy attributesDevelopmental stateClinical stageHistologyMalignancy siteDifferentiation statusHeredity status

  • Document Sets

    MEDLINE: Abstracts --> Full Text

    Annotation training set: 4,000 MEDLINE abstractsGenes commonly mutated in various malignanciesGenes implicated in neuroblastoma

    Abstracts are manually annotated (dual pass)

    Results are used to train automated taggers

  • Workflow Management

  • Extraction Process

  • ParsingMDS1genealterationsleukemiacauseoftenSeparate

  • Part-of-speech Tagging

  • Part-of-speech Tagging

  • Part-of-speech Tagging

  • Named Entity Recognition

  • Definitions: ProcessInitial Definitions: Domain ExpertsAnalyze representative subset of text mentionsInput of specific knowledge

    Manual AnnotationTag text with initial definitionsIterative re-definition processMore text: Tighter and more robust definitions

    Widen Domain Expertise

    Publication and Utilization

  • DefinitionsIndividual GeneGene SuperfamilyGene Family

  • Definitions

    Gene The Gene-Entity category includes genes as well as their downstream products such as transcripts and proteins, in addition to the more general groups of gene and protein families, super-families, and so forth. Note that the category name 'Gene-Entity is not a completely accurate description of the members of this class since the category includes things other than genes. However, most things in this class are genes, and everything is either a gene or gene derived (transcripts and proteins). The diagram that follows attempts to illustrate this point and provides some examples.

    What is and What is Not Included? There are two ways to think about genes.

    1. Genes as conceptual entities. (This is what we want to capture.) Genes refer to segments of the genome which have been identified with a specific function or product (for example, the gene for eye color in a fly or a membrane receptor in humans). Although they are "things", they really represent abstract concepts. We can talk about the gene "K-Ras", but we are really referring to an abstract concept an "ideal form" of the K-Ras gene, which has known attributes. We cant point to K-Ras; we can only point to instances of K-Ras. Each of these instances (a specific manifestation of the gene as described in #2 below) has the attributes and characteristics of the abstract concept of K-Ras but the different instances of K-Ras may vary slightly between them. (This parallels the concept of "species". We all have an intuitive grasp of the species concept, and can differentiate most species apart: a grizzly bear from a polar bear. However, when we visit the zoo we encounter instances of a species -- individual bears -- and not the concept itself.) Although this may seem pedantic, there is an important reason for making this distinction which well describe below.

    Lets consider some examples based upon this logic: a. For genes: c-kit, CD117, and alpha-smooth muscle actin b. A non-biology example: a 2003 Ferrari Modena. This is an abstract concept for a specific type of car. However, you cant point to an abstract 2003 Ferrari Modena, you can only point to specific instances which may vary, even if slightly, between one another. c. K-Ras as investigated in Bob. This can be a tricky example since it would appear as though we are talking about a specific instance of K-Ras. But remember, in nearly all cases, genes are paired in humans (sometimes there are even more

  • Definitions

    Confounding Issues:

    Levels of specificityProtein/enzyme/kinase/tyrosine kinase/NTRK1TRK antibodyColon cancer vs. cancer of the colon

    Boundary issuesRetinoblastomaHead and neck cancerMEN type 2B syndrome

  • Entity Annotation

  • Named Entity Recognition

  • Syntactic Analysis

  • Treebanking

  • Syntactic Analysis

  • Relation Tagging

  • Relation Tagging

  • Annotation ViewerAnnotation Viewer

  • Annotations

    AnnotationStart AnnotatedAnnotatedTaskDateDocumentsWordsPre-tagging11/3/0338341,456,000Entity tagging9/24/0338291,455,000POS tagging8/27/032332886,160Treebanking2/26/042300874,000Relation tagging10/31/04618234,000

  • Automated Algorithms

    PretaggerAssigns token, sentence, paragraph, section boundariesNearly 100% accuracyPipeline implementation: Finished

    Bio Part-of-speech taggerAssigns part-of-speech tags to tokensUses pretagging annotationsAccuracy of 97.3%Pipeline implementation: Finished

  • Entity TaggersEntity Taggers: Automated, machine-learning algorithms for named entity recognition in text

    Goals Highly accurate, precision > recallRapid deploymentFlexible design

    TechniqueConditional random fieldsText feature-basedUses pretagging, POS annotationsProbabilistic maximization of feature weightsCorrects for overfitting

  • Entity TaggersGeneTaggerCRF

    Tags gene symbols, names, and descriptionsKDR, VEGFR-2, VEGF receptor-2vascular endothelial growth factor receptor type 286% precision/79% recallPipeline implementation: Imminent

    VTagSimulataneously tags variation types, locations, statespoint mutation, loss of heterozygositycodon 12, 11q23, base pair 17, Ki-rasGGT, glycine, Asp85% precision/79% recallPipeline implementation: Imminent

  • Entity TaggersMtag

    Tags malignant type labelsacute myeloid leukemias (AMLs)translocation t( 9;11) - positive leukemiaNBtransitional cell carcinoma of the bladderHypoplastic myelodysplastic syndromepredominantly cystic bilateral neuroblastomas85% precision/82% recallPipeline implementation: Imminent

  • Entity Taggers

  • Relation TaggerRelation Taggers: Identifying relationships between entities Given this text:Missense mutation at codon 45 (TCT to TTT)Can we automatically identify:

    1. Pairwise associations [(codon 45 and TCT); (TCT and TTT); etc.]2. The entire mutation event:

    VARIATION EVENT #60609Variation type: missense mutationVariation location: codon 45Variation state 1: TCTVariation state 2: TTT

  • Relation TaggerGoals: Accurate, rapid, flexible

    TechniqueMaximum entropyFeature-based probabilistic modelEvents built upon binary associationsUses pretagging, POS, and entity annotations

    DomainGenomic variation eventsTested on 447 abstracts: 1218 relations, 4773 entities38% of relations were non-binaryBaseline: Two entities within 5 words = related

  • Relation TaggerResultsBinary Tagger: 77% precision/82% recallBaseline: 66% precision/77% recallEvent-wideTagger: 63% precision/77% recallBaseline: 43% precision/66% recallExamplemost common base change was a A ->G transition at codon 12 or 13Manual annotation:(transition, codon 12, A, G)(transition, codon 13, A, G)Automated annotation:(transition, codon 12, A, G)(transition, codon 13, A, G)(base change, codon 12, A, G)(base change, codon 13, A, G)

  • Data Management

  • Annotation PipelinePOS taggingDocumentPretaggingEntity taggingRelation taggingTreebankingDatabaseNormalizationIntegrationInterfacePropbanking

  • Annotation PipelineAnnotation Pipeline

    Carolyn Felix

  • Biomedical Annotation DatabaseAnnotation Retrieval

  • Applications: Entity ListsWhat is this all good for, anyway?

    Objective: To align the literature with genomic objects

    Goal: Can we replicate a manually curated list of genes implicated in a biological process?

    Domain: Angiogenesis

    Rationale:To focus on the subset of genes implicated in the process of angiogenesis from whole-genome expression profiling

  • Applications: Entity ListsThe manual list

    Genes represented on the Affy U133 chips340 genes, identified through:Prior knowledgeLiterature reviewsPubMed searchesGene Ontology codesGene family-based inference

  • Applications: Entity ListsThe automated list

    Twelve partially specific angiogenic termsConcordancy searching of MEDLINE: 41,276 abstractsTrained GeneTaggerCRF with ~100 hand-annotated angiogenesis abstractsTagged the document set104,118 mentions22,662 non-redundant mentions

  • Applications: Entity ListsNormalization

    Human gene/alias/identifier listCompiled identifiers from 19 public databases302,976 entries156,860 non-redundant entriesAll entries mapped to 25,096 official gene symbols

    Aligned normalized gene and tagged gene lists50.01% of entries matched a known gene term2,389 identified genes

  • Applications: Entity ListsGeneDescription FrequencyVEGFVascular endothelial growth factor


View more >