extracting biological names and relations from texts

Extracting biological names and relations from texts

Ting-Yi Sung 宋定懿Bioinformatics Program, TIGP

Institute of Information ScienceAcademia Sinica

2004/12/16

Motivation

To automatically extract information from natural language text. The need arises from rapid accumulation

of biomedical literature. Expedite survey efforts Support the database curation (automati

cally associate the papers with database records)

Targets of Information Extraction Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation

Example relationships between gene and drugs: The gene is the drug target The gene confers resistance to the drug The gene metabolizes the drug

Information Extraction TasksIdentify Target Named Entities

Identify Relationsamong Named

Entities

Identify Relationsamong Events and

Named Entities

Associate Resultswith existing

database records

Outline NER (named entity recognition)

in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works

What is NER? NER

Named Entity Recognition Including two tasks

Identification of proper names in text Classification of proper names in text

Newswire Domain Person, Location, Organization

Biomedical Domain Protein, DNA, RNA, Body Part, Cell Type, Lipid,

etc.

Example of NER - Biomedical

Protein

tissue

Disease

NER in biomedical domain BioNER aims to recognize following n

ames First Priority

Protein name, DNA name, RNA name Second Priority

cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic

The Overall Spectrum BioNER is only the starting point of biologic

al information extraction

A whole suite of NLP techniques are needed to treat relations, events in literature mining

Techniques developed for BioNER should be adaptable to problems in later stages, e.g. NE relation recognition

Intrinsic Features of BioNER Unknown words Long compound words Variations of expressions Nested NEs

Unknown Words Words containing hyphen, digit, letter, Gree

k letter, Roman numeral. Alpha B1 Adenyly cyclase 76E Latent membrane protein 1 4’-mycarosyl isovaleryl-CoA transferase oligodeoxyribonucleotide 18-deoxyaldosterone

Abbreviation and Acronym IL, TECd, IFN, TPA

Long Compound words interleukin 1 (IL-1)-responsive kinase interleukin 1-responsive kinase epidermal growth factor receptor SH2 domain containing tyrosine kinas

e Syk SH2 domain (GENIA example)

Various expressions of the same NE Spelling variation

N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine

Word permutation beta-1 intergrin, integrin beta-1

Ambiguous expressions epidermal growth factor receptor, EGF receptor,

EGFR c-jun, c-Jun, c jun

Various expressions: the name explains its function the Ras guanine nucleotide exchange

factor Sos the Ras guanine nucleotide releasing

protein Sos the Ras exchanger Sos the GDP-GTP exchange factor Sos Sos(mSos), a GDP/GTP exchange prot

ein for Ras

Various expressions:

The name includes preposition and/or conjunction (ambiguity of dependencies)

p85 alpha subunit of PI 3-kinase SH2 and SH3 domains of Src NF-AT1 , AP-1 , and NF-kB sites E2F1 and -3 Residues 432, 435, 437, 438, and 440

Nested Named Entity An NE embedded in another NE. IL-2: protein IL-2 gene: gene CBP/p300 associated factor: protein CBP/p300 associated factor binding p

romoter: DNA

Challenges of NER Unknown word identification Named entity boundary detection Class disambiguation

Challenges Unknown word identification

t (10;11) (p13; q14) DNA methyltransferase 73 kDa protein interleukin 1 (IL-1)-responsive kinase (NE

may contain an abbreviation within it.) Some unknown words occur very few tim

es in the corpus hard to recognize.

Challenges (cont’d) NE boundary detection

Can be a regular English word, unknown word, Roman numeral, digit.

MHC Class II latent protein 1 (The left boundary is an adjective) cyclin-like UDG gene product

Conjunction (and, or, …) alpha- and beta-globin human and mouse gene

Challenges (cont’d) Classification of abbreviations

NF-AT Full name: nuclear factor of activated cells Class: Protein

HTLV-I Full name: Human T cell lymphotropic virus I Class: Virus

TCDD Full name: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin Class: Other Organic

GRE Full name: glucocorticoid response element Class: DNA

State-of-the-art Systems on NER: Two evaluation contests

BioCreative 2004 (March) Critical Assessment of Information Extraction Syst

ems in Biology Task 1: Entity extraction

Target: genes (or proteins, where there is ambiguity)

10000 sentences from Medline as training data, and 5000 sentences as testing data

BioNLP 2004 (August) GENIA Corpus as training data and 404 abstracts a

s testing data Target: 5 classes, including protein, DNA, gene, cel

l line and cell type. Both use exact match scoring.

# of

abstracts

# of sentences # of tokens

Training Set 2,000 20,546 (10.27/abs)472,006 (236.00/abs)

(22.97/sen)

Test Set

Total 404 4,260 (10.54/abs)96,780 (239.55/abs)

(22.72/sen)

1978-1989

104 991 ( 9.53/abs)22,320 (214.62/abs)

(22.52/sen)

1990-1999

106 1,115 (10.52/abs)25,080 (236.60/abs)

(22.49/sen)

2000-2001

130 1,452 (11.17/abs)33,380 (256.77/abs)

(22.99/sen)

S/1998-2001

204 2,254 (11.05/abs)51,628 (253.08/abs)

(22.91/sen)

BioNLP 2004 Datasets

R/P/F 1978-1989 set 1990-1999 set 2000-2001 set

S/1998-2001 set Total

[Zho04

]

75.3 / 69.5 / 72.3

77.1 / 69.2 / 72.9

75.6 / 71.3 / 73.8

75.8 / 69.5 / 72.5

76.0 / 69.4 / 72.6

[Fin04]

66.9 / 70.4 / 68.6

73.8 / 69.4 / 71.5

72.6 / 69.3 / 70.9

71.8 / 67.5 / 69.6

71.6 / 68.6 / 70.1

[Set04

]

63.6 / 71.4 / 67.3

72.2 / 68.7 / 70.4

71.3 / 69.6 / 70.5

71.3 / 68.8 / 70.1

70.3 / 69.3 / 69.8

[Son04

]

60.3 / 66.2 / 63.1

71.2 / 65.6 / 68.2

69.5 / 65.8 / 67.6

68.3 / 64.0 / 66.1

67.8 / 64.8 / 66.3

[Zha04]

63.2 / 60.4 / 61.8

72.5 / 62.6 / 67.2

69.1 / 60.2 / 64.7

69.2 / 60.3 / 64.4

69.1 / 61.0 / 64.8

[Rös04]

59.2 / 60.3 / 59.8

70.3 / 61.8 / 65.8

68.4 / 61.5 / 64.8

68.3 / 60.4 / 64.1

67.4 / 61.0 / 64.0

[Par04]

62.8 / 55.9 / 59.2

70.3 / 61.4 / 65.6

65.1 / 60.4 / 62.7

65.9 / 59.7 / 62.7

66.5 / 59.8 / 63.0

[Lee04]

42.5 / 42.0 / 42.2

52.5 / 49.1 / 50.8

53.8 / 50.9 / 52.3

52.3 / 48.1 / 50.1

50.8 / 47.6 / 49.1

BL 47.1 / 33.9 /

39.4 56.8 / 45.5 /

50.5 51.7 / 46.3 /

48.8 52.6 / 46.0 /

49.1 52.6 / 43.6 /

47.7

Current Methods Machine Learning

HMM, SVM, ME (Maximum Entropy), CRF (Conditional Random Field)

Hybrid methods

Dictionary Based Approximate String matching algorithm

Naming Rules Dynamic Programming

Features for Machine Learning Methods Morphological Features Orthographical Features POS Features

Genia POS tagger Semantic Trigger Features

Head-noun Features NF-kappaB consensus site IL-2 gene

Morphological FeaturesPrefix/Suffix Example

~cin~mide~zole

actinomycinCycloheximideSulphamethoxazole

~lipid~rogen~vitamin

phospholipidsestrogendihydroxyvitamin

~blast~cyte~phil

erythroblastthymocyteeosinophil

phosph~methyl~immuno~

phosphorylationmethyltranferaseimmunomodulator

Orthographical Features

OrthographicalFeatures

Example Orthographical Features

Example

AllCaps EBNA, NFAT AlphaDigit p50, p65

AlphaDigitAlpha IL23R, E1A ATGCSequence

CCGCCC

CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB

CapsAndDigits IL2, STAT4, SH2

DigitAlpha 2xNFkappaB

Head Nouns

Head Nouns

Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine,kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin

Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell,glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain

Additional features used by Manning’s group: local features

Clues within a sentence

Include: Previous NEs Abbreviations: an abbr., a long form, neither Parenthesis-matching etc.

External resources used by Manning’s group

Motivation Contextual clues do not provide sufficient evidence for co

nfident classification. May be vulnerable to incompleteness, noise, and a

mbiguity. Web

Least vulnerable to incompleteness, highly vulnerable to noise.

Prepare patterns for each class For genes: X gene, X antagonist, X mutation For RNA: X mRNA, … For proteins: X ligation, …

Features: web-protein, web-RNA, O-web, … Does not work well in BioNLP Task.

External resources (2) Gazetteers (dictionaries)

Are arguably subject to all three, and yet have been successfully in some systems.

Compiled a list of gene names from databases (e.g. Locus Link) and GO, the data from BioCreative Tasks 1A and 1B.

Filtering Single character entries, e.g., ‘A’, ‘1’; entries contain

ing only digits or symbols and digits, e.g., ’37’ ‘3-1’ Entries containing only words can be found in an English

dictionary (CELEX), e.g., ‘abnormal’, ‘brain tumor’ 1,731,581 entries

Larger context

State-of-the-art approaches Machine learning + Post-

processing Our method (BioKDD2004)

Maximum entropy Post-processing

Boundary extension Re-classification

Zhou et al. approach HMM + SVM Post-processing

Rule-based: used to resolve nested name entities.

Top1 in the NLPBA Task, F=72.5%

Manning et al. method Machine learning:

ME Markov model Local features External resources and larger context

Post-processing To correct gene’s boundary (mainly for BioCre

ative Task)

Top 1 in BioCreative, F= 83.2% Top 2 in NLPBA Task, F=70.1%

Our Method OverviewTraining Phase

TrainingData

Construct boundary wordlists and dictionary

Mapping features

ME Learning

Dictionary

Boundary wordlists

Testing Phase

ME Boundary extension

Re-classify

Post-processingTesting

Data

NEs

Knowledge input

Knowledge input

Experimental Results:

ME-based NER

NE identificationP/R/F

0.56/0.589/0.574

NE recognitionP/R/F

0.512/0.538/0.525

Post-Processing Nested Named Entity

Ex: CIITA mRNA Nested Annotation: <RNA><DNA>CIITA

</DNA> mRNA</RNA> ME sometimes only recognizes CIITA as DNA 16.57% of NEs in GENIA 3.02 contains one

or more shorter NE [Zhang, 2003]

Post-processing method Boundary Extension Re-classification

Boundary Extension (1) Boundary extension for nested

NEs Extend the R-boundary repeatedly if

the NE is followed by another NE, a head noun, or an R-boundary word with a valid POS tag.

Extend the left boundary repeatedly if the NE is preceded by an L-boundary word with a valid POS tag.

Example ICAM-1 surface protein

ME result: ICAM-1 /1U surface/unknown protein /unknown (1:protein, U: single)

Boundary extension surface: in R-boundary word list, valid POS

tag Extension: ICAM-1 surface protein: in R-boundary word list, valid POS

tag Extension: ICAM-1 surface protein

Boundary extension (2) Boundary extension for NEs containing

brackets or slashes1. NE := NE + ( + NE + ) + {NE or head noun

or R-boundary word with valid POS tag}2. NE := NE + / + NE ( + / + NE ) + { NE or

head noun or R-boundary word with valid POS tag}

Example granulocyte-macrophage colony-stimulating

factor ( GM-CSF ) gene ME result: granulocyte-macrophage colony-

stimulating factor, GM-CSF Extension: granulocyte-macrophage colony-

stimulating factor ( GM-CSF ) gene

Re-classification Use dictionary lookup

Use R-boundary word CIITA mRNA: RNA class granulocyte-macrophage colony-

stimulating factor ( GM-CSF ) gene: DNA class

Experimental Results:NE Identification

Config Boundary Extension NE IdentificationP/R/FBE-1 BE-2 BE-3

Baseline 0.56/0.589/0.574

Conf1 0.582/0.597/0.594

Conf2 0.591/0.6/0.595

Conf3 0.757/0.746/0.751

Conf4 0.776/0.763/0.769BE-1:boundary extension for nested NEs

BE-2:boundary extension for brackets and slashes BE-3:with human name filter

Experimental Results:NE Recognition

RC-1: re-classification using dictionary lookup RC-2: re-classification using R-boundary words

ConfigBoundary Extension

Re-classification

NE RecognitionP/R/F

BE-1 BE-2 BE-3 RC-1 RC-2

Baseline

0.512/0.538/0.525

Conf4 0.645/0.634/0.639

Conf5 0.67/0.658/0.664

Conf6 0.707/0.695/0.701

Conf7 0.727/0.715/0.721

Experimental Results:

GENIA v3.02 (10 Fold-CV)

Recently, Zhou improve the F-measure of his HMM model to 0.712 by combining SVM

System Overall Protein DNA RNA

Our System 0.721 0.785 0.700 0.752

Zhou et al.(Bioinformatics,

2004)

0.666 0.758 0.633 0.612

Error Analysis GENIA inconsistent annotation

IL-2 gene expression <DNA>IL-2 gene</DNA> expression <othername><DNA>IL-2 gene</DNA> express

ion</othername>

Conjunction Human and mouse gene

Boundary detection error (boundary not in boundary word file) Squirrel, manic, bursal…

Error Analysis Abbreviation classification

Orthographical form fits into at least two classed.

Protein: SOS1, FLICE, GAG Other Organic: CD336

False negative A number of errors due to low-frequency

words or works not encountered in the training data.

False positive Ellipsis:

Many inflammatory cytokine genes including TNF, IL-1, and IL-6


in biomedical domain Challenges in biomedical NER Current methods and our

method State of progress in NER Future works

Manning’s conclusion (I): Key factor for low performance

Task difficulty does not appear to be the primary factor leading to low performance. BioCreative: 1 class, BioNLP: 5 classes

Key factor: quality of the training and evaluation data Higher inconsistency in the annotation of the BioN

LP data. Two of the authors independently review 50 syste

m’s errors; 34-35 are attributed to annotation. The authors do not think the annotation inconsist

encies are due to biological subtleties.

Manning’s Conclusion (II) To improve biomedical annotation BioNLP organizers emphasized that participant

s should focus on deep knowledge sources coreference resolution and use of dependency relati

ons over “wide used lexical-level features (POS, morphological, orthographical, etc)

Proper exploitation of external resources In both tasks, external resources led to improvemen

t of only 1-2%. Consistent annotation might have led to a 70%

reduction in error rate.

Disambiguation of abbreviation

Motivation (I) Named entity (NE) recognition

(NER) is first step of information extraction.

NER contain two steps NE identification: extract named

entity from text NE classification: classify given NE

into specific class.

Motivation (II) Since many protein or gene names are long

compound names, they usually represent gene or protein names with abbreviation. A2M: Alpha-2-macroglobulin A4GALT: alpha 1,4-galactosyltransferase EGFR: epidermal growth factor receptor, EGF re

ceptor NF-AT: nuclear factor of activated cells HTLV-I: Human T cell lymphotropic virus I TCDD: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin GRE: glucocorticoid response element

Motivation (III) Abbreviation identification task:

It is easier than classification task. Abbreviations often have some

orthographical clues. All Capital letter, Alphabet and digit

hybrid…etc. Abbreviation classification task:

In some situation, it is hard to disambiguate abbreviation’s class.

Example: only mention abbreviation without full name

Challenges of abbreviation Two cases

Case 1: sentence contains abbreviation and full name

Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.

Case 2: sentence contains only abbreviation

HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.

Case 1 Case 1 is easier than Case 2

The classification can be solved by following steps:

Abbreviation – Full name association Disambiguate full name’s class Assign full name’s class to abbreviation

Challenges has shift from abbreviation classification to abbreviate-full name association

Example of Case 1: Sentence

Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.

Step 1: Abbreviation – Full name association (Full name, Abbreviation) = (Human

immunodeficiency virus type 2, HIV-2) Step 2: Full name class assignment

Name: Human immunodeficiency virus type 2 Class: Virus

Step 3: Abbreviation class assignment Abbreviation: HIV-2 Class: Virus

A solution method to Case 1 Schwartz and Hearst, PSB 2003. Identify <long form, short form>

pairs. Both long form and short form occur

in the same sentence. long form ‘(’ short form ‘)’ – more

frequently short form ‘(’ long form ‘)’

Algorithm: Identify long form ‘(’ short form ‘)’ Identify long form and short form

candidates (using adjacency to parentheses).

Identify correct long form. Starting from the end of both candidates,

move right to left, trying to find the shortest long form that matches the short form.

Every character in the short form must match a character in the long form.

The matched characters in the long form must be in the same order as the characters in the short forms.

<HSF, Heat shock transcription factor> <TTF-1, Thyroid transcription factor 1> : fail

Error analysis Unused characters, e.g., <CNS1, cyclophilin seven sup

pressor> Do not have any pattern between long form and short

form, e.g., <ATN, anterior thalamus> Partial matching:

The long form includes additional words to the left of the matching, e.g., <Pol I, RNA polymerase I>

Out-of-order mapping First character matches to the internal character (of t

he long form). Non-continuous long form. Transformation in the mapping (2D -> two-dimension

al) Short form of only one character.

Other types of abbreviations Schwartz and Hearst’s algorithm

only consider candidates in parentheses.

Challenges: To find all possible pairs is a more difficult problem.

Example of Case 2 It’s hard to disambiguate abbreviations’

class, even with context information. Example:

HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.

HIV-1 and HIV-2 are both virus, but if we replace HIV-1 and HIV-2 with IL-2 and IL-10, the sentence still make sense.

IL-2 and IL-10 display significant differences in nucleic acid sequence and in the natural history of clinical disease.

IL-2 and IL-10: gene name

Case 2 Leave for future work. Clue:

Statistical methods Dictionary-based methods

What’s Next after NER solved? Name entity relation recognition (NERR)

Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation Gene-drug relation

Identify Relations among Named Entities Target: Extract relations between

various biological named entities.

Here we demonstrate that the c-myb proto-oncogene product, which is itself aDNA-binding protein, and transcriptional transactivator, can interact synergistically with Z.

Relation (Subject, Action, Object): (c-myb proto-oncogene product, interact, Z)

Future works Few papers have been published on the

following specific challenging topics of NER. Automated corpus correction Disambiguation of abbreviations (Schwartz &

Hearst, 2003,…) Conjunction …

NERR (difficult) parser Pronoun and anaphora resolution

Acknowledgements Bioinformatics: Yi-Feng Lin, Wen-Chi Chou

NLP: Tzong-Han Tsai, Cheng-Wei Lee

Postdoc: Kuen-Pin Wu

Colleague: Wen-Lian Hsu (Fu Chang is jumping on the bandwagon now.)

Lab Introduction

Research topics Protein structure prediciton

2nd structure prediction Tertiary structure prediction – local structure Members: Hsin-Nan Lin, Caster Chen, Jia-Ming C

hang Protein structure determination based on N

MR data Backbone assignment Side chain assignment RDC Jia-Ming Chang, Caster Chen, Philip Chen Collaborator: Prof TH Huang, IBMS

Research topics Mass spectrometry based proteomics

Protein quantification Protein identification – for modification stud

y Yi-Hwa Yian, Wen-Ting Lin, Jacky Chou, Wei-

Nung Hung Collaborator: Prof YR Chen, Inst of Chemistry

Biological literature mining NER, NERR Yi-Feng Lin, Jacky Chou, Richard Tsai

Faculty PI: Wen-Lian Hsu, Ting-Yi Sung

Post-doc: Kuen-Pin Wu

extracting biological names and relations from texts

Documents