extracting biological names and relations from texts
DESCRIPTION
Extracting biological names and relations from texts. Ting-Yi Sung 宋定懿 Bioinformatics Program, TIGP Institute of Information Science Academia Sinica 2004/12/16. Motivation. To automatically extract information from natural language text. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/1.jpg)
Extracting biological names and relations from texts
Ting-Yi Sung 宋定懿Bioinformatics Program, TIGP
Institute of Information ScienceAcademia Sinica
2004/12/16
![Page 2: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/2.jpg)
Motivation
To automatically extract information from natural language text. The need arises from rapid accumulation
of biomedical literature. Expedite survey efforts Support the database curation (automati
cally associate the papers with database records)
![Page 3: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/3.jpg)
Targets of Information Extraction Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation
Example relationships between gene and drugs: The gene is the drug target The gene confers resistance to the drug The gene metabolizes the drug
![Page 4: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/4.jpg)
Information Extraction TasksIdentify Target Named Entities
Identify Relationsamong Named
Entities
Identify Relationsamong Events and
Named Entities
Associate Resultswith existing
database records
![Page 5: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/5.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works
![Page 6: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/6.jpg)
What is NER? NER
Named Entity Recognition Including two tasks
Identification of proper names in text Classification of proper names in text
Newswire Domain Person, Location, Organization
Biomedical Domain Protein, DNA, RNA, Body Part, Cell Type, Lipid,
etc.
![Page 7: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/7.jpg)
Example of NER - Biomedical
Protein
tissue
Disease
![Page 8: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/8.jpg)
NER in biomedical domain BioNER aims to recognize following n
ames First Priority
Protein name, DNA name, RNA name Second Priority
cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic
![Page 9: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/9.jpg)
The Overall Spectrum BioNER is only the starting point of biologic
al information extraction
A whole suite of NLP techniques are needed to treat relations, events in literature mining
Techniques developed for BioNER should be adaptable to problems in later stages, e.g. NE relation recognition
![Page 10: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/10.jpg)
Intrinsic Features of BioNER Unknown words Long compound words Variations of expressions Nested NEs
![Page 11: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/11.jpg)
Unknown Words Words containing hyphen, digit, letter, Gree
k letter, Roman numeral. Alpha B1 Adenyly cyclase 76E Latent membrane protein 1 4’-mycarosyl isovaleryl-CoA transferase oligodeoxyribonucleotide 18-deoxyaldosterone
Abbreviation and Acronym IL, TECd, IFN, TPA
![Page 12: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/12.jpg)
Long Compound words interleukin 1 (IL-1)-responsive kinase interleukin 1-responsive kinase epidermal growth factor receptor SH2 domain containing tyrosine kinas
e Syk SH2 domain (GENIA example)
![Page 13: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/13.jpg)
Various expressions of the same NE Spelling variation
N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine
Word permutation beta-1 intergrin, integrin beta-1
Ambiguous expressions epidermal growth factor receptor, EGF receptor,
EGFR c-jun, c-Jun, c jun
![Page 14: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/14.jpg)
Various expressions: the name explains its function the Ras guanine nucleotide exchange
factor Sos the Ras guanine nucleotide releasing
protein Sos the Ras exchanger Sos the GDP-GTP exchange factor Sos Sos(mSos), a GDP/GTP exchange prot
ein for Ras
![Page 15: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/15.jpg)
Various expressions:
The name includes preposition and/or conjunction (ambiguity of dependencies)
p85 alpha subunit of PI 3-kinase SH2 and SH3 domains of Src NF-AT1 , AP-1 , and NF-kB sites E2F1 and -3 Residues 432, 435, 437, 438, and 440
![Page 16: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/16.jpg)
Nested Named Entity An NE embedded in another NE. IL-2: protein IL-2 gene: gene CBP/p300 associated factor: protein CBP/p300 associated factor binding p
romoter: DNA
![Page 17: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/17.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works
![Page 18: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/18.jpg)
Challenges of NER Unknown word identification Named entity boundary detection Class disambiguation
![Page 19: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/19.jpg)
Challenges Unknown word identification
t (10;11) (p13; q14) DNA methyltransferase 73 kDa protein interleukin 1 (IL-1)-responsive kinase (NE
may contain an abbreviation within it.) Some unknown words occur very few tim
es in the corpus hard to recognize.
![Page 20: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/20.jpg)
Challenges (cont’d) NE boundary detection
Can be a regular English word, unknown word, Roman numeral, digit.
MHC Class II latent protein 1 (The left boundary is an adjective) cyclin-like UDG gene product
Conjunction (and, or, …) alpha- and beta-globin human and mouse gene
![Page 21: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/21.jpg)
Challenges (cont’d) Classification of abbreviations
NF-AT Full name: nuclear factor of activated cells Class: Protein
HTLV-I Full name: Human T cell lymphotropic virus I Class: Virus
TCDD Full name: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin Class: Other Organic
GRE Full name: glucocorticoid response element Class: DNA
![Page 22: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/22.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works
![Page 23: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/23.jpg)
State-of-the-art Systems on NER: Two evaluation contests
BioCreative 2004 (March) Critical Assessment of Information Extraction Syst
ems in Biology Task 1: Entity extraction
Target: genes (or proteins, where there is ambiguity)
10000 sentences from Medline as training data, and 5000 sentences as testing data
BioNLP 2004 (August) GENIA Corpus as training data and 404 abstracts a
s testing data Target: 5 classes, including protein, DNA, gene, cel
l line and cell type. Both use exact match scoring.
![Page 24: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/24.jpg)
# of
abstracts
# of sentences # of tokens
Training Set 2,000 20,546 (10.27/abs)472,006 (236.00/abs)
(22.97/sen)
Test Set
Total 404 4,260 (10.54/abs)96,780 (239.55/abs)
(22.72/sen)
1978-1989
104 991 ( 9.53/abs)22,320 (214.62/abs)
(22.52/sen)
1990-1999
106 1,115 (10.52/abs)25,080 (236.60/abs)
(22.49/sen)
2000-2001
130 1,452 (11.17/abs)33,380 (256.77/abs)
(22.99/sen)
S/1998-2001
204 2,254 (11.05/abs)51,628 (253.08/abs)
(22.91/sen)
BioNLP 2004 Datasets
![Page 25: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/25.jpg)
R/P/F 1978-1989 set 1990-1999 set 2000-2001 set
S/1998-2001 set Total
[Zho04
]
75.3 / 69.5 / 72.3
77.1 / 69.2 / 72.9
75.6 / 71.3 / 73.8
75.8 / 69.5 / 72.5
76.0 / 69.4 / 72.6
[Fin04]
66.9 / 70.4 / 68.6
73.8 / 69.4 / 71.5
72.6 / 69.3 / 70.9
71.8 / 67.5 / 69.6
71.6 / 68.6 / 70.1
[Set04
]
63.6 / 71.4 / 67.3
72.2 / 68.7 / 70.4
71.3 / 69.6 / 70.5
71.3 / 68.8 / 70.1
70.3 / 69.3 / 69.8
[Son04
]
60.3 / 66.2 / 63.1
71.2 / 65.6 / 68.2
69.5 / 65.8 / 67.6
68.3 / 64.0 / 66.1
67.8 / 64.8 / 66.3
[Zha04]
63.2 / 60.4 / 61.8
72.5 / 62.6 / 67.2
69.1 / 60.2 / 64.7
69.2 / 60.3 / 64.4
69.1 / 61.0 / 64.8
[Rös04]
59.2 / 60.3 / 59.8
70.3 / 61.8 / 65.8
68.4 / 61.5 / 64.8
68.3 / 60.4 / 64.1
67.4 / 61.0 / 64.0
[Par04]
62.8 / 55.9 / 59.2
70.3 / 61.4 / 65.6
65.1 / 60.4 / 62.7
65.9 / 59.7 / 62.7
66.5 / 59.8 / 63.0
[Lee04]
42.5 / 42.0 / 42.2
52.5 / 49.1 / 50.8
53.8 / 50.9 / 52.3
52.3 / 48.1 / 50.1
50.8 / 47.6 / 49.1
BL 47.1 / 33.9 /
39.4 56.8 / 45.5 /
50.5 51.7 / 46.3 /
48.8 52.6 / 46.0 /
49.1 52.6 / 43.6 /
47.7
![Page 26: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/26.jpg)
Current Methods Machine Learning
HMM, SVM, ME (Maximum Entropy), CRF (Conditional Random Field)
Hybrid methods
Dictionary Based Approximate String matching algorithm
Naming Rules Dynamic Programming
![Page 27: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/27.jpg)
Features for Machine Learning Methods Morphological Features Orthographical Features POS Features
Genia POS tagger Semantic Trigger Features
Head-noun Features NF-kappaB consensus site IL-2 gene
![Page 28: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/28.jpg)
Morphological FeaturesPrefix/Suffix Example
~cin~mide~zole
actinomycinCycloheximideSulphamethoxazole
~lipid~rogen~vitamin
phospholipidsestrogendihydroxyvitamin
~blast~cyte~phil
erythroblastthymocyteeosinophil
phosph~methyl~immuno~
phosphorylationmethyltranferaseimmunomodulator
![Page 29: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/29.jpg)
Orthographical Features
OrthographicalFeatures
Example Orthographical Features
Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence
CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2
DigitAlpha 2xNFkappaB
![Page 30: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/30.jpg)
Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine,kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell,glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
![Page 31: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/31.jpg)
Additional features used by Manning’s group: local features
Clues within a sentence
Include: Previous NEs Abbreviations: an abbr., a long form, neither Parenthesis-matching etc.
![Page 32: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/32.jpg)
External resources used by Manning’s group
Motivation Contextual clues do not provide sufficient evidence for co
nfident classification. May be vulnerable to incompleteness, noise, and a
mbiguity. Web
Least vulnerable to incompleteness, highly vulnerable to noise.
Prepare patterns for each class For genes: X gene, X antagonist, X mutation For RNA: X mRNA, … For proteins: X ligation, …
Features: web-protein, web-RNA, O-web, … Does not work well in BioNLP Task.
![Page 33: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/33.jpg)
External resources (2) Gazetteers (dictionaries)
Are arguably subject to all three, and yet have been successfully in some systems.
Compiled a list of gene names from databases (e.g. Locus Link) and GO, the data from BioCreative Tasks 1A and 1B.
Filtering Single character entries, e.g., ‘A’, ‘1’; entries contain
ing only digits or symbols and digits, e.g., ’37’ ‘3-1’ Entries containing only words can be found in an English
dictionary (CELEX), e.g., ‘abnormal’, ‘brain tumor’ 1,731,581 entries
Larger context
![Page 34: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/34.jpg)
State-of-the-art approaches Machine learning + Post-
processing Our method (BioKDD2004)
Maximum entropy Post-processing
Boundary extension Re-classification
![Page 35: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/35.jpg)
Zhou et al. approach HMM + SVM Post-processing
Rule-based: used to resolve nested name entities.
Top1 in the NLPBA Task, F=72.5%
![Page 36: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/36.jpg)
Manning et al. method Machine learning:
ME Markov model Local features External resources and larger context
Post-processing To correct gene’s boundary (mainly for BioCre
ative Task)
Top 1 in BioCreative, F= 83.2% Top 2 in NLPBA Task, F=70.1%
![Page 37: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/37.jpg)
Our Method OverviewTraining Phase
TrainingData
Construct boundary wordlists and dictionary
Mapping features
ME Learning
Dictionary
Boundary wordlists
Testing Phase
ME Boundary extension
Re-classify
Post-processingTesting
Data
NEs
Knowledge input
Knowledge input
![Page 38: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/38.jpg)
Experimental Results:
ME-based NER
NE identificationP/R/F
0.56/0.589/0.574
NE recognitionP/R/F
0.512/0.538/0.525
![Page 39: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/39.jpg)
Post-Processing Nested Named Entity
Ex: CIITA mRNA Nested Annotation: <RNA><DNA>CIITA
</DNA> mRNA</RNA> ME sometimes only recognizes CIITA as DNA 16.57% of NEs in GENIA 3.02 contains one
or more shorter NE [Zhang, 2003]
Post-processing method Boundary Extension Re-classification
![Page 40: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/40.jpg)
Boundary Extension (1) Boundary extension for nested
NEs Extend the R-boundary repeatedly if
the NE is followed by another NE, a head noun, or an R-boundary word with a valid POS tag.
Extend the left boundary repeatedly if the NE is preceded by an L-boundary word with a valid POS tag.
![Page 41: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/41.jpg)
Example ICAM-1 surface protein
ME result: ICAM-1 /1U surface/unknown protein /unknown (1:protein, U: single)
Boundary extension surface: in R-boundary word list, valid POS
tag Extension: ICAM-1 surface protein: in R-boundary word list, valid POS
tag Extension: ICAM-1 surface protein
![Page 42: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/42.jpg)
Boundary extension (2) Boundary extension for NEs containing
brackets or slashes1. NE := NE + ( + NE + ) + {NE or head noun
or R-boundary word with valid POS tag}2. NE := NE + / + NE ( + / + NE ) + { NE or
head noun or R-boundary word with valid POS tag}
Example granulocyte-macrophage colony-stimulating
factor ( GM-CSF ) gene ME result: granulocyte-macrophage colony-
stimulating factor, GM-CSF Extension: granulocyte-macrophage colony-
stimulating factor ( GM-CSF ) gene
![Page 43: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/43.jpg)
Re-classification Use dictionary lookup
Use R-boundary word CIITA mRNA: RNA class granulocyte-macrophage colony-
stimulating factor ( GM-CSF ) gene: DNA class
![Page 44: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/44.jpg)
Experimental Results:NE Identification
Config Boundary Extension NE IdentificationP/R/FBE-1 BE-2 BE-3
Baseline 0.56/0.589/0.574
Conf1 0.582/0.597/0.594
Conf2 0.591/0.6/0.595
Conf3 0.757/0.746/0.751
Conf4 0.776/0.763/0.769BE-1:boundary extension for nested NEs
BE-2:boundary extension for brackets and slashes BE-3:with human name filter
![Page 45: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/45.jpg)
Experimental Results:NE Recognition
RC-1: re-classification using dictionary lookup RC-2: re-classification using R-boundary words
ConfigBoundary Extension
Re-classification
NE RecognitionP/R/F
BE-1 BE-2 BE-3 RC-1 RC-2
Baseline
0.512/0.538/0.525
Conf4 0.645/0.634/0.639
Conf5 0.67/0.658/0.664
Conf6 0.707/0.695/0.701
Conf7 0.727/0.715/0.721
![Page 46: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/46.jpg)
Experimental Results:
GENIA v3.02 (10 Fold-CV)
Recently, Zhou improve the F-measure of his HMM model to 0.712 by combining SVM
System Overall Protein DNA RNA
Our System 0.721 0.785 0.700 0.752
Zhou et al.(Bioinformatics,
2004)
0.666 0.758 0.633 0.612
![Page 47: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/47.jpg)
Error Analysis GENIA inconsistent annotation
IL-2 gene expression <DNA>IL-2 gene</DNA> expression <othername><DNA>IL-2 gene</DNA> express
ion</othername>
Conjunction Human and mouse gene
Boundary detection error (boundary not in boundary word file) Squirrel, manic, bursal…
![Page 48: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/48.jpg)
Error Analysis Abbreviation classification
Orthographical form fits into at least two classed.
Protein: SOS1, FLICE, GAG Other Organic: CD336
False negative A number of errors due to low-frequency
words or works not encountered in the training data.
False positive Ellipsis:
Many inflammatory cytokine genes including TNF, IL-1, and IL-6
![Page 49: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/49.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER Current methods and our
method State of progress in NER Future works
![Page 50: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/50.jpg)
Manning’s conclusion (I): Key factor for low performance
Task difficulty does not appear to be the primary factor leading to low performance. BioCreative: 1 class, BioNLP: 5 classes
Key factor: quality of the training and evaluation data Higher inconsistency in the annotation of the BioN
LP data. Two of the authors independently review 50 syste
m’s errors; 34-35 are attributed to annotation. The authors do not think the annotation inconsist
encies are due to biological subtleties.
![Page 51: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/51.jpg)
Manning’s Conclusion (II) To improve biomedical annotation BioNLP organizers emphasized that participant
s should focus on deep knowledge sources coreference resolution and use of dependency relati
ons over “wide used lexical-level features (POS, morphological, orthographical, etc)
Proper exploitation of external resources In both tasks, external resources led to improvemen
t of only 1-2%. Consistent annotation might have led to a 70%
reduction in error rate.
![Page 52: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/52.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works
![Page 53: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/53.jpg)
Disambiguation of abbreviation
![Page 54: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/54.jpg)
Motivation (I) Named entity (NE) recognition
(NER) is first step of information extraction.
NER contain two steps NE identification: extract named
entity from text NE classification: classify given NE
into specific class.
![Page 55: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/55.jpg)
Motivation (II) Since many protein or gene names are long
compound names, they usually represent gene or protein names with abbreviation. A2M: Alpha-2-macroglobulin A4GALT: alpha 1,4-galactosyltransferase EGFR: epidermal growth factor receptor, EGF re
ceptor NF-AT: nuclear factor of activated cells HTLV-I: Human T cell lymphotropic virus I TCDD: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin GRE: glucocorticoid response element
![Page 56: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/56.jpg)
Motivation (III) Abbreviation identification task:
It is easier than classification task. Abbreviations often have some
orthographical clues. All Capital letter, Alphabet and digit
hybrid…etc. Abbreviation classification task:
In some situation, it is hard to disambiguate abbreviation’s class.
Example: only mention abbreviation without full name
![Page 57: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/57.jpg)
Challenges of abbreviation Two cases
Case 1: sentence contains abbreviation and full name
Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.
Case 2: sentence contains only abbreviation
HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.
![Page 58: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/58.jpg)
Case 1 Case 1 is easier than Case 2
The classification can be solved by following steps:
Abbreviation – Full name association Disambiguate full name’s class Assign full name’s class to abbreviation
Challenges has shift from abbreviation classification to abbreviate-full name association
![Page 59: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/59.jpg)
Example of Case 1: Sentence
Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.
Step 1: Abbreviation – Full name association (Full name, Abbreviation) = (Human
immunodeficiency virus type 2, HIV-2) Step 2: Full name class assignment
Name: Human immunodeficiency virus type 2 Class: Virus
Step 3: Abbreviation class assignment Abbreviation: HIV-2 Class: Virus
![Page 60: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/60.jpg)
A solution method to Case 1 Schwartz and Hearst, PSB 2003. Identify <long form, short form>
pairs. Both long form and short form occur
in the same sentence. long form ‘(’ short form ‘)’ – more
frequently short form ‘(’ long form ‘)’
![Page 61: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/61.jpg)
Algorithm: Identify long form ‘(’ short form ‘)’ Identify long form and short form
candidates (using adjacency to parentheses).
Identify correct long form. Starting from the end of both candidates,
move right to left, trying to find the shortest long form that matches the short form.
Every character in the short form must match a character in the long form.
The matched characters in the long form must be in the same order as the characters in the short forms.
<HSF, Heat shock transcription factor> <TTF-1, Thyroid transcription factor 1> : fail
![Page 62: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/62.jpg)
Error analysis Unused characters, e.g., <CNS1, cyclophilin seven sup
pressor> Do not have any pattern between long form and short
form, e.g., <ATN, anterior thalamus> Partial matching:
The long form includes additional words to the left of the matching, e.g., <Pol I, RNA polymerase I>
Out-of-order mapping First character matches to the internal character (of t
he long form). Non-continuous long form. Transformation in the mapping (2D -> two-dimension
al) Short form of only one character.
![Page 63: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/63.jpg)
Other types of abbreviations Schwartz and Hearst’s algorithm
only consider candidates in parentheses.
Challenges: To find all possible pairs is a more difficult problem.
![Page 64: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/64.jpg)
Example of Case 2 It’s hard to disambiguate abbreviations’
class, even with context information. Example:
HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.
HIV-1 and HIV-2 are both virus, but if we replace HIV-1 and HIV-2 with IL-2 and IL-10, the sentence still make sense.
IL-2 and IL-10 display significant differences in nucleic acid sequence and in the natural history of clinical disease.
IL-2 and IL-10: gene name
![Page 65: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/65.jpg)
Case 2 Leave for future work. Clue:
Statistical methods Dictionary-based methods
![Page 66: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/66.jpg)
Outline NER (named entity recognition)
in biomedical domain Challenges in biomedical NER State of progress in NER Abbreviation disambiguation Future works
![Page 67: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/67.jpg)
What’s Next after NER solved? Name entity relation recognition (NERR)
Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation Gene-drug relation
![Page 68: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/68.jpg)
Identify Relations among Named Entities Target: Extract relations between
various biological named entities.
Here we demonstrate that the c-myb proto-oncogene product, which is itself aDNA-binding protein, and transcriptional transactivator, can interact synergistically with Z.
Relation (Subject, Action, Object): (c-myb proto-oncogene product, interact, Z)
![Page 69: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/69.jpg)
Future works Few papers have been published on the
following specific challenging topics of NER. Automated corpus correction Disambiguation of abbreviations (Schwartz &
Hearst, 2003,…) Conjunction …
NERR (difficult) parser Pronoun and anaphora resolution
![Page 70: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/70.jpg)
Acknowledgements Bioinformatics: Yi-Feng Lin, Wen-Chi Chou
NLP: Tzong-Han Tsai, Cheng-Wei Lee
Postdoc: Kuen-Pin Wu
Colleague: Wen-Lian Hsu (Fu Chang is jumping on the bandwagon now.)
![Page 71: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/71.jpg)
Lab Introduction
![Page 72: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/72.jpg)
Research topics Protein structure prediciton
2nd structure prediction Tertiary structure prediction – local structure Members: Hsin-Nan Lin, Caster Chen, Jia-Ming C
hang Protein structure determination based on N
MR data Backbone assignment Side chain assignment RDC Jia-Ming Chang, Caster Chen, Philip Chen Collaborator: Prof TH Huang, IBMS
![Page 73: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/73.jpg)
Research topics Mass spectrometry based proteomics
Protein quantification Protein identification – for modification stud
y Yi-Hwa Yian, Wen-Ting Lin, Jacky Chou, Wei-
Nung Hung Collaborator: Prof YR Chen, Inst of Chemistry
Biological literature mining NER, NERR Yi-Feng Lin, Jacky Chou, Richard Tsai
![Page 74: Extracting biological names and relations from texts](https://reader035.vdocuments.mx/reader035/viewer/2022081501/568148f0550346895db60f79/html5/thumbnails/74.jpg)
Faculty PI: Wen-Lian Hsu, Ting-Yi Sung
Post-doc: Kuen-Pin Wu