bioi 7791 projects in bioinformatics spring 2005 march 22 © kevin b. cohen
Post on 22-Dec-2015
218 views
TRANSCRIPT
BIOI 7791 Projects in bioinformaticsSpring 2005
March 22
© Kevin B. Cohen
PGES upregulates PGE2 production in human thyrocytes
(GeneRIF: 12145315)
• Syntax: what are the relationships between words/phrases?
• Parsing: figuring out the structure– Full parse– Shallow parse
• Shallow parse• Partial parse• Syntactic chunking
Full parse
PGES upregulates PGE2 production in human thyrocytes
Shallow parse
PGES
upregulates
PGE2 production
in
human thyrocytes
NounGroup
VerbGroup
NounGroup
NounGroup
PrepositionalGroup
Shallow vs. full parsing
• Different depths– Full parse goes down to level of individual
words– Shallow parse doesn’t go down any further
than the base phrase
• Different “heights”– Full parse goes “up” to root node– Shallow parse doesn’t (generally) go further
up than base phrase
Shallow vs. full parsing
• Different number of levels of structure– Full parse has many levels– Shallow parse has far fewer
Shallow vs. full parsing
• Either way, you need POS information…
POS tagging: why you need it
• All syntax is built on it
• Overcome sparseness problem by abstracting away from specific words
• Help you decide how to stem
• Potential basis for entity identification
What “POS tagging” is
• POS: part of speech
• School: 8 (noun, verb, adjective, interjection…)
• Real life: 40 or more
How do you get from 8 to 80?
• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)
How do you get from 8 to 80?
• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-3rd-
person)• VBZ (3rd-person singular present
tense)
Others that are good to recognize
• Adjective • JJ (adjective)• JJR (comparative
adjective)• JJS (superlative
adjective)
Others that are good to recognize
• Coordinating conjunctions
• Determiners• Prepositions• To• Punctuation
• CC
• DT• IN• TO• , (comma)• . (sentence-final)• : (sentence-medial)
POS tagging
• Definition: assigning POS “tags” to a string of tokens
• Input: – string of tokens– tag set
• Output:– Best tag for each token
How do you define noun, verb, etc.?
• Semantic: – “A noun is a person,
place, or thing…”– “A verb is…”
• Distributional characteristics:– “A noun can take the
plural and genitive morphemes”
– “A noun can appear in the environment All of my twelve hairy ___ left before noon”
Why’s it hard?
Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.
POS tagging: rule-based
1. Assign each word its list of potential parts of speech
2. Use rules to remove potential tags from the list
The EngCG system:
• 56,000-item dictionary
• 3,744 rules
Note that all taggers need a way to deal with unknown words (OOV or “out-of-vocabulary”).
As always, (about) two approaches….
• Rule-based
• Learning-based
An aside: tagger input formatsapoptosis in a human tumor cell line .
apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN ./.
apoptosis
in
a
human
tumor
cell
line
.
NN
IN
DT
JJ
NN
NN
NN
.
Just how ambiguous is natural language?
• Most English words are not ambiguous…
• …but, many of the most common ones are.
• Brown corpus: only 11.5% of word types ambiguous…
• …but > 40% of tokens ambiguous.Dictionary doesn’t give you a good estimate of the problem space…
…but corpus data does.
Empirical question: how ambiguous is biomedical text?
A statistical approach: TnT
• Second-order Markov model
• Smoothing by linear interpolation of ngrams• λ estimated by deleted interpolation• Tag probabilities learned for word endings; used
for unknown words
TnT
• Ngram: an n-tag or n-word sequence• N = 1
– DET– NOUN– role
• Bigrams– DET NOUN– NOUN PREPOSITION– a role
• Trigrams
The Brill Tagger
The Brill tagger
• Uses rules
• …but, set of rules are induced.
The Brill tagger
• Iterative error reduction1. Assign most common tags, then
2. Evaluate performance, then
The Brill tagger
• Iterative error reduction1. Assign most common tags, then
2. Evaluate performance, then
3. Propose rules to fix errors
4. Evaluate performance, then
5. If you’ve improved, GOTO 3, else END
The Brill tagger
• Change Determiner Verb “of”
• …to…
• Determiner Noun “of”
The/Determiner running/Verb of/IN
The/Determiner running/Noun of/IN
An aside: evaluating POS taggers
• Accuracy• Confusion matrix• How hard is the task? Domain/genre-
specific…– Baseline– Ceiling– State of the art:
• 96-97% total accuracy• Lower for non-punctuation
Give each word its most common tag
Interannotator agreement
--usually high 90’s
Low 90’s on some corpora!
Confusion matrix
JJ NN VBD
JJ -- .6 4.6
NN .5 --
VBD 5.4 .01 --
Columns = tagger output
Rows = right answer
An aside: unknown words
• Call them all nouns
• Learn most common POS from training data
• Use morphology
• Suffix trees
• Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…
POS tagging: extension(s)
• Entity identification
• What else??
• First step in any POS tagging effort: – Tokenization– …maybe sentence segmentation
First programming assignment: tokenization
• What was hard?
• What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?