part-of-speech tagging and chunking with log-linear models university of manchester national centre...
Post on 21-Dec-2015
215 Views
Preview:
TRANSCRIPT
Part-of-speech tagging and chunking with log-linear models
University of ManchesterNational Centre for Text Mining (NaCTeM)
Yoshimasa Tsuruoka
Outline
• POS tagging and Chunking for English– Conditional Markov Models (CMMs)– Dependency Networks– Bidirectional CMMs
• Maximum entropy learning
• Conditional Random Fields (CRFs)
• Domain adaptation of a tagger
Part-of-speech tagging
• The tagger assigns a part-of-speech tag to each word in the sentence.
The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
Algorithms for part-of-speech tagging
• Tagging speed and accuracy on WSJ
Tagging Speed Accuracy
Dependency Net (2003) Slow 97.24
SVM (2004) Fast 97.16
Perceptron (2002) ? 97.11
Bidirectional CMM (2005) Fast 97.10
HMM (2000) Very fast 96.7*
CMM (1998) Fast 96.6*
* evaluated on different portion of WSJ
Chunking (shallow parsing)
• A chunker (shallow parser) segments a sentence into non-recursive phrases
He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP
Chunking (shallow parsing)
• Chunking tasks can be converted into a standard tagging task
• Different approaches:– Sliding window
– Semi-Markov CRF
– …
He reckons the current account deficit will narrow toBNP BVP BNP INP INP INP BVP IVP BPP
only # 1.8 billion in September . BNP INPINP INP BPP BNP
Algorithms for chunking
• Chunking speed and accuracy on Penn Treebank
Tagging Speed Accuracy
SVM + voting (2001) Slow? 93.91
Perceptron (2003) ? 93.74
Bidirectional CMM (2005) Fast 93.70
SVM (2000) Fast 93.48
Conditional Markov Models (CMMs)
• Left to right decomposition (with the first-order Markov assumption)
n
iii
iin
ottP
otttPottP
11
111
|
...||...
t1 t2 t3
o
POS tagging with CMMs [Ratnaparkhi 1996; etc.]
• Left-to-right decomposition
– The local classifier uses the information on the preceding tag.
He runs fast
PRP VBZ RB? ? ?
ottPottPotPottP 2312131 ||||...
Examples of the features for local classification
Word unigram wi, wi-1, wi+1
Word bigram wi-1wi , wi wi+1
Previous tag ti-1
Tag/word ti-1wi
Prefix/suffix Up to length 10
Lexical features Hyphen, number, etc..
He runs fast
PRP ?
POS tagging with Dependency Network [Toutanova et al. 2003]
• Use the information on the following tag as well
),,|()|,...,( 111
1 otttPottScore ii
n
iin
This is no longer a probability You can use the followingtag as a feature in the localclassification model
t1 t2 t3
POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003]
• Training cost is small – almost equal to CMMs.• Decoding can be performed with dynamic
programming, but it is still expensive.• Collusion – the model can lock onto
conditionally consistent but jointly unlikely sequences.
t1 t2 t3
Bidirectional CMMs [Tsuruoka and Tsujii, 2005]
• Possible decomposition structures
• Bidirectional CMMs– We can find the “best” structure and tag
sequences in polynomial time
t1 t2 t3(a) t1 t2 t3(b)
t1 t2 t3(c) t1 t2 t3(d)
Maximum entropy learning
• Log-linear modeling
iii yxf
xZxyp ,exp
1|
Feature functionFeature weight
y iii yxfxZ ,exp
Maximum entropy learning
• Maximum likelihood estimation– Find the parameters that maximize the (log-)
likelihood of the training data
• Smoothing– Gaussian prior [Berger et al, 1996]– Inequality constrains [Kazama and Tsujii, 2005]
yx
xypxpxypLL,
|~~|log)(
Parameter estimation• Algorithms for maximum entropy
– GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997]
• General-purpose algorithms for numerical optimization– BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001]
• You need to provide the objective function and gradient:– Likelihood of training samples– Model expectation of each feature
][][)(
~ ipipi
fEfELL
yx
xypxpxypLL,
|~~|log)(
Computing likelihood and model expectation
• Example– Two possible tags: “Noun” and “Verb”
– Two types of features: “word” and “suffix”
Verb
He opened it
edsuffixverbtagopenedwordverbtagedsuffixnountagopenedwordnountag
edsuffixverbtagopenedwordverbtag
,,,,
,,
Noun Noun
tag = noun tag = verb
Conditional Random Fields (CRFs)
• A single log-linear model on the whole sentence
• One can use exactly the same techniques as maximum entropy learning to estimate the parameters.
• However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.
F
iiin xf
ZottP
11 exp
1)|...(
Conditional Random Fields (CRFs)
• Solution– Let’s restrict the types of features
– Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation
• Features you can use (in first-order CRFs)– Features defined on the tag
– Features defined on the adjacent pair of tags
Features
• Feature weights are associated with states and edges
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
W0=He&
Tag = Noun
Tagleft = Noun&
Tagright = Noun
A naive way of calculating Z(x)
Noun Noun Noun Noun = 7.2
= 1.3
= 4.5
= 0.9
= 2.3
= 11.2
= 3.4
= 2.5
= 4.1
= 0.8
= 9.7
= 5.5
= 5.7
= 4.3
= 2.2
= 1.9
Sum = 67.5
Noun Noun Noun Verb
Noun Noun Verb Noun
Noun Noun Verb Verb
Noun Verb Noun Noun
Noun Verb Noun Verb
Noun Verb Verb Noun
Noun Verb Verb Verb
Verb Noun Noun Noun
Verb Noun Noun Verb
Verb Noun Verb Noun
Verb Noun Verb Verb
Verb Verb Noun Noun
Verb Verb Noun Verb
Verb Verb Verb Noun
Verb Verb Verb Verb
Dynamic programming
• Results of intermediate computation can be reused.
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
Maximum entropy learning and Conditional Random Fields
• Maximum entropy learning– Log-linear modeling + MLE
– Parameter estimation• Likelihood of each sample
• Model expectation of each feature
• Conditional Random Fields– Log-linear modeling on the whole sentence
– Features are defined on states and edges
– Dynamic programming
Named Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2 control protein protein proteinIL-2 receptor alpha (IL-2R alpha) gene transcription in DNACD4-CD8-murine T lymphocyte precursors. cell_line
Algorithms for Biomedical Named Entity Recognition
Recall Precision F-score
SVM+HMM (2004) 76.0 69.4 72.6
Semi-Markov CRF [Okanohara et al., 2006]
72.7 70.4 71.5
Sliding window 75.8 67.5 70.8
MEMM (2004) 71.6 68.6 70.1
CRF (2004) 70.3 69.3 69.8
• Shared task data for Coling 2004 BioNLP workshop
Domain adaptation
• Large training data has been available for general domains (e.g. Penn Treebank WSJ)
• NLP Tools trained with general domain data are less accurate on biomedical domains
• Development of domain-specific data requires considerable human efforts
Tagging errors made by a tagger trained on WSJ
• Accuracy of the tagger on the GENIA POS corpus: 84.4%
… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
Re-training of maximum entropy models
• Taggers trained as maximum entropy models
• Adapting Maximum entropy models to target domains by re-training with domain specific data
F
iii xf
Zxp
1
exp1
Feature function(given by the developer)
Model parameter
Methods for domain adaptation
• Combined training data: a model is trained from scratch with the original and domain-specific data
• Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model
F
iiiorignew xfxp
Zxp
1
exp)(1
Adaptation of the part-of-speech tagger
• Relationships among training and test data are evaluated for the following corpora– WSJ: Penn Treebank WSJ– GENIA: GENIA POS corpus [Kim et al., 2003]
• 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors
– PennBioIE: Penn BioIE corpus [Kulick et al., 2004]• 1,100 MEDLINE abstracts about inhibition of the cytochrome
P450 family of enzymes• 1,157 MEDLINE abstracts about molecular genetics of
cancer– Fly: 200 MEDLINE abstracts on Drosophia
melanogaster
• Training sets
• Test sets
Training and test sets
# tokens # sentences
WSJ 912,344 38,219
GENIA 450,492 18,508
PennBioIE 641,838 29,422
Fly 1,024
# tokens # sentences
WSJ 129,654 5,462
GENIA 50,562 2,036
PennBioIE 70,713 3,270
Fly 7,615 326
Experimental results
AccuracyTraining
time(sec.)WSJ GENIA PennBioIE Fly
WSJ+GENIA+PennBioIE
96.68 98.10 97.65 96.35
Fly only 93.91
Combined 96.69 98.12 97.65 97.94 30,632
Ref. dist 95.38 98.17 96.93 98.08 21
Corpus size vs. accuracy(combined training data)
95.095.596.096.597.097.598.098.599.0
8 16 32 64 128 256 512 1024Number of sentences
Acc
urac
y(%)
Fly WSJ GENIA Penn
Corpus size vs. accuracy(reference distribution)
94.0
94.5
95.0
95.5
96.0
96.597.0
97.5
98.0
98.5
99.0
8 16 32 64 128 256 512 1024
Number of sentences
Acc
urac
y (%
)
Fly WSJ GENIA Penn
Summary
• POS tagging– MEMM-like approaches achieve good performance
with reasonable computational cost. CRFs seems to be too computationally expensive at present.
• Chunking– CRFs yield good performance for NP chunking. Semi-
Markov CRFs are promising, but we need to somehow reduce computational cost.
• Domain Adaptation– One can easily use the information about the original
domain as the reference distribution.
References• A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy
approach to natural language processing. Computational Linguistics.• Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings
of EMNLP.• Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of
ANLP. • Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines,
Proceedings of NAACL.• John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random
Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML.
• Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP.
• Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL.
• K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.
References
• Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP.
• Jesús Giménez and Lluís Márquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC.
• Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS 2004.
• Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP.
• Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT-NAACL BioNLP Workshop.
• Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.
top related