hindi parts-of-speech tagging & chunking
Post on 12-Jan-2016
65 Views
Preview:
DESCRIPTION
TRANSCRIPT
Hindi Parts-of-Speech Tagging & Chunking
Baskaran S
MSRI
4 July 2006 NWAI 2
What's in?
Why POS tagging & chunking? Approach Challenges
Unseen tag sequences Unknown words
Results Future work Conclusion
4 July 2006 NWAI 3
Intro & Motivation
4 July 2006 NWAI 4
POS
Parts-of-Speech Dionysius Thrax (ca 100 BC) 8 types – noun, verb, pronoun, preposition, adverb,
conjunction, participle and article
I get my thing in action.(Verb, that's what's happenin')To work, (Verb!)To play, (Verb!)To live, (Verb!)To love... (Verb!...)
- Schoolhouse Rock
4 July 2006 NWAI 5
Tagging
Assigning the appropriate POS
or lexical class marker
to words in a given text
Symbols, punctuation markers etc. are also assigned specific tag(s)
4 July 2006 NWAI 6
Why POS tagging?
Gives significant information about a word and its neighbours
Adjective near noun Adverb near verb
Gives clue on how a word is pronounced OBject as noun obJECT as verb
Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc.
4 July 2006 NWAI 7
Chunking
Identifying simple phrases Noun phrase, verb phrase, adjectival phrase…
Useful as a first step to Parsing Named entity recognition
4 July 2006 NWAI 8
POS tagging & Chunking
4 July 2006 NWAI 9
Stochastic approaches
Availability of tagged corpora in large quantity Most are based on HMM
Weischedel ’93 DeRose ’88 Skut and Brants ’98 – extending HMM to chunking Zhou and Su ‘00 and lots more…
4 July 2006 NWAI 10
HMM
)(
)/()()/(
WP
TWPTPWTP
)/()/()/()()/(13
12121
n
i
ii
n
i
iii twPtttPttPtPWTP
Tag-sequence probability Word-emit probability
Annotated corpus
)/(argmax WTPTT
Assumptions Probability of a word is dependent only on its tag Approximate the tag history to the most recent two tags
4 July 2006 NWAI 11
Structural tags
A triple – POS tag, structural relation & chunk tag
Originally proposed by Skut & Brants ’98 Seven relations
Enables embedded and overlapping chunks
4 July 2006 NWAI 12
Structural relations
परी�क्षा� में�
NP
00
Beg परी�क्षा�
NP
90
SSF । End
VG
09
SSF
श्रे�णी� प्रा�प्त
NP
99
SSF
VG
परी�क्षा� में� भी� प्राथमें श्रे�णी� प्रा�प्त की� औरी वि�द्या�लय में� की� लपवित द्वा�री� वि�शे�ष प�रीस्की�री भी� उन्हीं" की# प्रा�प्त हुआ ।
4 July 2006 NWAI 13
Decoding
Viterbi mostly used (also A* or stack) Aims at finding the best path (tag sequence)
given observation sequence Possible tags are identified for each
transition, with associated probabilities The best path is the one that maximizes the
product of these transition probabilities
4 July 2006 NWAI 14
अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006 NWAI 15
अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006 NWAI 16
अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006 NWAI 17
Issues
4 July 2006 NWAI 18
1. Unseen tag sequences
Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation)
Idea is to distribute some fractional probability (of seen occurrences) to unseen
Good-Turing Re-estimates the probability mass of lower count N-
grams by that of higher counts - Number of N-grams occurring c times
C
C
N
Ncc
1)1(*
CN
4 July 2006 NWAI 19
2. Unseen words
Insufficient corpus (even after 10 mn words) Not all of them are proper names Treat them as rare words that occur once in
the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95
Known Hindi corpus of 25 K words and unseen corpus of 6 K words
All words vs. Hapax vs. Unknown
4 July 2006 NWAI 20
Tag distribution analysis
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Tags
Probability
All words Hapex (= 1) Hapex (< 3) Unknown words
4 July 2006 NWAI 21
3. Features
Can we use other features? Capitalization Word endings and Hyphenations
Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations
Capitalizations, though useful for proper nouns are not very effective
4 July 2006 NWAI 22
Contd…
String length Prefix & suffix – fixed characters width Character encoding range Complete analysis remains to be done Expected to be very effective for
morphologically rich languages To be experimented with Tamil
4 July 2006 NWAI 23
4. Multi-part words
ExamplesIn/ terms/ of/
United/ States/ of/ America/
More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP
Central/NNC government/NN
NNPC – Compound proper noun, NN - noun
NNP – Proper noun, NNC – Compound noun How does the system identify the last word in multi-part
word? 10% of errors is due to this in Hindi (6 K words tested)
4 July 2006 NWAI 24
Results
4 July 2006 NWAI 25
Evaluation metrics
Tag precision Unseen word accuracy
% of unseen words that are correctly tagged Estimates the goodness of unseen words
% reduction in error Reduction in error after the application of a
particular feature
4 July 2006 NWAI 26
Results - Tagger
No structural tags better smoothing Unseen data – significantly more unknowns
Dev S-1 S-2 S-3 S-4 Test
# words 8511 6388 6397 6548 5847 5000
Correctly tagged 6749 5538 5504 5558 5060 3961
Precision 79.29 86.69 86.04 86.06 86.54 79.22
# Unseen 1543 660 648 589 603 1012
Correctly tagged 672 354 323 265 312 421
Unseen Precision 43.55 53.63 49.84 44.99 51.74 41.6
4 July 2006 NWAI 27
Results – Chunk tagger
Training 22 K, development data 8 K 4-cross validation Test data 5 K
POS tagging
Precision
Chunk
Identification Labelling
Pre Rec Pre Rec
Dev data 76.16 69.54 69.05 66.73 66.27
Average 85.02 72.26 73.52 70.01 71.35
Test data 76.49 58.72 61.28 54.36 56.73
4 July 2006 NWAI 28
Results – Tagging error analysis
Significant issues with nouns/multi-part words NNP NN NNC NN
Also, VAUX VFM; VFM VAUX and NVB NN; NN NVB
4 July 2006 NWAI 29
HMM performance (English)
> 96% reported accuracies About 85% for unknown words Advantage
Simple and most suitable with the availability of annotated data
4 July 2006 NWAI 30
Conclusion
4 July 2006 NWAI 31
Future work
Handling unseen words Smoothing Can we exploit other features?
Especially morphological ones Multi-part words
4 July 2006 NWAI 32
Summary
Statistical approaches now include linguistic features for higher accuracies
Improvement required Tagging
Precision – 79.22% Unknown words – 41.6%
Chunking Precision – 60% Recall – 62%
top related