hindi parts-of-speech tagging & chunking

Hindi Parts-of-Speech Tagging & Chunking

Baskaran S

4 July 2006 NWAI 2

What's in?

Why POS tagging & chunking? Approach Challenges

Unseen tag sequences Unknown words

Results Future work Conclusion

4 July 2006 NWAI 3

Intro & Motivation

4 July 2006 NWAI 4

Parts-of-Speech Dionysius Thrax (ca 100 BC) 8 types – noun, verb, pronoun, preposition, adverb,

conjunction, participle and article

I get my thing in action.(Verb, that's what's happenin')To work, (Verb!)To play, (Verb!)To live, (Verb!)To love... (Verb!...)

- Schoolhouse Rock

4 July 2006 NWAI 5

Tagging

Assigning the appropriate POS

or lexical class marker

to words in a given text

Symbols, punctuation markers etc. are also assigned specific tag(s)

4 July 2006 NWAI 6

Why POS tagging?

Gives significant information about a word and its neighbours

Adjective near noun Adverb near verb

Gives clue on how a word is pronounced OBject as noun obJECT as verb

Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc.

4 July 2006 NWAI 7

Chunking

Identifying simple phrases Noun phrase, verb phrase, adjectival phrase…

Useful as a first step to Parsing Named entity recognition

4 July 2006 NWAI 8

POS tagging & Chunking

4 July 2006 NWAI 9

Stochastic approaches

Availability of tagged corpora in large quantity Most are based on HMM

Weischedel ’93 DeRose ’88 Skut and Brants ’98 – extending HMM to chunking Zhou and Su ‘00 and lots more…

4 July 2006 NWAI 10

)/()()/(

TWPTPWTP

)/()/()/()()/(13

iii twPtttPttPtPWTP

Tag-sequence probability Word-emit probability

Annotated corpus

)/(argmax WTPTT

Assumptions Probability of a word is dependent only on its tag Approximate the tag history to the most recent two tags

4 July 2006 NWAI 11

Structural tags

A triple – POS tag, structural relation & chunk tag

Originally proposed by Skut & Brants ’98 Seven relations

Enables embedded and overlapping chunks

4 July 2006 NWAI 12

Structural relations

परी�क्षा� में�

Beg परी�क्षा�

SSF । End

श्रे�णी� प्रा�प्त

परी�क्षा� में� भी� प्राथमें श्रे�णी� प्रा�प्त की� औरी वि�द्या�लय में� की� लपवित द्वा�री� वि�शे�ष प�रीस्की�री भी� उन्हीं" की# प्रा�प्त हुआ ।

4 July 2006 NWAI 13

Decoding

Viterbi mostly used (also A* or stack) Aims at finding the best path (tag sequence)

given observation sequence Possible tags are identified for each

transition, with associated probabilities The best path is the one that maximizes the

product of these transition probabilities

4 July 2006 NWAI 14

अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।

4 July 2006 NWAI 15

4 July 2006 NWAI 16

4 July 2006 NWAI 17

Issues

4 July 2006 NWAI 18

1. Unseen tag sequences

Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation)

Idea is to distribute some fractional probability (of seen occurrences) to unseen

Good-Turing Re-estimates the probability mass of lower count N-

grams by that of higher counts - Number of N-grams occurring c times

4 July 2006 NWAI 19

2. Unseen words

Insufficient corpus (even after 10 mn words) Not all of them are proper names Treat them as rare words that occur once in

the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95

Known Hindi corpus of 25 K words and unseen corpus of 6 K words

All words vs. Hapax vs. Unknown

4 July 2006 NWAI 20

Tag distribution analysis

Probability

All words Hapex (= 1) Hapex (< 3) Unknown words

4 July 2006 NWAI 21

3. Features

Can we use other features? Capitalization Word endings and Hyphenations

Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations

Capitalizations, though useful for proper nouns are not very effective

4 July 2006 NWAI 22

Contd…

String length Prefix & suffix – fixed characters width Character encoding range Complete analysis remains to be done Expected to be very effective for

morphologically rich languages To be experimented with Tamil

4 July 2006 NWAI 23

4. Multi-part words

ExamplesIn/ terms/ of/

United/ States/ of/ America/

More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP

Central/NNC government/NN

NNPC – Compound proper noun, NN - noun

NNP – Proper noun, NNC – Compound noun How does the system identify the last word in multi-part

word? 10% of errors is due to this in Hindi (6 K words tested)

4 July 2006 NWAI 24

Results

4 July 2006 NWAI 25

Evaluation metrics

Tag precision Unseen word accuracy

% of unseen words that are correctly tagged Estimates the goodness of unseen words

% reduction in error Reduction in error after the application of a

particular feature

4 July 2006 NWAI 26

Results - Tagger

No structural tags better smoothing Unseen data – significantly more unknowns

Dev S-1 S-2 S-3 S-4 Test

# words 8511 6388 6397 6548 5847 5000

Correctly tagged 6749 5538 5504 5558 5060 3961

Precision 79.29 86.69 86.04 86.06 86.54 79.22

# Unseen 1543 660 648 589 603 1012

Correctly tagged 672 354 323 265 312 421

Unseen Precision 43.55 53.63 49.84 44.99 51.74 41.6

4 July 2006 NWAI 27

Results – Chunk tagger

Training 22 K, development data 8 K 4-cross validation Test data 5 K

POS tagging

Precision

Identification Labelling

Pre Rec Pre Rec

Dev data 76.16 69.54 69.05 66.73 66.27

Average 85.02 72.26 73.52 70.01 71.35

Test data 76.49 58.72 61.28 54.36 56.73

4 July 2006 NWAI 28

Results – Tagging error analysis

Significant issues with nouns/multi-part words NNP NN NNC NN

Also, VAUX VFM; VFM VAUX and NVB NN; NN NVB

4 July 2006 NWAI 29

HMM performance (English)

> 96% reported accuracies About 85% for unknown words Advantage

Simple and most suitable with the availability of annotated data

4 July 2006 NWAI 30

Conclusion

4 July 2006 NWAI 31

Future work

Handling unseen words Smoothing Can we exploit other features?

Especially morphological ones Multi-part words

4 July 2006 NWAI 32

Summary

Statistical approaches now include linguistic features for higher accuracies

Improvement required Tagging

Precision – 79.22% Unknown words – 41.6%

Chunking Precision – 60% Recall – 62%

hindi parts-of-speech tagging & chunking

verb phrase

unseen corpus

tag history

unseen tag sequencessmoothing

hindi corpus

appropriate pos

best path tag sequence

wordsall words

Documents

algorithms for minimum risk chunking

hindi pos tagging and chunking : an memm approach aniket...

february 2007csa3050: tagging iii and chunking 1 csa2050:...

hands on advanced machine learning for information ... ·...

hdf5 chunking and compression - star · why hdf5 chunking?...

chunking and storyboarding

content chunking in e-learning

chunking in spatial memory 1 running head: chunking in

brochier aymeric veille technologique...

chunking in music by coarticulation

content chunking & new revenue streams

experiments on phrasal chunking in nlp using...

chunking: what do i study?

pos tagging and chunking by manish shrivastava 1

instructional sequence and pacing: chunking a lesson … ·...

parts of speech tagging using hidden...

literature sound blending and chunking

hdf5 advanced topics - chunking

part-of-speech tagging and chunking with log-linear models

chunking slides