hindi parts-of-speech tagging & chunking

32
Hindi Parts-of- Speech Tagging & Chunking Baskaran S MSRI

Upload: garron

Post on 12-Jan-2016

65 views

Category:

Documents


2 download

DESCRIPTION

Hindi Parts-of-Speech Tagging & Chunking. Baskaran S MSRI. What's in?. Why POS tagging & chunking? Approach Challenges Unseen tag sequences Unknown words Results Future work Conclusion. Intro & Motivation. POS. Parts-of-Speech Dionysius Thrax (ca 100 BC) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hindi Parts-of-Speech Tagging & Chunking

Hindi Parts-of-Speech Tagging & Chunking

Baskaran S

MSRI

Page 2: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 2

What's in?

Why POS tagging & chunking? Approach Challenges

Unseen tag sequences Unknown words

Results Future work Conclusion

Page 3: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 3

Intro & Motivation

Page 4: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 4

POS

Parts-of-Speech Dionysius Thrax (ca 100 BC) 8 types – noun, verb, pronoun, preposition, adverb,

conjunction, participle and article

I get my thing in action.(Verb, that's what's happenin')To work, (Verb!)To play, (Verb!)To live, (Verb!)To love... (Verb!...)

- Schoolhouse Rock

Page 5: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 5

Tagging

Assigning the appropriate POS

or lexical class marker

to words in a given text

Symbols, punctuation markers etc. are also assigned specific tag(s)

Page 6: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 6

Why POS tagging?

Gives significant information about a word and its neighbours

Adjective near noun Adverb near verb

Gives clue on how a word is pronounced OBject as noun obJECT as verb

Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc.

Page 7: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 7

Chunking

Identifying simple phrases Noun phrase, verb phrase, adjectival phrase…

Useful as a first step to Parsing Named entity recognition

Page 8: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 8

POS tagging & Chunking

Page 9: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 9

Stochastic approaches

Availability of tagged corpora in large quantity Most are based on HMM

Weischedel ’93 DeRose ’88 Skut and Brants ’98 – extending HMM to chunking Zhou and Su ‘00 and lots more…

Page 10: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 10

HMM

)(

)/()()/(

WP

TWPTPWTP

)/()/()/()()/(13

12121

n

i

ii

n

i

iii twPtttPttPtPWTP

Tag-sequence probability Word-emit probability

Annotated corpus

)/(argmax WTPTT

Assumptions Probability of a word is dependent only on its tag Approximate the tag history to the most recent two tags

Page 11: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 11

Structural tags

A triple – POS tag, structural relation & chunk tag

Originally proposed by Skut & Brants ’98 Seven relations

Enables embedded and overlapping chunks

Page 12: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 12

Structural relations

परी�क्षा� में�

NP

00

Beg परी�क्षा�

NP

90

SSF । End

VG

09

SSF

श्रे�णी� प्रा�प्त

NP

99

SSF

VG

परी�क्षा� में� भी� प्राथमें श्रे�णी� प्रा�प्त की� औरी वि�द्या�लय में� की� लपवित द्वा�री� वि�शे�ष प�रीस्की�री भी� उन्हीं" की# प्रा�प्त हुआ ।

Page 13: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 13

Decoding

Viterbi mostly used (also A* or stack) Aims at finding the best path (tag sequence)

given observation sequence Possible tags are identified for each

transition, with associated probabilities The best path is the one that maximizes the

product of these transition probabilities

Page 14: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 14

अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।

JJ

NLOC

NN

PREP

PRP

QFN

RB

VFM

SYM

Page 15: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 15

अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।

JJ

NLOC

NN

PREP

PRP

QFN

RB

VFM

SYM

Page 16: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 16

अब जी��न की� एकी अन्य रूप उनकी� सा�मेंन� आय� ।

JJ

NLOC

NN

PREP

PRP

QFN

RB

VFM

SYM

Page 17: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 17

Issues

Page 18: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 18

1. Unseen tag sequences

Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation)

Idea is to distribute some fractional probability (of seen occurrences) to unseen

Good-Turing Re-estimates the probability mass of lower count N-

grams by that of higher counts - Number of N-grams occurring c times

C

C

N

Ncc

1)1(*

CN

Page 19: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 19

2. Unseen words

Insufficient corpus (even after 10 mn words) Not all of them are proper names Treat them as rare words that occur once in

the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95

Known Hindi corpus of 25 K words and unseen corpus of 6 K words

All words vs. Hapax vs. Unknown

Page 20: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 20

Tag distribution analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tags

Probability

All words Hapex (= 1) Hapex (< 3) Unknown words

Page 21: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 21

3. Features

Can we use other features? Capitalization Word endings and Hyphenations

Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations

Capitalizations, though useful for proper nouns are not very effective

Page 22: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 22

Contd…

String length Prefix & suffix – fixed characters width Character encoding range Complete analysis remains to be done Expected to be very effective for

morphologically rich languages To be experimented with Tamil

Page 23: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 23

4. Multi-part words

ExamplesIn/ terms/ of/

United/ States/ of/ America/

More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP

Central/NNC government/NN

NNPC – Compound proper noun, NN - noun

NNP – Proper noun, NNC – Compound noun How does the system identify the last word in multi-part

word? 10% of errors is due to this in Hindi (6 K words tested)

Page 24: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 24

Results

Page 25: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 25

Evaluation metrics

Tag precision Unseen word accuracy

% of unseen words that are correctly tagged Estimates the goodness of unseen words

% reduction in error Reduction in error after the application of a

particular feature

Page 26: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 26

Results - Tagger

No structural tags better smoothing Unseen data – significantly more unknowns

Dev S-1 S-2 S-3 S-4 Test

# words 8511 6388 6397 6548 5847 5000

Correctly tagged 6749 5538 5504 5558 5060 3961

Precision 79.29 86.69 86.04 86.06 86.54 79.22

# Unseen 1543 660 648 589 603 1012

Correctly tagged 672 354 323 265 312 421

Unseen Precision 43.55 53.63 49.84 44.99 51.74 41.6

Page 27: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 27

Results – Chunk tagger

Training 22 K, development data 8 K 4-cross validation Test data 5 K

POS tagging

Precision

Chunk

Identification Labelling

Pre Rec Pre Rec

Dev data 76.16 69.54 69.05 66.73 66.27

Average 85.02 72.26 73.52 70.01 71.35

Test data 76.49 58.72 61.28 54.36 56.73

Page 28: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 28

Results – Tagging error analysis

Significant issues with nouns/multi-part words NNP NN NNC NN

Also, VAUX VFM; VFM VAUX and NVB NN; NN NVB

Page 29: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 29

HMM performance (English)

> 96% reported accuracies About 85% for unknown words Advantage

Simple and most suitable with the availability of annotated data

Page 30: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 30

Conclusion

Page 31: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 31

Future work

Handling unseen words Smoothing Can we exploit other features?

Especially morphological ones Multi-part words

Page 32: Hindi Parts-of-Speech Tagging & Chunking

4 July 2006 NWAI 32

Summary

Statistical approaches now include linguistic features for higher accuracies

Improvement required Tagging

Precision – 79.22% Unknown words – 41.6%

Chunking Precision – 60% Recall – 62%