persian part of speech tagging

24
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran

Upload: tuyet

Post on 18-Jan-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Persian Part Of Speech Tagging. Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran. Decision Trees. Decision Tree (DT): Tree where the root and each internal node is labeled with a question. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Persian Part Of Speech Tagging

1

Persian Part Of Speech Tagging

Mostafa Keikha

Database Research Group (DBRG)

ECE Department, University of Tehran

Page 2: Persian Part Of Speech Tagging

2

Decision Trees

Decision Tree (DT): Tree where the root and each internal node is

labeled with a question. The arcs represent each possible answer to the

associated question. Each leaf node represents a prediction of a

solution to the problem. Popular technique for classification; Leaf

node indicates class to which the corresponding tuple belongs.

Page 3: Persian Part Of Speech Tagging

3

Decision Tree Example

Page 4: Persian Part Of Speech Tagging

4

Decision Trees

A Decision Tree Model is a computational model consisting of three parts: Algorithm to create the tree Algorithm that applies the tree to data

Creation of the tree is the most difficult part. Processing is basically a search similar to that in

a binary search tree (although DT may not be binary).

Page 5: Persian Part Of Speech Tagging

5

Decision Tree Algorithm

Page 6: Persian Part Of Speech Tagging

6

Using DT in POS Tagging

Compute Ambiguity classes Each term may have

different tags Ambiguity class for each

term: set of all possible tags

compute # of occurrence for each tag in each ambiguity class

Ambiguity Class

# of occurrence

a b c d10 20 25 40

b c d 40 39 50

b d 60 55

Page 7: Persian Part Of Speech Tagging

7

Using DT in POS Tagging

Create Decision Tree on Ambiguity classes

In each level delete tag with minimum occurrence

a b c d10 20 25 40

b c d40 39 50

b d60 55

b

Page 8: Persian Part Of Speech Tagging

8

Using DT in POS Tagging

Advantage Easy to understand Easy to implement

Disadvantage Context independent

Page 9: Persian Part Of Speech Tagging

9

Using DT in POS Tagging

Known Tokens Results

Run PercentTokensCorrectAccuracy

197.9739392336376492.34%

298.0635563032896592.50%

397.9639752836778992.51%

497.9241056138157892.94%

597.9740307937230592.36%

Average97.976392144.2362880.292.474%

Page 10: Persian Part Of Speech Tagging

11

POS tagging using HMMs

Let W be a sequence of words W = w1 , w2 , … , wn

Let T be the corresponding tag sequence T = t1 , t2 , … , tn

Task : Find T which maximizes P ( T | W )

T’ = argmaxT P ( T | W )

Page 11: Persian Part Of Speech Tagging

12

POS tagging using HMMs

By Bayes Rule,

P ( T | W ) = P ( W | T ) * P ( T ) / P ( W )

T’ = argmaxT P ( W | T ) * P ( T )

Transition Probability,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 )

Applying Tri-gram approximation,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Introducing a dummy tag, $, to represent the beginning of a sentence,

P ( T ) = P ( t1 | $ ) * P ( t2 | $ t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Page 12: Persian Part Of Speech Tagging

13

POS tagging using HMMs

Smoothing Transition Probabilities

Sparse data problem

Linear interpolation method

P'(ti | ti - 2 , ti - 1) = λ1 P( ti ) + λ2 P(ti | ti - 1 ) + λ3 P(ti | ti - 2 , ti - 1)

such that the s sum to 1

Page 13: Persian Part Of Speech Tagging

14

POS tagging using HMMs

Calculation of λs

Page 14: Persian Part Of Speech Tagging

15

POS tagging using HMMs

Emission Probability,

P(W | T ) ≈ P(w1 | t1) * P(w2 | t2) * . . . * P(wn | tn)

Context Dependency

To make more dependent on the context the emission probability is calculated as:

P(W | T ) ≈ P(w1 | $ t1) * P(w2 | t1 t2) ...* P(wn | tn-1 tn)

Page 15: Persian Part Of Speech Tagging

16

POS tagging using HMMs

Smoothing technique is applied

P' (wi | ti-1 ti) = θ1 P(wi | ti) + θ2 P(wi | ti-1 ti) Sum of all θs is equal to 1

θs are different for different words.

Page 16: Persian Part Of Speech Tagging

17

POS tagging using HMMs

1(

2(

3(

4(

5(

6(

Page 17: Persian Part Of Speech Tagging

18

POS tagging using HMMs

Page 18: Persian Part Of Speech Tagging

19

POS tagging using HMMs

Page 19: Persian Part Of Speech Tagging

20

POS tagging using HMMs

Lexicon generation probability

Page 20: Persian Part Of Speech Tagging

21

POS tagging using HMMs

Page 21: Persian Part Of Speech Tagging

22

P(N V ART N | files like a flower) = 4.37*10-6

POS tagging using HMMs

Page 22: Persian Part Of Speech Tagging

23

POS tagging using HMMs

Known Tokens Results

Run PercentTokensCorrectAccuracy

198.0739429038221196.94%

298.1634591334591397.18%

398.0439784934389496.96%

498.0241097039848796.96%

598.0740346039147597.03%

Average98.072390496.437239697.01%

Page 23: Persian Part Of Speech Tagging

24

Unknown Tokens Results

Run PercentTokensCorrectAccuracy

11.937760582975.12%

21.846689535780.09%

31.967956615377.34%

41.988283643577.69%

51.937945624678.62%

Average1.9287726.6600477.77%

Page 24: Persian Part Of Speech Tagging

25

Overall Results

Run TokensCorrectAccuracy

140205038804096.52%

236265835127096.86%

340580539189096.57%

441925340492296.58%

541140539772196.67%

Average400234.2386768.696.64%