임성신[email protected] speech and language processing ch8. word classes and part-of- speech...

20
임임임 임임임 [email protected] [email protected] Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING SPEECH TAGGING

Upload: ashley-perry

Post on 05-Jan-2016

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

임성신임성신

[email protected]@pusan.ac.kr

Speech and Language Processing

Ch8. WORD CLASSES AND PART-OF-Ch8. WORD CLASSES AND PART-OF-SPEECH TAGGINGSPEECH TAGGING

Page 2: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

2Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

AgendaAgenda

What are they?What are they? DistributionDistribution TagsetsTagsets TaggingTagging

Rules Probabilities Transformation-Based(Brill)

Page 3: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

3Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Parts of SpeechParts of Speech

Start with eight basic categoriesStart with eight basic categories Noun, verb, pronoun, preposition, adjective, adverb, article,

conjunction

These categories are based on morphological and These categories are based on morphological and distributional properties (not semantics)distributional properties (not semantics)

Some cases are easy, others are murkySome cases are easy, others are murky

Page 4: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

4Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Parts of SpeechParts of Speech

Two kinds of categoryTwo kinds of category Closed class

• Prepositions, articles, conjunctions, pronouns

Open class• Nouns, verbs, adjectives, adverbs

Page 5: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

5Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.1 Prepositions(and particles) of English from the CELEX on-line dictionary.Frequency counts are from the COBUILD 16 million word corpus.

Page 6: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

6Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.2 English single-word particles from Quirk et al.(1985).

Page 7: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

7Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.3 Coordinating and subordinating conjunctions of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Page 8: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

8Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.4 Pronouns of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Page 9: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

9Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.5 English modal verbs from the CELEX on-line dictionary.Frequency counts are from the COBUILD 16 million word corpus.

Page 10: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

10Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Sets of Parts of Speech: TagsetsSets of Parts of Speech: Tagsets

There are various standard tagsets to choose from; There are various standard tagsets to choose from; some have a lot more tags than otherssome have a lot more tags than others

The choice of tagset is based on the applicationThe choice of tagset is based on the application Accurate tagging can be done with even large tagsetAccurate tagging can be done with even large tagset

ss

Page 11: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

11Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Fig 8.6 Penn Treebank part-of-speech tags (including punctuation).

Page 12: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

12Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

TaggingTagging

Part of speech tagging is the process of assigning pPart of speech tagging is the process of assigning parts of speech to each word in a sentence… Assume arts of speech to each word in a sentence… Assume we havewe have A tagset A dictionary that gives you the possible set of tags for each

entry A text to be tagged A reason?

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./.

Page 13: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

13Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Figure 8.7 The number of word types in Brown corpus by degree of ambiguity (after DeRose(1988)).

Page 14: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

14Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Tagging - RulesTagging - Rules

Hand-crafted rules for ambiguous words that test the Hand-crafted rules for ambiguous words that test the context to make appropriate choicescontext to make appropriate choices Early attempts fairly error-prone Extremely labor-intensive

Page 15: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

15Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Figure 8.8 Sample lexical entries from the ENGTWOL lexicon described in Voutilainen(1995) and Heikkila(1995).

Page 16: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

16Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Tagging - ProbabilitiesTagging - Probabilities

장점장점 충분한 크기의 태그부탁 말뭉치만 주어지면 태깅에 필요한

통계정보의 추출이 용이하기 때문에 확장성이 좋고 적용범위가 넓으며 전체적인 정확성이 비교적 높다는 장점

단점단점 말뭉치에 의존적 의미 있는 통계정보를 추출하기 위해서는 일정크기 이상의

태그부탁 말뭉치 필요 말뭉치 구축에 시간과 노력이 많이 요구됨 말뭉치가 편중되어 있거나 불충분한 경우에는 data sparseness

로 인해 신뢰도가 떨어짐

Page 17: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

17Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Tagging - ProbabilitiesTagging - Probabilities

We want the best set of tags for a sequence of wordsWe want the best set of tags for a sequence of words(a sentence)(a sentence)

)(

)()|(maxarg)|(maxarg

WP

TPTWPWTP

)()|(maxarg)|(maxarg TPTWPWTP

W is a sequence of wordsW is a sequence of wordsT is a sequence of tagsT is a sequence of tags

The probability of the word sequence P(W)will be the same for each tag sequence

n

i

ii

n

i

ii ttPtPtwP2

112

)|(*)(*)|(maxarg

Page 18: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

18Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Tagging - Transformation-Based(Brill tagging)Tagging - Transformation-Based(Brill tagging)

Combine rules and statistics…Combine rules and statistics… TBL(Transformation-Based Learning) is based on rules Rules are automatically induced from the data(ML)

Page 19: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

19Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Brill tagging - ExamplesBrill tagging - Examples

RaceRace “race” as NN: .98 “race” as VB: .02

So you’ll be wrong 2% of the time, which really isn’t So you’ll be wrong 2% of the time, which really isn’t badbad

Patch the cases where you know it has to be a verbPatch the cases where you know it has to be a verb Change NN to VB when previous tag is TO

Page 20: 임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

20Artificial Intelligence LaboratoryArtificial Intelligence Laboratory

Brill tagging - RulesBrill tagging - Rules

Where did that transformational rule come from?Where did that transformational rule come from? Define a hypothesis space of rules that might help decrease

an error rate Search that space (exhaustively?) to find rules that most

reduce an error rate. Continue to add rules until some stopping criteria is

reached

Figure 8.9 Brill’s(1995) templates. Each begins with “Change tag a to tag b when : …”. The variables a, b, z and w range over parts-of-speech.