6. n-grams 부산대학교 인공지능연구실 최성자. 2 word prediction “i’d like to make...
TRANSCRIPT
![Page 1: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/1.jpg)
6. N-GRAMs
부산대학교 인공지능연구실최성자
![Page 2: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/2.jpg)
2
Word prediction
“I’d like to make a collect …”Call, telephone, or person-to-person
- Spelling error detection
- Augmentative communication
- Context-sensitive spelling error correction
![Page 3: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/3.jpg)
3
Language Model
Language Model (LM)– statistical model of word sequences
n-gram: Use the previous n -1 words to predict the next word
![Page 4: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/4.jpg)
4
Applications
context-sensitive spelling error detection and correction“He is trying to fine out.”
“The design an construction will take a year.”
machine translation
![Page 5: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/5.jpg)
5
Counting Words in Corpora
Corpora (on-line text collections) Which words to count
– What we are going to count– Where we are going to find the things to count
![Page 6: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/6.jpg)
6
Brown Corpus
1 million words 500 texts Varied genres (newspaper, novels, non-
fiction, academic, etc.) Assembled at Brown University in 1963-64 The first large on-line text collection used
in corpus-based NLP research
![Page 7: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/7.jpg)
7
Issues in Word Counting
Punctuation symbols (. , ? !) Capitalization (“He” vs. “he”, “Bush” vs. “b
ush”) Inflected forms (“cat” vs. “cats”)
– Wordform: cat, cats, eat, eats, ate, eating, eaten– Lemma (Stem): cat, eat
![Page 8: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/8.jpg)
8
Types vs. Tokens
Tokens (N): Total number of running words Types (B): Number of distinct words in a corpus (si
ze of the vocabulary)
Example: “They picnicked by the pool, then lay back on the gra
ss and looked at the stars.”–16 word tokens, 14 word types (not counting punctuation)
“※ Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word
![Page 9: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/9.jpg)
9
How Many Words in English?
Shakespeare’s complete works– 884,647 wordform tokens– 29,066 wordform types
Brown Corpus– 1 million wordform tokens– 61,805 wordform types– 37,851 lemma types
![Page 10: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/10.jpg)
10
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word First attempt:
– Suppose there is no corpus available
– Use uniform distribution
– Assume:• word types = V (e.g., 100,000)
![Page 11: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/11.jpg)
11
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word Second attempt:
– Suppose there is a corpus
– Assume:• word tokens = N
• # times w appears in corpus = C(w)
![Page 12: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/12.jpg)
12
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word Third attempt:
– Suppose there is a corpus
– Assume a word depends on its n –1 previous words
![Page 13: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/13.jpg)
13
Simple (Unsmoothed) N-grams
![Page 14: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/14.jpg)
14
Simple (Unsmoothed) N-grams
n-gram approximation:– Wk only depends on its previous n–1words
![Page 15: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/15.jpg)
15
Bigram Approximation
Example:P(I want to eat British food)
= P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British)
<s>: a special word meaning “start of sentence”
![Page 16: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/16.jpg)
16
Note on Practical Problem
Multiplying many probabilities results in a very small number and can cause numerical underflow
Use logprob instead in the actual computation
![Page 17: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/17.jpg)
17
Estimating N-gram Probability
Maximum Likelihood Estimate (MLE)
![Page 18: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/18.jpg)
18
![Page 19: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/19.jpg)
19
Estimating Bigram Probability
Example:– C(to eat) = 860
– C(to) = 3256
![Page 20: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/20.jpg)
20
![Page 21: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/21.jpg)
21
Two Important facts
The increasing accuracy of N-gram models as we increse the value of N
Very strong dependency on their training corpus (in particular its genre and its size in words)
![Page 22: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/22.jpg)
22
Smoothing
Any particular training corpus is finite Sparse data problem Deal with zero probability
![Page 23: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/23.jpg)
23
Smoothing
Smoothing– Reevaluating zero probability n-grams and
assigning them non-zero probability
Also called Discounting– Lowering non-zero n-gram counts in order to
assign some probability mass to the zero n-grams
![Page 24: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/24.jpg)
24
Add-One Smoothing for Bigram
![Page 25: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/25.jpg)
25
![Page 26: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/26.jpg)
26
![Page 27: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/27.jpg)
27
Things Seen Once
Use the count of things seen once to help estimate the count of things never seen
![Page 28: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/28.jpg)
28
Witten-Bell Discounting
![Page 29: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/29.jpg)
29
Witten-Bell Discounting for Bigram
![Page 30: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/30.jpg)
30
Witten-Bell Discounting for Bigram
![Page 31: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/31.jpg)
31
Seen count Unseen count
![Page 32: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/32.jpg)
32
![Page 33: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/33.jpg)
33
Good-Turing Discounting for Bigram
![Page 34: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/34.jpg)
34
![Page 35: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/35.jpg)
35
Backoff
![Page 36: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/36.jpg)
36
Backoff
![Page 37: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/37.jpg)
37
Entropy
Measure of uncertainty Used to evaluate quality of n-gram models (how
well a language model matches a given language) Entropy H(X) of a random variable X:
Measured in bits Number of bits to encode information in the optimal
coding scheme
![Page 38: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/38.jpg)
38
Example 1
![Page 39: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/39.jpg)
39
Example 2
![Page 40: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/40.jpg)
40
Perplexity
![Page 41: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/41.jpg)
41
Entropy of a Sequence
![Page 42: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/42.jpg)
42
Entropy of a Language
![Page 43: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/43.jpg)
43
Cross Entropy
Used for comparing two language models p: Actual probability distribution that generated
some data m: A model of p (approximation to p) Cross entropy of m on p:
![Page 44: 6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection](https://reader037.vdocuments.mx/reader037/viewer/2022110101/56649e685503460f94b6443d/html5/thumbnails/44.jpg)
44
Cross Entropy
By Shannon-McMillan-Breimantheorem:
Property of cross entropy:
Difference between H(p,m) and H(p) is a measure of how accurate model m is
The more accurate a model, the lower its cross-entropy