lecture 4: n-grams and nlp - university of pittsburghnaraehan/ling1330/lecture4.pdfobjectives...

14
Lecture 4: n-grams in NLP LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Upload: hoangkiet

Post on 11-Apr-2018

232 views

Category:

Documents


13 download

TRANSCRIPT

Page 1: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Lecture 4: n-grams in NLP

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han

Page 2: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Objectives

Frequent n-grams in English

n-grams and statistical NLP

n-grams and conditional probability

Large n-gram resources

2/2/2017 2

Page 3: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

For fun: most frequent bigrams?

2/2/2017 3

2551888 of the

1887475 in the

1041011 to the

861798 on the

676658 and the

648408 to be

578806 for the

561171 at the

498217 in a

479627 do n't

455367 with the

451460 from the

443547 of a

395939 that the

362176 is a

361879 going to

335255 by the

330828 as a

319846 with a

317431 I think

Source: http://www.ngrams.info/download_coca.asp

Page 4: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Most frequent trigrams?

2/2/2017 4

198630 I do n't

140305 one of the

129406 a lot of

117289 the United States

79825 do n't know

76782 out of the

75015 as well as

73540 going to be

61373 I did n't

61132 to be a

Source: http://www.ngrams.info/download_coca.asp

Page 5: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

n-grams and statistical NLP

2/2/2017 5

You have a good intuition as a native speaker.

Beyond intuition, it is possible to obtain a highly detailed & accurate set of n-gram statistics.

How? Through corpus data.

Corpus-sourced, large-scale n-grams are one of the biggest contributors to the recent advancement of statistical natural language processing (NLP) technologies.

Used for: spelling correction, machine translation, speech recognition, information extraction...

JUST ABOUT ANY NLP APPLICATION

Page 6: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

n-grams vs. conditional probability

6

Suppose 'is' is the current word. What is the most likely next word?

How likely are 'you' and 'your' as the next word?

Questions of conditional probability

Can be answered through n-gram data

*Source: http://norvig.com/ngrams/count_1w.txt **Source: http://norvig.com/ngrams/count_2w.txt

is a ** (2) 476718990

is the 306482559

is not 276753375

is an 98762170

is to 97276807

is your (3) 17051576

is you (4) 1826931

'is' occurs 4,705,743,816 times (1)*

'a' is the most likely next word with (2) / (1) = 0.10 probability.

'your' as the next word has (3) / (1) = 0.0036 probability.

'you' as the next word has (4) / (1) = 0.000388 probability.

Page 7: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Extremely large

2/2/2017 7

"All our N-gram are Belong to You"

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google Web 1T 5-Gram, released in August 2006 through LDC (Linguistic Data Consortium)

1-5 grams

Compiled from 1-trillion words of running web text

24 GB of compressed text

Source of Norvig's 1- and 2-gram frequency lists

Publication of this data triggered huge advances in NLP technologies and applications.

Page 8: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Even larger

2/2/2017 8

Google Books Ngram Corpus

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Basis for Google Books Ngram Viewer

1-5grams

Freely downloadable (for those who can)

Compiled from over 5 million books, published up to 2008

Data has publication dates; good for charting historical trend

Books were digitized using OCR

In multiple languages

American/British English, Chinese, French, German, Hebrew, Italian, Russian, Spanish

Page 9: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Large-ish

2/2/2017 9

COCA n-gram lists

http://www.ngrams.info/download_coca.asp

Word 2-5 grams, each containing top ~1 million entries

Based on COCA (The Corpus of Contemporary American English) (http://corpus.byu.edu/coca/), 520 million words as of Jan 2017

COCA's full unigram list is not free.

COCA's top 5000 words/lemmas

http://www.wordfrequency.info/free.asp

Contains lemma and POS of top 5,000 words

Page 10: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Excerpted, manageable

2/2/2017 10

Natural Language Corpus Data: Beautiful Data

by Peter Norvig

http://norvig.com/ngrams/

Has lists of large-scale English n-gram data: character ( 1- & 2-grams) and word level (1, 2, 3 grams)

Data derived/excerpted from Google Web 1T 5-Gram corpus

¼ million most frequent bigrams

Google's original data is 315 mil

Page 11: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

1-grams/word list: Norvig vs. ENABLE

11

count_1w.txt enable1.txt the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 on 3750423199 that 3400031103 by 3350048871 this 3228469771 with 3183110675 i 3086225277 goofel 12711 gooek 12711 gooddg 12711 gooblle 12711 gollgo 12711 golgw 12711

aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardvarks aardwolf aardwolves aargh aarrgh zymotic zymurgies zymurgy zyzzyva zyzzyvas

Total # of entries: 333K

vs. 173K

Usefulness?

Overlap?

Page 12: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

2-grams: Norvig vs. COCA

12

count_2w.txt w2_.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 you handle 336799 you hang 144949 you happen 627632 you happy 603963

39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got

Compiled from: 1 trillion words

vs. 500 million words

Page 13: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

2-grams: Norvig vs. COCA

13

count_2w.txt w2_.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 you handle 336799 you hang 144949 you happen 627632 you happy 603963

39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got

*NOT google's fault! Norvig only took top 0.1% of 315 million.

Total # of entries: ¼ million*

vs. 1 million

Usefulness?

Page 14: Lecture 4: n-grams and NLP - University of Pittsburghnaraehan/ling1330/Lecture4.pdfObjectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability

Know your data

2/2/2017 14

When using publicly available resources, you must evaluate and understand the data.

Origin?

Domain & genre?

Size?

Traits?

Merits and limitations?

Fit with your project?