ling 388: language and computers sandiway fong lecture 27: 12/6

17
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Upload: shannon-berry

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

LING 388: Language and Computers

Sandiway Fong

Lecture 27: 12/6

Page 2: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Today’s Topic

• Computer Laboratory Class– n-gram statistics

• Homework #5

• Due next Monday – 13th December– exam period

Page 3: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Software Required

• NSP– Ted Pedersen’s n-gram statistics package– written in Perl– free– http://www.d.umn.edu/~tpederse/nsp.html

• Active State Perl– free Perl– http://www.activestate.com/

NSP

Perl

Windows/Mac etc.

Page 4: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Active Perl

• Installed on all the machines

Page 5: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

NSP

• On the SBSRI computers• Already present on the C drive

– C:\nsp

• Otherwise access it using:– 1. Click "Start" and choose "Run"

– 2. Type \\sbsri0\apps\nsp and click "OK"

Page 6: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Command Processor

• We will run NSP from the command line

here

Page 7: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 1

• Download and prepare text– 1. Google “marion jones steroids”– 2. Click on USAToday article

this one

Page 8: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 1

• Download and prepare text– 3. Click on Print this– 4. Copy text of article into text editor

this one

Page 9: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 1

• Reformat article of text into lines

• Lower case the first letter of a sentence when appropriate

• Question 1: (1pt)– How many lines of text are there in the

article, including the headline?

Page 10: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 2

• In the command line environment ...• perl count.pl --ngram 1 --newline unigrams.txt text.txt

– Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE] ...]

• Counts up the frequency of all n-grams occurring in SOURCE.• Sends to DESTINATION the list of n-grams found, along with the

frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted.

– OPTIONS:• --ngram N

– Creates n-grams of N tokens each. N = 2 by default.

• --newLine – Prevents n-grams from spanning across the new-line character.

Page 11: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 2

• Obtain unigrams from the text• Question 1 (1pt)

– How many different words are there in the text, excluding punctuation symbols (. , : ? etc.)

• Question 2 (1pt)– Which is the most frequent non-closed class word?

• Question 3 (1pt)– Which is the 2nd most frequent non-closed class word?

• Question 4 (1pt)– Which person is mentioned most in the article?

Page 12: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 3

• Obtain bigrams from the text• Obtain trigrams from the text• Question 1: (2pts)

– Compute the probability of the sequence “... Jones has denied using steroids” using the bigram approximation

– show the workings of your answer– assume first word is Jones and p(Jones) is the unigram

probability

• Question 2: (2pts)– Compute the probability of the sequence “... Jones has

denied using steroids” using the trigram approximation

Page 13: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 4

• Insert the start of sentence dummy symbol StarT where appropriate

• Question 1: (2pts)– Compute the probability of the sentence

beginning “Jones has denied using steroids” using the bigram approximation

• Question 2: (1pt)– Compare your answer with the answer you gave in

Exercise Question 1– Which probability is greater and why?

Page 14: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Smoothing

• Small and sparse dataset means that zero frequency values are a problem– Zero probabilities

• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model

• one zero and the whole product is zero

– Zero frequencies are a problem• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• word doesn’t exist in dataset and we’re dividing by zero

• What to do when frequencies are zero?• Answer 1: get a larger corpus

– (even with the BNC) never large enough, always going to have zeros

• Answer 2: (Smoothing) – assign some small non-zero value to unknown frequencies

Page 15: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Smoothing

• Simplest Algorithm – not the best way...– called Add-One– add one to the frequency counts for everything

– bigram probability for sequence wn-1wn now given by • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)

• V = # different words in corpus

Page 16: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Exercise 5

• Using Add-One and StarT (from Exercise 4)• Question 1: (2pts)

– Recompute the bigram probability approximation for “Jones has denied using steroids”

• Question 2: (2pts)– Compute the bigram probability approximation for

“Jones has admitted using steroids” – a sentence that does not exist in the original article

Page 17: LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Homework Summary

• Total points on offer: 14• Exercise 1

– 1pt

• Exercise 2– 4pt

• Exercise 3– 4pt

• Exercise 4– 3pt

• Exercise 5– 4pts