LING 388: Language and Computers
Sandiway Fong
Lecture 27: 12/6
Today’s Topic
• Computer Laboratory Class– n-gram statistics
• Homework #5
• Due next Monday – 13th December– exam period
Software Required
• NSP– Ted Pedersen’s n-gram statistics package– written in Perl– free– http://www.d.umn.edu/~tpederse/nsp.html
• Active State Perl– free Perl– http://www.activestate.com/
NSP
Perl
Windows/Mac etc.
Active Perl
• Installed on all the machines
NSP
• On the SBSRI computers• Already present on the C drive
– C:\nsp
• Otherwise access it using:– 1. Click "Start" and choose "Run"
– 2. Type \\sbsri0\apps\nsp and click "OK"
Command Processor
• We will run NSP from the command line
here
Exercise 1
• Download and prepare text– 1. Google “marion jones steroids”– 2. Click on USAToday article
this one
Exercise 1
• Download and prepare text– 3. Click on Print this– 4. Copy text of article into text editor
this one
Exercise 1
• Reformat article of text into lines
• Lower case the first letter of a sentence when appropriate
• Question 1: (1pt)– How many lines of text are there in the
article, including the headline?
Exercise 2
• In the command line environment ...• perl count.pl --ngram 1 --newline unigrams.txt text.txt
– Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE] ...]
• Counts up the frequency of all n-grams occurring in SOURCE.• Sends to DESTINATION the list of n-grams found, along with the
frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted.
– OPTIONS:• --ngram N
– Creates n-grams of N tokens each. N = 2 by default.
• --newLine – Prevents n-grams from spanning across the new-line character.
Exercise 2
• Obtain unigrams from the text• Question 1 (1pt)
– How many different words are there in the text, excluding punctuation symbols (. , : ? etc.)
• Question 2 (1pt)– Which is the most frequent non-closed class word?
• Question 3 (1pt)– Which is the 2nd most frequent non-closed class word?
• Question 4 (1pt)– Which person is mentioned most in the article?
Exercise 3
• Obtain bigrams from the text• Obtain trigrams from the text• Question 1: (2pts)
– Compute the probability of the sequence “... Jones has denied using steroids” using the bigram approximation
– show the workings of your answer– assume first word is Jones and p(Jones) is the unigram
probability
• Question 2: (2pts)– Compute the probability of the sequence “... Jones has
denied using steroids” using the trigram approximation
Exercise 4
• Insert the start of sentence dummy symbol StarT where appropriate
• Question 1: (2pts)– Compute the probability of the sentence
beginning “Jones has denied using steroids” using the bigram approximation
• Question 2: (1pt)– Compare your answer with the answer you gave in
Exercise Question 1– Which probability is greater and why?
Smoothing
• Small and sparse dataset means that zero frequency values are a problem– Zero probabilities
• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model
• one zero and the whole product is zero
– Zero frequencies are a problem• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency
• word doesn’t exist in dataset and we’re dividing by zero
• What to do when frequencies are zero?• Answer 1: get a larger corpus
– (even with the BNC) never large enough, always going to have zeros
• Answer 2: (Smoothing) – assign some small non-zero value to unknown frequencies
Smoothing
• Simplest Algorithm – not the best way...– called Add-One– add one to the frequency counts for everything
– bigram probability for sequence wn-1wn now given by • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)
• V = # different words in corpus
Exercise 5
• Using Add-One and StarT (from Exercise 4)• Question 1: (2pts)
– Recompute the bigram probability approximation for “Jones has denied using steroids”
• Question 2: (2pts)– Compute the bigram probability approximation for
“Jones has admitted using steroids” – a sentence that does not exist in the original article
Homework Summary
• Total points on offer: 14• Exercise 1
– 1pt
• Exercise 2– 4pt
• Exercise 3– 4pt
• Exercise 4– 3pt
• Exercise 5– 4pts