creating and exploring frequency lists -...
TRANSCRIPT
Creating and Exploring Frequency Lists
Marco Baroni
Computational skills for text analysis
Outline
Introduction
Perl tables
Collecting frequency lists
Exploring frequency lists
Stop(-word) lists
Hello Larry!
Why frequency?
I Occurrence and co-occurrence counts (“frequency”) atcore of any statistical method to analyze/extractinformation from text
I Frequency analysis to study not only what is possible, butalso what is common, “natural”
Collecting frequency lists
I To count words, start from tokenized text (so you knowwhat a word is)
I We need a new type of variable to store a table in Perl
Outline
Introduction
Perl tables
Collecting frequency lists
Exploring frequency lists
Stop(-word) lists
Tables in perl
I Hash tables, or hashes: complex variables that contain atable where each key X (variable, number, string) isassociated with a value Y (variable, number, string)
I A hash: %tableI A key: $iI The corresponding value: $table{$i}
Hash tableContents of a hash table named %hash
$x $hash{$x}cat 12dog 23mouse 7
Populate a table
I Create/update a table row with the key dog and the value23:$hash{"dog"} = 23;
I The same, but now dog is in $word and 23 is in $freq:$hash{$word} = $freq;
I Incrementing the value corresponding to the $word key:$hash{$word} += 1;
I More compactly:$hash{$word}++;
Printing all rows of a table
foreach $key (keys %hash) {print "$key $hash{$key}\n";
}
# NB: keys return the key list (an array!)
# NB2: a new kind of loop:# foreach item (array) {}
Outline
Introduction
Perl tables
Collecting frequency lists
Exploring frequency lists
Stop(-word) lists
Collecting frequency listsFrom tokenized input
while (<>) {$input = $_;$input =~ s/\n//; # remove the newline
# increment count for word in $input:$freqtable{$input}++;
}# we traversed the whole input, the list is ready
# we print itforeach $key (keys %freqtable) {print "$key $freqtable{$key}\n";
}
Sorting in alphabetical (ASCII) order
while (<>) {$input = $_;$input =~ s/\n//; # remove the newline
# increment count for word in $input:$freqtable{$input}++;
}# we traversed the whole input, the list is ready
# we sort it and print itforeach $key (sort (keys %freqtable)) {print "$key $freqtable{$key}\n";
}
Sorting by decreasing frequencyWeird syntax, just learn it as an idiom
while (<>) {$input = $_;$input =~ s/\n//; # remove the newline
# increment count for word in $input:$freqtable{$input}++;
}# we traversed the whole input, the list is ready
# we sort it and print it (code on multiple lines# just so it fits on the slide)foreach $key
(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {
print "$key $freqtable{$key}\n";}
Frequency of 2-word sequences
$key $freqtable{$key}the cat 10cat meows 2a cat 7
Frequency of 2-word sequences
while (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input; # add to sequence
if (!defined($sequence[1])) { next;} # not enough items yet
$curr_string = join " ", @sequence; # concatenate$freqtable{$curr_string}++; # increment sequence count
shift @sequence; # remove first item in sequence}
foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {
print "$key $freqtable{$key}\n";}
Longer sequences$last_index = 2; # to control sequence sizewhile (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input;
if (!defined($sequence[$last_index])) { next;}
$curr_string = join " ", @sequence;$freqtable{$curr_string}++;
shift @sequence;}
foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {
print "$key $freqtable{$key}\n";}
Outline
Introduction
Perl tables
Collecting frequency lists
Exploring frequency lists
Stop(-word) lists
Exploring frequency lists
I Filtering frequency lists with:I Regexps on the words they containI Numerical conditions (such as: > “greater than”) on the
frequency countsI Store the Brown word and bigram frequency lists in two
files, so that we can play with them
The general frame
while (<>) {$input = $_;$input =~ s/\n//;
($word1,$freq) = split " ",$input;# split is the inverse of join# and with (), perl treats $word1 and $freq as the# first and second elements of the array implicitly# created by split
# how would you extend to bigrams, trigrams?
if (CONDITIONS) {print "$input\n";}
}
Example: love forms
while (<>) {$input = $_;$input =~ s/\n//;
($word1,$freq) = split " ",$input;
if ($word1 =~ /^lov(e[sd]?|ing)$/) {print "$input\n";}
}
Example: love forms that occur at least 50 times
while (<>) {$input = $_;$input =~ s/\n//;
($word1,$freq) = split " ",$input;
if (($word1 =~ /^lov(e[sd]?|ing)/) &&($freq > 49)) {
print "$input\n";}
}
Practice time
I Extract the frequency of all words that contain at least 5characters; do not print words that occur less than 5 times
I Extract the frequency of words that contain at least 6characters and end in -ment or -ion
I Extract the bigrams that contain a form of love as first orsecond element
I You will need to formulate an or condition (with ||) since thelove form could be the first or the second word of thebigram
Outline
Introduction
Perl tables
Collecting frequency lists
Exploring frequency lists
Stop(-word) lists
The need for more cleaningI When looking at keywords in context qualitatively, it made
sense to preserve the original context in which keywordsappeared
I However, once we start extracting frequency lists,numbers, punctuation marks and function words (such asof and the) are annoying, and clutter the resulting lists
I You can weed out numbers, punctuation marks and othernon-(fully)-alphabetical stuff from your tokenized corpuswith one or more regular expressions
I A rather draconian approach might involve something like:if ($input =~ /[^a-zA-Z’\-]/) { next; }
I (This assumes that the newline character was stripped offfrom $input)
I You can also clean up the frequency lists, instead of thetokenized corpus: what’s the difference?
I However, function words do not contain easy-to-spotspecial characters
Function words
I Function words (prepositions, articles, auxiliary verbs. . . )such as the, of, is are very very frequent, and (for mostpurposes) not that interesting
I They also “block” the harvesting of more interestingco-occurrences
I In state of the nation, the interesting bigram is state nation,not state of, of the, the nation
I Solution: collect list of most frequent words from corpus (orother source), and write program to filter words from thisstop list out of the corpus
Creating a stop-word list
I From the Brown single word frequency list, create astop.txt file with all the words that occur N times ormore in the Brown
I One word per line (no frequencies!)I By looking at the sorted frequency list, I think that 500
occurrences might be a good threshold, but you can tryother Ns, if you so wish
Removing stop-words$stop_file = shift;$tok_corpus = shift;
open STOP, $stop_file;while (<STOP>) {$input = $_;$input =~ s/\n//;$stop_list{$input} = 1; # value is arbitrary, we will just
} # check if token is in %stop_list hashclose STOP;
open CORPUS, $tok_corpus;while (<CORPUS>) {$input = $_;$input =~ s/\n//;if (!defined($stop_list{$input})) {print "$input\n";
}}close CORPUS;
Removing stop-wordsOpening, reading and closing files explicitly
I We need to follow explicit procedure to get a file name fromcommand line argument, opening file, reading one line at atime, closing file
I We cannot use the while <> { ... } shortcut,because this time we need to jostle two input files (the stopword list, and the tokenized corpus)
I We encountered the full procedure in the Basics slides, butwe forgot about it
I Note the funny filehandle variables, without $ and (byconvention) UPPERCASED, that Perl uses to connect toexternal files
Removing stop-wordsChecking set membership in Perl
I We just store the set members (here, the stop words) askeys of a hash table with arbitrary values (here, 1s)
I We can then efficiently check if an item is in the set simplyby checking whether we have a value corresponding tothat item in the hash table: defined($set{$item})
I Note that instead of checking if word is not in stop list inorder to print, we could have checked if word is in stop list,and in that case issue a next command to move to nextline in tokenized file
Filtering the Brown
I Run the program to remove stop words on the tokenizedBrown
I It will need two input arguments: the name of the stop list,and the name of the tokenized Brown file
I Extract a bigram frequency list from the filtered tokenizedfile you created, and compare to the frequency list youobtained before stop-list filtering
I NB: it is also sometimes useful to perform keep-wordfiltering, i.e., to filter the corpus (or a frequency list) topreserve only words that are found in a certain list (mightbe relevant to your project)