creating and exploring frequency lists -...

31
Creating and Exploring Frequency Lists Marco Baroni Computational skills for text analysis

Upload: phungque

Post on 16-Feb-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Creating and Exploring Frequency Lists

Marco Baroni

Computational skills for text analysis

Page 2: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Page 3: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Hello Larry!

Page 4: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Why frequency?

I Occurrence and co-occurrence counts (“frequency”) atcore of any statistical method to analyze/extractinformation from text

I Frequency analysis to study not only what is possible, butalso what is common, “natural”

Page 5: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Collecting frequency lists

I To count words, start from tokenized text (so you knowwhat a word is)

I We need a new type of variable to store a table in Perl

Page 6: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Page 7: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Tables in perl

I Hash tables, or hashes: complex variables that contain atable where each key X (variable, number, string) isassociated with a value Y (variable, number, string)

I A hash: %tableI A key: $iI The corresponding value: $table{$i}

Page 8: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Hash tableContents of a hash table named %hash

$x $hash{$x}cat 12dog 23mouse 7

Page 9: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Populate a table

I Create/update a table row with the key dog and the value23:$hash{"dog"} = 23;

I The same, but now dog is in $word and 23 is in $freq:$hash{$word} = $freq;

I Incrementing the value corresponding to the $word key:$hash{$word} += 1;

I More compactly:$hash{$word}++;

Page 10: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Printing all rows of a table

foreach $key (keys %hash) {print "$key $hash{$key}\n";

}

# NB: keys return the key list (an array!)

# NB2: a new kind of loop:# foreach item (array) {}

Page 11: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Page 12: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Collecting frequency listsFrom tokenized input

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we print itforeach $key (keys %freqtable) {print "$key $freqtable{$key}\n";

}

Page 13: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Sorting in alphabetical (ASCII) order

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we sort it and print itforeach $key (sort (keys %freqtable)) {print "$key $freqtable{$key}\n";

}

Page 14: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Sorting by decreasing frequencyWeird syntax, just learn it as an idiom

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we sort it and print it (code on multiple lines# just so it fits on the slide)foreach $key

(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Page 15: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Frequency of 2-word sequences

$key $freqtable{$key}the cat 10cat meows 2a cat 7

Page 16: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Frequency of 2-word sequences

while (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input; # add to sequence

if (!defined($sequence[1])) { next;} # not enough items yet

$curr_string = join " ", @sequence; # concatenate$freqtable{$curr_string}++; # increment sequence count

shift @sequence; # remove first item in sequence}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Page 17: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Longer sequences$last_index = 2; # to control sequence sizewhile (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input;

if (!defined($sequence[$last_index])) { next;}

$curr_string = join " ", @sequence;$freqtable{$curr_string}++;

shift @sequence;}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Page 18: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Page 19: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Exploring frequency lists

I Filtering frequency lists with:I Regexps on the words they containI Numerical conditions (such as: > “greater than”) on the

frequency countsI Store the Brown word and bigram frequency lists in two

files, so that we can play with them

Page 20: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

The general frame

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;# split is the inverse of join# and with (), perl treats $word1 and $freq as the# first and second elements of the array implicitly# created by split

# how would you extend to bigrams, trigrams?

if (CONDITIONS) {print "$input\n";}

}

Page 21: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Example: love forms

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;

if ($word1 =~ /^lov(e[sd]?|ing)$/) {print "$input\n";}

}

Page 22: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Example: love forms that occur at least 50 times

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;

if (($word1 =~ /^lov(e[sd]?|ing)/) &&($freq > 49)) {

print "$input\n";}

}

Page 23: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Practice time

I Extract the frequency of all words that contain at least 5characters; do not print words that occur less than 5 times

I Extract the frequency of words that contain at least 6characters and end in -ment or -ion

I Extract the bigrams that contain a form of love as first orsecond element

I You will need to formulate an or condition (with ||) since thelove form could be the first or the second word of thebigram

Page 24: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Page 25: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

The need for more cleaningI When looking at keywords in context qualitatively, it made

sense to preserve the original context in which keywordsappeared

I However, once we start extracting frequency lists,numbers, punctuation marks and function words (such asof and the) are annoying, and clutter the resulting lists

I You can weed out numbers, punctuation marks and othernon-(fully)-alphabetical stuff from your tokenized corpuswith one or more regular expressions

I A rather draconian approach might involve something like:if ($input =~ /[^a-zA-Z’\-]/) { next; }

I (This assumes that the newline character was stripped offfrom $input)

I You can also clean up the frequency lists, instead of thetokenized corpus: what’s the difference?

I However, function words do not contain easy-to-spotspecial characters

Page 26: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Function words

I Function words (prepositions, articles, auxiliary verbs. . . )such as the, of, is are very very frequent, and (for mostpurposes) not that interesting

I They also “block” the harvesting of more interestingco-occurrences

I In state of the nation, the interesting bigram is state nation,not state of, of the, the nation

I Solution: collect list of most frequent words from corpus (orother source), and write program to filter words from thisstop list out of the corpus

Page 27: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Creating a stop-word list

I From the Brown single word frequency list, create astop.txt file with all the words that occur N times ormore in the Brown

I One word per line (no frequencies!)I By looking at the sorted frequency list, I think that 500

occurrences might be a good threshold, but you can tryother Ns, if you so wish

Page 28: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Removing stop-words$stop_file = shift;$tok_corpus = shift;

open STOP, $stop_file;while (<STOP>) {$input = $_;$input =~ s/\n//;$stop_list{$input} = 1; # value is arbitrary, we will just

} # check if token is in %stop_list hashclose STOP;

open CORPUS, $tok_corpus;while (<CORPUS>) {$input = $_;$input =~ s/\n//;if (!defined($stop_list{$input})) {print "$input\n";

}}close CORPUS;

Page 29: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Removing stop-wordsOpening, reading and closing files explicitly

I We need to follow explicit procedure to get a file name fromcommand line argument, opening file, reading one line at atime, closing file

I We cannot use the while <> { ... } shortcut,because this time we need to jostle two input files (the stopword list, and the tokenized corpus)

I We encountered the full procedure in the Basics slides, butwe forgot about it

I Note the funny filehandle variables, without $ and (byconvention) UPPERCASED, that Perl uses to connect toexternal files

Page 30: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Removing stop-wordsChecking set membership in Perl

I We just store the set members (here, the stop words) askeys of a hash table with arbitrary values (here, 1s)

I We can then efficiently check if an item is in the set simplyby checking whether we have a value corresponding tothat item in the hash table: defined($set{$item})

I Note that instead of checking if word is not in stop list inorder to print, we could have checked if word is in stop list,and in that case issue a next command to move to nextline in tokenized file

Page 31: Creating and Exploring Frequency Lists - CLIC-CIMECclic.cimec.unitn.it/marco/teaching/compskills/materials/lists.pdf · Creating and Exploring Frequency Lists Marco Baroni Computational

Filtering the Brown

I Run the program to remove stop words on the tokenizedBrown

I It will need two input arguments: the name of the stop list,and the name of the tokenized Brown file

I Extract a bigram frequency list from the filtered tokenizedfile you created, and compare to the frequency list youobtained before stop-list filtering

I NB: it is also sometimes useful to perform keep-wordfiltering, i.e., to filter the corpus (or a frequency list) topreserve only words that are found in a certain list (mightbe relevant to your project)