accessing files with nltk regular expressions. accessing additional files python has tools for...

60
Accessing files with NLTK Regular Expressions

Upload: janie-whitt

Post on 16-Dec-2015

237 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Accessing files with NLTKRegular Expressions

Page 2: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Accessing additional files

• Python has tools for accessing files from the local directories and also for obtaining files from the web.– We have seen the tools for reading any file

from a local directory– Now, let’s see how to obtain files from the

web.

Page 3: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Reminder, file access• file(filename[, mode])• filename.close()

– File no longer available • filename.fileno()

– returns the file descriptor, not usually needed.• filename.read([size])

– read at most size bytes. If size not specified, read to end of file.• filename.readline([size])

– read one line. If size provided, read that many bytes. Empty string returned if EOF encountered immediately

• filename.readlines([sizehint]) – return a list of lines. If sizehint present, return approximately

that number of lines, possibly rounding to fill a buffer. • filename.write(string)

Where filename is the internal name of the file object

Mode is ‘r’ for read only, ‘w’ for write only, ‘r+’ for read or write, ‘a’ for append.

Page 4: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

4

Python module for web access• urllib2

– Note – this is for Python 2.x, not Python 3• Python 3 splits the urllib2 materials over several modules

– import urllib2– urllib2.urlopen(url [,data][, timeout])

• Establish a link with the server identified in the url and send either a GET or POST request to retrieve the page.

• The optional data field provides data to send to the server as part of the request. If the data field is present, the HTTP request used is POST instead of GET– Use to fetch content that is behind a form, perhaps a login page– If used, the data must be encoded properly for including in an HTTP

request. See http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1

• timeout defines time in seconds to be used for blocking operations such as the connection attempt. If it is not provided, the system wide default value is used.http://docs.python.org/library/urllib2.html

Page 5: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

5

URL fetch and use• urlopen returns a file-like object with

methods:– Same as for files: read(), readline(), readlines(),

fileno(), close()– New for this class: • info() – returns meta information about the

document at the URL• getcode() – returns the HTTP status code sent with

the response (ex: 200, 404)• geturl() – returns the URL of the page, which may

be different from the URL requested if the server redirected the request

Page 6: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Short example file read

filename=raw_input('File to read: ')source = file(filename) #Access is read-onlyfor line in source: print line

Recall what this does:Open the file for read access (default when no option specified)Step through the file, one line at a time (“for line in source”)Print each line

Page 7: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

URL fetchimport urllib2url = raw_input("Enter the URL of the page to fetch: ")if "http://" not in url[0:6]: url = "http://"+urlprint "Attempting to open ", urltry: linecount=0 page=urllib2.urlopen(url) result = page.getcode() if result == 200: for line in page: print line linecount+=1 print "Page Information \n ", page.info() print "Result code = ", page.getcode() print "Page contains ",linecount," lines."except: print "\nCould not open URL: ", url

Page 8: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

8

URL info• info() provides the header information that

http returns when the HEAD request is used.• ex:

>>> print mypage.info()Date: Mon, 12 Sep 2011 14:23:44 GMTServer: Apache/1.3.27 (Unix)Last-Modified: Tue, 02 Sep 2008 21:12:03 GMTETag: "2f0d4-215f-48bdac23"Accept-Ranges: bytesContent-Length: 8543Connection: closeContent-Type: text/html

Page 9: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

9

URL status and code

>>> print mypage.getcode()200

>>> print mypage.geturl()http://www.csc.villanova.edu/~cassel/

Page 10: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

10

Messy HTML• HTML is not always perfect. – Browsers may be forgiving. – Human and computerized html generators make

mistakes.• Tools for dealing with imperfect html include

Beautiful Soup.http://www.crummy.com/software/BeautifulSoup/

– Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Page 11: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

The NLP pipeline

Page 12: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

import nltkimport urllib2fail = Falseurl = raw_input("Enter the URL of the page to fetch: ")if "http://" not in url[0:7]: url = "http://"+urlprint "Attempting to open ", urltry: linecount=0 page=urllib2.urlopen(url)except: print "\nCould not open URL: ", url fail = True

if not fail: for line in page: raw = nltk.clean_html(line) print raw

File: /Users/lcassel/pythonwork/classexample/url-fetch-clean.py

Page 13: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Tokenizing

import re, nltk, urllib2, pprint

filename=raw_input('File to read: ')infile = file(filename) #Access is read-onlyprint "File chosen:", filename

source = infile.read(1000)

tokens = nltk.wordpunct_tokenize(source)tokens = tokens[20:200]text = nltk.Text(tokens)

words = [w.lower() for w in text]vocab = sorted(set(words))print vocab

File: /Users/lcassel/pythonwork/classexamples/openfile.py

Page 14: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Output from previous code:

['#', "'", '***', ',', '-', '.', '15', '2006', '2011', '2554', '28', ':', ';', '[', ']', 'a', 'about', 'almost', 'and', 'anywhere', 'at', 'author', 'away', 'bickers', 'but', 'by', 'children', 'constance', 'copy', 'cost', 'crime', 'dagny', 'date', 'deeply', 'doctor', 'dostoevsky', 'ebook', 'english', 'evenings', 'father', 'few', 'five', 'fyodor', 'garnett', 'give', 'gutenberg', 'hard', 'help', 'himself', 'his', 'in', 'included', 'it', 'john', 'language', 'last', 'license', 'lived', 'march', 'may', 'mother', 'no', 'november', 'of', 'online', 'only', 'or', 'org', 'parents', 'people', 'poor', 'preface', 'produced', 'project', 'punishment', 're', 'reader', 'release', 'religious', 'restrictions', 'rooms', 's', 'so', 'son', 'spent', 'start', 'terms', 'that', 'the', 'their', 'they', 'this', 'title', 'to', 'translated', 'translator', 'two', 'under', 'understand', 'updated', 'use', 'very', 'was', 'were', 'whatsoever', 'with', 'words', 'work', 'working', 'www', 'you']

Input file was “Crime and Punishment” as a local txt file, since crawling Gutenberg does not seem to work.

Page 15: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Spot check• Fetch a web page• Print out the lines of the page as they

are, and also as cleaned by nltk.– Compare the two versions. What is

removed and what is retained? Is all html removed? If anything is left, what is it and why do you think it is retained.

• Tokenize the text of the page– Print the vocabulary

Page 16: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Character encoding• ASCII, Unicode– American Standard Code for Information

Interchange• Everything stored in the computer must be

expressed as a bit pattern.– For numbers, easy – convert to binary

• For integers, direct conversion• For real numbers, floating point

– somewhat arbitrary choice of how to represent where the decimal point is, how much precision for the whole number part, how much for the exponent.

– For non-numeric characters, some arbitrary choice of what bit pattern to assign to each character

Page 17: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Coding considerations• If the numeric interpretation of the bit string

assigned to one character is less than that for another character, the first will sort to an earlier position.

• Thus, assign the codes in the sort order desired. – Clearly, A before B– A before or after a?– 8 before or after A?– * before or after A, 8?

• Once the choices are made and the code is constructed, sort order is determined. Any need to change will have to be dealt with in individual applications

Page 18: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Representing the bit patterns• All the encodings can be represented as

numeric values. Example ASCII code for “K” – two bytes: 0100 1011– Decimal 75• familiar, but not really convenient for

representing bits.

– Hexadecimal 4B• one character for each four bits.

– Octal 113 (_01 001 011)• one character for each three bits, from the right

Page 19: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

The ASCII code

Page 20: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Limitations of ASCII• Original ASCII used only 7 of the

available 8 bits– last bit kept for parity checking

• Limited the number of characters that can be represented.

• Extended – use the 8th bit– There are several variations– See http://www.ascii-code.com/

Page 21: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Source: http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

Extended ASCII Hex 80 to FF

Some additional language characters, such as é and à and æ and the Greek alphabet. Many more missing.

Page 22: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Unicode• ASCII is just one encoding example• ASCII, even extended, does not have

enough space for all needed encodings.• Different schemes in use present

potential conflict – different codes for the same symbol, different symbols with the same code if you deal with more than one scheme.

• Enter unicode. See unicode.org

Page 23: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

From unicode.orgUnicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Page 24: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Unicode• There are three encoding forms:– 8, 16, 32 bits– UTF-8 includes the ASCII codes– UTF-16 all commonly used symbols, other

symbols available in pairs of 16-bit units– UTF-32 when size is not an issue. All

symbols in 32 bit string of bits

Page 25: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Using unicode

Page 26: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Regular Expressions• Processing text often involves selecting

for specific characteristics• Regular expressions – powerful tool for describing the

characteristics of interest• Access in python: import re– Raw string notation: precede a string with r– r’\n’means backslash then n, not new line

Page 27: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Regular Expression special characters –pt 1

• ‘^’ (Caret) Matches the start of the string• ‘$’ matches the end of the string, or just before newline at

the end of a string• ‘.’ matches any single character• ‘*’ match 0 or more repetitions of the preceding re. 0*1

matches any number of 0s followed by 1: 1, 01, 001, 0001, etc.

• ‘+’ matches 1 or more repetition. 0+1 matches 01, 001, 0001, etc., but not 1

• ‘?’ matches 0 or 1 repetitions. 0?1 matches 1 and 01 only• {m,n} matches between m and n repetitions. If no n

specified, matches only exactly m repetitions. – 0{2,4}1 matches 001, 0001, 00001– 0{3}1 matches only 0001

Page 28: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Regular Expression special characters –pt 2

• {m,n}? match as few as possible of these. – 0{2,4}1 will match 001 if it is available, or 0001 if no 001 is

available, or 00001 if no shorter string is available.

• \ escape special character, so you can search for * or ? etc• [ ] used to indicate a set of characters.

– [abc] will match a or b or c• range: [0-9A-Za-z] will match any digit or letter, upper or lower

case– Special characters lose meaning in set: [\*] matches \ or *– ^ = negate the set [^0-9] will match anything except a digit

• | means “or” – A|B means the character A or the character B. – Options are tested left to right and the search quits when a match

is found. This gives priority to the symbol listed first.

Page 29: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Python reimport nltkimport rewordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]print [w for w in wordlist if re.search('ed$', w)]

matches all words in the list that end in ed

Take it step by step:(Get all the English words in the wordlist -- )wordlist = [w for w in nltk.corpus.words.words('en')]print wordlist[0:200]

['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', 'Aaronical', 'Aaronite', 'Aaronitic', 'Aaru', 'Ab', 'aba', 'Ababdeh', 'Ababua', 'abac', 'abaca', 'abacate', 'abacay', 'abacinate', 'abacination', 'abaciscus', 'abacist', 'aback', 'abactinal', 'abactinally', 'abaction', 'abactor', 'abaculus', 'abacus', 'Abadite', 'abaff', 'abaft', 'abaisance',

Page 30: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

from __future__ import divisionimport nltk, re, pprint

wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]print wordlist[0:200]

Restrict to lower case words

['a', 'aa', 'aal', 'aalii', 'aam', 'aardvark', 'aardwolf', 'aba', 'abac', 'abaca', 'abacate', 'abacay', 'abacinate', 'abacination', 'abaciscus', 'abacist', 'aback', 'abactinal', 'abactinally', 'abaction', 'abactor', …

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed']

from __future__ import divisionimport nltk, re, pprint

wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]wordlist = wordlist[0:200]print [w for w in wordlist if re.search('ed$', w)]

Page 31: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Wildcard . matches any single characterCrossword match example:

[w for w in wordlist if re.search('^..j..t..$', w)]

Word beginning

Single character

Specific letter

Word end

Crossword match example: ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly’]

Page 32: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Spot check• Your Turn: The caret symbol ^ matches the start

of a string, just like the $ matches the end. What results do we get with the above example if we leave out both of these, and search for «..j..t..»?– Think about it first. What do you expect?– Then run it.

Crossword match example: ['abjectedness', 'abjection', 'abjective', 'abjectly', 'abjectness', 'adjection', 'adjectional', 'adjectival', 'adjectivally', 'adjective', 'adjectively', 'adjectivism', 'adjectivitis', 'adjustable', 'adjustably', 'adjustage', 'adjustation', 'adjuster', 'adjustive', 'adjustment', 'antejentacular', 'antiprojectivity', 'bijouterie', 'coadjustment', 'cojusticiar', 'conjective', 'conjecturable', 'conjecturably', 'conjectural', 'conjecturalist', 'conjecturality', 'conjecturally', 'conjecture', 'conjecturer', 'coprojector', 'counterobjection', 'dejected', 'dejectedly', 'dejectedness', 'dejectile', 'dejection', …There will always be two letters before j and two letters between j and t and two letters after t. Nothing else specified.

Page 33: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

? as optional character• ? indicates 0 or 1 occurrences– ^e-?mail$– matches either email or e-mail– ^[Ee]-?mail$• allows either upper or lower case E• Note that [^Ee] matches anything that is not E,e

– the negation is inside the [ ]

Page 34: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Texting example

• First letter from ghi, second from mno, then jlk, then def

• Take away the ^ and $

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)

['gold', 'golf', 'hold', 'hole']

'tinkerlike', 'tinkerly', 'tinkershire', 'tinkershue', 'tinkerwise', 'tinlet', 'titleholder', 'toolholder', 'toolholding', 'touchhole', 'trainless', 'traphole', 'trinkerman', 'trinket', 'trinketer', 'trinketry', 'trinkety', 'triole', 'trioleate', 'triolefin', 'trioleic’, …

Page 35: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Python use of re• re.search(pattern, string[,flags])– scan through string looking for pattern.

Return None if not found.• re.match(pattern, string) – if zero or more characters at the beginning of

string match the re pattern, return a corresponding MatchObject instance. Return None if string does not match the pattern.

• re.split(pattern,string)– Split string by occurrences of pattern.

from: http://docs.python.org/library/re.html some options not included

Page 36: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Some shortened forms

>>> re.split('\W+', 'Words, words, words.')['Words', 'words', 'words', '']

>>> re.split('(\W+)', 'Words, words, words.')['Words', ', ', 'words', ', ', 'words', '.', '']

>>> re.split('\W+', 'Words, words, words.', 1)['Words', 'words, words.']

>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)['0', '3', '9']

\w = word class: equivalent to [a-zA-Z0-9_]

\W = complement of \w – all characters other than letters and digits

“If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.” – thus, the split is on the non alpha-numeric characters, but those characters are included in the resulting list.

Ref: http://docs.python.org/library/re.html

Page 37: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

• re.findall(pattern, string[,flags])– return all non-overlapping matches of

pattern in string, as a list of strings. String scanned left-to-right. Matches returned in order found.

Page 38: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Applications of re• Extract word pieces

• another

> word = 'supercalifragilisticexpialidocious'>>> re.findall(r'[aeiou]', word)['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']>>> len(re.findall(r'[aeiou]', word))16

>>> wsj = sorted(set(nltk.corpus.treebank.words()))>>> fd = nltk.FreqDist(vs for word in wsj... for vs in re.findall(r'[aeiou]{2,}', word))>>> fd.items()

vu50390:ch3 lcassel$ python re2.py[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ('oe', 15), ('iu', 14), ('ae', 11), ('eau', 10), ('uo', 8), ('ao', 6), ('oui', 6), ('eou', 5), ('uou', 5), ('uee', 4), ('aa', 3), ('ieu', 3), ('uie', 3), ('eei', 2), ('aia', 1), ('aii', 1), ('aiia', 1), ('eea', 1), ('iai', 1), ('iao', 1), ('ioa', 1), ('oei', 1), ('ooi', 1), ('ueui', 1), ('uu', 1)]

Page 39: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Spot check

Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

[int(n) for n in re.findall(?, '2009-12-31')]

Page 40: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Processing some text

>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'>>> def compress(word):... pieces = re.findall(regexp, word)... return ''.join(pieces)...>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty andof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtnof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmnrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

Noting redundancy in English and eliminating internal word vowels:

Page 41: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Tabulating combinations>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')>>> cvs = [cv for w in rotokas_words for cv in re.findall\(r'[ptksvr][aeiou]', w)]>>> cfd = nltk.ConditionalFreqDist(cvs)>>> cfd.tabulate()

a e i o uk 418 148 94 420 173p 83 31 105 34 51r 187 63 84 89 79s 0 0 100 2 1t 47 8 0 148 37v 93 27 105 48 49

Rotokas is an East Papuan language

Page 42: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Inspecting the words behind the numbers

>>> cv_word_pairs = [(cv, w) for w in rotokas_words... for cv in re.findall(r'[ptksvr][aeiou]', w)]>>> cv_index = nltk.Index(cv_word_pairs)>>> cv_index['su']['kasuari']>>> cv_index['po']

['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']

Page 43: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Stemming• Simple approach:

>>> def stem(word):... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es',\ 's', 'ment']:... if word.endswith(suffix):... return word[:-len(suffix)]... return word

Page 44: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Building a stemmer• Build a disjunction of all suffixes

• Take a look. What do we have here?– r – raw string. Interpret everything just as what you see.– ^ from the beginning – . match anything– * repeat the match anything 0 or more times– (ing|ly|ed|ious|ies|ive|es|s|ment) – look for one of

these– $ at the end of the string– ‘processing’ -- the string– result =

re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

Page 45: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

To get the whole word• Need to add ?:

>>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')['processing']

Page 46: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Split the word into stem and suffix

• Some subtleties involved

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')[('process', 'ing')]

Looks ok, but

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')[('processe', 's')] The * is a greedy operator. It takes as much as it can get.

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')[('process', 'es')]

*? is non greedy version.

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')[('language', '')]

? makes the suffix list optional, matches when none present

Page 47: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

A stemming function>>> def stem(word):... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'... stem, suffix = re.findall(regexp, word)[0]... return stem...>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords... is no basis for a system of government. Supreme executive power derives from... a mandate from the masses, not from some farcical aquatic ceremony.""">>> tokens = nltk.word_tokenize(raw)>>> [stem(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond','distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Note some strange “words” returned as the stem: basi from basis and deriv and execut etc.

Page 48: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

The Porter Stemmer• Official home:

http://tartarus.org/martin/PorterStemmer/index-old.html

• The python version• http://tartarus.org/martin/

PorterStemmer/python.txt

Page 49: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

>>> from nltk.corpus import gutenberg, nps_chat>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))>>> moby.findall(r"<a> (<.*>) <man>") monied; nervous; dangerous; white; white; white; pious; queer; good;mature; white; Cape; great; wise; wise; butterless; white; fiendish;pale; furious; better; certain; complete; dismasted; younger; brave;brave; brave; brave

>>> chat = nltk.Text(nps_chat.words())>>> chat.findall(r"<.*> <.*> <bro>") you rule bro; telling you bro; u twizted bro

>>> chat.findall(r"<l.*>{3,}") lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; lala la; lovely lol lol love; lol lol lol.; la la la; la la la

( ) means only that part is

returned

Page 50: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

re.show

Co{l}or{l}ess green ideas s{l}eep furious{l}yColorless {gree}n ideas sleep furiously

import nltk, resent = "Colorless green ideas sleep furiously"nltk.re_show('l',sent)

nltk.re_show('gree',sent)

Page 51: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Word patterns

>>> from nltk.corpus import brown>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))>>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")speed and other activities; water and other liquids; tomb and otherlandmarks; Statues and other monuments; pearls and other jewels;charts and other items; roads and other features; figures and otherobjects; military and other areas; demands and other factors;abstracts and other compilations; iron and other metals

Page 52: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Spot Check• How would you find all instances of the

pattern as x as y• example: as easy as pie• Can you handle this: as pretty as a

picture

Page 53: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

More on Stemming>>> porter = nltk.PorterStemmer()>>> lancaster = nltk.LancasterStemmer()>>> [porter.stem(t) for t in tokens]['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']

>>> [lancaster.stem(t) for t in tokens]['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.']

>>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']

Only keeps stems if in dictionary

Page 54: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Tokenizing• We have done split, but it was not very

complete.• Built in re abbreviation for any kind of

white space: \s>>> re.split(r'\s+', raw)['Dennis:', 'Listen,', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses,', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony.']>>>

Page 55: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Tokenizing• Split on anything other than a word

character (A-Za-z0-9) >>> re.split(r'\W+', raw)['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in','a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without','Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered','']

Note: I’M became I M

re.findall(r'\w+', raw) Splits on the words, instead of the separators

«\w+|\S\w*»

will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.

Page 56: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Getting there

>>> re.findall(r'\w+|\S\w*', raw)["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does','very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that','makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Now get internal marks – ‘M and ‘t

Page 57: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Regular expression symbols• Summary

Symbol Function\b Word boundary (zero width)\d Any decimal digit (equivalent to [0-9])\D Any non-digit character (equivalent to [^0-9])\s Any whitespace character (equivalent to [ \t\n\r\f\v]\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])\t The tab character\n The newline character

Page 58: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Tokenizer in Python

>>> text = 'That U.S.A. poster-print costs $12.40...'>>> pattern = r'''(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.... | \w+(-\w+)* # words with optional internal hyphens... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%... | \.\.\. # ellipsis... | [][.,;"'?():-_`] # these are separate tokens... '''>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Page 59: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

Spot Check

☼ Describe the class of strings matched by the following regular expressions.

[a-zA-Z]+[A-Z][a-z]*p[aeiou]{,2}t\d+(\.\d+)?([^aeiou][aeiou][^aeiou])*\w+|[^\w\s]+Test your answers using nltk.re_show().

Page 60: Accessing files with NLTK Regular Expressions. Accessing additional files Python has tools for accessing files from the local directories and also for

ExercisesFor next week:

◑ Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

For two weeks from now: ★ Obtain raw texts from two or more genres and compute

their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (nltk.corpus.abc). Use Punkt to perform sentence segmentation.