a smattering of natural language processing in python

A Smattering of NLP in Pythonby Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)

(https://www.python.org/)

Part of a joint meetup on Natural Language Processing(http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014

Statistical Programming DC (http://www.meetup.com/stats-prog-dc/)

Data Wranglers DC (http://www.meetup.com/Data-Wranglers-DC/)

DC Natural Language Processing (http://dcnlp.org/)

IntroductionBack in the dark ages of data science, each group or individual working inNatural Language Processing (NLP) generally maintained an assortment ofhomebrew utility programs designed to handle many of the common tasksinvolved with NLP. Despite everyone's best intentions, most of this codewas lousy, brittle, and poorly documented -- not a good foundation uponwhich to build your masterpiece. Fortunately, over the past decade,mainstream open source software libraries like the Natural Language Toolkitfor Python (NLTK) (http://www.nltk.org/) have emerged to offer a collectionof high-quality reusable NLP functionality. These libraries allow researchersand developers to spend more time focusing on the application logic of thetask at hand, and less on debugging an abandoned method for sentencesegmentation or reimplementing noun phrase chunking.

This presentation will cover a handful of the NLP building blocks providedby NLTK (and a few additional libraries), including extracting text fromHTML, stemming & lemmatization, frequency analysis, and named entity

https://www.python.org/

http://www.meetup.com/stats-prog-dc/events/177772322/

http://www.meetup.com/stats-prog-dc/

http://www.meetup.com/Data-Wranglers-DC/

http://dcnlp.org/

http://www.nltk.org/

recognition. Several of these components will then be assembled to build avery basic document summarization program.

(http://oreilly.com/catalog/9780596516499/)

Initial Setup

Obviously, you'll need Python installed on your system to run the codeexamples used in this presentation. We enthusiatically recommend usingAnaconda (https://store.continuum.io/cshop/anaconda/), a Pythondistribution provided by Continuum Analytics (http://www.continuum.io/).Anaconda is free to use, it includes nearly 200 of the most commonly usedPython packages for data analysis(http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and itworks on Mac, Linux, and yes, even Windows.

(https://store.continuum.io/cshop/anaconda/)

We'll make use of the following Python packages in the example code:

nltk (http://www.nltk.org/install.html) (comes with Anaconda)readability-lxml (https://github.com/buriy/python-readability)BeautifulSoup4 (http://www.crummy.com/software/BeautifulSoup/)(comes with Anaconda)scikit-learn (http://scikit-learn.org/stable/install.html) (comes withAnaconda)

http://oreilly.com/catalog/9780596516499/

https://store.continuum.io/cshop/anaconda/

http://www.continuum.io/

http://docs.continuum.io/anaconda/pkg-docs.html

https://store.continuum.io/cshop/anaconda/

http://www.nltk.org/install.html

https://github.com/buriy/python-readability

http://www.crummy.com/software/BeautifulSoup/

http://scikit-learn.org/stable/install.html

Please note that the readability package is not distributed with Anaconda,so you'll need to download & install it separately using something like easy_install readability-lxml or pip install readability-lxml.

If you don't use Anaconda, you'll also need to download & install the otherpackages separately using similar methods. Refer to the homepage of eachpackage for instructions.

You'll want to run nltk.download() one time to get all of the NLTKpackages, corpora, etc. (see below). Select the "all" option. Depending onyour network speed, this could take a while, but you'll only need to do itonce.

Java libraries (optional)

One of the examples will use NLTK's interface to the Stanford Named EntityRecognizer (http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which is distributed as a Java library. In particular,you'll want the following files handy in order to run this particular example:

stanford-ner.jarenglish.all.3class.distsim.crf.ser.gz

(http://www-nlp.stanford.edu/software/CRF-

NER.shtml#Download)

Getting StartedThe first thing we'll need to do is import nltk:

In []: import nltk

Downloading NLTK resources

http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download

http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download

The first time you run anything using NLTK, you'll want to go ahead anddownload the additional resources that aren't distributed directly with theNLTK package. Upon running the nltk.download() command below, thethe NLTK Downloader window will pop-up. In the Collections tab, select"all" and click on Download. As mentioned earlier, this may take severalminutes depending on your network connection speed, but you'll only everneed to run it a single time.

In []: nltk.download()

Extracting text from HTMLNow the fun begins. We'll start with a pretty basic and commonly-facedtask: extracting text content from an HTML page. Python's urllib packagegives us the tools we need to fetch a web page from a given URL, but wesee that the output is full of HTML markup that we don't want to deal with.

(N.B.: Throughout the examples in this presentation, we'll use Python slicing(e.g., [:500] below) to only display a small portion of a string or list.Otherwise, if we displayed the entire item, sometimes it would take up theentire screen.)

In []: from urllib import urlopen

url = "http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/"html = urlopen(url).read()html[:500]

Stripping-out HTML formatting

Fortunately, NTLK provides a method called clean_html() to get the rawtext out of an HTML-formatted string. It's still not perfect, though, since theoutput will contain page navigation and all kinds of other junk that we don'twant, especially if our goal is to focus on the body content from a newsarticle, for example.

In []: text = nltk.clean_html(html)text[:500]

Identifying the Main Content

If we just want the body content from the article, we'll need to use twoadditional packages. The first is a Python port of a Ruby port of aJavascript tool called Readability, which pulls the main body content out ofan HTML document and subsequently "cleans it up." The second package,BeautifulSoup, is a Python library for pulling data out of HTML and XMLfiles. It parses HTML content into easily-navigable nested data structure.Using Readability and BeautifulSoup together, we can quickly get exactlythe text we're looking for out of the HTML, (mostly) free of page navigation,comments, ads, etc. Now we're ready to start analyzing this text content.

In []: from readability.readability import Documentfrom bs4 import BeautifulSoup

readable_article = Document(html).summary()readable_title = Document(html).title()soup = BeautifulSoup(readable_article)print '*** TITLE *** \n\"' + readable_title + '\"\n'print '*** CONTENT *** \n\"' + soup.text[:500] + '[...]\"'

Frequency AnalysisHere's a little secret: much of NLP (and data science, for that matter) boilsdown to counting things. If you've got a bunch of data that needs analyzin'but you don't know where to start, counting things is usually a good placeto begin. Sure, you'll need to figure out exactly what you want to count,how to count it, and what to do with the counts, but if you're lost and don'tknow what to do, just start counting.

Perhaps we'd like to begin (as is often the case in NLP) by examining thewords that appear in our document. To do that, we'll first need to tokenizethe text string into discrete words. Since we're working with English, thisisn't so bad, but if we were working with a non-whitespace-delimitedlanguage like Chinese, Japanese, or Korean, it would be much moredifficult.

In the code snippet below, we're using two of NLTK's tokenize methods tofirst chop up the article text into sentences, and then each sentence intoindividual words. (Technically, we didn't need to use sent_tokenize(),but if we only used word_tokenize() alone, we'd see a bunch ofextraneous sentence-final punctuation in our output.) By printing each tokenalphabetically, along with a count of the number of times it appeared in thetext, we can see the results of the tokenization. Notice that the output

contains some punctuation & numbers, hasn't been loweredcased, andcounts BuzzFeed and BuzzFeed's separately. We'll tackle some of thoseissues next.

In []: tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]

for token in sorted(set(tokens))[:30]: print token + ' [' + str(tokens.count(token)) + ']'

Word Stemming

Stemming (http://en.wikipedia.org/wiki/Stemming) is the process ofreducing a word to its base/stem/root form. Most stemmers are prettybasic and just chop off standard affixes indicating things like tense (e.g., "-ed") and possessive forms (e.g., "-'s"). Here, we'll use the Snowballstemmer for English, which comes with NLTK.

Once our tokens are stemmed, we can rest easy knowing that BuzzFeedand BuzzFeed's are now being counted together as... buzzfe? Don't worry:although this may look weird, it's pretty standard behavior for stemmersand won't affect our analysis (much). We also (probably) won't show thestemmed words to users -- we'll normally just use them for internal analysisor indexing purposes.

In []: from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")stemmed_tokens = [stemmer.stem(t) for t in tokens]

for token in sorted(set(stemmed_tokens))[50:75]: print token + ' [' + str(stemmed_tokens.count(token)) + ']'

Lemmatization

Although the stemmer very helpfully chopped off pesky affixes (and madeeverything lowercase to boot), there are some word forms that givestemmers indigestion, especially irregular words. While the process ofstemming typically involves rule-based methods of stripping affixes (makingthem small & fast), lemmatization involves dictionary-based methods toderive the canonical forms (i.e., lemmas) of words. For example, run, runs,ran, and running all correspond to the lemma run. However, lemmatizersare generally big, slow, and brittle due to the nature of the dictionary-basedmethods, so you'll only want to use them when necessary.

http://en.wikipedia.org/wiki/Stemming

The example below compares the output of the Snowball stemmer with theWordNet lemmatizer (also distributed with NLTK). Notice that thelemmatizer correctly converts women into woman, while the stemmer turnslying into lie. Additionally, both replace eyes with eye, but neither of themproperly transforms told into tell.

In []: lemmatizer = nltk.WordNetLemmatizer()temp_sent = "Several women told me I have lying eyes."

print [stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)]print [lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)]

NLTK Frequency Distributions

Thus far, we've been working with lists of tokens that we're manuallysorting, uniquifying, and counting -- all of which can get to be a bitcumbersome. Fortunately, NLTK provides a data structure called FreqDist that makes it more convenient to work with these kinds offrequency distributions. The code snippet below builds a FreqDist fromour list of stemmed tokens, and then displays the top 25 tokens appearingmost frequently in the text of our article. Wasn't that easy?

In []: fdist = nltk.FreqDist(stemmed_tokens)

for item in fdist.items()[:25]: print item

Filtering out Stop Words

Notice in the output above that most of the top 25 tokens are worthless.With the exception of things like facebook, content, user, and perhaps emot(emotion?), the rest are basically devoid of meaningful information. Theydon't really tells us anything about the article since these tokens will appearis just about any English document. What we need to do is filter out thesestop words (http://en.wikipedia.org/wiki/Stop_words) in order to focus onjust the important material.

While there is no single, definitive list of stop words, NLTK provides adecent start. Let's load it up and take a look at what we get:

In []: sorted(nltk.corpus.stopwords.words('english'))[:25]

http://en.wikipedia.org/wiki/Stop_words

Now we can use this list to filter-out stop words from our list of stemmedtokens before we create the frequency distribution. You'll notice in theoutput below that we still have some things like punctuation that we'dprobably like to remove, but we're much closer to having a list of the most"important" words in our article.

In []: stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]

fdist2 = nltk.FreqDist(stemmed_tokens_no_stop)

for item in fdist2.items()[:25]: print item

Named Entity RecognitionAnother task we might want to do to help identify what's "important" in atext document is named entity recogniton (NER)(http://en.wikipedia.org/wiki/Named-entity_recognition). Also called entityextraction, this process involves automatically extracting the names ofpersons, places, organizations, and potentially other entity types out ofunstructured text. Building an NER classifier requires lots of annotatedtraining data and some fancy machine learning algorithms(http://en.wikipedia.org/wiki/Conditional_random_field), but fortunately,NLTK comes with a pre-built/pre-trained NER classifier ready to extractentities right out of the box. This classifier has been trained to recognizePERSON, ORGANIZATION, and GPE (geo-political entity) entity types.

(At this point, I should include a disclaimer stating No True ComputationalLinguist (http://en.wikipedia.org/wiki/No_true_Scotsman) would ever use apre-built NER classifier in the "real world" without first re-training it onannotated data representing their particular task. So please don't send meany hate mail -- I've done my part to stop the madness.)

http://en.wikipedia.org/wiki/Named-entity_recognition

http://en.wikipedia.org/wiki/Conditional_random_field

http://en.wikipedia.org/wiki/No_true_Scotsman

Retrain my classifier models? Ain't nobody got time for that!In the example below (inspired by this gist from Gavin Hackeling(https://gist.github.com/gavinmh/4735528/) and this post from John Price(http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-nltk/)), we're defining a method to perform the following steps:

take a string as inputtokenize it into sentencestokenize the sentences into wordsadd part-of-speech tags to the words using nltk.pos_tag()run this through the NLTK-provided NER classifier using nltk.ne_chunk()parse these intermediate results and return any extracted entities

We then apply this method to a sample sentence and parse the clunkyoutput format provided by nltk.ne_chunk() (it comes as a nltk.tree.Tree(http://www.nltk.org/_modules/nltk/tree.html)) to display the entities we'veextracted. Don't let these nice results fool you -- NER output isn't alwaysthis satisfying. Try some other sample text and see what you get.

In []: def extract_entities(text): entities = [] for sentence in nltk.sent_tokenize(text): chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))) entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])

return entities

https://gist.github.com/gavinmh/4735528/

http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-nltk/

http://www.nltk.org/_modules/nltk/tree.html

return entities

for entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'): print '[' + entity.node + '] ' + ' '.join(c[0] for c in entity.leaves())

If you're like me, you've grown accustomed over the years to working withthe Stanford NER (http://nlp.stanford.edu/software/CRF-NER.shtml) libraryfor Java, and you're suspicious of NLTK's built-in NER classifier (especiallybecause it has chunk in the name). Thankfully, recent versions of NLTKcontain an special NERTagger interface that enables us to make calls toStanford NER from our Python programs, even though Stanford NER is aJava library (the horror!). Not surprisingly(http://www.yurtopic.com/tech/programming/images/java-and-python.jpg),the Python NERTagger API is slightly less verbose than the native Java APIfor Stanford NER.

To run this example, you'll need to follow the instructions for installing theoptional Java libraries, as outlined in the Initial Setup section above. You'llalso want to pay close attention to the comment that says # change the paths below to point to wherever you unzipped the Stanford NER download file.

In []: from nltk.tag.stanford import NERTagger

# change the paths below to point to wherever you unzipped the Stanford NER download filest = NERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')

for i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()): print '[' + i[1] + '] ' + i[0]

Automatic SummarizationNow let's try to take some of what we've learned and build somethingpotentially useful in real life: a program that will automatically summarize(http://en.wikipedia.org/wiki/Automatic_summarization) documents. For this,

http://nlp.stanford.edu/software/CRF-NER.shtml

http://www.yurtopic.com/tech/programming/images/java-and-python.jpg

http://en.wikipedia.org/wiki/Automatic_summarization

we'll switch gears slightly, putting aside the web article we've been workingon until now and instead using a corpus of documents distributed withNLTK.

The Reuters Corpus contains nearly 11,000 news articles about a variety oftopics and subjects. If you've run the nltk.download() command aspreviously recommended, you can then easily import and explore theReuters Corpus like so:

In []: from nltk.corpus import reuters

print '** BEGIN ARTICLE: ** \"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\"'

Our painfully simplistic (http://anthology.aclweb.org/P/P11/P11-3014.pdf)automatic summarization tool will implement the following steps:

assign a score to each word in a document corresponding to itslevel of "importance"rank each sentence in the document by summing the individualword scores and dividing by the number of tokens in the sentenceextract the top N highest scoring sentences and return them asour "summary"

Sounds easy enough, right? But before we can say "voila!," we'll need tofigure out how to calculate an "importance" score for words. As we sawabove with stop words, etc. simply counting the number of times a wordappears in a document will not necessarily tell you which words are mostimportant.

Term Frequency - Inverse Document Frequency (TF-IDF)

Consider a document that contains the word baseball 8 times. You mightthink, "wow, baseball isn't a stop word, and it appeared rather frequentlyhere, so it's probably important." And you might be right. But what if thatdocument is actually an article posted on a baseball blog? Won't the wordbaseball appear frequently in nearly every post on that blog? In thisparticular case, if you were generating a summary of this document, wouldthe word baseball be a good indicator of importance, or would you maybelook for other words that help distinguish or differentiate this blog post fromthe rest?

Context is essential. What really matters here isn't the raw frequency of thenumber of times each word appeared in a document, but rather therelative frequency comparing the number of times a word appeared in thisdocument against the number of times it appeared across the rest of the

http://anthology.aclweb.org/P/P11/P11-3014.pdf

collection of documents. "Important" words will be the ones that aregenerally rare across the collection, but which appear with an unusuallyhigh frequency in a given document.

We'll calculate this relative frequency using a statistical metric called termfrequency - inverse document frequency (TF-IDF)(http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF ourselves using NLTK, but rather than bore you with the math, we'lltake a shortcut and use the TF-IDF implementation provided by the scikit-learn (http://scikit-learn.org/) machine learning library for Python.

Chevy Chase: "It was my understanding that there would be no math."

Building a Term-Document Matrix

We'll use scikit-learn's TfidfVectorizer class to construct a term-document matrix (http://en.wikipedia.org/wiki/Document-term_matrix)containing the TF-IDF score for each word in each document in the ReutersCorpus. In essence, the rows of this sparse matrix correspond todocuments in the corpus, the columns represent each word in thevocabulary of the corpus, and each cell contains the TF-IDF value for agiven word in a given document.

(http://scikit-

learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

http://scikit-learn.org/

http://en.wikipedia.org/wiki/Document-term_matrix

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Inspired by a computer science lab exercise from Duke University(http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html)the code sample below iterates through the Reuters Corpus to build adictionary of stemmed tokens for each article, then uses the TfidfVectorizer and scikit-learn's own built-in stop words list togenerate the term-document matrix containing TF-IDF scores.

In []: import datetime, re, sysfrom sklearn.feature_extraction.text import TfidfVectorizer

def tokenize_and_stem(text): tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) stems = [stemmer.stem(t) for t in filtered_tokens] return stems

token_dict = {}for article in reuters.fileids(): token_dict[article] = reuters.raw(article) tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')print 'building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']'sys.stdout.flush()

tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)print 'done! [process finished: ' + str(datetime.datetime.now()) + ']'

TF-IDF Scores

Now that we've built the term-document matrix, we can explore itscontents:

In []: from random import randint

feature_names = tfidf.get_feature_names()

print 'TDM contains ' + str(len(feature_names)) + ' ter

http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html

print 'TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents'

print 'first term: ' + feature_names[0]print 'last term: ' + feature_names[len(feature_names) - 1]

for i in range(0, 4): print 'random term: ' + feature_names[randint(1,len(feature_names) - 2)]

Generating the Summary

That's all we'll need to produce a summary for any document in the corpus.In the example code below, we start by randomly selecting an article fromthe Reuters Corpus. We iterate through the article, calculating a score foreach sentence by summing the TF-IDF values for each word appearing inthe sentence. We normalize the sentence scores by dividing by the numberof tokens in the sentence (to avoid bias in favor of longer sentences). Thenwe sort the sentences by their scores, and return the highest-scoringsentences as our summary. The number of sentences returnedcorresponds to roughly 20% of the overall length of the article.

Since some of the articles in the Reuters Corpus are rather small (i.e., asingle sentence in length) or contain just raw financial data, some of thesummaries won't make sense. If you run this code a few times, however,you'll eventually see a randomly-selected article that provides a decentdemonstration of this simplistic method of identifying the "most important"sentence from a document.

In []: import mathfrom __future__ import division

article_id = randint(0, tdm.shape[0] - 1)article_text = reuters.raw(reuters.fileids()[article_id])

sent_scores = []for sentence in nltk.sent_tokenize(article_text): score = 0 sent_tokens = tokenize_and_stem(sentence) for token in (t for t in sent_tokens if t in feature_names): score += tdm[article_id, feature_names.index(token)] sent_scores.append((score / len(sent_tokens), sente

nce))

nce))

summary_length = int(math.ceil(len(sent_scores) / 5))sent_scores.sort(key=lambda sent: sent[0])

print '*** SUMMARY ***'for summary_sentence in sent_scores[:summary_length]: print summary_sentence[1]

print '\n*** ORIGINAL ***'print article_text

Improving the Summary

That was fairly easy, but how could we improve the quality of the generatedsummary? Perhaps we could boost the importance of words found in thetitle or any entities we're able to extract from the text. After initially selectingthe highest-scoring sentence, we might discount the TF-IDF scores forduplicate words in the remaining sentences in an attempt to reducerepetitiveness. We could also look at cleaning up the sentences used toform the summary by fixing any pronouns missing an antecedent, or evenpulling out partial phrases instead of complete sentences. The possibilitiesare virtually endless.

Next StepsWant to learn more? Start by working your way through all the examples inthe NLTK book (aka "the Whale book"):

(http://oreilly.com/catalog/9780596516499/)

Natural Language Processing with Python (book)(http://oreilly.com/catalog/9780596516499/)(free online version: nltk.org/book (http://www.nltk.org/book/))



http://www.nltk.org/book/

Additional NLP Resources for Python

NLTK HOWTOs (http://www.nltk.org/howto/)Python Text Processing with NLTK 2.0 Cookbook (book)(http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book)Python wrapper for the Stanford CoreNLP Java library(https://pypi.python.org/pypi/corenlp)guess_language (Python library for language identification)(https://bitbucket.org/spirit/guess_language)MITIE (new C/C++-based NER library from MIT with a Python API)(https://github.com/mit-nlp/MITIE)gensim (topic modeling library for Python)(http://radimrehurek.com/gensim/)

Attend future DC NLP meetups

(http://dcnlp.org/)

dcnlp.org (http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)

http://www.nltk.org/howto/

http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book

https://pypi.python.org/pypi/corenlp

https://bitbucket.org/spirit/guess_language

https://github.com/mit-nlp/MITIE

http://radimrehurek.com/gensim/

http://dcnlp.org/

http://dcnlp.org/

a smattering of natural language processing in python

Data & Analytics