text analytics in python and r with examples from tobacco control

Text Analytics with and

(w/ examples from Tobacco Control)@BenHealey

The Process

Look intenselyFrequencies

Classification

Bright Idea Gather Clean Standardise

De-dup and select

http://scrapy.org

Spiders Items Pipelines

- readLines, XML / Rcurl / scrapeR packages- tm package (factiva plugin), twitteR

- Beautiful Soup- Pandas (eg, financial data)

http://blog.siliconstraits.vn/building-web-crawler-scrapy/

http://blog.siliconstraits.vn/building-web-crawler-scrapy/

• Translating text to consistent form– Scrapy returns unicode strings– Māori Maori

• SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]

• translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])

• cleaned_content = html_content.translate(translation_table)

– Or… • test=u’Māori’ (you already have unicode)• Unidecode(test) (returns ‘Maori’)

• Dealing with non-Unicode– http://nedbatchelder.com/text/unipain.html– Some scraped html will be in latin1 (mismatch UTF8)– Have your datastore default to UTF-8– Learn to love whack-a-mole

• Dealing with too many spaces:– newstring = ' '.join(mystring.split())– Or… use re

• Don’t forget the metadata!– Define a common data structure early if you have multiple

sources

http://nedbatchelder.com/text/unipain.html

Text Standardisation

• Stopwords– "a, about, above, across, ... yourself, yourselves, you've, z”

• Stemmers– "some sample stemmed words" "some sampl stem word“

• Tokenisers (eg, for bigrams)– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))– ‘and said’, ‘and security’

Natural Language Toolkittm package

Text Standardisationlibs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")

…

cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp)}

posts.corpus = cleanCorpus(posts.corpus)posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)

Text Standardisation• Using dictionaries for stem completion

politi.tdm <- TermDocumentMatrix(politi.corpus)politi.tdm = removeSparseTerms(politi.tdm, 0.99)politi.tdm = as.matrix(politi.tdm)

# get word counts in decreasing order, put these into a plain text doc.word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)length(word_freqs)smalldict = PlainTextDocument(names(word_freqs))

politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")

Deduplication

• Python sets– shingles1 = set(get_shingles(record1['standardised_content']))

• Shingling and Jaccard similarity– (a,rose,is,a,rose,is,a,rose)– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}

• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}

–

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf a free texthttp://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Frequency Analysis

• Document-Term Matrix– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,

control = list(wordLengths=c(4,Inf)))

• Frequent and co-occurring terms– findFreqTerms(politi.dtm, 5000)

[1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi"

– findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57

Mentions of the 2025 goal

Top 100 terms: Tariana Turia

Note: Documents from Aug 2011 – July 2012 Wordcloud package

Top 100 terms: Tony Ryall

Note: Documents from Aug 2011 – July 2012

• Exploration and feature extraction– Metadata gathered at time of collection (eg, Scrapy)– RODBC or MySQLdb with plain ol’ SQL– Native or package functions for length of strings, sna, etc.

• Unsupervised– nltk.cluster– tm, topicmodels, as.matrix(dtm) kmeans, etc.

• Supervised– First hurdle: Training set – nltk.classify– tm, e1071, others…

Classification

2 posts or fewer more than 750 posts846 1,157 23 45,499

41.0% 1.3% 1.1% 50.1%

Cohort: New users (posters) in Q1 2012

• LDA (topicmodels)– New users

– Highly active users

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5good smoke just smoke feelday time day quit daythank week get can dontwell patch realli one likewill start think will still

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time

• LDA (topicmodels)– Highly active users (HAU)

– HAU1 (F, 38, PI)

– HAU2 (F, 33, NZE)

– HAU3 (M, 48, NZE)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time

18% 14% 40% 8% 20%

31% 21% 27% 6% 16%

16% 9% 21% 49% 5%

Recap• Your text will probably be messy– Python, R-based tools reduce the pain

• Simple analyses can generate useful insight

• Combine with data of other types for context– source, quantities, dates, network position, history

• May surface useful features for classification

Slides, Code: [email protected]

text analytics in python and r with examples from tobacco control

Technology