text analytics in python and r with examples from tobacco control

24
Text Analytics with and (w/ examples from Tobacco Control) @BenHealey

Upload: ben-healey

Post on 09-May-2015

7.905 views

Category:

Technology


4 download

DESCRIPTION

Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good. This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.

TRANSCRIPT

Page 1: Text analytics in Python and R with examples from Tobacco Control

Text Analytics with and

(w/ examples from Tobacco Control)@BenHealey

Page 2: Text analytics in Python and R with examples from Tobacco Control

The Process

Look intenselyFrequencies

Classification

Bright Idea Gather Clean Standardise

De-dup and select

Page 3: Text analytics in Python and R with examples from Tobacco Control

http://scrapy.org

Spiders Items Pipelines

- readLines, XML / Rcurl / scrapeR packages- tm package (factiva plugin), twitteR

- Beautiful Soup- Pandas (eg, financial data)

http://blog.siliconstraits.vn/building-web-crawler-scrapy/

Page 4: Text analytics in Python and R with examples from Tobacco Control
Page 5: Text analytics in Python and R with examples from Tobacco Control
Page 6: Text analytics in Python and R with examples from Tobacco Control
Page 7: Text analytics in Python and R with examples from Tobacco Control

• Translating text to consistent form– Scrapy returns unicode strings– Māori Maori

• SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]

• translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])

• cleaned_content = html_content.translate(translation_table)

– Or… • test=u’Māori’ (you already have unicode)• Unidecode(test) (returns ‘Maori’)

Page 8: Text analytics in Python and R with examples from Tobacco Control

• Dealing with non-Unicode– http://nedbatchelder.com/text/unipain.html– Some scraped html will be in latin1 (mismatch UTF8)– Have your datastore default to UTF-8– Learn to love whack-a-mole

• Dealing with too many spaces:– newstring = ' '.join(mystring.split())– Or… use re

• Don’t forget the metadata!– Define a common data structure early if you have multiple

sources

Page 9: Text analytics in Python and R with examples from Tobacco Control

Text Standardisation

• Stopwords– "a, about, above, across, ... yourself, yourselves, you've, z”

• Stemmers– "some sample stemmed words" "some sampl stem word“

• Tokenisers (eg, for bigrams)– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))– ‘and said’, ‘and security’

Natural Language Toolkittm package

Page 10: Text analytics in Python and R with examples from Tobacco Control

Text Standardisationlibs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")

cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp)}

posts.corpus = cleanCorpus(posts.corpus)posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)

Page 11: Text analytics in Python and R with examples from Tobacco Control

Text Standardisation• Using dictionaries for stem completion

politi.tdm <- TermDocumentMatrix(politi.corpus)politi.tdm = removeSparseTerms(politi.tdm, 0.99)politi.tdm = as.matrix(politi.tdm)

# get word counts in decreasing order, put these into a plain text doc.word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)length(word_freqs)smalldict = PlainTextDocument(names(word_freqs))

politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")

Page 12: Text analytics in Python and R with examples from Tobacco Control

Deduplication

• Python sets– shingles1 = set(get_shingles(record1['standardised_content']))

• Shingling and Jaccard similarity– (a,rose,is,a,rose,is,a,rose)– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}

• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf a free texthttp://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

Page 13: Text analytics in Python and R with examples from Tobacco Control

Frequency Analysis

• Document-Term Matrix– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,

control = list(wordLengths=c(4,Inf)))

• Frequent and co-occurring terms– findFreqTerms(politi.dtm, 5000)

[1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi"

– findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57

Page 14: Text analytics in Python and R with examples from Tobacco Control
Page 15: Text analytics in Python and R with examples from Tobacco Control

Mentions of the 2025 goal

Page 16: Text analytics in Python and R with examples from Tobacco Control

Mentions of the 2025 goal

Page 17: Text analytics in Python and R with examples from Tobacco Control

Top 100 terms: Tariana Turia

Note: Documents from Aug 2011 – July 2012 Wordcloud package

Page 18: Text analytics in Python and R with examples from Tobacco Control

Top 100 terms: Tony Ryall

Note: Documents from Aug 2011 – July 2012

Page 19: Text analytics in Python and R with examples from Tobacco Control

• Exploration and feature extraction– Metadata gathered at time of collection (eg, Scrapy)– RODBC or MySQLdb with plain ol’ SQL– Native or package functions for length of strings, sna, etc.

• Unsupervised– nltk.cluster– tm, topicmodels, as.matrix(dtm) kmeans, etc.

• Supervised– First hurdle: Training set – nltk.classify– tm, e1071, others…

Classification

Page 20: Text analytics in Python and R with examples from Tobacco Control

2 posts or fewer more than 750 posts846 1,157 23 45,499

41.0% 1.3% 1.1% 50.1%

Page 21: Text analytics in Python and R with examples from Tobacco Control

Cohort: New users (posters) in Q1 2012

Page 22: Text analytics in Python and R with examples from Tobacco Control

• LDA (topicmodels)– New users

– Highly active users

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5good smoke just smoke feelday time day quit daythank week get can dontwell patch realli one likewill start think will still

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time

Page 23: Text analytics in Python and R with examples from Tobacco Control

• LDA (topicmodels)– Highly active users (HAU)

– HAU1 (F, 38, PI)

– HAU2 (F, 33, NZE)

– HAU3 (M, 48, NZE)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time

18% 14% 40% 8% 20%

31% 21% 27% 6% 16%

16% 9% 21% 49% 5%

Page 24: Text analytics in Python and R with examples from Tobacco Control

Recap• Your text will probably be messy– Python, R-based tools reduce the pain

• Simple analyses can generate useful insight

• Combine with data of other types for context– source, quantities, dates, network position, history

• May surface useful features for classification

Slides, Code: [email protected]