katy perry and trend detection red dirt
TRANSCRIPT
![Page 1: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/1.jpg)
Katy Perry and Trend Detec.on using
Ma5 Kirk – Wetpaint
![Page 2: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/2.jpg)
`whoami`
Former finance quant turned Ruby hacker
twi5er: @mjkirk
website: ma5hewkirk.com
![Page 3: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/3.jpg)
I work for wetpaint.com
![Page 4: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/4.jpg)
Who the hell is Wetpaint?
![Page 5: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/5.jpg)
wetpaint.com
![Page 6: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/6.jpg)
WTF am I doing here at RedDirt??!!!
![Page 7: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/7.jpg)
6 months ago we started building an engine
![Page 8: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/8.jpg)
To find relevant news
![Page 9: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/9.jpg)
That U.lizes
• Sta.s.cal processing • Natural Language Processing • Hardcore mathema.cs
And Magic
![Page 10: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/10.jpg)
Ruby doesn’t really have tools for this sort of thing
Except for magic
![Page 11: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/11.jpg)
Java Packages of help
• Apache commons math • Stanford CoreNLP • OpenNLP • Weka
• LingPipe • Mahout
• etc, etc
![Page 12: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/12.jpg)
Who wants to write Java all day long though?
![Page 13: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/13.jpg)
Java + JRuby = Awesome
![Page 14: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/14.jpg)
So what about Katy Perry….
![Page 15: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/15.jpg)
JRuby helped us find Katy Perry
![Page 16: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/16.jpg)
In the context of Glee
![Page 17: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/17.jpg)
How we went about finding Katy Perry
1. Detect ac.vity around shows on Twi5er, Facebook etc.
2. Extract a5ributes about that ac.vity 3. Cluster everything together to reduce clu5er
![Page 18: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/18.jpg)
Back in November she tweeted
Oh...My...Gosh... this just brought a sweet tear to my eye! Teenage Dream on GLEE makes my heart go WEEEEE!
h5p://t.co/8SAFkGl
![Page 19: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/19.jpg)
Which fed into a re-‐tweet frenzy
![Page 20: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/20.jpg)
Obviously that’s an outlier
![Page 21: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/21.jpg)
But how would we find that?
• Fit the Poisson distribu.on to the last day and figure out the percen.le of the current data point.
• If it’s greater than say 95% there’s something weird going on
![Page 22: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/22.jpg)
Poisson Distribu.on
![Page 23: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/23.jpg)
Let’s use the Apache Commons Math Package!!!
![Page 24: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/24.jpg)
require 'java’; require’math.jar’!Poisson = org.apache.commons.math.distribution.PoissonDistributionImpl!
mean = 20 # tweets per five minutes!
fishy = Poisson.new(mean)!
fishy.cumulative_probability(30) # => 98.6%
![Page 25: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/25.jpg)
Coding a Poisson distribu.on in ruby wouldn’t be as much fun
![Page 26: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/26.jpg)
Ok so we know there’s something there. What is it?
Let’s assume we don’t know it was Katy Perry
![Page 27: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/27.jpg)
Extract some a5ributes
• Using a tool like the Stanford CoreNLP
• Extract – n-‐gram phrases – words – urls
![Page 28: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/28.jpg)
Probably get a5ributes like
n_grams = ["sweet tear", "teenage dream"]!
words = ["oh", "gosh", "just", "brought", "sweet", "tear", "eye", "teenage", "dream", "glee", "heart", "go", "weeeee"]!
urls = ["http://t.co/8SAFkGl"]!
![Page 29: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/29.jpg)
Stanford Core NLP
require 'java'; require 'nlp.jar’!include_class "edu.stanford.nlp.ie.machinereading.domains.ace.reader.RobustTokenizer”!
# Seriously wtf guys...!
RT = RobustTokenizer!
![Page 30: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/30.jpg)
Stanford Core NLP
tok = RT.new(katy_perry_tweet)!
tokens = tok.tokenize.map(&:to_s)!urls = tokens.select do |t| ! RT.is_url(t) !end!
words = tokens.uniq - urls - punctuation - stopwords!
![Page 31: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/31.jpg)
We have a bag full of words and urls. Now what?
![Page 32: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/32.jpg)
Cluster it!
• Li5le more of a difficult problem. In our case we wrote our own package.
• Apache Commons math and Weka both have k-‐means clustering in them
![Page 33: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/33.jpg)
![Page 34: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/34.jpg)
Quickest solu.on is to use Apache Commons Math
![Page 35: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/35.jpg)
Build a Point Class First
class Point! attr_reader :attrs! include
org.apache.commons.math.stat.clustering.Clusterable!
def initialize(attrs = [])! !@attrs = attrs! end! def distanceFrom(point)! ((point.attrs | @attrs) - (point.attrs &
@attrs)).length! end!
![Page 36: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/36.jpg)
Build a Point Class First
def centroidOf(points)! u = Point.new(points.map(&:attributes).flatten.uniq)!
guess = points.first! points.each do |point|! !if u.distanceFrom(point) < u.distanceFrom(best_guess)!
!!guess = point! !end! end! best_guess!
end!end
![Page 37: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/37.jpg)
Feed it into Apache Commons
require 'java'; require 'math.jar’!include_class "org.apache.commons.math.stat.clustering.KMeansPlusPlusClusterer”!
clusterer = KMeansPlusPlusClusterer(java.util.Random.new)!
num_clusters = ? # Depends...!max_iter = -1 # no max!clusterer.cluster(collection_of_points, num_clusters, max_iter)
![Page 38: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/38.jpg)
Now that we’ve clustered, our peeps can find Katy Perry
![Page 39: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/39.jpg)
And now we can have a party
![Page 40: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/40.jpg)
Conclusion
• Detect • Extract • Cluster
![Page 41: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/41.jpg)
Conclusion
• Java + JRuby = Killer combo for fun development of sta.s.cs apps.
![Page 42: Katy perry and trend detection red dirt](https://reader033.vdocuments.mx/reader033/viewer/2022053020/557d780ad8b42a2c428b4bb9/html5/thumbnails/42.jpg)
THANKS!!!
If you want to work with Females 18-‐34. We’re Hiring!!!
h5p://bit.ly/wpdevs