nerddit demo presentation
TRANSCRIPT
More than a reddit N-gram viewer
JERRY PRAWIHARJOINSIGHT DATA ENGINEERING FELLOW
nerddit
MotivationN-grams:Allows Data Scientists to do Topic trends analysis, language analysisAllows for “Type-ahead” feature
Subreddits network graph:
SR1
SR2
SR3U
U
U
U
1-gram
My Name is Jerry“My” “Name” “is” “Jerry”
2-grams
My name is Jerry“My Name”
“name is”
“is Jerry”
3-grams
My name is Jerry“My name is”
“name is Jerry”
Pipeline
6x m4.xlarge$1.43/hour
5x m4.xlarge$28.7/day
t2.microfree
~10GB10/2007-12/2015
>1TB uncompressed
4x m4.large$11.5/day
Reddit Statistics
Year Date comments Unique authors Unique subreddits2015 2015-12-01 10000 25000 600002014 2014-12-01 50000 35000 40000
N-gram
Ngram Date N Count PercentageHallows 2011-04 1 10 0.1Deathly Hallows 2011-04 2 50 0.1
Ngram N Subreddit Count (counter type)Hallows 1 movies 1000Deathly Hallows 2 movies 5000
N gram cluster against subredditsTime series ngrams
word-parser
(“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”])
(“2011-04::old::movies”, 1)(“2011-04::lady::movies”, 1)…(“2011-04::Deathly::movies”, 1)
(“2011-04::old::movies”, 10)(“2011-04::lady::movies”, 5)…(“2011-04::Deathly::movies”, 2)Job took ~2days to complete
Regex filtersURLs, IMG links, unicodes
Subreddits Graph
Year node1 node22011 movies {politics: 10, games:5,…}2014 politics {games: 3,conservative: 2,…}
Year Distinct authors subreddit Comments2011 TheOceldoc movies 1002011 JohnDoe politics 200
(TheOceldoc, (movies, politics, games,…)(JohnDoe, (politics, conservative,…)
(“movies::politics”,10)(“movies::games”,5)(“politics::games”,3)…(“politics::conservative”,2)
Edge weight
Filter degree < 100ClusteringForce Atlas 2 layout
Spark Tuning
A B C D0
2
4
6
8
10
12
14
16
18
Case
Tim
e (m
inut
es)
Case Rdd Compress KryoA FALSE FALSEB TRUE FALSEC TRUE TRUED FALSE TRUE
Jerry Prawiharjo Phd in Optoelectronics from Southampton England
◦ Distributed computation on Beowulf cluster (MPI)
Product Development Engineer at Neophotonics◦ Test software development and data analysis
Senior Test Development Engineer at Cisco◦ Test station development (hardware and software) for 100G transceiver module
Back Up
Challenges Sheer amount of Data: >1TB
◦ Scoping the project: monthly time bucket (as opposed to daily or weekly)◦ Filter foreign language subreddits◦ Spark tuning
S3 rate limit: Process data on file-per-file basis