tweetmogaz - the arabic tweets platform: presented by ahmed adel, badr
TRANSCRIPT
O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X
TweetMogaz: The Arabic Tweets Platform Ahmed Adel
Team Lead, BADR
3
01Who Am I?
• Bs.c. Engineering from Alexandria University
• BADR Co-Founder
• Now: Part-time Team Lead @ BADR
• 8+ years experience in software development
• Mainly Java, JavaScript
• Solr, Hadoop, Hive, ...
4
02BADR
• Established Software House in Egypt
• Was founded in 2006
• Provide BigData consulting servicesand solutions
• Machine Learning, NLP, Data Science, ...
• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...
5
02Agenda
• What is TweetMogaz • System Modules
• Tweets processing • Indexing • Event detection • Archivers • …
• System Architecture • Tricks and Challenges • What’s Next
6
02What Is TweetMogaz?
• Innovation and applied research project @ BADR • Portal for browsing, filtering and searching Arabic Tweets • ... and events detection • Based on several research papers
• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.CIKM 2012
• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013
• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in Social Media. CIKM 2014
• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.ICWSM 2014
7
02Why Arabic
• 230 Millions speakers • 6th largest in
the world (native + 2nd) • One of the 6 UN
official languages
Mandarin Chinese
English
Hindi
Spanish
Russian
Arabic
German
Bengali
Portuguese
Japanese
Speakers in Millions0 300 600 900 1,200
Native 2nd
8
02Main Features
• Classifying • Browsing• Searching
• Event Detection • Time machine
9
02System Modules
• Tweets processing module • Indexing module • Event detection module
• Events • Active Hashtags
• WordCloud generator • Archivers
• Short-term • Long-term
• Analytics
10
Tweets Processing Module
11
02Tweets Processing Module
• Retrieves tweets(streams and search q's)
• Filters out inappropriatetweets
• Text pre-processing • Normalization
• ي ، ى• أ ، ا ، إ ، آ• ه ، ة• Kashida: ـ ، ْ
• Removing stop-words
12
02
• Classification at indexing time • Multiple classes map to multi-value field (politics, sport, religious, etc)
• Boolean classifier
• Adaptive classifier (Naïve Bayes/SVM (experimental))
• Scoring at indexing time • Recent (date): latest tweets in a specific category
• Top (score field): trending tweets (high retweet rate in the past 48 hours)
Tweets Processing Module
13
02Score
Scor
e
0
0.005
0.009
0.014
0.018
Tweet Age (seconds)
0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k
14
Indexing Module
15
02Indexing Module
• Responsible for indexingtweets to correspondingSolr cores
• Realtime core (< 10 mins) • up to 48 hours cores
• Media: photos, videos • Text only and text that contains
links • All tweets
• Short term archives cores(>48 hours and <30 days)
16
Event Detection Module
17
Event Detection Module
18
Event Detection Module
• Responsible for detecting events • Elsawy E., M. Mokhtar, and W. Magdy.
TweetMogaz v2: Identifying News Storiesin Social Media. CIKM 2014
• Feature-pivot (term) approach
19
02Event Detection Module
• Clusters are created based ona distance threshold (fuzzy clusters)
• Distance threshold 0.4 (experimental)
S
SS
S• In 8 hours window • Processed text faceting with using min_count • Builds facets for stems • For each facet, calculate distance
to all other facets O(n2)
20
02Event Detection Module
• Cluster enrichment • Enhancing clusters with less than 6 terms • Running Solr AND query with all keywords and
selecting terms with highest TFIDF toenrich the cluster
21
02Event Detection Module
• Cluster de-duplication over time • Search using cluster keywords of each detected
cluster • For each response result, build stem frequency
vector • Compare the two vectors for similarity
(Cosine = 0.5: experimental) • Old clusters are updated to maintain the
chronological order of events
22
02Event Detection Module
• Relevant tweets retrieval • Query against 48 hours cores
23
02Event Detection Module
• Active hash tag detection • Separate field added at index time • Stored in events core with type hashtag • Build normalized top hashtag facets every 24 hours for the past week • Query Solr for hashtags older that 1 week and eliminate them
24
WordCloud
25
02Word Cloud: Bi-gram detection
• Facet for specific class • Facets next to each other, with a specific threshold, tend to be a bi-gram • For example: ريال مدريد - كأس العالم (Real Madrid - World Cup) • min_count applies
26
Archiving Module
27
02Archiving Module
• Why? • Space in finite! • Faster performance of searching recent cores
• Short-term archiving • Archive tweets that are older than 48 hours • Same Solr instance
• Long-term archiving • Archive tweets that are older than 30 days • Separate Solr instance
28
System Architecture
29
02System Architecture
• SolrCloud • 2 Shards • Replication factor of 2 • Zookeeper ensemble
for distribution management • SolrJ API
• Front-end • Node.js • AngularJS (Web and mobile web)
• Long-term archive • Separate Solr Instance
30
Analytics and Visualization
31
02Analytics and Visualization
• Banana Dashboards • Deployed on both realtime
and archive • Insights on the tweets distribution
per class, trends over time ofspecific search queries
• Realtime on production with‘Auto-refresh’ feature
• Users with highest retweets
32
Challenges and Tricks
33
02
• Archiving • Initially developed on Solr 4.4 • Upgrade to 4.7+ for deep paging
• Archivers Sync’ing • Short-term is writing and long term is reading • Have to sync in case of deep paging
Short-term cores
Long-term cores
Short-term archiver(W)
Long-term archiver(R)
Tricks
34
02Challenges
• Twitter (Micro-blogs) very short text • Arabic has many dialects: colloquial, formal, regional variations
35
Next Steps
36
02Next Steps
• Integrating an adaptive classifier that can handle thecharacteristics of micro-blogs
• Search query trend over time • Engage system users • Integrate R for statistical processing (classification, detection, …)