tweetmogaz - the arabic tweets platform: presented by ahmed adel, badr

37
OCTOBER 13-16, 2015 AUSTIN, TX

Upload: lucidworks

Post on 07-Jan-2017

2.434 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X

Page 2: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

TweetMogaz: The Arabic Tweets Platform Ahmed Adel

Team Lead, BADR

Page 3: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

3

01Who Am I?

• Bs.c. Engineering from Alexandria University

• BADR Co-Founder

• Now: Part-time Team Lead @ BADR

• 8+ years experience in software development

• Mainly Java, JavaScript

• Solr, Hadoop, Hive, ...

Page 4: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

4

02BADR

• Established Software House in Egypt

• Was founded in 2006

• Provide BigData consulting servicesand solutions

• Machine Learning, NLP, Data Science, ...

• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...

Page 5: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

5

02Agenda

• What is TweetMogaz • System Modules

• Tweets processing • Indexing • Event detection • Archivers • …

• System Architecture • Tricks and Challenges • What’s Next

Page 6: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

6

02What Is TweetMogaz?

• Innovation and applied research project @ BADR • Portal for browsing, filtering and searching Arabic Tweets • ... and events detection • Based on several research papers

• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.CIKM 2012

• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013

• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in Social Media. CIKM 2014

• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.ICWSM 2014

Page 7: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

7

02Why Arabic

• 230 Millions speakers • 6th largest in

the world (native + 2nd) • One of the 6 UN

official languages

Mandarin Chinese

English

Hindi

Spanish

Russian

Arabic

German

Bengali

Portuguese

Japanese

Speakers in Millions0 300 600 900 1,200

Native 2nd

Page 8: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

8

02Main Features

• Classifying • Browsing• Searching

• Event Detection • Time machine

Page 9: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

9

02System Modules

• Tweets processing module • Indexing module • Event detection module

• Events • Active Hashtags

• WordCloud generator • Archivers

• Short-term • Long-term

• Analytics

Page 10: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

10

Tweets Processing Module

Page 11: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

11

02Tweets Processing Module

• Retrieves tweets(streams and search q's)

• Filters out inappropriatetweets

• Text pre-processing • Normalization

• ي ، ى• أ ، ا ، إ ، آ• ه ، ة• Kashida: ـ ، ْ

• Removing stop-words

Page 12: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

12

02

• Classification at indexing time • Multiple classes map to multi-value field (politics, sport, religious, etc)

• Boolean classifier

• Adaptive classifier (Naïve Bayes/SVM (experimental))

• Scoring at indexing time • Recent (date): latest tweets in a specific category

• Top (score field): trending tweets (high retweet rate in the past 48 hours)

Tweets Processing Module

Page 13: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

13

02Score

Scor

e

0

0.005

0.009

0.014

0.018

Tweet Age (seconds)

0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k

Page 14: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

14

Indexing Module

Page 15: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

15

02Indexing Module

• Responsible for indexingtweets to correspondingSolr cores

• Realtime core (< 10 mins) • up to 48 hours cores

• Media: photos, videos • Text only and text that contains

links • All tweets

• Short term archives cores(>48 hours and <30 days)

Page 16: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

16

Event Detection Module

Page 17: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

17

Event Detection Module

Page 18: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

18

Event Detection Module

• Responsible for detecting events • Elsawy E., M. Mokhtar, and W. Magdy.

TweetMogaz v2: Identifying News Storiesin Social Media. CIKM 2014

• Feature-pivot (term) approach

Page 19: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

19

02Event Detection Module

• Clusters are created based ona distance threshold (fuzzy clusters)

• Distance threshold 0.4 (experimental)

S

SS

S• In 8 hours window • Processed text faceting with using min_count • Builds facets for stems • For each facet, calculate distance

to all other facets O(n2)

Page 20: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

20

02Event Detection Module

• Cluster enrichment • Enhancing clusters with less than 6 terms • Running Solr AND query with all keywords and

selecting terms with highest TFIDF toenrich the cluster

Page 21: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

21

02Event Detection Module

• Cluster de-duplication over time • Search using cluster keywords of each detected

cluster • For each response result, build stem frequency

vector • Compare the two vectors for similarity

(Cosine = 0.5: experimental) • Old clusters are updated to maintain the

chronological order of events

Page 22: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

22

02Event Detection Module

• Relevant tweets retrieval • Query against 48 hours cores

Page 23: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

23

02Event Detection Module

• Active hash tag detection • Separate field added at index time • Stored in events core with type hashtag • Build normalized top hashtag facets every 24 hours for the past week • Query Solr for hashtags older that 1 week and eliminate them

Page 24: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

24

WordCloud

Page 25: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

25

02Word Cloud: Bi-gram detection

• Facet for specific class • Facets next to each other, with a specific threshold, tend to be a bi-gram • For example: ريال مدريد - كأس العالم (Real Madrid - World Cup) • min_count applies

Page 26: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

26

Archiving Module

Page 27: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

27

02Archiving Module

• Why? • Space in finite! • Faster performance of searching recent cores

• Short-term archiving • Archive tweets that are older than 48 hours • Same Solr instance

• Long-term archiving • Archive tweets that are older than 30 days • Separate Solr instance

Page 28: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

28

System Architecture

Page 29: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

29

02System Architecture

• SolrCloud • 2 Shards • Replication factor of 2 • Zookeeper ensemble

for distribution management • SolrJ API

• Front-end • Node.js • AngularJS (Web and mobile web)

• Long-term archive • Separate Solr Instance

Page 30: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

30

Analytics and Visualization

Page 31: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

31

02Analytics and Visualization

• Banana Dashboards • Deployed on both realtime

and archive • Insights on the tweets distribution

per class, trends over time ofspecific search queries

• Realtime on production with‘Auto-refresh’ feature

• Users with highest retweets

Page 32: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

32

Challenges and Tricks

Page 33: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

33

02

• Archiving • Initially developed on Solr 4.4 • Upgrade to 4.7+ for deep paging

• Archivers Sync’ing • Short-term is writing and long term is reading • Have to sync in case of deep paging

Short-term cores

Long-term cores

Short-term archiver(W)

Long-term archiver(R)

Tricks

Page 34: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

34

02Challenges

• Twitter (Micro-blogs) very short text • Arabic has many dialects: colloquial, formal, regional variations

Page 35: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

35

Next Steps

Page 36: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

36

02Next Steps

• Integrating an adaptive classifier that can handle thecharacteristics of micro-blogs

• Search query trend over time • Engage system users • Integrate R for statistical processing (classification, detection, …)

Page 37: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

37

03Thank you!

Ahmed Adel email: [email protected] twitter: @ahmadadel website: badrit.com