computational framework for generating visual summaries of topical clusters in twitter streams

47
Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams* Authors: Presenter: Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh 1 Semantic Modeling * http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9

Upload: sebastian-alfers

Post on 14-Jul-2015

134 views

Category:

Software


1 download

TRANSCRIPT

Computational Framework for Generating Visual Summaries of

Topical Clusters in Twitter Streams*

Authors: Presenter: !Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh

1

Semantic Modeling

* http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9

Visual Summaries of Twitter Streams

2

http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif

http://www.infobarrel.com/media/image/54054.jpg

Step 1:get &

pre-process Data

construct graph & clustering

extract keywords & summarize

Keywords

Stream Tweets

Preprocessing/ Cleaning

Construct GraphClustering

Select Relevant Clusters Extract Topical

Keywords

Visual Cluster Summary

Step 2:

Step 3:

3

Input: Keywords• initial set of Keywords

• similar to Twitter Search

4

Input: Keywords• initial set of Keywords

• similar to Twitter Search

5

Step 1: Stream Tweets• HTTP base API

- JSON, REST

6

7

• OAuth + HTTP

• here: java library with scala and play!framework

Step 1: Preprocessing• transform Tweets

- easy-to-analyze / clan format

• Process of cleaning: 1. lowercase 2. remove urls, user mentions and stop words

• like @user, „a“ or „123“ 3. remove special characters (#,.)

8

Step 1: Preprocessing• Example Keywords:

- SCALA - Scala - scala - #scala

• Ling Pipe Library* - remove tense and plurals

9

} scala

*http://alias-i.com/lingpipe/

Step 1: Preprocessing• Example Tweets

10

new york time reactive

programming tool scala scale

techrepublic

akka-http based reactive stream scala scaladay

Step 1: Preprocessing• Example Tweets

11

new york time reactive

programming tool scala scale

techrepublic

akka-http based reactive stream scala scaladay

Step 2: Graph• Word Co-Occurrence Graph

- Word = Node (Unigrams) - Tweet = Link between Nodes

• Example

12 *http://alias-i.com/lingpipe/

akka-http based reactivestream scala scaladay

Step 2: Graph• Word Co-Occurrence Graph

- Word = Node (Unigrams) - Tweet = Link between Nodes

• Example

13 *http://alias-i.com/lingpipe/

akka-http based reactivestream scala scaladay

Step 2: Graph• Word Co-Occurrence Graph

- Word = Node (Unigrams) - Tweet = Link between Nodes

• Example

14 *http://alias-i.com/lingpipe/

akka-http

basedreactivestream

scalascaladay

Step 2: Graph• Word Co-Occurrence Graph

- Word = Node (Unigrams) - Tweet = Link between Nodes

• Example

15 *http://alias-i.com/lingpipe/

akka-http

basedreactivestream

scalascaladay

NodesNodes

NodesLinks

Step 2: Graph• Word Co-Occurrence Graph

- Word = Node (Unigrams) - Tweet = Link between Nodes

• Example

16 *http://alias-i.com/lingpipe/

akka-http

basedreactivestream

scalascaladay

17

18

Step 2: Graph• Co-Occurrence Graph

- connect nodes (words) within and between tweets

- add strength (weight) and cost (distance)

• More frequently words - increase the strength - decrease cost

19

Step 2: Graph• Summary

reactive

scala

+

=

based

stream

programming

uses

Step 2: Clustering• Here: „complete link (max) clustering“ algorithm

- hierarchical clustering algorithm that forms clusters by merging subgroups

• Group Words from Tweets - frequently appear on topic - cluster = topic

* http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html

Step 2: Clustering• Here: „complete link (max) clustering“ algorithm

• each node starts as individual cluster

!

• close clusters are successively merged together - close = highest cost within clusters

Clusters = Nodes = Words in tweet

22

Step 2: Clustering

reactive

scalabased

stream

reactive

scalabased

stream

23

cost = distance = 0.5

cost = distance = 1

1

1

Graph Representation Cluster Representation

Step 2: Clustering

24

Step 2: Clustering

distance = 0.5

25

Step 2: Clustering

distance = 0.5

distance = 1

distance = 1

26

Step 2: Clustering

distance = 0.5

distance = 1

distance = 1

271

1

Step 2: Clustering

distance = 0.5

distance = 1

distance = 1

28

distance = 2

1

1

Step 2: Clustering

29

Step 2: Clustering• Final step: Dendrogram

- tree diagram - represents the arrangement of hierarchical clusters

• why? - easy to apply thresholds metics

30

Step 2: Clustering• Final step: Dendrogram

- closer to the root = lower similarity

31

root

reactive scalafirst cluster

Step 2: Clustering• Final step: Dendrogram

- closer to the root = lower similarity

32

root

reactive scala

new york programming … akka-http based stream scaladay

Step 2: Clustering• Final step: Dendrogram

- closer to the root = lower similarity

33

root

reactive scala

new york programming … akka-http based stream scaladay

thresholds

34

Step 3: Extract topical keywords

35

Preprocessing/ Cleaning

Construct Graph

Extract Topical Keywords

Step 3: Extract topical keywords• keywords

- express a topic - frequently used - summarize tweets content

• Questions - „What are the relevant keywords?“ - „In what clusters do they appear?“

36

Step 3: Extract topical keywords• How?

- „topical tweets“ vs. „general tweets“

• frequently in topical tweets!- search keywords „reactive scala“!

• not frequently in general tweets!- general twitter stream (all tweets)

37

Step 3: Extract topical keywords• Strength of a word

- is a word relevant for that topical cluster?

38

Low Frequency

High Frequency

Low Frequency

High Frequency

Topical Tweets

Gen

eral

Tw

eets

Step 3: Extract topical keywords• Strength of a word

- is a word relevant for that topical cluster?

39

Low Frequency

High Frequency

Low Frequency

High Frequency

Topical Tweets

Gen

eral

Tw

eets ✔

relevant for topic / cluster

Step 3: Extract topical keywords• Result

- topical strength for each keyword - sort them by relevancy - select top 20 keyword

• choose clusters that contain this words

40

Final Step• Combine clusters and keywords

• create visual summary

41

Final Step

42

• Keyword1

• Keyword2

• Keyword3

• Keyword4

• …

high relevancy

low relevancy

Final Step

43

• Keyword1

• Keyword2

• Keyword3

• Keyword4

• …

high relevancy

low relevancy

Final Step

44

• Treemap Visualisation - color = cluster - area of word = frequency of word

Final Step

45

• Wordcloud Visualisation - color = cluster - size of word = frequency of word

Final Notes• 4. Million Topical Tweets

• 15 Days

• User Study - Treemap vs. Word Cloud

46

Thank You!• Discussion

- Loosing precision while cleaning tweet - Loosing sense while removing stop words like

„not“ (negate) - Unigram vs. Multigram? - ?

47