discovering hot topics using twitter streaming data

28
Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering” Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong Mathematical Analytics Team, National Institute for Mathematical Scneice 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013 Niagara Falls, Canada, August 25-28, 2013 †: corresponding author

Upload: sunghyon-kyeong

Post on 21-Apr-2017

1.377 views

Category:

Social Media


2 download

TRANSCRIPT

Page 1: Discovering Hot Topics using Twitter Streaming Data

Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering”

Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong†

Mathematical Analytics Team, National Institute for Mathematical Scneice

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013

Niagara Falls, Canada, August 25-28, 2013 †: corresponding author

Page 2: Discovering Hot Topics using Twitter Streaming Data

p

Outlines

• Introduction

• Dataset

• Analysis Methods and Results

• Conclusion

2

Page 3: Discovering Hot Topics using Twitter Streaming Data

Introduction

Page 4: Discovering Hot Topics using Twitter Streaming Data

p

Role of SNSs• Informing breaking news (Twitter Journalism)

• Expressing one’s feelings and emotions

• Communication tool in daily life

• Research tools for studying - social behaviors, - human commmunication, - detection of a flu epidemic, - and text mining

4

Page 5: Discovering Hot Topics using Twitter Streaming Data

p

In this study

• Twitter streaming API and MongoDB were used for data collection.

• We proposed a measure for the social hot topic detection of the day.

• Geographic communities were detected for the weather related keywords, and visualized using Google Fusion Table.

5

Page 6: Discovering Hot Topics using Twitter Streaming Data

p

Related Works• Met et al. (2006) proposed probabilistic latent semantic

indexing (PLSI) to discover a spatiotemporal theme pattern on weblogs.

• Wang et al. (2007) proposed location aware topic model (LATM) to incorporate the relationship between locations and words.

• Yin et al. (2011) proposed Latent Geogrpahical Topic Analysis (LGTA), a novel location-text joint model.

• In general, EM algorithm takes huge amount of computing time, and the previous studies did not directly classify locations by topics.

6

EM: expectation minimization

Page 7: Discovering Hot Topics using Twitter Streaming Data

Dataset

Page 8: Discovering Hot Topics using Twitter Streaming Data

p

Data collection• Geo-tagged public statuses tweeted in the united states.

• A total of ~19 millions geo-tagged Twitter statuses were obtained from March 23 to April 1, 2013.

• This period includes events such as snowfall on spring, same-sex marriage issues by the US court, world cup qualifier match between the US and Mexico, basketball games, and the Easter

8

Twitter streaming data in US

Page 9: Discovering Hot Topics using Twitter Streaming Data

p

MongoDB Sharding

9

! !

! !

! !

! !

! !

! !Mongod Mongod

Mongod ! !

! !

! !Mongod Mongod

Mongod! !

! !

! !Mongod Mongod

Mongod

MongoS! !

! !

C1 Mongod

C2 Mongod

C3 Mongod

Config Servers

Shard1 Shard2 Shard3

! !

Client

Application

Replica Sets

Page 10: Discovering Hot Topics using Twitter Streaming Data

Analysis Methods and Results

Page 11: Discovering Hot Topics using Twitter Streaming Data

p

Word frequency

11

wf! =X

t2T

X

s2Sf!tswf! frequency function for a word ( )

in a US state ( ) at time ( ).!

s t

The most frequently tweeted words are not the social topic, but emotional words expressing one’s feelings.

Top 5 words and Easter

Page 12: Discovering Hot Topics using Twitter Streaming Data

p

Distribution of Word Freq.

12

log10(word frequency)

log 1

0(Cou

nts) lol

likeloveEaster

※ scale-free distribution

Page 13: Discovering Hot Topics using Twitter Streaming Data

a measure of social topics

R!t

The ratio of word frequency

Page 14: Discovering Hot Topics using Twitter Streaming Data

p

Ratio of Word Freq.

14

R!t =

F!t � F!

t�1

F!t + F!

t�1

F!t =

X

s2Sf!ts

The time series function for a word ( ) integrated over the spatial index ( ).s

!The definition of a ratio of word frequency to measure social topic.

-1.0

-0.5

0.0

0.5

1.0

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

Easter lol like love

Page 15: Discovering Hot Topics using Twitter Streaming Data

p

Social Topics by

15

Topics Top words in terms of frequencyWeather H1={weather, snow, winter, cold, sick}

Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs}

Weekend H3={bar,party,drinking,beer,movies,drunk,club}

US law H4={gay,marriage}

Sports 1 H5={soccer,usa,mexico}

Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan}

TV show H7={thewalkingdead,walking,dead}

EasterH8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,egg,eggs,god,lord}

April Fools’ Day H9={april,joke,fool}

Emotions H10={lol,like,love,shit,fuck,haha,oh,ass}

R!t

Page 16: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Weather, H1

16

• According to US newspapers, there was a heavy snowfall in about six states in the Midwest to Estern states, from Missouri to Pensylvania on March 24, 2013.

• The snowfall stoped on March 25. Interestingly, is dramatically decreased for the word set H1 on March 26.

-0.6

-0.3

0.0

0.3

0.6

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

WeatherSnowWinterColdSick

R!t

Page 17: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Weekend, H3

17

-0.4

-0.2

0.0

0.2

0.4

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

BarPartyDrinkingBeerMoviesDrunkClub

• Topic words during the weekend include the entertainment words such as moview and party but these are also used steadily during the week albeit less frequently.

Page 18: Discovering Hot Topics using Twitter Streaming Data

p

Topic - US Law, H4

• On March 26, the hot topic was the same-sex marriage issue by US court, and we can see the corresponding rapid increase on the March 26.

18

-0.8

-0.4

0.0

0.4

0.8

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

gaymarriage

Page 19: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Sports, H5

• As the US and Mexico played a World Cup qualifying match in Mexico on March 26, we found that for the topic ‘Sports 1’ peaked on March.

19

-0.8

-0.4

0.0

0.4

0.8

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

SoccerUSAMexico

R!t

Page 20: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Easter, H9

• On March 31, we can see that about Easter such as easter, happy, bunny, egg(s), god and jesus increases.

• This is expected as the Easter is one of the most cerebrated Christian festivals in the US.

20

-1.0

-0.5

0.0

0.5

1.0

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

EasterBlessedBunnyJesusHappyHappyeasterBasketCandyEggEggsGodLord

R!t

Page 21: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Emotions, H10• The for emotional words was showed a small

fluctuation ( ) even though they showed higher word frequency ranking.

• This results suggest that the frequency of expressions of feelings and emotions are relatively constant over time.

21

-0.1

-0.1

0.0

0.1

0.1

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

lol likelove shitfuck hahaoh ass

R!t

|R!t | < 0.1

Page 22: Discovering Hot Topics using Twitter Streaming Data

p

Geographic Clustering• For each set of hot topic Hk, we computed the

spatiotemporal matrix for the k-th hot topic as the following:

22

�kts =

X

!2Hk

f!ts

• Then we obtained the adjacency matrix by Pearson’s correlation coefficient between US states:Ak

ij = Corr(�k•i,�

k•j)

• Modularity (Q) was computed from the weighted graph using a Louvain community detection algorithm, which maximize Q

Q =1

2m

X

i,j

hAij �

sisj2m

i�(Ci, Cj)

Page 23: Discovering Hot Topics using Twitter Streaming Data

Graph Theory

C

B

A

D

Page 24: Discovering Hot Topics using Twitter Streaming Data

p

Types of Graph

24

1. What is degree? 2. betweenness centrality?3. global/local network efficiency?4. modular structure

undirected binary graph

directed binary graph

directed weighted graph

1

3

6

5

2

4

0 1 1 0 0 0

1 0 1 0 1 0

1 1 0 0 0 0

0 0 0 0 1 0

0 0 0 1 0 1

0 0 0 0 1 0

Aij  =

AdjacencyMatrix

Page 25: Discovering Hot Topics using Twitter Streaming Data

p

Network Analysis Ex.

25

co-authorship network formed by author list

semantic network formed by free association

Steyvers, Cognitive Science 29 (2005) 41–78Neumann, PNAS 101 (2004) 5200-5205

Page 26: Discovering Hot Topics using Twitter Streaming Data

p

Geographic Clustering

26

Geographic Clustering Adjacency Matrix

Page 27: Discovering Hot Topics using Twitter Streaming Data

p

Conclusion• The ratio of word frequency properly detected social hot

topics of the day by identifying increasing or decreasing frequency of keywords in Twitter messages,

• while supressing the non-topic keywords such as frequencly tweeted emotional words (e.g., lol, like, and love).

• The social topic detection method may be applied on a different time scale, e.g., hourly, monghly, or yearly.

• The geographic clustering based on a social topic appropriately reflected not only the patyway of spring storm but also the properties of US geography.

27

Page 28: Discovering Hot Topics using Twitter Streaming Data

Thank you for your attention