mining twitter data with resource constraints - ieee/acm conference on web intelligence 2014

Mining Twitter Data with Resource Constraints

Geoge Valkanas, Ioannis Katakis,Dimitrios Gunopulos, Anthony Stefanidis

August 12, 2015

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 1 / 18

Research Question

Is the 1% sample provided by the Twitter API sufficient forspatio-temporal analysis tasks? ... which tasks?→ We compare with the 10% sample (Garden Hose)


Outline

1 Problem and Motivation

2 Data Collection3 Experiments in Various Tasks

Geo-location CoverageSentiment AnalysisPopular Topic DetectionGraph Evolution

4 Conclusions


Introduction

Twitter Samples

Two ways to access the stream

Public Stream: 1% Sample

Garden Hose: 10% Sample

... in both cases, we don’t know details about the sampling method.


Introduction

Constraints

Financial cost

Licences of larger samples, are costly and difficult to obtain.

Computational cost

7 Giga Bytes per minute

Off the shelf approaches are unable to operate in such settings

In practice: those who engage in social media analytical tasks havepractically no choice but to resort to the downsized information. However,being only a small fraction of the entire stream, it is unclear how reliablethis information is for each type of application.


Introduction

A more concrete example

The INSIGHT Project: Improve understanding, prediction and warning ofemergencies through real-time processing of data streams including socialdata.

(a) Floods in Germany (2013) (b) Control Center in Dublin CC

How much data are efficient for our task?


Introduction

Tasks we look into...

Sentiment Analysis

Geo-located information

Popular tweets

Social Graph Evolution

Linguistic Analysis


Data

The data

100K

1M

10M

0 20 40 60 80 100

Tw

eet C

ount

Hours

Default Gardenhose

(c) All tweets

1K

10K

100K

0 20 40 60 80 100

GP

S T

weet C

ount

Hours

Default Gardenhose

(d) GPS-tagged tweets

Figure : Comparing default and gardenhose samples for volume over time

4 day period - November 2013

The two samples differ by an order of magnitude

Exhibit the same temporal pattern

Geotagged tweets are between 1-2% of their respective sampled data

Geotagged are more flattened out


Experiments

Geo-location coverage - Experiment 1

Bounding Box

Twitter also allows its users to ask for geotagged information.

The user provides a bounding box, by specifying 4 coordinates in theform [(latmin, lonmin)(latmax , lonmax)], and Twitter returns tweets thatfall within this region.

−50

−25

0

25

60 90 120 150lon

lat

. In this particular case, where geotagged tweets are asked for instead of ageneral sample, the volume of the returned results is the same for the two

samples!.


Experiments

Geo-location coverage - Experiment 2

4 different crawls in London area

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30 35 40 45

Co

un

t

Half-Hour Interval

Loc1 Loc2 Loc3 Loc4

. As the overlap increases between the bounding boxes, so does thesimilarity between two different crawls.


Experiments

Sentiment Analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 40 60 80 100

Ratio

Hour

Sample 1% Sample10%

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 20 40 60 80 100

Ratio

Hour

Sample 1% Sample10%

Positive and Negative Sentiment Ratio

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 20 40 60 80 100

Ratio

Hours

Pos 1%Neg 1%

Pos 10%Neg 10%

- Dictionary basedsentiment analysis- Ratio of tweets isthe same in bothsamples- Ratios in geo-taggedtweets are lower,meaning thatgeottagged tweetsoffer lesssentiment-orientedinformation


Experiments

Popular Topic Detection - Experiment

1 Extract the top-k most retweeted posts, that appear in our data(both samples).

2 Compare the two lists (Kendall Correlation)

3 Compare the two lists with the ground truth (= actual retweet countinformation included in the tweet)


Experiments

Popular Topic Detection - Results

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

10 100 1000 10000

Kendall C

orr

el.

List Items

S1-S10

S1-S10P1

S1-S10P2

S10P1-S10P2

S1-S1P1

(a) Kendall

0.94

0.95

0.96

0.97

0.98

0.99

1

10 100 1000 10000C

om

mon Ite

ms (

%)

List Items

S1-S10S1-S10P1S1-S10P2

S10P1-S10P2S1-S1P1

(b) Common Items

0

0.1 0.2

0.3 0.4

0.5

0.6 0.7

0.8 0.9

1

1 5 10

100

500

1000

2500

5000

7500

10000

Ke

nd

all

Co

rre

l.

Iteration

Sample 1% Sample 10%

(c) Vs the ground truth

Figure : Comparing the top-N most retweeted items

Conclusions

For up to 10 items, 1% is adequate. That is not however the case forlist with more than 1000 items.

Comparison with Ground Truth: 10% has higher correlation.Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 13 / 18

Experiments

Graph Evolution Study - Experiment

Study the re-tweet graph (directed)

Edges are weighted (more re-tweets → larger weight) and decay overtime

Edges are removed when their weight drops below a certain threshold

Method 1: Iter At each time interval extract a new graph

Method 2: Glb At each time interval aggregate the new nodes to thecurrent graph


Experiments

Results

0

50000

100000

150000

200000

250000

300000

0 200 400 600 800 1000 1200

Va

lue

Iteration

Iter 1%

Glb 1%

Iter 10%

Glb 10%

(a) Size

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200

Va

lue

Iteration

Glb 1% Glb 10%

(b) Lar. Con. Comp. Size

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 200 400 600 800 1000 1200

Va

lue

Iteration

Iter 1%

Glb 1%

Iter 10%

Glb 10%

(c) Clustering Coefficient

Figure : Statistical properties of the extracted retweet graph, over time

Conclusions

No significant differences between the two samples

LCC does not follow the 24-hour pattern

Clustering coefficient of 10% similar 100%Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 15 / 18

Experiments

More on the paper...

Retweet Burstiness

The rate at which users retweet information plays an important rolein capturing trending topics

We investigate wether there is a difference between the rates ofreceiving retweets in both samples

Linguistic Analysis

Is there a correlation between the spoken languages in Twitter, andthe ground truth obtained from studies in the physical world?

What are the differences between the two samples in this context?

We use language detection tools and ground truth information fromWikipedia.


Summary and Conclusions

Conclusions

Research question: Is the default sample sufficient? For which tasks?

Focused on spatio-temporal tasks

We compared 1% with 10% sample

The samples have quite similar properties

However when you get into the details (less popular re-tweets) thebigger sample is better


Summary and Conclusions

The End...

Thank You!

Contact: @iokat // [email protected] // www.katakis.eu

AcknowledgementThis work has been co-financed by EU and Greek National funds through the Operational Program “Education and LifelongLearning” of the National Strategic Reference Framework (NSRF) - Research Funding Programs: Heraclitus II fellowship,THALIS - GeomComp, THALIS - DISFER, ARISTEIA - MMD and the EU funded project INSIGHT.


mining twitter data with resource constraints - ieee/acm conference on web intelligence 2014

Science

gmu mining twitter

mining twitter data

tweetvalkanas et

sample garden hosevalkanas

moreattened outvalkanas

ageneral sample

tweets isthe

gardenhose samples