mining twitter data with resource constraints - ieee/acm conference on web intelligence 2014
DESCRIPTION
Social media analysis constitutes a scientific field that is rapidly gaining ground due to its numerous research challenges and practical applications, as well as the unprecedented availability of data in real time. Several of these applications have significant social and economical impact, such as journalism, crisis management, advertising, etc. However, two issues regarding these applications have to be confronted. The first one is the financial cost. Despite the abundance of information, it typically comes at a premium price, and only a fraction is provided free of charge. For example, Twitter, a predominant social media online service, grants researchers and practitioners free access to only a small proportion (1%) of its publicly available stream. The second issue is the computational cost. Even when the full stream is available, off the shelf approaches are unable to operate in such settings due to the real-time computational demands. Consequently, real world applications as well as research efforts that exploit such information are limited to utilizing only a subset of the available data. In this paper, we are interested in evaluating the extent to which analytical processes are affected by the aforementioned limitation. In particular, we apply a plethora of analysis processes on two subsets of Twitter public data, obtained through the service’s sampling API’s. The first one is the default 1% sample, whereas the second is the Gardenhose sample that our research group has access to, returning 10% of all public data. We extensively evaluate their relative performance in numerous scenarios.TRANSCRIPT
Mining Twitter Data with Resource Constraints
Geoge Valkanas, Ioannis Katakis,Dimitrios Gunopulos, Anthony Stefanidis
August 12, 2015
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 1 / 18
Research Question
Is the 1% sample provided by the Twitter API sufficient forspatio-temporal analysis tasks? ... which tasks?→ We compare with the 10% sample (Garden Hose)
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 2 / 18
Outline
1 Problem and Motivation
2 Data Collection3 Experiments in Various Tasks
Geo-location CoverageSentiment AnalysisPopular Topic DetectionGraph Evolution
4 Conclusions
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 3 / 18
Introduction
Twitter Samples
Two ways to access the stream
Public Stream: 1% Sample
Garden Hose: 10% Sample
... in both cases, we don’t know details about the sampling method.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 4 / 18
Introduction
Constraints
Financial cost
Licences of larger samples, are costly and difficult to obtain.
Computational cost
7 Giga Bytes per minute
Off the shelf approaches are unable to operate in such settings
In practice: those who engage in social media analytical tasks havepractically no choice but to resort to the downsized information. However,being only a small fraction of the entire stream, it is unclear how reliablethis information is for each type of application.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 5 / 18
Introduction
A more concrete example
The INSIGHT Project: Improve understanding, prediction and warning ofemergencies through real-time processing of data streams including socialdata.
(a) Floods in Germany (2013) (b) Control Center in Dublin CC
How much data are efficient for our task?
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 6 / 18
Introduction
Tasks we look into...
Sentiment Analysis
Geo-located information
Popular tweets
Social Graph Evolution
Linguistic Analysis
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 7 / 18
Data
The data
100K
1M
10M
0 20 40 60 80 100
Tw
eet C
ount
Hours
Default Gardenhose
(c) All tweets
1K
10K
100K
0 20 40 60 80 100
GP
S T
weet C
ount
Hours
Default Gardenhose
(d) GPS-tagged tweets
Figure : Comparing default and gardenhose samples for volume over time
4 day period - November 2013
The two samples differ by an order of magnitude
Exhibit the same temporal pattern
Geotagged tweets are between 1-2% of their respective sampled data
Geotagged are more flattened out
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 8 / 18
Experiments
Geo-location coverage - Experiment 1
Bounding Box
Twitter also allows its users to ask for geotagged information.
The user provides a bounding box, by specifying 4 coordinates in theform [(latmin, lonmin)(latmax , lonmax)], and Twitter returns tweets thatfall within this region.
−50
−25
0
25
60 90 120 150lon
lat
. In this particular case, where geotagged tweets are asked for instead of ageneral sample, the volume of the returned results is the same for the two
samples!.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 9 / 18
Experiments
Geo-location coverage - Experiment 2
4 different crawls in London area
0
200
400
600
800
1000
1200
1400
0 5 10 15 20 25 30 35 40 45
Co
un
t
Half-Hour Interval
Loc1 Loc2 Loc3 Loc4
. As the overlap increases between the bounding boxes, so does thesimilarity between two different crawls.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 10 / 18
Experiments
Sentiment Analysis
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 20 40 60 80 100
Ratio
Hour
Sample 1% Sample10%
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 20 40 60 80 100
Ratio
Hour
Sample 1% Sample10%
Positive and Negative Sentiment Ratio
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 20 40 60 80 100
Ratio
Hours
Pos 1%Neg 1%
Pos 10%Neg 10%
- Dictionary basedsentiment analysis- Ratio of tweets isthe same in bothsamples- Ratios in geo-taggedtweets are lower,meaning thatgeottagged tweetsoffer lesssentiment-orientedinformation
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 11 / 18
Experiments
Popular Topic Detection - Experiment
1 Extract the top-k most retweeted posts, that appear in our data(both samples).
2 Compare the two lists (Kendall Correlation)
3 Compare the two lists with the ground truth (= actual retweet countinformation included in the tweet)
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 12 / 18
Experiments
Popular Topic Detection - Results
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
10 100 1000 10000
Kendall C
orr
el.
List Items
S1-S10
S1-S10P1
S1-S10P2
S10P1-S10P2
S1-S1P1
(a) Kendall
0.94
0.95
0.96
0.97
0.98
0.99
1
10 100 1000 10000C
om
mon Ite
ms (
%)
List Items
S1-S10S1-S10P1S1-S10P2
S10P1-S10P2S1-S1P1
(b) Common Items
0
0.1 0.2
0.3 0.4
0.5
0.6 0.7
0.8 0.9
1
1 5 10
100
500
1000
2500
5000
7500
10000
Ke
nd
all
Co
rre
l.
Iteration
Sample 1% Sample 10%
(c) Vs the ground truth
Figure : Comparing the top-N most retweeted items
Conclusions
For up to 10 items, 1% is adequate. That is not however the case forlist with more than 1000 items.
Comparison with Ground Truth: 10% has higher correlation.Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 13 / 18
Experiments
Graph Evolution Study - Experiment
Study the re-tweet graph (directed)
Edges are weighted (more re-tweets → larger weight) and decay overtime
Edges are removed when their weight drops below a certain threshold
Method 1: Iter At each time interval extract a new graph
Method 2: Glb At each time interval aggregate the new nodes to thecurrent graph
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 14 / 18
Experiments
Results
0
50000
100000
150000
200000
250000
300000
0 200 400 600 800 1000 1200
Va
lue
Iteration
Iter 1%
Glb 1%
Iter 10%
Glb 10%
(a) Size
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200
Va
lue
Iteration
Glb 1% Glb 10%
(b) Lar. Con. Comp. Size
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 200 400 600 800 1000 1200
Va
lue
Iteration
Iter 1%
Glb 1%
Iter 10%
Glb 10%
(c) Clustering Coefficient
Figure : Statistical properties of the extracted retweet graph, over time
Conclusions
No significant differences between the two samples
LCC does not follow the 24-hour pattern
Clustering coefficient of 10% similar 100%Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 15 / 18
Experiments
More on the paper...
Retweet Burstiness
The rate at which users retweet information plays an important rolein capturing trending topics
We investigate wether there is a difference between the rates ofreceiving retweets in both samples
Linguistic Analysis
Is there a correlation between the spoken languages in Twitter, andthe ground truth obtained from studies in the physical world?
What are the differences between the two samples in this context?
We use language detection tools and ground truth information fromWikipedia.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 16 / 18
Summary and Conclusions
Conclusions
Research question: Is the default sample sufficient? For which tasks?
Focused on spatio-temporal tasks
We compared 1% with 10% sample
The samples have quite similar properties
However when you get into the details (less popular re-tweets) thebigger sample is better
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 17 / 18
Summary and Conclusions
The End...
Thank You!
Contact: @iokat // [email protected] // www.katakis.eu
AcknowledgementThis work has been co-financed by EU and Greek National funds through the Operational Program “Education and LifelongLearning” of the National Strategic Reference Framework (NSRF) - Research Funding Programs: Heraclitus II fellowship,THALIS - GeomComp, THALIS - DISFER, ARISTEIA - MMD and the EU funded project INSIGHT.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 18 / 18