impact of sampling on diffusion on twitter
TRANSCRIPT
How Does the Data Sampling Strategy Impact the Discovery of Information
Diffusion in Social Media?
Munmun De Choudhury1, Yu-Ru Lin1, Hari Sundaram1,K. Selcuk Candan1, Lexing Xie2, Aisling Kelliher1
1Arizona State University, Tempe, AZ2IBM T. J. Watson Research Center, Hawthorne, NY
This talk is about sampling the social
web
Background and MotivationProblem DefinitionSampling Diffusion DataEvaluation of Diffusion SamplesExperimental StudyConclusions
04/13/2023 4
Modern Social Interactional Modes
FacebookSlashdot
Engadget
Flickr
LiveJournalDigg
YouTubeBlogger
MetaFilterReddit
MySpaceOrkut
We are attracted to social media, in part due
to large scale datasets
Viral Marketing, Advertizing Campaigns
Collaboration, “Wisdom of the Crowds”
Crisis management w.r.t. real-time events
Is there something more fundamental happening here than just scale?
Information Diffusion
140 characters can cause revolutions
Inference is based on data quality
( )y f x=
What has been done?
Snowball
Random walk
Forest Fire[Leskovec et. al, KDD 2005]
Designed to capture topology
But not context or content!
Problem DefinitionSampling Diffusion DataEvaluation of Diffusion SamplesExperimental StudyConclusions
Background and Motivation
Two simple questions
What is the role of context in the sampling process?
What fraction of the social data should we sample?
04/13/2023 26
Twitter• “Tweets”: 140 character length shared content.
– RT (or re-tweet feature), hashtags (e.g. #iranelection), bit.ly encoded URLs
• Follower / Following relationship.• “Trending topics” e.g. #musicmonday, #formulaone.
• Diffusion via (1) RT feature, (2) shared URL (e.g. bit.ly, tinyurl), (3) same hashtag
(1) RT based diffusion
(2) URL based diffusion
(3) hashtag based diffusion
04/13/2023 27
Diffusion Series
Social graph
Diffusion series
Sampling Diffusion DataEvaluation of Diffusion SamplesExperimental StudyConclusions
Background and MotivationProblem Definition
What are our sampling strategies?
Assume we are give N, the number of nodes to pick
And θ, the topic
Plus the social graph G
What if we ignored topology?
We can also sample topology using Forest Fire
Evaluation of Diffusion SamplesExperimental StudyConclusions
Background and MotivationProblem DefinitionSampling Diffusion Data
04/13/2023 38
Diffusion Saturation Metrics• User-based (Volume, participation, dissemination)• Topology-based (Reach, spread, cascade instances, collection size)• Time-based (Rate)
Volume
ParticipationDissemination
Reach
Spread
Cascade Instances
Collection Size
Rate
Distortion
Fθ (m;S) =
mθ −mθ (S)
mθ
04/13/2023 40
1−D(EN+1D (θ),E
N+1S (θ)) where, E
N+1S (θ) is the search volume, and
EN+1D (θ)= | l
m(S
N+1(θ))| /Q
Dm≤N+1∑ ,
| lm(S
N+1(θ))| is the number of nodes at slot l
m in the collection S
N+1(θ), and
QD= | l
m(S
N+1(θ))|.
m∑ Similarly,
Diffusion Response Metrics
• Search and News Trends, i.e. ability of the social graph sample to correlate with external temporal variables like user search behavior and news items featured online (http://news.google.com/), given as:
1 1 11 ( ( ), ( )) where, ( ) is the news volume.DN N ND E E E θ θ θ+ + +
Experimental StudyConclusions
Background and MotivationProblem DefinitionSampling Diffusion DataEvaluation of Diffusion Samples
Reference Set
500 seed users using mashable.com
04/13/2023 44
Experimental Setup• ~465K users, ~836K edges (“follower” / “following” relationships) and 29.5M tweets.• 125 randomly chosen “trending topics” from Twitter, between Oct and Nov 2009.• Trending topic – theme association based on OpenCalais (http://www.opencalais.com/).
Themes Trending topics
Politics Obama, Senate, Afghanistan, Tehran, Healthcare
Entertainment_Culture Beyonce, Eagles, Michael Jackson, #britney3premiere
Sports Chargers, Cliff Lee, Dodgers, Formula One, New York Yankees
Technology_Internet Android 2, Bing, Google Wave, Windows 7, #Firefox5
Social Issues Swine Flu, Unemployment, #BeatCancer, #Stoptheviolence
• Dataset released for non-commercial research purposes: http://www.public.asu.edu/~mdechoud/temp/released-data/
R1
Is this a slam dunk for forest fire + activity?
Looking at themes tells a more nuanced story
What is a “good” sampling ratio?
What happens when ρ = 0.3?
Conclusions
Background and MotivationProblem DefinitionSampling Diffusion DataEvaluation of Diffusion SamplesExperimental Study
Social networks are causing significant changes in our lives
Inferences about social phenomena is affected by data quality
Topic + topology + seed attribute makes a difference to sampling
Thanks!
[email protected] Publications / Datasets: http://www.public.asu.edu/~mdechoud/Twitter: @munmun10
Questions?