network structure and its role in information diffusion and user...

29
Network Structure and its Role in Information Diffusion and User Behavior Brendan Meeder June 12, 2012 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Luis von Ahn, chair Manuel Blum Christos Faloutsos Jon Kleinberg, Cornell Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2012 Brendan Meeder

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Network Structure and its Role inInformation Diffusion and User

BehaviorBrendan Meeder

June 12, 2012

Computer Science DepartmentSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA

Thesis Committee:Luis von Ahn, chair

Manuel BlumChristos Faloutsos

Jon Kleinberg, Cornell

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2012 Brendan Meeder

Page 2: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Keywords: social network analysis, network structure, information cascades, stochastic models

Page 3: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Abstract

Social networks such as Twitter and Facebook are becoming a significant source of in-formation production and consumption for many people. These systems have hundreds ofmillions of users producing hundreds of millions to billions of pieces of information eachday. What are the patterns of communication between users? How can we use the networkstructure to make predictions about the spread of information and filter out what is timely andrelevant for a particular user? Making sense of increasingly large quantities of social networkdata requires analytical techniques that scale to many billions of messages and networks withhundreds of millions of nodes and billions of edges.

This thesis investigates the interplay between network structure, information diffusion,and user behavior. Our main goal is to develop analyses and models that capture the informa-tion diffusion process. What do large information cascadesin a network look like and howdo those cascades differ across topics, network topologies, and user characteristics? Usinga corpus of over five billion messages and a snapshot of the network structure between overone-hundred million users with over five billion edges, we examine the structure and evo-lution of large-scale information cascades and find significant differences across topics. Wehave also developed methods to accurately estimate when edges in the network were created,allowing us to study the temporal evolution of a large subgraph of the Twitter network and ob-serve how real-world events influence user behaviors in the network. Finally, we have studiedhow network structure and communication patterns between users vary over time.

Based on this completed work, we propose two areas of furtherstudy. The first is fur-ther examination of the temporal nature of information cascades. Rather than examine acharacterization of information cascades, we want to see how various graph properties suchas centrality and density change as the cascade evolves. This work includes both measuredobservations across different types of information as wellas theoretical analyses using com-monly used network models. We also propose examining the role of social relationships onthe learning and contributions of users in the Duolingo language-learning website. How dosocial constructs within the site affect the amount that users contribute, the speed at whichthey progress through the site, and measurements of how wellthe user has learned the lan-guage?

Page 4: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Contents

1 Introduction 11.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Completed work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Survey 52.1 An Overview of Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Information Cascades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Completed Work 73.1 Twitter: General Observations and Patterns. . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Topical Differences in Information Diffusion. . . . . . . . . . . . . . . . . . . . . . . . 103.3 Triadic Closure and User Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Recovering Social Graph Time-stamps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Proposed Work and Conclusion 194.1 Task 1: Topological Evolution of Information Cascades. . . . . . . . . . . . . . . . . . . 194.2 Task 2: Social Interactions and Contribution Patterns in Duolingo . . . . . . . . . . . . . 204.3 Timeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv

Page 5: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Chapter 1

Introduction

1.1 Motivation

Social networks such as Twitter and Facebook are increasingly important mediums in which we exchangeinformation, and are especially important for facilitating the spread of time-sensitive information such asbreaking news and natural disasters. As of 2012, hundreds ofmillions to billions of messages are generatedand exchanged between hundreds of millions of users across these networks every day. Understanding thepattern of these communications and how relationships between users affect these patterns can be used toidentify new and novel information, both globally and for anindividual user. Since we have only so muchtime to allocate to these activities, the utility and usefulness of social networking diminishes as the burdenof checking many messages increases.

Our work studies the interplay between network structure, information diffusion, and user behaviors. Arich collection of research has examined the structure and evolution of networks. Social and communi-cation networks have always been a part the world wide web, first in the form of bulletin boards, chatnetworks and email, followed by the recent trends set by services such as Facebook and Twitter, photo-sharing networks like Flickr and Instagram, and many others. These networks comprise of not only anetwork structure, but also pieces of information that are shared, and actions taken on those pieces ofinformation such as repeating and reposting, ‘liking’ or ‘favoriting,’ and replying. We primarily focus onthe connections between these pieces as they manifests in the Twitter network.

The first question we ask is what differences exist in how information spreads through networks. Eventsthat exist outside of the network, such as earthquakes, sporting events, or breaking news, appear andspread with a different pattern than those pieces of information that are endogenous to the network. Twosuch types of endogenous information include URLs andhashtags(tokens starting with the # symbol)spreading through Twitter. These two types of information are especially interesting because they can beeasily traced as they (potentially) spread between users, thereby allowing us to extract many thousandsof information cascades. We want to succinctly characterize the diffusion of information, preferablywith some kind of model, so that we can examine differences and similarities between different types ofinformation.

Next, how does the network structure affect the informationdiffusion process? This question addressesboth the topology of the network, as well as characteristicsof the nodes in the network. Users are not allthe same; some use Twitter primarily as a means of hearing about news and other interesting information

1

Page 6: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

on the internet, while others use it to keep in touch with friends. How do these different kinds of userscontribute to and shape the flow of information? Do certain pieces of information require a particularnetwork topology to support their spread? In addition to studying the impact of network structure and userbehavior on information diffusion, we also study what we canlearn from the network structure to makepredictions about users. For example, if we know that one user follows another, can we predict whetherthat relationship is reciprocated? How does the amount of interaction between two users change as theystart to follow and interact with others?

We are entering a time where the effort required to keep up with a constant stream of information isbecoming onerous. The development of tools to extract timely and relevant information is important forreducing the burden of information overload, and we believethat such tools will benefit greatly froman understanding of information diffusion and user behaviors in large social networks. Our work offersextensive observations and methods for analyzing social network data comprising of both graph data andmessages created and exchanged in the system.

1.2 Completed work

The work we have performed includes network analysis, several detailed studies about information dif-fusion, and several studies about the changing user behavior over time. As a prerequisite to any of thisanalysis we created a distributed crawling infrastructureand have collected many billions of messages andthe user profiles and follower relationships for over 100 million users. Managing and manipulating thisdata set required the development of several tools and extensive use of a clustered Hadoop system. Belowwe outline some of the analyses that we have performed using this data.

Network Structure

On the topic of network analysis, we have examined the structure of the Twitter social network usingHEigen [Kang et al., 2011], a distributed eingensolver written for the Hadoop platform. Using this sys-tem we find outliers that correspond to spam accounts that link to each other and several political figuresthat have particularly connected or especially sparse neighborhoods. Twitter does not provide exact timesat which edges form between users; however, they do return the in-edges and out-edges of a node in theorder that they were created. We developed a method for recovering accurate estimates of the time at whichedges in the network were created [Meeder et al., 2011] [http://www.cs.cmu.edu/ ˜ bmeeder/proposal/timestThis theoretical analysis and experimental evaluation allowed us to time-stamp over 860 million edges inthe network and examine the time-evolution of this large subgraph in detail.

Since Twitter has directed social links, it is interesting to study the cases in which those links are re-ciprocated. We study the problem of link prediction and reciprocated-link prediction [Cheng et al., 2011][http://www.cs.cmu.edu/ ˜ bmeeder/proposal/predicting_reciprocity.pdf ]. Givenonly the network neighborhood of two users, we predict whether the directed edge(U, V ) exists. Addi-tionally, we study the case in which we know that an edge(U, V ) exists and ask whether the reciprocaledge(V,U) is present. We have also studied macroscopic properties of the network such as diameter,degree distribution, density, and connected components, and find that Twitter follows many of the obser-vations about other networks. However, some anomalous characteristics are discovered, such as severalspikes in the out-degree distribution.

2

Page 7: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

User Behaviors

We have studied the change in following relationships over time in the presence of information overload.We model a user’s utility as a function of how much information they must consume. An empty timelineis not particularly interesting and has little value, whilebeing overwhelmed with messages is also unde-sirable. In [Borgs et al., 2010] we study a theoretical model in which a set of ‘celebrity’ accounts mustdecide at what rate to produce messages as to maximize the product of their message rate and the numberof followers they have. Followers are assumed to have an ordered preference list across celebrities and willfollow as many celebrities as possible until their ‘information quota’ is exceeded. In addition to analyzingseveral theoretical models, we also observed the follower relationships of hundreds of users that authormessages at a variety of rates. We see that when these users overloaded their followers with messages itdid cause users to stop following them, but paradoxically, also increased the rate at which new users tostarted following them.

We have studied [Meeder et al., 2010] the behavior of users repeating what others have said in Twitter.This behavior, calledretweeting, constitutes a significant fraction of all messages. The focus of thisstudy was to see that information that is likely desired to bekept private, such as telephone numbers,comments about spouses, bosses, or coworkers, and other sensitive information is often leaked throughretweeting. Additionally, we examine the topology of retweet cascades and the latency distribution forretweeting and replying to messages. We find that almost all retweet cascades are shallow and have ahighly variable fan-out and that most retweets and replies happen within several hours of the originalmessage. The distribution of time between tweets has a long-tail and has noticeable spikes at one, two,three, etc. minutes for short times and at every hour for longer times.

Finally, we studied how the amount of interaction between users changes as the network evolves [Romero et al.,2011a] [http://www.cs.cmu.edu/ ˜ bmeeder/proposal/balance_exchange.pdf ]. Weextract a subgraph of time-stamped edges in which a directededge(U, V ) exists ifU mentionsV at leastk times, with the creation time of the edge being the time when thekth such message was composed. Thisnetwork can be thought of as a proxy for the attention that users pay to each other. Taking only edges thatare reciprocated, we study the process of triadic closure and its impact on the amount of communicationbetween users. Two competing sociological theories ofbalanceandexchange, which predict what hap-pens as triads close, both seem to appear in our results. Shortly after a triad closes communication alongexisting links decreases, while at long times after the triad closes the probability of a continued interactionis increased.

Information Diffusion

Twitter provides an excellent environment in which to studythe spread of information. Particular piecesof information, such as hashtags and URLs, are easily tracedthrough the network and potential pathsalong which the information could spread are known. Additionally, we can study the volume of key wordsin tweets over time and detect events that occur outside of Twitter. In [Motoyama et al., 2010] we usetweets to detect when online services are unavailable. Using the messages created surrounding severalknown service outages, we train a simple exponential-weighted moving average model to detect a signalin the stream of messages and then apply this model to extractover 50 supposed service outages. Weconfirm that 70 percent of these events were in fact actual service outages, and that for the 15 events forwhich we didn’t find an official news article or blog post, there is evidence that a service interruption hadoccurred.

3

Page 8: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

In [Romero et al., 2011b] [http://www.cs.cmu.edu/ ˜ bmeeder/proposal/info_diffusion_topics.pdfwe define a probabilistic mechanism by which we suppose information spreads through the network.Specifically, we define a probabilistic cascade process in which afterk neighbors of a node use a partic-ular token (are ‘infected’), a node has a probabilityP (k) of using that token. We study theP (k) curvesacross the 500 most popular hashtags in our Twitter corpus and discover two important characterizationsof these curves that vary across different topics of hashtags. Furthermore, we examine the importanceof the initial network topology on the successful growth or failure of a cascade for a fixedP (k) curve.We find that the starting topology is critical for the survival and spread of certain topics such as politicalhashtags.

1.3 Proposed work

We propose two main lines of future work. The first is an extension of our analysis of topical spread ofinformation on Twitter. In that work, we primarily focused on the macroscopic properties of hashtagsspreading through the network and offered a succinct characterization for each hashtag that captures thestickiness and persistence of that piece of information. Wewant to further investigate the temporal spreadof information through a network and examine the structure of the cascade over time. How centralizeddoes the cascade remain as it grows? What relationships exist between properties of a user (messagecreation rate, types of messages she authors, number of people she follows, etc.) and her effectiveness atgetting her neighbors to participate in the cascade. The main extension in this line of work is our focus onthe temporal dynamics of the spread rather than the ‘after-the-fact’ analysis of the entire subgraph inducedby the cascade.

Our second line of work involves analyzing the contributionpatterns of users in Duolingo, an onlinelanguage-learning website. In Duolingo users learn a foreign language by doing lessons of hand-craftedsentences as well as translating real-world content such asnews articles and blogs. Like other contribution-driven sites such as Wikipedia and Stack Overflow, we expect there to a very skewed distribution of useractivity. However, because translating real-world content is a central element of learning on Duolingo,we expect there to be a different set of incentives encouraging users to translate sentences. We seek tounderstand the relationship between social relationshipsin Duolingo, the amount of time spent on the site,and the learning outcome of the user.

4

Page 9: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Chapter 2

Survey

2.1 An Overview of Twitter

Our completed work primarily focuses on observations of data from the Twitter social network. We fa-miliarize the reader with the necessary concepts we discussthroughout this proposal. Users on Twittercompose short messages, known astweets, of no more than 140 characters. The possibly asymmetricrelationship offollowing is the primary social link in the network. When a userU follows userV , themessages thatV composes will appear inU ’s timeline, a reverse chronological listing of messages com-posed by users thatU is following. Unless a user changes her account settings to the ‘private’ state, anymessages that she compose will be visible to others in Twitter; additionally, the set of users she followsand the set of users following her will also be visible.

Although tweets are short messages, a rich set of community conventions has been created. Users canreference each other by the@-mentionconvention. One simply prefixes another user’s username withthe @ sign; for example, “I saw @bmeeder give his thesis proposal.” When the @-mention occurs atthe beginning of the message it signals that the message has an intended recipient; however, the messageis still publicly visible. Additionally, users can repeat something they see to all of their followers; thesemessages are calledretweets. Tokens that start with the # symbol, calledhashtags, are used to assigna topic and allow for messages to be ‘aggregated’ in a generalway. Hashtags are used to designate thetopic of a message or to make it easier to search for the message; for example, users who share haikususe the #haiku hashtag to make it easy to find their compositions. More generally, popular hashtags havearisen for political topics such as healthcare reform and revolutions in the Middle East, sporting events,TV shows, and breaking news events.

We believe a distinctive characteristic of Twitter is that it is both a social network as well as aninformationnetwork. The follower relationships require low social-cost; thisis unlike Facebook, LinkedIn, or othersocial networking sites where both users must agree to enterthe relationship. This asymmetry gives rise tocelebrities and ‘informational accounts’ such as those of news organizations, corporations, or brands. Wecan measure the spread of particular pieces of information such as URLs, retweets, and hashtags throughthe network in the study of information cascades. The underlying social graph provides an opportunity tostudy network evolution and various sociological theoriessuch as triadic closure and homophily. Overall,Twitter is a rich social and informational network, with many users from across the world using the servicein many different ways.

5

Page 10: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

2.2 Network Models

The networks we study are naturally expressed as undirectedor directed graphs in which we might alsoconsider the nodes to have attribute vectors. Models of network formation have a rich history start-ing with the Erdos-Reyni random graphs [Erdos and Renyi, 1960]. Although this model is thoroughlystudied and well understood [Bollobas, 2001], Erdos-Reyni random graphs fail to have many impor-tant properties observed in real-world graphs. For example, graphs such as citation networks, the WorldWide Web, who-emails-whom networks, and links between blogs have a power-law or heavy-tailed de-gree distribution [Chakrabarti et al., 2004, Faloutsos et al., 1999, Kleinberg et al., 1999, Newman, 2004].Additionally, real-world graphs are found to have a small diameter, densification over time, and a highdegree of triadic closure. Some examples of random graph models and generators include the Barabasi-Alberts model [Barabasi and Albert, 1999], the forest-fire model [Leskovec et al., 2005], and Kronecker-graphs [Leskovec and Faloutsos, 2007]. Watts and Strogatz [Watts and Strogatz, 1998] propose a simplemodel for generating small-world graphs, and Kleinberg [Kleinberg, 2000] studies the efficiency of rout-ing messages using only local information.

In addition to the evolution and structure of graphs, researchers have looked at identifying the importanceof individual nodes and pathways in networks. Two famous methods for studying the importance ofnodes are the hubs and authorities method [Kleinberg, 1999], as well as PageRank [Page et al., 1998].In [Kossinets et al., 2008], the authors look at the critical paths required for the timely dissemination ofinformation in the network.

2.3 Information Cascades

Kempe et al. [Kempe et al., 2003] provide a(1 − 1/e)-approximation for maximizing the spread of arumor under the independent cascade model and the linear threshold model. Subsequent works haveused the technique of submodular function optimization to obtain similar results for various influence-maximization/minimization problems [Budak et al., 2011]. Classical models from epidemiology and mar-keting include the SIS and SIR models [Jackson, 2010] and the Bass model [Bass, 1969]. These methodsmodel the fraction of the population in a particular state asa set of differential equations and don’t considerany network structure that might exist between individuals.

The common patterns of cascades that occur in a product recommendation network are examined in [Leskovec et al.,2006]. In [Leskovec et al., 2007b] the authors study the effect of time on the popularity of links go-ing to blog posts. The temporal volume of topics during the 2008 US presidential election are studiedin [Leskovec et al., 2009]. [Yang and Leskovec, 2011] examines the temporal variation of topics discussedin Twitter and across 170 million blog posts.

6

Page 11: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Chapter 3

Completed Work

3.1 Twitter: General Observations and Patterns

We have created a distributed crawler that uses the Twitter API to collect data. This system runs on overeighty machines and has collected nearly 6 billion tweets and a snapshot of the social graph of morethan 100 million users. Below we describe some of the descriptive statistics we have calculated over thisdata.

In Figure3.1we show the distributions of account in- and out-degrees, number of tweets, and number offavorites. The distribution of the number of friends (someone a user is following) has several noticeablespikes atr = 20, 220, and 440. These spikes occur because Twitter’s suggested users list had an option tofollow a random subset of 20 users. The suggested users list initially had approximately 220 users on it,and was later extended to 440 users. At approximatelyr = 5000 there is another spike in the distribution.Twitter limits the number of accounts one can follow until they also get followed back; this preventsspammers from tracking a very large number of users.

The exponentsα in each box are the power-law fits for the distribution, calculated according to [Clauset et al.,2009]. This method uses a lower-bound cutoff, fitting the distribution considering only the counts fork ≥ kmin. We use lower-bounds ofkmin = 6 and 25.

In Figure3.2 we aggregate the inter-arrival times between messages on Twitter. For each user, we lookat her timeline and take the time difference between consecutive messages. These time-deltas are thenaggregated across all users in Twitter. We notice that at regular intervals such as 30 minutes, one hour,two hours, and one day there is a significant deviation from the trend. For example, more than ten timesas many message-pairs have a time difference of one hour thanwe would expect based on the trend seen(nearly106 time deltas of 3,600 seconds seen compared to fewer than105 we would expect based onneighboring counts).

In Figure 3.3 we show the number of Tweets occurring in each minute of the week, where the Tweettime-stamps have been converted to the user’s local time. A clear diurnal cycle exists, with two peaksaround 3pm and 9pm. Also, the behavior of users on Friday, Saturday, and Sunday is noticeably differentfrom the Monday-Thursday pattern. The most striking feature of this graph is that the 0th minute ofeach hour sees approximately 8 - 10 percent more Tweets generated in that single minute than These twoanalyses suggest that a large number of Tweets are composed automatically. We estimate that upwards

7

Page 12: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

of 10 percent of messages observed in our data set are automatically generated by programs using theAPI.

100

102

104

106

100

102

104

106

108

Number of friends (r)

Use

rs w

ith r

frie

nds α = 1.9598 (x

min = 25)

α = 1.6799 (xmin

= 6)

100

105

1010

100

102

104

106

108

Number of followers (f)

Use

rs w

ith f

follo

wer

s

α = 1.8945 (xmin

= 25)

α = 1.853 (xmin

= 6)

100

105

1010

100

102

104

106

108

Number of statuses (s)

Use

rs w

ith s

sta

tuse

s

α = 1.5951 (xmin

= 25)

α = 1.4863 (xmin

= 6)

100

102

104

106

100

102

104

106

108

Number of favorites (a)

Use

rs w

ith a

favo

rites α = 1.8579 (x

min = 25)

α = 1.8111 (xmin

= 6)

Figure 3.1: The in- and out-degree distributions, number of Tweets distribution, and number of favoritesdistribution. Notice that the following-degree distribution has many spikes that occur due tothe suggested users list and following limits imposed by Twitter.

We have performed a detailed analysis of the structure of theentire Twitter graph using HEigen [Kang et al.,2011]. This technique detected outlier users such as spammers who follow each other in large cliques andthe political candidates John McCain and Barack Obama. Compared to the general trend between thenumber of followers and number of triangles in their neighborhood, these two accounts had far fewer andfar more triangles, respectively, than is expected.

8

Page 13: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

100

101

102

103

104

104

105

106

107

108

Interarrival Time T

Num

ber

of O

ccur

renc

es

Figure 3.2: The distribution of inter-arrival times of Tweets, aggregated across all users. We notice aheavy-tail distribution, and notable anomalous spikes at 30 minutes, 1 hour, and 2 hours.

Mon 0:00 Tue 0:00 Wed 0:00 Thu 0:00 Fri 0:00 Sat 0:00 Sun 0:00 0

0.5

1

1.5

2

2.5x 10

5

Local Time

Num

ber

of U

pdat

es

Figure 3.3: The number of posts during each minute of the week, adjusted to the local time of the user.The main points of interest are the change in post volume on Friday and Saturday comparedto the rest of the week, and that on the 0th minute of every hourapproximately 8 percent moremessages are created.

9

Page 14: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

3.2 Topical Differences in Information Diffusion

Seehttp://www.cs.cmu.edu/ ˜ bmeeder/proposal/info_diffusion_topics.pdf forthe full paper.

Overview

A growing line of recent research has studied the spread of information on-line, investigating the tendencyfor people to engage in activities such as forwarding messages, linking to articles, joining groups, purchas-ing products, or becoming fans of pages after some number of their friends have done so [Adar et al., 2004,Backstrom et al., 2006, Cosley et al., 2010, Crane and Sornette, 2008, Gruhl et al., 2004, Leskovec et al.,2007a,c, Liben-Nowell and Kleinberg, 2008, Sun et al., 2009]. The work in this area has thus far fo-cused primarily on identifying properties that generalizeacross different domains and different types ofinformation, leading to principles that characterize the process of on-line information diffusion and draw-ing connections with sociological work on thediffusion of innovations[Rogers, 1995, Strang and Soule,1998].

We look at a sequential, probabilistic model of informationcascades in which the probability that a nodeuses a hashtag depends on the number of their neighbors that have already used the hashtag. From ourcorpus of Tweet data, we extract the 500 most used hashtags and the time at which each user used ahashtagh. Rather than using the set of following-relationships between users, we extract a network basedon the communication patterns between users.

Defining P vs. K curves

We say that a user isk−exposedto hashtagh if he has not usedh, but has edges tok other users who haveusedh in the past. Given a useru that isk−exposed toh we would like to estimate the probability thatuwill useh in the future. Here are two basic ways of doing this.

Ordinal time estimate. Assume that useru is k−exposed to some hashtagh. We will estimate theprobability thatu will use h before becoming(k + 1)−exposed. LetE(k) be the number of users whowerek−exposed toh at some time, and letI(k) be the number of users that werek−exposed and usedhbefore becoming(k + 1)−exposed. We then conclude that the probability of using the hashtagh whilebeingk−exposed toh is p(k) = E(k)

I(k) .

Snapshot estimate. Given a time intervalT = (t1, t2), assume that a useru is k−exposed to somehashtagh at timet = t1. We will estimate the probability thatu will useh sometime during time intervalT . We letE(k) be the number of users who werek−exposed toh at timet = t1, and letI(k) be thenumber of users who werek−exposed toh at time t = t1 and usedh sometime beforet = t2. Wethen conclude thatp(k) = E(k)

I(k) is the probability of usingh before timet = t2, conditioned on beingk−exposed toh at timet = t1. We will refer top(k) as anexposure curve; we will also informally referto it as aninfluence curve, although it is being used only for prediction, not necessarily to infer causalinfluence.

The ordinal time approach requires more detailed data than the snapshot method. Since our data aredetailed enough that we are able to generate the ordinal timeestimate, we only present the results based

10

Page 15: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

on the ordinal time approach; however, we have confirmed thatthe conclusions hold regardless of whichapproached is followed. In Figure3.4 is the average P vs. K curve across all 500 topics:

Figure 3.4: The point-wise average of all 500 P vs. K curves.

Extracting the Network

From our Tweet dataset, we build a network on the users from the structure of interaction via @-messages;for usersX andY , if X includes “@Y ” in at leastt tweets, for some thresholdt, we include a directededge fromX to Y . These edges are time-stamped based on the time at which thetth tweet occurs.

For a given userX, we call the set of other users to whomX has an edge theneighbor setof X. As usersin X ’s neighbor set each mention a given hashtagH in a tweet for the first time, we look at the probabilitythatX will first mention it as well; in effect, we are asking, “How dosuccessive exposures toH affectthe probability thatX will begin mentioning it?” Concretely, following the methodology of [Cosley et al.,2010], we look at all usersX who have not yet mentionedH, but for whomk neighbors have; we definep(k) to be the fraction of such users who mentionH before a(k + 1)st neighbor does so. In other words,p(k) is the fraction of users who adopt the hashtag directly aftertheirkth “exposure” to it, given that theyhadn’t yet adopted it.

Hashtag Categories

The authors partitioned the hashtags into eight categories: celebrity, games, idioms, movies/TV, music, po-litical, sports, and technology. A set of independent volunteers then evaluated the placement of each of the500 hashtags, and a high level of agreement was found betweenthe authors’ assignment and the volunteerresponses. Most categories are self-explanatory, but it isworth mentioning the meaning of the ‘idioms’category. We define an idiom to be a tag representing a conversational theme on twitter, consisting of aconcatenation of at least two common words. For example, ‘#cantlivewithout’ and ‘#dontyouhate’.

11

Page 16: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

(a) Techology (b) Music

(c) Political

Figure 3.5: The P vs. K curves for hashtags within a particular category compared to the average P vs.K curve across all topics. Note that Technology is not as ‘infectious’ as most hashtags, whilemusic and political hashtags have a greater tendency to drawother users into the conversation.

Results and Conclusions

The first conclusion is that the P vs. K curves can distinguishbetween different topics. In Figure3.5,the red curve depicts the average P vs. K curve for hashtags inthe given category. Additionally, we plotthe average P vs. K curve across all hashtags in blue, as well as the interval in which 95th percent of thepoint-wise values fall. We see that for the topic music the P vs. K curves are significantly higher thanacross all topics; hashtags about music are much more infectious. On the other hand, topics relating totechnology seem to be much less infectious than is typical among all hashtags.

Besides characterizing the P vs. K curves, we also studied how the initial set of nodes affects the final sizeof the cascade. We run simulations taking the firstN nodes that use a particular hashtagH1, with the P vs.K curve for a hashtagH2, and run the sequential cascade method until no new nodes usethe hashtag.H1

andH2 are either randomly drawn from a particular category or across all hashtags. We find that the initialset of users makes a significant difference in the expected size of the cascade. For example, if we pick arandom political P vs. K curve and simulate the cascade usinga random political start set, we find that theexpected cascade size is two to six times as large, in expectation, than if we started the cascade with thestarting nodes for a random hashtag. See section 5 of the paper for more details and graphs.

12

Page 17: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

3.3 Triadic Closure and User Interactions

Seehttp://www.cs.cmu.edu/ ˜ bmeeder/proposal/balance_exchange.pdf for the fullpaper.

When users interact with one another on social media sites, the volume and frequency of their communi-cation can shift over time, as their interaction is affectedby a number of factors. In particular, if two usersdevelop mutual relationships to third parties, this can exert a complex effect on the level of interactionbetween the two users – it has the potential to strengthen their relationship, through processes related totriadic closure, but it can also weaken their relationship,by drawing their communication away from oneanother and toward these newly formed connections. One belief is thattriadic closure, or the increasedlikelihood that two individuals will be friends if they havemutual acquaintances, influences the evolutionof social networks. Romero and Kleinberg [Romero and Kleinberg, 2010] have studied the role of directedtriadic closure on the followers if so-called microcelebrities in Twitter.

We analyze the interplay of these competing forces and relate the underlying issues to classical theoriesin sociology – the theory of balance, the theory of exchange,and betweenness. Our setting forms anintriguing testing ground for these two theories, in that itprovides a scenario in which their qualitativepredictions are largely at odds with one another.

Exchange and Betweenness

First, we consider the force of balance. Suppose we have a user B who is friends with usersA andC. Theprinciple of balance argues that ifA andC do not have a social tie, this absence introduces latent straininto theB-A andB-C relationships, and this strain can be alleviated if anA-C tie forms [Heider, 1958,Rapoport, 1953]. Hence, balance is a force that causes the formation of anA-C tie to strengthen theB-Atie, whenC is also linked toB.1

Counterbalancing this is an equally natural force, which isthe principle ofexchange[Emerson, 1962,Willer (editor), 1999]. Let’s return to the userB who is friends with usersA andC. If A were to becomefriends withC, this providesA with more social interaction options than she had previously. The theoryof exchange argues that this makesA less dependent onB for social interaction, thereby weakening theB-A tie.

Extracted Social Graph

The primary analysis of this data is to extract all @-messages and build a temporal network of ‘attentionrelationships.’ A directed edge exists from userA to B if A sends at leastk @-messages toB; the timethis edge is created,tD(A,B), is the time at which thekth @-message is sent. In our analyses we usek = 3. There are multiple ways of defining a network, and our definition is one way of defining a proxyfor the attention that a userA pays to other users. The resulting network contains 8,509,140 non-isolatednodes and 50,814,366 links.

1One sees balance theory applied in two related contexts whenwe consider scenarios such as this, whenB has positiverelations withA andC. In one line of argument, the absence of anA-C link produces stress that needs to be resolved. A relatedline of argument considers situations in which there is in fact antagonism betweenA andC, which produces even stronger formsof stress [Cartwright and Harary, 1956]. Both of these situations point to the same conclusions, and both fall under the principleof balance.

13

Page 18: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

(a) d = 1 (b) d = 3 (c) d = 5

Figure 3.6: Percentage of message fromA to B vs. the number of day after creation of open triad. Thegreen curve is based on thed-open triads and the red curve is based on thed-closed triads.Amust have sent from 200 to 1000 messages in total after day = 0 andA must have sent at leastone messages on days 1,d, and2d.

From this directed, temporal network we extract an undirected, temporal network of ties. An undi-rected edge between two usersA andB is formed whenA has sent at least 3 @-messages toB andB has sent at least 3 @-messages toA. The edgeE = (A,B) has time-stamp equal tot(A,B) =max{tD(A,B), tD(B,A)}, the later of the times when the two directed edges were formed. This tie net-work contains 20,492,393 ties between 3,701,860 users, andalthough fewer than half of the users remainin the tie network, over 80% of attention relationships contribute to a tie.

We define anopen triadO as a graph of three nodesA, B, andC containing the ties(A,B) and(B,C)The time-stamp of the open triad isOt = max{t(A,B), t(B,C)}, the time at which the last of the twoties forms. Open triadsO = (A,B,C) in which the undirected(A,C) edge eventually forms are said toclose. We define an open triad that closesd days afterOt (t(A,C) is d days afterOt) to be ad-closedtriad.

Main Results

We look at the differences betweend − open andd − closed triads in Figure3.6. At each day after thecreation of the open or closed triad forming we look at what fraction of messages sent were betweenAandB. It is required thatA composed at least one message on daysd and2d for d = 1, 3, 5.

The main aspects to notice are that the closed triads experience significantly less decay inA−B communi-cations, especially after long periods of time. The resultsfor d = 1, 3, 5 are shown in the figure3.6.

Exchange Theory and Spill-Over Effects

In the previous section, we observed that in the triad(A,B,C) the communication betweenA andBbenefits in the long run from the triad’s closing. At a more general level, we will now ask what can bepredicted about theA-B interaction from knowledge of how activeA was with respect to users other thanB.

Exchange theory posits that asA has more “outside options” provided by communication partners whoare notB, A will spend less time communicating withB. One hypothesis, then, is that asA spends more

14

Page 19: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

time talking to her friends who are notB, A’s communication withB will decrease. Alternatively, wecan consider a simple model based on the schematic picture inFigure3.7, whereA first decides howmuch time to spend on Twitter, and then divides that time evenly between all of her friends on Twitter.According to this model, the more timeA spends talking to anyone on Twitter, the more time she willspend talking toB as well.

Figure 3.7: The outside option: A must distribute her attention not onlyamong users on Twitter, but alsoreal-world tasks like paying bills.

We find that as a user spends more time on Twitter, as indicatedby the number of messages thatA sendsto users other thanB, the number ofA−B messages increases.

Conclusions

There are many forces that affect the strength and longevityof ties on social media sites, and it is achallenge to separate these into their distinct effects. This work offers a set of data analysis methodologiesthat lets us begin to isolate the effect of three such forces:balance, in which ties are strengthened whenthey close triads; exchange, in which ties are weakened whenone end of the tie has other opportunities;and betweenness, in which ties are strengthened when they serve as conduits for information. Our analysesshow the power of balance in the domain we study, Twitter. It also shows that exchange theory should bebroadened to conceptually include off-site opportunitiesfor participants in a tie, reflecting the rapid rateat which ties decay.

3.4 Recovering Social Graph Time-stamps

Seehttp://www.cs.cmu.edu/ ˜ bmeeder/proposal/timestamping.pdf for the full pa-per.

When studying the temporal evolution of a network we requiresome kind of ordering over the edgeformation events. Obviously, if the time at which edge is created is known then the evolution of the graphcan be fully reconstructed. The Twitter API does not providetime-stamps for each edge; rather, theyreturn the edges in the local order in which they were created. Thus, we know which of two given edges

15

Page 20: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

with the same source or destination were created first in the network. Using this information across alledges it is possible to create a partial ordering over edge creation events; however, we still have no idea asto the correspondence between these events and the real-world time at which edges were created.

We exploit one additional feature of Twitter to develop a method for recoveringlower boundsfor when anedge is created. Namely, after a user signs up for the servicethey are presented with a list of recommendedusers to follow. Many celebrities and news organizations are on this suggested users list, and thereforea large fraction of new account registrations start following at least one of these recommended accounts.We use the fact that some users follow these recommended accounts shortly after their account is createdto derive highly accurate estimates of when edges are created.

Method Overview

We use two simple facts to determine a lower-bound for the creation time of every edge incident to avertexC. Namely, for any edge(u, v) in the graph, the time at which the edge is created must be after thecreation time of usersu andv. Moreover, if we know that the edge(u,C) is created before(v,C), andthatu created their account after userv, we know that(v,C) was created afteru registered their account.This is a stronger result (a larger lower-bound) than what can be inferred by looking at the edge(v,C)alone. Therefore, our method is simply to look at the maximumof every lower-bound we can infer for andedge(u,C) and assume that it is created at that time. Succinctly, the estimated follow timeF̂u is givenby

F̂u = maxa≤u

Ca,

wherea ≤ u means thata comes beforeu in C ’s follower list, andCa is the account creation time forusera.

Empirical Evaluation and Theoretical Result

How well does this method work in practice? We define the errorof the method as applied to a particularuser to be maximum difference between the estimated follow-time and the actual follow time for anyfollow event. It is clear that in order for this method to produce estimates with small error users muststart following the celebrityC very shortly after they create their account. Furthermore,the rate at whichrecord-breaker users start followingC must be sufficiently high so that the error of timestamp estimatesbetween record breaker users does not grow too large. To perform an empirical evaluation we chooseapproximately 1,800 ‘celebrity’ users who either have a large number of followers or are on Twitter’ssuggested users list. We pick the top 1000 most followed Twitter users according to Twitaholic.com andan additional 800 that are on Twitter’s suggested users list. A total of 862 million follow-links connect tothese users from over 74 million users.

To determine the actual follow time, we repeatedly crawl thecelebrity users’ followers within a smalltimeframe. For each new edge that we observe we can determinean interval in which that edge is formed.After applying the time-stamping method we then derive lower- and upper-bounds for the estimate errorfor that edge. If an edge is created between[T1, T2] and we have an estimated creation time ofE, thelower-bound for the error ismax(0, T1 − E) and the upper-bound isT2 − E.

Over a period of two weeks we repeatedly crawl the follower lists of these 1,800 celebrities every fiveminutes to get a five minute interval in which each edge could have been created. At the end of the two

16

Page 21: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

week period we crawl the entire follower list of each celebrity and apply our time-stamping method. Foreach follower edge we find the five minute interval in which theedge could have been created and calculatethe upper- and lower-bounds on the error for the time-stamp estimate. Our results show that for users whogained at least 10,00 new followers in the two week period, the average time-stamp error was less than 10minutes and the maximum error, acrossanyfollow event was no more than 8 hours, and for many of thesecelebrities, the maximum timestamp error was less than one hour.

0 1 2 3 4 5 6

1E-3

0.01

0.1

1

1 10 100 10001E-8

1E-7

1E-6

1E-5

1E-4

1E-3

0.01

0.1

1

Frac

tion

of c

eleb

ritie

s w

ith >

k fo

llow

ers

Celebrity degree k (106)Fr

actio

n of

use

rs fo

llow

ing

k ce

lebr

ities

User degree k

Figure 3.8: The distribution of celebrity degree and follower degree. We notice that anomalies in thefollower degree-distribution occur because Twitter has automatic settings at 20 followers, aswell as there being 220 and 440 users who were on the suggestedusers list at some point.

How well can the method work in theory? If the latency with which new users can follow others isarbitrarily small, then the probability that the estimatederror is larger than someδ > 0 diminishes as therate at which users followC increases. Formally, letℓ(t) be the latency probability density such that foreveryα > 0,

∫ α0 ℓ(t)dt > 0. We assume that everyλ secondsC gets a new follower. Given a desired

error bound ofδ > 0, if(

∫ δ/2

0ℓ(t)dt

)δ/2λ

< ǫ

then the probability that the estimated timestamp has errorbigger thanδ, for anyuser, is less thanǫ.

Applications of the Method

The first application is to evaluate models of network formation on a large, real-world graph. We eval-uated random-attachment, preferential-attachment, and preferential-attachment with fitness. Random-attachment is clearly not a good model of the network evolution. Furthermore, the two preferential-attachment models have difficulty capturing pop-star phenomenons such as Lady Gaga and Justin Bieber.These users became exceptionally famousafter many other celebrities such as Oprah Winfrey were al-ready well-established on Twitter. It would be worth investigating the growth of the graph where thehistorical state is discarded.

Additionally, we use the time-stamping to see how real-world events affect the following behavior ofusers. We apply our method to the timeline of the fifty most popular accounts on Twitter from early 2009

17

Page 22: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

until October 2010. Using the time-stamped follower lists we look at what fraction of follow events go toeach of these fifty celebrities every day. A graph of these fractions over time is shown in Figure3.9.

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14R

elat

ive

Pop

ular

ity Lady Gaga Ashton Kutcher Justin Bieber Oprah Winfrey Taylor Swift Random attachment

Jan 2009 Apr July Oct Jan 2010 Apr July Oct

Date

5341

2

Date

Fol

low

eve

nts

(per

hou

r), a

ccou

nt c

reat

ions

(pe

r da

y)

(1) (2) (3)

Accurate Celebrity Follow and Account Creation Rates

Jan 2009 Apr July Oct Jan 2010 Apr July Oct0

1

2

3

4

5

6

7x 10

5

Hourly follow rateFour day average hourly follow rateDaily account creation rate

Figure 3.9: Relative follow-popularity of five celebrities over time. (1) is when Ashton Kutcher appearedon Oprah Winfrey’s TV show to discuss Twitter. (2) A spike in Taylor Swift’s popular whenshe was interrupted by Kanye West at the Music Awards. (3) Lady Gaga performing at theEmmy’s and (4) a release of her “Telephone” music video.

18

Page 23: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Chapter 4

Proposed Work and Conclusion

4.1 Task 1: Topological Evolution of Information Cascades

We seek to continue our studies of information diffusion as observed in Twitter.

One part of this work is to examine how cascades evolve under different mechanisms on several familiesof graphs. For a given graphG and a setI of vertices starting the cascade, how does the topology of theevolution change as the cascade unfolds. We are interested in different centrality measurements of thecascade over time. We hypothesize that for information thatis of interest to a particular group the cascadewill remain highly clustered. Other topics, which have a wider reach, will be less clustered For somepieces of information, especially those which are of a specific interest such as technology, shopping, orsomething region specific

We will examine the following graph properties as the cascade evolves:

• Number of connected components

• Diameter of the largest connected component

• Edge density (based on both the total number of edges and the number of edges along which theinformation propagation could have happened).

• Centrality measures (betweenness, closeness, eigenvalue)

Using the results of this study, we will build a system and apply it to a real-time feed of approximately 10percent of traffic on Twitter. This has several advantages over the data collection and analysis methodologywe have used so far:

• It allows us to analyze what is currently happening across all users. Our crawling methodology,although almost complete for users that we crawl, results ina corpus that is not particularly recent.

• We can evaluate the effectiveness of the method on uniformlysampled message data.

• Users can interact with the system and evaluate its effectiveness at bringing relevant information totheir attention.

19

Page 24: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

4.2 Task 2: Social Interactions and Contribution Patterns in Duolingo

Duolingo provides a unique opportunity to study a brand new community of users. With Duolingo userslearn from both hand-crafted sentences which are relatively simplistic and sentences from real-worldcontent that can be quite unconstrained and grammatically difficult. We wish to study the contributionpatterns of users over an extended duration. Similar to whathas been observed on sites such as Wikipedia,we have seen a small fraction of the userbase spend a considerable amount of time on the site each day.Unlike Wikipedia, we believe that there are increasing returns for spending time on the site as a userbecome more proficient in the foreign language.

Duolingo also has a social component whereby users can follow the progress of each other users, andthis also serves as a form of friendly competition. We want tomeasure the efficacy of these interactionsin engaging users and improving the learning outcome. Additionally, there is a question-and-answercomponent of the site in which users can ask and answer questions about each language. This is anothermanner in which users can contribute to site.

4.3 Timeline

We expect that these two projects can be carried out in parallel. In particular, the empirical studies of userbehavior on Duolingo will need to take place over an extendedduration; an invitation-based beta startedin November, 2011. In mid June the site will be open to the public, and we would expect to study usersover a period of at least six months.

The following outlines our expected progress, with an expected completion by August 2013.

• June 2012Thesis proposal

• June - August 2012Analyze cascades from streaming sampled data. Simulate various cascademodels against synthetic and observed network data, perform topological analysis on resulting cas-cades.

• September 2012Write up results for WWW 2013 conference.

• October - December 2012Perform first large-scale analysis of users in Duolingo. This will be ourfirst opportunity to perform analysis after three or more months using the site.

• January - February 2013Build system to process sampled Twitter feed and a simple webinterfacefor this tool.

• March-May 2013 Revisit Duolingo analysis after many users have been on the site for six to ninemonths. Measure impact of different social recommendationsystems.

• June - July 2013Thesis writing.

• August 2013Thesis defense.

20

Page 25: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

4.4 Conclusion

In this thesis we analyze the interplay between network structure, user behavior, and information diffusion.We have developed the infrastructure for collecting and analyzing very large quantities of informationfrom the Twitter social network. Additionally, we have access to data from Duolingo, a language-learningwebsite in which users simultaneous learn a language and translate real-world content.

Our contributions so far include:

• A method for accurately estimating the times at which more than 860 million edges in the Twitterfollower-graph were created. Using these time-stamped edges, we are able to see the effect ofreal-world events on the following behavior of users in the network.

• Discovering differences in the spread of topical hashtags.We present a probabilistic model ofinformation diffusion and show statistically significant differences in both the model parametersand the topology of the initial nodes mentioning each hashtag.

• Measuring the impact of information overload on user following and unfollowing behavior.• An analysis of the reciprocity of relationships in Twitter based on the network neighborhoods of

two users.• Learning and applying a simple model for detecting when online services are unavailable, based on

a temporal analysis of message content in Twitter.

The proposed work is summarized as follows:

• Analyze and characterize the temporal characteristics of information cascades in Twitter. We willfocus on the change in graph-theoretical properties of the cascade subgraph over time.

• Develop a tool that allows a user to see the results of our analyses applied to her account. This willprimarily integrate our results from neighborhood analysis of the user’s network neighborhood andtemporal analysis of message data.

• Understand the contribution patterns of users in Duolingo and how social interactions and algorith-mic decisions affect these patterns.

21

Page 26: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

Bibliography

E. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose. Implicit structure and the dynamics of blogspace.In Workshop on the Weblogging Ecosystem, 2004.10

L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Mem-bership, growth, and evolution. InProc. 12th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2006.10

A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science (New York,N.Y.), 286(5439):509–512, Oct. 1999. ISSN 1095-9203. doi: 10.1126/science.286.5439.509. URLhttp://dx.doi.org/10.1126/science.286.5439.509 . 6

F. M. Bass. A new product growth for model consumer durables.Management Science, 15(5):215–227,1969.6

B. Bollobas.Random Graphs. Cambridge University Press, 2001.6

C. Borgs, J. Chayes, B. Karrer, B. Meeder, R. Ravi, R. Reagans, and A. Sayedi. Game-theoretic models of information overload in social networks. In R. Kumar and D. Sivakumar,editors, Algorithms and Models for the Web-Graph, volume 6516 ofLecture Notes in Com-puter Science, pages 146–161. Springer Berlin / Heidelberg, 2010. ISBN 978-3-642-18008-8. URL http://dx.doi.org/10.1007/978-3-642-18009-5_14 . 10.1007/978-3-642-18009-514.3

C. Budak, D. Agrawal, and A. El Abbadi. Limiting the spread ofmisinformation in social networks.In Proceedings of the 20th international conference on World wide web, WWW ’11, pages 665–674,New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4. doi: 10.1145/1963405.1963499. URLhttp://doi.acm.org/10.1145/1963405.1963499 . 6

D. Cartwright and F. Harary. Structure balance: A generalization of Heider’s theory.Psychological Review,63(5):277–293, Sept. 1956.13

D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining.SIAM Int. Conf.on Data Mining, Apr. 2004.6

J. Cheng, D. Romero, B. Meeder, and J. Kleinberg. Predictingreciprocity in social networks. InIEEE ThirdInternational Conference on Social Computing, 2011.2

A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data.SIAM Rev., 51(4):661–703, Nov. 2009. ISSN 0036-1445. doi: 10.1137/070710111. URLhttp://dx.doi.org/10.1137/070710111 . 7

D. Cosley, D. P. Huttenlocher, J. M. Kleinberg, X. Lan, and S.Suri. Sequential influence models in socialnetworks. InProc. 4th International Conference on Weblogs and Social Media, 2010.10, 11

R. Crane and D. Sornette. Robust dynamic classes revealed bymeasuring the response function of a social

22

Page 27: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

system.Proc. Natl. Acad. Sci. USA, 105(41):15649–15653, 29 September 2008.10

R. M. Emerson. Power-dependence relations.American Sociological Review, 27:31–40, 1962.13

P. Erdos and A. Renyi. On the evolution of random graphs.Publ. Math. Inst. Hungary. Acad. Sci., 5:17–61,1960.6

M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology.SIG-COMM, pages 251–262, Aug-Sept. 1999.6

D. Gruhl, D. Liben-Nowell, R. V. Guha, and A. Tomkins. Information diffusion through blogspace. InProc.13th International World Wide Web Conference, 2004.10

F. Heider.The Psychology of Interpersonal Relations. John Wiley & Sons, 1958.13

M. Jackson.Social and Economic Networks. Princeton University Press, 2010.6

U. Kang, B. Meeder, and C. Faloutsos. Spectral analysis for billion-scale graphs: Discoveries and implemen-tation. In J. Huang, L. Cao, and J. Srivastava, editors,Advances in Knowledge Discovery and Data Mining,volume 6635 ofLecture Notes in Computer Science, pages 13–25. Springer Berlin / Heidelberg, 2011.ISBN 978-3-642-20846-1. URLhttp://dx.doi.org/10.1007/978-3-642-20847-8_2 .10.1007/978-3-642-20847-82 .2, 8

D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spreadof influence through a social network. InKDD ’03, 2003.6

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632,1999. URLhttp://citeseer.ist.psu.edu/kleinberg99authoritativ e.html . 6

J. M. Kleinberg. Navigation in a small world.Nature, 406(6798), August 2000. ISSN 0028-0836. doi:10.1038/35022643. URLhttp://dx.doi.org/10.1038/35022643 . 6

J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The Web as a graph: Measure-ments, models and methods.Lecture Notes in Computer Science, 1627:1–17, 1999.6

G. Kossinets, J. Kleinberg, and D. Watts. The structure of information pathways in a social communicationnetwork. InProceedings of the 14th ACM SIGKDD international conference on Knowledge discovery anddata mining, KDD ’08, pages 435–443, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-193-4.doi: 10.1145/1401890.1401945. URLhttp://doi.acm.org/10.1145/1401890.1401945 . 6

J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker multiplication. InPro-ceedings of the 24th international conference on Machine learning, ICML ’07, pages 497–504, NewYork, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273559. URLhttp://doi.acm.org/10.1145/1273496.1273559 . 6

J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrink-ing diameters and possible explanations. InProceedings of the eleventh ACM SIGKDD inter-national conference on Knowledge discovery in data mining, KDD ’05, pages 177–187, NewYork, NY, USA, 2005. ACM. ISBN 1-59593-135-X. doi: 10.1145/1081870.1081893. URLhttp://doi.acm.org/10.1145/1081870.1081893 . 6

J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network. InPacific-AsiaConference on Knowledge Discovery and Data Mining (PAKDD), 2006.6

J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral marketing. ACM Transactions on theWeb, 1(1), May 2007a.10

J. Leskovec, M. Mcglohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs.

23

Page 28: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

SIAM International Conference on Data Mining (SDM), 2007b.6

J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs.In Proc. SIAM International Conference on Data Mining, 2007c.10

J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-trackingand the dynamics of the news cycle. InProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data min-ing, KDD ’09, pages 497–506, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9. doi:10.1145/1557019.1557077. URLhttp://doi.acm.org/10.1145/1557019.1557077 . 6

D. Liben-Nowell and J. Kleinberg. Tracing information flow on a global scale using Internet chain-letterdata.Proc. Natl. Acad. Sci. USA, 105(12):4633–4638, Mar. 2008.10

B. Meeder, J. Tam, P. G. Kelley, and L. F. Cranor. : Widespreadviolation of privacy settings in the twittersocial network, 2010.3

B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs, and J. Chayes. We know who you followed last summer:inferring social link creation times in twitter. InProceedings of the 20th international conference on Worldwide web, WWW ’11, pages 517–526, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4. doi:10.1145/1963405.1963479. URLhttp://doi.acm.org/10.1145/1963405.1963479 . 2

M. Motoyama, B. Meeder, K. Levchenko, G. M. Voelker, and S. Savage. Measuring on-line service availability using twitter. InProceedings of the 3rd conference on Online so-cial networks, WOSN’10, pages 13–13, Berkeley, CA, USA, 2010. USENIX Association. URLhttp://dl.acm.org/citation.cfm?id=1863190.1863203 . 3

M. E. J. Newman. Power laws, pareto distributions and zipf’slaw, December 2004.6

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing or-der to the web. Technical report, Stanford Digital Library Technologies Project, 1998. URLciteseer.ist.psu.edu/page98pagerank.html . 6

A. Rapoport. Spread of information through a population with socio-structural bias I: Assumption of transi-tivity. Bulletin of Mathematical Biophysics, 15(4):523–533, Dec. 1953.13

E. Rogers.Diffusion of Innovations. Free Press, fourth edition, 1995.10

D. M. Romero and J. M. Kleinberg. The directed closure process in hybrid social-information networks,with an analysis of link formation on twitter. InICWSM, 2010.13

D. M. Romero, B. Meeder, V. Barash, and J. M. Kleinberg. Maintaining ties on social media sites: Thecompeting effects of balance, exchange, and betweenness. In ICWSM, 2011a.3

D. M. Romero, B. Meeder, and J. Kleinberg. Differences in themechanics of information diffu-sion across topics: idioms, political hashtags, and complex contagion on twitter. InProceed-ings of the 20th international conference on World wide web, WWW ’11, pages 695–704, NewYork, NY, USA, 2011b. ACM. ISBN 978-1-4503-0632-4. doi: 10.1145/1963405.1963503. URLhttp://doi.acm.org/10.1145/1963405.1963503 . 4

D. Strang and S. Soule. Diffusion in organizations and social movements: From hybrid corn to poison pills.Annual Review of Sociology, 24:265–290, 1998.10

E. Sun, I. Rosenn, C. Marlow, and T. M. Lento. Gesundheit! Modeling contagion through Facebook NewsFeed. InProc. 3rd International Conference on Weblogs and Social Media, 2009.10

D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks.Nature, 393(6684):440–442,June 1998. ISSN 0028-0836. doi: 10.1038/30918. URLhttp://dx.doi.org/10.1038/30918 .6

24

Page 29: Network Structure and its Role in Information Diffusion and User …bmeeder/proposal/proposal.pdf · ther examination of the temporal nature of information cascades. Rather than examine

D. Willer (editor). Network Exchange Theory. Praeger, 1999.13

J. Yang and J. Leskovec. Patterns of temporal variation in online media. InProceedings of the fourthACM international conference on Web search and data mining, WSDM ’11, pages 177–186, NewYork, NY, USA, 2011. ACM. ISBN 978-1-4503-0493-1. doi: 10.1145/1935826.1935863. URLhttp://doi.acm.org/10.1145/1935826.1935863 . 6

25