transcending our views to sequential data

17
Transcending our views to sequential data Markus Luczak-Roesch | @mluczak University of Southampton, Web and Internet Science http://markus-luczak.de

Upload: markus-luczak-roesch

Post on 13-Apr-2017

179 views

Category:

Science


1 download

TRANSCRIPT

Transcending our views to sequential data Markus Luczak-Roesch | @mluczak�University of Southampton, Web and Internet Science http://markus-luczak.de

HF LF

[1] Kleinberg, Jon. "Bursty and hierarchical structure in streams." Data Mining and Knowledge Discovery 7.4 (2003): 373-397.

[2] Subašić, I., & Berendt, B. (2013). Story graphs: Tracking document set evolution using dynamic graphs. Intelligent Data Analysis, 17(1),

125-147.

Time

Numb

er of

obse

rved d

ocum

ents

Content streams as automata [1]

“The key notion of TTM is burstiness – sudden increases in frequency of text fragments, and all TTM methods aim to model burstiness.” [2]

t

System A

System B

System C

Related activity?

t

Building transcendental information cascades

only local understanding of its use but also an abstract globalview. This lets us propose a new model that we call transcen-dental information cascades. Informed by Kleinbergs work onbursty structures in document streams [2] it regards time asthe only ascertainable condition for relationships between anypairs of resources, meaning that we focus on coincidence ofinformation sharing activities rather than socially-determinedconditionality.

In [20] we presented the initial definition of a transcenden-tal information cascade as a 4-tupel TC = (V,E,R, F ). This4-tupel represents a directed network consisting of a set ofnodes V and edges E, derived when applying a set of matchingfunctions F to a set of resources R = {r1, r2, ..., rm}, r

i

=(u

i

, ti

, ci

), where every ui

is a unique identifier of a resourceri

that was shared at the time ti

with the content ci

. Nodes inthe network are those resources from R that contain a set I

i

ofone or multiple cascade identifiers. A cascade identifier is anyunique informational pattern that is recognized by applyinga matching function to the content or any other inherentproperties of a resource (e.g. simple string matching algorithmsto identify keywords in content). Formally a matching functionfk

2 F, k 2 N, k n is defined as:

fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk

matches patterns{i1, i2, ..., ix} in c

i

x 2 N

; otherwise

Nodes V and edges E are then given as follows

V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci

) beingthe result of the concatenation of all identifiers found by allmatching functions2. An edge exists between any two nodesthat share a unique subset of all the cascade identifiers thatwere found for them. This subset and none of its subsets ispart of the identifiers found for any node that was created in thetime period between when the two linked nodes were created.

⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}

A node that contains a cascade identifier that was notdetected for any other nodes before is called the identifierroot. Beside this we call a node without any incoming edgesa network root and node that has no outgoing edges a stub.Our cascade model clearly yields different outputs dependingon the data to hand (e.g. determined by the extent of the

2Please note that [20] contains an unintentionally malformed equation forthis as the wrong symbol was used to refer to the concatenation of the matchingfunctions.

Web crawl), and the matching algorithms determining whichcascade identifiers will be spotted (e.g. reuse of hashtags,URIs, quotes, images, or maybe exploiting wider semanticsor sentiment) as depicted in Figure ??.

Fig. 1. Depending on the applied matching functions, different transcendentalinformation cascade representations can be generated for the same input data.

A fictive example of a transcendental cascade based on ourmodel is shown in Figure 2. Consider a system that featureshashtags as an established form of identifying content patterns.The visualisation uses the following approach to representdistinct identifiers and time: Nodes are chronologically orderedalongside the horizontal dimension from left (the oldest node)to right (the most recent node); additionally nodes are orderedalongside the vertical dimension depending on the set ofidentifiers present in a node (each unique set is assigned toa distinct level). Consequently, the visualisation represents thecontent creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).

Fig. 2. Example of a cascade that emerges along five different identifiers.#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinationsresepectively) treated as the indentifying content patterns

In order to understand how edges are labelled we highlightthe sub-graph involving the nodes 2, 3, 4, and 5. Conformingto our cascade model an edge exist between nodes 2 and 3

only local understanding of its use but also an abstract globalview. This lets us propose a new model that we call transcen-dental information cascades. Informed by Kleinbergs work onbursty structures in document streams [2] it regards time asthe only ascertainable condition for relationships between anypairs of resources, meaning that we focus on coincidence ofinformation sharing activities rather than socially-determinedconditionality.

In [20] we presented the initial definition of a transcenden-tal information cascade as a 4-tupel TC = (V,E,R, F ). This4-tupel represents a directed network consisting of a set ofnodes V and edges E, derived when applying a set of matchingfunctions F to a set of resources R = {r1, r2, ..., rm}, r

i

=(u

i

, ti

, ci

), where every ui

is a unique identifier of a resourceri

that was shared at the time ti

with the content ci

. Nodes inthe network are those resources from R that contain a set I

i

ofone or multiple cascade identifiers. A cascade identifier is anyunique informational pattern that is recognized by applyinga matching function to the content or any other inherentproperties of a resource (e.g. simple string matching algorithmsto identify keywords in content). Formally a matching functionfk

2 F, k 2 N, k n is defined as:

fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk

matches patterns{i1, i2, ..., ix} in c

i

x 2 N

; otherwise

Nodes V and edges E are then given as follows

V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci

) beingthe result of the concatenation of all identifiers found by allmatching functions2. An edge exists between any two nodesthat share a unique subset of all the cascade identifiers thatwere found for them. This subset and none of its subsets ispart of the identifiers found for any node that was created in thetime period between when the two linked nodes were created.

⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}

A node that contains a cascade identifier that was notdetected for any other nodes before is called the identifierroot. Beside this we call a node without any incoming edgesa network root and node that has no outgoing edges a stub.Our cascade model clearly yields different outputs dependingon the data to hand (e.g. determined by the extent of the

2Please note that [20] contains an unintentionally malformed equation forthis as the wrong symbol was used to refer to the concatenation of the matchingfunctions.

Web crawl), and the matching algorithms determining whichcascade identifiers will be spotted (e.g. reuse of hashtags,URIs, quotes, images, or maybe exploiting wider semanticsor sentiment) as depicted in Figure ??.

Fig. 1. Depending on the applied matching functions, different transcendentalinformation cascade representations can be generated for the same input data.

A fictive example of a transcendental cascade based on ourmodel is shown in Figure 2. Consider a system that featureshashtags as an established form of identifying content patterns.The visualisation uses the following approach to representdistinct identifiers and time: Nodes are chronologically orderedalongside the horizontal dimension from left (the oldest node)to right (the most recent node); additionally nodes are orderedalongside the vertical dimension depending on the set ofidentifiers present in a node (each unique set is assigned toa distinct level). Consequently, the visualisation represents thecontent creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).

Fig. 2. Example of a cascade that emerges along five different identifiers.#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinationsresepectively) treated as the indentifying content patterns

In order to understand how edges are labelled we highlightthe sub-graph involving the nodes 2, 3, 4, and 5. Conformingto our cascade model an edge exist between nodes 2 and 3

only local understanding of its use but also an abstract globalview. This lets us propose a new model that we call transcen-dental information cascades. Informed by Kleinbergs work onbursty structures in document streams [2] it regards time asthe only ascertainable condition for relationships between anypairs of resources, meaning that we focus on coincidence ofinformation sharing activities rather than socially-determinedconditionality.

In [20] we presented the initial definition of a transcenden-tal information cascade as a 4-tupel TC = (V,E,R, F ). This4-tupel represents a directed network consisting of a set ofnodes V and edges E, derived when applying a set of matchingfunctions F to a set of resources R = {r1, r2, ..., rm}, r

i

=(u

i

, ti

, ci

), where every ui

is a unique identifier of a resourceri

that was shared at the time ti

with the content ci

. Nodes inthe network are those resources from R that contain a set I

i

ofone or multiple cascade identifiers. A cascade identifier is anyunique informational pattern that is recognized by applyinga matching function to the content or any other inherentproperties of a resource (e.g. simple string matching algorithmsto identify keywords in content). Formally a matching functionfk

2 F, k 2 N, k n is defined as:

fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk

matches patterns{i1, i2, ..., ix} in c

i

x 2 N

; otherwise

Nodes V and edges E are then given as follows

V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci

) beingthe result of the concatenation of all identifiers found by allmatching functions2. An edge exists between any two nodesthat share a unique subset of all the cascade identifiers thatwere found for them. This subset and none of its subsets ispart of the identifiers found for any node that was created in thetime period between when the two linked nodes were created.

⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}

A node that contains a cascade identifier that was notdetected for any other nodes before is called the identifierroot. Beside this we call a node without any incoming edgesa network root and node that has no outgoing edges a stub.Our cascade model clearly yields different outputs dependingon the data to hand (e.g. determined by the extent of the

2Please note that [20] contains an unintentionally malformed equation forthis as the wrong symbol was used to refer to the concatenation of the matchingfunctions.

Web crawl), and the matching algorithms determining whichcascade identifiers will be spotted (e.g. reuse of hashtags,URIs, quotes, images, or maybe exploiting wider semanticsor sentiment) as depicted in Figure ??.

Fig. 1. Depending on the applied matching functions, different transcendentalinformation cascade representations can be generated for the same input data.

A fictive example of a transcendental cascade based on ourmodel is shown in Figure 2. Consider a system that featureshashtags as an established form of identifying content patterns.The visualisation uses the following approach to representdistinct identifiers and time: Nodes are chronologically orderedalongside the horizontal dimension from left (the oldest node)to right (the most recent node); additionally nodes are orderedalongside the vertical dimension depending on the set ofidentifiers present in a node (each unique set is assigned toa distinct level). Consequently, the visualisation represents thecontent creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).

Fig. 2. Example of a cascade that emerges along five different identifiers.#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinationsresepectively) treated as the indentifying content patterns

In order to understand how edges are labelled we highlightthe sub-graph involving the nodes 2, 3, 4, and 5. Conformingto our cascade model an edge exist between nodes 2 and 3

only local understanding of its use but also an abstract globalview. This lets us propose a new model that we call transcen-dental information cascades. Informed by Kleinbergs work onbursty structures in document streams [2] it regards time asthe only ascertainable condition for relationships between anypairs of resources, meaning that we focus on coincidence ofinformation sharing activities rather than socially-determinedconditionality.

In [20] we presented the initial definition of a transcenden-tal information cascade as a 4-tupel TC = (V,E,R, F ). This4-tupel represents a directed network consisting of a set ofnodes V and edges E, derived when applying a set of matchingfunctions F to a set of resources R = {r1, r2, ..., rm}, r

i

=(u

i

, ti

, ci

), where every ui

is a unique identifier of a resourceri

that was shared at the time ti

with the content ci

. Nodes inthe network are those resources from R that contain a set I

i

ofone or multiple cascade identifiers. A cascade identifier is anyunique informational pattern that is recognized by applyinga matching function to the content or any other inherentproperties of a resource (e.g. simple string matching algorithmsto identify keywords in content). Formally a matching functionfk

2 F, k 2 N, k n is defined as:

fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk

matches patterns{i1, i2, ..., ix} in c

i

x 2 N

; otherwise

Nodes V and edges E are then given as follows

V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci

) beingthe result of the concatenation of all identifiers found by allmatching functions2. An edge exists between any two nodesthat share a unique subset of all the cascade identifiers thatwere found for them. This subset and none of its subsets ispart of the identifiers found for any node that was created in thetime period between when the two linked nodes were created.

⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}

A node that contains a cascade identifier that was notdetected for any other nodes before is called the identifierroot. Beside this we call a node without any incoming edgesa network root and node that has no outgoing edges a stub.Our cascade model clearly yields different outputs dependingon the data to hand (e.g. determined by the extent of the

2Please note that [20] contains an unintentionally malformed equation forthis as the wrong symbol was used to refer to the concatenation of the matchingfunctions.

Web crawl), and the matching algorithms determining whichcascade identifiers will be spotted (e.g. reuse of hashtags,URIs, quotes, images, or maybe exploiting wider semanticsor sentiment) as depicted in Figure ??.

Fig. 1. Depending on the applied matching functions, different transcendentalinformation cascade representations can be generated for the same input data.

A fictive example of a transcendental cascade based on ourmodel is shown in Figure 2. Consider a system that featureshashtags as an established form of identifying content patterns.The visualisation uses the following approach to representdistinct identifiers and time: Nodes are chronologically orderedalongside the horizontal dimension from left (the oldest node)to right (the most recent node); additionally nodes are orderedalongside the vertical dimension depending on the set ofidentifiers present in a node (each unique set is assigned toa distinct level). Consequently, the visualisation represents thecontent creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).

Fig. 2. Example of a cascade that emerges along five different identifiers.#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinationsresepectively) treated as the indentifying content patterns

In order to understand how edges are labelled we highlightthe sub-graph involving the nodes 2, 3, 4, and 5. Conformingto our cascade model an edge exist between nodes 2 and 3

Markus Luczak-Roesch, Ramine Tinati, and Nigel Shadbolt. 2015. When Resources Collide: Towards a Theory of Coincidence in Information Spaces. To appear in WWW’15 Companion, May 18–22, 2015, Florence, Italy. http://dx.doi.org/10.1145/2740908.2743973

Transcendental �information cascades

t

#A

#A#B

#A#B#C

#B#D

#C

Cascade motifs as an indicator of state?

?

Markus Luczak-Roesch, Ramine Tinati, Max van Kleek, and Nigel Shadbolt. 2015. From coincidence to purposeful flow? Properties of transcendental information cascades. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, FR.

Analyzing low-level properties of the multiple states of a system that exist at the same time

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

4

1 15

10

Tags URIs

KID & APH

Single node motifs

long uniform paths

short uniform paths

long non-uniform paths

Analyzing low-level properties of the multiple states of a system that exist at the same time

Tags URIs

KID &

APH

Identifier entropy

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for

varying profiles of increasing randomness with growing cascade size

From information co-occurrence to the discovery of hidden structure in Wikipedia

Metric Trigram MFNodes 18,896Links 17,004Matched identifiers 1,745Identifier roots 1,599Stubs 1,645Nodes without any links 146Avg identifier path length 11.53Shortest path (links) 2Longest path (links) 1373Average path duration (hours) 369Longest path duration (hours) 2133 (88 days)Shortest path duration (hours) 0Cascades 1,379Largest cascade (links) 8068Smallest cascade (links) 2Average cascade size (links) 13.70

Table 1: Results of the experiments. The Trigram MF matches ona 3 noun-phrase sequence.

tigated this in more detail by assessing the identifier entropy. Wefind two cascade types: (a) a significant proportion of cascades withan identifier entropy of 0; (b) the entropy for all captured cascadesis lower than 5. While (a) reflects the existence of a significantnumber of single identifier cascades again, (b) lets us conclude thatmulti-identifier cascades tend to be dominated by some identifiersresulting in an unequal distribution of the identifiers in those cas-cades. Both observations support the findings from the analysis ofthe wiener index in relation to cascade size.

Burstiness. We measured three kinds of burstiness: (1) the bursti-ness of all captured edits independent from the cascade they belongto; (2) the burstiness of all edits captured within specific fully-connected cascade networks; (3) the burstiness of all edits thatmatch a particular identifier (identifier burstiness). As described inSection 2, burstiness refers to periods of high activity in a stream ofactivity, and offers a way to detect behavior that is correlated witha particular event. In the context of the Wikipedia editing streambursts of editing activity across a set of Wikipedia articles could berelated to some external (or internal) social phenomenon such asa controversial topic, the injection of biased information, or someform of vandalism. The overall burstiness reveals only very fewperiods of significantly high activity. Naturally, the amount of ac-tivity increases as the TIC model will capture additional identifiersthe longer the edit stream is observed. This results in an increasinglikelihood to match observed edit events to older ones.

As a more fine grained indicator of bursts of related information, wecomputed the cascade burstiness by for each structurally connectedcascade network derived from the overall edit stream individually.We observe that it is possible to differentiate between cascades thatshow a similar burstiness pattern as the overall burstiness and oth-ers that are significantly different and become only visible on thismicroscopic level. TIC allow to map activity streams into a threedimensional space. In Figure 1 we zoomed into a period of 1500edits happening in about 40 minutes and highlight that within thisdense global activity we can identify various local bursts ((1) and(2) mark the most prominent two local bursts). Generally, this map-ping of Transcendental Information Cascades allows us to analyse(a) global bursts of high activity involving diverse information and(b) local bursts of significance occurrence of the same information.

Wikipedia Article Network (WAN) Comparison. We comparedthe difference in the link structure of the cascades, and the explicit(embedded) links in a Wikipedia article. We constructed two net-works, the Cascade Article Network (CAN), and the Wikipedia Ar-

Figure 1: Wikipedia edits in a three dimensional space. The di-mensions are (1) time; (2) information diversity as the chronologi-cal order in which unique identifier sets are found; (3) informationspecificity as the index for each unique identifier set which is incre-mented with each occurrence of the respective set over time.

ticle Network (WAN)3. Table 2 provides an overview of the CANand WAN. For comparative purposes, the metrics of the WAN net-work have been applied to the sub-set of articles which are con-tained within the CAN. Figure 2a provides a visual representationof the CAN structure, with three labelled strongly connected com-ponents, (A), (B), and (C).

Metric CAN WAN*Total Nodes (Articles) 7,293 5,716,808Total Edges (A-to-A) 23,560 5,705,827Avg. Edges 3.1 142Avg. Degree 6.46 343

Table 2: A Comparison of the cascade links between articles withthe Wikipedia article graph. WAN - Wikipedia Article Network.*The WAN graph metrics are based on the subset of matchingWikipedia articles, not the complete article base.

Due to the articles which reside outside the set of articles identifiedwithin the CAN, the WAN has a higher average degree and edgesper article. However, in comparison to the WAN’s structure whichcontained one large connected component of articles (within thegiven subset of articles), the WAN network featured three stronglyconnected components. As labelled on 2a, these components re-lated to articles containing content about (A) South Korea, (B) theUnited States of America (Geographic articles), and (C) Politicalarticles.

We compared the edges between articles formed by the cascadesto the edges within the WAN, and found that only 4.4% of edgesin the CAN could be identified within the WAN. Only 2 articlesfrom the CAN had a 100% overlap with the WAN. Furthermore,we found that 94.7% of articles within the CAN had a overlap ofless than 1%. These findings suggests that the article links formedwithin the CAN network may be forming article structure which isnot explicitly found within Wikipedia.

Cascade Category Co-Occurrence. In order to examine the relat-3A node represent a Wikipedia articles, and an edge represents ei-ther a matched identifier between two edits (for CAN), or an ex-plicit link within the Wikipedia graph (for WAN)

Tinati, Ramine, Luczak-Roesch, Markus, Hall, Wendy and Shadbolt, Nigel (2016) More than an edit: using transcendental information cascades to capture hidden structure in Wikipedia. At 25th International World Wide Web Conference, Montreal, Canada, 11 - 15 Apr 2016. ACM (doi:10.1145/2872518.2889401). Tinati, R., Luczak-Rösch, M., & Hall, W. Finding Structure in Wikipedia Edit Activity: An Information Cascade Approach . In WikiWorkshop 2016, co-located with WWW 2016.

Events detected: •  Edward Snowden speech at SXSW

conference •  US supreme court case on same sex

marriage

Matching identifier Associated Root Article EdgesU.S. Supreme Court Hillman v. Maretta 17,893NATO Joint Jet Fighter Pilot 13,868U.S. District Court BJU Press 5,584Mehr News Agency To the Youth in Europe and

North America2,078

U.S.Religious Land-scape Survey

Utah 1,500

Table 3: 5 highest connected cascades. Each cascade is formedby a particular identifier, and can be associated with a Wikipediaarticle where the identifier was first used (the root).

(a) Cascade Article Network (CAN): Nodes represent uniqueWikipedia articles, edges are shared edits based on a sharedidentifier matched. A force directed layout has been ap-plied, with edge path lengths determined by edge weight. Thestrongly connected component (A) contains articles associatedwith South Korean media, (B) and (C) contain articles relatedto the USA.

(b) Cascade-to-Cascade path network graph: Nodes are cas-cades, Edges are the shared articles between cascades. The cen-tral strongly connected component is established by the Identi-fiers shown in Table 3. A force directed layout has been applied,with edge path lengths determined by edge weight.

Figure 2: Article networks

edness between Wikipedia content, we used DBpedia to obtain thecategory classification labels (dct:subject) associated with a givenWikipedia article. These labels, which are machine and humangenerated provide a general classification for the subject (or topic),based on the article’s content. We then calculate the co-occurrenceof categories between nodes (articles) within a cascade path [14].Using the co-occurrence measure of a cascade provides us with away to measure the potential similarity between the subject andcontent of the articles within a given cascade. Using DBpedia, ourqueries found, 78.2% of the total articles within the WAN wereidentified with at least one category. On average, an article wasassociated with 2 categories. From the 1,745 unique cascades path-ways, 521 were found to contain at least one node (article) mappedto a set of categories, and 360 cascades pathways were identified tohave two or more articles with categories associated with them. Forthe analysis, we removed duplicate nodes within a cascade, whichwere identified as nodes related to the same Wikipedia article, astheir categories would be the same, thus skewing the results.

Based on the remaining cascades which had duplicate nodes re-moved, and two or more nodes with categories associated withthem (20% of total cascades), we calculate the co-occurrence ofcategories between articles within a given cascade. As shown inTable 4, there was an average co-occurrence of 63.6% between ar-ticle categories within a given cascade pathway. We also extractedthe top 10 categories based on co-occurrence frequency. The find-ings suggest that the articles within a given cascade tend to relateto the same subject or share similar content. We also found thatthe most frequent co-occurring topics reflect the strongly connectedcomponents found in the CAN network, shown in Figure 2a.

Metric CTC NetworkTotal Nodes (Article) 18,896Matched Article 14,776Unique Categories 1,605Avg. Category per article 2Avg. Duplicate Article per Cascade 43.7%Avg. Cascade Category co-occurrence 63.6%

Table 4: Overview of the Cascade mapping to DBpedia categories.Avg. Cascade Category Overlap is calculated on cascades with twoor more nodes that are associated with different Wikipedia articles

5. DISCUSSIONRQ1: Structural Properties The structure of Wikipedia can beconsidered as an explicit and static network of hyperlinks con-necting articles with articles, and with external resources (e.g., hy-perlinks to URLs not prefixed by wikipedia.org). We examinedwhether an underlying structure between Wikipedia articles oc-curred, and whether this complements, or mimics the explicit link-ing structure. Our analysis of the wiener index and identifier en-tropy of the resulting cascades highlights an over-representation ofcascades that are long uniform paths with only one matched iden-tifier. Such single identifier cascades can still be suited to find im-plicit links between articles and detect bursts around trending top-ics. But it means that only a small proportion of cascades is suitedto find implicit relationships between matched identifier phrases.

We conducted the analysis of patterns of burstiness in order to ex-amine the time dimension on the macro and the micro level of thecaptured edits. The TIC model is based on the principle of cap-turing elements from a stream that contain a particular informa-tional pattern and bringing subsets of these elements together asbranching and merging cascades, when a pattern matches multipleinformation in some of the elements, so that sequences are linkedtogether. As such it is a generalisation of Kleinberg’s approach pre-sented in [16]; based on flat sequences of elements from a stream,only one particular matched information occurs. While the overallburstiness does not show significant bursts from the macroscopic

Discrete vs. continuous data

Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

EEG brain wave recordings

Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

EEG brain wave recordings

Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

EEG brain wave recordings

Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

Linking based on similarity of spectral density (Euclidian distance)

t

F1

Fn

… …

C11

C21

C22C23

Formalising the multiple possible representations of a system at any time and their relationships. Not all representing purposeful action but reflecting useful informational properties.

•  Applying Transcendental Information Cascades to – data from the complex engineering industries (e.g. shipping) – urban traffic data – disaster response data

Reducing risk and enhancing security by understanding coincidence in information spaces (RECOIN)

PI: Markus Luczak-Roesch

F1

Fn…

Transcendental Information Cascades �Generic time-ordered networks of information co-occurrence

t

C11

C21

C22C23

t6-t0

t2-t1 t8-t2t4-t2

t7-t4

t5-t3

t1-t0t2-t1

t4-t1

t4-t3t6-t5

t8-t6

t7-t4

t5-t4

t3-t2