the complex dynamics of collaborative tagging harry halpin university of edinburgh valentin robu...

19
The Complex Dynamics The Complex Dynamics of Collaborative of Collaborative Tagging Tagging Harry Halpin Harry Halpin University University of Edinburgh of Edinburgh Valentin Robu Valentin Robu CWI, Netherlands CWI, Netherlands Hana Shepherd Hana Shepherd Princeton University Princeton University WWW 2007 WWW 2007

Upload: chastity-cunningham

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

The Complex The Complex Dynamics of Dynamics of Collaborative TaggingCollaborative Tagging

Harry HalpinHarry Halpin University of University of EdinburghEdinburgh

Valentin RobuValentin Robu CWI, CWI, NetherlandsNetherlands

Hana ShepherdHana Shepherd Princeton Princeton UniversityUniversity

WWW 2007WWW 2007

IntroductionIntroduction

An issue continues to be a central concern: An issue continues to be a central concern: How How metadata for web resources should be generated?metadata for web resources should be generated?– concerned with concerned with efficiencyefficiency and and efficacyefficacy

Social bookmarkingSocial bookmarking– An increasingly influential web applicationAn increasingly influential web application– del.icio.us, Flickr, Furl, Rojo, Connotea, Technorati,etcdel.icio.us, Flickr, Furl, Rojo, Connotea, Technorati,etc

Folksonomies vs. OntologiesFolksonomies vs. Ontologies– categorization (tagging) by unsupervised users vs. categorization (tagging) by unsupervised users vs.

classification by formal ontologies defined by expertsclassification by formal ontologies defined by experts– Multi-categories vs. exact one classMulti-categories vs. exact one class

Benefits and drawbacks Benefits and drawbacks of collaborative taggingof collaborative tagging BenefitsBenefits

– higher malleability and adaptability (“users do not have higher malleability and adaptability (“users do not have to agree on a hierarchy of tags or detailed taxonomy”)to agree on a hierarchy of tags or detailed taxonomy”)

– Enable retrieving and sharing data more efficientlyEnable retrieving and sharing data more efficiently

DrawbacksDrawbacks– Ambiguity in the meaning of tagsAmbiguity in the meaning of tags– The use of synonyms creates informational redundancyThe use of synonyms creates informational redundancy– The central concern: whether or not the system The central concern: whether or not the system

becomes relatively stable with time and use?becomes relatively stable with time and use?

The most problematic claim for tagging systems: The most problematic claim for tagging systems: Because users are not under a centralized controlling Because users are not under a centralized controlling vocabulary, no coherent categorization scheme vocabulary, no coherent categorization scheme can emerge can emerge at allat all from collaborative tagging. from collaborative tagging.

The Dynamics of The Dynamics of TaggingTagging Tag distributionTag distribution

– The collection of all tags and their frequencies The collection of all tags and their frequencies ordered by rank frequency for a given resourceordered by rank frequency for a given resource

Features of complex systemsFeatures of complex systems– A large number of usersA large number of users– A lack of central coordinationA lack of central coordination– Non-linear dynamicsNon-linear dynamics

Two important features of collaborative tagging Two important features of collaborative tagging systemssystems– Imitation of othersImitation of others– Shared knowledgeShared knowledge

The Tripartite The Tripartite Structure of TaggingStructure of Tagging

Figure: tripartite graph structure of a tagging system. An edge Figure: tripartite graph structure of a tagging system. An edge linking a user, a tag and a resource (website) represents one linking a user, a tag and a resource (website) represents one tagging instancetagging instance

Tags provide the link between the users and the Tags provide the link between the users and the resources (search resources (search tagging [feedback] ) tagging [feedback] )

A Generative ModelA Generative Model

Preferential attachmentPreferential attachment– Known popularly as the “rich get richer” modelKnown popularly as the “rich get richer” model– P(P(aa) = the probability of a user committing a tagging ) = the probability of a user committing a tagging

actionaction– P(P(oo) = the probability that an “old tag” is reinforced) = the probability that an “old tag” is reinforced– If an old tag If an old tag xx is added, it happens with the probability is added, it happens with the probability

Preferential attachment do not explain why a particular Preferential attachment do not explain why a particular new tag is added.new tag is added.– In practice, a new tag may be added that uncovers an In practice, a new tag may be added that uncovers an

informational dimension not captured by older tags.informational dimension not captured by older tags.– Information valueInformation value: the information conveyed by the tag: the information conveyed by the tag

Linear combination:Linear combination:

)(

)(

iR

xR

))(

)(()()()1())(()(

iR

xRPoPaPxIPxP

An Example of An Example of Preferential Preferential AttachmentAttachment

Figure: an example of how shuffling leads to preferential Figure: an example of how shuffling leads to preferential attachment. This process produces a power law attachment. This process produces a power law distribution.distribution.

Abstract Example of Abstract Example of Information ValueInformation Value

I(I(tt11)=1, I()=1, I(tt33)=0, I()=0, I(tt22)> I()> I(tt44), I(), I(tt22,,tt44)=1, I()=1, I(tt11,,tt55)=0 (not additive))=0 (not additive)

Following Zipf’s famous “Principle of Least Effort”, users Following Zipf’s famous “Principle of Least Effort”, users presumably minimize the number of tags used.presumably minimize the number of tags used.

Empirical StudyEmpirical Study

Data setData set– 500 sites from the “Popular” section of 500 sites from the “Popular” section of

del.icio.usdel.icio.us Mean 2074.8 users, standard deviation of 92.9Mean 2074.8 users, standard deviation of 92.9

– 500 from the “Recent” section500 from the “Recent” section Mean 286.1 users, standard deviation of 18.2Mean 286.1 users, standard deviation of 18.2

Power law distributionPower law distribution

yy = = cxcxαα

log log yy = αlog = αlog xx + log + log cc

Power Law Regression Power Law Regression for Popular Sitesfor Popular Sites

Figure: frequency of tag usage, based on relative position Figure: frequency of tag usage, based on relative position (the 25 most frequently used tags)(the 25 most frequently used tags)

Average α=-1.22 and standard deviation ±0.03Average α=-1.22 and standard deviation ±0.03

Empirical Results for Empirical Results for Popular SitesPopular Sites

Figure: cumulative frequency of tag use, based on relative positionFigure: cumulative frequency of tag use, based on relative position

In positions seven to ten have a considerably sharper dropIn positions seven to ten have a considerably sharper drop

Regression Results for Regression Results for Less Popular SitesLess Popular Sites

Average α=-3.9 and standard deviation ±4.63Average α=-3.9 and standard deviation ±4.63

The Dynamics of Tag The Dynamics of Tag DistributionsDistributions Study how the shape of these distributions forms in Study how the shape of these distributions forms in

time from the tagging actions of individual userstime from the tagging actions of individual users

Kullback-Leibler Divergence (relative entropy)Kullback-Leibler Divergence (relative entropy)

Two complementary ways to detect whether or not Two complementary ways to detect whether or not a distribution has converged to a steady statea distribution has converged to a steady state– Take the relative entropy between every two Take the relative entropy between every two consecutiveconsecutive

points in time of the distributionpoints in time of the distribution– Take the relative entropy of the tag distribution for each Take the relative entropy of the tag distribution for each

time point with respect to the time point with respect to the finalfinal tag distribution tag distribution

x xQ

xPxPQPDKL

)(

)(log)()||(

Empirical Results for Empirical Results for Tag Dynamics (1/2) Tag Dynamics (1/2)

Figure: relative entropy between tag frequency distributions Figure: relative entropy between tag frequency distributions at consecutive time-stepsat consecutive time-steps

Empirical Results for Empirical Results for Tag Dynamics (2/2)Tag Dynamics (2/2)

Figure: the relative entropy of the tag distribution for each Figure: the relative entropy of the tag distribution for each time point with respect to the final distributiontime point with respect to the final distribution

Constructing Inter-Tag Constructing Inter-Tag Correlation GraphsCorrelation Graphs The information value of tags is a central The information value of tags is a central

aspect governing the evolution of tag aspect governing the evolution of tag distributions.distributions.

Distance between two tagsDistance between two tags

NN((TTii))=the number of pages tagged by =the number of pages tagged by TTii

)()(

),(),(

ji

jiji

TNTN

TTNTTDist

Tag Correlation Tag Correlation NetworkNetwork

Figure: visualization of a tag correlation network, considering Figure: visualization of a tag correlation network, considering only the correlations corresponding to one central node only the correlations corresponding to one central node “complexity”“complexity”

Tag Correlation Tag Correlation NetworkNetwork

Figure: visualization of a tag correlation network, considering all relevant correlations (“small world” structure Zipf’s law)

Conclusion and Future Conclusion and Future WorkWork This work has explored a number of issues highly relevant This work has explored a number of issues highly relevant

to the question of whether a coherent way of organizing to the question of whether a coherent way of organizing metadata can emerge from distributive tagging systems.metadata can emerge from distributive tagging systems.

It’s shown that tagging distributions tend to stabilize into It’s shown that tagging distributions tend to stabilize into power law distributions.power law distributions.

Using an example domain, we explored one of the most Using an example domain, we explored one of the most empirically challenging aspects of the generative model: empirically challenging aspects of the generative model: the information value of a tag as a function of the number the information value of a tag as a function of the number of pages.of pages.

Future work will elaborate on the results presented here Future work will elaborate on the results presented here regarding categorization schemes based on tag co-regarding categorization schemes based on tag co-occurrence and information value and will examine occurrence and information value and will examine whether these results hold among many different tagging whether these results hold among many different tagging applications. applications.