mining social media communities and content akshay java ph.d. dissertation defense

Mining Social MediaCommunities and Content

Akshay Java

Ph.D. Dissertation DefenseOctober 16th 2008

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”

Thesis Statement

Key Observations

1. Understanding communication in social media requires identifying and modeling communities

2. Communities are a result of collective, social interactions and usage.

1. Developed and evaluated innovative approaches for community detection

– A new algorithm for finding communities in social datasets– SimCUT, a novel algorithm for combining structural and

semantic information

2. First to comprehensively analyze two important, new social media forms

– Feed Readership – Microblogging Usage and Communities

3. Built systems, infrastructure and datasets for the social media research community

Contributions

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Social Media

Describes the online technologies

and practices that people use to

share opinions, insights,

experiences, and perspectives

and engage with each other.

UGC + Social Network

~Wikipedia

What you…

Think blogs

Say Podcasts

See Flickr, YouTube

Hear Pandora, Last.fm

Do Twitter,Jaiku, Pownce

It’s about YOU!

Who are our...

Friends Facebook

Colleagues LinkedIn

Virtual Avatars secondlife

Also about US

What we share

Knowledge Wikipedia

Links del.icio.us, StumbleUpon

Love/Hate yelp, Upcoming

Location FireEagle, BrightKite

Spaces Ustream, Qik

How We Share

Social interactions build communities

• Shared Interests• Common Beliefs• Events• Organization/Location

Communities

Outline




A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it.

Graph• Citation Network• Affiliation Network• Sentiment Information• Shared Resource (tags, videos..)

Political Blogs

Twitter Network

Facebook Network

What is a Community

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical

Incrementally, group similar nodes to form clusters

Communities in Football League

(Hierarchical Clustering)Football Teams

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical

Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical2. Divisive/Partition based (Girvan Newman)

Normalized Cut (NCut) (Shi, Malik)

Political Books

NCut(A,B) Cut(A,B)1

vol(A)

1

vol(B)

Existing Approaches

• The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik)

• The second smallest eigenvector of the graph Laplacian is the Fiedler vector.

• The graph can be recursively partitioned using the sign of the values in its Fielder vector.

L D W I D

1

2 *W * D

1

2

NCut(A,B) Cut(A,B)1

Vol(A)

1

Vol(B)

Normalized Cuts

Graph Laplacian

Cost of edges deleted to disconnect the graph

Total cost of all edges that start from B

Existing Approaches

• Modularity Score (Newman et al.)

– Measure of quality of clustering

eii fraction of intra-community edges

ai expected value of eii disregarding communities

– Q = 0 Communities are random– Q >0 Higher values are better

• Optimizing modularity is NP-Hard*

– Spectral Methods– Heuristics

Q (eii ai2

i

)

ai eiji

* (Brandes et al.)

Existing methods 1.Do not scale well for Web graphs 2.Fail to exploit the underlying graph’s distributions3.Unable to use available meta-data and semantic features.

Limitations


Thesis Statement

Outline




• The Long Tail– 80/20 Rule or Pareto distribution– Few blogs get most attention/links– Most are sparsely connected

• Motivation– Web graphs are large, but sparse – Expensive to compute community

structure over the entire graph

• Goal– Approximate the membership of the

nodes using only a small portion of the entire graph.

Special Properties of Social Datasets

Special Properties of Social Datasets

• Intuition – communities are defined by the core (A)

and the membership of the rest of the network (B) can be approximated by how they link to the core.

• Direct Method – NCut (Baseline)

• Approximation– Singular Value Decomposition (SVD)

– Sampling

– Heuristic

• SVD (low rank) • Sampling based Approach

– Communities can be extracted by sampling only columns from the head (Drineas et al.)

• Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to.

Approximating CommunitiesNodes ordered by degree

€

~UrΣ

rrV

r

T

€

A

B T

B

C

⎛

⎝ ⎜

⎞

⎠ ⎟=A B

B T B TA −1B

⎛

⎝ ⎜

⎞

⎠ ⎟

=U

B TUΛ−1

⎛

⎝ ⎜

⎞

⎠ ⎟Λ U

TA −1U TB( )

AUUT

r

ICWSM ‘08

Approximating Communities

1. Dataset: A blog dataset of 6000 blogs.

ICWSM ‘08

Original Adjacency Heuristic Approximation

Modularity = 0.51

Approximating Communities

Low ModularityMore Time

Similar ModularityLower Time

• Advantage Faster detection using small portion of the graph, less Memory.

• SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns

ICWSM ‘08

Approximating CommunitiesICWSM ‘08

1. Blog Dataset2. Social network datasets:

Additional evaluations using Variation of Information score

Outline




Tags are free meta-data!

Other semantic features:• Sentiments• Named Entities• Readership information• Geolocation information• etc.

How to combine this for detecting communities?

Social Media Graphs

Links Between Nodes Links Between Nodes and Tags

Simultaneous Cuts

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

Communities in Social Media

SimCUT: Simultaneously Clustering Tags and Graphs

1 1 1 0 0

1 1 1 0 0

1 0 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 0 0 0 1 1 1 0

1 1 1 0 0 1 1 0 0

0 0 1 1 1 0 0 1 1

0 0 0 1 1 0 0 1 1

Nodes

Nodes

Nod

esT

ags

Tag

sN

odes

Tags

Tags

1

1

1

1

1

1

1

1

1

Fiedler Vector Polarity

W ' I C

C T W

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

WebKDD ‘08

SimCUT: Simultaneously Clustering Tags and Graphs

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

Clustering Only Links

Clustering Links + Tags

W ' I C

C T W

WebKDD ‘08

Datasets

• Citeseer (Getoor et al.)

– Agents, AI, DB, HCI, IR, ML– Words used in place of tags

• Blog data – derived from the WWE/Buzzmetrics dataset– Tags associated with Blogs derived from del.icio.us– For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)

• Pairwise similarity computed – RBF Kernel for Citeseer– Cosine for blogs

Clustering Tags and GraphsClustering Only Links

Clustering Links + Tags

Clustering Tags and Graphs

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ information

Varying Scaling Parameter β

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ information

Simple Kmeans ~23% Content only, binaryContent only ~52% (Getoor et al. 2004)

β >> 1 β=1β=0

Accuracy = 39%

Only Graph Only Tags Graphs & Tags

Mutual Information • Measures the dependence between two random variables.• Compares results with ground truth

Effect of Number of Tags, Clusters

Citeseer

Link only has lower MIMore

Semantics helps

Similar results for real, blog datasets

Outline




Tags are one type of meta-data!

Other semantic information:• Sentiments• Named Entities• Readership information• Geolocation information• etc.

How do we get additional semantics?

Additional Semantics

• BlogVox: – Sentiments and Opinions

• SemNews: – Named Entities, beliefs, facts

• Link Polarity: – Sentiment from anchor text

• Readership:– Feed subscriptions and usage

(TREC 06, IJCAI/AND 07)

(AAAI SS 05, HICS 06, IJSWIS)

(ICWSM 07)

(ICWSM 07)

Outline




Key Observations



Feeds Readershiphttp://ftm.umbc.edu

Folders

Use folder label as topics/tags. Group similar folders together.Rank Feeds under a “topic”

ICWSM ‘07

http://ftm.umbc.edu/

• 83K publicly listed subscribers

• 2.8M feeds, 500K are unique

• 26K users (35%) use folders to organize subscriptions

• Data collected in May 2006

Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions

Feed Subscription StatisticsICWSM ‘07

• Communities from Feed Subscriptions– A Common vocabulary emerges from folder names– Folder names are used as topics. Lower ranked folder are

merged into a higher ranked folder if there is an overlap and a high cosine similarity

Feeds That Matter

Folder Usage

Rank of a Folder

(By number of Feeds in it)

# of

Use

rs U

sing

a F

old

er

http://ftm.umbc.edu

ICWSM ‘07


Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.

Tag Cloud After Merging

• Two feeds are similar if they are categorized under similar folders

Feed Recommendations

If you like X you will like…..

Feed Distillation for “Politics”Merged folders: “political”, “political blogs”• Talking Points Memo: by Joshua Micah Marshal• Daily Kos: State of the Nation• Eschaton• The Washington Monthly• Wonkette, Politics for People with Dirty Minds• http://instapundit.com/• Informed Comment • Power Line• AMERICAblog: Because a great nation deserves the

truth• Crooks and Liars

Tec

hK

nit

tin

g

http://ftm.umbc.edu

ICWSM ‘07


Outline




Wikipedia is our collective wisdom

Twitter is our collective consciousness

Easily share status messages

Twitter post

Current Status

Friends

MicrobloggingSNAKDD ‘07

Twitterment

Rank City

1 Tokyo

2 New York

3 San Francisco

4 Seattle

5 Los Angeles

6 Chicago

7 Toronto

8 Austin

9 Singapore

10 Madrid

http://twitterment.umbc.edu

• First twitter search engine• Uses Lucene to index public timeline• Provides search and analytics• Built a social network of users• 1.3 M Tweets• 83 K Users• Two months of data

http://twitterment.umbc.edu/

http://twitterment.umbc.eduSearch and Trend analytics on Microblogs

lunch dinner

work

coffee

Microblogging Trend Analytics

http://twitterment.umbc.edu/

Clique Percolation Method (CPM)Two nodes belong to the same community if they can be connected through adjacent k-cliques. (Palla et al.)

Gaming Community

Microblogging Communities

Finds overlapping communities

A Community is a union of all k-clique subgraphs 3 Clique

SNAKDD ‘07

INFORMATION

HUB

Information Source: Communities connected via Robert Scoble, an A-list blogger

INFORMATION

BRIDGE

Information Source, Information Seeker: Different roles in different communities

STAR NETWORKS /

SMALL CLIQUES

Friendship-relation: Small groups among friends/co-workers

Outline





Observations



Thesis Statement

Future Work

• Social media content is challenging, much improvements are needed in textual analysis, sentiment detection, named entity detection and language understanding in such systems.

• Temporal analysis of community structures• Feed distillation and ranking in blog search• Index quality vs. index freshness• User intention and personalization

Outline




Conclusions

• Demonstrated a fast, community detection algorithm well suited for social datasets.

• Implemented SimCut, a technique that outperforms simple graph based approaches for community detection.

• Evaluated and tested proposed algorithms on real social media datasets and benchmark datasets.

• Conducted the first comprehensive study of feed readership and microblogging usage.

• Built systems, infrastructure and datasets for the social media research community.

Conclusions

• We have presented a framework for analyzing social media content and structure making use of certain special properties and features in such systems.

• We study Social Web from a user perspective and analyze not just how people are using these systems but also why?

• Social Media is connecting people and building communities by bridging the gap between content production and consumption.

Thanks!

The Future….

• Location– Social, mobile applications– Geographically relevant, query(less) search

• Social Advertising and Personalization– Role of influence and communities in advertising

• Real-Time, Social Information Streams– Event detection/ Breaking News– How effective is the advertising?

• Social Web to solve challenging AI problems– Just as tagging has helped image search– Availability of social tools and Wikipedia provide opportunities to

work on difficult AI problems like disambiguation and common sense reasoning.

http://ebiquity.umbc.eduhttp://socialmedia.typepad.com

mining social media communities and content akshay java ph.d. dissertation defense

Documents

social interactions

social datasetssimcut

social mediacommunities

ugc social network

webscale communities

semantic informationfirst

set of nodes

group similar nodes