mining social media communities and content akshay java ph.d. dissertation defense

66
Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008

Upload: bairn

Post on 07-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008. “It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties , structure , content .”. Thesis Statement. Key Observations. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Mining Social MediaCommunities and Content

Akshay Java

Ph.D. Dissertation DefenseOctober 16th 2008

Page 2: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”

Thesis Statement

Page 3: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Key Observations

1. Understanding communication in social media requires identifying and modeling communities

2. Communities are a result of collective, social interactions and usage.

Page 4: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

1. Developed and evaluated innovative approaches for community detection

– A new algorithm for finding communities in social datasets– SimCUT, a novel algorithm for combining structural and

semantic information

2. First to comprehensively analyze two important, new social media forms

– Feed Readership – Microblogging Usage and Communities

3. Built systems, infrastructure and datasets for the social media research community

Contributions

Page 5: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 6: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 7: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Social Media

Describes the online technologies

and practices that people use to

share opinions, insights,

experiences, and perspectives

and engage with each other.

UGC + Social Network

~Wikipedia

Page 8: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

What you…

Think blogs

Say Podcasts

See Flickr, YouTube

Hear Pandora, Last.fm

Do Twitter,Jaiku, Pownce

It’s about YOU!

Page 9: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Who are our...

Friends Facebook

Colleagues LinkedIn

Virtual Avatars secondlife

Also about US

Page 10: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

What we share

Knowledge Wikipedia

Links del.icio.us, StumbleUpon

Love/Hate yelp, Upcoming

Location FireEagle, BrightKite

Spaces Ustream, Qik

How We Share

Page 11: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Social interactions build communities

• Shared Interests• Common Beliefs• Events• Organization/Location

Communities

Page 12: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 13: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it.

Graph• Citation Network• Affiliation Network• Sentiment Information• Shared Resource (tags, videos..)

Political Blogs

Twitter Network

Facebook Network

What is a Community

Page 14: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical

Incrementally, group similar nodes to form clusters

Communities in Football League

(Hierarchical Clustering)Football Teams

Page 15: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical

Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Page 16: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing Approaches

Clustering Approach1. Agglomerative/Hierarchical2. Divisive/Partition based (Girvan Newman)

Normalized Cut (NCut) (Shi, Malik)

Political Books

NCut(A,B) Cut(A,B)1

vol(A)

1

vol(B)

Page 17: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing Approaches

• The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik)

• The second smallest eigenvector of the graph Laplacian is the Fiedler vector.

• The graph can be recursively partitioned using the sign of the values in its Fielder vector.

L D W I D

1

2 *W * D

1

2

NCut(A,B) Cut(A,B)1

Vol(A)

1

Vol(B)

Normalized Cuts

Graph Laplacian

Cost of edges deleted to disconnect the graph

Total cost of all edges that start from B

Page 18: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing Approaches

• Modularity Score (Newman et al.)

– Measure of quality of clustering

eii fraction of intra-community edges

ai expected value of eii disregarding communities

– Q = 0 Communities are random– Q >0 Higher values are better

• Optimizing modularity is NP-Hard*

– Spectral Methods– Heuristics

Q (eii ai2

i

)

ai eiji

* (Brandes et al.)

Page 19: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Existing methods 1.Do not scale well for Web graphs 2.Fail to exploit the underlying graph’s distributions3.Unable to use available meta-data and semantic features.

Limitations

Page 20: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”

Thesis Statement

Page 21: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 22: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

• The Long Tail– 80/20 Rule or Pareto distribution– Few blogs get most attention/links– Most are sparsely connected

• Motivation– Web graphs are large, but sparse – Expensive to compute community

structure over the entire graph

• Goal– Approximate the membership of the

nodes using only a small portion of the entire graph.

Special Properties of Social Datasets

Page 23: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Special Properties of Social Datasets

• Intuition – communities are defined by the core (A)

and the membership of the rest of the network (B) can be approximated by how they link to the core.

• Direct Method – NCut (Baseline)

• Approximation– Singular Value Decomposition (SVD)

– Sampling

– Heuristic

Page 24: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

• SVD (low rank) • Sampling based Approach

– Communities can be extracted by sampling only columns from the head (Drineas et al.)

• Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to.

Approximating CommunitiesNodes ordered by degree

~UrΣ

rrV

r

T

A

B T

B

C

⎝ ⎜

⎠ ⎟=A B

B T B TA −1B

⎝ ⎜

⎠ ⎟

=U

B TUΛ−1

⎝ ⎜

⎠ ⎟Λ U

TA −1U TB( )

AUUT

r

ICWSM ‘08

Page 25: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Approximating Communities

1. Dataset: A blog dataset of 6000 blogs.

ICWSM ‘08

Original Adjacency Heuristic Approximation

Modularity = 0.51

Page 26: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Approximating Communities

Low ModularityMore Time

Similar ModularityLower Time

• Advantage Faster detection using small portion of the graph, less Memory.

• SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns

ICWSM ‘08

Page 27: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Approximating CommunitiesICWSM ‘08

1. Blog Dataset2. Social network datasets:

Additional evaluations using Variation of Information score

Page 28: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 29: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Tags are free meta-data!

Other semantic features:• Sentiments• Named Entities• Readership information• Geolocation information• etc.

How to combine this for detecting communities?

Page 30: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Social Media Graphs

Links Between Nodes Links Between Nodes and Tags

Simultaneous Cuts

Page 31: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

Communities in Social Media

Page 32: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

SimCUT: Simultaneously Clustering Tags and Graphs

1 1 1 0 0

1 1 1 0 0

1 0 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 0 0 0 1 1 1 0

1 1 1 0 0 1 1 0 0

0 0 1 1 1 0 0 1 1

0 0 0 1 1 0 0 1 1

Nodes

Nodes

Nod

esT

ags

Tag

sN

odes

Tags

Tags

1

1

1

1

1

1

1

1

1

Fiedler Vector Polarity

W ' I C

C T W

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

WebKDD ‘08

Page 33: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

SimCUT: Simultaneously Clustering Tags and Graphs

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

Clustering Only Links

Clustering Links + Tags

W ' I C

C T W

WebKDD ‘08

Page 34: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Datasets

• Citeseer (Getoor et al.)

– Agents, AI, DB, HCI, IR, ML– Words used in place of tags

• Blog data – derived from the WWE/Buzzmetrics dataset– Tags associated with Blogs derived from del.icio.us– For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)

• Pairwise similarity computed – RBF Kernel for Citeseer– Cosine for blogs

Page 35: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Clustering Tags and GraphsClustering Only Links

Clustering Links + Tags

Page 36: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Clustering Tags and Graphs

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ information

Page 37: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Varying Scaling Parameter β

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ information

Simple Kmeans ~23% Content only, binaryContent only ~52% (Getoor et al. 2004)

β >> 1 β=1β=0

Accuracy = 39%

Only Graph Only Tags Graphs & Tags

Page 38: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Mutual Information • Measures the dependence between two random variables.• Compares results with ground truth

Effect of Number of Tags, Clusters

Citeseer

Link only has lower MIMore

Semantics helps

Similar results for real, blog datasets

Page 39: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 40: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Tags are one type of meta-data!

Other semantic information:• Sentiments• Named Entities• Readership information• Geolocation information• etc.

How do we get additional semantics?

Page 41: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Additional Semantics

• BlogVox: – Sentiments and Opinions

• SemNews: – Named Entities, beliefs, facts

• Link Polarity: – Sentiment from anchor text

• Readership:– Feed subscriptions and usage

(TREC 06, IJCAI/AND 07)

(AAAI SS 05, HICS 06, IJSWIS)

(ICWSM 07)

(ICWSM 07)

Page 42: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 43: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Key Observations

1. Understanding communication in social media requires identifying and modeling communities

2. Communities are a result of collective, social interactions and usage.

Page 44: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Feeds Readershiphttp://ftm.umbc.edu

Folders

Use folder label as topics/tags. Group similar folders together.Rank Feeds under a “topic”

ICWSM ‘07

Page 45: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

• 83K publicly listed subscribers

• 2.8M feeds, 500K are unique

• 26K users (35%) use folders to organize subscriptions

• Data collected in May 2006

Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions

Feed Subscription StatisticsICWSM ‘07

Page 46: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

• Communities from Feed Subscriptions– A Common vocabulary emerges from folder names– Folder names are used as topics. Lower ranked folder are

merged into a higher ranked folder if there is an overlap and a high cosine similarity

Feeds That Matter

Folder Usage

Rank of a Folder

(By number of Feeds in it)

# of

Use

rs U

sing

a F

old

er

http://ftm.umbc.edu

ICWSM ‘07

Page 47: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.

Tag Cloud After Merging

Page 48: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

• Two feeds are similar if they are categorized under similar folders

Feed Recommendations

If you like X you will like…..

Feed Distillation for “Politics”Merged folders: “political”, “political blogs”• Talking Points Memo: by Joshua Micah Marshal• Daily Kos: State of the Nation• Eschaton• The Washington Monthly• Wonkette, Politics for People with Dirty Minds• http://instapundit.com/• Informed Comment • Power Line• AMERICAblog: Because a great nation deserves the

truth• Crooks and Liars

Tec

hK

nit

tin

g

http://ftm.umbc.edu

ICWSM ‘07

Page 49: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 50: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Wikipedia is our collective wisdom

Twitter is our collective consciousness

Page 51: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Easily share status messages

Twitter post

Current Status

Friends

MicrobloggingSNAKDD ‘07

Page 52: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Twitterment

Rank City

1 Tokyo

2 New York

3 San Francisco

4 Seattle

5 Los Angeles

6 Chicago

7 Toronto

8 Austin

9 Singapore

10 Madrid

http://twitterment.umbc.edu

• First twitter search engine• Uses Lucene to index public timeline• Provides search and analytics• Built a social network of users• 1.3 M Tweets• 83 K Users• Two months of data

Page 53: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

http://twitterment.umbc.eduSearch and Trend analytics on Microblogs

lunch dinner

work

coffee

Microblogging Trend Analytics

Page 54: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Clique Percolation Method (CPM)Two nodes belong to the same community if they can be connected through adjacent k-cliques. (Palla et al.)

Gaming Community

Microblogging Communities

Finds overlapping communities

A Community is a union of all k-clique subgraphs 3 Clique

SNAKDD ‘07

Page 55: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

INFORMATION

HUB

Information Source: Communities connected via Robert Scoble, an A-list blogger

Page 56: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

INFORMATION

BRIDGE

Information Source, Information Seeker: Different roles in different communities

Page 57: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

STAR NETWORKS /

SMALL CLIQUES

Friendship-relation: Small groups among friends/co-workers

Page 58: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 59: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”

Observations

1. Understanding communication in social media requires identifying and modeling communities

2. Communities are a result of collective, social interactions and usage.

Thesis Statement

Page 60: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Future Work

• Social media content is challenging, much improvements are needed in textual analysis, sentiment detection, named entity detection and language understanding in such systems.

• Temporal analysis of community structures• Feed distillation and ranking in blog search• Index quality vs. index freshness• User intention and personalization

Page 61: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Outline

• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies

– Feed Usage and Distillation– Microblogging Communities

• Future Work• Conclusions

Page 62: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Conclusions

• Demonstrated a fast, community detection algorithm well suited for social datasets.

• Implemented SimCut, a technique that outperforms simple graph based approaches for community detection.

• Evaluated and tested proposed algorithms on real social media datasets and benchmark datasets.

• Conducted the first comprehensive study of feed readership and microblogging usage.

• Built systems, infrastructure and datasets for the social media research community.

Page 63: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Conclusions

• We have presented a framework for analyzing social media content and structure making use of certain special properties and features in such systems.

• We study Social Web from a user perspective and analyze not just how people are using these systems but also why?

• Social Media is connecting people and building communities by bridging the gap between content production and consumption.

Page 64: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

Thanks!

Page 65: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

The Future….

• Location– Social, mobile applications– Geographically relevant, query(less) search

• Social Advertising and Personalization– Role of influence and communities in advertising

• Real-Time, Social Information Streams– Event detection/ Breaking News– How effective is the advertising?

• Social Web to solve challenging AI problems– Just as tagging has helped image search– Availability of social tools and Wikipedia provide opportunities to

work on difficult AI problems like disambiguation and common sense reasoning.

Page 66: Mining Social Media Communities and  Content Akshay Java Ph.D. Dissertation Defense

http://ebiquity.umbc.eduhttp://socialmedia.typepad.com