mining social media communities and content akshay java ph.d. dissertation defense
DESCRIPTION
Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008. “It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties , structure , content .”. Thesis Statement. Key Observations. - PowerPoint PPT PresentationTRANSCRIPT
Mining Social MediaCommunities and Content
Akshay Java
Ph.D. Dissertation DefenseOctober 16th 2008
“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”
Thesis Statement
Key Observations
1. Understanding communication in social media requires identifying and modeling communities
2. Communities are a result of collective, social interactions and usage.
1. Developed and evaluated innovative approaches for community detection
– A new algorithm for finding communities in social datasets– SimCUT, a novel algorithm for combining structural and
semantic information
2. First to comprehensively analyze two important, new social media forms
– Feed Readership – Microblogging Usage and Communities
3. Built systems, infrastructure and datasets for the social media research community
Contributions
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Social Media
Describes the online technologies
and practices that people use to
share opinions, insights,
experiences, and perspectives
and engage with each other.
UGC + Social Network
~Wikipedia
What you…
Think blogs
Say Podcasts
See Flickr, YouTube
Hear Pandora, Last.fm
Do Twitter,Jaiku, Pownce
It’s about YOU!
Who are our...
Friends Facebook
Colleagues LinkedIn
Virtual Avatars secondlife
Also about US
What we share
Knowledge Wikipedia
Links del.icio.us, StumbleUpon
Love/Hate yelp, Upcoming
Location FireEagle, BrightKite
Spaces Ustream, Qik
How We Share
Social interactions build communities
• Shared Interests• Common Beliefs• Events• Organization/Location
Communities
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it.
Graph• Citation Network• Affiliation Network• Sentiment Information• Shared Resource (tags, videos..)
Political Blogs
Twitter Network
Facebook Network
What is a Community
Existing Approaches
Clustering Approach1. Agglomerative/Hierarchical
Incrementally, group similar nodes to form clusters
Communities in Football League
(Hierarchical Clustering)Football Teams
Existing Approaches
Clustering Approach1. Agglomerative/Hierarchical
Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)
Existing Approaches
Clustering Approach1. Agglomerative/Hierarchical2. Divisive/Partition based (Girvan Newman)
Normalized Cut (NCut) (Shi, Malik)
Political Books
NCut(A,B) Cut(A,B)1
vol(A)
1
vol(B)
Existing Approaches
• The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik)
• The second smallest eigenvector of the graph Laplacian is the Fiedler vector.
• The graph can be recursively partitioned using the sign of the values in its Fielder vector.
L D W I D
1
2 *W * D
1
2
NCut(A,B) Cut(A,B)1
Vol(A)
1
Vol(B)
Normalized Cuts
Graph Laplacian
Cost of edges deleted to disconnect the graph
Total cost of all edges that start from B
Existing Approaches
• Modularity Score (Newman et al.)
– Measure of quality of clustering
eii fraction of intra-community edges
ai expected value of eii disregarding communities
– Q = 0 Communities are random– Q >0 Higher values are better
• Optimizing modularity is NP-Hard*
– Spectral Methods– Heuristics
Q (eii ai2
i
)
ai eiji
* (Brandes et al.)
Existing methods 1.Do not scale well for Web graphs 2.Fail to exploit the underlying graph’s distributions3.Unable to use available meta-data and semantic features.
Limitations
“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”
Thesis Statement
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
• The Long Tail– 80/20 Rule or Pareto distribution– Few blogs get most attention/links– Most are sparsely connected
• Motivation– Web graphs are large, but sparse – Expensive to compute community
structure over the entire graph
• Goal– Approximate the membership of the
nodes using only a small portion of the entire graph.
Special Properties of Social Datasets
Special Properties of Social Datasets
• Intuition – communities are defined by the core (A)
and the membership of the rest of the network (B) can be approximated by how they link to the core.
• Direct Method – NCut (Baseline)
• Approximation– Singular Value Decomposition (SVD)
– Sampling
– Heuristic
• SVD (low rank) • Sampling based Approach
– Communities can be extracted by sampling only columns from the head (Drineas et al.)
• Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to.
Approximating CommunitiesNodes ordered by degree
€
~UrΣ
rrV
r
T
€
A
B T
B
C
⎛
⎝ ⎜
⎞
⎠ ⎟=A B
B T B TA −1B
⎛
⎝ ⎜
⎞
⎠ ⎟
=U
B TUΛ−1
⎛
⎝ ⎜
⎞
⎠ ⎟Λ U
TA −1U TB( )
AUUT
r
ICWSM ‘08
Approximating Communities
1. Dataset: A blog dataset of 6000 blogs.
ICWSM ‘08
Original Adjacency Heuristic Approximation
Modularity = 0.51
Approximating Communities
Low ModularityMore Time
Similar ModularityLower Time
• Advantage Faster detection using small portion of the graph, less Memory.
• SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns
ICWSM ‘08
Approximating CommunitiesICWSM ‘08
1. Blog Dataset2. Social network datasets:
Additional evaluations using Variation of Information score
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Tags are free meta-data!
Other semantic features:• Sentiments• Named Entities• Readership information• Geolocation information• etc.
How to combine this for detecting communities?
Social Media Graphs
Links Between Nodes Links Between Nodes and Tags
Simultaneous Cuts
A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.
Communities in Social Media
SimCUT: Simultaneously Clustering Tags and Graphs
1 1 1 0 0
1 1 1 0 0
1 0 1 1 0
1 0 0 1 1
1 0 0 1 1
1 1 0 0 0 1 1 1 0
1 1 1 0 0 1 1 0 0
0 0 1 1 1 0 0 1 1
0 0 0 1 1 0 0 1 1
Nodes
Nodes
Nod
esT
ags
Tag
sN
odes
Tags
Tags
1
1
1
1
1
1
1
1
1
Fiedler Vector Polarity
W ' I C
C T W
β= 0 Entirely ignore link information
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
WebKDD ‘08
SimCUT: Simultaneously Clustering Tags and Graphs
β= 0 Entirely ignore link information
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
Clustering Only Links
Clustering Links + Tags
W ' I C
C T W
WebKDD ‘08
Datasets
• Citeseer (Getoor et al.)
– Agents, AI, DB, HCI, IR, ML– Words used in place of tags
• Blog data – derived from the WWE/Buzzmetrics dataset– Tags associated with Blogs derived from del.icio.us– For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)
• Pairwise similarity computed – RBF Kernel for Citeseer– Cosine for blogs
Clustering Tags and GraphsClustering Only Links
Clustering Links + Tags
Clustering Tags and Graphs
Accuracy = 36% Accuracy = 62%
Higher accuracy by adding ‘tag’ information
Varying Scaling Parameter β
Accuracy = 36% Accuracy = 62%
Higher accuracy by adding ‘tag’ information
Simple Kmeans ~23% Content only, binaryContent only ~52% (Getoor et al. 2004)
β >> 1 β=1β=0
Accuracy = 39%
Only Graph Only Tags Graphs & Tags
Mutual Information • Measures the dependence between two random variables.• Compares results with ground truth
Effect of Number of Tags, Clusters
Citeseer
Link only has lower MIMore
Semantics helps
Similar results for real, blog datasets
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Tags are one type of meta-data!
Other semantic information:• Sentiments• Named Entities• Readership information• Geolocation information• etc.
How do we get additional semantics?
Additional Semantics
• BlogVox: – Sentiments and Opinions
• SemNews: – Named Entities, beliefs, facts
• Link Polarity: – Sentiment from anchor text
• Readership:– Feed subscriptions and usage
(TREC 06, IJCAI/AND 07)
(AAAI SS 05, HICS 06, IJSWIS)
(ICWSM 07)
(ICWSM 07)
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Key Observations
1. Understanding communication in social media requires identifying and modeling communities
2. Communities are a result of collective, social interactions and usage.
Feeds Readershiphttp://ftm.umbc.edu
Folders
Use folder label as topics/tags. Group similar folders together.Rank Feeds under a “topic”
ICWSM ‘07
• 83K publicly listed subscribers
• 2.8M feeds, 500K are unique
• 26K users (35%) use folders to organize subscriptions
• Data collected in May 2006
Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions
Feed Subscription StatisticsICWSM ‘07
• Communities from Feed Subscriptions– A Common vocabulary emerges from folder names– Folder names are used as topics. Lower ranked folder are
merged into a higher ranked folder if there is an overlap and a high cosine similarity
Feeds That Matter
Folder Usage
Rank of a Folder
(By number of Feeds in it)
# of
Use
rs U
sing
a F
old
er
http://ftm.umbc.edu
ICWSM ‘07
Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.
Tag Cloud After Merging
• Two feeds are similar if they are categorized under similar folders
Feed Recommendations
If you like X you will like…..
Feed Distillation for “Politics”Merged folders: “political”, “political blogs”• Talking Points Memo: by Joshua Micah Marshal• Daily Kos: State of the Nation• Eschaton• The Washington Monthly• Wonkette, Politics for People with Dirty Minds• http://instapundit.com/• Informed Comment • Power Line• AMERICAblog: Because a great nation deserves the
truth• Crooks and Liars
Tec
hK
nit
tin
g
http://ftm.umbc.edu
ICWSM ‘07
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Wikipedia is our collective wisdom
Twitter is our collective consciousness
Easily share status messages
Twitter post
Current Status
Friends
MicrobloggingSNAKDD ‘07
Twitterment
Rank City
1 Tokyo
2 New York
3 San Francisco
4 Seattle
5 Los Angeles
6 Chicago
7 Toronto
8 Austin
9 Singapore
10 Madrid
http://twitterment.umbc.edu
• First twitter search engine• Uses Lucene to index public timeline• Provides search and analytics• Built a social network of users• 1.3 M Tweets• 83 K Users• Two months of data
http://twitterment.umbc.eduSearch and Trend analytics on Microblogs
lunch dinner
work
coffee
Microblogging Trend Analytics
Clique Percolation Method (CPM)Two nodes belong to the same community if they can be connected through adjacent k-cliques. (Palla et al.)
Gaming Community
Microblogging Communities
Finds overlapping communities
A Community is a union of all k-clique subgraphs 3 Clique
SNAKDD ‘07
INFORMATION
HUB
Information Source: Communities connected via Robert Scoble, an A-list blogger
INFORMATION
BRIDGE
Information Source, Information Seeker: Different roles in different communities
STAR NETWORKS /
SMALL CLIQUES
Friendship-relation: Small groups among friends/co-workers
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.”
Observations
1. Understanding communication in social media requires identifying and modeling communities
2. Communities are a result of collective, social interactions and usage.
Thesis Statement
Future Work
• Social media content is challenging, much improvements are needed in textual analysis, sentiment detection, named entity detection and language understanding in such systems.
• Temporal analysis of community structures• Feed distillation and ranking in blog search• Index quality vs. index freshness• User intention and personalization
Outline
• Introduction• Detecting Communities in Social Media• Combining Semantic Information • Case Studies
– Feed Usage and Distillation– Microblogging Communities
• Future Work• Conclusions
Conclusions
• Demonstrated a fast, community detection algorithm well suited for social datasets.
• Implemented SimCut, a technique that outperforms simple graph based approaches for community detection.
• Evaluated and tested proposed algorithms on real social media datasets and benchmark datasets.
• Conducted the first comprehensive study of feed readership and microblogging usage.
• Built systems, infrastructure and datasets for the social media research community.
Conclusions
• We have presented a framework for analyzing social media content and structure making use of certain special properties and features in such systems.
• We study Social Web from a user perspective and analyze not just how people are using these systems but also why?
• Social Media is connecting people and building communities by bridging the gap between content production and consumption.
Thanks!
The Future….
• Location– Social, mobile applications– Geographically relevant, query(less) search
• Social Advertising and Personalization– Role of influence and communities in advertising
• Real-Time, Social Information Streams– Event detection/ Breaking News– How effective is the advertising?
• Social Web to solve challenging AI problems– Just as tagging has helped image search– Availability of social tools and Wikipedia provide opportunities to
work on difficult AI problems like disambiguation and common sense reasoning.
http://ebiquity.umbc.eduhttp://socialmedia.typepad.com