cse509 lecture 5
Upload: web-science-research-group-at-institute-of-business-administration-karachi-pakistan
Post on 11-May-2015
555 views
DESCRIPTION
Lecture 5 of CSE509:Web Science and Technology Summer CourseTRANSCRIPT
Arjumand Younus
Web Science Research Group
Institute of Business Administration (IBA)
CSE509: Introduction to Web Science
and Technology
Lecture 5: Social Network Analysis
2
Last Time…
Web Data Explosion
Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture
Part II Large-Scale File Systems
Google File System Case-Study
August 06, 2011
3
Today
Transition from Web 1.0 to Web 2.0 Social Media Characteristics
Part I: Theoretical Aspects Social Networks as a Graph Properties of Social Networks
Part II: Getting Hands-On Experience on Social Media Analytics Twitter Data Hacks
Part III: Example Researches
August 06, 2011
4
Quick Survey
Do you have a Facebook, MySpace, Twitter, or LinkedIn account?
Do you own a blog? Do you read blogs? Have you ever searched for something on Wikipedia? Have you ever submitted content to a social network?
August 06, 2011
5
Web 1.0 vs. Web 2.0
August 06, 2011
Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
6
What is so Different about Web 2.0?
User Generated Content Collaborative Environment: Participatory Web, Citizen
Journalism User is the Driving Factor
August 06, 2011
A Paradigm Shift rather than a Technology Shift
7
Top 20 Most Visited Web Sites
Internet traffic report by Alexa on July 29th 2008
August 06, 2011
1 Yahoo! 11 Orkut
2 Google 12 RapidShare
3 YouTube 13 Baidu.com
4 Windows Live 14 Microsoft Corporation
5 Microsoft Network 15 Google India
6 Myspace 16 Google Germany
7 Wikipedia 17 QQ.Com
8 Facebook 18 EBay
9 Blogger 19 Hi5
10 Yahoo! Japan 20 Google France
Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
9
Characteristics of Social Media
“Consumers” become “Producers” Rich User Interaction User-Generated Contents Collaborative environment Collective Wisdom Long Tail
Broadcast MediaFilter, then Publish
Social MediaPublish, then Filter
August 06, 2011
10August 06, 2011
11
PART I: Theoretical Aspects
August 06, 2011
12
Networks and Representation
Social Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc.
August 06, 2011
Graph Representation Matrix Representation
13
Properties of Large-Scale Networks
Networks in social media are typically huge, involving millions of actors and connections
Large-scale networks in real world demonstrate similar patterns Scale-free Distributions Small-world Effect Strong Community Structure
August 06, 2011
14
Scale-Free Distributions
Degree distribution in large-scale networks often follows a power law.
A.k.a. long tail distribution, scale-free distribution
August 06, 2011
Degrees
Nodes
15
Small-World Effect
“Six Degrees of Separation”
A famous experiment conducted by Travers and Milgram (1969) Subjects were asked to send a chain letter to his acquaintance in order to
reach a target person The average path length is around 5.5
Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6
August 06, 2011
16
Small World Facebook Experiment by Yahoo! Labs
Anyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.
http://smallworld.sandbox.yahoo.com/
August 06, 2011
17
Community Structure
Community: People in a group interact with each other more frequently than those outside the group
ki = number of edges among node Ni’s neighbors Friends of a friend are likely to be friends as well Measured by clustering coefficient:
Density of connections among one’s friends
August 06, 2011
18
Clustering Coefficient
August 06, 2011
• d6=4, N6= {4, 5, 7,8}
• k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)
• C6 = 4/(4*3/2) = 2/3
• Average clustering coefficient
C = (C1 + C2 + … + Cn)/n
• C = 0.61 for the left network
• In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.
19
Challenges
Scalability Social networks are often in a scale of millions of nodes and connections Traditional network analysis often deals with at most hundreds of subjects
Heterogeneity Various types of entities and interactions are involved
Evolution Timelines are emphasized in social media
Collective Intelligence How to utilize wisdom of crowds in forms of tags, wikis, reviews
Evaluation Lack of ground truth, and complete information due to privacy
August 06, 2011
20
Social Computing Tasks
Social Computing: a young and vibrant field Conferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom,
etc. Tasks
Centrality Analysis and Influence Modeling Community Detection Classification and Recommendation Privacy, Spam and Security
August 06, 2011
21
Centrality Analysis and Influence Modeling
Centrality Analysis: Identify the most important actors or edges
E.g. PageRank in Google
Various other criteria
Influence modeling: How is information diffused? How does one influence each other?
Related Problems Viral marketing: word-of-mouth effect Influence maximization
August 06, 2011
22
Community Detection
A community is a set of nodes between which the interactions are (relatively) frequent A.k.a., group, cluster, cohesive subgroups, modules
Applications: Recommendation based communities, Network Compression, Visualization of a huge network
New lines of research in social media Community Detection in Heterogeneous Networks Community Evolution in Dynamic Networks Scalable Community Detection in Large-Scale Networks
August 06, 2011
23
Classification and Recommendation
Common in social media applications Tag suggestion, Product/Friend/Group Recommendation
August 06, 2011
Link prediction
Network-Based Classification
24
Privacy, Spam and Security
Privacy is a big concern in social media Facebook, Google buzz often appear in debates about privacy NetFlix Prize Sequel cancelled due to privacy concern Simple anonymization does not necessarily protect privacy
Spam blog (splog), spam comments, fake identity, etc., all requires new techniques
As private information is involved, a secure and trustable system is critical
Need to achieve a balance between sharing and privacy
August 06, 2011
25
PART II: Practical SNA with Twittersphere Mining
August 06, 2011
26
Pre-Requisites
Expectation that Python is installed and you have some hands-on experience with it
Dependencies easy_install networkx twitter (Twitter API for Python)
For Windows users Install ActivePython: comes bundled with easy_install easy_install networkx easy_install twitter
For Linux users sh setuptools-0.6c11-py2.6.egg sudo easy_install networkx sudo easy_install twitter
August 06, 2011
27
Getting Tweets from Twitter Search API
import twitter
import json
twitter_search=twitter.Twitter(domain="search.twitter.com")
search_results=[]
for page in range(1,6):
search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))
print json.dumps(search_results, sort_keys=True, indent=1)
tweets=[r['text'] for result in search_results for r in result['results']]
print tweets
August 06, 2011
28
Lexical Diversity for Tweets
words=[]
for t in tweets:
words+= [w for w in t.split()]
lexical_diversity=1.0*len(set(words))/len(words)
August 06, 2011
29
What People are Tweeting: Frequency Analysis
freq_dist=nltk.FreqDist(words)
freq_dist.keys()[:50]
freq_dist.keys()[-50:]
August 06, 2011
30
Extracting Relationships from Tweets (1/3)
Step 1: Extracting Graph Dataimport networkx as nx
import re
g=nx.DiGraph()
twitter_search=twitter.Twitter(domain="search.twitter.com")
search_results=[]
for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))
all_tweets=[tweet for page in search_results for tweet in page["results"]]
def get_rt_sources(tweet):rt_patterns=re.compile(r"(RT|via)((?:\b\W*@\w+)+)",re.IGNORECASE)
return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]
for tweet in all_tweets:rt_sources=get_rt_sources(tweet["text"])
if not rt_sources:continue
for rt_source in rt_sources:g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})
August 06, 2011
31
Extracting Relationships from Tweets (2/3)
Step 2: Generating DOT FileOUT = "pakistan_search_results.dot“
dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]
f=open(OUT, 'w')
f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))
f.close()
August 06, 2011
32
Extracting Relationships from Tweets (3/3)
Step 3: Visualizing the Retweet Data in Graphical Form For Windows users
For Linux users circo -Tpng -Osnl_search_results pakistan_search_results.dot
August 06, 2011
33
PART III: Example Researches
August 06, 2011
34
Million Follower Fallacy (New York Times)
August 06, 2011
35
Twitter: More a News Medium than a Social Network (PC World)
August 06, 2011
36
Twitter for World Peace (Business Week)
August 06, 2011
37
SocialFlow: Social Media Optimiza-tion
Social Media Optimization Platform Works in Domains of Viral and Word-of-Mouth Marketing Provides Services to Major Media Outlets Recent study
How different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York Times
August 06, 2011
38August 06, 2011
Twitter as a Real-Time News Analysis
Service
39
Studying Ins and Outs of News
Using Twitter to study hot news items people are heavily tweeting about
August 06, 2011
40
Algorithm for Identification of Popu-lar News
August 06, 2011
1. Crawl daily news data and send the unranked news articles list to the UI module of the system.
2. Extract news title, news summary and news text for each news article per day. 3. From the news summary extract named entities per news article through a named
entity recognition approach.4. Match each named entity across entities in the common entity corpus and tag each
named entity per news article as common or uncommon.5. Use Boolean query model to compose the query per news article:
a. Use AND predicate with each common entity and OR predicate with each uncommon entity. If all entities for a particular news article are common use news title for the query construction.
6. For all articles per day:a. Send a request to Twitter Search API and extract result tweets per news article
per dayb. For each t in result tweet:
i. Use t’s metadata to find following and follower statistics for each unique Twitterer who has tweeted about the news article
ii. Calculate rank of each article using the ranking function 7. Send the ranked news articles list to the UI module of the system
41
Application Prototype
August 06, 2011
42
Observations (1/3)
August 06, 2011
Percentage of news in tweets per day greater than 50% for all days except one
day
43
Observations (2/3)
August 06, 2011
Highest Number of Recorded Tweets per Day
44
Observations (3/3)
August 06, 2011
DATE EVENT
17th Oct. Karachi violence (local)
18th Oct. Hopes fade for trapped Chinese miners (international)
19th Oct. Lakki Marwat suicide attack attempt foiled (local)
20th Oct. Rebels raid parliament in Grozny; 7 dead (international)
21st Oct. Obama to visit next year (local)
22nd Oct. Nuclear plant completes decade of good performance (local)
23rd Oct. Ghazi shrine suicide bomber back home (local)
24th Oct. WikiLeaks makes fresh claim about Iraq deaths (international)
25th Oct. Court orders Iraqi parliament back to work (international)
26th Oct. SCBA presidential election (local)