cse509 lecture 5

Arjumand Younus

Web Science Research Group

Institute of Business Administration (IBA)

CSE509: Introduction to Web Science

and Technology

Lecture 5: Social Network Analysis

2

Last Time…

Web Data Explosion

Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture

Part II Large-Scale File Systems

Google File System Case-Study

August 06, 2011

3

Today

Transition from Web 1.0 to Web 2.0 Social Media Characteristics

Part I: Theoretical Aspects Social Networks as a Graph Properties of Social Networks

Part II: Getting Hands-On Experience on Social Media Analytics Twitter Data Hacks

Part III: Example Researches

August 06, 2011

4

Quick Survey

Do you have a Facebook, MySpace, Twitter, or LinkedIn account?

Do you own a blog? Do you read blogs? Have you ever searched for something on Wikipedia? Have you ever submitted content to a social network?

August 06, 2011

5

Web 1.0 vs. Web 2.0

August 06, 2011

Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

6

What is so Different about Web 2.0?

User Generated Content Collaborative Environment: Participatory Web, Citizen

Journalism User is the Driving Factor

August 06, 2011

A Paradigm Shift rather than a Technology Shift

7

Top 20 Most Visited Web Sites

Internet traffic report by Alexa on July 29th 2008

August 06, 2011

1 Yahoo! 11 Orkut

2 Google 12 RapidShare

3 YouTube 13 Baidu.com

4 Windows Live 14 Microsoft Corporation

5 Microsoft Network 15 Google India

6 Myspace 16 Google Germany

7 Wikipedia 17 QQ.Com

8 Facebook 18 EBay

9 Blogger 19 Hi5

10 Yahoo! Japan 20 Google France

Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

9

Characteristics of Social Media

“Consumers” become “Producers” Rich User Interaction User-Generated Contents Collaborative environment Collective Wisdom Long Tail

Broadcast MediaFilter, then Publish

Social MediaPublish, then Filter

August 06, 2011

10August 06, 2011

11

PART I: Theoretical Aspects

August 06, 2011

12

Networks and Representation

Social Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc.

August 06, 2011

Graph Representation Matrix Representation

13

Properties of Large-Scale Networks

Networks in social media are typically huge, involving millions of actors and connections

Large-scale networks in real world demonstrate similar patterns Scale-free Distributions Small-world Effect Strong Community Structure

August 06, 2011

14

Scale-Free Distributions

Degree distribution in large-scale networks often follows a power law.

A.k.a. long tail distribution, scale-free distribution

August 06, 2011

Degrees

Nodes

15

Small-World Effect

“Six Degrees of Separation”

A famous experiment conducted by Travers and Milgram (1969) Subjects were asked to send a chain letter to his acquaintance in order to

reach a target person The average path length is around 5.5

Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6

August 06, 2011

16

Small World Facebook Experiment by Yahoo! Labs

Anyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.

http://smallworld.sandbox.yahoo.com/

August 06, 2011



17

Community Structure

Community: People in a group interact with each other more frequently than those outside the group

ki = number of edges among node Ni’s neighbors Friends of a friend are likely to be friends as well Measured by clustering coefficient:

Density of connections among one’s friends

August 06, 2011

18

Clustering Coefficient

August 06, 2011

• d6=4, N6= {4, 5, 7,8}

• k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)

• C6 = 4/(4*3/2) = 2/3

• Average clustering coefficient

C = (C1 + C2 + … + Cn)/n

• C = 0.61 for the left network

• In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.

19

Challenges

Scalability Social networks are often in a scale of millions of nodes and connections Traditional network analysis often deals with at most hundreds of subjects

Heterogeneity Various types of entities and interactions are involved

Evolution Timelines are emphasized in social media

Collective Intelligence How to utilize wisdom of crowds in forms of tags, wikis, reviews

Evaluation Lack of ground truth, and complete information due to privacy

August 06, 2011

20

Social Computing Tasks

Social Computing: a young and vibrant field Conferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom,

etc. Tasks

Centrality Analysis and Influence Modeling Community Detection Classification and Recommendation Privacy, Spam and Security

August 06, 2011

21

Centrality Analysis and Influence Modeling

Centrality Analysis: Identify the most important actors or edges

E.g. PageRank in Google

Various other criteria

Influence modeling: How is information diffused? How does one influence each other?

Related Problems Viral marketing: word-of-mouth effect Influence maximization

August 06, 2011

22

Community Detection

A community is a set of nodes between which the interactions are (relatively) frequent A.k.a., group, cluster, cohesive subgroups, modules

Applications: Recommendation based communities, Network Compression, Visualization of a huge network

New lines of research in social media Community Detection in Heterogeneous Networks Community Evolution in Dynamic Networks Scalable Community Detection in Large-Scale Networks

August 06, 2011

23

Classification and Recommendation

Common in social media applications Tag suggestion, Product/Friend/Group Recommendation

August 06, 2011

Link prediction

Network-Based Classification

24

Privacy, Spam and Security

Privacy is a big concern in social media Facebook, Google buzz often appear in debates about privacy NetFlix Prize Sequel cancelled due to privacy concern Simple anonymization does not necessarily protect privacy

Spam blog (splog), spam comments, fake identity, etc., all requires new techniques

As private information is involved, a secure and trustable system is critical

Need to achieve a balance between sharing and privacy

August 06, 2011

25

PART II: Practical SNA with Twittersphere Mining

August 06, 2011

26

Pre-Requisites

Expectation that Python is installed and you have some hands-on experience with it

Dependencies easy_install networkx twitter (Twitter API for Python)

For Windows users Install ActivePython: comes bundled with easy_install easy_install networkx easy_install twitter

For Linux users sh setuptools-0.6c11-py2.6.egg sudo easy_install networkx sudo easy_install twitter

August 06, 2011

27

Getting Tweets from Twitter Search API

import twitter

import json

twitter_search=twitter.Twitter(domain="search.twitter.com")

search_results=[]

for page in range(1,6):

search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))

print json.dumps(search_results, sort_keys=True, indent=1)

tweets=[r['text'] for result in search_results for r in result['results']]

print tweets

August 06, 2011

28

Lexical Diversity for Tweets

words=[]

for t in tweets:

words+= [w for w in t.split()]

lexical_diversity=1.0*len(set(words))/len(words)

August 06, 2011

29

What People are Tweeting: Frequency Analysis

freq_dist=nltk.FreqDist(words)

freq_dist.keys()[:50]

freq_dist.keys()[-50:]

August 06, 2011

30

Extracting Relationships from Tweets (1/3)

Step 1: Extracting Graph Dataimport networkx as nx

import re

g=nx.DiGraph()

twitter_search=twitter.Twitter(domain="search.twitter.com")

search_results=[]

for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))

all_tweets=[tweet for page in search_results for tweet in page["results"]]

def get_rt_sources(tweet):rt_patterns=re.compile(r"(RT|via)((?:\b\W*@\w+)+)",re.IGNORECASE)

return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]

for tweet in all_tweets:rt_sources=get_rt_sources(tweet["text"])

if not rt_sources:continue

for rt_source in rt_sources:g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})

August 06, 2011

31


Step 2: Generating DOT FileOUT = "pakistan_search_results.dot“

dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]

f=open(OUT, 'w')

f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))

f.close()

August 06, 2011

32


Step 3: Visualizing the Retweet Data in Graphical Form For Windows users

For Linux users circo -Tpng -Osnl_search_results pakistan_search_results.dot

August 06, 2011

33

PART III: Example Researches

August 06, 2011

34

Million Follower Fallacy (New York Times)

August 06, 2011

35

Twitter: More a News Medium than a Social Network (PC World)

August 06, 2011

36

Twitter for World Peace (Business Week)

August 06, 2011

37

SocialFlow: Social Media Optimiza-tion

Social Media Optimization Platform Works in Domains of Viral and Word-of-Mouth Marketing Provides Services to Major Media Outlets Recent study

How different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York Times

August 06, 2011

38August 06, 2011

Twitter as a Real-Time News Analysis

Service

39

Studying Ins and Outs of News

Using Twitter to study hot news items people are heavily tweeting about

August 06, 2011

40

Algorithm for Identification of Popu-lar News

August 06, 2011

1. Crawl daily news data and send the unranked news articles list to the UI module of the system.

2. Extract news title, news summary and news text for each news article per day. 3. From the news summary extract named entities per news article through a named

entity recognition approach.4. Match each named entity across entities in the common entity corpus and tag each

named entity per news article as common or uncommon.5. Use Boolean query model to compose the query per news article:

a. Use AND predicate with each common entity and OR predicate with each uncommon entity. If all entities for a particular news article are common use news title for the query construction.

6. For all articles per day:a. Send a request to Twitter Search API and extract result tweets per news article

per dayb. For each t in result tweet:

i. Use t’s metadata to find following and follower statistics for each unique Twitterer who has tweeted about the news article

ii. Calculate rank of each article using the ranking function 7. Send the ranked news articles list to the UI module of the system

41

Application Prototype

August 06, 2011

42

Observations (1/3)

August 06, 2011

Percentage of news in tweets per day greater than 50% for all days except one

day

43

Observations (2/3)

August 06, 2011

Highest Number of Recorded Tweets per Day

44

Observations (3/3)

August 06, 2011

DATE EVENT

17th Oct. Karachi violence (local)

18th Oct. Hopes fade for trapped Chinese miners (international)

19th Oct. Lakki Marwat suicide attack attempt foiled (local)

20th Oct. Rebels raid parliament in Grozny; 7 dead (international)

21st Oct. Obama to visit next year (local)

22nd Oct. Nuclear plant completes decade of good performance (local)

23rd Oct. Ghazi shrine suicide bomber back home (local)

24th Oct. WikiLeaks makes fresh claim about Iraq deaths (international)

25th Oct. Court orders Iraqi parliament back to work (international)

26th Oct. SCBA presidential election (local)

cse509 lecture 5

Technology

web crawler

participatory web

social structure

professor nitin agarwal

representationsocial

example researchesaugust

nodes individuals

linkedin account