cse509 lecture 5

43
Arjumand Younus Web Science Research Group Institute of Business Administration (IBA) CSE509: Introduction to Web Science and Technology Lecture 5: Social Network Analysis

Category:

Technology


1 download

DESCRIPTION

Lecture 5 of CSE509:Web Science and Technology Summer Course

TRANSCRIPT

Page 1: CSE509 Lecture 5

Arjumand Younus

Web Science Research Group

Institute of Business Administration (IBA)

CSE509: Introduction to Web Science

and Technology

Lecture 5: Social Network Analysis

Page 2: CSE509 Lecture 5

2

Last Time…

Web Data Explosion

Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture

Part II Large-Scale File Systems

Google File System Case-Study

August 06, 2011

Page 3: CSE509 Lecture 5

3

Today

Transition from Web 1.0 to Web 2.0 Social Media Characteristics

Part I: Theoretical Aspects Social Networks as a Graph Properties of Social Networks

Part II: Getting Hands-On Experience on Social Media Analytics Twitter Data Hacks

Part III: Example Researches

August 06, 2011

Page 4: CSE509 Lecture 5

4

Quick Survey

Do you have a Facebook, MySpace, Twitter, or LinkedIn account?

Do you own a blog? Do you read blogs? Have you ever searched for something on Wikipedia? Have you ever submitted content to a social network?

August 06, 2011

Page 5: CSE509 Lecture 5

5

Web 1.0 vs. Web 2.0

August 06, 2011

Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

Page 6: CSE509 Lecture 5

6

What is so Different about Web 2.0?

User Generated Content Collaborative Environment: Participatory Web, Citizen

Journalism User is the Driving Factor

August 06, 2011

A Paradigm Shift rather than a Technology Shift

Page 7: CSE509 Lecture 5

7

Top 20 Most Visited Web Sites

Internet traffic report by Alexa on July 29th 2008

August 06, 2011

1 Yahoo! 11 Orkut

2 Google 12 RapidShare

3 YouTube 13 Baidu.com

4 Windows Live 14 Microsoft Corporation

5 Microsoft Network 15 Google India

6 Myspace 16 Google Germany

7 Wikipedia 17 QQ.Com

8 Facebook 18 EBay

9 Blogger 19 Hi5

10 Yahoo! Japan 20 Google France

Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

Page 8: CSE509 Lecture 5

9

Characteristics of Social Media

“Consumers” become “Producers” Rich User Interaction User-Generated Contents Collaborative environment Collective Wisdom Long Tail

Broadcast MediaFilter, then Publish

Social MediaPublish, then Filter

August 06, 2011

Page 9: CSE509 Lecture 5

10August 06, 2011

Page 10: CSE509 Lecture 5

11

PART I: Theoretical Aspects

August 06, 2011

Page 11: CSE509 Lecture 5

12

Networks and Representation

Social Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc.

August 06, 2011

Graph Representation Matrix Representation

Page 12: CSE509 Lecture 5

13

Properties of Large-Scale Networks

Networks in social media are typically huge, involving millions of actors and connections

Large-scale networks in real world demonstrate similar patterns Scale-free Distributions Small-world Effect Strong Community Structure

August 06, 2011

Page 13: CSE509 Lecture 5

14

Scale-Free Distributions

Degree distribution in large-scale networks often follows a power law.

A.k.a. long tail distribution, scale-free distribution

August 06, 2011

Degrees

Nodes

Page 14: CSE509 Lecture 5

15

Small-World Effect

“Six Degrees of Separation”

A famous experiment conducted by Travers and Milgram (1969) Subjects were asked to send a chain letter to his acquaintance in order to

reach a target person The average path length is around 5.5

Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6

August 06, 2011

Page 15: CSE509 Lecture 5

16

Small World Facebook Experiment by Yahoo! Labs

Anyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.

http://smallworld.sandbox.yahoo.com/

August 06, 2011

Page 16: CSE509 Lecture 5

17

Community Structure

Community: People in a group interact with each other more frequently than those outside the group

ki = number of edges among node Ni’s neighbors Friends of a friend are likely to be friends as well Measured by clustering coefficient:

Density of connections among one’s friends

August 06, 2011

Page 17: CSE509 Lecture 5

18

Clustering Coefficient

August 06, 2011

• d6=4, N6= {4, 5, 7,8}

• k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)

• C6 = 4/(4*3/2) = 2/3

• Average clustering coefficient

C = (C1 + C2 + … + Cn)/n

• C = 0.61 for the left network

• In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.

Page 18: CSE509 Lecture 5

19

Challenges

Scalability Social networks are often in a scale of millions of nodes and connections Traditional network analysis often deals with at most hundreds of subjects

Heterogeneity Various types of entities and interactions are involved

Evolution Timelines are emphasized in social media

Collective Intelligence How to utilize wisdom of crowds in forms of tags, wikis, reviews

Evaluation Lack of ground truth, and complete information due to privacy

August 06, 2011

Page 19: CSE509 Lecture 5

20

Social Computing Tasks

Social Computing: a young and vibrant field Conferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom,

etc. Tasks

Centrality Analysis and Influence Modeling Community Detection Classification and Recommendation Privacy, Spam and Security

August 06, 2011

Page 20: CSE509 Lecture 5

21

Centrality Analysis and Influence Modeling

Centrality Analysis: Identify the most important actors or edges

E.g. PageRank in Google

Various other criteria

Influence modeling: How is information diffused? How does one influence each other?

Related Problems Viral marketing: word-of-mouth effect Influence maximization

August 06, 2011

Page 21: CSE509 Lecture 5

22

Community Detection

A community is a set of nodes between which the interactions are (relatively) frequent A.k.a., group, cluster, cohesive subgroups, modules

Applications: Recommendation based communities, Network Compression, Visualization of a huge network

New lines of research in social media Community Detection in Heterogeneous Networks Community Evolution in Dynamic Networks Scalable Community Detection in Large-Scale Networks

August 06, 2011

Page 22: CSE509 Lecture 5

23

Classification and Recommendation

Common in social media applications Tag suggestion, Product/Friend/Group Recommendation

August 06, 2011

Link prediction

Network-Based Classification

Page 23: CSE509 Lecture 5

24

Privacy, Spam and Security

Privacy is a big concern in social media Facebook, Google buzz often appear in debates about privacy NetFlix Prize Sequel cancelled due to privacy concern Simple anonymization does not necessarily protect privacy

Spam blog (splog), spam comments, fake identity, etc., all requires new techniques

As private information is involved, a secure and trustable system is critical

Need to achieve a balance between sharing and privacy

August 06, 2011

Page 24: CSE509 Lecture 5

25

PART II: Practical SNA with Twittersphere Mining

August 06, 2011

Page 25: CSE509 Lecture 5

26

Pre-Requisites

Expectation that Python is installed and you have some hands-on experience with it

Dependencies easy_install networkx twitter (Twitter API for Python)

For Windows users Install ActivePython: comes bundled with easy_install easy_install networkx easy_install twitter

For Linux users sh setuptools-0.6c11-py2.6.egg sudo easy_install networkx sudo easy_install twitter

August 06, 2011

Page 26: CSE509 Lecture 5

27

Getting Tweets from Twitter Search API

import twitter

import json

twitter_search=twitter.Twitter(domain="search.twitter.com")

search_results=[]

for page in range(1,6):

search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))

print json.dumps(search_results, sort_keys=True, indent=1)

tweets=[r['text'] for result in search_results for r in result['results']]

print tweets

August 06, 2011

Page 27: CSE509 Lecture 5

28

Lexical Diversity for Tweets

words=[]

for t in tweets:

words+= [w for w in t.split()]

lexical_diversity=1.0*len(set(words))/len(words)

August 06, 2011

Page 28: CSE509 Lecture 5

29

What People are Tweeting: Frequency Analysis

freq_dist=nltk.FreqDist(words)

freq_dist.keys()[:50]

freq_dist.keys()[-50:]

August 06, 2011

Page 29: CSE509 Lecture 5

30

Extracting Relationships from Tweets (1/3)

Step 1: Extracting Graph Dataimport networkx as nx

import re

g=nx.DiGraph()

twitter_search=twitter.Twitter(domain="search.twitter.com")

search_results=[]

for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))

all_tweets=[tweet for page in search_results for tweet in page["results"]]

def get_rt_sources(tweet):rt_patterns=re.compile(r"(RT|via)((?:\b\W*@\w+)+)",re.IGNORECASE)

return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]

for tweet in all_tweets:rt_sources=get_rt_sources(tweet["text"])

if not rt_sources:continue

for rt_source in rt_sources:g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})

August 06, 2011

Page 30: CSE509 Lecture 5

31

Extracting Relationships from Tweets (2/3)

Step 2: Generating DOT FileOUT = "pakistan_search_results.dot“

dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]

f=open(OUT, 'w')

f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))

f.close()

August 06, 2011

Page 31: CSE509 Lecture 5

32

Extracting Relationships from Tweets (3/3)

Step 3: Visualizing the Retweet Data in Graphical Form For Windows users

For Linux users circo -Tpng -Osnl_search_results pakistan_search_results.dot

August 06, 2011

Page 32: CSE509 Lecture 5

33

PART III: Example Researches

August 06, 2011

Page 33: CSE509 Lecture 5

34

Million Follower Fallacy (New York Times)

August 06, 2011

Page 34: CSE509 Lecture 5

35

Twitter: More a News Medium than a Social Network (PC World)

August 06, 2011

Page 35: CSE509 Lecture 5

36

Twitter for World Peace (Business Week)

August 06, 2011

Page 36: CSE509 Lecture 5

37

SocialFlow: Social Media Optimiza-tion

Social Media Optimization Platform Works in Domains of Viral and Word-of-Mouth Marketing Provides Services to Major Media Outlets Recent study

How different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York Times

August 06, 2011

Page 37: CSE509 Lecture 5

38August 06, 2011

Twitter as a Real-Time News Analysis

Service

Page 38: CSE509 Lecture 5

39

Studying Ins and Outs of News

Using Twitter to study hot news items people are heavily tweeting about

August 06, 2011

Page 39: CSE509 Lecture 5

40

Algorithm for Identification of Popu-lar News

August 06, 2011

1. Crawl daily news data and send the unranked news articles list to the UI module of the system.

2. Extract news title, news summary and news text for each news article per day. 3. From the news summary extract named entities per news article through a named

entity recognition approach.4. Match each named entity across entities in the common entity corpus and tag each

named entity per news article as common or uncommon.5. Use Boolean query model to compose the query per news article:

a. Use AND predicate with each common entity and OR predicate with each uncommon entity. If all entities for a particular news article are common use news title for the query construction.

6. For all articles per day:a. Send a request to Twitter Search API and extract result tweets per news article

per dayb. For each t in result tweet:

i. Use t’s metadata to find following and follower statistics for each unique Twitterer who has tweeted about the news article

ii. Calculate rank of each article using the ranking function 7. Send the ranked news articles list to the UI module of the system

Page 40: CSE509 Lecture 5

41

Application Prototype

August 06, 2011

Page 41: CSE509 Lecture 5

42

Observations (1/3)

August 06, 2011

Percentage of news in tweets per day greater than 50% for all days except one

day

Page 42: CSE509 Lecture 5

43

Observations (2/3)

August 06, 2011

Highest Number of Recorded Tweets per Day

Page 43: CSE509 Lecture 5

44

Observations (3/3)

August 06, 2011

DATE EVENT

17th Oct. Karachi violence (local)

18th Oct. Hopes fade for trapped Chinese miners (international)

19th Oct. Lakki Marwat suicide attack attempt foiled (local)

20th Oct. Rebels raid parliament in Grozny; 7 dead (international)

21st Oct. Obama to visit next year (local)

22nd Oct. Nuclear plant completes decade of good performance (local)

23rd Oct. Ghazi shrine suicide bomber back home (local)

24th Oct. WikiLeaks makes fresh claim about Iraq deaths (international)

25th Oct. Court orders Iraqi parliament back to work (international)

26th Oct. SCBA presidential election (local)