소셜 네트워크 데이터 분석 기법 - kse...

46
소셜 네트워크 데이터 분석 기법 December 7, 2012 KAIST Jae-Gil Lee

Upload: dinhhanh

Post on 29-Aug-2019

245 views

Category:

Documents


0 download

TRANSCRIPT

소셜 네트워크 빅 데이터 분석 기법

December 7, 2012

KAIST

Jae-Gil Lee

2012-12-07 2

강연자 소개 - 이재길 교수

약력

2010년 12월~현재: KAIST 지식서비스공학과 조교수

2008년 9월~2010년 11월: IBM Almaden Research Center 연구원

2006년 7월~2008년 8월: University of Illinois at Urbana-

Champaign 박사후연구원

연구분야

시공간 데이터 마이닝 (경로 및 교통 데이터)

소셜 네트워크 및 그래프 데이터 마이닝

빅 데이터 분석 (MapReduce 및 Hadoop)

연락처

E-mail:

홈페이지: http://dm.kaist.ac.kr/jaegil

Contents

1

2

3

4

Big Data and Social Networks

Online Analysis

Summary

Offline Analysis

1. Big Data and Social Networks

2012-12-07 5

Big Data Social Networks

The online social network (OSN) is one of the

main sources of big data

2012-12-07 6

Data Growth in Facebook

2012-12-07 7

Data Growth in Twitter

2012-12-07 8

Some Statistics on OSNs

Twitter is estimated to have 140 million users,

generating 340 million tweets a day and handling

over 1.6 billion search queries per day

As of May 2012, Facebook has more than 900

million active users; Facebook has 138.9 million

monthly unique U.S. visitors in May 2011

2012-12-07 9

Data Characteristics

Relationship data: e.g., follower, …

Content data: e.g., tweets, …

Location data

contents a user

relationship location

2012-12-07 10

Graph Data

A social network is usually

modeled as a graph

A node → an actor

An edge → a relationship

or an interaction

The graph is diverse

directed vs. undirected

weighted vs. unweighted

2012-12-07 11

Categories of Graph Data Analysis

Online analysis

Example: for a given user, finding anyone whose first

name is “David” among his friends, his friends’ friends,

and his friends’ friends’ friends

Typically using graph databases (e.g., Neo4j,

HyperGraphDB, FlockDB)

Offline analysis

Example: calculating PageRank for the entire graph

Typically using distributed, parallel systems (e.g.,

MapReduce, Pregel, Trinity)

2. Online Analysis

2012-12-07 13

Graph Databases (1/2)

Neo4j

http://neo4j.org/

Open source, current version: 1.8 (as of Dec. 2012)

Running on a single machine

HyperGraphDB

http://www.hypergraphdb.org/

Korbix Software, current version: 1.2 (as of Dec. 2012)

Running on a single machine

FlockDB

https://github.com/twitter/flockdb

Open source, current version: 1.8 (as of Dec. 2012)

Running on a cluster of machines

Being used by Twitter to store social graphs and indexes

2012-12-07 14

Graph Databases (2/2)

Storing data as nodes and relationships

Both nodes and relationships can hold properties in a

key/value fashion

Being able to navigate the structure

2012-12-07 15

FlockDB

A distributed graph database for storing

adjacency lists, with goals of supporting:

A high rate of add/update/remove operations

Potentially complex set arithmetic queries

Paging through query result sets (over 1M entries)

Ability to “archive” and later restore archived edges

Horizontal scaling including replication

Online data migration

but not including:

Multi-hop queries (or graph-walking queries)

Automatic shard migrations

2012-12-07 16

FlockDB for Twitter

Storing 13+ billion edges

Sustaining 20k writes/second at peak

Sustaining 100k reads/second at peak

As of April 2010,

2012-12-07 17

Temporal and Count Operations

Counts

Temporal

Intersection

2012-12-07 18

Set Operations

This tweet needs to be delivered to people who

follow both @aplusk (13M followers) and

@foursquare (530K followers)

2012-12-07 19

Adjacency Lists (1/2)

Storing the follower relationship as an edge

position: used for sorting (e.g., current time)

source_id:int64

destination_id:int64

position:int64

state:int8

Normal,

Removed,

Archived

2012-12-07 20

Adjacency Lists (2/2)

Storing an edge in both directions

source_id destination_id position state

20 12 20:50:14

20 13 20:51:32

20 16 20:54:26

destination_id source_id position state

12 20 20:50:14

12 32 20:51:42

12 16 20:53:24

Forward Backward

Indexed and partitioned by

Can efficiently answer the question “Who follows A?”

as well as “Whom is A following?”

2012-12-07 21

Partitioning / Sharding

Data is partitioned by

node, so the queries can

be answered by a single

partition, using an

indexed range query

The app servers

(affectionately called

“flapps”) are stateless and

are horizontally scalable

2012-12-07 22

Example Queries

How many people are following user 1?

flock.select(nil, :follows, 1).to_a

Who's reciprocally following user 1?

flock.select(1, :follows, nil).intersect(nil, :follows,

1).to_a

How about the union then?

flock.select(1, :follows, nil).union(nil, :follows, 1).to_a

Who's following user 1 that user 1 is not following

back?

flock.select(nil, :follows, 1).difference(1, :follows,

nil).to_a

3. Offline Analysis

2012-12-07 24

Analysis at Scale

Example: Running PageRank across users to

calculate reputations

To give any Twitter user a score from 1~10 based on

their followers’ networks of followers

2012-12-07 25

PageRank Overview (1/4)

Google describes PageRank:

“… PageRank also considers

the importance of each page

that casts a vote, as votes

from some pages are

considered to have greater

value, thus giving the linked

page greater value. … and

our technology uses the collective intelligence of the

web to determine a page's importance”

A page referenced by many high-quality pages

is also a high-quality page

2012-12-07 26

PageRank Overview (2/4)

Formula

PR(A): PageRank of a page A

d: the probability, at any step, that the person will

continue which is called a damping factor d (usually,

set to be 0.85)

L(B): the number of outbound links on a page B

N: the total number of pages

OR

2012-12-07 27

PageRank Overview (3/4)

Example

PR(A) = (1–d) * (1/N) + d * (PR(C) / 2)

PR(B) = (1–d) * (1/N) + d * (PR(A) / 1 + PR(C) / 2)

PR(C) = (1–d) * (1/N) + d * (PR(B) / 1)

Set d = 0.70 for ease of calculation

PR(A) = 0.1 + 0.35 * PR(C)

PR(B) = 0.1 + 0.70 * PR(A) + 0.35 * PR(C)

PR(C) = 0.1 + 0.70 * PR(B)

Iteration 1: PR(A) = 0.33, PR(B) = 0.33, PR(C) = 0.33

Iteration 2: PR(A) = 0.22, PR(B) = 0.45, PR(C) = 0.33

Iteration 3: PR(A) = 0.22, PR(B) = 0.37, PR(C) = 0.41

Iteration 9: PR(A) = 0.23, PR(B) = 0.39, PR(C) = 0.38

A

B C

2012-12-07 28

PageRank Overview (4/4)

A random surfer selects a page and keeps

clicking links until getting bored, then randomly

selects another page

PR(A) is the probability that such a user visits A

(1-d) is the probability of getting bored at a page (d is

called the damping factor)

PageRank matrix can be computed offline

Google takes into account both the relevance of

the page and PageRank

2012-12-07 29

MapReduce Basics

To handle big data, Google

proposed a new approach

called MapReduce

MapReduce can crunch

huge amounts of data by

splitting the task over

multiple computers that can

operate in parallel No matter how large the problem

is, you can always increase the

number of processors (that today

are relatively cheap)

2012-12-07 30

Two Steps of MapReduce

Map step: The master node takes the

input, divides it into smaller sub-

problems, and distributes them to

worker nodes. The worker node

processes the smaller problem, and

passes the answer back to its master

node.

Reduce step: The master node then

collects the answers to all the sub-

problems and combines them in some

way to form the output – the answer

to the problem it was originally trying

to solve.

Example:

2012-12-07 31

Example – Programming Model

# LAST FIRST SALARY

Smith John $90,000

Brown David $70,000

Johnson George $95,000

Yates John $80,000

Miller Bill $65,000

Moore Jack $85,000

Taylor Fred $75,000

Smith David $80,000

Harris John $90,000

... ... ...

... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapper

reducer

def getName (line):

return line.split(‘\t’)[1]

def addCounts (hist, name):

hist[name] = \

hist.get(name,default=0) + 1

return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

result = reduce(addCounts, \

intermediate, {})

Note: pp. 31~36 are borrowed from KDD 2011 tutorial “Large-scale Data

Mining: MapReduce and Beyond”

2012-12-07 32

def getName (line):

return (line.split(‘\t’)[1], 1)

def addCounts (hist, (name, c)):

hist[name] = \

hist.get(name,default=0) + c

return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

result = reduce(addCounts, \

intermediate, {})

Example – Programming Model

# LAST FIRST SALARY

Smith John $90,000

Brown David $70,000

Johnson George $95,000

Yates John $80,000

Miller Bill $65,000

Moore Jack $85,000

Taylor Fred $75,000

Smith David $80,000

Harris John $90,000

... ... ...

... ... ...

employees.txt mapper

reducer

Key-value iterators

Q: “What is the frequency of each first name?”

2012-12-07 33

public class HistogramJob extends Configured implements Tool {

public static class FieldMapper extends MapReduceBase

implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable ONE = new LongWritable(1);

private static Text firstname = new Text();

@Override

public void map (LongWritable key, Text value,

OutputCollector<Text,LongWritable> out, Reporter r) {

firstname.set(value.toString().split(“\t”)[1]);

output.collect(firstname, ONE);

}

} // class FieldMapper

Example – Programming Model Hadoop / Java

non-boilerplate

typed…

2012-12-07 34

Example – Programming Model Hadoop / Java

public static class LongSumReducer extends MapReduceBase

implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable sum = new LongWritable();

@Override

public void reduce (Text key, Iterator<LongWritable> vals,

OutputCollector<Text,LongWritable> out, Reporter r) {

long s = 0;

while (vals.hasNext())

s += vals.next().get();

sum.set(s);

output.collect(key, sum);

}

} // class LongSumReducer

2012-12-07 35

Example – Programming Model Hadoop / Java

public int run (String[] args) throws Exception {

JobConf job = new JobConf(getConf(), HistogramJob.class);

job.setJobName(“Histogram”);

FileInputFormat.setInputPaths(job, args[0]);

job.setMapperClass(FieldMapper.class);

job.setCombinerClass(LongSumReducer.class);

job.setReducerClass(LongSumReducer.class);

// ...

JobClient.runJob(job);

return 0;

} // run()

public static main (String[] args) throws Exception {

ToolRunner.run(new Configuration(), new HistogramJob(), args);

} // main()

} // class HistogramJob

2012-12-07 36

Execution Model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCER

MAPPER

MAPPER

REDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

All-to-all, hash partitioning

Sort-merge

Smith John $90,000

Yates John $80,000 John 1

John 1

John 2

Input file

Output file

2012-12-07 37

Apache Hadoop

The most popular open-source implementation of

MapReduce

http://hadoop.apache.org/

HBase

MapReduce

Core Avro

HDFS Zoo

Keeper

Hive Pig Chukwa

2012-12-07 38

PageRank on MapReduce (1/2)

Map: distributing PageRank “credit” to link targets

Reduce: summing up PageRank “credit” from multiple

sources to compute new PageRank values

Iterate until

convergence

2012-12-07 39

PageRank on MapReduce (2/2)

Map (nid n, node N)

p ← N.PageRank / |N.AdjacencyList|

emit (nid n, node N) // Pass along the graph structure

for nid m ∈ N.AdjacencyList do

emit (nid m, p) // Pass a PageRank value to its neighbors

Reduce (nid m, [p1, p2, …])

M ← 0

for p ∈ [p1, p2, …] do

if IsNode(p) then

M ← p // Recover the graph structure

else

s ← s + p // Sum up the incoming PageRank contributions

M.PageRank ← s

emit (nid m, node M)

2012-12-07 40

Implementation

Cloud9

Jimmy Lin and Michael Schatz. Design Patterns for

Efficient Graph Algorithms in MapReduce.

Proceedings of the 2010 Workshop on Mining and

Learning with Graphs Workshop (MLG-2010), July

2010, Washington, D.C.

http://lintool.github.com/Cloud9/

2012-12-07 41

Pig

Pig raises the level of abstraction for processing

large datasets

Turning the transformations into a series of

MapReduce jobs

The language used to express

data flows is called Pig Latin

2012-12-07 42

A Real Pig Script

2012-12-07 43

Java program for the same task

4. Summary

2012-12-07 45

Summary

FlockDB: Real-time Analysis

Hadoop: Storing and Analyzing Data

Cassandra: Storing Tweets

http://cassandra.apache.org/

HBase: Searching People

http://hbase.apache.org/

Pig: Easier (SQL-like) Analysis

http://pig.apache.org/

Scribe: Log Data Collection

https://github.com/facebook/scribe

THANK YOU