소셜 네트워크 데이터 분석 기법 - kse...
TRANSCRIPT
2012-12-07 2
강연자 소개 - 이재길 교수
약력
2010년 12월~현재: KAIST 지식서비스공학과 조교수
2008년 9월~2010년 11월: IBM Almaden Research Center 연구원
2006년 7월~2008년 8월: University of Illinois at Urbana-
Champaign 박사후연구원
연구분야
시공간 데이터 마이닝 (경로 및 교통 데이터)
소셜 네트워크 및 그래프 데이터 마이닝
빅 데이터 분석 (MapReduce 및 Hadoop)
연락처
E-mail:
홈페이지: http://dm.kaist.ac.kr/jaegil
2012-12-07 5
Big Data Social Networks
The online social network (OSN) is one of the
main sources of big data
2012-12-07 8
Some Statistics on OSNs
Twitter is estimated to have 140 million users,
generating 340 million tweets a day and handling
over 1.6 billion search queries per day
As of May 2012, Facebook has more than 900
million active users; Facebook has 138.9 million
monthly unique U.S. visitors in May 2011
2012-12-07 9
Data Characteristics
Relationship data: e.g., follower, …
Content data: e.g., tweets, …
Location data
contents a user
relationship location
2012-12-07 10
Graph Data
A social network is usually
modeled as a graph
A node → an actor
An edge → a relationship
or an interaction
The graph is diverse
directed vs. undirected
weighted vs. unweighted
2012-12-07 11
Categories of Graph Data Analysis
Online analysis
Example: for a given user, finding anyone whose first
name is “David” among his friends, his friends’ friends,
and his friends’ friends’ friends
Typically using graph databases (e.g., Neo4j,
HyperGraphDB, FlockDB)
Offline analysis
Example: calculating PageRank for the entire graph
Typically using distributed, parallel systems (e.g.,
MapReduce, Pregel, Trinity)
2012-12-07 13
Graph Databases (1/2)
Neo4j
http://neo4j.org/
Open source, current version: 1.8 (as of Dec. 2012)
Running on a single machine
HyperGraphDB
http://www.hypergraphdb.org/
Korbix Software, current version: 1.2 (as of Dec. 2012)
Running on a single machine
FlockDB
https://github.com/twitter/flockdb
Open source, current version: 1.8 (as of Dec. 2012)
Running on a cluster of machines
Being used by Twitter to store social graphs and indexes
2012-12-07 14
Graph Databases (2/2)
Storing data as nodes and relationships
Both nodes and relationships can hold properties in a
key/value fashion
Being able to navigate the structure
2012-12-07 15
FlockDB
A distributed graph database for storing
adjacency lists, with goals of supporting:
A high rate of add/update/remove operations
Potentially complex set arithmetic queries
Paging through query result sets (over 1M entries)
Ability to “archive” and later restore archived edges
Horizontal scaling including replication
Online data migration
but not including:
Multi-hop queries (or graph-walking queries)
Automatic shard migrations
2012-12-07 16
FlockDB for Twitter
Storing 13+ billion edges
Sustaining 20k writes/second at peak
Sustaining 100k reads/second at peak
As of April 2010,
2012-12-07 18
Set Operations
This tweet needs to be delivered to people who
follow both @aplusk (13M followers) and
@foursquare (530K followers)
2012-12-07 19
Adjacency Lists (1/2)
Storing the follower relationship as an edge
position: used for sorting (e.g., current time)
source_id:int64
destination_id:int64
position:int64
state:int8
Normal,
Removed,
Archived
2012-12-07 20
Adjacency Lists (2/2)
Storing an edge in both directions
source_id destination_id position state
20 12 20:50:14
20 13 20:51:32
20 16 20:54:26
destination_id source_id position state
12 20 20:50:14
12 32 20:51:42
12 16 20:53:24
Forward Backward
Indexed and partitioned by
Can efficiently answer the question “Who follows A?”
as well as “Whom is A following?”
2012-12-07 21
Partitioning / Sharding
Data is partitioned by
node, so the queries can
be answered by a single
partition, using an
indexed range query
The app servers
(affectionately called
“flapps”) are stateless and
are horizontally scalable
2012-12-07 22
Example Queries
How many people are following user 1?
flock.select(nil, :follows, 1).to_a
Who's reciprocally following user 1?
flock.select(1, :follows, nil).intersect(nil, :follows,
1).to_a
How about the union then?
flock.select(1, :follows, nil).union(nil, :follows, 1).to_a
Who's following user 1 that user 1 is not following
back?
flock.select(nil, :follows, 1).difference(1, :follows,
nil).to_a
2012-12-07 24
Analysis at Scale
Example: Running PageRank across users to
calculate reputations
To give any Twitter user a score from 1~10 based on
their followers’ networks of followers
2012-12-07 25
PageRank Overview (1/4)
Google describes PageRank:
“… PageRank also considers
the importance of each page
that casts a vote, as votes
from some pages are
considered to have greater
value, thus giving the linked
page greater value. … and
our technology uses the collective intelligence of the
web to determine a page's importance”
A page referenced by many high-quality pages
is also a high-quality page
2012-12-07 26
PageRank Overview (2/4)
Formula
PR(A): PageRank of a page A
d: the probability, at any step, that the person will
continue which is called a damping factor d (usually,
set to be 0.85)
L(B): the number of outbound links on a page B
N: the total number of pages
OR
2012-12-07 27
PageRank Overview (3/4)
Example
PR(A) = (1–d) * (1/N) + d * (PR(C) / 2)
PR(B) = (1–d) * (1/N) + d * (PR(A) / 1 + PR(C) / 2)
PR(C) = (1–d) * (1/N) + d * (PR(B) / 1)
Set d = 0.70 for ease of calculation
PR(A) = 0.1 + 0.35 * PR(C)
PR(B) = 0.1 + 0.70 * PR(A) + 0.35 * PR(C)
PR(C) = 0.1 + 0.70 * PR(B)
Iteration 1: PR(A) = 0.33, PR(B) = 0.33, PR(C) = 0.33
Iteration 2: PR(A) = 0.22, PR(B) = 0.45, PR(C) = 0.33
Iteration 3: PR(A) = 0.22, PR(B) = 0.37, PR(C) = 0.41
…
Iteration 9: PR(A) = 0.23, PR(B) = 0.39, PR(C) = 0.38
A
B C
2012-12-07 28
PageRank Overview (4/4)
A random surfer selects a page and keeps
clicking links until getting bored, then randomly
selects another page
PR(A) is the probability that such a user visits A
(1-d) is the probability of getting bored at a page (d is
called the damping factor)
PageRank matrix can be computed offline
Google takes into account both the relevance of
the page and PageRank
2012-12-07 29
MapReduce Basics
To handle big data, Google
proposed a new approach
called MapReduce
MapReduce can crunch
huge amounts of data by
splitting the task over
multiple computers that can
operate in parallel No matter how large the problem
is, you can always increase the
number of processors (that today
are relatively cheap)
2012-12-07 30
Two Steps of MapReduce
Map step: The master node takes the
input, divides it into smaller sub-
problems, and distributes them to
worker nodes. The worker node
processes the smaller problem, and
passes the answer back to its master
node.
Reduce step: The master node then
collects the answers to all the sub-
problems and combines them in some
way to form the output – the answer
to the problem it was originally trying
to solve.
Example:
2012-12-07 31
Example – Programming Model
# LAST FIRST SALARY
Smith John $90,000
Brown David $70,000
Johnson George $95,000
Yates John $80,000
Miller Bill $65,000
Moore Jack $85,000
Taylor Fred $75,000
Smith David $80,000
Harris John $90,000
... ... ...
... ... ...
employees.txt
Q: “What is the frequency of each first name?”
mapper
reducer
def getName (line):
return line.split(‘\t’)[1]
def addCounts (hist, name):
hist[name] = \
hist.get(name,default=0) + 1
return hist
input = open(‘employees.txt’, ‘r’)
intermediate = map(getName, input)
result = reduce(addCounts, \
intermediate, {})
Note: pp. 31~36 are borrowed from KDD 2011 tutorial “Large-scale Data
Mining: MapReduce and Beyond”
2012-12-07 32
def getName (line):
return (line.split(‘\t’)[1], 1)
def addCounts (hist, (name, c)):
hist[name] = \
hist.get(name,default=0) + c
return hist
input = open(‘employees.txt’, ‘r’)
intermediate = map(getName, input)
result = reduce(addCounts, \
intermediate, {})
Example – Programming Model
# LAST FIRST SALARY
Smith John $90,000
Brown David $70,000
Johnson George $95,000
Yates John $80,000
Miller Bill $65,000
Moore Jack $85,000
Taylor Fred $75,000
Smith David $80,000
Harris John $90,000
... ... ...
... ... ...
employees.txt mapper
reducer
Key-value iterators
Q: “What is the frequency of each first name?”
2012-12-07 33
public class HistogramJob extends Configured implements Tool {
public static class FieldMapper extends MapReduceBase
implements Mapper<LongWritable,Text,Text,LongWritable> {
private static LongWritable ONE = new LongWritable(1);
private static Text firstname = new Text();
@Override
public void map (LongWritable key, Text value,
OutputCollector<Text,LongWritable> out, Reporter r) {
firstname.set(value.toString().split(“\t”)[1]);
output.collect(firstname, ONE);
}
} // class FieldMapper
Example – Programming Model Hadoop / Java
non-boilerplate
typed…
2012-12-07 34
Example – Programming Model Hadoop / Java
public static class LongSumReducer extends MapReduceBase
implements Mapper<LongWritable,Text,Text,LongWritable> {
private static LongWritable sum = new LongWritable();
@Override
public void reduce (Text key, Iterator<LongWritable> vals,
OutputCollector<Text,LongWritable> out, Reporter r) {
long s = 0;
while (vals.hasNext())
s += vals.next().get();
sum.set(s);
output.collect(key, sum);
}
} // class LongSumReducer
2012-12-07 35
Example – Programming Model Hadoop / Java
public int run (String[] args) throws Exception {
JobConf job = new JobConf(getConf(), HistogramJob.class);
job.setJobName(“Histogram”);
FileInputFormat.setInputPaths(job, args[0]);
job.setMapperClass(FieldMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
// ...
JobClient.runJob(job);
return 0;
} // run()
public static main (String[] args) throws Exception {
ToolRunner.run(new Configuration(), new HistogramJob(), args);
} // main()
} // class HistogramJob
2012-12-07 36
Execution Model: Flow
SPLIT 0
SPLIT 1
SPLIT 2
SPLIT 3
MAPPER
REDUCER
MAPPER
MAPPER
REDUCER
PART 0
PART 1
MAPPER
Sequential scan
Key/value iterators
All-to-all, hash partitioning
Sort-merge
Smith John $90,000
Yates John $80,000 John 1
John 1
John 2
Input file
Output file
2012-12-07 37
Apache Hadoop
The most popular open-source implementation of
MapReduce
http://hadoop.apache.org/
HBase
MapReduce
Core Avro
HDFS Zoo
Keeper
Hive Pig Chukwa
2012-12-07 38
PageRank on MapReduce (1/2)
Map: distributing PageRank “credit” to link targets
Reduce: summing up PageRank “credit” from multiple
sources to compute new PageRank values
Iterate until
convergence
2012-12-07 39
PageRank on MapReduce (2/2)
Map (nid n, node N)
p ← N.PageRank / |N.AdjacencyList|
emit (nid n, node N) // Pass along the graph structure
for nid m ∈ N.AdjacencyList do
emit (nid m, p) // Pass a PageRank value to its neighbors
Reduce (nid m, [p1, p2, …])
M ← 0
for p ∈ [p1, p2, …] do
if IsNode(p) then
M ← p // Recover the graph structure
else
s ← s + p // Sum up the incoming PageRank contributions
M.PageRank ← s
emit (nid m, node M)
2012-12-07 40
Implementation
Cloud9
Jimmy Lin and Michael Schatz. Design Patterns for
Efficient Graph Algorithms in MapReduce.
Proceedings of the 2010 Workshop on Mining and
Learning with Graphs Workshop (MLG-2010), July
2010, Washington, D.C.
http://lintool.github.com/Cloud9/
2012-12-07 41
Pig
Pig raises the level of abstraction for processing
large datasets
Turning the transformations into a series of
MapReduce jobs
The language used to express
data flows is called Pig Latin
2012-12-07 45
Summary
FlockDB: Real-time Analysis
Hadoop: Storing and Analyzing Data
Cassandra: Storing Tweets
http://cassandra.apache.org/
HBase: Searching People
http://hbase.apache.org/
Pig: Easier (SQL-like) Analysis
http://pig.apache.org/
Scribe: Log Data Collection
https://github.com/facebook/scribe