techniques that facebook use to analyze and querysocial graphs

9
Techniques that Facebook use to Analyze and Query Social Graphs Abdulfattah Safa 1 , Haneen Droubi 2 sharehan bakri 3 Web Data Management Course Master in Computing Birzeit University, Palestine 1 [email protected] , 2 [email protected] 3 [email protected] ABSTRACT In this paper, we present the subgraph discovery which is one of the most important topics in social network graphing by discussing four different approaches the first is the large Scale Cohesive Subgraphs the second is how to find top-k Dense Subgraphs which adopted studying a static graph then expand the solution to include a dynamic one , the third is about how to find large dense subgraphs in a massive graph , On Other hand this paper talking about Memcache talking about a general purpose distributed memory caching system which is used by Facebook to speed up their dynamic databases , after that we present the UNICORN a System Facebook uses for Searching the Social Graph , Facebook uses many systems to analze data, in this paper we illustrate two systems for analyizing facebook data which are HIVE and scuba and comparing them with another effective one used for analizing google social network products which is Dremel distributed system used for querying big datasets as nested data . Keywords Social Graph , Query ,Typeahead , Memcache , Unicorn , Apply Operator , Facebook , Aggregation , Hive , Dremel , Scuba , Hadoop , Google , Mysql , Subgraph , Graph Discovery , Cohesive Subgraph , Dense Subgraph . 1. Introduction and Motivation Social networks data are stored in graphs, where the vertices represent the social actors and relations are represented by edges. Graph search and discovery is one of the most important graph operation , as it is the core of finding how graph vertices (actors) are connected to each other. This is very important in for social networks for studying the interests and habits for different people acting at it then providing suggestions for them based on the study result . As a definition , highly connected graphs is called coherence one. In a social network coherence graph means a graph with vertices that have a large set of common social properties (relations). One of most popular social network is Facebook with Facebook history comes several old search systems that Facebook have to unify in order to build Graph Search . At first, the old search on Facebook (called PPS) was keyword based ,the searcher entered keywords and the search engine produced a results page that was personalized and could be filtered to focus on specific kinds of entities , In 2009, Facebook started work on a new search product (called Typeahead) that would deliver search results as the searcher typed, or “prefix matching" . According to the massive number of usesrs who will be unhappy if any latency or bug happen in facebook page , Facebook data management systems becomes one of the hot topics in technology world today, it’s issues about how to enhance the continuous rapid growth of data (scalability) and diversity of users submitting jobs characteristics (execution time, data delivery deadlines…etc) , being the core of facebook challenges. Moreover, some users want to deal with ad hoc querying analysis to test some hypothesis or to answer functional posted questions. dealing with user data, data storing and representation become. In this paper we start talking about cohesive subgraph discovery in social network through discussing three main approaches. In section four Memcach is discussed as a general purpose distributed memory caching system . In section five and six we discussed Typehead and Unicorn search Systems , In the next sections analysis data systems (Hive, Scuba and Dreaml) are all discussed in details. 2. THE SOCIAL GRAPH The database which facebook used to maintain it's inter - relationships between the people and things in the real world calls social graph .Facebook consists nodes signifying people and things , and edges representing a relationship between two nodes in other words the entities are the nodes and the relationships are the edges [3,1] . Each Entity (node) has primary key , which is a 64-bit identifier (id). Facebook also store the edges between entities . facebook has many types of edges : directional (e.g The inverse of likes is likers, which is an edge from a page to a user who has liked that This report is part of the Web Data Management Course, Master in Computing at Birzeit University, Palestine. Each student is given a topic and asked to read the most important scientific literature in this topic, and criticize it and linked with the topics taught in the course.

Upload: haneen-droubi

Post on 14-Apr-2017

298 views

Category:

Engineering


1 download

TRANSCRIPT

Techniques that Facebook use to Analyze and Query

Social Graphs Abdulfattah Safa

1 , Haneen Droubi

2

sharehan bakri3

Web Data Management Course

Master in Computing Birzeit University, Palestine

[email protected] ,

[email protected]

[email protected]

ABSTRACT

In this paper, we present the subgraph discovery which is one of

the most important topics in social network graphing by

discussing four different approaches the first is the large Scale

Cohesive Subgraphs the second is how to find top-k Dense

Subgraphs which adopted studying a static graph then expand the

solution to include a dynamic one , the third is about how to find

large dense subgraphs in a massive graph , On Other hand this

paper talking about Memcache talking about a general purpose

distributed memory caching system which is used by Facebook to

speed up their dynamic databases , after that we present the

UNICORN a System Facebook uses for Searching the Social

Graph , Facebook uses many systems to analze data, in this paper

we illustrate two systems for analyizing facebook data which are

HIVE and scuba and comparing them with another effective one

used for analizing google social network products which is

Dremel distributed system used for querying big datasets as

nested data .

Keywords

Social Graph , Query ,Typeahead , Memcache , Unicorn , Apply

Operator , Facebook , Aggregation , Hive , Dremel , Scuba ,

Hadoop , Google , Mysql , Subgraph , Graph Discovery ,

Cohesive Subgraph , Dense Subgraph .

1. Introduction and Motivation Social networks data are stored in graphs, where the vertices

represent the social actors and relations are represented by edges. Graph search and discovery is one of the most important graph

operation , as it is the core of finding how graph vertices (actors)

are connected to each other. This is very important in for social

networks for studying the interests and habits for different people

acting at it then providing suggestions for them based on the

study result . As a definition , highly connected graphs is called

coherence one. In a social network coherence graph means a

graph with vertices that have a large set of common social

properties (relations). One of most popular social network is Facebook with Facebook

history comes several old search systems that Facebook have to

unify in order to build Graph Search . At first, the old search on

Facebook (called PPS) was keyword based ,the searcher entered

keywords and the search engine produced a results page that was

personalized and could be filtered to focus on specific kinds of

entities , In 2009, Facebook started work on a new search

product (called Typeahead) that would deliver search results as

the searcher typed, or “prefix matching" . According to the massive number of usesrs who will be unhappy

if any latency or bug happen in facebook page , Facebook data

management systems becomes one of the hot topics in technology

world today, it’s issues about how to enhance the continuous

rapid growth of data (scalability) and diversity of users

submitting jobs characteristics (execution time, data delivery

deadlines…etc) , being the core of facebook challenges.

Moreover, some users want to deal with ad hoc querying analysis

to test some hypothesis or to answer functional posted questions.

dealing with user data, data storing and representation become.

In this paper we start talking about cohesive subgraph discovery

in social network through discussing three main approaches. In

section four Memcach is discussed as a general purpose

distributed memory caching system . In section five and six we

discussed Typehead and Unicorn search Systems , In the next

sections analysis data systems (Hive, Scuba and Dreaml) are all

discussed in details.

2. THE SOCIAL GRAPH

The database which facebook used to maintain it's inter -

relationships between the people and things in the real world

calls social graph .Facebook consists nodes signifying people

and things , and edges representing a relationship between two

nodes in other words the entities are the nodes and the

relationships are the edges [3,1].

Each Entity (node) has primary key , which is a 64-bit identifier

(id). Facebook also store the edges between entities . facebook

has many types of edges : directional (e.g The inverse of likes is

likers, which is an edge from a page to a user who has liked that

This report is part of the Web Data Management Course, Master in

Computing at Birzeit University, Palestine. Each student is given a topic

and asked to read the most important scientific literature in this topic, and

criticize it and linked with the topics taught in the course.

page ) , another type is symmetric (e.g the friend relation ) ,

there are many other edge-types in Facebook . Figure(1)

illustrates A running example of how a user’s checkin might be

mapped to objects and associations.

Figure 1 : checkin might be mapped to objects and associations.[1]

3. Cohesive Subgraph Discovery in Social

Networks

This section discusses three approaches to search cohesive graph

in social networks .

3.1 LARGE SCALE

This approach aims to develop an algorithm to discover the

cohesive subgraph then use it to develop two solutions for

Memory and Database. To define the problem statement, the following terms are used : Definition1: “A k-core is a connected subgraph g such that each

vertex v has degree d(v) ≥ k within the subgraph g.” This

definition also introduces an important property related to edges

in clique that each edge in a clique is supported by k-2 triangles

in the k clique. Definition2: “A k-mutual-friend is a connected subgraph g ∈ G

such that each edge is supported by at least k pairs of edges

forming a triangle with that edge within g. The k-mutual-friend

number of this subgraph, denoted as M(g), equals k”. Definition3: “A maximal k-mutual-friend subgraph is a k-

mutual-friend subgraph that is not a proper subgraph of any

other k-mutual friend subgraph” [14] . The main problem statement: Find all the k-mutual-friends

subgraph for a graph G(V,E) , Two solutions were developed in

order to get this problem fixed Memory and Disk. Firstly a

memory based solution is developed for small graphs then it is

expanded to include large graphs as Disk solution.

3.1.1 Memory Solution

The basic idea of the memory solution of finding k-maximal-

friend is to drop all the vertices and edges that don’t satisfy

iteratively. in other words, drop all the edges that are not existing

the k triangles one by one, till the condition: number of triangles

Tr(e) ≥ k is met. The solution is proposed in Figure (2). The algorithm is applied

on the graph with k = 2 as follows: at the first iteration the edge

(e-i) is removed besides the newly isolated vertex i , as it is part

of no triangle. Then the edges (e-h), (e-g) and (f-h) are all

removed. At the third iteration, one can see that edges (d-g), (f-g)

and (g-h) become part of less than 2 triangles after removing the

other edges and vertex. So they are all removed. All the edges in

the resultant graph satisfy the condition Tr(e) ≥ 2 , so the graph

is a cohesive subgraph .

Figure 2 : Offline Memory Solution .

Definition1: “The maximum core C(G) of a graph G is a

subgraph of G containing vertices with a degree at least β, where

the value of β is the maximum possible.” [15] .

To make the idea clear, consider Figure (3) According to these

steps, vertices v9,v10 and v11 will be removed first, since their

degree is 1. The resulting graph is the 2-core of G, since all

vertices have a degree at least 2. Next, we remove vertex v7, and

consequently vertices v6 and v8 are also removed, because their

degree has been reduced due to the removal of v7. The resulting

graph, composed of the vertices v1, v2, v3, v4 and v5 is the 3-

core of G since every vertex has a degree at least 3. At this point,

if we continue this process, we will result in an empty graph.

Therefore, the maximum core value of G is 3.

Figure 3 : Example of maxflow computations [14] .

The TopKDense algorithm , Algorithm 3 in Figure (4) , is

developed using this procedure.

3.1.2 Disk Solution

The disk solution is based on using the Graph Databases.Then,

streaming is used to enhance it , a graph database is a database

which is designed particularly to store graph data by storing

vertices and edges as graph structure instead of conventional data

storing in different tables. It also has a free-index adjacency

where every vertex and edge has a direct reference to its

adjacent. A solution of finding k-maximal-friend in graph database is

developed from the Algorithm of the Memory solution. Two

main changes are introduced: the first one is that graph traversal

is used to access vertices and edges as well as compute triangle

counts. The second change is that, index is on edge attributes to

mark edges as deleted and record edges’ triangle.

Figure 4 : TopKDense Algorithm [15] .

3.2 Discovery of Top-k Dense Subgraphs in

Dynamic Graph Collections

This approach starts with developing an algorithm for

discovering dense subgraphs in a set of graphs, then uses it in

developing another one for a stream of graphs. One can define the density of a graph as the average degree of all

its nodes. Based on this definition one can find the density of a

graph using the Goldberg Theory which states “The computation

of the densest subgraph of a graph G requires a logarithmic

number of steps, where in each step a maxflow computation is

performed. The maxflow computations are performed on an

augmented graph G` and not on the original graph G. More

specifically, G is converted to a directed graph by substituting

an edge between nodes u and v by two directed edges from u to v

and backwards. These edges are assigned a capacity of 1” [15] , Based on the Goldberg theory, one can write the problem

statement as: To Find the top k most dense graphs in a dynamic

stream of graphs.

3.2.1 Dense Subgraph Discovery in a Set of Graphs

The procedure is as follows: firstly the density of each subgraph

is computed, secondly the edges that comprises the densities

subgraphs are removed besides the resulted isolated nodes. The

procedure is repeated till get a k subgraphs at last [15] , One can

notice in this procedure is that two dense subgraphs can’t share

edges. But actually they may share some nodes , To proceed with

the steps above can use the Goldberg theorem to get rid of a

significant number of maxflow operations. This may result a

more efficient computation. To enable this the following

definition is used: Updating the top-k subgraphs starts with finding the densest

subgraph of the newly arrived graph. If the density of it is greater

than the density threshold, k, then the subgraph is added to the

set and the process continues.

The TopKDense algorithm can be updated to insert the new

subgraph. one is removed , Also the top-k sub graphs are

updated accordingly , The next thing is the expiration of the

graph. Here two cases exist : The first one is when the expired

graph has no subgraph in the k-top set and the other one when it

does. The first one is quit simple, as remove the expired graph

has no effect. whereas the second one needs to rescan the active

graphs for the substitute subgraphs.

3.3 Discovering Large Dense Subgraphs in

Massive Graphs

This approach aims to to develop an algorithm to find the most

cohesives subgraph by fingerprinting the whole graph.

3.3.1 Discovering Large Dense Subgraphs

The algorithm looks for clusters of nodes that is capable to link

to the same destinations. Basically this algorithm does the

following: firstly, shingling is applied on each node in the graph

to the set of destinations linked to from that node, resulting in c

shingles per node. Then, grouping the set of nodes with same

shingle. As third step, to begin clustering nodes, the set of nodes

with large number of shingles is to be found. Grouping of

shingles that tend to be on the same nodes is the fourth step.

Such an analysis is yet another application of the (s, c) shingling

algorithm, this time to the set of nodes associated with a

particular shingle, and results in bringing together shingles that

have significant overlap in the nodes on which they occur. Then the shingles algorithm is used again on the set of of a

particular shingle, to group shingles that have significant overlap

in the nodes on which they occur .

These sequences may be repeated several times as required.

Figure (5) shows these steps.

Figure 5 : Recursive shingling flow [16] .

4. Memcache

Memcached is a general purpose distributed memory

caching system. It is often used to speed up dynamic database

driven websites by caching data and objects in RAM to reduce

the number of times an external data source (such as a database

or API) must be read. Memcached's APIs provide a giant hash

table (key-value) distributed across multiple machines.

Applications using Memcached typically layer requests and

additions into RAM before falling back on a slower backing

store, such as a database , Memcached provides a simple set of

operations (set,get, and delete) .

In Facebook the high degree of output customization, combined

with a high update rate of a typical user’s News Feed, makes it

impossible to generate the views presented to users ahead of

time. Thus, the data set must be retrieved and rendered on the fly

in a few hundred milliseconds , Facebook has always realized

that even the best relational database technology available is a

poor match for this challenge unless it is supplemented by a large

distributed cache that offloads the persistent store.

Memcache has played that role since Mark Zuckerberg installed

it on Facebook’s Apache web servers back in 2005 see Figure(6).

Figure 6 : Memcache as a demand-filled look-aside cache. The

left half illustrates the read path for a web server on a cache

miss. The right half illustrates the write path.[9]

When a web server needs data, it first requests the value from

memcache by providing a string key. If the item addressed by

that key is not cached, the web server retrieves the data from the

database or other backend service and populates the cache with

the key-value pair. For write requests, the web server issues SQL

state ments to the database and then sends a delete request to

memcache that invalidates any stale data , After conversion to

Memcached, the same call might look like the following

( pseudocode) [9] :

function get_foo(int userid) {

/* first try the cache */

data = memcached_fetch("userrow:" + userid);

if (!data) {

/* not found : request database */

data = db_select("SELECT * FROM users WHERE

userid = ?", userid);

/* then store in cache until next get */

memcached_add("userrow:" + userid, data);

}

return data;

}

The client would first check whether a Memcached value with

the unique key " userrow : userid " exists , where userid is some

number. If the result does not exist, it would select from the

database as usual, and set the unique key using the Memcached

API add function call.

5. TYPEAHEAD

A typeahead search is a dropdown menu that appears when

you're searching for something. It guesses what you're searching

for so you can find it faster. A typeahead query consists of a

string, which is a prefix of the name of the individual the user is

seeking. For example, if a user is typing in the name of “Jon

Jones” the typeahead backend would sequentially receive queries

for “J”, “Jo”, “Jon”, “Jon ”, “Jon J”, etc. For each prefix, the

backend will return a ranked list of individuals for whom the

user might be searching. Some of these individuals will be

within the user’s explicit circle of friends and networks. [4]

6. Unicorn

Unicorn is a System for Searching the Social Graph. There was

an effort within Facebook to build an inverted-index system

called Unicorn since 2009. By late 2010, Unicorn had been used

for many projects at Facebook as an in-memory “database” to

lookup entities given combinations of attributes. In 2011,

Facebook decided to extend Unicorn to be a search engine and

migrate all existing search backends to Unicorn as the first step

towards building Graph Search. [5]

6.1 DATA MODEL

The social graph is sparse so it is logical to represent it as a set

of adjacency lists(posting list ). Unicorn is an inverted index

service that implements an adjacency list service. Each posting

list contains a sorted list of hits, which are (DocId , HitData)

pairs . A DocId (document identifier) is a pair of (sort-key, id),

and HitData is just an array of bytes that store data. The sort-key

enables us to store the most globally important ids earlier in the

post list and it's is an integer. Hits are sorted first by sort-key

(highest first) and secondly by id (lowest first) [3] .

Figure 7 : Converting a node’s edges into posting lists in an

inverted index. Users who are friends of id 8 correspond to ids in

hits of the posting list. Each hit also has a corresponding sort-key

and (optional) HitData byte array [3] .

PostingListn → (Hitn,0 , Hitn,1 , ..., Hit n,k-1 )

Hit i,j → (DocId i,j , HitData i,j )

DocId i,j → (sort-key i,j , id i,j )

HitData would be the place where positional information for

matching documents is kept In a full-text search system , in

UNICORN HitData is not present in all terms ,HitData used for

storing extra data useful for filtering results (e.g posting list

might contain the ids of all users who graduated from a specific

university) in this example the HitData could store the

graduation year and major.

Unicorn is sharded (partitioned) by result-id (the ids in query

output) instead of term-sharding and optimized for handling

graph queries. Unicorn chose to partition by result-id to remain

the system available in the event of a dead machine or network

partition(refers to the failure of a network device that causes a

network to be split). for example when queried for friends of

"Jon Jones", it is better to return some fraction of the friends of

Jon Jones than no friends at all [3] .

Posting lists in Unicorn are referenced by terms, which, by

convention, are of the form:

<edge-type>:<id>

Edge-type is merely a string such as friend or like.

6.2 ARCHITECTURE

Client queries are sent to a Unicorn top-aggregator, which sent

the query to one rack-aggregator per rack. These rack-aggregators

sent the query to all index servers in their respective racks.

Figure 5 shows a top-aggregator communicating with a single tier

(a set of all index partitions).

Each index server is responsible for serving and accepting

updates for one shard of the index. The rack aggregators and top-

aggregator are responsible for combining and truncating results

from multiple index shards in a sensible way before the top-

aggregator finally sends a response back to the client.

Figure 8 : Example of Unicorn cluster architecture with

multiple verticals. The top-aggregator determines which

vertical(s) each query needs to be sent to, and it sends the query

to the racks for that vertical [3] .

6.3 QUERY LANGUAGE Clients send queries to Unicorn as Thrift requests ( Thrift is a

software library that implements cross-language RPC

communication for any interfaces defined using it.) [8] .

''Thrift is used as the underlying protocol and transport layer for

the Facebook Search service. The multi-language code generation

is well suited for search because it allows for application

development in an efficient server side language (C++) and

allows the Facebook PHP-based web application to make calls to

the search service using Thrift PHP libraries '' [7] .

And and Or operators :

Like many other search systems , Unicorn supports And and Or

operators , For example a client who wishes to find all female

friends of Jon Jones id= 5 would issue the query (and friend:5

gender:1) . If there exists another user Lea Lin with id=7, we

could find all friends of Jon Jones or Lea Lin by issuing the

query (or friend:5 friend:7).

Difference operator :

Unicorn also supports a Difference operator, which returns

results from the first operand that are not present in the second

operand. Continuing with the example above, we could find

female friends of Jon Jones who are not friends of Lea Lin by

using (difference (and friend:5 gender:1) friend:7) [3] .

''For some queries, results are simply returned in DocId

ordering.''[3] , but In many applications it is useful to give

preference to results that match more terms so Unicorn use

another ranking method to sort results by the number of terms

that matched for example , for the query (or friend:5 friend:7),

some of the results will match friend:5, some will match friend:6,

and some will match both the high rank given to results which

matched both .

WeakAnd operator:

One problem with TYPEAHEAD approach is that it makes no

provision for social relevance: a query for “Jon” would not be

likely to select people named “Jon” who are in the user’s circle

of friends.

WeakAnd can be used when user JonJones performs the query

(weak-and (term friend:3 :optional-hits 2) (term melanie) (term

mars*)) herein “Melanie Mars” (a prefix of the full name

“Melanie Marshall”) , in this query Unicorn only allow a finite

number of hits to be non-friends of Jon Jones the benifit of

WeakAnd is ensures that some results will be friends of Jon

Jones if any match the text, but it also ensures that we don’t miss

very good results who are not friends with Jon Jones [3] .

StrongOr operator

StrongOr is a modification of Or , StrongOr is useful for

enforcing diversity in a result set. For example, if most of Jon

Jones’ friends live in either Beijing (100), San Francisco (101),

or Palo Alto (102), we could fetch a geographically diverse

sample of Jon’s friends by executing:

(strong-or friend:5

(and friend:5 live-in:100

:optional-weight 0.2)

(and friend:5 live-in:101

:optional-weight 0.2)

(and friend:5 live-in:102

:optional-weight 0.1))

The optional weights mean at least 20% of the results must be

people who live in Beijing. Another 20% must be people who

live in San Francisco, and 10% must be people who live in Palo

Alto. The remainder of results can be people from anywhere

(including the cities above) [3] .

7. GRAPH SEARCH

Graph Search return results that are more than one edge away

from the source nodes.

we might want to know the pages liked by friends of Haneen

Droubi (100001599373168) who like BirzeitUniversity page

(129221627113013). We can answer this by first executing the

query (and friend: 100001599373168 likers: 129221627113013)

, assume that results ids is 684969149 for Sharehan ,

100000062277593 for Abdulfattah collecting the results, and

creating a new query that produces the union of the pages

liked(edge name is likes ) by any of these individuals:

Inner

(and friend: 100001599373168 likers:

129221627113013) → 684969149,

100000062277593

Outer

(or likes: 684969149 likes: 100000062277593)

this requires supporting queries that take more than one round-

trip between the index server and the top-aggregator. Graph

Search have features like Apply (Section 5.2 ) .

7.1 Apply

Graph Search allows clients to query for a set of ids and then use

those ids to construct and execute a new query. Algorithm below

describes the operator in detail [3] .

For example : If a client wants to find all companies employs his

friends who lives in Ramallah "Employers of my friends who live

in Ramallah" the client will issue this query :

Apply

works-at : (and friend : 100001599373168 live-in : 103122493061917 )

innerQ “ and friend:100001599373168

live-in : 103122493061917 ” collect N results.

After collection the results for the inner query, the outer query

constructed based on the results of the inner query . In our

example, if the results of the inner query are ids:

{022222131000001 , 022220999100300 , 0013009011} then the

top-aggregator would construct a new query (or works-at :

022222131000001 works-at : 022220999100300 works-at :

0013009011 ) and send this new query, again, to all users racks.

Once the results for this outer query are gathered, the results are

re-sorted and truncated if necessary, and then they are returned to

the client.

8. HIVE

Hive is an open source framework built for analyzing data in

facebook (ad hoc querying, reporting...Etc). Hive is used for

mapping (data tables in Hive stored in HDFS directories) with

Hadoop to reduce time and improve scalability of map-reduce

with querying facilities. HiveQL is a query language that

provides the driver that contains the map-reduce job and the

HDFS commands that were optimized in. Using Hive make it

easier for user to specify different queries using an interface

.transformation can be done easily through user functions or table

defined functions or scripts of Hiveql. Many interfaces are used

to deal with Hive like: web based GUI, Hive CLI (command line

interface), Hipal, where these interfaces are used for ad hoc

querying [11].

8.1 Hive - Hadoop

There are two sources of data in facebook: (1) Federated mysql

tier: contains all the facebook data.(2)Web server: contain all the

log data. Hadoop using scribe servers for aggregate log data from

different web servers and put it in the scribe-Hadoop clusters

(scribeh) that located on the data center, where the data was

written as the HDFS files. data scale in scribeh influence many

ways ,where 30TB of data transferred daily , if this data is

uncompresed it will cause a high traffic on network so it is

located in scribeh data center ,but if it was compressed it will

depend then on the log categories that must adapt the

compression buffer ,but if it is full or lower than compression

buffer size ,the delay will be caused until the compression buffer

reduced or flashed .copier jobs is used to compress and transfer

the scribeh cluster data to Hive-Hadoop clusters to be HDFS data

where it is ready to be published and consuming. Compressed

federated mysql datasets dumped in to Hive-Hadoop clusters

using scrape process, resiliently and well designed scrapes is

very important aspect of the process, to prevent failure and heavy

Figure 9 : Dremel splitting records into tables [10]

load . replicated tier is used to run scraped process before

loading it [13] .

Figure 10 : HIVE-HADOOP architecture [13] .

There are two Hive-Hadoop clusters where data ready to be

consumed using the down-stream processes:( (1)Production Hive-

Hadoop cluster: used for jobs that need to be executed on

committed deadlines. (2) Adhoc Hive- Hadoop cluster: used for

jobs that was not adhering to be executed fast. The reason for

separating the two clusters is the serious and dangerous of

executes the poorly written ad-hoc jobs that will starve the

production jobs. Hve Replication process is used to copy the data

row of updated Hive tables and enable the changes in meta data

from production to ad-hoc clusters , in case both clusters

(production Hive-Hadoop and ad hoc Hive-Hadoop) need a

specific dataset. The data that are published either exists in

Hive-Hadoop cluster (future analyis) or Fedarate mysql (for

facebook use in serve users) [13] .

9. SCUBA

scuba is a DBMS used to analysis data in facebook,Scuba has

many utilization cases; one of the most use is Performance

monitoring ,using dashboard that runs queries implemented on

few seconds fresh data , to watch the load of servers, network

throughput, cash requests ,misses, hits….etc. these metrics

represented by different visualizations graphs like : pie chart,

snaky view…etc. these graphs show the performance of week

over a week ,where noticing the changes column is very easy to

discovering the bug and try to fix it from minutes to hours .Trend

analysis is another use case for scuba , the graph shows the terms

frequencies in posts regarding to time and other specific

information like; age, country, gender…etc. Pattern mining

product specialists interested in collecting information about

people responds of new applications versions and devices and

analyzing their information’s: age, gender, country…etc, without

care of technical issues(how data received, collecting) [12] .

9.1 Scuba Architecture

Many servers are used independently to employ the Scuba

storage engine .scuba consists of logical units that presented as a

tree (consists of root and leaf), each server is divided into 8

leaves (determined by the currently number of CPU core),each

leaf consists of data tables that queries implemented on [12] .

9.1.1 Data Architecture

All data table consists of rows that including data types columns:

integer, string, set of strings, and vectors of strings. Scuba use

scribe to log in data using the Tailer process that divides data

into batches and send each one to scuba via scuba Thrift API.

each batch has a compressed copy stored on disk and start

reading it and add new rows to tables located in scuba leaves,

stored in memory with their timestamps [12] .

9.1.2 Query architecture

Scuba query consists of where clause that specifies the min and

max timestamp and aggregation function. Conditions, order by,

group by are optional. Scuba depends on many interfaces for

specifying the query, web based interface where the user specify

visualization type as: pie chart…ect, and to choose parameters of

query using the dashboard.cmd interface will take the SQL query

,then an application code based on the Thrift based API will deal

with. Scuba contains many types of aggregators to execute the

query , Root aggregator : specified by users as a root of cluster.

It was connected to intermediate aggregator with 5 fan-outs.it

return back the results of the queries received from intermediate

and send them to the user view. Intermediate aggregator: it

linked with leaf nodes with fan out of 5 to deliver data to leaf

aggregator. Receive back the results from leaf aggregator and

send them to the root aggregator. Leaf aggregator: send the query

to the leaf servers, and collect the results from each leaf server.

Leaf server: solve the query in parallel of each node [12] .

10. DREMEL DATA MANAGEMENT

SYSTEM

10.1 Dremel Architecture

Dremel operate many adhoc queries on data without translating

them to Map_Reduce jobs as Hive. The most characteristic of

Dremel that is it used column storage which analyzing relational

data.To create and execute queries in Dremel fastly , query

processor and data management tool must be interoperation, by

two layer :(1) common storage layer like GFS , (2) Shared

storage format where the columnar storage is very useful in Data

in Google+ is a distributed file system data, and a nested record

based in Dremel. There are many definitions of record data fields

; Required field: where this field must be exists. Optional field: it

can be absent from the record. Repeated field: occurred multiple

times. A web document schema may contains all these types, as

about the required field Doc-id is one of the most important

required field , Name is another example of required field that

contains Name-URL (the reference of the document), Links (of

backward and forward) field is an optional field , The

full path of the record is expressed as dots between fields as an

example; Name. Language. Country. figure , present a schema

that suit for the records r1,r2. To store these records in columns

as a nested data records, many challenges were exists, repetition

and Definition levels: the repetition level is used to fetch the

repeated values, whether in two records or in one record, where

Definition levels is to define fields of the records to prevent

wandering of missing optional fields record. in figure2 ,it clarify

how the records are reshaped in columns with the Repetition

level (r) ,and definition level (d),where Null values is used to

specify which field has a missing value, especially if this value is

between two exists values [10] .

To illustrate the idea, let’s take an example: the table Docid , the

repetition column r is 0 and the definition column d is also 0

because no need to put bits for required fields that will never be

repeated, but what about Name. Language. Code table? When

scanning the r1 record from top to down, the field code occurred

three times inside three Name fields , repetition bit values is 0-

2.so when starting scan r1,the first occurrence of field code take

a value bit 0 inside the first Name field which is (en-us) , in the

other hand ,the column a definition level d ,take 1 bit for field

and 1 bit for language field and 0 bit for code ,because it is

required field for (optional) language (if the scanning path is

name. language. country another 1 bit will be added to country

which is optional),column d is the sum of these bits which is 2 ,

then while continue scanning the second occurrence of field code

in the same field Name which is (en) it take value bit of 2 ,but

when scanning the second Name field ,a NULL value indicate

that it was missing in this field ,the r column is 1 because this

field occurred before, and one bit of Name field added to the d

column .at the last scanning of field code which is (en-gb), the r

column is 1 because language code occurred once. hence, the

records is encoded to column format. Splitting the records : to

split the record of google+ datasets to column with repetitive and

definitions levels ,an algorithm was developed for scanning the

repetition and definition levels and set a counters for each levels

,using field writer for updating fields. Record Assembly :After

splitting the records to column, there is a need to reassembled

the record again but with specified fields where the others will

be pruned to improve the efficiency of data read or retrieval by

Map-Reduce job .FSM (finite state machine) is used for reading

the fields’ values [10] .

10.2 Query Architecture

After stored the datasets in a nested columnar storage, Dremel

support a SQL based query language that used to produce one

nested table from two input nested tables, ex: Where the query

mean; select the docid that are less than 20 ,and aggregate the

(name.lamguage.code) occurrence by COUNT expression,

WITHIN the sub_record name, and display the Name.URL with

http beginning as str value [10] .

SELECT DOCID AS Id,

COUNT (Name.Language.Code) WITHIN Name AS cnt,

Name.Url + ‘,’ + Name.Language.Code AS str

FROM t

WHERE REGEXP (Name.Url, ‘^http’ ) AND DOCID < 20 ;

Figure 11 : results of SQL query in Dremel [10] .

To execute the queries in Dremel; the query is picking through

multi-level tree servers , Root server : read the data from tables

depending on the queries after receiving it . Intermediate servers

: aggregate partial results of query execution . Leaf server : deal

with the storage layers like: GFS.Many queries need to be

executed at same time in Dremel as a multi-user system. To

prevent faults tolerance of any server or any replications, Query

dispatcher is used to choose the query with the highest

priorities[10].

11. CONCLUSION

In this paper, we have described some techniques that Facebook

use to Analyze and Query Social Graphs , the high degree of

output customization, combined with a high update rate of a

typical user’s News Feed is a major feature of social networking

systems and imposes a heavy load on backend data stores , we

have described Memcache a general purpose distributed memory

caching system that Facebook and other social Networks used to

retrieve and render the Data on the fly in a few hundred

milliseconds , Facebook Graph search uses in background

UNICORN engine which take nodes as input and return nodes as

outputs .

Id: 10 t1

Name

Cnt: 2

Language

Str: ’http: //A, en-us’

Str: ’http: //A,en’

Name

Cnt: 0

HIVE, SCUBA, and DREMEL are three systems used to analysis

data in social networks, like facebook, google+. There are several

differences between them, HIVE is a data warehouse

infrastructure used as a data schema DB of HADOOP data and

execute the HIVEQL (SQL-like language) queries after

translating it to

This paper discuss three different approaches to discover the

cohesive subgraph in a social graph. The first approach tends to

be complete. It provides the algorithm, implement it in different

solutions and provides a visual system. Whereas the second

approach provides a generic solution to find the most k cohesive

subgraphs. The significant area covered by this approach is the

stream graph. This is very important point as it is the case of the

social networks. However the solution is not complete as the first

approach. The third approach seems to be more flexible to be

applied on different sets graphs.

12. ACKNOWLEDGMENTS

We take this opportunity to express our gratitude and deep

regards to our guide Dr. Mustafa Jarrar for his guidance,

monitoring and constant encouragement throughout the course.

13. REFERENCES

[1] Bronson, Nathan, Zach Amsden, George Cabrera, Prasad

Chakka, Peter Dimov, Hui Ding, Jack Ferris et al. "Tao:

Facebook’s distributed data store for the social graph."

In USENIX ATC. 2013.

[2] Facebook – Company Info. http://newsroom.fb.com.

[3] Curtiss, Michael, Iain Becker, Tudor Bosman, Sergey

Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya

Kunnatur et al. "Unicorn: a system for searching the social

graph." Proceedings of the VLDB Endowment 6, no. 11

(2013): 1150-1161.

[4] Li, Guoliang, et al. "Efficient type-ahead search on

relational data: a tastier approach." Proceedings of the 2009

ACM SIGMOD International Conference on Management of

data. ACM, 2009.

[5] Facebook - Facebook Engineering : https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-

out-the-infrastructure-for-graph-search/10151347573598920

[6] Facebook - Facebook Engineering: https://www.facebook.com/note.php?note_id=389105248919

[7] Slee, Mark, Aditya Agarwal, and Marc Kwiatkowski.

"Thrift: Scalable cross-language services

implementation." Facebook White Paper 5 (2007).

[8] Abraham, Lior, John Allen, Oleksandr Barykin, Vinayak

Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl et al.

"Scuba: diving into data at facebook."Proceedings of the

VLDB Endowment 6, no. 11 (2013): 1057-1067.

[9] Nishtala, Rajesh, Hans Fugal, Steven Grimm, Marc

Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy et

al. "Scaling memcache at facebook." InProceedings of the

10th USENIX conference on Networked Systems Design and

Implementation, pp. 385-398. USENIX Association, 2013.

[10] Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey

Romer, Shiva Shivakumar, Matt Tolton, and Theo

Vassilakis. "Dremel: interactive analysis of web-scale

datasets." Proceedings of the VLDB Endowment 3, no. 1-2

(2010): 330-339.

[11] Apache Hadoop :http://wiki.apache.org/hadoop.

[12] Abraham, Lior, John Allen, Oleksandr Barykin, Vinayak

Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl et al.

"Scuba: diving into data at facebook."Proceedings of the

VLDB Endowment 6, no. 11 (2013): 1057-1067.

[13] Thusoo, Ashish, Zheng Shao, Suresh Anthony, Dhruba

Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham

Murthy, and Hao Liu. "Data warehousing and analytics

infrastructure at facebook." In Proceedings of the 2010 ACM

SIGMOD International Conference on Management of data,

pp. 1013-1020. ACM, 2010.

[14] Zhao, Feng, and Anthony KH Tung. "Large scale cohesive

subgraphs discovery for social network visual analysis."

In Proceedings of the 39th international conference on Very

Large Data Bases, pp. 85-96. VLDB Endowment, 2012.

[15] Valari, Elena, Maria Kontaki, and Apostolos N.

Papadopoulos. "Discovery of top-k dense subgraphs in

dynamic graph collections." In Scientific and Statistical

Database Management, pp. 213-230. Springer Berlin

Heidelberg, 2012.

[16] Gibson, David, Ravi Kumar, and Andrew Tomkins.

"Discovering large dense subgraphs in massive graphs."

In Proceedings of the 31st international conference on Very

large data bases, pp. 721-732. VLDB Endowment, 2005.