Techniques that Facebook use to Analyze and Query
Social Graphs Abdulfattah Safa
1 , Haneen Droubi
2
sharehan bakri3
Web Data Management Course
Master in Computing Birzeit University, Palestine
ABSTRACT
In this paper, we present the subgraph discovery which is one of
the most important topics in social network graphing by
discussing four different approaches the first is the large Scale
Cohesive Subgraphs the second is how to find top-k Dense
Subgraphs which adopted studying a static graph then expand the
solution to include a dynamic one , the third is about how to find
large dense subgraphs in a massive graph , On Other hand this
paper talking about Memcache talking about a general purpose
distributed memory caching system which is used by Facebook to
speed up their dynamic databases , after that we present the
UNICORN a System Facebook uses for Searching the Social
Graph , Facebook uses many systems to analze data, in this paper
we illustrate two systems for analyizing facebook data which are
HIVE and scuba and comparing them with another effective one
used for analizing google social network products which is
Dremel distributed system used for querying big datasets as
nested data .
Keywords
Social Graph , Query ,Typeahead , Memcache , Unicorn , Apply
Operator , Facebook , Aggregation , Hive , Dremel , Scuba ,
Hadoop , Google , Mysql , Subgraph , Graph Discovery ,
Cohesive Subgraph , Dense Subgraph .
1. Introduction and Motivation Social networks data are stored in graphs, where the vertices
represent the social actors and relations are represented by edges. Graph search and discovery is one of the most important graph
operation , as it is the core of finding how graph vertices (actors)
are connected to each other. This is very important in for social
networks for studying the interests and habits for different people
acting at it then providing suggestions for them based on the
study result . As a definition , highly connected graphs is called
coherence one. In a social network coherence graph means a
graph with vertices that have a large set of common social
properties (relations). One of most popular social network is Facebook with Facebook
history comes several old search systems that Facebook have to
unify in order to build Graph Search . At first, the old search on
Facebook (called PPS) was keyword based ,the searcher entered
keywords and the search engine produced a results page that was
personalized and could be filtered to focus on specific kinds of
entities , In 2009, Facebook started work on a new search
product (called Typeahead) that would deliver search results as
the searcher typed, or “prefix matching" . According to the massive number of usesrs who will be unhappy
if any latency or bug happen in facebook page , Facebook data
management systems becomes one of the hot topics in technology
world today, it’s issues about how to enhance the continuous
rapid growth of data (scalability) and diversity of users
submitting jobs characteristics (execution time, data delivery
deadlines…etc) , being the core of facebook challenges.
Moreover, some users want to deal with ad hoc querying analysis
to test some hypothesis or to answer functional posted questions.
dealing with user data, data storing and representation become.
In this paper we start talking about cohesive subgraph discovery
in social network through discussing three main approaches. In
section four Memcach is discussed as a general purpose
distributed memory caching system . In section five and six we
discussed Typehead and Unicorn search Systems , In the next
sections analysis data systems (Hive, Scuba and Dreaml) are all
discussed in details.
2. THE SOCIAL GRAPH
The database which facebook used to maintain it's inter -
relationships between the people and things in the real world
calls social graph .Facebook consists nodes signifying people
and things , and edges representing a relationship between two
nodes in other words the entities are the nodes and the
relationships are the edges [3,1].
Each Entity (node) has primary key , which is a 64-bit identifier
(id). Facebook also store the edges between entities . facebook
has many types of edges : directional (e.g The inverse of likes is
likers, which is an edge from a page to a user who has liked that
This report is part of the Web Data Management Course, Master in
Computing at Birzeit University, Palestine. Each student is given a topic
and asked to read the most important scientific literature in this topic, and
criticize it and linked with the topics taught in the course.
page ) , another type is symmetric (e.g the friend relation ) ,
there are many other edge-types in Facebook . Figure(1)
illustrates A running example of how a user’s checkin might be
mapped to objects and associations.
Figure 1 : checkin might be mapped to objects and associations.[1]
3. Cohesive Subgraph Discovery in Social
Networks
This section discusses three approaches to search cohesive graph
in social networks .
3.1 LARGE SCALE
This approach aims to develop an algorithm to discover the
cohesive subgraph then use it to develop two solutions for
Memory and Database. To define the problem statement, the following terms are used : Definition1: “A k-core is a connected subgraph g such that each
vertex v has degree d(v) ≥ k within the subgraph g.” This
definition also introduces an important property related to edges
in clique that each edge in a clique is supported by k-2 triangles
in the k clique. Definition2: “A k-mutual-friend is a connected subgraph g ∈ G
such that each edge is supported by at least k pairs of edges
forming a triangle with that edge within g. The k-mutual-friend
number of this subgraph, denoted as M(g), equals k”. Definition3: “A maximal k-mutual-friend subgraph is a k-
mutual-friend subgraph that is not a proper subgraph of any
other k-mutual friend subgraph” [14] . The main problem statement: Find all the k-mutual-friends
subgraph for a graph G(V,E) , Two solutions were developed in
order to get this problem fixed Memory and Disk. Firstly a
memory based solution is developed for small graphs then it is
expanded to include large graphs as Disk solution.
3.1.1 Memory Solution
The basic idea of the memory solution of finding k-maximal-
friend is to drop all the vertices and edges that don’t satisfy
iteratively. in other words, drop all the edges that are not existing
the k triangles one by one, till the condition: number of triangles
Tr(e) ≥ k is met. The solution is proposed in Figure (2). The algorithm is applied
on the graph with k = 2 as follows: at the first iteration the edge
(e-i) is removed besides the newly isolated vertex i , as it is part
of no triangle. Then the edges (e-h), (e-g) and (f-h) are all
removed. At the third iteration, one can see that edges (d-g), (f-g)
and (g-h) become part of less than 2 triangles after removing the
other edges and vertex. So they are all removed. All the edges in
the resultant graph satisfy the condition Tr(e) ≥ 2 , so the graph
is a cohesive subgraph .
Figure 2 : Offline Memory Solution .
Definition1: “The maximum core C(G) of a graph G is a
subgraph of G containing vertices with a degree at least β, where
the value of β is the maximum possible.” [15] .
To make the idea clear, consider Figure (3) According to these
steps, vertices v9,v10 and v11 will be removed first, since their
degree is 1. The resulting graph is the 2-core of G, since all
vertices have a degree at least 2. Next, we remove vertex v7, and
consequently vertices v6 and v8 are also removed, because their
degree has been reduced due to the removal of v7. The resulting
graph, composed of the vertices v1, v2, v3, v4 and v5 is the 3-
core of G since every vertex has a degree at least 3. At this point,
if we continue this process, we will result in an empty graph.
Therefore, the maximum core value of G is 3.
Figure 3 : Example of maxflow computations [14] .
The TopKDense algorithm , Algorithm 3 in Figure (4) , is
developed using this procedure.
3.1.2 Disk Solution
The disk solution is based on using the Graph Databases.Then,
streaming is used to enhance it , a graph database is a database
which is designed particularly to store graph data by storing
vertices and edges as graph structure instead of conventional data
storing in different tables. It also has a free-index adjacency
where every vertex and edge has a direct reference to its
adjacent. A solution of finding k-maximal-friend in graph database is
developed from the Algorithm of the Memory solution. Two
main changes are introduced: the first one is that graph traversal
is used to access vertices and edges as well as compute triangle
counts. The second change is that, index is on edge attributes to
mark edges as deleted and record edges’ triangle.
Figure 4 : TopKDense Algorithm [15] .
3.2 Discovery of Top-k Dense Subgraphs in
Dynamic Graph Collections
This approach starts with developing an algorithm for
discovering dense subgraphs in a set of graphs, then uses it in
developing another one for a stream of graphs. One can define the density of a graph as the average degree of all
its nodes. Based on this definition one can find the density of a
graph using the Goldberg Theory which states “The computation
of the densest subgraph of a graph G requires a logarithmic
number of steps, where in each step a maxflow computation is
performed. The maxflow computations are performed on an
augmented graph G` and not on the original graph G. More
specifically, G is converted to a directed graph by substituting
an edge between nodes u and v by two directed edges from u to v
and backwards. These edges are assigned a capacity of 1” [15] , Based on the Goldberg theory, one can write the problem
statement as: To Find the top k most dense graphs in a dynamic
stream of graphs.
3.2.1 Dense Subgraph Discovery in a Set of Graphs
The procedure is as follows: firstly the density of each subgraph
is computed, secondly the edges that comprises the densities
subgraphs are removed besides the resulted isolated nodes. The
procedure is repeated till get a k subgraphs at last [15] , One can
notice in this procedure is that two dense subgraphs can’t share
edges. But actually they may share some nodes , To proceed with
the steps above can use the Goldberg theorem to get rid of a
significant number of maxflow operations. This may result a
more efficient computation. To enable this the following
definition is used: Updating the top-k subgraphs starts with finding the densest
subgraph of the newly arrived graph. If the density of it is greater
than the density threshold, k, then the subgraph is added to the
set and the process continues.
The TopKDense algorithm can be updated to insert the new
subgraph. one is removed , Also the top-k sub graphs are
updated accordingly , The next thing is the expiration of the
graph. Here two cases exist : The first one is when the expired
graph has no subgraph in the k-top set and the other one when it
does. The first one is quit simple, as remove the expired graph
has no effect. whereas the second one needs to rescan the active
graphs for the substitute subgraphs.
3.3 Discovering Large Dense Subgraphs in
Massive Graphs
This approach aims to to develop an algorithm to find the most
cohesives subgraph by fingerprinting the whole graph.
3.3.1 Discovering Large Dense Subgraphs
The algorithm looks for clusters of nodes that is capable to link
to the same destinations. Basically this algorithm does the
following: firstly, shingling is applied on each node in the graph
to the set of destinations linked to from that node, resulting in c
shingles per node. Then, grouping the set of nodes with same
shingle. As third step, to begin clustering nodes, the set of nodes
with large number of shingles is to be found. Grouping of
shingles that tend to be on the same nodes is the fourth step.
Such an analysis is yet another application of the (s, c) shingling
algorithm, this time to the set of nodes associated with a
particular shingle, and results in bringing together shingles that
have significant overlap in the nodes on which they occur. Then the shingles algorithm is used again on the set of of a
particular shingle, to group shingles that have significant overlap
in the nodes on which they occur .
These sequences may be repeated several times as required.
Figure (5) shows these steps.
Figure 5 : Recursive shingling flow [16] .
4. Memcache
Memcached is a general purpose distributed memory
caching system. It is often used to speed up dynamic database
driven websites by caching data and objects in RAM to reduce
the number of times an external data source (such as a database
or API) must be read. Memcached's APIs provide a giant hash
table (key-value) distributed across multiple machines.
Applications using Memcached typically layer requests and
additions into RAM before falling back on a slower backing
store, such as a database , Memcached provides a simple set of
operations (set,get, and delete) .
In Facebook the high degree of output customization, combined
with a high update rate of a typical user’s News Feed, makes it
impossible to generate the views presented to users ahead of
time. Thus, the data set must be retrieved and rendered on the fly
in a few hundred milliseconds , Facebook has always realized
that even the best relational database technology available is a
poor match for this challenge unless it is supplemented by a large
distributed cache that offloads the persistent store.
Memcache has played that role since Mark Zuckerberg installed
it on Facebook’s Apache web servers back in 2005 see Figure(6).
Figure 6 : Memcache as a demand-filled look-aside cache. The
left half illustrates the read path for a web server on a cache
miss. The right half illustrates the write path.[9]
When a web server needs data, it first requests the value from
memcache by providing a string key. If the item addressed by
that key is not cached, the web server retrieves the data from the
database or other backend service and populates the cache with
the key-value pair. For write requests, the web server issues SQL
state ments to the database and then sends a delete request to
memcache that invalidates any stale data , After conversion to
Memcached, the same call might look like the following
( pseudocode) [9] :
function get_foo(int userid) {
/* first try the cache */
data = memcached_fetch("userrow:" + userid);
if (!data) {
/* not found : request database */
data = db_select("SELECT * FROM users WHERE
userid = ?", userid);
/* then store in cache until next get */
memcached_add("userrow:" + userid, data);
}
return data;
}
The client would first check whether a Memcached value with
the unique key " userrow : userid " exists , where userid is some
number. If the result does not exist, it would select from the
database as usual, and set the unique key using the Memcached
API add function call.
5. TYPEAHEAD
A typeahead search is a dropdown menu that appears when
you're searching for something. It guesses what you're searching
for so you can find it faster. A typeahead query consists of a
string, which is a prefix of the name of the individual the user is
seeking. For example, if a user is typing in the name of “Jon
Jones” the typeahead backend would sequentially receive queries
for “J”, “Jo”, “Jon”, “Jon ”, “Jon J”, etc. For each prefix, the
backend will return a ranked list of individuals for whom the
user might be searching. Some of these individuals will be
within the user’s explicit circle of friends and networks. [4]
6. Unicorn
Unicorn is a System for Searching the Social Graph. There was
an effort within Facebook to build an inverted-index system
called Unicorn since 2009. By late 2010, Unicorn had been used
for many projects at Facebook as an in-memory “database” to
lookup entities given combinations of attributes. In 2011,
Facebook decided to extend Unicorn to be a search engine and
migrate all existing search backends to Unicorn as the first step
towards building Graph Search. [5]
6.1 DATA MODEL
The social graph is sparse so it is logical to represent it as a set
of adjacency lists(posting list ). Unicorn is an inverted index
service that implements an adjacency list service. Each posting
list contains a sorted list of hits, which are (DocId , HitData)
pairs . A DocId (document identifier) is a pair of (sort-key, id),
and HitData is just an array of bytes that store data. The sort-key
enables us to store the most globally important ids earlier in the
post list and it's is an integer. Hits are sorted first by sort-key
(highest first) and secondly by id (lowest first) [3] .
Figure 7 : Converting a node’s edges into posting lists in an
inverted index. Users who are friends of id 8 correspond to ids in
hits of the posting list. Each hit also has a corresponding sort-key
and (optional) HitData byte array [3] .
PostingListn → (Hitn,0 , Hitn,1 , ..., Hit n,k-1 )
Hit i,j → (DocId i,j , HitData i,j )
DocId i,j → (sort-key i,j , id i,j )
HitData would be the place where positional information for
matching documents is kept In a full-text search system , in
UNICORN HitData is not present in all terms ,HitData used for
storing extra data useful for filtering results (e.g posting list
might contain the ids of all users who graduated from a specific
university) in this example the HitData could store the
graduation year and major.
Unicorn is sharded (partitioned) by result-id (the ids in query
output) instead of term-sharding and optimized for handling
graph queries. Unicorn chose to partition by result-id to remain
the system available in the event of a dead machine or network
partition(refers to the failure of a network device that causes a
network to be split). for example when queried for friends of
"Jon Jones", it is better to return some fraction of the friends of
Jon Jones than no friends at all [3] .
Posting lists in Unicorn are referenced by terms, which, by
convention, are of the form:
<edge-type>:<id>
Edge-type is merely a string such as friend or like.
6.2 ARCHITECTURE
Client queries are sent to a Unicorn top-aggregator, which sent
the query to one rack-aggregator per rack. These rack-aggregators
sent the query to all index servers in their respective racks.
Figure 5 shows a top-aggregator communicating with a single tier
(a set of all index partitions).
Each index server is responsible for serving and accepting
updates for one shard of the index. The rack aggregators and top-
aggregator are responsible for combining and truncating results
from multiple index shards in a sensible way before the top-
aggregator finally sends a response back to the client.
Figure 8 : Example of Unicorn cluster architecture with
multiple verticals. The top-aggregator determines which
vertical(s) each query needs to be sent to, and it sends the query
to the racks for that vertical [3] .
6.3 QUERY LANGUAGE Clients send queries to Unicorn as Thrift requests ( Thrift is a
software library that implements cross-language RPC
communication for any interfaces defined using it.) [8] .
''Thrift is used as the underlying protocol and transport layer for
the Facebook Search service. The multi-language code generation
is well suited for search because it allows for application
development in an efficient server side language (C++) and
allows the Facebook PHP-based web application to make calls to
the search service using Thrift PHP libraries '' [7] .
And and Or operators :
Like many other search systems , Unicorn supports And and Or
operators , For example a client who wishes to find all female
friends of Jon Jones id= 5 would issue the query (and friend:5
gender:1) . If there exists another user Lea Lin with id=7, we
could find all friends of Jon Jones or Lea Lin by issuing the
query (or friend:5 friend:7).
Difference operator :
Unicorn also supports a Difference operator, which returns
results from the first operand that are not present in the second
operand. Continuing with the example above, we could find
female friends of Jon Jones who are not friends of Lea Lin by
using (difference (and friend:5 gender:1) friend:7) [3] .
''For some queries, results are simply returned in DocId
ordering.''[3] , but In many applications it is useful to give
preference to results that match more terms so Unicorn use
another ranking method to sort results by the number of terms
that matched for example , for the query (or friend:5 friend:7),
some of the results will match friend:5, some will match friend:6,
and some will match both the high rank given to results which
matched both .
WeakAnd operator:
One problem with TYPEAHEAD approach is that it makes no
provision for social relevance: a query for “Jon” would not be
likely to select people named “Jon” who are in the user’s circle
of friends.
WeakAnd can be used when user JonJones performs the query
(weak-and (term friend:3 :optional-hits 2) (term melanie) (term
mars*)) herein “Melanie Mars” (a prefix of the full name
“Melanie Marshall”) , in this query Unicorn only allow a finite
number of hits to be non-friends of Jon Jones the benifit of
WeakAnd is ensures that some results will be friends of Jon
Jones if any match the text, but it also ensures that we don’t miss
very good results who are not friends with Jon Jones [3] .
StrongOr operator
StrongOr is a modification of Or , StrongOr is useful for
enforcing diversity in a result set. For example, if most of Jon
Jones’ friends live in either Beijing (100), San Francisco (101),
or Palo Alto (102), we could fetch a geographically diverse
sample of Jon’s friends by executing:
(strong-or friend:5
(and friend:5 live-in:100
:optional-weight 0.2)
(and friend:5 live-in:101
:optional-weight 0.2)
(and friend:5 live-in:102
:optional-weight 0.1))
The optional weights mean at least 20% of the results must be
people who live in Beijing. Another 20% must be people who
live in San Francisco, and 10% must be people who live in Palo
Alto. The remainder of results can be people from anywhere
(including the cities above) [3] .
7. GRAPH SEARCH
Graph Search return results that are more than one edge away
from the source nodes.
we might want to know the pages liked by friends of Haneen
Droubi (100001599373168) who like BirzeitUniversity page
(129221627113013). We can answer this by first executing the
query (and friend: 100001599373168 likers: 129221627113013)
, assume that results ids is 684969149 for Sharehan ,
100000062277593 for Abdulfattah collecting the results, and
creating a new query that produces the union of the pages
liked(edge name is likes ) by any of these individuals:
Inner
(and friend: 100001599373168 likers:
129221627113013) → 684969149,
100000062277593
Outer
(or likes: 684969149 likes: 100000062277593)
this requires supporting queries that take more than one round-
trip between the index server and the top-aggregator. Graph
Search have features like Apply (Section 5.2 ) .
7.1 Apply
Graph Search allows clients to query for a set of ids and then use
those ids to construct and execute a new query. Algorithm below
describes the operator in detail [3] .
For example : If a client wants to find all companies employs his
friends who lives in Ramallah "Employers of my friends who live
in Ramallah" the client will issue this query :
Apply
works-at : (and friend : 100001599373168 live-in : 103122493061917 )
innerQ “ and friend:100001599373168
live-in : 103122493061917 ” collect N results.
After collection the results for the inner query, the outer query
constructed based on the results of the inner query . In our
example, if the results of the inner query are ids:
{022222131000001 , 022220999100300 , 0013009011} then the
top-aggregator would construct a new query (or works-at :
022222131000001 works-at : 022220999100300 works-at :
0013009011 ) and send this new query, again, to all users racks.
Once the results for this outer query are gathered, the results are
re-sorted and truncated if necessary, and then they are returned to
the client.
8. HIVE
Hive is an open source framework built for analyzing data in
facebook (ad hoc querying, reporting...Etc). Hive is used for
mapping (data tables in Hive stored in HDFS directories) with
Hadoop to reduce time and improve scalability of map-reduce
with querying facilities. HiveQL is a query language that
provides the driver that contains the map-reduce job and the
HDFS commands that were optimized in. Using Hive make it
easier for user to specify different queries using an interface
.transformation can be done easily through user functions or table
defined functions or scripts of Hiveql. Many interfaces are used
to deal with Hive like: web based GUI, Hive CLI (command line
interface), Hipal, where these interfaces are used for ad hoc
querying [11].
8.1 Hive - Hadoop
There are two sources of data in facebook: (1) Federated mysql
tier: contains all the facebook data.(2)Web server: contain all the
log data. Hadoop using scribe servers for aggregate log data from
different web servers and put it in the scribe-Hadoop clusters
(scribeh) that located on the data center, where the data was
written as the HDFS files. data scale in scribeh influence many
ways ,where 30TB of data transferred daily , if this data is
uncompresed it will cause a high traffic on network so it is
located in scribeh data center ,but if it was compressed it will
depend then on the log categories that must adapt the
compression buffer ,but if it is full or lower than compression
buffer size ,the delay will be caused until the compression buffer
reduced or flashed .copier jobs is used to compress and transfer
the scribeh cluster data to Hive-Hadoop clusters to be HDFS data
where it is ready to be published and consuming. Compressed
federated mysql datasets dumped in to Hive-Hadoop clusters
using scrape process, resiliently and well designed scrapes is
very important aspect of the process, to prevent failure and heavy
Figure 9 : Dremel splitting records into tables [10]
load . replicated tier is used to run scraped process before
loading it [13] .
Figure 10 : HIVE-HADOOP architecture [13] .
There are two Hive-Hadoop clusters where data ready to be
consumed using the down-stream processes:( (1)Production Hive-
Hadoop cluster: used for jobs that need to be executed on
committed deadlines. (2) Adhoc Hive- Hadoop cluster: used for
jobs that was not adhering to be executed fast. The reason for
separating the two clusters is the serious and dangerous of
executes the poorly written ad-hoc jobs that will starve the
production jobs. Hve Replication process is used to copy the data
row of updated Hive tables and enable the changes in meta data
from production to ad-hoc clusters , in case both clusters
(production Hive-Hadoop and ad hoc Hive-Hadoop) need a
specific dataset. The data that are published either exists in
Hive-Hadoop cluster (future analyis) or Fedarate mysql (for
facebook use in serve users) [13] .
9. SCUBA
scuba is a DBMS used to analysis data in facebook,Scuba has
many utilization cases; one of the most use is Performance
monitoring ,using dashboard that runs queries implemented on
few seconds fresh data , to watch the load of servers, network
throughput, cash requests ,misses, hits….etc. these metrics
represented by different visualizations graphs like : pie chart,
snaky view…etc. these graphs show the performance of week
over a week ,where noticing the changes column is very easy to
discovering the bug and try to fix it from minutes to hours .Trend
analysis is another use case for scuba , the graph shows the terms
frequencies in posts regarding to time and other specific
information like; age, country, gender…etc. Pattern mining
product specialists interested in collecting information about
people responds of new applications versions and devices and
analyzing their information’s: age, gender, country…etc, without
care of technical issues(how data received, collecting) [12] .
9.1 Scuba Architecture
Many servers are used independently to employ the Scuba
storage engine .scuba consists of logical units that presented as a
tree (consists of root and leaf), each server is divided into 8
leaves (determined by the currently number of CPU core),each
leaf consists of data tables that queries implemented on [12] .
9.1.1 Data Architecture
All data table consists of rows that including data types columns:
integer, string, set of strings, and vectors of strings. Scuba use
scribe to log in data using the Tailer process that divides data
into batches and send each one to scuba via scuba Thrift API.
each batch has a compressed copy stored on disk and start
reading it and add new rows to tables located in scuba leaves,
stored in memory with their timestamps [12] .
9.1.2 Query architecture
Scuba query consists of where clause that specifies the min and
max timestamp and aggregation function. Conditions, order by,
group by are optional. Scuba depends on many interfaces for
specifying the query, web based interface where the user specify
visualization type as: pie chart…ect, and to choose parameters of
query using the dashboard.cmd interface will take the SQL query
,then an application code based on the Thrift based API will deal
with. Scuba contains many types of aggregators to execute the
query , Root aggregator : specified by users as a root of cluster.
It was connected to intermediate aggregator with 5 fan-outs.it
return back the results of the queries received from intermediate
and send them to the user view. Intermediate aggregator: it
linked with leaf nodes with fan out of 5 to deliver data to leaf
aggregator. Receive back the results from leaf aggregator and
send them to the root aggregator. Leaf aggregator: send the query
to the leaf servers, and collect the results from each leaf server.
Leaf server: solve the query in parallel of each node [12] .
10. DREMEL DATA MANAGEMENT
SYSTEM
10.1 Dremel Architecture
Dremel operate many adhoc queries on data without translating
them to Map_Reduce jobs as Hive. The most characteristic of
Dremel that is it used column storage which analyzing relational
data.To create and execute queries in Dremel fastly , query
processor and data management tool must be interoperation, by
two layer :(1) common storage layer like GFS , (2) Shared
storage format where the columnar storage is very useful in Data
in Google+ is a distributed file system data, and a nested record
based in Dremel. There are many definitions of record data fields
; Required field: where this field must be exists. Optional field: it
can be absent from the record. Repeated field: occurred multiple
times. A web document schema may contains all these types, as
about the required field Doc-id is one of the most important
required field , Name is another example of required field that
contains Name-URL (the reference of the document), Links (of
backward and forward) field is an optional field , The
full path of the record is expressed as dots between fields as an
example; Name. Language. Country. figure , present a schema
that suit for the records r1,r2. To store these records in columns
as a nested data records, many challenges were exists, repetition
and Definition levels: the repetition level is used to fetch the
repeated values, whether in two records or in one record, where
Definition levels is to define fields of the records to prevent
wandering of missing optional fields record. in figure2 ,it clarify
how the records are reshaped in columns with the Repetition
level (r) ,and definition level (d),where Null values is used to
specify which field has a missing value, especially if this value is
between two exists values [10] .
To illustrate the idea, let’s take an example: the table Docid , the
repetition column r is 0 and the definition column d is also 0
because no need to put bits for required fields that will never be
repeated, but what about Name. Language. Code table? When
scanning the r1 record from top to down, the field code occurred
three times inside three Name fields , repetition bit values is 0-
2.so when starting scan r1,the first occurrence of field code take
a value bit 0 inside the first Name field which is (en-us) , in the
other hand ,the column a definition level d ,take 1 bit for field
and 1 bit for language field and 0 bit for code ,because it is
required field for (optional) language (if the scanning path is
name. language. country another 1 bit will be added to country
which is optional),column d is the sum of these bits which is 2 ,
then while continue scanning the second occurrence of field code
in the same field Name which is (en) it take value bit of 2 ,but
when scanning the second Name field ,a NULL value indicate
that it was missing in this field ,the r column is 1 because this
field occurred before, and one bit of Name field added to the d
column .at the last scanning of field code which is (en-gb), the r
column is 1 because language code occurred once. hence, the
records is encoded to column format. Splitting the records : to
split the record of google+ datasets to column with repetitive and
definitions levels ,an algorithm was developed for scanning the
repetition and definition levels and set a counters for each levels
,using field writer for updating fields. Record Assembly :After
splitting the records to column, there is a need to reassembled
the record again but with specified fields where the others will
be pruned to improve the efficiency of data read or retrieval by
Map-Reduce job .FSM (finite state machine) is used for reading
the fields’ values [10] .
10.2 Query Architecture
After stored the datasets in a nested columnar storage, Dremel
support a SQL based query language that used to produce one
nested table from two input nested tables, ex: Where the query
mean; select the docid that are less than 20 ,and aggregate the
(name.lamguage.code) occurrence by COUNT expression,
WITHIN the sub_record name, and display the Name.URL with
http beginning as str value [10] .
SELECT DOCID AS Id,
COUNT (Name.Language.Code) WITHIN Name AS cnt,
Name.Url + ‘,’ + Name.Language.Code AS str
FROM t
WHERE REGEXP (Name.Url, ‘^http’ ) AND DOCID < 20 ;
Figure 11 : results of SQL query in Dremel [10] .
To execute the queries in Dremel; the query is picking through
multi-level tree servers , Root server : read the data from tables
depending on the queries after receiving it . Intermediate servers
: aggregate partial results of query execution . Leaf server : deal
with the storage layers like: GFS.Many queries need to be
executed at same time in Dremel as a multi-user system. To
prevent faults tolerance of any server or any replications, Query
dispatcher is used to choose the query with the highest
priorities[10].
11. CONCLUSION
In this paper, we have described some techniques that Facebook
use to Analyze and Query Social Graphs , the high degree of
output customization, combined with a high update rate of a
typical user’s News Feed is a major feature of social networking
systems and imposes a heavy load on backend data stores , we
have described Memcache a general purpose distributed memory
caching system that Facebook and other social Networks used to
retrieve and render the Data on the fly in a few hundred
milliseconds , Facebook Graph search uses in background
UNICORN engine which take nodes as input and return nodes as
outputs .
Id: 10 t1
Name
Cnt: 2
Language
Str: ’http: //A, en-us’
Str: ’http: //A,en’
Name
Cnt: 0
HIVE, SCUBA, and DREMEL are three systems used to analysis
data in social networks, like facebook, google+. There are several
differences between them, HIVE is a data warehouse
infrastructure used as a data schema DB of HADOOP data and
execute the HIVEQL (SQL-like language) queries after
translating it to
This paper discuss three different approaches to discover the
cohesive subgraph in a social graph. The first approach tends to
be complete. It provides the algorithm, implement it in different
solutions and provides a visual system. Whereas the second
approach provides a generic solution to find the most k cohesive
subgraphs. The significant area covered by this approach is the
stream graph. This is very important point as it is the case of the
social networks. However the solution is not complete as the first
approach. The third approach seems to be more flexible to be
applied on different sets graphs.
12. ACKNOWLEDGMENTS
We take this opportunity to express our gratitude and deep
regards to our guide Dr. Mustafa Jarrar for his guidance,
monitoring and constant encouragement throughout the course.
13. REFERENCES
[1] Bronson, Nathan, Zach Amsden, George Cabrera, Prasad
Chakka, Peter Dimov, Hui Ding, Jack Ferris et al. "Tao:
Facebook’s distributed data store for the social graph."
In USENIX ATC. 2013.
[2] Facebook – Company Info. http://newsroom.fb.com.
[3] Curtiss, Michael, Iain Becker, Tudor Bosman, Sergey
Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya
Kunnatur et al. "Unicorn: a system for searching the social
graph." Proceedings of the VLDB Endowment 6, no. 11
(2013): 1150-1161.
[4] Li, Guoliang, et al. "Efficient type-ahead search on
relational data: a tastier approach." Proceedings of the 2009
ACM SIGMOD International Conference on Management of
data. ACM, 2009.
[5] Facebook - Facebook Engineering : https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-
out-the-infrastructure-for-graph-search/10151347573598920
[6] Facebook - Facebook Engineering: https://www.facebook.com/note.php?note_id=389105248919
[7] Slee, Mark, Aditya Agarwal, and Marc Kwiatkowski.
"Thrift: Scalable cross-language services
implementation." Facebook White Paper 5 (2007).
[8] Abraham, Lior, John Allen, Oleksandr Barykin, Vinayak
Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl et al.
"Scuba: diving into data at facebook."Proceedings of the
VLDB Endowment 6, no. 11 (2013): 1057-1067.
[9] Nishtala, Rajesh, Hans Fugal, Steven Grimm, Marc
Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy et
al. "Scaling memcache at facebook." InProceedings of the
10th USENIX conference on Networked Systems Design and
Implementation, pp. 385-398. USENIX Association, 2013.
[10] Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey
Romer, Shiva Shivakumar, Matt Tolton, and Theo
Vassilakis. "Dremel: interactive analysis of web-scale
datasets." Proceedings of the VLDB Endowment 3, no. 1-2
(2010): 330-339.
[11] Apache Hadoop :http://wiki.apache.org/hadoop.
[12] Abraham, Lior, John Allen, Oleksandr Barykin, Vinayak
Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl et al.
"Scuba: diving into data at facebook."Proceedings of the
VLDB Endowment 6, no. 11 (2013): 1057-1067.
[13] Thusoo, Ashish, Zheng Shao, Suresh Anthony, Dhruba
Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham
Murthy, and Hao Liu. "Data warehousing and analytics
infrastructure at facebook." In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data,
pp. 1013-1020. ACM, 2010.
[14] Zhao, Feng, and Anthony KH Tung. "Large scale cohesive
subgraphs discovery for social network visual analysis."
In Proceedings of the 39th international conference on Very
Large Data Bases, pp. 85-96. VLDB Endowment, 2012.
[15] Valari, Elena, Maria Kontaki, and Apostolos N.
Papadopoulos. "Discovery of top-k dense subgraphs in
dynamic graph collections." In Scientific and Statistical
Database Management, pp. 213-230. Springer Berlin
Heidelberg, 2012.
[16] Gibson, David, Ravi Kumar, and Andrew Tomkins.
"Discovering large dense subgraphs in massive graphs."
In Proceedings of the 31st international conference on Very
large data bases, pp. 721-732. VLDB Endowment, 2005.