linkedin case study

11
4/10/2013 Case Study LinkedIn and its System, Network and Analytics Data Storage Sai Srinivas K (B09016), Sai Sagar J (B09014), Rajeshwari R (B09026) and Ashish K Gupta (B09008) Distributed Database Systems, Spring 2013, IIT Mandi Instructor: Dr. Arti Kashyap

Upload: kumar-ashish-gupta

Post on 04-Jan-2016

62 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LinkedIn Case Study

4/10/2013

Case Study

LinkedIn and its System, Network and Analytics – Data Storage

Sai Srinivas K (B09016), Sai Sagar J (B09014), Rajeshwari R (B09026) and Ashish K Gupta (B09008)

Distributed Database Systems, Spring 2013, IIT Mandi

Instructor: Dr. Arti Kashyap

Page 2: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 2

IIT Mandi

Abstract

This paper is a case study on LinkedIn, a social networking

website for people in professional occupations, Its Data

Storage Systems and few of the System, Network and

Analytics aspects of the site. The SNA team at LinkedIn

has a web site that hosts the open source projects built by

the group. Notable among these projects is Project

Voldemort, a distributed key-value structured storage

system with low-latency similar in purpose to

Amazon.com's Dynamo and Google's BigTable. Let us see

the till date research and backend managing systems of the

web-site which reports more than 200 million acquired

users in more than 200 countries and territories.

I. Introduction

LinkedIn Corporation is a social networking website for

professional networking among people in various

occupations. The company was founded by Reid Hoffman

and founding team members from PayPal and

Socialnet.com (Allen Blue, Lee Hower, Eric Ly, David

Eves, Ian McNish, Chris Saccheri, Jean-Luc Vaillant,

Konstantin Guericke, Stephen Beitzel and Yan Pujante)

launched it on May 5, 2003 in Santa Monica, California [1].

LinkedIn's CEO is Jeff Weiner, previously a Yahoo! Inc.

Executive, and Founder Reid Hoffman, previously CEO of

LinkedIn, is now the Chairman of the Board.

1.1 Features

This site helps in professional social networking for the

users by maintaining a list of connections which would

have the individual contact details of everyone ‘connected’

to them. Whether a site User or not, one can invite anyone

to become a connection. However, if the invitee selects "I

don't know" or "Spam", this counts as a report against the

inviter and he gets too many of such responses, the account

may be restricted or closed. This list of connections can

then be used in a number of ways:

A network of contacts is built up of, their direct

connections, their second-degree connections

(connections of each of their connections) and also

third degree connections (connections of the

second-degree connections). This is similar to the

concept of “Mutual Friends” in Facebook where

one can gain an introduction to someone, he/she

finds interesting.

Users can upload their resumes or build/design

them design in their profiles in order to share their

respective work and community experiences.

It can be used to find jobs, people and business

opportunities recommended by someone in one's

contact network.

Employers can list jobs and search for potential

candidates.

Job seekers can review the profile of hiring

managers and discover which of their existing

contacts can introduce them.

Users can post their own photos and view photos of

others to aid in identification.

Users can now follow different companies and can

get notification about the new joining and offers

available.

Users can save or bookmark jobs that they would

like to apply for.

The "gated-access approach" (where contact with any

professional requires either an existing relationship or the

intervention of a contact of theirs) is intended to build trust

among the service's users and is one of the Special Features

of LinkedIn. The feature “LinkedIn Answers” similar to

“Yahoo! Answers” allows users to ask questions for the

community to answer. This feature is free, and the main

difference from the latter is that questions are potentially

more business-oriented, and the identity of the people

asking and answering questions is known. “LinkedIn cites”

a new 'focus on development of new and more engaging

ways to share and discuss professional topics across

LinkedIn' is a recent development which may sack the

outdating feature, “LinkedIn Answers”.

Other LinkedIn features include LinkedIn Polls as a form of

researching (for the users), LinkedIn DirectAds as a form of

sponsored advertising etc. LinkedIn allows users to endorse

each other’s skills. This feature also allows users to

efficiently provide commentary on other user’s profiles

thus reinforcing the network build-up. However there is no

way of flagging anything other than positive content.

1.1.1 Applications

The Applications Platform allows other online services to

be embedded within a member's profile page like ‘Amazon’

Reading List that allows LinkedIn members to display

books they are reading, a connection to Tripit (travel

itinerary), a WordPress and TypePad application, which

allows members to display their latest blog postings within

their LinkedIn profile and etc. Later on LinkedIn allowed

businesses to list products and services on company profile

pages; it also permitted LinkedIn members to "recommend"

products and services and write reviews.

1.1.2 Groups

LinkedIn also supports the formation of interest groups

(which are equivalently famous in many social networking

sites and blogs), the majority related to employment

although a very wide range of topics are covered mainly

around professional and career issues and the current focus

is on the groups for both academic and corporate alumni.

Groups support a limited form of discussion area,

moderated by the group owners and managers. Since

groups offer the ability to reach a wide audience without so

easily falling foul of anti-spam solutions, there is a constant

stream of spam postings, and there now exist a range of

firms who offer a spamming service for this very purpose.

Groups also keep their members informed through emails

with updates to the group, including most talked about

Page 3: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 3

IIT Mandi

discussions within your professional circles. Groups may be

private, accessible to members only or may be open to

Internet users in general to read, though they must join in

order to post messages.

1.1.3 Job listings

LinkedIn allows users to research companies with which

they may be interested in working. When typing the name

of a given company in the search box, statistics about the

company are provided. These may include the location of

the company's headquarters and offices, or a list of present

and former employees, the percentage of the most common

titles/positions held within the company etc. LinkedIn

launched a new feature allowing companies to include an

"Apply with LinkedIn" button on job listing pages which

was really a serious and useful development. The new plug-

in will allow potential employees to apply for positions

using their LinkedIn profiles as resumes. All applications

will also be saved under a "Saved Jobs" tab.

II. SNA LinkedIn

The Search, Network, and Analytics of LinkedIn host the

open source projects in the data blogs built by the group.

Notable among these projects is Project Voldemort, a

distributed key-value structured storage Database system

with low-latency similar in purpose to Amazon’s ‘Dynamo’

and Google's BigTable. The data team at LinkedIn works

on LinkedIn's information retrieval systems, the social

graph system, data driven features, and supporting data

infrastructure.

2.1 Project Voldemort

Voldemort is a distributed key-value storage system. It has

the following properties:

Data is automatically replicated over multiple

servers (Data Replication)

Data is automatically partitioned so each server

contains only a subset of the total data (Data

Partitioning)

Server failures are handled transparently oblivious

to the users (Transparent failures)

Pluggable serialization is supported to allow rich

keys and values including lists and tuples with

named fields, as well as to integrate with common

serialization frameworks like Protocol Buffers,

Thrift, Avro and Java Serialization

Data items are versioned to maximize data integrity

in failure scenarios without compromising

availability of the system (Versioning)

Each node is independent of other nodes with no

central point of failure or coordination (Node

Independence)

Good single node performance: you can expect 10-

20k operations per second depending on the

machines, the network, the disk system, and the

data replication factor

Support for pluggable data placement strategies to

support things like distribution across data centers

that are geographically far apart (Data Placement).

2.1.1 Comparison with the Relational Database

Voldemort is not a relational database; it does not attempt

to satisfy arbitrary relations while satisfying ACID

properties. Nor is it an object database that attempts to

transparently map object reference graphs. Nor does it

introduce a new abstraction such as document-orientation.

It is basically just a big, distributed, persistent, fault-

tolerant hash table.

For applications that can use an O/R mapper like active-

record or hibernate this will provide horizontal scalability

and much higher availability but at great loss of

convenience. For large applications under internet-type

scalability pressure, a system may likely consists of a

number of functionally partitioned services or APIs, which

may manage storage resources across multiple data centres

using storage systems which may themselves be

horizontally partitioned.

Voldemort offers a number of advantages:

Voldemort combines in memory caching with the

storage system so that a separate caching tier is not

required (instead the storage system itself is just

fast)

Unlike MySQL replication, both reads and writes

scale horizontally

Data portioning is transparent, and allows for

cluster expansion without rebalancing all data

Data replication and placement is decided by a

simple API to be able to accommodate a wide

range of application specific strategies

The storage layer is completely mockable so

development and unit testing can be done against a

throw-away in-memory storage system without

needing a real cluster (or even a real storage

system) for simple testing.

For applications in this space, arbitrary in-database joins

are already impossible since all the data is not available in

any single database. A typical pattern is to introduce a

caching layer which will require hash table semantics

anyway. It is even used for certain high-scalability storage

problems where simple functional partitioning is not

sufficient. It is still a new system under development which

may have rough edges and probably plenty of uncaught

bugs.

Page 4: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 4

IIT Mandi

2.1.2 Design

Key-Value Storage

Project Voldemort created by LinkedIn is just simple key-

value data storage, for their primary importance is enabling

high performance and availability to the users. Both keys

and values can be complex compound objects including

lists or maps, but nonetheless the only supported queries are

effectively the following:

Value = store.get(key); Store.put(key, value);

Store.delete(key)

This may not be good enough for all storage problems, for

there maybe a variety of trade-offs like no complex query

filters, all joins must be done in code, no foreign key

constraints, no triggers etc.

2.1.3 System Architecture

The below representation [2] is the Logical view in which

each layer implements a simple storage interface like put,

get, and delete. Each of these layers is responsible for

performing one function such as tcp/ip network

communication, serialization, version reconciliation, inter-

node routing, etc. For example the routing layer is

responsible for taking an operation; say a PUT, and

delegating it to all the N storage replicas in parallel, while

handling any failures. [3]

We have flexibility, on even where the intelligent routing of

data to partitions is done, for that matter anywhere in those

layers. One could add in a compression layer that

compresses byte values at any level below the serialization

level. This could be done on the client sides or on the server

side to enable hardware load-balanced http clients.

The below representation [4] is the Physical architecture

having Frontend, Back end and Voldemort Clusters

connected through Load balancers (hardware) which is a

round-robin software load balancer, and "Partition-aware

routing" which is the storage systems internal routing. All

the Possible tier-architectures are denoted in the Diagram.

It is highly efficient if one could see it, from the latency

perspective because the obvious fewer hops and also from

the throughput perspective since there are fewer potential

bottlenecks, but has few bottlenecks too for it requires the

routing intelligence to move up the stack.

Apart from them, the flexibility aspect makes high

performance configurations possible. Disk access is the

single biggest performance hit in storage, the second is

network hops. Disk access can be avoided by partitioning

the data set and caching as much as possible. Network hops

require architectural flexibility to eliminate. In the diagram

shown one can implement 3-hop, 2-hop, or 1-hop remote

services using different configurations. This enables very

high performance to be achieved when it is possible to

route service calls directly to the appropriate server.

2.1.3.1 Data partitioning and replication [5]

Data needs to be partitioned across a cluster of servers so

that no single server needs to hold the complete data set.

Even when the data can fit on a single disk, disk access for

small values may be slowed down by seek time so

partitioning would invariably improve cache efficiency by

splitting the data into smaller chunks. The servers in the

cluster are not interchangeable, and requests need to be

routed to a server that holds requested data, not just any

available server at random.

Similarly Servers which regularly fail or become

overloaded are brought down for maintenance. If there are

S servers and each server is assumed to fail independently

with probability p in a given day, then the probability of

losing at least one server in a day will be 1 - (1 - p)s.

Therefore we cannot store data on only one server or the

probability of data loss will be inversely proportional to

cluster size.

The simplest possible way to accomplish this would be to

cut the data into S partitions (one per server) and store

copies of a given key K on R servers. One way to associate

the R servers with key K would be to take a = K mod S and

Page 5: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 5

IIT Mandi

store the value on servers a, a+1, ..., a+r. So for any

probability p you can pick an appropriate replication factor

R to achieve an acceptably low probability of data loss.

This system has the nice property that anyone can calculate

the location of a value just by knowing its key, which

allows us to do look-ups in a peer-to-peer fashion without

contact a central metadata server that has a mapping of all

keys to servers. The downside (Failures) to the above

approach occurs when a server is added, or removed from

the cluster. In this case d may change and all data will shift

between servers. Even if d does not change, load will not

evenly distribute from a single removed/failed server to the

rest of the cluster.

Consistent hashing is a technique that avoids these

problems, and we use it to compute the location of each key

on the cluster. Using this technique Voldemort has the

property that when a server fails load will distribute equally

over all remaining servers in the cluster. Likewise when a

new server is added to a cluster of S servers, only 1/(S+1)

values must be moved to the new machine.

To visualize the consistent hashing method we can see the

possible integer hash values as a ring beginning with 0 and

circling around to 2^31-1. This ring is divided into Q

equally-sized partitions with Q >> S and each of the S

servers is assigned Q/S of these. A key is mapped onto the

ring using an arbitrary hash function, and then we compute

a list of R servers responsible for this key by taking the first

R unique nodes when moving over the partitions in a

clockwise direction. The diagram [6] below pictures a hash

ring for servers A, B, C, D. The arrows indicate keys

mapped onto the hash ring and the resulting list of servers

that will store the value for that key if R=3.

These features like load balancing, Semantic partitioning is

implemented by Kafka, Sensei DB etc

2.1.4 Data Format & Queries

In Voldemort data is divided into “store” unlike in a

relational database where it is broken into 2D tables. The

word table is not used for the data need not necessarily be

tabular (a value can contain lists and mappings which are

not considered in a strict relational mapping). Each key is

unique to a store, and each key can have at most one value.

2.1.4.1 Queries

Voldemort supports hash table semantics, so a single value

can be modified at a time and retrieval is by primary key.

This makes distribution across machines particularly easy

since everything can be split by the primary key.

It can support lists as values if not one-many relations

because anyways both accomplish the same, so it is

possible to store a reasonable number of values associated

with a single key. In most cases this denormalization is a

huge performance improvement since there is only a single

set of disk seeks; but for very large one-to-many

relationships (say where a key maps to tens of millions of

values) which must be kept on the server and streamed

lazily via a cursor this approach is not practical. This rare

case must be broken up into sub-queries or otherwise

handled at the application level.

The simplicity of the queries can be an advantage, since

each has very predictable performance, it is easy to break

down the performance of a service into the number of

storage operations it performs and quickly estimate the

load. In contrast SQL queries are often opaque, and

execution plans can be data dependent, so it can be very

difficult to estimate whether a given query will perform

well with realistic data under load (especially for a new

feature which has neither data nor load).

Also, having a three operation interface makes it possible to

transparently mock out the entire storage layer and unit test

using a mock-storage implementation that is little more

than a HashMap. This makes unit testing outside of a

particular container or environment much more practical.

2.1.5 Consistency & Versioning

When taking multiple simultaneous writes distributed

across multiple servers and perhaps multiple data centres,

consistency of data becomes a difficult problem. The

traditional solution to this problem is distributed

transactions but these are both slow (due to many round

trips) and fragile as they require all servers to be available

to process a transaction. In particular any algorithm which

must talk to more than 50% of the servers to ensure

consistency becomes quite problematic if the application is

running in multiple data centres and hence the latency for

cross-data-centre operations will be extremely high.

An alternate solution is to tolerate the possibility of

inconsistency, and resolve inconsistencies at read time.

Applications usually do a read-modify-update sequence

when modifying data. For example if a user adds an email

address to their account we might load the user object, add

the email, and then write the new values back to the db.

Transactions in databases are a solution to this problem, but

are not a real option when the transaction must span

Page 6: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 6

IIT Mandi

multiple page loads (which may or may not complete, and

which can complete on any particular time frame)

The value for a given key is consistent if, in the absence of

updates, all reads of that key return the same value. In the

read-only world data is created in a consistent way and not

changed. When we add both writes, and replication, we

encounter problems: now we need to update multiple values

on multiple machines and leave things in a consistent state.

In the presence of server failures this is very hard, in the

presence of network partitions it is provably impossible (a

partition is when, e.g., A and B can reach each other and C

and D can reach each other, but A and B can't reach C and

D).

There are several methods for reaching consistency with

different guarantees and performance tradeoffs like two-

Phase Commit, Paxos-style consensus and Read-repair.

The first two approaches prevent permanent inconsistency.

The third approach involves writing all inconsistent

versions, and then at read-time detecting the conflict, and

resolving the problems and hence used by the SNA team.

This involves little co-ordination and is completely failure

tolerant, but may require additional application logic to

resolve conflicts. This has the best availability guarantees,

and the highest efficiency (only W writes network

roundtrips are required for N replicas where W can be

configured to be less than N). 2PC typically requires 2N

blocking roundtrips. Paxos variations vary quite a bit but

are comparable to 2PC.

Another approach to reach consistency is by using Hinted

Handoff. In this method during writes if we find that the

destination nodes are down (Failure Handling) we store a

"hint" of the updated value on one of the ‘alive’ nodes.

Then when these down nodes come back up the "hints" are

pushed to them thereby making the data consistent.

2.1.6 Routing Parameters

Any persistent system needs to answer the question "where

is my stuff?". This is a very easy question if we have a

centralized database, since the answer is always

"somewhere on the database server". In a partitioned key

system there are multiple machines that may have the data.

When we do a read we need to read from at least 1 server to

get the answer, when we do a write we need to (eventually)

write to all N of the replicas.

There are thus three parameters that matter:

N - The number of replicas

R - The number of machines to read from

W - The number writes to block for

Note that if R + W > N then we are guaranteed to "read our

writes". If W = 0, then writes are non-blocking and there is

no guarantee of success whatever. Puts and deletes are

neither immediately consistent nor isolated. The semantics

are this: if a put/delete operation succeeds without

exception then it is guaranteed that at least W nodes carried

out the operation; however if the write fails (say because

too few nodes succeed in carrying out the operation) then

the state is unspecified. If at least one put/delete succeeds

then the value will eventually be the new value, however if

none succeeded then the value is lost. If the client wants to

ensure the state after a failed write operation they must

issue another write.

2.1.7 Performance [7]

Getting real applications deployed requires having simple,

well understood, predictable performance. Understanding

and tuning performance of a cluster of machines is a

important criteria too. Note that there are a number of

tuneable parameters: the cache size on a node, the number

of nodes you read and write to on each operation, the

amount of data on a server, etc.

Estimating network latency and data/cache ratios

Disk is far and away the slowest and lowest throughput

operation. Disk seeks are 5-10ms and a lookup could

involve multiple disk seeks. When the hot data is primarily

in memory you are benchmarking the software, when it is

primarily on disk you are benchmarking your disk system.

The calculation we do when planning a feature is to take

the estimated total data size, divide by the number of nodes

and multiply be the replication factor. This is the amount of

data per node. Then compare this to the cache size per

node. This is the fraction of the total data that can be served

from memory. This fraction can be compared to some

estimate of the hotness of the data. For example if the

requests are completely random, then a high proportion

should be in memory. If instead the requests represent data

about particular members and only some fraction of

members are logged in at once, and one member session

indicates many requests, then you may survive with a much

lower fraction.

Network is the second biggest bottleneck after disk. The

maximum throughput one java client can get for roundtrips

through a socket to a service that does absolutely nothing

seems to be about 30-40k req/sec over localhost. Adding

work on the client or server side or adding network latency

can only decrease this.

“Some results of LinkedIn performances

The throughput we see from a single multithreaded client

talking to a single server where the "hot" data set is in

memory under artificially heavy load in a performance lab:

Reads: 19,384 req/sec

Writes: 16,559 req/sec

Note that this is to a single node cluster so the replication

factor is 1. Obviously doubling the replication factor will

halve the client req/sec since it is doing 2x the operations.

So these numbers represent the maximum throughput from

Page 7: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 7

IIT Mandi

one client, by increasing the replication factor, decreasing

the cache size, or increasing the data size on the node, we

can make the performance arbitrarily slow. Note that in this

test, the server is actually fairly lightly loaded since it has

only one client so this does not measure the maximum

throughput of a server, just the maximum throughput from

a single client” [8]

.

2.2 Support for batch computed data – Read only stores

One of the most data-intensive storage needs is storing

batch computed data about members and content in our

system. These jobs often deal with the relationships

between entities (e.g. related users, or related news articles)

and so for N entities can produce up to N2 relationships. An

example at LinkedIn is member networks, which are in the

12TB range if stored explicitly for all members. Batch

processing of data is generally much more efficient than

random access, which means one can easily produce more

batch computed data than can be easily accessed by the live

system - Hadoop greatly expands this ability. Therefore a

Voldemort persistence-backend that supports very efficient

read-only access that helps take a lot of the pain our of

building, deploying, and managing large, read-only batch

computed data sets was created.

Much of the pain of dealing with batch computing comes

from the "push" process that transfers data from a data

warehouse or hadoop instance to the live system. In a

traditional db this will often mean rebuilding the index on

the live system with the new data. Doing millions of sql

insert or update statements is generally not at all efficient,

and typically in a SQL db the data will be deployed as a

new table and then swapped to replace the current data

when the new table is completely built. This is better than

doing millions of individual updates, but this still means the

live system is now building a many GB index for the new

data set (or performa) while simultaneously serving live

traffic.

This alone can take hours or days, and may destroy the

performance on live queries. Some people have fixed this

by swapping out at the database level (e.g. having an online

and offline db, and then swapping), but this requires effort

and means only half your hardware is being utilized.

Voldemort fixes this process by making it possible to

prebuild the index itself offline (on Hadoop or wherever),

and simply push it out to the live servers and transparently

swap.

A driver program initiates the fetch and swap procedure in

parallel across a whole Voldemort cluster. In their tests it is

reported that this process can reach the I/O limit of either

the Hadoop cluster or the Voldemort cluster. This also

helps in associating the ‘Hot’ data with its

corresponding keys.

Benchmarking anything that involves disk access is

notoriously difficult because of sensitivity to three factors:

1. The ratio of data to memory

2. The performance of the disk subsystem, and

3. The entropy of the request stream

The ratio of data to memory and the entropy of the request

stream determine how many cache misses will be sustained,

so these are critical. A random request stream is more or

less un-cacheable, but fortunately almost no real request

streams are random. They tend to have strong temporal

locality which is what page cache eviction algorithms

exploit. So we can assume a large ratio of memory to disk,

and test against a simulated request stream to get

performance information. Any build process will consist of

three stages: (1) partitioning the data into separate sets for

each destination nodes, (2) gathering all data for a given

node, and (3) building the lookup structure for that node.

2.2.1 Build Time [8]

The tested time is the complete build time including

mapping the data out to the appropriate node-chunk,

shuffling the data to the nodes that will do the build, and

finally creating the ‘store’ files. In general, the time was

roughly evenly split between map, shuffle and reduce

phases. The number of map and reduce tasks are a very

important parameter, as experiments on a smaller data set

show that varying the number of tasks could change the

build time by more than 25%, but due to time constraints

LinkedIn used defaults Hadoop produced, for Testing. Here

are the times taken:

100GB: 28mins (400 mappers, 90 reducers)

512GB: 2hrs, 16mins (2313 mappers, 350 reducers)

1TB: 5hrs, 39mins (4608 mappers, 700 reducers)

This neglects the additional benefits of Hadoop for

handling failures, dealing with slower nodes, etc.

Page 8: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 8

IIT Mandi

In addition, this process is scalable: it can be run on a

number of machines equal to the number of chunks (700 in

our 1TB case) not the number of destination nodes (only

10). Data transfer between the clusters happens at a steady

rate bound by the disk or network. In LinkedIn’s Amazon

instances this is around 40MB/second.

2.2.2 Online Performance [8]

Lookup time for a single Voldemort node compares well to

a single MySQL instance as well. Consider a local test

against the 100GB per-node data from the 1 TB test. Let it

run on an Amazon Extra Large instance with 15GB of

RAM and the 4 ephemeral disks in a RAID 10

configuration. To run the tests 1 million requests from a

real request stream recorded on the production system

against each of storage systems, be simulated. Then the

following performance for 1 million requests against a

single node is resulted:

MySQL Voldemort

Reqs per sec. 727 1291

Median req. Time 0.23 ms 0.05 ms

Avg. req. Time 13.7 ms 7.7 ms

99th percentile req. time 127.2 ms 100.7 ms

These numbers are both for local requests with no network

involved as the only intention is to benchmark the storage

layer of these systems.

2.3 White Elephant: The Hadoop Tool

LinkedIn’s solution of a Hadoop Tool to manage and

configure the Network analytics is “White Elephant”. At

LinkedIn it is used for product development (e.g.,

predictive analytics applications like ‘People You May

Know’ and ‘Endorsements’), descriptive statistics for

powering our internal dashboards, ad-hoc analysis by data

scientists, and ETL. White Elephant parses Hadoop logs to

provide visual drill downs and rollups of task statistics for

your Hadoop cluster, including total task time, slots used,

CPU time, and failed job counts.

White Elephant fills several needs:

Scheduling: when you have a handful of periodic

jobs, it’s easy to reason about when they should

run, but that quickly doesn’t scale. The ability to

schedule jobs at periods of low utilization helps

maximize cluster efficiency.

Capacity planning: to plan for future hardware

needs, operations need to understand the resource

usage growth of jobs.

Billing: Hadoop clusters have finite capacity, so in

a multi-tenant environment it’s important to know

the resources used by a product feature against its

business value.

2.3.1 Architecture [10]

Here's a diagram outlining the White Elephant architecture:

There are three Hadoop Grids, A, B, and C, for which

White Elephant will compute statistics as follows:

1. Upload Task: a task that periodically runs on the

Job Tracker for each grid and incrementally copies

new log files into a Hadoop grid for analysis.

2. Compute: a sequence of MapReduce jobs

coordinated by a Job Executor parses the uploaded

logs and computes aggregate statistics.

3. Viewer: a viewer app incrementally loads the

aggregate statistics, caches them locally, and

exposes a web interface which can be used to slice

and dice statistics for your Hadoop clusters

2.4 Sensei DB

Sensei DB is a distributed searchable database that handles

complex semi-structured queries. It can be used to power

consumer search systems with rich structured data. It is an

Open-source, distributed, real-time, semi-structured

database which powers “LinkedIn homepage” and

“LinkedIn Signal”.

Some Features of this database include:

Full-text search

Fast real-time updates

Structured and faceted search

BQL: SQL-like query language

Fast key-value lookup

High performance under concurrent heavy update

and query volumes

Page 9: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 9

IIT Mandi

Hadoop integration

It helps in faceted search on the rich structured data

required by LinkedIn to incorporate into the user profiles.

The fundamental paradigm was to provide individuals with

an easy and natural way to slice and dice through search

results or simply content so a faceted search paradigm

would be ideal not only for retrieval but also for Navigation

and Discovery. At LinkedIn since a member profile does

have these rich structural dimensions, along with rich text

data, it seemed that it would be only a matter of time to

create such an interface.

A click on a facet value would be similar to a

filtering of search results through that value. For

example in the search for “John” and later selecting

the “San Francisco” should get you only people in

San Francisco called John, i.e. “John” +

facet_value(“San Francisco”) = “John AND

location:(San Francisco)”. While navigating

through results this never leads to a Dead end.

What was implemented is essentially a query engine for the

following type of query:

SELECT f1,f2…fn FROM members

WHERE c1 AND c2 AND c3..

MATCH (fulltext query, e.g. “java engineer”)

GROUP BY fx,fy,fz…

ORDER BY fa,fb…

LIMIT offset,count

Deferring this query to a traditional RDBMS on 10s – 100s

millions of rows with sub-second query latency SLA is not

feasible. Thus a distributed system like Sensei that handles

the above query at internet scale, is necessary. Below is a

faceted search snapshot. [11]

2.5 Avatara: OLAP for Web-scale Analytics Products

The last important part of SNA, LinkedIn described in this

paper is Avatara which is an OLAP for web analytics

products. LinkedIn has many analytical insight products

such as "Who's Viewed My Profile?" and "Who's Viewed

This Job?"At their core, these are multidimensional queries.

For example, "Who's Viewed My Profile?" takes someone's

profile views and breaks them down by industry,

geography, company, school, etc to show the richness of

people who viewed their profiles and who viewed the Job [12]:

Page 10: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 10

IIT Mandi

Online analytical processing (OLAP) has been the

traditional approach to solve these multi-dimensional

analytical problems. However, LinkedIn had to build a

solution that can answer these queries in milliseconds

across 175+ million members and so built Avatara. Avatara

is LinkedIn's scalable, low latency, and highly-available

OLAP system for "Sharded" multi-dimensional queries in

the time constraints of a request/response loop.

An interesting insight for LinkedIn's use cases is that

queries span relatively few – usually tens to at most a

hundred – dimensions, so this data can be “Sharded” across

a primary dimension. For "Who's Viewed My Profile?", we

can shard the cube by the member herself, as the product

does not allow analyzing profile views of anyone other than

the member currently logged in. Here's a brief overview of

how it works. As shown in the figure below, Avatara

consists of two components:

1. An offline engine that computes cubes in batch

2. An online engine that serves queries in real time

The offline engine computes cubes with high throughput by

leveraging Hadoop for batch processing. It then writes

cubes to Voldemort DDBS. The online engine queries the

Voldemort store when a member loads a page. Every piece

in this architecture runs on commodity hardware and can be

easily scaled horizontally.

The above diagram also shows the integration scenario of

Hadoop with the LinkedIn’s key-store DDB Voldemort.

2.5.1 Offline Engine

The offline batch engine processes data through a pipeline

that has three phases:

1. Pre-processing

2. Projections and joins

3. Cubification

Each phase runs one or more Hadoop jobs and produces

output that is the input for the subsequent phase. We utilize

Hadoop for its built-in high throughput, fault tolerance and

horizontal scalability. The pipeline pre-processes raw data

as needed, projects out dimensions of interest, performs

user-defined joins, and at the end transforms the data to

cubes. The result of the batch engine is a set of Sharded

small cubes, represented by key-value pairs, where each

key is a shard (for example, by member_id for "Who's

Viewed My Profile?"), and the value is the cube for the

shard.

2.5.2 Online Engine

All cubes are bulk loaded into Voldemort. The online query

engine retrieves and processes data from Voldemort,

returning results back to the client. It provides SQL-like

operators, such as select, where, group by, plus some math

operations. The wide-spread adoption of SQL makes it easy

for application developers to interact with Avatara. With

Avatara, 80% of queries can be satisfied within 10 ms, and

95% of queries can be answered within 25 ms for "Who's

Viewed My Profile?" on a high traffic day.

2.6 Conclusions

When the scale of data began to overload the LinkedIn

servers, their solution wasn’t to add more nodes but to cut

out some of the matching heuristics that required too much

compute power. Instead of writing algorithms to make

“People You Know” more accurate, their team worked on

getting LinkedIn’s Hadoop infrastructure in place and built

a distributed database called Voldemort. They then built

Azkaban, an open source scheduler for batch processes

such as Hadoop jobs, and Kafka, another open source tool

referred to as “the big data equivalent of a message broker”.

At a high level, Kafka is responsible for managing the

company’s real-time data and getting those hundreds of

feeds to the apps that subscribe to them with minimal

latency. A 2012 study comparing systems for storing APM

monitoring data reported that Voldemort, Cassandra, and

HBase offered linear scalability in most cases, with

Voldemort having the lowest latency and Cassandra having

the highest throughput.

Why hasn’t LinkedIn shifted from a NoSQL Database like

Voldemort?

“The fundamental problem is endemic to the relational

database mindset, which places the burden of computation

on reads rather than writes. This is completely wrong for

large-scale web applications, where response time is

critical. It’s made much worse by the serial nature of most

applications. Each component of the page blocks on reads

from the data store, as well as the completion of the

operations that come before it. Non-relational data stores

reverse this model completely, because they don’t have the

complex read operations of SQL” as mentioned by

LinkedIn SNA Team in the ‘Interview with Ryan King’

Acknowledgements

The authors of this paper would like to acknowledge the

Data Team of the LinkedIn which has open-sourced their

data store DBSs like Voldemort and SNA tools like Sensei

DB, Avatara, Azkaban etc., hence providing various means

for researching.

Page 11: LinkedIn Case Study

D D B , S p r i n g 2 0 1 3 | 11

IIT Mandi

References

[1] http://en.wikipedia.org/wiki/LinkedIn

[2] Dynamo: Amazon's Highly Available Key-Value Store

[3] http://data.linkedin.com/ - The data team which

manages the SNA of LinkedIn

[4]http://www.project_voldemort.com/voldemort/design.ht

ml

[5]http://en.wikipedia.org/wiki/Voldemort_%28distributed_

data_store%29

[6] Time, Clocks, and the Ordering of Events in a

Distributed System—for the versioning details

[7] Eventual Consistency Revisited A discussion on Werner

Vogels' blog on the developers interaction with the storage

system and what the tradeoffs mean in practical terms.

[8] Brewer's conjecture and the feasibility of consistent,

available, partition-tolerant web services— Consistency,

Availability and Partition-tolerance

[9] Berkeley DB performance— A somewhat biased

overview of bdb performance.

[10] Google's Bigtable— for comparison, a very different

approach.

[11] One Size Fit's All: An Idea Whose Time Has Come

and Gone— Very interesting paper by the creator of Ingres,

Postgres and Vertica

[12] One Size Fits All? - Part 2, Benchmarking Results—

Benchmarks mentioned in the paper

[13] Consistency in Amazon's Dynamo— blog posts on

Dynamo

[14] Paxos Made Simple , Two-phase commit— Wikipedia

description.

[15] The Life of a Typeahead Query

The various technical aspects and challenges of real-time

typeahead search in the context of social network.

[16] Efficient type-ahead search on relational data: a

TASTIER approach

A relational approach for typeahead searching by means of

specialized index structures and algorithms for joining

related tuples in the database.

[16]http://gigaom.com/2013/03/03/how-and-why-linkedin-

is-becoming-an-engineering-powerhouse/

“LinkedIn, A powerhouse” Interviews with the Developing

Team

[17] http://www.cloudera.com/hadoop-training-basic the

principles behind Map Reduce and Hadoop.

[18]

https://groups.google.com/forum/?fromgroups#!forum/proj

ect-voldemort