1 tdd: topics in distributed databases distributed query processing mapreduce vertex-centric models...

1

TDD: Topics in Distributed Databases

Distributed Query Processing

MapReduce

Vertex-centric models for querying graphs

Distributed query evaluation by partial evaluation

2

Coping with the sheer volume of big data

Can we do better if we have n processors/SSDs?Divide and conquer: split the work across n processors

DB

M

DB

M

DB

M

interconnection network

Using SSD of 6G/s, a linear scan of a data set DD would take

1.9 days when DD is of 1PB (1015B)

5.28 years when DD is of 1EB (1018B)

How to make use of parallelism in query answering?

P P P

MapReduce

33

4

MapReduce

A programming model with two primitive functions:

How does it work?

Input: a list <k1, v1> of key-value pairs

Map: applied to each pair, computes key-value pairs <k2,

v2>

• The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is

sent to a reducer

Reduce: takes a partition as input, and computes key-value pairs <k3, v3>

Map: <k1, v1> list (k2, v2)

Reduce: <k2, list(v2)> list (k3, v3)

The process may reiterate – multiple map/reduce steps

Google

5

Architecture (Hadoop)

No need to worry about how the data is stored and sent

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

One block for each mapper (a map task)

In local store of mappers

Stored in DFS Partitioned in blocks (64M)

Hash partition (k2)

Aggregate results

Multiple steps

6

Connection with parallel database systems

Data partitioned parallelism

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

Parallel computation

Parallel computation

What parallelism?

7

Parallel database systems

Restricted query languages: only two primitive

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

DB

P

M

DB

P

M

DB

P

M

interconnection network

8

MapReduce implementation of relational operators

Map(key, t)

• emit (t.A, t.A)

Reduce(hkey, hvalue[ ])

• emit(hkey, hkey)

Input: for each tuple t in R, a pair (key, value), where value = t

not necessarily a key of R

Apply to each input tuple, in parallel; emit new tuples with projected attributes

the reducer is not necessary; but it eliminates duplicates. Why?

Projection A R

Mappers: processing each tuple in parallel

9

Selection

Map(key, t)

• if C(t)

• then emit (t, “1”)



Input: for each tuple t in R, a pair (key, value), where value = t

Apply to each input tuple, in parallel; select tuples that satisfy condition C

How does it eliminate duplicates?

Selection C R

Reducers: eliminate duplicates

10

Union

Map: process tuples of R1 and R2 uniformly

Map(key, t)

• emit (t, “1”)



Input: for each tuple t in R1 and s in R2, a pair (key, value)

A mapper just passes an input tuple to a reducer

Reducers simply eliminate duplicates

Union R1 R2 A mapper is assigned chunks from either R1 or R2

11

Set difference

Reducers do the checking

Map(key, t) • if t is in R1

• then emit(t, “1”)• else emit(t, “2”)


• if only “1” appears in the list hvelue

• then emit(hkey, hkey)


tag each tuple with its source

Why does it work?

Set difference R1 R2 distinguishable

12

Reduce-side join

Reduce-side join (based on hash partitioning)


Natural R1 R1.A = R2.B R2, where R1[A, C], R2[B, D]

Map(key, t) • if t is in R1

• then emit(t.[A], (“1”, t[C]))• else emit(t.[B], (“2”, t.[D]))


• for each (“1”, t[C]) and each (“2”, s[D]) in the list hvalue

• emit((hkey, t[C], s[D]), hkey)

Hashing on join attributes

Nested loopHow to implement R1 R2 R3? Inefficient. Why?

13

Map-side join

Map-side join:

Input relations are partitioned and sorted based on join keys

Map over R1 and read from the corresponding partition of R2

Partitioned join

Recall R1 R1.A = R2.B R2

Partition R1 and R2 into n partitions, by the same partitioning function in R1.A and R2.B, via either range or hash partitioning

Compute Ri1 R1.A = R2.B Ri2 locally at processor i

Merge the local results

map(key, t)• read Ri2• for each tuple s in relation Ri2

• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])

Limitation: sort order and partitioning

Merge join

14

In-memory join

Recall R1 R1.A < R2.B R2

Partition R1 into n partitions, by any partitioning method, and distribute it across n processors

Replicate the other relation R2 across all processors

Compute Rj1 R1.A < R2.B R2 locally at processor j

Merge the local results Fragment and replicate join

map(key, t)• for each tuple s in relation R2 (local)

• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])

Broadcast join A smaller relation is broadcast to each node and stored in its

local memory The other relation is partitioned and distributed across mappers

Limitation: memory

15

Aggregation

Leveraging the MapReduce framework

Map(key, t)

• emit (t[A], t[B])


• sum := 0;

• for each value s in the list hvalue

• sum := sum + s;

• emit(hkey, sum)

R(A, B, C), compute sum(B) group by A

Grouping: done by MapReduce framework

Compute the aggregation for each group

16

Practice: validation of functional dependencies

A functional dependency (FD) defined on schema R: X Y– For any instance D of R, D satisfies the FD if for any pair of

tuples t and t’, if t[X] = t’[X], then t[Y] = t’[Y]– Violations of the FD in D:

{ t | there exists t’ in D, such that t[X] = t’[X], but t[Y] t’[Y] }

Develop a MapReduce algorithm to find all the violations of the FD in D

key

: Write a MapReduce algorithm to validate a set of FDs?

Does MapReduce support recursion?

– Map: for each tuple t, emit (t[X], t)– Reduce: find all tuples t such that there exists t’, with but

t[Y] t’[Y]; emit (1, t)

17

Transitive closures

The transitive closure TC of a relation R[A, B]

– R is a subset of TC

– if (a, b) and (b, c) are in TC, then (a, c) is in TC

That is,

• TC(x, y) :- R(x, y);

• TC(x, z) :- TC(x, y), TC(y, z).

Develop a MapReduce algorithm that given R, computes TC

A fixpoint computation

– How to determine when to terminate?

– How to minimize redundant computation?

: Write a MapReduce algorithm

18

A naïve MapReduce algorithm

Termination?

Map((a, b), value)

• emit (a, (“r”, b)); emit(b, (“l”, a));

Reduce(b, hvalue[ ])

• for each (“l”, a) in hvalue

• for each (“r”, c) in hvalue

• emit(a, c); emit(b, c);

• emit(a, b);

Given R(A, B), compute TCInitially, the input relation R

One round of recursive computation:Apply the transitivity ruleRestore (a, b), (b, c). Why?

Iteration: the output of reducers becomes the input of mappers in the next round of MapReduce computation

19

A MapReduce algorithm

Naïve: not very efficient. Why?

Map((a, b), value)

• emit (a, (“r”, b)); emit(b, (“l”, a));

Reduce(b, hvalue[ ])• for each (“l”, a) in hvalue

• for each (“r”, c) in hvalue• emit(a, c); emit(b, c);

• emit(a, b);

Given R(A, B), compute TC

Termination: when the intermediate result no longer changes

controlled by a non-MapReduce driverHow many rounds?How to improve it?

20

Smart transitive closure (recursive doubling)

Implement smart TC in MapReduce

1. P0(x, y) := { (x, x) | x is in R};

2. Q0 := R;

3. i : = 0;

4. while Qi do

1. i := i + 1;

2. Pi(x, y) := (x, y) (Q i-1 (x, z) Pi-1 (z, y));

3. Pi := Pi Pi - 1;

4. Qi(x, y) := (x, y) (Q i-1 (x, z) Q i-1 (z, y));

5. Qi := Qi Pi;

How many rounds?

Recursive doubling: log (|R|)

MapReduce: you know how to do join, union and set difference A project

• R: edge relation• Pi(x, y): the distance from x to

y is in [0, 2i – 1] • Qi(x, y): the distance from x to

y is exactly 2i

Recursive doubling

Remove those guys with a shorter path

[2i-1, 2i – 1]

21

Advantages of MapReduce

Simple: one only needs to define two functions

no need to worry about how the data is stored, distributed and how the operations are scheduled

Fault tolerance: why?

scalability: a large number of low end machines• scale out (scale horizontally): adding a new computer to a

distributed software application; lost-cost “commodity”• scale up (scale vertically): upgrade, add (costly) resources

to a single node

flexibility: independent of data models or schema

independence: it can work with various storage layers

22

Fault tolerance

Able to handle an average of 1.2 failures per analysis job

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

triplicated

Detecting failures and reassigning the tasks of failed nodes to healthy nodes

Redundancy checking to achieve load balancing

23

Pitfalls of MapReduce

No schema: schema-free

No index: index-free

A single dataflow: a single input and a single output

No high-level languages: no SQL

Why bad?

Inefficient to do join

Low efficiency: I/O optimization, utilization

Carry invariant input data; no support for incremental computation: redundant computation

The MapReduce model does not provide a mechanism to maintain global data structures that can be accessed and updated by all mappers and reducers; maximize communication: all-to-all

Functional programming

Not minimize human effort, not to maximize performance

Why low efficiency?

24

Inefficiency of MapReduce

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

Blocking: Reduce does not start until all Map tasks are completed

Despite these, MapReduce is popular in industry

Unnecessary shipment of invariant input data in each MapReduce round – Haloop fixes

25

MapReduce platforms

Apache Hadoop, used by Facebook, Yahoo, …– Hive, Facebook, HiveQL (SQL)– PIG, Yahoo, Pig Latin (SQL like)– SCOPE, Microsoft, SQL

NoSQL

– Cassandra, Facebook, CQL (no join)

– HBase, Google, distributed BigTable

– MongoDB, document-oriented (NoSQL)

Distributed graph query engines– Pregel, Google– TAO: Facebook,– GraphLab, machine learning and data mining– Neo4j, Neo Tech; Trinity, Microsoft; HyperGraphDB (knowledge)

A vertex-centric model

Study some of these

Vertex-centric models

26

27

The BSP model (Pregel)

Vertex: computation is defined to run on each vertex

Supersteps: within each, all vertices compute in parallel

– Each vertex modifies its states and that of outgoing edges

– Sends messages to other vertices (in the next superstep)

– Receives messages sent to it (from the last superstep)

Termination:

– Each vertex votes to halt

– When all vertices are inactive and no messages in transit

Synchronization: supersteps

Asynchronous: all vertices within each superstep

analogous to MapReduce rounds

Message passing

Vertex-centric, message passing

Bulk Synchronous Parallel

28

The vertex centric model of GraphLab

Vertex: computation is defined to run on each vertex

All vertices compute in parallel

– Each vertex reads and writes to data on adjacent nodes or edges

Consistency: serialization

– Full consistency: no overlap for concurrent updates

– Edge consistency: exclusive read-write to its vertex and adjacent edges; read only to adjacent vertices

– Vertex consistency: all updates in parallel (sync operations)

Asynchronous: all vertices

No supersteps

asynchronous

Machine learning, data mining

29

Vertex-centric models vs. MapReduce

Vertex centric: think like a vertex; MapReduce: think like a graph

Can we do better?

Vertex centric: maximize parallelism – asynchronous, minimize data shipment via message passing

MapReduce: inefficiency caused by blocking; distributing intermediate results

Vertex centric: limited to graphs; MapReduce: general

Lack of global control: ordering for processing vertices in recursive computation, incremental computation, etc

Have to re-cast algorithms in these models, hard to reuse existing (incremental) algorithms

GRAPE: A parallel model based on partial evaluation

3030

Querying distributed graphs

Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si

Dividing a big G into small fragments of manageable size

Each processor Si processes its local fragment Gi in parallel

Parallel query answeringInput: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G

Q( )

GGQ( )

G1G1Q( )GnGnQ( )G2G2

…

How does it work?

31

GRAPE (GRAPh Engine)

32

Divide and conquer

partition G into fragments (G1, …, Gn), distributed to various sites

manageable sizes

upon receiving a query Q,

• evaluate Q( Gi ) in parallel

• collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G

evaluate Q on smaller

Gi

data-partitioned parallelism

Each machine (site) Si processes the same query Q, uses only data stored in its local fragment Gi

32

Partial evaluation

33The connection between partial evaluation and parallel processing

compute f( x ) f( s, d )

conduct the part of computation that depends only on s

generate a partial answer

the part of known input

Partial evaluation in distributed query processing


• collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G

yet unavailable input

a residual function

Gj as the yet unavailable input

as residual functions

at each site, Gi as the known input

Coordinator

34

Coordinator: receive/post queries, control termination, and assemble answers

Upon receiving a query Q

• post Q to all workers

• Initialize a status flag for each worker, mutable by the worker

Terminate the computation when all flags are true

• Assemble partial answers from workers, and produce the final answer Q(G)

Termination, partial answer assembling

Each machine (site) Si is either a coordinator a worker: conduct local computation and produce partial answers

34

Workers

35

Worker: conduct local computation and produce partial answers



• send messages to request data for “border nodes”

use local data Gi only

Local computation, partial evaluation, recursion, partial answers 35

Incremental computation: upon receiving new messages M

• evaluate Q( Gi + M) in parallel

set its flag true if no more changes to partial results, and send the partial answer to the coordinator

This step repeats until the partial answer at site Si is ready

With edges to other fragments

Incremental computation

36

Subgraph isomorphism

Input: A pattern graph Q and a graph G

Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q

a bijective function f on nodes:

(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G

How to compute partial answers?

partition G into fragments (G1, …, Gn), distributed to n workers

Coordinator: upon receiving a pattern query Q

• post pattern Q to each worker

• set flag[i] false for each worker Si

Local computation at each worker

37

Worker:


• send messages to request the d-neighbour for each border node, denoted by Gd

• Compute partial answer Q(Gi Gd), by calling any existing

subgraph isomorphism algorithm

• set flag[i] true, and send Q(Gi Gd) to the coordinator

The algorithm is correct, why? Correctness analysis

the diameter of Q

Coordinator: the final answer Q(G) is simply the union of all Q(Gd)

the data locality of subgraph isomorphism

No incremental computation is needed

37

38

Performance guarantee

Partial evaluation + data-partitioned parallelism

Complexity analysis: let

t(|G|, |Q|) be the sequential complexity

T(|G|, |Q|, n) be the parallel complexity

Then T(|G|, |Q|, n) = O(t(|G|, |Q|)/n), n: the number of processors

Parallel scalability: the more processors, the faster the algorithm (provided that p << |G|). Why?

Reuse of existing algorithms: the worker can employ any existing sequential algorithm for subgraph isomorphism

polynomial reduction

Proof: computational costs + communication costs

compare this with MapReduce

Necessary parts for your projectCorrectness proofComplexity analysisPerformance guarantees

Transitive closures

TC(x, y) :- R(x, y); TC(x, z) :- TC(x, y), TC(y, z).

: compared with MapReduce: minimize unnecessary recomputation

Worker:

compute TC( Gi ) in parallel

send messages to request data for “border nodes”

Represent R as a graph G: elements as nodes, and R as its edge relation

partition G into fragments (G1, …, Gn), distributed to n workers

given new messages M, incrementally compute TC(Gi + M)

Termination: repeat until there exist no more changes

incremental step

39

40

Summary and review

What is the MapReduce framework? Pros? Pitfalls?

How to implement joins in the MapReduce framework?

Develop algorithms in MapReduce

Vertex-centric models for querying big graphs

Distributed query processing with performance guarantee by partial evaluation?

What performance guarantees we want for evaluating graph queries on distributed graphs?

Compare the four parallel models: MapReduce, PBS, vertex-centric, and GRAPE

Correctness proof, complexity analysis and performance guarantees for your algorithm

Project (1)

A “smart” algorithm for computing the transitive closure of R is given in “Distributed Algorithms for the Transitive Closure”http://asingleneuron.files.wordpress.com/2013/10/distributedalgorithmsfortransitiveclosure.pdf

Implement the algorithm in the programming models of • MapReduce • GRAPE

Give correctness proof and complexity analysis Experimentally evaluate and compare the implementations 41

1. TC(x, y) := ; Q := R;

2. while Q do

a) TC := Q TC;

b) TC := Q TC TC;

c) Q := Q Q TC;

Recall that you can implement union and join in MapReduce

42

Flashback: Graph simulation

Input: a graph pattern graph Q and a graph G

Output: Q(G) is a binary relation S on the nodes of Q and G

Complexity: O((| V | + | VQ |) (| E | + | EQ| )) time

• each node u in Q is mapped to a node v in G, such that (u, v) S∈

• for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈

42

43

Project (2)

Develop two algorithms that, given a graph G and a pattern Q, compute Q(G) via graph simulation, based on two of the following models: • MapReduce • BSP (Pregel)• The vertex-centric model of GraphLab• GRAPE

43

Give correctness proof and complexity analysis of your algorithms

Experimentally evaluate your algorithms

44

Project (3)

Given a directed graph G and a pair of nodes (s, t) in G, the distance between s and t is the length of a shortest path from s to t.

Develop two algorithms that, given a graph G, compute the shortest distance for all pair of nodes in G, based on two of the following models: • MapReduce • BSP (Pregel)• The vertex-centric model of GraphLab• GRAPE

Give correctness proof and complexity analysis of your algorithms

Experimentally evaluate your algorithms44

45

Project (4)

Show that GRAPE can be revised to support relational queries, based on what you have learned about parallel/distributed query processing. Show• how to support relational operators• how to optimize your queries

Implement a lightweight system to support relational queries in GRAPE• Basic relational operators• SPC queries

Demonstrate your system

46

Project (5)

Write a survey on any of the following topics, by evaluating 5-6 representative papers/systems:

• Parallel models for query evaluation

• Distributed graph query engines

• Distributed database systems

• MapReduce platforms

• NoSQL systems

• …

Demonstrate your understanding of the topics46

Develop a set of criteria, and evaluate techniques/systems based on the criteria

Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html

47

1. W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013 (BD-tractability)

2. Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf

3. J. Hellerstein. The Declarative Imperative: Experiences and Conjectures in Distributed logic. SIGMOD Record 39(1), 2010 (communication complexity) http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf

4. Y. Cao, W. Fan, and W. Yu. Bounded Conjunctive Queries. VLDB 2014. (scale independence)

5. W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression)

6. W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)

7. F. Afrati, V. Borkar, M. Carey, N. Poluzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries, EDBT 2011.

https://asterix.ics.uci.edu/pub/EDBT11-afrati.pdf

1 tdd: topics in distributed databases distributed query processing mapreduce vertex-centric models...

Documents