1 tdd: topics in distributed databases distributed query processing mapreduce vertex-centric models...
TRANSCRIPT
1
TDD: Topics in Distributed Databases
Distributed Query Processing
MapReduce
Vertex-centric models for querying graphs
Distributed query evaluation by partial evaluation
2
Coping with the sheer volume of big data
Can we do better if we have n processors/SSDs?Divide and conquer: split the work across n processors
DB
M
DB
M
DB
M
interconnection network
Using SSD of 6G/s, a linear scan of a data set DD would take
1.9 days when DD is of 1PB (1015B)
5.28 years when DD is of 1EB (1018B)
How to make use of parallelism in query answering?
P P P
MapReduce
33
4
MapReduce
A programming model with two primitive functions:
How does it work?
Input: a list <k1, v1> of key-value pairs
Map: applied to each pair, computes key-value pairs <k2,
v2>
• The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is
sent to a reducer
Reduce: takes a partition as input, and computes key-value pairs <k3, v3>
Map: <k1, v1> list (k2, v2)
Reduce: <k2, list(v2)> list (k3, v3)
The process may reiterate – multiple map/reduce steps
5
Architecture (Hadoop)
No need to worry about how the data is stored and sent
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
One block for each mapper (a map task)
In local store of mappers
Stored in DFS Partitioned in blocks (64M)
Hash partition (k2)
Aggregate results
Multiple steps
6
Connection with parallel database systems
Data partitioned parallelism
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
Parallel computation
Parallel computation
What parallelism?
7
Parallel database systems
Restricted query languages: only two primitive
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
DB
P
M
DB
P
M
DB
P
M
interconnection network
8
MapReduce implementation of relational operators
Map(key, t)
• emit (t.A, t.A)
Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
Input: for each tuple t in R, a pair (key, value), where value = t
not necessarily a key of R
Apply to each input tuple, in parallel; emit new tuples with projected attributes
the reducer is not necessary; but it eliminates duplicates. Why?
Projection A R
Mappers: processing each tuple in parallel
9
Selection
Map(key, t)
• if C(t)
• then emit (t, “1”)
Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
Input: for each tuple t in R, a pair (key, value), where value = t
Apply to each input tuple, in parallel; select tuples that satisfy condition C
How does it eliminate duplicates?
Selection C R
Reducers: eliminate duplicates
10
Union
Map: process tuples of R1 and R2 uniformly
Map(key, t)
• emit (t, “1”)
Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
Input: for each tuple t in R1 and s in R2, a pair (key, value)
A mapper just passes an input tuple to a reducer
Reducers simply eliminate duplicates
Union R1 R2 A mapper is assigned chunks from either R1 or R2
11
Set difference
Reducers do the checking
Map(key, t) • if t is in R1
• then emit(t, “1”)• else emit(t, “2”)
Reduce(hkey, hvalue[ ])
• if only “1” appears in the list hvelue
• then emit(hkey, hkey)
Input: for each tuple t in R1 and s in R2, a pair (key, value)
tag each tuple with its source
Why does it work?
Set difference R1 R2 distinguishable
12
Reduce-side join
Reduce-side join (based on hash partitioning)
Input: for each tuple t in R1 and s in R2, a pair (key, value)
Natural R1 R1.A = R2.B R2, where R1[A, C], R2[B, D]
Map(key, t) • if t is in R1
• then emit(t.[A], (“1”, t[C]))• else emit(t.[B], (“2”, t.[D]))
Reduce(hkey, hvalue[ ])
• for each (“1”, t[C]) and each (“2”, s[D]) in the list hvalue
• emit((hkey, t[C], s[D]), hkey)
Hashing on join attributes
Nested loopHow to implement R1 R2 R3? Inefficient. Why?
13
Map-side join
Map-side join:
Input relations are partitioned and sorted based on join keys
Map over R1 and read from the corresponding partition of R2
Partitioned join
Recall R1 R1.A = R2.B R2
Partition R1 and R2 into n partitions, by the same partitioning function in R1.A and R2.B, via either range or hash partitioning
Compute Ri1 R1.A = R2.B Ri2 locally at processor i
Merge the local results
map(key, t)• read Ri2• for each tuple s in relation Ri2
• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])
Limitation: sort order and partitioning
Merge join
14
In-memory join
Recall R1 R1.A < R2.B R2
Partition R1 into n partitions, by any partitioning method, and distribute it across n processors
Replicate the other relation R2 across all processors
Compute Rj1 R1.A < R2.B R2 locally at processor j
Merge the local results Fragment and replicate join
map(key, t)• for each tuple s in relation R2 (local)
• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])
Broadcast join A smaller relation is broadcast to each node and stored in its
local memory The other relation is partitioned and distributed across mappers
Limitation: memory
15
Aggregation
Leveraging the MapReduce framework
Map(key, t)
• emit (t[A], t[B])
Reduce(hkey, hvalue[ ])
• sum := 0;
• for each value s in the list hvalue
• sum := sum + s;
• emit(hkey, sum)
R(A, B, C), compute sum(B) group by A
Grouping: done by MapReduce framework
Compute the aggregation for each group
16
Practice: validation of functional dependencies
A functional dependency (FD) defined on schema R: X Y– For any instance D of R, D satisfies the FD if for any pair of
tuples t and t’, if t[X] = t’[X], then t[Y] = t’[Y]– Violations of the FD in D:
{ t | there exists t’ in D, such that t[X] = t’[X], but t[Y] t’[Y] }
Develop a MapReduce algorithm to find all the violations of the FD in D
key
: Write a MapReduce algorithm to validate a set of FDs?
Does MapReduce support recursion?
– Map: for each tuple t, emit (t[X], t)– Reduce: find all tuples t such that there exists t’, with but
t[Y] t’[Y]; emit (1, t)
17
Transitive closures
The transitive closure TC of a relation R[A, B]
– R is a subset of TC
– if (a, b) and (b, c) are in TC, then (a, c) is in TC
That is,
• TC(x, y) :- R(x, y);
• TC(x, z) :- TC(x, y), TC(y, z).
Develop a MapReduce algorithm that given R, computes TC
A fixpoint computation
– How to determine when to terminate?
– How to minimize redundant computation?
: Write a MapReduce algorithm
18
A naïve MapReduce algorithm
Termination?
Map((a, b), value)
• emit (a, (“r”, b)); emit(b, (“l”, a));
Reduce(b, hvalue[ ])
• for each (“l”, a) in hvalue
• for each (“r”, c) in hvalue
• emit(a, c); emit(b, c);
• emit(a, b);
Given R(A, B), compute TCInitially, the input relation R
One round of recursive computation:Apply the transitivity ruleRestore (a, b), (b, c). Why?
Iteration: the output of reducers becomes the input of mappers in the next round of MapReduce computation
19
A MapReduce algorithm
Naïve: not very efficient. Why?
Map((a, b), value)
• emit (a, (“r”, b)); emit(b, (“l”, a));
Reduce(b, hvalue[ ])• for each (“l”, a) in hvalue
• for each (“r”, c) in hvalue• emit(a, c); emit(b, c);
• emit(a, b);
Given R(A, B), compute TC
Termination: when the intermediate result no longer changes
controlled by a non-MapReduce driverHow many rounds?How to improve it?
20
Smart transitive closure (recursive doubling)
Implement smart TC in MapReduce
1. P0(x, y) := { (x, x) | x is in R};
2. Q0 := R;
3. i : = 0;
4. while Qi do
1. i := i + 1;
2. Pi(x, y) := (x, y) (Q i-1 (x, z) Pi-1 (z, y));
3. Pi := Pi Pi - 1;
4. Qi(x, y) := (x, y) (Q i-1 (x, z) Q i-1 (z, y));
5. Qi := Qi Pi;
How many rounds?
Recursive doubling: log (|R|)
MapReduce: you know how to do join, union and set difference A project
• R: edge relation• Pi(x, y): the distance from x to
y is in [0, 2i – 1] • Qi(x, y): the distance from x to
y is exactly 2i
Recursive doubling
Remove those guys with a shorter path
[2i-1, 2i – 1]
21
Advantages of MapReduce
Simple: one only needs to define two functions
no need to worry about how the data is stored, distributed and how the operations are scheduled
Fault tolerance: why?
scalability: a large number of low end machines• scale out (scale horizontally): adding a new computer to a
distributed software application; lost-cost “commodity”• scale up (scale vertically): upgrade, add (costly) resources
to a single node
flexibility: independent of data models or schema
independence: it can work with various storage layers
22
Fault tolerance
Able to handle an average of 1.2 failures per analysis job
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
triplicated
Detecting failures and reassigning the tasks of failed nodes to healthy nodes
Redundancy checking to achieve load balancing
23
Pitfalls of MapReduce
No schema: schema-free
No index: index-free
A single dataflow: a single input and a single output
No high-level languages: no SQL
Why bad?
Inefficient to do join
Low efficiency: I/O optimization, utilization
Carry invariant input data; no support for incremental computation: redundant computation
The MapReduce model does not provide a mechanism to maintain global data structures that can be accessed and updated by all mappers and reducers; maximize communication: all-to-all
Functional programming
Not minimize human effort, not to maximize performance
Why low efficiency?
24
Inefficiency of MapReduce
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
Blocking: Reduce does not start until all Map tasks are completed
Despite these, MapReduce is popular in industry
Unnecessary shipment of invariant input data in each MapReduce round – Haloop fixes
25
MapReduce platforms
Apache Hadoop, used by Facebook, Yahoo, …– Hive, Facebook, HiveQL (SQL)– PIG, Yahoo, Pig Latin (SQL like)– SCOPE, Microsoft, SQL
NoSQL
– Cassandra, Facebook, CQL (no join)
– HBase, Google, distributed BigTable
– MongoDB, document-oriented (NoSQL)
Distributed graph query engines– Pregel, Google– TAO: Facebook,– GraphLab, machine learning and data mining– Neo4j, Neo Tech; Trinity, Microsoft; HyperGraphDB (knowledge)
A vertex-centric model
Study some of these
Vertex-centric models
26
27
The BSP model (Pregel)
Vertex: computation is defined to run on each vertex
Supersteps: within each, all vertices compute in parallel
– Each vertex modifies its states and that of outgoing edges
– Sends messages to other vertices (in the next superstep)
– Receives messages sent to it (from the last superstep)
Termination:
– Each vertex votes to halt
– When all vertices are inactive and no messages in transit
Synchronization: supersteps
Asynchronous: all vertices within each superstep
analogous to MapReduce rounds
Message passing
Vertex-centric, message passing
Bulk Synchronous Parallel
28
The vertex centric model of GraphLab
Vertex: computation is defined to run on each vertex
All vertices compute in parallel
– Each vertex reads and writes to data on adjacent nodes or edges
Consistency: serialization
– Full consistency: no overlap for concurrent updates
– Edge consistency: exclusive read-write to its vertex and adjacent edges; read only to adjacent vertices
– Vertex consistency: all updates in parallel (sync operations)
Asynchronous: all vertices
No supersteps
asynchronous
Machine learning, data mining
29
Vertex-centric models vs. MapReduce
Vertex centric: think like a vertex; MapReduce: think like a graph
Can we do better?
Vertex centric: maximize parallelism – asynchronous, minimize data shipment via message passing
MapReduce: inefficiency caused by blocking; distributing intermediate results
Vertex centric: limited to graphs; MapReduce: general
Lack of global control: ordering for processing vertices in recursive computation, incremental computation, etc
Have to re-cast algorithms in these models, hard to reuse existing (incremental) algorithms
GRAPE: A parallel model based on partial evaluation
3030
Querying distributed graphs
Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si
Dividing a big G into small fragments of manageable size
Each processor Si processes its local fragment Gi in parallel
Parallel query answeringInput: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Q( )
GGQ( )
G1G1Q( )GnGnQ( )G2G2
…
How does it work?
31
GRAPE (GRAPh Engine)
32
Divide and conquer
partition G into fragments (G1, …, Gn), distributed to various sites
manageable sizes
upon receiving a query Q,
• evaluate Q( Gi ) in parallel
• collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G
evaluate Q on smaller
Gi
data-partitioned parallelism
Each machine (site) Si processes the same query Q, uses only data stored in its local fragment Gi
32
Partial evaluation
33The connection between partial evaluation and parallel processing
compute f( x ) f( s, d )
conduct the part of computation that depends only on s
generate a partial answer
the part of known input
Partial evaluation in distributed query processing
• evaluate Q( Gi ) in parallel
• collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G
yet unavailable input
a residual function
Gj as the yet unavailable input
as residual functions
at each site, Gi as the known input
Coordinator
34
Coordinator: receive/post queries, control termination, and assemble answers
Upon receiving a query Q
• post Q to all workers
• Initialize a status flag for each worker, mutable by the worker
Terminate the computation when all flags are true
• Assemble partial answers from workers, and produce the final answer Q(G)
Termination, partial answer assembling
Each machine (site) Si is either a coordinator a worker: conduct local computation and produce partial answers
34
Workers
35
Worker: conduct local computation and produce partial answers
upon receiving a query Q,
• evaluate Q( Gi ) in parallel
• send messages to request data for “border nodes”
use local data Gi only
Local computation, partial evaluation, recursion, partial answers 35
Incremental computation: upon receiving new messages M
• evaluate Q( Gi + M) in parallel
set its flag true if no more changes to partial results, and send the partial answer to the coordinator
This step repeats until the partial answer at site Si is ready
With edges to other fragments
Incremental computation
36
Subgraph isomorphism
Input: A pattern graph Q and a graph G
Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q
a bijective function f on nodes:
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
How to compute partial answers?
partition G into fragments (G1, …, Gn), distributed to n workers
Coordinator: upon receiving a pattern query Q
• post pattern Q to each worker
• set flag[i] false for each worker Si
Local computation at each worker
37
Worker:
upon receiving a query Q,
• send messages to request the d-neighbour for each border node, denoted by Gd
• Compute partial answer Q(Gi Gd), by calling any existing
subgraph isomorphism algorithm
• set flag[i] true, and send Q(Gi Gd) to the coordinator
The algorithm is correct, why? Correctness analysis
the diameter of Q
Coordinator: the final answer Q(G) is simply the union of all Q(Gd)
the data locality of subgraph isomorphism
No incremental computation is needed
37
38
Performance guarantee
Partial evaluation + data-partitioned parallelism
Complexity analysis: let
t(|G|, |Q|) be the sequential complexity
T(|G|, |Q|, n) be the parallel complexity
Then T(|G|, |Q|, n) = O(t(|G|, |Q|)/n), n: the number of processors
Parallel scalability: the more processors, the faster the algorithm (provided that p << |G|). Why?
Reuse of existing algorithms: the worker can employ any existing sequential algorithm for subgraph isomorphism
polynomial reduction
Proof: computational costs + communication costs
compare this with MapReduce
Necessary parts for your projectCorrectness proofComplexity analysisPerformance guarantees
Transitive closures
TC(x, y) :- R(x, y); TC(x, z) :- TC(x, y), TC(y, z).
: compared with MapReduce: minimize unnecessary recomputation
Worker:
compute TC( Gi ) in parallel
send messages to request data for “border nodes”
Represent R as a graph G: elements as nodes, and R as its edge relation
partition G into fragments (G1, …, Gn), distributed to n workers
given new messages M, incrementally compute TC(Gi + M)
Termination: repeat until there exist no more changes
incremental step
39
40
Summary and review
What is the MapReduce framework? Pros? Pitfalls?
How to implement joins in the MapReduce framework?
Develop algorithms in MapReduce
Vertex-centric models for querying big graphs
Distributed query processing with performance guarantee by partial evaluation?
What performance guarantees we want for evaluating graph queries on distributed graphs?
Compare the four parallel models: MapReduce, PBS, vertex-centric, and GRAPE
Correctness proof, complexity analysis and performance guarantees for your algorithm
Project (1)
A “smart” algorithm for computing the transitive closure of R is given in “Distributed Algorithms for the Transitive Closure”http://asingleneuron.files.wordpress.com/2013/10/distributedalgorithmsfortransitiveclosure.pdf
Implement the algorithm in the programming models of • MapReduce • GRAPE
Give correctness proof and complexity analysis Experimentally evaluate and compare the implementations 41
1. TC(x, y) := ; Q := R;
2. while Q do
a) TC := Q TC;
b) TC := Q TC TC;
c) Q := Q Q TC;
Recall that you can implement union and join in MapReduce
42
Flashback: Graph simulation
Input: a graph pattern graph Q and a graph G
Output: Q(G) is a binary relation S on the nodes of Q and G
Complexity: O((| V | + | VQ |) (| E | + | EQ| )) time
• each node u in Q is mapped to a node v in G, such that (u, v) S∈
• for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈
42
43
Project (2)
Develop two algorithms that, given a graph G and a pattern Q, compute Q(G) via graph simulation, based on two of the following models: • MapReduce • BSP (Pregel)• The vertex-centric model of GraphLab• GRAPE
43
Give correctness proof and complexity analysis of your algorithms
Experimentally evaluate your algorithms
44
Project (3)
Given a directed graph G and a pair of nodes (s, t) in G, the distance between s and t is the length of a shortest path from s to t.
Develop two algorithms that, given a graph G, compute the shortest distance for all pair of nodes in G, based on two of the following models: • MapReduce • BSP (Pregel)• The vertex-centric model of GraphLab• GRAPE
Give correctness proof and complexity analysis of your algorithms
Experimentally evaluate your algorithms44
45
Project (4)
Show that GRAPE can be revised to support relational queries, based on what you have learned about parallel/distributed query processing. Show• how to support relational operators• how to optimize your queries
Implement a lightweight system to support relational queries in GRAPE• Basic relational operators• SPC queries
Demonstrate your system
46
Project (5)
Write a survey on any of the following topics, by evaluating 5-6 representative papers/systems:
• Parallel models for query evaluation
• Distributed graph query engines
• Distributed database systems
• MapReduce platforms
• NoSQL systems
• …
Demonstrate your understanding of the topics46
Develop a set of criteria, and evaluate techniques/systems based on the criteria
Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html
47
1. W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013 (BD-tractability)
2. Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf
3. J. Hellerstein. The Declarative Imperative: Experiences and Conjectures in Distributed logic. SIGMOD Record 39(1), 2010 (communication complexity) http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf
4. Y. Cao, W. Fan, and W. Yu. Bounded Conjunctive Queries. VLDB 2014. (scale independence)
5. W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression)
6. W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)
7. F. Afrati, V. Borkar, M. Carey, N. Poluzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries, EDBT 2011.
https://asterix.ics.uci.edu/pub/EDBT11-afrati.pdf