advanced partitioning techniques for massively distributed...

Advanced Partitioning Techniquesfor Massively Distributed Computation

Jingren ZhouMicrosoft

[email protected]

Nicolas BrunoMicrosoft

[email protected]

Wei LinMicrosoft

[email protected]

ABSTRACTAn increasing number of companies rely on distributed datastorage and processing over large clusters of commodity ma-chines for critical business decisions. Although plain MapRe-duce systems provide several benefits, they carry certain lim-itations that impact developer productivity and optimiza-tion opportunities. Higher level programming languagesplus conceptual data models have recently emerged to ad-dress such limitations. These languages offer a single ma-chine programming abstraction and are able to perform so-phisticated query optimization and apply efficient executionstrategies. In massively distributed computation, data shuf-fling is typically the most expensive operation and can leadto serious performance bottlenecks if not done properly. Animportant optimization opportunity in this environment isthat of judicious placement of repartitioning operators andchoice of alternative implementations. In this paper we dis-cuss advanced partitioning strategies, their implementation,and how they are integrated in the Microsoft Scope system.We show experimentally that our approach significantly im-proves performance for a large class of real-world jobs.

Categories and Subject DescriptorsH.2 [Information Systems]: Database Management

KeywordsScope, Partitioning, Distributed Computation, Query Opti-mization

1. INTRODUCTIONAn increasing number of companies rely on the results

of massive data computation for critical business decisions.Such analysis is crucial in many ways, such as to improve ser-vice quality and support novel features, to detect changes inpatterns over time, and to detect fraudulent activities. Usu-ally the scale of the data volumes to be stored and processedis so large that traditional, centralized database system so-lutions are no longer practical or even viable.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD Š12, May 20–24, 2012, Scottsdale, Arizona, USA.Copyright 2012 ACM 978-1-4503-1247-9/12/05 ...$10.00.

For that reason, several companies have developed dis-tributed data storage and processing systems on large clus-ters of thousands of shared-nothing commodity servers. Ex-amples of such initiatives include Google’s MapReduce [5],Hadoop [1] from the open-source community and used at Ya-hoo, and Cosmos/Dryad [3, 12] at Microsoft. In the MapRe-duce approach, developers provide map and reduce functionsin procedural languages like C++, which perform data trans-formation and aggregation. The underlying runtime systemachieves parallelism by partitioning the data and processingeach partition concurrently using multiple machines. Thismodel scales reasonably well to massive data sets and hassophisticated mechanisms to achieve load-balancing, outlierdetection, and recovery to failures, among others.

This approach, however, has its own set of limitations.Users are required to translate their application logic to theMapReduce model in order to achieve parallelism. For someapplications this mapping is very unnatural. Users haveto provide implementations for the map and reduce func-tions, even for simple operations like projection and selec-tion. Such custom code is error-prone and hardly reusable.Moreover, for complex applications that require multipleMapReduce stages, there are often many valid evaluationstrategies and execution orders. Having developers manu-ally implement and combine multiple MapReduce functionsis equivalent to asking them to specify physical executionplans directly in relational database systems, an approachthat became obsolete with the introduction of the relationalmodel over three decades ago. Moreover, optimizing com-plex, multi-step MapReduce jobs is difficult, since it is notusually possible to do complex reasoning over sequences ofopaque, primitive MapReduce operations.

To address this problem, higher level programming lan-guages plus conceptual data models were recently proposed,including Jaql [2], Scope [3, 24], Tenzing [4], Dremel [15],Pig [18], Hive [21], and DryadLINQ [23]. These languagesoffer a single machine programming abstraction and allowdevelopers to focus on application logic, while providingsystematic optimizations for the distributed computation.These optimizations take advantage of a full view of theapplication logic, and can perform sophisticated query opti-mization strategies that are simply not possible otherwise.

In massive distributed computation, data shuffling is typ-ically the most expensive operation and can lead to seriousperformance bottlenecks if not done properly. A very impor-tant optimization opportunity in this type of environmentis that of judicious placement of repartitioning operatorsand choice of different implementation techniques. Com-

plex scripts and reasoning about properties of intermediateresults opens the door to a rich class of optimization oppor-tunities regarding data partitioning. It is crucial to avoidrepartitioning unless absolutely necessary, and do so as ef-ficiently as possible. Previous work focused on minimizingthe number of partitioning operations while executing com-plex scripts [24]. However, independently on how much weoptimize input scripts, there are still scenarios that requiredata shuffling. There are also different ways to perform datashuffling, and each one is preferable under different condi-tions. The optimal choice of shuffling operations dependson many factors, including data and code properties, andscalability requirements. It is valuable to integrate the rea-soning with the query optimizer and systematically consideralternatives with the rest of the query.In this paper we describe advanced partitioning techniques

that leverage input data properties and optimize data move-ment across the network. As a result, we are able to greatlyimprove the shuffling efficiency, by either avoiding unneces-sary repartitioning or partially repartitioning the input dataset. The techniques are applicable both at compilation timeand at runtime. We also design a novel partitioning strategywhich indexes intermediate partitioned results and storesthem in a single sequential file. This approach fundamen-tally solves the scalability challenge for data partitioning andefficiently supports an arbitrarily large number of partitions.All the partitioning techniques are implemented into the

Scope system at Microsoft, which is running in productionover tens of thousands of machines. The query optimizerof Scope considers different alternatives in a single opti-mization framework and chooses the optimal solution in acost-based fashion. Experiments show that the proposedtechniques improve data shuffling efficiency by a few foldsfor real-world queries. Although this paper uses Scope asthe underlying data and computation platform, the ideas areapplicable to any distributed system that relies on a queryoptimizer and performs data shuffling.The rest of the paper is structured as follows. In Sec-

tion 2 we review the necessary technical background to sup-port the remaining sections. In Section 3 we describe theimplementation of partial repartitioning and its interactionwith the query optimizer. In Section 4 we discuss a scal-able index-based partitioning strategy. Section 5 reports anexperimental evaluation of our strategies discussed on real-world data. Section 6 discusses related work, and Section 7concludes the paper.

2. PRELIMINARIESScope [3, 24] (Structured Computations Optimized for

Parallel Execution) is the distributed computation platformfor Microsoft’s online services targeted for large scale dataanalysis. It incorporates the best characteristics of bothMapReduce and parallel database systems. The system runsover large clusters of tens of thousands of machines, executestens of thousands of jobs daily, and it is on its way to be-come an exabyte store. We start with an overview of Scope,including the query language, query optimization, and dif-ferent partitioning types. The reader can find more detailson these concepts in the literature [3, 24, 12].

2.1 Query LanguageThe Scope language is declarative and intentionally rem-

iniscing SQL. The select statement is retained along with

joins variants, aggregation, and set operators. Like SQL,data is modeled as sets of rows composed of typed columns,and every rowset has a well-defined schema. At the sametime, the language is highly extensible and is deeply inte-grated with the .NET framework. Users can easily definetheir own functions and implement their own versions ofrelational operators: extractors (parsing and constructingrows from a raw file), processors (row-wise processing), re-ducers (group-wise processing), combiners (combining rowsfrom two inputs), and outputters (formatting and outputtingfinal results). This flexibility allows users to solve problemsthat cannot be easily expressed in SQL, while at the sametime is able to perform sophisticated reasoning of scripts.

Figure 1(a) shows a very simple Scope script that countsthe different 4-grams of a given single-column string data set.In the figure, NGramProcessor is a C# user defined operatorthat outputs, for each input row, all its n-grams (n = 4in the example). Conceptually, the intermediate output ofthe processor is a regular rowset that is processed by theSQL-like main query (note that intermediate results are notnecessarily materialized between operators at runtime).

2.2 Query Compilation and OptimizationA Scope script goes through a series of transformations

before it is executed in the cluster. Initially, the Scopecompiler parses the input script, unfolds views and macrodirectives, performs syntax and type checking, and resolvesnames. The result of this step is an annotated abstract syn-tax tree, which is passed to the query optimizer. Figure 1(b)shows an input tree for the sample script.

The Scope optimizer is a cost-based transformation en-gine based on Cascades framework [9], and generates effi-cient execution plans for input trees. Since the language isheavily influenced by SQL, Scope is able to leverage existingwork on relational query optimization and perform rich andnon-trivial query rewritings that consider the input script ina holistic manner. The Scope optimizer extends Cascadesby fully integrating parallel plan optimization and perform-ing more complex structural data property reasoning, suchas partitioning and sorting of intermediate results.

The optimizer returns an execution plan that specifies thesteps that are required to efficiently execute the script. Fig-ure 1(c) shows the output from the optimizer, which definesspecific implementations for each operation (e.g., stream-based aggregation), data partitioning operations (e.g., therepartition operator), and additional implementation details(e.g., the initial sort after the processor, and the unfoldingof the aggregate into a local/global pair).

Finally, code generation produces the final algebra (whichdetails the units of execution and data dependencies amongthem) and the assemblies that contain both user definedcode and runtime implementation of the operators in theexecution plan. Figure 1(c) shows dotted lines for the twounits of execution corresponding to the input script. Thispackage is then sent to the cluster, where it is actually exe-cuted. Users can monitor the progress of running jobs, andthere are management utilities to submit, queue, and prior-itize scripts in the shared cluster.

2.3 Data PartitioningA key feature of distributed query processing is based on

partitioning data into smaller subsets and processing parti-tions in parallel on multiple machines. This requires opera-

Get

(input.ss)

Process

(NGramProcessor)

Aggregate

({ngram}, count)

Output

(output.txt)

Get

(input.ss)

Process

(NGramProcessor)

Sort

(ngram)

Global Stream Agg

({ngram}, count)

Local Stream Agg

({ngram}, count)

Repartition

(ngram)

Output

(output.txt)

SV2

SV1

SV1 SV1 SV1 SV1 SV1

SV2SV2 SV2

(a) Scope script (b) Input tree (c) Output tree (d) Scheduled graph

Figure 1: Definition, Compilation, and Scheduling of a Simple Scope Script

tors for splitting a single input into smaller partitions, merg-ing multiple partitions into a single output, and repartition-ing an already partitioned input into a new set of partitions.This can done by a single operator, the data exchange oper-ator, that repartitions data from n inputs to m outputs [8].After an exchange operator, the data is partitioned into msubsets that are then processed independently and in par-allel using standard operators, until the data flows into thenext exchange operator.Exchange is implemented by one or two physical opera-

tors: a partition operator and/or a merge operator. Eachpartition operator simply partitions its input while eachmerge operator collects and merges the partial results thatbelong to its result partition. Suppose we want to reparti-tion n input partitions, each one on a different machine, intom output partitions on a different set of machines. The pro-cessing is done by n partition operators, one on each inputmachine, which read its input and split it onto m local parti-tions, and m merge operators, one on each output machine,which collect the data for their partition from the n corre-sponding local partitions. Figure 1(d) shows five partitionoperators at the end of each SV1 vertex instance split theirinput into three partitions each, and three merge operatorsat the beginning of each SV2 vertex instance merge the fivepieces of a corresponding partition.

(a) Full Repartitioning (b) Initial Split (c) Full Merge

Figure 2: Different Types of Data Exchange

2.3.1 Exchange TopologyFigure 2 shows the main classes of exchange operators.

Full Repartitioning (Figure 2(a)) consumes n input parti-tions and produces m output partitions, partitioned in adifferent way. Every input partition contributes to everyoutput partition, resulting in n · m connections. In the ex-ample of Figure 1(c), the plan uses full repartitioning onthe nGram column in SV1, so that the input, which is notpartitioned by nGram, can be aggregated correctly by SV2.Initial Split (Figure 2(b)) is a special case of full partitioning

where there is a single input stream that is partitioned intom streams, without merge operators. Finally, Full Merge(Figure 2(c)) is a special case of full partitioning when thereis a single output stream, merged from n input streams with-out partition operators. In this paper we explore alternativeimplementations of exchange operators that are based onadditional topologies.

2.3.2 Partitioning SchemesAn instance of a partition operator takes one input stream

and generates multiple output streams. It consumes one rowat a time and writes the row to the output stream selectedby a partitioning function applied to the row in a FIFOmanner (so that the order of two rows r1 and r2 in theinput stream is preserved if they are assigned to the samepartition). There are several different types of partitioningschemes. Hash Partitioning applies a hash function to thepartitioning columns to generate the partition number towhich the row is output. Range Partitioning divides thedomain of the partitioning columns into a set of disjointranges, as many as the desired number of partitions. A rowis assigned to the partition determined by the value of itspartitioning columns, producing ordered partitions. Othernon-deterministic partitioning schemes, in which the datacontent of a row does not affect which partition the row isassigned to, include round-robin and random. The examplein Figure 1(c) uses hash partitioning.

2.3.3 Merging SchemesAn instance of a merge operator combines data from mul-

tiple input streams into a single output stream. Depend-ing on whether the input streams are sorted individuallyand how rows from different input streams are ordered, wehave several types of merge operations. Random Merge ran-domly pulls rows from different input streams and mergesthem into a single output stream, so the ordering of rowsfrom the same input stream is preserved. Sort Merge takesa list of sort columns as a parameter and a set of inputstreams sorted on the same columns. The input streams aremerged together into a single sorted output stream. Con-cat Merge concatenates multiple input streams into a singleoutput stream. It consumes one input stream at a time andoutputs its rows in order to the output stream. That is,it maintains the row order within an input stream but it

does not guarantee the order in which the input streams areconsumed. Finally, Sort-Concat Merge takes a list of sortcolumns as a parameter. First, it picks one row (usually thefirst one) from each input stream, sorts them on the valueson the sort columns, and uses the row order to decide theorder in which to concatenate the input streams. This is use-ful for merging range-partitioned inputs into a fully orderedoutput. In the example of Figure 1(c), the different parti-tions are ordered by nGram due to the sort operator, so weuse the Sort Merge variant to ensure the right row orderingwhen consumed by the global stream aggregate operator.

2.4 Job SchedulingScope relies onDryad [12] to schedule the compiled script

inside the cluster. The execution of a script can be modeledas a graph, where each vertex represents an instance of acomputation unit, and each edge corresponds to data flowbetween vertices 1. Figure 1(d) shows the execution graphfor the sample script, assuming that the input data is laidout in five machines and the optimizer determines that datawould be better aggregated into three partitions.Vertex scheduling is rather sophisticated and takes differ-

ent factors into consideration when deciding which verticesto run next, and on which machine to place such vertices(e.g., data locality, average execution time and memory con-sumption). Additionally, the nature of the graph imposessome natural scheduling constraints. A vertex can only startwhen all its inputs are already finished processing. For in-stance, in Figure 1(d), any SV2 vertex can only begin afterall SV1 vertices finish. This is required not only to avoidproducers and consumers running concurrently, but also be-cause we need to simultaneously Sort-Merge SV2’s inputs.

2.4.1 Aggregation TreesVertex scheduling attempts to minimize the overall job

latency. One important aspect to achieve this goal is toreduce recovery costs due to failures. Consider a merge op-erator that consumes data from a very large number of par-tition vertices (i.e., suppose in Figure 1(d) that there area thousand SV1 partition vertices connecting to each oneof the three SV2 vertices). The chance of a random failurein SV2 while reading SV1 outputs increases with the num-ber of input connections. Moreover, any such failure causesthe whole SV2 vertex instance to restart, wasting partialwork. To alleviate this problem, aggregates are typicallydone using aggregation trees, which can be seen as check-points during aggregation. When the number of input con-nections to an aggregate vertex is beyond a certain limit, weintroduce partial aggregate vertices that operate over frag-ments of the input partition vertices. This works well forqueries where merge operations interact with algebraic par-tial aggregation, which reduces the input size and can beapplied multiple times at different levels without changingquery correctness. We can then aggregate the inputs withinthe same rack before sending them out, reducing the overallnetwork traffic among racks. As an example, if the thresh-old is 250 input connections and we have a thousand SV1instances in Figure 1(d), for each one of the three SV2 in-stances, we would introduce four partial aggregate vertices,each working on one fourth of the SV1 instances, based onthe network topology.

1To simplify the presentation, we do not discuss mechanisms thatdynamically expand or contract the graph at runtime.

2.5 Structured StreamsIn Scope, structured data can be efficiently stored as

structured streams. Like tables in traditional databases,a structured stream has a well-defined schema that everyrecord follows. Additionally, structured streams can be di-rectly stored in a partitioned way, which can be either hash-or range-based over a set of columns. Data in a partitionis typically processed together (i.e., a partition represents acomputation unit). Each partition may contain a group ofextents, which is the unit of storage in Scope. In the ex-ample of Figure 1(c), data is read from a structured streaminput.ss, which is not partitioned by nGram (and thereforea repartition operator is needed). If the input structuredstream were partitioned by nGram already, the resulting planwould consist of a single computation unit SV1, on whichthe final partitioning operator is replaced by the output op-erator. Relying on pre-partitioned data significantly reduceslatency by removing both the data exchange and the super-fluous global aggregate operators.

2.5.1 Indexes for Random AccessWithin each partition, a local sorting order is maintained

through a B+-Tree index. This organization not only allowssorted access to the content of a partition, but also enablesfast key lookup on a prefix of the sorting keys. Such supportis very useful for queries that select only a small portion ofthe underlying data, and also for more sophisticated strate-gies such as index-based joins. In our example, if input.sswere partitioned and sorted by nGram, we could not only re-move SV2 as explained earlier, but also the sort operator inSV1, with an additional improvement in performance.

2.5.2 Data AffinityScope does not require all the data that belongs to a

partition to be stored in a single machine. Instead, Scopeattempts to store all the data in a partition close together byutilizing store affinity. Every extent has an optional affinityid. All the extents with the same affinity id belong to anaffinity group. The system tries to place all the extents of anaffinity group on the same machine unless the machine hasalready been overloaded. In this case, the extents are placedin the same rack (or a close rack if the rack itself is over-loaded). Each partition of a structured stream is assigned anaffinity id. As extents are created within the partition, theyget assigned the same affinity id, suggesting that they shouldbe stored together. Processing a partition can be done ef-ficiently either on a single machine or within a rack. Storeaffinity is a very powerful mechanism to achieve maximumdata locality without sacrificing uniform data distribution.

The store affinity functionality can also be used to asso-ciate/affnitize the partitioning of an output stream with thatof a referenced stream. This causes the output stream to mir-ror the partitioning choices (i.e., partitioning function andnumber of buckets) of the referenced stream. Additionally,each partition in the output stream uses the affinity id of thecorresponding partition in the referenced stream. Therefore,two streams that are referenced not only are partitioned inthe same way, but partitions are physically placed close toeach other in the cluster. This layout significantly improvesparallel join performance, as data need not be transferredacross the network.

3. PARTIAL DATA REPARTITIONINGAs described in Section 2.3, data partitioning typically

consists of one or two physical operators: a partition oper-ator, which splits its input into local partitions, and/or amerge operator, which collects and merges the local parti-tions that belong to the corresponding output partition. Inabsence of additional information, as illustrated in Figure 2,every merge vertex needs to connect and read data from ev-ery partition vertex. As we discuss in this section, however,by carefully defining the partition scheme of the exchangeoperator (e.g., number of partitions and partition bound-aries), we can guarantee that certain local partitions wouldbe empty. Additionally, we can somewhat influence howmuch data has to be transferred, and to which destination.This general approach has several advantages during exe-

cution. First, by carefully defining partition boundaries, wecan drastically reduce data transfer between partition andmerge vertices by taking data locality into account. Second,partition vertices need to reserve fewer memory buffers andstorage due to local partitions that are guaranteed to beempty. Third, the job manager does not need to maintainexplicit connection state between partition and merge ver-tices for which the local partitions are empty, which reducesthe footprint of the scheduler. Fourth, merge vertices do nothave to wait for all partition vertices to finish executing, butonly for those that actually contribute to the correspondingpartition. This is important because a single partition out-lier can delay all merge vertices from starting, increasingoverall latency. Finally, fewer input connections to mergevertices reduce the chance of failures, thus requiring fewerintermediate aggregates (see Section 2.4.1), which improvesoverall performance.In this section we explore efficient alternatives to perform

data repartitioning that leverage properties of the input datato be repartitioned. Depending on the partitioning scheme,we discuss how to construct and implement data exchangeoperators that are defined over effective partitioning bound-aries and only require a subset of connections between par-tition and merge vertices. Additionally, we discuss how tointegrate these alternatives during query optimization. Toillustrate the different partitioning techniques, we use thesimple script below:

SELECT a, UDAgg(b) AS aggBFROM SSTREAM "input.ss"GROUP BY a;

OUTPUT TO SSTREAM "output.ss"[HASH | RANGE] CLUSTERED BY a;

The input in the script is a structured stream distributedin the cluster. The script performs a user defined aggrega-tion UDAgg on column a and outputs the result into anotherstructured stream, either hash- or range-partitioned by a.

3.1 Hash-based PartitioningWe consider the case in which the script writes a struc-

tured stream hash-partitioned by column a. We assume hashpartitioning is done by first applying a hash function to thepartitioning columns and then having the hash value mod-ulo the number of output partitions to generate the partitionnumber to which the row goes. As long as the hash func-tion and the number of output partitions are fixed, hashpartitioning is deterministic.

Get

(input.ss)

Stream Agg

({a}, UDAgg(b))

Repartition

(a)

Output

(output.txt)

SV2

SV1

Figure 3: Execution Plan for a Simple Script

Figure 3 shows an execution plan for the script. Assumingthat the input is not already partitioned by column a, theplan reads the input in parallel and hash-repartitions thedata on column a (SV 1 in the figure), and then merges thepartitions, performs the user defined aggregate and outputsthe result (SV 2 in the figure). Note that in our exampleUDAgg is not associative nor commutative, so we cannot uselocal/global aggregates as in Figure 1(c)). If the input werealready partitioned by a, a different execution plan withoutrepartitioning would be preferable (i.e., read each partition,perform the aggregation, and write results in parallel).

Output

(output.txt)

SV3

SV1

Get

(input.ss)

Stream Agg

({a}, UDAgg(b))

Repartition

(a,200)

Repartition

(a,50)

SV2

SV1

Figure 4: Partial Hash Repartitioning

Suppose that the input is indeed hash-partitioned by col-umn a into 100 partitions. However, the user defined aggre-gate is very expensive, so we choose to repartition the inputinto 200 buckets. Additionally, the output of the aggregate issmaller than the input, so we choose to additionally repar-tition it into 50 buckets before the final output, in orderto avoid writing too many small fragments. The resultingexecution plan is shown in Figure 4. In principle, the execu-tion plan looks rather expensive due to two full repartitionoperators. However, in this scenario there are important op-timization opportunities. In fact, repartitioning a 100-waypartitioned input into 200 partitions (by the same column)can be done by locally splitting each of the original partitionsinto two, without any network data transfer. Also, reparti-tioning a 200-way input into 50 partitions can be done bypartially merging the input partitions. We next formalizethis approach, and characterize when it is effective.

3.1.1 Determining Data Flow ConnectionsSuppose that we want to hash partition an input T into

po partitions, and T is already hash-partitioned into pi par-titions by the same columns. The naive execution plan con-tains pi vertices P0, . . . Ppi−1, each one partitioning its in-put into po local partitions, followed by po merge vertices

M0, . . . ,Mpo−1, each one reading and merging the i-th localpartition from each of the pi partition vertices. Interestingly,for certain values of pi and po, some local partitions in Pi

vertices would always be empty and do not need to be readby Mj vertices.

Example 1 Suppose that pi = 4 and po = 2 (i.e., we wantto partition 2-ways an input that is already 4-way parti-tioned). Every row in P0 satisfies h(C) ≡ 0 mod 4, where his the hash function and C are the partitioning columns. Fig-ure 5(a) shows the default partitioning strategy which con-nects every input vertex with every output vertex. In thiscase, we know that h(C) ≡ 0 mod 2 as well, and thereforeP0 would never generate a row satisfying h(C) ≡ 1 mod 2.Thus, M1 does not need to read the empty local partitionproduced by P0. In general, M0 only reads from P0 and P2,and M1 from P1 and P3. Figure 5(b) shows the refined mergegraph. A similar strategy can be applied when pi = 2 andpo = 4. Figure 5(c) shows the refined partitioning graph.

P0 P1 P2 P3

M1M0

P0 P1 P2 P3

M1M0 M3M2M1M0

P0 P1

(a) Full Partitioning (b) Partial Merge (c) Partial Partitioning

Figure 5: Different Hash Partitioning Strategies

In general, merge vertex Mi (0 ≤ i < po) needs to readfrom partition vertex Pj (0 ≤ j < pi) if there might be arow in Pj for which its hash value modulo po is i. In otherwords, if there exists an integer k such that k ≡ j mod piand k ≡ i mod po. By definition, this implies that there areintegers k1 and k2 such that k = k1 ·po+i and k = k2 ·pi+j,and therefore k1 · po+ i = k2 · pi+ j, or

po · k1 + (−pi) · k2 = (j − i)

This is a linear diophantine equation of the form a·x+b·y=c,which has integer (x, y) solutions if and only if c is a multipleof the greatest common denominator of a and b. In our case,there are solutions, and therefore Mi needs to reads from Pj

if and only if (j − i) is a multiple of gcd(po,−pi), or, moreconcisely, if and only if

i ≡ j mod gcd(pi, po)

If pi and po are co-primes, then gcd(pi, po) = 1 and the op-timized technique is the same as the traditional one (i.e., weneed to connect every partition and merge vertex instances).

3.2 Range-based PartitioningThe ideas in the previous section can also be applied to

range partitioning scenarios. Consider again the script in theprevious section when it writes a structured stream range-partitioned by column a. Suppose now that the input inthe figure is range-partitioned by columns (a, b). Note thatthis property does not imply the data being partitioned by aalone, since two rows that share a values might be in differentpartitions due to varying b values. In this case, the same planof Figure 3 can certainly be used, with a full repartitioningon a. However, an input that is range-partitioned by (a, b)can be repartitioned by a in a cheaper way by locally splittingand merging the original partitions, as shown next.

1,A

1,A

1,B

1,B

1,C

1,D

2,A

2,D

2,E

3,A

3,B

4,A

1,A

1,A

1,B

1,B

1,C

1,D

2,A

2,D

2,E

3,A

3,B

4,A

P1=[(1,A)…(1,C))

P2=[(1,C)…(2,E))

P3=[(2,E)…max)

P’1=[1…2)

P’2=[2...3)

P’3=[3…max)

Figure 6: Refining Range Partitions from (a, b) to a

Example 2 Consider the data set at the left of Figure 6,which is range-partitioned by columns (a, b) into three par-titions. Partition P2, for instance, contains all rows withvalues in the interval [(1, B), (2, D)) (note that the intervalis closed on left and open on right). As shown in the figure,we can repartition this data set by column a without con-necting each partition output with each merge input. Theprecise connectivity map can be statically computed at com-pile time given the original partitioning boundaries. Notethat the output partition P ′

1 receives data from input parti-tions P1 and P2. While P1 is fully contained into P ′

1, weneed to filter rows in P2 with a < 2 to avoid incorrect re-sults. Compared to the initial physical partitions Pi, we callP ′i as logical partitions, each of which conceptually reads one

or more physical partitions with optional filtering predicates.The end result is that P ′

i is range-partitioned on column a.

In general, partial range-based repartitioning can be ap-plied whenever the input and output partition schemes sharea common column prefix (otherwise, there is no choice butto connect each partition vertex with every merge vertex).The general approach consists of two steps. First, we needto determine the range boundaries for each output parti-tion. With this information, we can then generate code thatdetermines the output partition for each incoming row, anddetermine which partition and merge vertices are connected.We next discuss these steps in detail.

3.2.1 Determining Partitioning BoundariesThe main goal of query optimization is to minimize the re-

sulting job latency. In the context of range partitioning, andespecially with respect of boundary determination, there aretwo main aspects that contribute to overall latency. First,the resulting partitions need to be evenly distributed. Anyskewed partition is likely to introduce outliers during exe-cution, as a partition that is significantly larger than theaverage increases the latency of the overall job (and likelypropagates skewed partitions to subsequent execution ver-tices). Second, the cost of the repartitioning operator itselfis important, which is directly proportional to the amountof network data transfer. As a consequence, we should tryto align input and output partition boundaries as much aspossible. This way, we avoid splitting input partitions inmultiple large fragments that are sent across the network.

Suppose that we want to repartition data on columns C,where each partition is roughly of size T . When the inputis already partitioned by columns C′ (where C and C′ sharea common prefix), the input partitioning metadata infor-mation itself can be viewed as a (coarse) histogram B over

columns C′, with each partition representing a bucket inthe histogram2. We assume that buckets maintain the num-ber of rows that are included in its boundary, that bucketboundaries are closed on left and open on right, and thatthe last range on right is a special maximum value that islarger than any other valid value. The information from suchhistograms can help the system choose partition boundaries.Some very common column types (e.g., strings or binary

columns) are not amenable to interpolation, which limitswhat we can infer about value distributions inside a his-togram bucket. Specifically, in this case we know all bucketboundaries but we cannot interpolate and generate inter-mediate values within input buckets. Under this naturalrestriction, Algorithm 1 describes a greedy approach to de-termine partition boundaries that are compatible with Cand would result in partitions of size around T .

Algorithm 1: PartitionBoundaries(C, T, B)

Input: Columns C, Partition size T, Buckets BOutput: Partition boundaries P

/* Assume that C and B.cols share common prefix CPand for each 1 < i <|B|: B[i-1].hi = B[i].lo */

/* Output is partitioned by CP, which implies C,and each partition size is around T */

CB = ∪i[ΠCP B[i].lo, ΠCP B[i].hi) // project B on CPidx = 0;while idx < |CB| do

actLo = CB[idx].Lo;actSize = CB[idx].Size;idx++;while actLo = CB[idx].Lo OR

CB[idx].size / 2 < T - actSize doactSize += CB[idx].size;idx++;

endP = P ∪ [actLo, CB[idx-1].hi);

end

return P;

First, we project the histogram buckets onto the com-mon prefix between input and output partitioning columns.Then, we walk the projected buckets and accumulate suchbuckets into new output partitions. We keep accumulatingbuckets into a new partition as long as (i) the lower boundof the bucket is the same as the current lower bound of thepartition (which can happen when we project the originalbucket boundaries onto the common prefix), and (ii) theresulting bucket is as close as possible to the target size T .Note that due to bucket granularity and the need to collapsetogether buckets that share the same lower bounds, the re-sulting partitions might not exactly match the expected sizeT . This data skew is known at compile time and taken intoaccount during optimization, as we discuss later.In some cases, however, there is additional information

that we can leverage. For instance, if the input data comesfrom a structured stream, the optional local B+-tree-likeindex provides detailed distribution of values within eachinput partitions. Also, there could be a histogram obtainedfrom statistics on the relevant columns. Finally, for columntypes that are amenable to interpolation (e.g., integer orfloat), a uniform data distribution within each partition can

2Input partitioning metadata is obtained directly for base struc-tured streams, and propagated throughout query plans analo-gously to other properties such as cardinality information.

Repartition

(a)

Coordinator

StatCollector

(a)

Source

Figure 7: Getting Partition Boundaries at Runtime

be assumed. In these cases, we can additionally leveragesuch value distribution and further improve the quality ofthe resulting partitions by slightly extending Algorithm 1.

Partition boundaries can be chosen not only at compiletime but also at runtime. If the optimizer detects poten-tially unbalanced partition boundaries, it also considers analternative plan that uses a more expensive repartitioningoperator at runtime but results in good-quality partitionboundaries. Figure 7 shows how to achieve this goal, byusing additional computation stages. We first intercept theinput rowset and compute a histogram on the partitioningcolumns (StatCollector operator in the figure). This interme-diate information is aggregated by a Coordinator, which ob-tains the global histogram for the partitioning columns. Thecoordinator then computes the global partition boundariesas before, and broadcast them to the actual partition ver-tices, which additionally receive the input to be partitioned,and are executed in the traditional way. The optimizer con-siders this alternative along with the ones defined in theprevious section and chooses the one that is the cheapest inestimated costs, considering both the local partitioning costand the increase in latency by unbalanced partitions.

3.2.2 Determining Data Flow ConnectionsOnce the partition boundaries are calculated, it is straight-

forward to determine which output partition and input mergevertices should be connected to each other. Suppose that[POi

lo, POihi) is the i-th partition range boundary (i.e., it

corresponds to the i-thmerge vertex in the execution graph).We should connect this merge vertex with a partition vertexdefined over the input range [PIjlo, P Ijhi) whenever:

ΠCP [POilo, POi

hi) ∩ΠCP [PIjlo, P Ijhi) = ∅

where CP is the longest common prefix between the out-put and input partitioning columns, and the projection of arange with columns C is defined as follows:

ΠCP [lo, hi) =

{[lo, hi) ifCP = C[ΠCP (lo),ΠCP (hi)] otherwise

Additionally, we add a filter predicate that keeps incomingrows in the appropriate range unless we can guarantee thatall input rows qualify, which happens whenever:

ΠCP [PIjlo, P Ijhi) ⊆ [POilo, POi

hi)

The range predicate can be implemented by exploiting anindex if available, or otherwise by scanning an input par-tition and removing extra rows on the fly. This choice isknown at compile time and incorporated into the cost model.

3.3 Optimizer IntegrationAs discussed in Section 2, structured streams are parti-

tioned when stored in the cluster. The structured streampartitioning scheme might or might not be compatible withthat of subsequent repartition operators. Defining the par-titioning scheme as well as which columns to partition on ispart of the physical data design problem, which is crucial intraditional database systems and is also gaining importancein distributed computation engines. Partitioning choices aregenerally made based on usage patterns, as there is no op-timal physical design independent of a workload.In this section, we discuss how to integrate different parti-

tion strategies into the optimization framework, particularlyin the presence of structured streams, and identify some op-timization opportunities. The optimizer considers all thealternatives in a uniform framework and chooses the opti-mal solution based on the estimated costs. The optimizersearch space increases by adding new partitioning strategies.The exact overhead heavily depends on queries and physicaldesign but is reduced by effective cost-based pruning. Over-all, we only observed marginal increase in compilation timein the cluster due to these extensions. We also note thatour environment has relaxed requirements on optimizationtime compared to traditional database systems, so a slightincrease in optimization time is acceptable.We enhance the traditional optimizer cost model [24] in

two main directions. First, in addition to CPU and I/Ocost, each operator estimates network transfer based on theamount of data to be shuffled. Second, we adjust the impactof partition skew by explicitly charging each operator thecost of the largest partition it needs to process. We omitdetails due to space constraints.

3.3.1 General Repartitioning OpportunitiesThe Scope optimizer chooses to repartition data based

on requirements from subsequent operators. For instance, agroup-by operator would cause the optimizer to require itsinput to be partitioned by a subset of the grouping columns.To determine the number of partitions, the optimizer con-siders the total data size that needs to be processed, and alsothe cost to process each row. The goal of this estimation is toobtain partitions that are processed in a reasonable amountof time (partitions should not be too large, which increasesrecovery time, nor too small, which increases scheduling andvertex start-up overhead).If the input is partitioned by a different but compatible

set of columns, the optimizer considers partial partitioningstrategies besides the full repartitioning alternative. Addi-tionally, if the input is partitioned by the right columns butwith a different number of partitions (or different rangesfor range partitioning), the optimizer considers alternativesthat, while not optimal with respect to partition size, mightresult in a faster repartitioning and cheaper overall plan cost.For instance, suppose that the optimizer determines that thenumber of partitions should be 100. However, the input isalready hash-partitioned by the right columns into 45 parti-tions. The optimizer would also consider an alternative with90 partitions, which would result in slightly larger partitionsbut a more efficient repartition operation.

3.3.2 Opportunities for N-ary OperatorsOperators that receive multiple inputs, such as union and

join variants, have an important requirement: all their in-

puts need to be partitioned in the same way to preventwrong results. For instance, suppose we are joining twoinputs R and S on columns (R.a,R.b,R.c) = (S.d, S.e, S.f).The result would be correct whenever we partition both in-puts on any subset of these columns, as long as we use thesame subset on each input (modulo join equality). Other-wise, rows that would join might end up in different parti-tions and we would miss rows in the result. We next discusssome optimization opportunities for n-ary operators:

Pushing partition schemes from one input to others. Ifa join input is already partitioned by a subset of the joincolumns, we attempt to use such partitioning scheme for theother input. That way, if the other input is partitioned in acompatible way (e.g., same columns but different number ofpartitions for hash partitioning, or common column prefixfor range partitioning) we might be able to use a cheaperpartial repartition operator on the remaining inputs, leadingto a better plan overall.

Heuristic range partitioning. Suppose that all inputs of ann-ary operator are range-partitioned by the same columns,but with different ranges. If these inputs are defined over dif-ferent domains of data, there might not be a single partitionamong inputs to push to the others without resulting in largepartition skew. To address this scenario, the optimizer ad-ditionally obtains a common partitioning scheme that takesinto account the overall distribution of inputs. We first col-lect all histogram buckets from each input, union them to-gether, and use a slightly refined version of Algorithm 1 toobtain the overall required partition boundaries. Finally, wepush this partitioning requirement to each input.

Broadcast optimization. When one side of the join is verylarge and the other is small, it is common to rely on a broad-cast join variant, where the small input is sent to all par-titions of the large input. As the size of the small inputincreases, this strategy loses its competitive advantage. In-terestingly enough, if both inputs are partitioned in such away that there is a common prefix between the partitioningand join columns, we can improve the traditional broadcastjoin optimization and handle larger“small inputs”as follows.We project the partition boundaries from the big input onthe common prefix as the required partition boundaries forthe small input. Rather than sending the whole small in-put to each partition of the large input, we determine, foreach partition p in the large input, all the partitions in thesmall input that have rows which might join with rows in p.In this situation, we might send data from the small inputmultiple times to different partitions from the big input butavoid the expensive full repartitioning altogether.

3.3.3 Eliminating RepartitioningAn extreme case of repartitioning optimization happens

when the optimizer is able to remove a partitioning operatorcompletely leveraging additional data properties. Supposethat we want to partition the input data on column a into100 partitions, and the input is already hash-partitioned bycolumn b on the same number of partitions. If there is afunctional dependency b → a, then we know that the inputis also partitioned by column a and we completely eliminatea repartition operation. As another example, if the inputis partitioned by (a, b) but we can determine that b is aconstant (e.g., perhaps due to some filter operation), theinput is effectively partitioned by a.

It is crucial to note, however, that while these schemescorrectly partition data by column a, they are not the samepartitions that we would obtain by directly hash partitioningby a alone. To illustrate this point, suppose that a = b+ 1,and so b → a. Every row belongs to partition h(b) mod 100,which is not the same as h(a) mod 100 = h(b+1) mod 100.In some scenarios we can leverage this partitioning scheme(e.g., if the partitions are consumed by a group-by opera-tor), but not in other scenarios (e.g., if the partitions areconsumed by an n-ary operator that requires the same par-titioning scheme on all its inputs). The optimizer is awareof the context of each partitioning request and can deter-mine when fully eliminating a partitioning operation basedon functional dependencies or constraints is possible.

3.3.4 Other ConsiderationsSkewed partitions in input structured streams or interme-

diate results during query execution can have a very negativeimpact on query performance. A partition that is signifi-cantly larger than average increases the overall job latency.On the other hand, too many small partitions may increasethe number of vertices and overall scheduling overhead. Sim-ilar to Algorithm 1, we consider refining partition boundariesboth at compilation time and at runtime. In this case, therefined partitions are still partitioned by the same columnsbut with more even distribution.Finally, Algorithm 1 assumes the input and the output

partition columns share a common prefix. The same con-cept applies to string prefix optimization. That is, if theinput data is range-partitioned by a string column s, we canpartially partition the data by a prefix of column s.

4. INDEXED-BASED PARTITIONINGConsider again the example of Figure 1, but suppose that

there are several terabytes of input data (and therefore sev-eral thousand SV1 vertex instances). In that case, we needto fully repartition the input into, say, ten thousand par-titions (i.e., ten thousand SV2 vertices). Each SV1 ver-tex needs to partition its output into ten thousand differentstreams, which are later read and merged by SV2 vertices.When the number of partitions is very large, this procedurepresents some scalability challenges. In this section we intro-duce an alternative way to repartition data that leveragesstructured stream technology (see Section 2.5 for details),has similarities with the duality between sorting and hash-ing that was discussed in [10], and can be applied to bothrange- and hash-based partitioning schemes.In the example in Figure 1, rows arrive to the repartition

operator sorted by nGram. However, the hash function ef-fectively randomizes the partition that each row belongs to.To avoid constant random writes to local storage for eachpartitioned value, we need to keep one buffer in memory foreach partition and flush the buffer when it gets full. For alarge number of partitions, the combined memory require-ments for such buffers start interfering with those of otheroperators. More importantly, the underlying store is opti-mized for large volumes of data and sequential access. Thisgoal is achieved by reserving a minimum amount of sequen-tial space in disk for each stream (in the order of megabytes)and making it the append and compression unit. This meansthat we would have to reserve a very large amount of spacein the local disk volume to support all the local partitions,

which not only decreases performance, but wastes significantamount of storage for the duration of the job.

An alternative approach, which we call index-based parti-tioning, can be used to significantly improve I/O efficiency.Suppose that we want to repartition a rowset by column a(Figure 8(a) shows the traditional way of implementing thisoperation). For each partition vertex, instead of writing toa different stream per partition, we change the operationas follows (see Figure 8(b)). First we introduce a new com-puted column, called pa, to the input rowset that needs to bepartitioned. The value of such column is equal to p(a), thepartition number that would be assigned to the correspond-ing row with value a (e.g., for hash-based partitioning intoN partitions with hash function h, p(a) = h(a) mod N , andfor range-based partitioning p(a) is the index of the bucketthat contains value a).

Then, we insert a stable sort operator on column pa. Astable sort operator sorts the input rowset by pa and keepsthe original row order in case of ties. Therefore, the outputof such sort contains all input rows grouped together bythe partition they belong to, and within each partition therows satisfy an ordering consistent to the original one. Thisis very important as the optimizer can continue to exploitthe original input ordering for the following operations. Inother words, this approach produces the same result as ifwe actually created all partitions in the default way andconcatenate them together in order of the partition number.

Finally, we store this“partitioned”result locally as a struc-tured stream, with a B+-tree index on column pa. Themerge operation stays the same, but reading data from thisstructured stream with a specific key value. Instead of lo-cating the corresponding file and streaming all the rows, weperform a key lookup of the partition number on the B+-treeand return all relevant rows.

...

Partition(a)

Merge

...

Local Output

Merge

Stable Sort(pa)

Compute pa=p(a)

(a) Traditional partitioning. (b) Indexed partitioning.

Figure 8: Scalable Indexed-based Partitioning

The indexed-based approach as discussed above operatesoutside of the optimizer. That is, after the optimizer de-termines that repartitioning is required, we can decide ina post-processing step whether to implement the operatorin the traditional or index-based manner. This choice isnecessary because each alternative might be the faster one.On one hand, index-based partitioning improves I/O effi-ciency for partition operators. At the same time, it performsB+-tree lookups to return the corresponding partitions, andmore importantly, it introduces an additional stable sort op-eration. This is especially wasteful when the stable sort isplaced directly or indirectly on top of another sort operation,due to the following equivalence:

Stable-Sortb(Sorta(R)) ≡ Sortb,a(R)

We incorporate index-based partitioning into the opti-mizer in Scope in a principled way. Particularly, we enhancethe set of enforcer rules [9] in the Scope optimizer. Every

time we enforce a partitioning request, we generate both atraditional repartitioning operator and an indexed-based al-ternative. The indexed-based alternative consists of a com-pute scalar operator for a new column p as in Figure 8(b)directly below the index-based partitioning operator (notethat we do not explicitly insert a stable sort on p as in the fig-ure). Instead, the index-based partitioning operator requiresits input to be sorted by (p,X) where X is the sort require-ment (if any) of the original partitioning operator. We thenlet the optimizer explore different alternatives, by relyingon transformation rules that, for instance, exploit the sortequivalence described above, take advantage of functionaldependencies (since p is functionally determined by the par-titioning columns) and reorder operators (e.g., pushing thecompute scalar operator and corresponding sorts down thetree). This approach lets the optimizer choose, in a cost-based manner, the most efficient execution plan includingboth alternatives in Figure 8, but other options as well.

5. EXPERIMENTAL EVALUATIONWe implemented all the advanced partitioning techniques

and their optimization in Scope, which is deployed on pro-duction clusters consisting of tens of thousands of machinesat Microsoft. Tens of thousands of jobs are executed daily,reading and writing tens of petabytes of data in total, andpowering different online services.In this section, we perform experimental evaluation of dif-

ferent techniques and report results in a small test cluster ofa hundred machines. Each machine has two six-core AMDOpteron processors running at 1.8GHz, 24 GB of DRAM,and four 1TB SATA disks. All machines run WindowsServer 2008 R2 Enterprise X64 Edition. The results cor-relate very well with observations we made on productionclusters consisting of thousands of machines. Due to con-fidential data/business information, we report performancetrends rather than actual numbers for all the experiments.

5.1 Partial PartitioningIn the context of web data analytics, it is common to

calculate aggregates of raw data at different granularities(e.g., averages per domain, per host, and so on). Con-sider a large structured stream that contains informationabout web pages. A typical schema for this data set isshown in Table 1, where each page URL is decomposed intothe (reversed) domain, (reversed) host, top-level-directory,and URL-suffix columns. The data is range-partitioned andsorted by (domain, host, top-level-directory) in a structuredstream, so that related pages are clustered together. Dueto very skewed domain distributions, we cannot partitionthe data by domain alone, or we would get partitions withvery large number of rows (the same trend is observed whenpartitioning by (domain, host)).Consider now a query that groups the input table by do-

main and host values, and performs several user-defined ag-gregates over the remaining columns 3:

SELECT domain, host, Agg(col1), ..., Agg(coln)FROM SSTREAM "WebPages.ss"GROUP BY domain, host

3This is a simplified version of a family of pipelines that are verycommon in the context of web-experiments. A large fraction ofour production cluster utilization goes into executing pipelinesthat use this pattern in one way or another.

Since the input is partitioned by a superset of the groupingcolumns, we need to repartition the input to the group-byoperator. Figure 9 shows the results of evaluating this queryunder two alternative execution plans: one which performsa traditional full repartition on columns (domain, host), andanother that applies the partial repartitioning approach dis-cussed in Section 3.2. We can see in Figures 9(a) that usingpartial repartitioning significantly improves query latencyby a 3.5x speedup. We also calculate the total work for eachquery by adding up latencies for each vertex. As shown inFigure 9(b), partial repartitioning reduces the total work byover 7 times. The improvement in latency is less than thaton total work, because the strategy that uses partial reparti-tioning sometimes requires reading slightly larger partitionsdue to boundary alignment and overlap (see Algorithm 1).

To understand the performance gain, Figure 9(c) showsthe fraction of I/O incurred by the new alternative, com-pared with the traditional full repartitioning, in both datawrite and data read, respectively. The traditional approachhas to read and write the whole data set when performinga full repartitioning, and then read the repartitioned dataagain. In contrast, the new approach avoids an intermediatewrite/read operation and directly reads the right partitions,using an appropriate filter that is evaluated efficiently byexploiting the underlying index. Besides the savings in I/O,the new approach has significant fewer vertices (e.g. no ex-plicit partition and merge vertices) and does not have theheavily connected scheduling graph between the partitionand merge vertices as in the full repartitioning case, both ofwhich greatly improve scheduling efficiency at runtime.

5.2 Optimizing N-ary OperatorsWe next show a more complex example of partial range

partitioning in the case of a N-ary operation, where obtain-ing the right partition boundaries is crucial for performance.This scenario is similar to the previous one, with an impor-tant difference. Rather than having a single input stream,we have four sources of data, each one obtained in a differentperiod of time over a different domain. Although each datasource is range-partitioned by the same columns, the cor-responding partition boundaries are different due to inputsources being inherently biased in different ways. The mod-ified query, which unions together all input sources beforeaggregating the result, is shown below:

SELECT domain, host, Agg(col1), ..., Agg(coln)FROM (

SELECT * FROM T1 UNION ALLSELECT * FROM T2 UNION ALLSELECT * FROM T3 UNION ALLSELECT * FROM T4)

GROUP BY domain, host

Figure 10 shows three different execution alternatives: par-tial repartitioning, or PR (which uses the technique describedin Section 3.3.2 to obtain a global set of bucket boundaries),full partitioning, or FR (which performs the traditional par-titioning), and partial repartitioning without boundary merge,or PRN (which, instead of considering all inputs for choosingpartitioning boundaries, uses the partitioning boundaries ofone input to partially repartition the others).

Figures 10(a-b) show that PR improves both the latencyand total work of the query by a 6x factor compared to FR.Specifically, Figure 10(c) shows that PR incurs in around

Domain Host Top-level-directory URL-suffix Datacom.microsoft www download/ en/default.aspx?WT.mc id=MSCOM HP US Nav Downloads ...com.microsoft windows products/ home ...com.bing www videos/ browse?FORM=Z9LH6 ...... ... ... ... ...

Table 1: Sample Information for a Web-pages Structured Stream

0

100

200

300

400

Partial Repartitioning Full Repartitioning

Tim

e U

nit

s

0

10000

20000

30000

40000

Partial Repartitioning Full Repartitioning

Tim

e U

nit

s

0%

10%

20%

30%

40%

50%

60%

70%

Data Written Data Read

I/O

re

lati

ve

to

Fu

ll R

ep

art

itio

nin

g

(a) Latency (b) Total Work (c) Data Write and Data Read

Figure 9: An Aggregation Query over Web-pages

0

100

200

300

400

500

Partial Repartitioning Full Repartitioning Partial Repartitioning/No

Boundary Merge

Tim

e U

nit

s

0

2500

5000

7500

10000

Partial Repartitioning Full Repartitioning Partial Repartitioning/No

Boundary Merge

Tim

e U

nit

s

0.00%

25.00%

50.00%

75.00%

100.00%

Partial Repartitioning Partial Repartitioning/No Boundary

Merge

I/O

re

lati

ve

to

Fu

ll P

art

itio

nin

g

(a) Latency (b) Total Work (c) Total Data I/O

Figure 10: A Union-All Query on Web-pages

20% total I/O compared to FR. The ratio is smaller than50% because the FR alternative needs to perform interme-diate aggregate stages as described in Section 2.4.1.It is also interesting to discuss the relative performance of

the PRN strategy. Although the total I/O for PRN is lessthan 80% compared to that of FR, the total work and la-tency are much worse for PRN. In particular, the latency ofPRN is 7 times longer than that of FR. This non-intuitivebehavior is explained when looking more carefully at theinput data. When reusing the partition boundaries of oneinput to repartition and union all inputs together, we intro-duce a big source of data skew. Some partitions are muchsmaller and others much larger than the average. While thisobservation does not significantly affect the total amount ofwork done, it substantially increases latency. It illustratesthe importance of optimizing partitioning boundaries in or-der to improve the overall query performance.

5.3 Indexed-based PartitioningTo measure the benefits of indexed-based partitioning, we

executed a script that requires repartitioning on a singlecolumn, and vary the number of output partitions from 250to 4,000. Figure 11 shows the average latency of a parti-tioning vertex, normalized to the time it takes the tradi-tional technique to do a 250-way repartitioning. We cansee in the figure that traditional partitioning performancedegrades visibly when increasing the number of partitions,due to increased random I/Os resulting in poor I/O perfor-mance. Not shown in the figure, but also important, is thefact that the amount of wasted storage due to reserved andunused disk also increases with the number of partitions. Onthe other hand, we can see that indexed-based partitioninghas a longer, fixed start-up cost due to the additional sort by

partition id, and it is worse than the traditional approachfor fewer than 500 partitions. However, the performanceof the index-based approach is almost independent on thenumber of partitions (it slightly increases when adding morepartitions due to a larger index structure in the structuredstream). Indexed-based partitioning achieves parity withthe traditional approach at around 500 partitions and is al-ready 1.6x faster when using 4,000 partitions.

0

0.5

1

1.5

2

2.5

3

250 500 1000 2000 4000

No

rma

lize

d A

ve

rag

e L

ate

ncy

pe

r P

art

itio

n V

ert

ex

Number of Partitions

Original Repartition

Indexed-based Repartition

Figure 11: Partitioning Scalability

6. RELATED WORKThe last decade witnessed the emergence of various so-

lutions for massive distributed data storage and computa-tion. Distributed file systems, such as Google File Sys-tem [7], Hadoop Distributed File System [1], and MicrosoftCosmos [3], provide scalable and fault-tolerant storage so-lutions. Google’s MapReduce [5] provides a simple abstrac-tion of common group-by-aggregation operations where mapfunctions correspond to groupings and reduce functions cor-respond to aggregations. Microsoft’s Dryad [12] providesadditional flexibility where a distributed computation jobis represented as a dataflow graph, and supports scriptinglanguages that allow to easily compose distributed data-flowprograms. These programming models help developers write

distributed applications, but require dealing with implemen-tation details to achieve good performance.To tackle these shortcomings, high level programming lan-

guages plus conceptual data models were recently proposed,including Jaql [2], Scope [3, 24], Tenzing [4], Dremel [15],Pig [18], Hive [21], and DryadLINQ [23]. Regardless of thelanguage differences, their declarative nature hides systemcomplexities from the users and can benefit from an opti-mizer to generate efficient execution plans.Scope relies on a cost-based query optimizer [24] that

considers performance tradeoffs of the entire system, in-cluding language, runtime, and distributed store. It lever-ages database optimization techniques and also incorporatesunique requirements derived from the context of distributedquery processing. Scope also supports data in both unstruc-tured and structured formats. Rich structural propertiesand access methods from structured streams provide manyunique opportunities for efficient physical data design anddistributed query processing. Previous work [24] mentioned“refined partitioning” and “refined merge” but didn’t offerthe details. To the best of our knowledge, this is the firstpaper to discuss advanced partitioning strategies, based oninput data properties and index strategies, and fully inte-grate the reasoning into a uniform optimization framework.Many of the core optimization techniques that the Scope

optimizer implemented originated from early research in par-allel databases [6, 13] and traditional databases [11, 14, 16,17, 19, 20, 22]. The Scope optimizer enhances previouswork in several ways, such as by fully integrating parallelplan optimization and reasoning with more complex prop-erties derivation relying on structural data properties. Ourwork in this paper enables the optimizer to consider addi-tional partitioning strategies and come up with more effi-cient query execution plans.

7. CONCLUSIONMassive data analysis in cloud-scale data centers plays

a crucial role in making critical business decisions and im-proving quality of service. High-level scripting languagesfree users from understanding various system trade-offs andcomplexities, support a transparent abstraction of the un-derlying system, and provide the system great opportunitiesand challenges for query optimization. Data shuffling is themost expensive operation in such environment. Its judiciousplacement and implementation techniques play a vital rolein the effectiveness and efficiency of cloud-scale query execu-tion. We describe several advanced partitioning techniquesto significantly improve data shuffling efficiency and inte-grate such complex reasoning into the query optimizer togenerate much more efficient query plans. The system in-telligently exploits the input data properties and performspartial partitioning by moving a only small subset of theinput data set whenever possible. A novel index-based par-titioning strategy is also used for the system to efficientlysupport a massive data partitioning operation that generatesthousands of partitions. The techniques are incorporated inScope, running over data clusters of tens of thousands ofmachines, and have proven to be effective, greatly improv-ing query performance for a wide range of real-world jobs.

8. REFERENCES[1] Apache. Hadoop. http://hadoop.apache.org/.

[2] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita.Jaql: A scripting language for large scale semistructureddata analysis. In Proceedings of VLDB Conference, 2011.

[3] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy andefficient parallel processing of massive data sets. InProceedings of VLDB Conference, 2008.

[4] B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda,V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQLimplementation on the mapreduce framework. InProceedings of VLDB Conference, 2011.

[5] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In Proceedings of OSDIConference, 2004.

[6] D. DeWitt and J. Gray. Parallel database systems: Thefuture of high performance database processing.Communications of the ACM, 36(6), 1992.

[7] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google filesystem. In Proceedings of SOSP Conference, 2003.

[8] G. Graefe. Encapsulation of parallelism in the Volcanoquery processing system. In Proceeding of SIGMODConference, 1990.

[9] G. Graefe. The Cascades framework for query optimization.Data Engineering Bulletin, 18(3), 1995.

[10] G. Graefe, A. Linville, and L. Shapiro. Sort versus hashrevisited. IEEE Transactions on Knowledge and DataEngineering, 6(6), 1994.

[11] G. Graefe and W. J. McKenna. The Volcano optimizergenerator: Extensibility and efficient search. In Proceedingof ICDE Conference, 1993.

[12] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.Dryad: Distributed data-parallel programs from sequentialbuilding blocks. In Proc. of EuroSys Conference, 2007.

[13] A. Jhingran, T. Malkemus, and S. Padmanabhan. Queryoptimization in DB2 parallel edition. Data EngineeringBulletin, 20(2), 1997.

[14] H. Lu. Query Processing in Parallel Relational DatabaseSystems. IEEE Computer Society Press, 1994.

[15] S. Melnik, A. Gubarev, J. J. Long, G. Romer,S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel:Interactive analysis of webscale datasets. In Proceedings ofVLDB Conference, 2010.

[16] T. Neumann and G. Moerkotte. A combined framework forgrouping and order optimization. In Proceedings of VLDBConference, 2004.

[17] T. Neumann and G. Moerkotte. An efficient framework fororder optimization. In Proceedings of ICDE, 2004.

[18] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig latin: A not-so-foreign language for dataprocessing. In Proceedings of SIGMOD Conference, 2008.

[19] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A.Lorie, and T. G. Price. Access path selection in a relationaldatabase management system. In Proceedings of SIGMODConference, 1979.

[20] D. Simmen, E. Shekita, and T. Malkenus. Fundamentaltechniques for order optimization. In Proceedings ofSIGMOD Conference, 1996.

[21] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive – apetabyte scale data warehouse using Hadoop. InProceedings of ICDE Conference, 2010.

[22] X. Wang and M. Cherniack. Avoiding sorting and groupingin processing queries. In Proc. of VLDB Conference, 2003.

[23] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson,P. K. Gunda, and J. Currey. DryadLINQ: A system forgeneral-purpose distributed data-parallel computing using ahigh-level language. In Proc. of OSDI Conference, 2008.

[24] J. Zhou, P.-A. Larson, and R. Chaiken. Incorporatingpartitioning and parallel plans into the SCOPE optimizer.In Proceedings of ICDE Conference, 2010.

advanced partitioning techniques for massively distributed...

Documents