ri:small:collaborative research: distributed inference ...alex.smola.org/data/amazon.pdf · neous...

23
RI:Small:Collaborative Research: Distributed Inference Algorithms for Machine Learning Alexander J. Smola and David G. Andersen Our goal is to design, analyze, and implement novel inference algorithms that take advantage of emerging hardware paradigms to enable learning on and mining of massive datasets. In particular, we commit to: a) developing new optimization algorithms for machine learning; b) analyzing their convergence properties; and c) releasing open-source code for all our algorithms. We base this research upon four likely shifts in the designs of datacenters of the future: 1. Lightweight CPUs for Power Efficiency will become common in server farms to take advantage of the up to 10x advantage in power efficiency yielded by CPUs such as those found in mobile devices. 2. Heterogenous architectures: homogeneous multi-core CPUs are already being replaced by combinations of special purpose units such as CPUs + GPGPUs (graphics processors). 3. Solid state storage: SSDs will become more prevalent in server centers, increasing the rate of random read operations by 1,000 to 10,000 times relative to hard drives. 4. High bisection bandwidth networks: will increasingly replace traditional aggregation trees, enabled by merchant silicon and software-defined networking. Intellectual Merit: Many learning algorithms fail to take advantage of these changes. Moreover, many excellent single-machine codes exist that would greatly benefit from parallelization, yet re- engineering each of them is infeasible. The main thrust of our proposed work is to develop a toolkit for designing (and retrofitting) the next generation of systems-aware, efficient, scalable inference algorithms. To keep the research relevant to practitioners, we will leverage collaborations with Intel, Calxeda, and Google to test our ideas on reference hardware and real-life datasets. Our deliverables will lay the foundation for applying machine learning in server centers of the future. Towards this end, we will design and analyze parameter distribution algorithms, fault tolerant replication schemes, and asynchronous optimization algorithms. While some of these problems are understood in a piecemeal fashion in domains such as systems, networking, and databases, they are novel in the context of statistical data analysis. Our proposal, which is a collaborative effort between PIs with complementary expertise, takes a holistic approach to the problem and draws on this confluence of systems research, optimization and machine learning. Broader Impact: Different researchers and research groups in machine learning are developing their own seemingly disparate models specialized for their particular application area, often with single machine or single threaded codes. By tackling distributed inference and parameter distri- bution we can potentially benefit many of these applications. Having an abstract glue layer via a parameter server will enable retrofitting many existing efficient single-machine algorithms for parallel inference. We will actively strive to build a vibrant community in academia and industry around the released code by adapting a public, open development model. Moreover, our open source platform will serve as a teaching tool and we will organize summer schools and workshops focused on scalable machine learning. We also plan to develop a new Scalable Data Analysis course which will be disseminated online. Keywords: Machine learning, Optimization, Mobile Processors, SSDs, Fault Tolerance. i

Upload: others

Post on 11-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

RI:Small:Collaborative Research:Distributed Inference Algorithms for Machine Learning

Alexander J. Smola and David G. Andersen

Our goal is to design, analyze, and implement novel inference algorithms that take advantageof emerging hardware paradigms to enable learning on and mining of massive datasets. Inparticular, we commit to: a) developing new optimization algorithms for machine learning; b)analyzing their convergence properties; and c) releasing open-source code for all our algorithms.We base this research upon four likely shifts in the designs of datacenters of the future:

1. Lightweight CPUs for Power Efficiency will become common in server farms to takeadvantage of the up to 10x advantage in power efficiency yielded by CPUs such as those foundin mobile devices.

2. Heterogenous architectures: homogeneous multi-core CPUs are already being replacedby combinations of special purpose units such as CPUs + GPGPUs (graphics processors).

3. Solid state storage: SSDs will become more prevalent in server centers, increasing the rateof random read operations by 1,000 to 10,000 times relative to hard drives.

4. High bisection bandwidth networks: will increasingly replace traditional aggregationtrees, enabled by merchant silicon and software-defined networking.

Intellectual Merit: Many learning algorithms fail to take advantage of these changes. Moreover,many excellent single-machine codes exist that would greatly benefit from parallelization, yet re-engineering each of them is infeasible. The main thrust of our proposed work is to develop a toolkitfor designing (and retrofitting) the next generation of systems-aware, efficient, scalable inferencealgorithms. To keep the research relevant to practitioners, we will leverage collaborations withIntel, Calxeda, and Google to test our ideas on reference hardware and real-life datasets.

Our deliverables will lay the foundation for applying machine learning in server centers of thefuture. Towards this end, we will design and analyze parameter distribution algorithms, faulttolerant replication schemes, and asynchronous optimization algorithms. While some of theseproblems are understood in a piecemeal fashion in domains such as systems, networking, anddatabases, they are novel in the context of statistical data analysis. Our proposal, which is acollaborative effort between PIs with complementary expertise, takes a holistic approach to theproblem and draws on this confluence of systems research, optimization and machine learning.

Broader Impact: Different researchers and research groups in machine learning are developingtheir own seemingly disparate models specialized for their particular application area, often withsingle machine or single threaded codes. By tackling distributed inference and parameter distri-bution we can potentially benefit many of these applications. Having an abstract glue layer viaa parameter server will enable retrofitting many existing efficient single-machine algorithms forparallel inference. We will actively strive to build a vibrant community in academia and industryaround the released code by adapting a public, open development model. Moreover, our opensource platform will serve as a teaching tool and we will organize summer schools and workshopsfocused on scalable machine learning. We also plan to develop a new Scalable Data Analysis coursewhich will be disseminated online.

Keywords: Machine learning, Optimization, Mobile Processors, SSDs, Fault Tolerance.

i

Page 2: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

RI:Small:Collaborative Research:Distributed Inference Algorithms for Machine Learning

Alexander J. Smola and David G. Andersen

1 Introduction

Our goal is to design, analyze, and implement novel inference algorithms that take advantageof emerging hardware paradigms to enable analysis and mining of massive datasets.

What are the limitations of existing machine learning algorithms? Existing algorithmsassume that data either can be loaded into main memory or can be scanned repeatedly [25, 54,55, 90–92]. Others lack fault tolerance on large scale deployments [54, 55], or their fault-tolerantdeployments are often not very efficient [29]. Furthermore, many popular inference codes are stillsequential, restricted to a single machine [28, 41, 49]. Yet others are parallel but heavily optimizedfor specific problems [76]. We aim to build a toolkit that is both distributed, fault tolerant, efficient,and allows for re-use of legacy code.

Can advances in systems and hardware help? Computer hardware and networks are un-dergoing dramatic changes, and we believe that server centers of the future will be fundamentallydifferent. This requires redesigned algorithms.

Power: many server farms will be made up of large numbers of lightweight CPUs, like the onesfound in mobile devices, due to a 10x advantage in power efficiency. The past 5 years haveseen a revolution in energy efficiency in mobile devices, motivated by severely limited batteryresources. This selective pressure has led to highly efficient micro-architectures such as theARM A9 and A15 designs. Several companies, such as Calxeda,1 Marvell,2 and AMD3 aredeveloping server processors based on it.

Architecture: homogeneous multicore processors are already being replaced by heterogeneouscombinations of special purpose units such as CPUs (central processing units) + GPGPUs(general purpose graphic processing units). For instance, by now most major chip manufac-turers sell processors with integrated graphics cores4 for laptops and desktops. This progressis fueled by the need for rich media content and games on power efficient consumer gradehardware, and will eventually trickle down to the server farms.

Storage: solid state drives (SSDs) will become more prevalent in server centers, and this willallow us to increase the rate of random read operations by 1,000 to 10,000 times relative tohard drives. These rates are likely to improve further since SSDs benefit from the commonadvances in microchip process technology.

Networks: traditional hierarchical aggregation trees in server centers will be replaced by homoge-neous high bandwidth configurations using software defined networking. This is a necessitysince the number of machines is increasing dramatically while the fan-in of conventional net-work switches has essentially stagnated over the past decades. Progress in network designhas led to clusters that are practically flat and fully connected [35].

1http://www.calxeda.com/technology/products/processors/2http://www.marvell.com/embedded-processors/armada-xp/3http://www.amd.com/us/aboutamd/newsroom/Pages/fact-sheet-2012oct29.aspx4http://www.intel.com/content/www/us/en/architecture-and-technology/hd-graphics/

1

Page 3: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

Unfortunately, existing learning algorithms are not well suited to take advantage of these changes.We propose to develop the next generation of systems-aware, efficient, scalable optimization algo-rithms which can be applied to machine learning tasks.

Why the focus on optimization? Optimization is at the heart of many machine learningalgorithms. While different machine learning algorithms can appear dissimilar on the surface, theunderlying optimization problems are often strikingly similar. By understanding these similarities,we can focus on the core problems that will lead to good scaling of most Machine Learning methods.The problems faced by distributed optimization often closely resemble those in distributed samplingor related inference methods. In other words, we will use optimization as a stand-in testbed todevelop distributed inference techniques, with the understanding that many of our techniques canalso be carried over to methods like sampling.

Why do we need specialized optimization algorithms? Off-the-shelf optimization routinesare typically designed for dealing with general objective functions and constraints. That is, theytypically fail to take advantage of the special properties inherent in many machine learning problems(e.g., streaming observations from disk, prior knowledge in problem partitioning). This meansthat while they excel for prototyping small applications, their runtime on massive datasets withbillions of variables and observations is typically considerable. However, the objective functionsencountered in machine learning usually have a well-defined structural form. Because the underlyingoptimization problem presents a computational bottleneck, significant performance gains can berealized by developing specialized algorithms that exploit the structure in the objective function.

Why do we need yet another parallel toolkit? Our proposal describes a modular architecturethat allows us parallelize a large family of existing algorithms without dramatic modification ofthe single-machine component. This is achieved by means of a parameter server that decouplescomputation from synchronization in a fully asynchronous fashion. A key benefit is that this oftenleads to efficient integration of legacy codes. No existing toolkit provides this functionality atpresent. Many recent parallel processing frameworks either lack higher order update semantics[69] or they do not allow random writes in a large and complex state space [90]. Nonetheless atoolkit is desirable since improvements on the synchronization side will immediately translate intoimprovements for all the algorithms using it.

In summary, the proposal addresses a need manifest in machine learning both in industry andin academia — the need for flexible and scalable optimization algorithms that are robust, takeadvantage of modern hardware, provide fault tolerance, are easy to deploy, and that allow usersto upgrade legacy codes without a complete redesign. In other words, we aim to commoditizethe difficult part of parallelization for machine learning by providing a distributed synchronizationlayer. We commit to the following deliverables:

1. Optimization algorithms for Machine Learning that exploit recent advances in hardware.Our proposed algorithms are distributed, asynchronous, and fault tolerant.

2. Theoretical analysis and convergence bounds for the algorithms. This will extend currentresults in variable decomposition methods to obtain faster rates beyond dual decomposition.

3. An open-source platform for the code developed in this project. In particular, it willprovide a fault tolerant parameter server and distributed inference controller that enablesrapid deployment of our new algorithms as well as parallelization of existing legacy codes.

2

Page 4: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

2 Background

Many machine learning problems can be interpreted as penalized risk minimization; consequentlyoptimization provides a useful unifying thread. For instance, regularized risk bounds and maximum-a-posteriori estimates [44, 56, 61, 63, 74, 82] are a staple of statistical estimation. Furthermore,variational bounds extend this reasoning to cases where one would otherwise more commonlyemploy sampling [16]. A benefit is that it endows the problems with a well-defined objectivefunction that makes comparison of algorithms fairly straightforward.

This is clearly not the only way to approach statistical inference. For instance, sampling is aformidable tool in statistics [31]. However, convergence analysis tends to be more easily achievablefor optimization than for finding mixing guarantees for sampling algorithms. Moreover, the prob-lems encountered in optimization are also prevalent in sampling (e.g., synchronization of sufficientstatistics). Hence we limit ourselves to discussing optimization algorithms for machine learning.

2.1 Machine Learning

To illustrate the problems incurred in large scale inference we give a cartoon view of machinelearning. It is understood that the reality is considerably more varied and complex. A large classof estimation problems can be viewed as regularized risk minimization:

minimizew

R(w) where R(w)︸ ︷︷ ︸Regularized Risk

:= λΩ(w)︸ ︷︷ ︸Regularizer

+1

m

m∑i=1

l(xi, yi, w)︸ ︷︷ ︸Empirical risk

. (1)

Here w is the parameter of the model, xi ∈ X ⊆ Rd are the training instances, yi ∈ Y are thecorresponding labels, and l is the loss function. The trade-off between the regularizer and the riskis determined by the constant λ > 0. Four aspects are worth considering:

• There is no guarantee that l or Ω are convex. However, quite often Ω is convex and lis convex in a subset of variables provided that the remainder is fixed. Examples includecollaborative filtering problems where the objective function is convex in the inner productbetween subsets of the parameter vectors [12] or in topic models [17] where the objective isconvex in the document and topic specific parameters respectively.• Ω may have a nontrivial structure in terms of subsets of variables interacting. This typically

can be represented by a sparse graph [11, 57, 64]. Usually Ω is convex, but not alwaysdifferentiable, e.g., in the case of `1 based regularization [15].• The subset of coefficients interacting in any given term in l(xi, yi, w) can be small compared

to the dimensionality of w. This occurs, e.g., in text analysis where only a small number ofwords out of a much larger dictionary occurs in any given document [40].• The set of coefficients can often be decomposed into subgroups with different access and local-

ity characteristics [89]. This occurs, for example, when mixing global with local personalizedmodels for mail filtering.

This means that in many cases we can rewrite the regularized risk functional as a sum over functionsinvolving subsets of variables, i.e.,

R(w) =∑C∈C

RC(wC) (2)

3

Page 5: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

Here C are maximal cliques of coordinates in w and C denotes the set of maximal cliques associatedwith R(w). Moreover, wC is the restriction of w on the clique C. This means that the problem canoften be decomposed such that each part only involves subsets of variables. Notably, however, anyvariable may occur in multiple cliques: the cliques are not disjoint. Such decompositions are alsotypical in graphical models [67]. The Hammersley-Clifford decomposition [14, 37] provides furtherexamples of this situation. That is, (2) occurs in both directed and undirected graphical models.Many efficient algorithms in this context exploit the generalized distributive law [5].

Equations (1) and (2) characterize the tension in solving optimization problems in machinelearning: The first form (1) can be optimized for efficient data access, e.g., minimization by stochas-tic gradient descent when streaming data sequentially from disk [49, 72]. The second form (2)instead allows for local computation that operates only on a subset of variables [54]. The resultingsubproblems may be considerably easier to solve in this case, at the expense of nonuniform data ac-cess. Furthermore, if we distribute subproblems over several machines, we need to keep overlappingvariables synchronized [76].

2.2 Partitioning

To accelerate computation relative to a single machine, we must partition the problem into partsthat can be processed on many machines. There are four common strategies for handling this:

1. Naive observation partitioning. Each machine updates its local copy of all parameters basedupon results from its own partition of the observations. It provides high bandwidth to readthe observation data, and can store large amounts of this data spread across many machines.Unfortunately, storing the full parameter vector w on each machine may require more memorythan is available, particularly when harnessing energy-efficient nodes.

2. Parameter-limiting partitioning. This improves upon the naive approach by distributing datajudiciously such that the number of parameters per machine is limited. This works, e.g. incontent personalization problems where many parameters can be kept machine local [45, 89].

3. Clique decomposition. We may decompose RC(wC) directly and partition by the set of max-imal cliques. A possible side-effect is that data distribution may have overlapping parts.

4. Joint partitioning. This partitions data and interactions into subgroups. That is, we computeupdates on the variables in several machines separately and aggregate within each block.

The three improved strategies amount to a row, column and biclustering-type partitioning of themodel.

Parameter-limiting partitioning The risk in (1) is computed by averaging the loss over thetraining instances. This implies that gradients, too, can be computed by a distribute & averageoperation, e.g. via MapReduce [22]. We divide the training instances into as many subsets as thereare processors, give each processor its share of the data to compute the loss and its gradient (themap operation), and average the results at the end (the reduce operation). This is computationallyattractive only if we have convex optimization problems and whenever the data naturally decom-poses into a shared and a local part, where the size of the local parameter space dwarfs any sharedterms (for simple problems). This applies e.g. to personalized spam filtering [89].

In general, however, data partitioning is challenging since not all data cleanly decomposes intosets of bounded range of interaction. For instance, when annotating user activities and preferences

4

Page 6: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

we have implicit interactions between users that frequent the same locations. Similarly, topicmodels implicitly create interactions between documents via the topics they contain.

Clique Decomposition A second way to solve the regularized risk minimization problem is toexploit the decomposition of R(w) in (2) and employ many machines to solve the sub-problems.Denote by Ci a subset of cliques such that⋃

i

Ci = C and let Ri(w) :=∑C∈Ci

RC(wC). (3)

That is, we decompose R into R =∑

iRi where each Ri depends only on the union of all coordinatesoccurring in Ci. Depending on the problem type this set may be considerably smaller than the fullset of nonzero variables. For instance, for movie recommendation, we could partition the set ofobservations by users. Here each subset of data containing only parameters of a small set of users isamenable to separate optimization. However, the set of movie-related parameters is largely sharedbetween all partitions. Minimizing partitioning cost can be formalized as follows:

minimizeCi

f(Ci) where Ci =⋃C∈Ci

subject to⋃i

Ci = C and C ∈ C (4)

Here Ci denotes the subset of coordinates that are required in partition i, as arising from the cliquepartitioning into subsets Ci, and f is a monotonic function in |Ci|. The following special casesillustrate how one may want to choose f .

• Choosing f(Ci) = maxi |Ci| means that we are trying to minimize the maximum numberof variables in any given partition. This is related to minimizing the vertex cut of a graph —there one attempts to find (balanced) cuts such that the number of neighboring vertices perpartition is small. In our setting this goal is desirable since the amount of memory requiredfor each partition is linear is |Ci|, hence minimizing the upper bound is desirable.• Choosing f(Ci) =

∑i 6=j |Ci ∩ Cj | minimizes the number of edges between partitions. This

is related to finding a (balanced) edge cut of a graph. In the context of optimization this isrelated to the amount of network traffic a naive partitioning and synchronization requires.

Several such partitioning systems have been used in the past. For instance, [88, 94] partition thedata according to users. This means that the overlap in Ci only contains movie parameters. [3, 76]partition by documents or users. Due to the increased sparsity in the data this restricts the overlapfurther to only frequently used items. [34] discuss both greedy and random partitioning for naturalgraphs. Finally, [30] discuss a blocked decomposition of variables with intermittent synchronizationsteps. None of the aforementioned heuristics comes with tangible optimality guarantees. It istherefore desirable to tackle the partitioning problem systematically. A promising strategy is toview (4) as a submodular load-balancing problem [78].

Block Partitioning Finally, one may combine both strategies to partition the set of influencevariables into subsets such that parts of the data are kept locally while also only keeping partsof the variables locally. Recent work by [66] shows that this can be efficient in terms of dealingwith functions involving many variables. Moreover, the implementation of [47] deals with practicalissues of messaging when using block decomposition on disk. Note that none of these strategies areparticularly efficient in terms of automatic graph partitioning. There is an opportunity to obtainmore efficient algorithms accordingly.

5

Page 7: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

2.3 Distributed Systems

Problem partitioning is only the first part of the challenge in solving large-scale problems. We needto decide how to solve the individual subproblems efficiently, and how to distribute the storage andupdate of the shared parameters (e.g., the overlapping nodes among the cliques in C).

Of particular importance is fault tolerance: our proposal aims to generate solutions that arepractically relevant in the large scale setting. However, large numbers of computers almost in-evitably exhibit failures, or more likely, jobs may be preempted or even terminated by the clustermanager (e.g. on Amazon spot instances). Hence it is not an optional but necessary attribute.

Key to fault tolerance is the ability to recover state upon the failure of a node. This recoverymay take place by redundancy (replicating the state on multiple machines) or by re-execution.There are three key questions that underlie the design of a fault-tolerant system:

• Is it possible to recompute lost data incrementally, i.e., without restarting from scratch?• If so, what is the cost to recompute the data?• What consistency is required between the primary copy of the data and any replicas?

Two extremes help illustrate the designs in this space. First is MapReduce, in which computationsare forced to be both individually idempotent and only synchronized at the end of a full iteration ofMap and Reduce. As a result, work issued to a single worker node can be re-executed upon failurewithout re-computing any other data. Second is traditional high-performance computing applica-tions, e.g., finite element simulation, which execute computations in lockstep at a fine granularitywith a completely consistent view of the shared state between machines. Fault tolerance in HPCapplications is typically achieved by pausing the entire system synchronously, taking a consistentshapshot to stable storage, and resuming execution. This adds very high overhead.

Attractively, this design question is directly related to the strategies of Section 3.1, in whichour goal is to find solutions for optimization that are either partitionable or that can operate asyn-chronously. In general, strategies for accelerating distributed computation also ease the require-ments for fault tolerance. For partitionable problem (sub-problems can be solved independently andthen combined at the end) it suffices to provide fault tolerance for each sub-problem independently.A failure requires re-executing or recovering a single sub-problem, but not the global instance. Fora problem that can be executed asynchronously, we can typically relax the consistency required ofreplication, taking advantage of more “sloppy” asynchronous replication strategies. This is usefulfor optimization algorithms already have a built-in tolerance to inaccurate partial solutions.

The replication strategies we plan to bring to bear (as appropriate given the algorithmic re-quirements) are, from “strongest” (but most expensive) to “weakest” (and therefore cheapest):Globally consistent snapshotting; Individual node consistent replication using high-performancePaxos variants; causally consistent replication of updates; and “eventually” consistent replication(i.e., almost no consistency guarantees at all). Of these, the causally consistent option bears furthermention: It is the strongest consistency model that can be provided without requiring synchronouscommunication between the replicas, but it can still (in many cases) maintain consistency invariantsrelating to the order and grouping of updates. For example, under causal consistency, it is possibleto ensure, for two updates u1 and u2 generated by the same node, “do not apply update u2 to wuntil update u1 has been applied.”5

5We plan to harness for this PI Andersen’s prior expertise in both strongly consistent key-value stores [8] andscalable, high-performance causally consistent key-value stores [53].

6

Page 8: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

2.4 Hardware

As outlined in Section 1, a key driver for the current research proposal is dramatic changes thathardware is currently undergoing in terms of power, architecture, storage and networks. In practicethis means that our algorithms must exhibit strong scaling behavior, i.e., the ability to operate onlarge numbers of resource constrained processors.

Such lightweight systems are relatively constrained in terms of RAM. However, this problem ismitigated by the fact that SSDs (solid state drives) offer high speed out-of-core storage. This isadvantageous whenever the data has nonuniform access properties, as is common on natural data[26]. For instance, we may only want to keep the most frequently used keys in memory and prefetchthe remainder on demand to hide latency. This requires algorithms that are able to re-arrange data.

For power-law distributions frequencies of occurrence follow O(x−a) for some exponent a (e.g.a > 2 for English). In this case, the probability of seeing any item of rank x or higher is O(x1−a).With suitable constants this means that after caching the 105 most frequent items would lead to amiss rate of 10−4. Given a rate of 105 IOPS on modern SSDs, this means that we can handle atleast 109 operations when combining RAM and SSDs. While quite obvious, few machine learningalgorithms currently take advantage of these basic properties.

Secondly, it is a commonly stated misperception that machine learning algorithms are disk andIO-bound. Indeed, when designing algorithms geared towards loading data, processing it once andthen discarding the observations, the network and disk interfaces are considerably slower than whatthe main memory interface offers. To make things more explicit we list a range of systems below:

Infiniband Ethernet Disk SSD RAM Cache

Capacity n.a. n.a. 3TB 512GB 16GB 16MBBandwidth 1GB/s 100MB/s 200MB/s 500MB/s 30GB/s 100GB/sIOPS 106 104 102 105 108 109

This means that any algorithm processing data only once will almost inevitably be data-starvedsince memory and processors are significantly faster. A simple means to address this is to reuseinstances that have been loaded into memory more than once before evicting them. This instantlyrebalances data access and processing. Preliminary results in [59] indicate that exploiting thisproperty can yield significant savings in performance.

3 Proposed Work

We propose to advance the state of the art in the following aspects:

Infrastructure: We aim to build an open-source distributed (key,vector) storage system for op-timization between multiple machines. This constitutes the systems component of a parallelinference toolkit. Fault tolerance, replication, key distribution, and self repair will be studied.

Optimization: We will use the above architecture to solve distributed optimization problems.Our general strategy is to extend dual decomposition methods to incorporate second orderasynchronous updates in the dual outer loop.

Partitioning: The above problems require good partitioning algorithms. At present there are nogood established techniques for scalable graph partitioning. That is, while there exist plentyof tools that minimize the number of edges cut in partitioning [10, 32, 43], this is not entirelythe case for decompositions that are efficient for memory and computation.

7

Page 9: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

Integration: A key point of our design is to allow for easy integration of third party implemen-tations. Codes such as Vowpal Wabbit (http://hunch.net/~vw) (VW) or LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) are highly effective single machine solvers.Individually engineering them to allow for parallelization is nontrivial. That said, variabledecomposition allows for efficient problem distribution.

Applications: They serve as a testbed for large scale inference. Due to the targeted problemsize, we mainly aim to address data-descriptive and unsupervised problems in terms of largeamounts of data, such as topic modeling, factorization, and recommendation problems.

3.1 Optimization

To perform distributed inference we have essentially two alternatives:

Partition and Merge: We partition the optimization problem judiciously into subsets and solvethem separately. Once sub-solutions are obtained, we combine them into a joint solution.

Continuous Synchronization: While solving the optimization problem we continuously keeppartial solutions adjusted and consistent, possibly with delay.

Partition and Merge is quite popular for convex problems. [58, 95] show that it is possible to solvesubproblems for distinct partitions and to perform a final merger operation to obtain results that arecomparable to global convex solutions. This holds since we know that a) averaging local solutionscan only improve estimates and b) near optimality averaging solutions is variance reduction.

Unfortunately, for nonconvex problems such as clustering, matters are not quite so straightfor-ward. For instance, the inherent invariance in clustering immediately leads to the problem thatidentical clusters will have different identifiers on different machines. Generally, for such nonconvexsettings a great deal of care is required in the Partition and Merge scenario. For instance, [65]discuss solving intermediate linear assignment problems and [50] use staged optimization.

This leaves us with the second approach, Continuous Synchronization as the only viable al-ternative in the general nonconvex case. It is well known that this accelerates convergence. Forinstance, the stochastic EM algorithm [20] for clustering and topic models [60], sequential Monte-Carlo methods for topic models [4, 19], and collapsed samplers [36] all bear witness to this fact,albeit on single processors.

Our approach is to extend dual decomposition algorithms to asynchronous and second orderstrategies. This addresses two problems with ADMM-style algorithms [18]. The problem

minimizew

∑i

fi(w) is equivalent to minimizewi,z

∑i

fi(wi) subject to wi = z. (5)

Subsequently we may specify a matching Lagrange function via

L(wi , z, µi , λi) =∑i

fi(wi) +µi2‖wi − z‖2 + 〈λi, wi − z〉 (6)

Hence the problem decomposes into subproblems pertaining to wi, a joint problem in terms of zthat is straightforward to solve, and dual problems in terms of µ and λ. The strategy proposedby [18] is to perform dual ascent in µ and λ. This is problematic since often gradient descentalgorithms suffer from slow convergence [13] (also confirmed by preliminary experiments). We aimto address this by employing a second order method in the dual.

8

Page 10: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

Client Client ... Client Client

Server Server ... Server

Paxos

Database

Figure 1: Parameter distribution andreplication in the parameter server.Data is exchanged between clientsand a distributed pool of servers.Fault tolerance is achieved by syn-chronizing machine state with a con-sensus server implementing the Paxosprotocol [48] and a database backend.

Secondly, the optimization in (6) is carried out synchronously. This slows down convergencebetween machines and also makes the algorithm sensitive to slow and faulty machines. With 1000sof machines the probability of experiencing at least one malfunction is very high [77].

Proposed Strategy An alternative is to perform asynchronous gradient ascent. The secondorder stochastic gradient descent method with exponential rates, as described by [73] can likely bemodified to address the scenario where gradients are updated asynchronously. The key differenceis that while [73] use the update strategy on instantaneous loss gradients, we use it on subsets ofvariables in the dual. We plan on drawing on previous experience with distributed second orderalgorithms [79] in our work.

Lastly, we plan on using the memory hierarchy for efficient caching. Preliminary experimentsby [59] show that dual parameter update yields dramatic acceleration relative to algorithms thatuse data only once at each pass. The challenge is that it is infeasible to communicate the valuesof associated instance-based Lagrange multipliers as is the default in [59]. However, by virtue ofsaddle point conditions we believe that it is possible to synchronize only a much smaller set of theparameter space while still retaining near optimality.

3.2 Parameter Server

Distributed optimization requires and efficient mechanism for collecting, updating and rebroadcast-ing parameters. This is what the Parameter Server is meant to address.

1. We need a parameter distribution mechanism which can be executed without the need for acentral directory manager to decide where parts of the model are stored.

2. We need a mechanism for replication and persistent storage of parameters in case some of themachines die. This is needed for unreliable resources such as Amazon’s spot instances.

3. We need to decompose optimization problems such that optimization on individual clientmachines can proceed without much modification while the parameter server acts as gluebetween them. What level of ‘intelligence’ does the parameter server need?

4. We need scheduling algorithms to decide which parameters to synchronize more aggressivelythan others. [54] has demonstrated significant gains from effective prioritization.

As outlined in Section 3.3 many of these questions have been resolved satisfactorily in the contextof distributed (key,value) pair storage [23], block replication for distributed file systems [86, 87], andconsistent hashing [21, 42]. Our design is an extension of the fully connected bipartite replicationmechanism introduced in [3, 76].

9

Page 11: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

The proposed design is particularly timely since by now software defined networks are beginningto offer the ability to build virtually flat networks. That is, it is possible by now to connectthousands of machines while effectively maintaining a network where point to point bandwidthis nearly maximal and almost uniform [35]. Available cloud computing instances from all majorproviders appear to be following these network characteristics in practice.

The key idea is that rather than synchronizing parameters directly between all machines holdinga given parameter, we synchronize it with a global parameter server (GPS). It is the task of theGPS to reconcile local solutions with global values. In this way, we trade off communication forlatency — in the case of n machines needing to synchronize a parameter, we effectively replace acomplete graph of n vertices by a star of n+ 1 vertices.

1. Key distribution over machines is achieved via consistent hashing [42]. For our purposes(since we need to deal with replication) require a consistent hash list sk(i,M) that selects asubset of k machines from a pool of M machines, e.g. using the allocation in CRUSH or byusing distributed hash tables [21, 42, 86]. This lets us identify servers that hold the keys.

2. The hash mapping sk(i,M) lets us replicate keys. Fault tolerance and repair yield general-izations of the repair protocols discussed in [23, 48]: instead of simply comparing values wenow have objects in a vector space with an associated metric.

3. Each client solves the optimization problem on the associated subset of variables. To ensureconsistency between partial solutions we can resort to a dual decomposition and penaltyapproach as discussed in Section 3.1. In practice this means that each machine solves theinference problem separately and asynchronously.

4. Likely one of the more demanding parts of the proposal is to address the issue of schedul-ing. There is ample evidence that [33, 34] good scheduling can contribute significantly toconvergence. That is, we need to decide when the server will push updates to the clients.A brute force, i.e., uniform approach was shown to perform quite well in [2]. However, theexperiments in [2] also clearly demonstrate the benefits of rapid synchronization.

The above design satisfies our desiderata. Engineering such a system is no minor task. It re-quires serialization libraries, libraries for remote procedure calls, efficient memory allocators, anda highly fault tolerant parameter repository. Fortunately there exists a significant amount ofopen source software designed for general purpose systems: serialization is achieved via Google’sProtocol Buffers (http://code.google.com/p/protobuf/). Remote procedure calls and fault tol-erance can be obtained via the Ice toolkit (http://www.zeroc.com/ice.html). A distributedconsistent parameter storage for updates of the alive machine IDs is possible via ZooKeeper(http://wiki.apache.org/hadoop/ZooKeeper). Lastly, persistent storage is provided by tradi-tional database-backed (key,value) services such as Basho Riak.

3.3 Systems Research

The problems outlined thus far create two very serious systems challenges: The first is to enablethe programmer to cope with the many levels of hierarchy in the individual system and clusterwithout introducing excess complexity or fragility. The second is the extremely high throughputand low latency demands placed on intra- and inter-machine communication and data transfers.We believe that there is substantial potential to alleviate these problems by co-design of the systemsand algorithmic solutions, instead of the more traditional approach of systems researchers doingsystems research and learning researchers using the existing systems.

10

Page 12: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

Machine Learning in a Complex Hardware Hierarchy: In the past decade, we have seenthe end of the “classical” processor scaling, in which each generation of CPUs had an increasedclock rate, improving performance for nearly all applications without software changes. Instead, wehave entered the multi-core era, in which harnessing the performance of a new generation of CPUsrequires increasing algorithmic and application parallelism. Coupled with this is an increase in thenumber of levels in the memory hierarchy: Modern CPUs not only have L1, L2, and L3 cache,but even the bandwidth and latency of memory access is non-uniform on many machines, wheresome of the memory is attached to one processor and the rest is attached to the other. As noted inSection 2.4, designing algorithms that take advantage of this hierarchy can yield huge speed gains.

At the same time, the levels within this hierarchy represent specific technologies with uniquebehaviors and properties. Consider flash storage. As noted earlier, it has substantially fasterrandom read speeds than disk (in the range of 105 I/O operations per second). However, it doesnot offer the same small random write performance. In order to write to a particular location, theflash device must first erase the entire “erase block” to which that location belongs. A typical eraseblock is 256KB in size. As a result, modifying a single bit could require reading/modifying/writing256KB of data. Modern flash SSDs mask much of this by using an internal log-structured approach,but they typically do so only at a 2KB or 4KB page granularity. As a consequence, with flashmemory, writing 1 byte is just as expensive as writing 2KB. This property helps shape the typesof algorithms and data access patterns that work well and poorly on SSDs.

Our primary research goal is to co-design a data storage substrate that is optimized for, e.g.,matrix access and caching updates, that operates well on both SSDs and disk. Our prior successfulexperience in developing the FAWN-KV storage system [83] suggests that a log-structured approachwill work well. At its most basic, the idea is to store updates by appending them to the end of agrowing log, while maintaining a small in-memory index to locate the value on reading. Machinelearning and sparse updates, however, introduce serious challenges that are unaddressed by priorwork: the entry sizes are occasionally very small (e.g., a 64-bit floating point number) and theaccess rates are orders of magnitude higher than those needed for simple remote key-value stores.

Two approaches are likely to bear fruit in solving this problem. The first arises because, inmany of the workloads we have discussed, some portion of the dataset is static over the course ofexecution. In most applications, e.g., the input data that is being evaluated satisfies this property.For example, stochastic approximation, while the order (or selected subset) of the data is permutedon each pass, the data itself is unchanged. Static data is typically amenable to the use of moreextensive pre-processing to speed subsequent access, using one (or multiple) of several techniques:layout optimization, index computation, and duplication. In the case of stochastic approximation,for example, it may prove worthwhile to store an additional permutation of the data on disk toensure that during processing, the data can be read sequentially at high speed.

The second approach that we will explore is optimistic access clustering: If it is necessary toread item A from an SSD, it is “free” to read all other items stored in the same 512 byte diskblock. The disk driver and operating system must already bring these items into main memory.In keeping with our earlier algorithmic philosophy, then, we will seek opportunities to performadditional computation on these items — but will also ensure that the cache management systemknows that they should be easily evicted from memory in favor of items of known higher utility.

Fast remote access: The second challenge we face is that updates to shared state, such asthe parameter server, benefit from higher bandwidth and lower latency than is physically possible

11

Page 13: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

0 10 20 30 40 50 60 70 80 904

5

6

7

8

9

10 x 1010

Time in minutes

HierarchicalFlat Figure 2: Effect of problem partitioning on con-

vergence for a natural graph factorization prob-lem on 200 million vertices. A greedy approachwhich attempts to partition the entire graphimmediately yields considerable improvementin speed of convergence (value of the objectivefunction) relative to a hierarchical partitioningstrategy. Random partitioning would not havebeen feasible in terms of memory footprint.

using the available networking technology. Given that perfection is not achievable, we must striveto optimize our use of the network and, as discussed earlier, make appropriate decisions about whatupdates to send over the network at any given time (scheduling). Complicating this decision is thefact that network bandwidth is not static: In many computing environments, it is a shared resourcewhose availability varies with time and the number of competing applications running. As a result,simple, static scheduling decisions are unlikely to work well in practice.

Our second systems research goal is therefore to design an appropriate messaging interface thatexposes sufficient information to the networking layer to allow fine-grained, real-time schedulingdecisions to be made that take advantage of just-in-time bandwidth availability information. Theinterface should further satisfy our efficiency goals: In particular, batch operations are more effi-cient than processing a long sequence of individual, small reads and writes; and operations thatperform only a simple remote read or write (but with no computation) can often be optimized usingnetworking hardware support for RDMA (remote direct memory access). Here, too, experience hasshown that making the best use of technologies such as RDMA requires co-design of networkingand systems considerations and the algorithms making use of it.

3.4 Graph Partitioning

Problem decomposition is necessary to minimize the communication overhead, as outlined in Sec-tion 2.2. At minimum, we must design algorithms that partition sufficiently well so as to not exceedthe memory available in each machine. Notably, graph partitioning is itself a challenge when deal-ing with the billions of vertices that may be found in large collections of webpages or realisticonline social networks. Because partitioning is essentially a preprocessing step it makes little senseto obtain an optimal solution at great cost. We believe it is possible to design algorithms that willfind an aceptable solution at bounded expense.

Existing work essentially randomly partitions the underlying problem [47, 55]. Our preliminaryexperiments show that even a greedy strategy can accelerate convergence considerably and that thequality of partitioning has significant influence on how easy the problem is to solve (see Figure 2).Given the submodular nature of the objective function it is only fitting that a greedy partitioningprocedure shows good promise. We believe that further analysis of the algorithm of [78] shouldgive us fast approximate partitioning strategies.

An alternative approach is to write variants of the partitioning problem as a convex linear pro-gram. Unfortunately, the constraints are not totally unimodular. However, randomized roundingmethods may help us find a feasible and near optimal solution efficiently.

12

Page 14: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

One of the attractions of the problem is that it is quite different from the commonly edge-cutminimizing settings. Hence many of the conventional cut strategies do not apply [10, 32, 43, 75].Instead, there is a need to modify and extend these strategies to new problem settings.

3.5 Applications and Integration

Beyond designing optimization and distributed inference systems of our own, we wish to capitalizeon existing software packages already available in the open source community. Two of the morepopular ones are LibLinear [28] and VW [49]. They are widely used in industry and academia.Both rely on risk minimization with quadratic penalty for capacity control.

As can be seen from (6), quadratic penalties are precisely the mechanism that can also usedfor distributed optimization and to ensure that sub-solutions converge to the same consensus. Wetherefore believe it will be feasible to modify existing code to make it parallel without changingthe entire control flow of the single-machine code. Because several codes keep track of variables inprimal space, they are easily retrofitted with associated decomposition algorithms.

A second area of applications can be found in Bayesian inference. For instance, machines canbe synchronized via Expectation Propagation [39] or, more generally, via Bayesian model averaging[81]. In this case evidence about parameter uncertainty arising from different sources can be pooled.Moreover, we may use optimization for sampling variational models. For instance [60] describe analgorithm for inferring parameters by performing a variational approximation on the sufficientstatistics of a conditional exponential model. This could be parallelized efficiently by decouplingcomputation and synchronization of sufficient statistics. In other words, this has the potential toyield significantly faster latent variable inference models than those described by [2, 76].

Third, a distributed (key,vector) store is useful for many other distributed inference libraries,such as Graphlab.org. In interactions with the Graphlab team, they have expressed considerableinterest in the above architecture.

Finally, we approach the problem of applications purely from the viewpoint of demonstratingthe feasibility of our model. Access to data is likely not going to be a major problem due tothe availability of large collections.6 Specific statistical modeling challenges will be addressed inseparate research projects.

4 Broader Impact

Different researchers and research groups in machine learning are developing their own seeminglydisparate models specialized for their particular application area, often with single machine orsingle threaded codes. By tackling distributed optimization, machine learning, and systems re-search jointly we can potentially benefit many of these areas. Having an abstract glue layer viaa parameter server will enable retrofitting many existing efficient single-machine algorithms forparallel optimization. We are also very committed to disseminating our results and training thenext generation of machine learners.

Course on Big Data Data scientists are currently in high demand in industry and universitiesare barely able to keep up with training them. We plan on offering a course on Scalable DataAnalysis that ties together systems research, data mining, machine learning and statistics. At

6See, e.g., Amazon’s public 5 billion page crawl http://aws.amazon.com/datasets.

13

Page 15: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

present much of the techniques required in industry for successful design of data processing arebeing taught to rather different demographics. For instance, it is rare that statisticians can beenticed to take courses on systems and hardware design. The course on Big Data intends toaddress this. Video and slides of the course will be made publicly available, as the PIs have donein the past.7 This will be an unique opportunity for the students to learn from a set of instructorswho are experts in complementary areas.

Workshops and Summer School Our overarching goal is to bring together leading researchersfrom computer systems, optimization, machine learning, and data mining so that they can talk toeach other and exchange ideas. Towards this end, we will organize interdisciplinary workshops atleading conferences in these areas (e.g., NIPS and ICML for machine learning, KDD and ICDM fordata mining, USENIX for systems, and ISMP for optimization).

Another vehicle for educating students and bringing together researchers from different areas aresummer schools. Smola has extensive experience in organizing Machine Learning Summer schools,having organized 6 of them in the past. Capitalizing on this experience, we will organize a summerschool on Big Data. This two week event will introduce advanced undergraduates and graduatestudents, amongst others, to concepts in machine learning and usage of our algorithms. Partialfunding requirements for this event covering approximately 50% of the cost are included in theproposal. We will try to get the remainder from industrial sponsors and other agencies.

Open-source Software All the PIs have significant experience in releasing research softwareunder permissible open source licenses. An open development model will be adapted with the aimof building a vibrant community of developers and users. Access to an open source parameter serverthat goes beyond simple (key,value) storage is highly desirable for may researchers and industrylabs alike. In fact, preliminary discussions have received strong interest from both LinkedIn (http://www.project-voldemort.com/) and also the Graphlab (http://www.graphlab.org) team. Webelieve that our toolkit will make it considerably easier to scale large distributed inference problems.

Enhancing Research Training The ideas in this proposal have significant potential to beapplied to real-world problems. As such, we have strong support from collaborators in Google andMicrosoft. Our partnerships with these research labs will give graduate students on this projectaccess to resources such as large datasets, real-world problems, and large clusters of computers.Furthermore, they will get a chance to participate in competitive internships at Google. Thisexperience will enrich their training and help them in their future careers.

5 Team Qualifications

PIs Andersen and Smola have complementary expertise, which is crucial for the success. Ander-sen’s research interests are focused on computer systems in the networked environment, the designof large scale lightweight systems, such as fast arrays of wimpy nodes [9]. This expertise will beinvaluable in helping to build highly reliable smart (key,vector) storages and to help identify signif-icant opportunity for optimization of machine learning algorithms. Smola has extensive experiencein designing machine learning algorithms. Furthermore, his work in Google and Yahoo has given

7http://alex.smola.org/teaching/berkeley2012

14

Page 16: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

him insight into the requirements for practical large scale machine learning. This ensures that wewill be able to design tools that are both practically useful and advance science.

6 Project Plan and Milestones

The project encompasses a number of components which are prerequisites for the remainder:

Parameter Server (2013-2014): This component is fundamental to our approach. We expectthat after 1 year a very workable system is available that other teams can use. Overall, weanticipate completion of this project within 2 years.

Graph Partitioning (2013-2014): Likewise, this part is needed for efficient inference. Hence weneed to build a workable graph partitioner withint the first year and refine it subsequently.

Optimization (2013-2015): Most likely, in the first year, we will be spending considerable timeunderstanding the theoretical properties of an asynchronous distributed optimization algo-rithm. System building will commence in 2014 using the parameter server and partitioningalgorithm. The work will conclude with the end of the grant.

Applications and Integration (2014-2015): Integration of components requires a good under-standing of the optimization problems involved and access to a workable parameter server.Hence we anticipate work on application and integration to begin in 2014 and to require asignificant part of our attention in 2015.

Systems Codesign (2013-2015): This is an ongoing effort that will be inherent in the overallderivation and implementation of algorithms.

7 Results of Prior NSF Funding

PI Smola has not received prior NSF funding.co-PI Andersen currently serves as PI on “DC: Medium: Designing and Programming a Low-Power Cluster Architecture for Data-Intensive Workloads.” (CCF-0964474, 7/1/2010–6/30/2013).This proposal underlies the FAWN, or Fast Array of Wimpy Nodes, project, which aims to developmethods for “big data” processing on large clusters of individually slow and memory-constrainednodes. The research from this project has resulted in three papers at SOSP [8, 52, 53], includ-ing a best paper award; two at SOCC [27, 84]; and two algorithm engineering papers at JEAand ALENEX [51, 62]. He is also co-PI on “FIA: Collaborative Research: A Content and Ser-vice Friendly Architecture with Intrinsic Security and Explicit Trust.” (CNS-1040801, 9/1/2010–8/31/2013). [6, 38, 93]. This project, as well as several others in the FIA program, incorporatenumerous aspects of self-certifying networking, via the Accountable Internet Protocol (AIP) [7] ex-plored in Andersen’s prior Cybertrust award “CT-T: Toward a More Accountable Internet.” (CNS-0716287, 8/1/2007–7/31/2010). Andersen’s CAREER award, CNS-0546551, for Data-OrientedTransfer examined the challenges and opportunities in using a higher-layer of abstraction for datatransfers, both in the wide-area and in the data center [1, 24, 46, 68, 70, 71, 80]. Of particularrelevance was an in-depth study of the “incast” problem facing data center networks for massiveproducer-consumer data transfers [68, 85].

15

Page 17: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

8 Bibliography

[1] M. Afanasyev, D. G. Andersen, and A. C. Snoeren. Efficiency through eavesdropping: Link-layer packet caching. In Proc. 5th USENIX NSDI, San Francisco, CA, Apr. 2008.

[2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference inlatent variable models. In Web Science and Data Mining (WSDM), 2012.

[3] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inferencein latent variable models. In Proceedings of The 5th ACM International Conference on WebSearch and Data Mining (WSDM), 2012.

[4] A. Ahmed, Q. Ho, J. Eisenstein, E. Xing, A. Smola, and C. Teo. Unified analysis of streamingnews. In Proceedings of WWW, Hyderabad, India, 2011. IW3C2, Sheridan Printing.

[5] S. Aji and R. McEliece. The generalized distributive law. IEEE Transactions on InformationTheory, 46:325–343, 2000.

[6] A. Anand, F. R. Dogar, D. Han, B. Li, H. Lim, M. Machado, W. Wu, A. Akella, D. G.Andersen, J. W. Byers, S. Seshan, and P. Steenkiste. XIA: An architecture for an evolvableand trustworthy internet. In Proc. ACM Hotnets-X, Cambridge, MA. USA., Nov. 2011.

[7] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon, and S. Shenker. Ac-countable Internet Protocol (AIP). In Proc. ACM SIGCOMM, Seattle, WA, Aug. 2008.

[8] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN:A fast array of wimpy nodes. In Proc. 22nd ACM Symposium on Operating Systems Principles(SOSP), Big Sky, MT, Oct. 2009.

[9] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN:a Fast Array of Wimpy Nodes. Communications of the ACM, 54(7):101–109, July 2011.

[10] R. Andersen, D. Gleich, and V. Mirrokni. Overlapping clusters for distributed computation. InE. Adar, J. Teevan, E. Agichtein, and Y. Maarek, editors, Proceedings of the Fifth InternationalConference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February8-12, 2012, pages 273–282. ACM, 2012.

[11] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties. CoRR, abs/1108.0775, 2011.

[12] R. M. Bell and Y. Koren. Lessons from the netflix prize challenge. SIGKDD Explorations,9(2):75–79, 2007.

[13] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods.Prentice-Hall, 1989.

[14] J. Besag. Spatial interaction and the statistical analysis of lattice systems (with discussion).Journal of the Royal Statistical Society. Series B, 36(2):192–236, 1974.

[15] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector.Annals of Statistics, 37(4), 2008. Comment: Noramlization factor corrected.

16

Page 18: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[16] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[17] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine LearningResearch, 3:993–1022, Jan. 2003.

[18] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-tical learning via the alternating direction method of multipliers. Foundations and Trends inMachine Learning, 3(1):1–123, 2010.

[19] K. R. Canini, L. Shi, and T. L. Griffiths. Online inference of topics with latent dirichletallocation. In Proceedings of the Twelfth International Conference on Artificial Intelligenceand Statistics (AISTATS), 2009.

[20] G. Celeux, D. Chauveau, and J. Diebolt. Stochastic versions of the em algorithm: an ex-perimental study in the mixture case. Journal of Statistical Computation and Simulation,55(4):287–314, 1996.

[21] A. Chawla, B. Reed, K. Juhnke, and G. Syed. Semantics of caching with SPOCA: A stateless,proportional, optimally-consistent addressing algorithm. In USENIX, 2011.

[22] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM,51(1):107–113, 2008.

[23] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasub-ramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store.In T. C. Bressoud and M. F. Kaashoek, editors, Symposium on Operating Systems Principles,pages 205–220. ACM, 2007.

[24] F. Dogar, A. Phanishayee, H. Pucha, O. Ruwase, and D. Andersen. Ditto - a system foropportunistic caching in multi-hop wireless mesh networks. In Proc. ACM MobiCom, SanFrancisco, CA, Sept. 2008.

[25] C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark:Fast data analysis using coarse-grained distributed memory. In SIGMOD, May 2012.

[26] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internettopology. In SIGCOMM, pages 251–262, 1999.

[27] B. Fan, H. Lim, D. G. Andersen, and M. Kaminsky. Small cache, big effect: Provable loadbalancing for randomly partitioned cluster services. In Proc. 2nd ACM Symposium on CloudComputing (SOCC), Cascais, Portugal, Oct. 2011.

[28] R.-E. Fan, J.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library forlarge linear classification. Journal of Machine Learning Research, 9:1871–1874, Aug. 2008.

[29] A. Foundation. Mahout project, 2012. http://mahout.apache.org.

[30] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization withdistributed stochastic gradient descent. In Conference on Knowledge Discovery and DataMining, pages 69–77, 2011.

17

Page 19: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[31] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice.Chapman & Hall, 1995.

[32] M. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut andsatisfiability problems using semidefinite programming. Journal of the ACM, 42(6):1115–1145,1995.

[33] J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief prop-agation. In In Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, Florida,2009.

[34] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI 2012), 2012.

[35] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel,and S. Sengupta. VL2: a scalable and flexible data center network. Communications of theACM, 54(3):95–104, Mar. 2011.

[36] T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academyof Sciences, 101:5228–5235, 2004.

[37] J. M. Hammersley and P. E. Clifford. Markov fields on finite graphs and lattices. unpublishedmanuscript, 1971.

[38] D. Han, A. Anand, F. Dogar, B. Li, H. Lim, M. Machado, A. Mukundan, W. Wu, A. Akella,D. G. Andersen, J. W. Byers, S. Seshan, and P. Steenkiste. XIA: Efficient support for evolvableinternetworking. In Proc. 9th USENIX NSDI, San Jose, CA, Apr. 2012.

[39] R. Herbrich, T. Minka, and T. Graepel. TrueskillTM: A Bayesian skill ranking system. InNIPS, 2007.

[40] T. Joachims. Text categorization with support vector machines: Learning with many relevantfeatures. Technical Report 23, LS VIII, University of Dortmund, 1997.

[41] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. J. C. Burges, andA. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184,Cambridge, MA, 1999. MIT Press.

[42] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistenthashing and random trees: Distributed caching protocols for relieving hot spots on the worldwide web. In Symposium on the Theory of Computing STOC, pages 654–663, New York, May1997. Association for Computing Machinery.

[43] G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Order-ing System, Version 2.0, 1995.

[44] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MITPress, 2009.

18

Page 20: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[45] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.IEEE Computer, 42(8):30–37, 2009.

[46] E. Krevat, V. Vasudevan, A. Phanishayee, D. G. Andersen, G. R. Ganger, G. A. Gibson, andS. Seshan. On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems. In Proc. Petascale Data Storage Workshop at Supercomputing’07, Nov.2007.

[47] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just apc. In OSDI, Hollywood, CA, 2012.

[48] L. Lamport. Recent discoveries from paxos. In Dependable Systems and Networks, page 3.IEEE Computer Society, 2004.

[49] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.http://hunch.net/?p=309.

[50] Q. Le, R. Monga, M. Devin, G. Corrado, K. Chen, M. R. J. Dean, and A. Ng. Buildinghigh-level features using large scale unsupervised learning. Technical report, arXiv, 2011.

[51] H. Lim, D. G. Andersen, and M. Kaminsky. Practical batch-updatable external hashing withsorting. In Proc. Meeting on Algorithm Engineering and Experiments (ALENEX), Jan. 2013.

[52] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A memory-efficient, high-performance key-value store. In Proc. 23rd ACM Symposium on Operating Systems Principles(SOSP), Cascais, Portugal, Oct. 2011.

[53] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don’t settle for eventual:Scalable causal consistency for wide-area storage with COPS. In Proc. 23rd ACM Symposiumon Operating Systems Principles (SOSP), Cascais, Portugal, Oct. 2011.

[54] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab:A new parallel framework for machine learning. In Conference on Uncertainty in ArtificialIntelligence, 2010.

[55] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributedgraphlab: A framework for machine learning and data mining in the cloud. In PVLDB, 2012.

[56] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge Uni-versity Press, 2003.

[57] O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Oper.Res., 13:444–452, 1965.

[58] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale dis-tributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans,J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Pro-cessing Systems 22, pages 1231–1239, 2009.

19

Page 21: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[59] S. Matsushima, S. Vishwanathan, and A. Smola. Linear support vector machines via dualcached loops. In Q. Yang, D. Agarwal, and J. Pei, editors, The 18th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining, KDD, pages 177–185. ACM,2012.

[60] D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent dirichlet allocation.In International Conference on Machine Learning, 2012.

[61] T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[62] I. Moraru and D. G. Andersen. Exact pattern matching with feed-forward bloom filters.Journal of Experimental Algorithmics (JEA), 17(1), July 2012.

[63] K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.

[64] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers. CoRR, abs/1010.2731,2010. informal publication.

[65] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models.Journal of Machine Learning Research, 10:1801–1828, 2009.

[66] N. Parikh and S. Boyd. Graph projection block splitting for distributed optimization, 2012.submitted.

[67] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2001.

[68] A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, andS. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storagesystems. In Proc. USENIX Conference on File and Storage Technologies (FAST), San Jose,CA, Feb. 2008.

[69] R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. InR. H. Arpaci-Dusseau and B. Chen, editors, Operating Systems Design and Implementation,OSDI, pages 293–306. USENIX Association, 2010.

[70] H. Pucha, D. G. Andersen, and M. Kaminsky. Exploiting similarity for multi-source downloadsusing file handprints. In Proc. 4th USENIX NSDI, Cambridge, MA, Apr. 2007.

[71] H. Pucha, M. Kaminsky, D. G. Andersen, and M. A. Kozuch. Adaptive file transfers for diverseenvironments. In Proc. USENIX Annual Technical Conference, Boston, MA, June 2008.

[72] N. Ratliff, J. Bagnell, and M. Zinkevich. (online) subgradient methods for structured predic-tion. In Eleventh International Conference on Artificial Intelligence and Statistics (AIStats),March 2007.

[73] N. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponentialconvergence rate for strongly-convex optimization with finite training sets, 2012. Short versionat NIPS’12.

[74] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

20

Page 22: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[75] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(8):888–905, 2000.

[76] A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very LargeDatabases (VLDB), 2010.

[77] S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In S. Srini-vasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, editors,Conference on World Wide Web, pages 607–614. ACM, 2011.

[78] Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms andlower bounds. In FOCS, pages 697–706. IEEE Computer Society, 2008.

[79] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularizedrisk minimization. Journal of Machine Learning Research, 11:311–365, January 2010.

[80] N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil. An architecture for Internet datatransfer. In Proc. 3rd Symposium on Networked Systems Design and Implementation (NSDI),San Jose, CA, May 2006.

[81] V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.

[82] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

[83] V. Vasudevan, M. Kaminsky, and D. Andersen. Using vector interfaces to deliver millionsof iops from a networked key-value storage server. In In Proc. ACM Symposium on CloudComputing SOCC, San Jose, CA, 2012.

[84] V. Vasudevan, M. Kaminsky, and D. G. Andersen. Using vector interfaces to deliver millionsof IOPS from a networked key-value storage server. In Proc. 3rd ACM Symposium on CloudComputing (SOCC), San Jose, CA, Oct. 2012.

[85] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A.Gibson, and B. Mueller. Safe and effective fine-grained TCP retransmissions for datacentercommunication. In Proc. ACM SIGCOMM, Barcelona, Spain, Aug. 2009.

[86] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A scalable, high-performancedistributed file system. In OSDI, pages 307–320. USENIX Association, 2006.

[87] S. Weil, S. Brandt, E. Miller, and C. Maltzahn. Grid resource management - CRUSH: con-trolled, scalable, decentralized placement of replicated data. In ACM/IEEE Conference onSupercomputing, page 122. ACM Press, 2006.

[88] M. Weimer, A. Karatzoglou, Q. Le, and A. J. Smola. Cofi rank - maximum margin matrixfactorization for collaborative ranking. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.

[89] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. J. Smola. Feature hashing forlarge scale multitask learning. In L. Bottou and M. Littman, editors, International Conferenceon Machine Learning, 2009.

21

Page 23: RI:Small:Collaborative Research: Distributed Inference ...alex.smola.org/data/amazon.pdf · neous high bandwidth con gurations using software de ned networking. This is a necessity

[90] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mccauley, M. J. Franklin, S. Shenker,and I. Stoica. Fast and interactive analytics over hadoop data with spark. USENIX ;login:,37(4):45–51, August 2012.

[91] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker,and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory clustercomputing. In NSDI, April 2012.

[92] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster com-puting with working sets. In HotCloud 2010, June 2010.

[93] X. Zhang, H.-C. Hsiao, G. Hasker, H. Chan, A. Perrig, and D. G. Andersen. SCION: Scalability,control, and isolation on next-generation networks. In Proc. IEEE Symposium on Security andPrivacy, Oakland, CA, May 2011.

[94] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filteringfor the netflix prize. In Proceedings of the 4th international conference on Algorithmic Aspectsin Information and Management, pages 337–348, 2008.

[95] M. Zinkevich, A. J. Smola, M. Weimer, and L. Li. Parallelized stochastic gradient descent. Innips23e, editor, nips23, pages 2595–2603, 2010.

22