implementing domain-specific l heterogeneous p … · 2020-06-22 · implementing domain-specific...

..........................................................................................................................................................................................................................

IMPLEMENTING DOMAIN-SPECIFICLANGUAGES FOR HETEROGENEOUS

PARALLEL COMPUTING..........................................................................................................................................................................................................................

DOMAIN-SPECIFIC LANGUAGES OFFER A SOLUTION TO THE PERFORMANCE AND THE

PRODUCTIVITY ISSUES IN HETEROGENEOUS COMPUTING SYSTEMS. THE DELITE COMPILER

FRAMEWORK SIMPLIFIES THE PROCESS OF BUILDING EMBEDDED PARALLEL DSLS. DSL

DEVELOPERS CAN IMPLEMENT DOMAIN-SPECIFIC OPERATIONS BY EXTENDING THE DSL

FRAMEWORK, WHICH PROVIDES STATIC OPTIMIZATIONS AND CODE GENERATION FOR

HETEROGENEOUS HARDWARE. THE DELITE RUNTIME AUTOMATICALLY SCHEDULES AND

EXECUTES DSL OPERATIONS ON HETEROGENEOUS HARDWARE.

......Power constraints have limitedthe ability of microprocessor vendors toscale single-core performance with eachnew generation. Instead, vendors are increas-ing the number of processor cores and incor-porating specialized hardware to improveperformance.1 For example, GPUs have be-come essential components of modern sys-tems because of their massively parallelcomputing capability.2,3 However, the ad-vent of heterogeneous systems creates theproblem of having too many programmingmodels. Common examples are Pthreadsor OpenMP for multicore CPU, OpenCLor CUDA for GPU, and message passinginterface (MPI) for clusters. As a result, ap-plication programmers pursuing higher per-formance must have expertise not only inthe application domain but also in disparateparallel programming models and hardwareimplementations. In addition, the relativeperformance improvement is difficult to

predict until the program is written and exe-cuted in different models with elaborateoptimization phases that might also dependon the input data. Even worse, the opti-mized code for one system is neither porta-ble nor guarantees high performance onanother system. Ideally, application pro-grammers shouldn’t have to manage low-level implementation details, so they canfocus on algorithmic description and stillobtain high performance.

A successful parallel programming modelshould have three broad characteristics(known as the three Ps): productivity, per-formance, and portability. To provide pro-ductivity, a parallel programming modelmust raise the level of abstraction abovethat of current low-level programmingmodels such as Pthreads and CUDA. Ide-ally, such a programming model wouldalso be general, allowing the applicationdeveloper to express arbitrary semantics.

mmi2011050042.3d 19/9/011 16:27 Page 42

HyoukJoong Lee

Kevin J. Brown

Arvind K. Sujeeth

Hassan Chafi

Kunle Olukotun

Stanford University

Tiark Rompf

Martin Odersky

Ecole Polytechnique

Federale de Lausanne

..............................................................

42 Published by the IEEE Computer Society 0272-1732/11/$26.00 �c 2011 IEEE

However, despite decades of research, nosuch programming model exists. Instead, itseems that generality, productivity, and per-formance are usually at odds with oneanother, and successful programming lan-guages must carefully trade them off. Forexample, C++ is general and high perfor-mance, but usually not considered as pro-ductive as higher-level languages such asPython and Ruby. On the other hand,Python and Ruby can’t compete with C++in terms of performance. Another approachis to focus on performance and productivitywhile trading off generality—for example,by focusing on a particular domain usingdomain-specific languages (DSLs).4,5 Theability to exploit domain knowledge to pur-sue high performance and productivitymake DSLs an ideal platform for attackingthe heterogeneous parallel programmingproblem. However, making the DSL approachuseful on a large scale requires lowering thebarrier for DSL development.

To facilitate development of embeddedparallel DSLs, we designed the Delitecompiler framework. DSL developers canimplement domain-specific operations byextending the DSL framework, which pro-vides static optimizations and code genera-tion for heterogeneous hardware. The Deliteruntime automatically schedules and executesDSL operations on heterogeneous hardware.We evaluated this approach with OptiML, amachine-learning DSL, and found signifi-cant performance benefits when runningOptiML applications on a system using multi-core CPUs and GPU.

Domain-specific languagesDSLs provide carefully designed APIs

that are easy for developers in the domainto use. DSL code often more closely resem-bles pseudocode than C code, deferringmost if not all of the implementation detailsto the language. This deferral of responsibil-ity lets the DSL developer use the most effi-cient parallel implementation and targetdifferent devices transparently to the applica-tion. This is feasible only because the DSLdoesn’t try to do everything; instead, it triesto do a few things very well. DSLs providea structured foundation to identify and ex-ploit parallel execution patterns specific to

particular domains. They aren’t a silver bul-let, however. They can’t parallelize existingsequential code, and they don’t on theirown eliminate the parallel programming bur-den for the DSL developer.

For parallel DSLs to be tractable, theymust be easy to create for many domains.Traditionally, there are two types ofDSLs.

Internal DSLs are embedded in a hostlanguage, and are sometimes called the‘‘just-a-library’’ approach.6 These DSLstypically use a flexible host language to pro-vide syntactic sugar over library calls. Al-though this is the easiest approach (noadditional compiler is necessary), it con-strains the DSL’s capabilities. A purelyembedded, or library-based, DSL can’tbuild or analyze an intermediate representa-tion (IR) of user programs. Thus, suchDSLs can only perform dynamic analysesand optimizations, and these handicapscan severely impact achievable parallel per-formance. More importantly, without anIR, DSLs can’t do their own code genera-tion, which prevents retargeting DSL codeto heterogeneous devices.

External DSLs are implemented as astand-alone language.7 Although these DSLsobviously don’t have the limitations of inter-nal DSLs, they’re extremely difficult to build.The DSL developer must define a grammarand implement the entire compiler frame-work as well as tooling to make it useful(for example, IDE support). This clearlyisn’t a scalable approach.

Our hybrid approach addresses thisdilemma. We use the concept of languagevirtualization8 to characterize a host languagethat lets us implement embedded DSLs thatare virtually indistinguishable from stand-alone DSLs. A virtualizable host languageprovides an expressive, flexible front endthat the DSL can borrow, while letting theDSL leverage metaprogramming facilities tobuild and optimize an IR. Figure 1 demon-strates this separation.

Language virtualization is an effective wayto define and implement DSLs inside a suffi-ciently flexible host language. However, build-ing parallel DSLs adds new challenges, such asimplementing parallel patterns, launchingand scheduling parallel tasks, synchronizing

mmi2011050042.3d 19/9/011 16:27 Page 43

....................................................................

SEPTEMBER/OCTOBER 2011 43

communication, and managing multipleaddress spaces. Therefore, we implementeda reusable compiler infrastructure and run-time (the Delite compiler framework andruntime) to make developing parallelDSLs even easier. The framework providescommon parallel execution patterns, andDSL developers can easily implement aDSL operation by mapping it to one ofthese patterns. This mapping processrequires minimum effort because most ofthe behavior is already encoded in the pat-tern. For example, to implement an opera-tion that iterates over a collection andupdates each element by applying a givenfunction, the DSL developer needs only tospecify the function behavior. The frame-work and runtime manage the boilerplatecode (for example, iterating over the loop,updates, and parallelization). The frameworkalso provides code generators for those pat-terns, so the DSL can target heterogeneoushardware (multicore CPU, GPU, and soon) without developing code generators foreach target. In addition, the runtime man-ages efficient scheduling, exploiting bothtask and data parallelism with propersynchronization—tasks that used to be addi-tional burdens on the DSL developer. There-fore, DSL developers can focus on thelanguage design rather than the implementa-tion details and can exploit heterogeneousparallel performance without writing anylow-level parallel code.

OptiML and DeliteFigure 2 shows the overall operation

of the Delite compiler framework and run-time. It gives an example of OptiML,9 a

machine-learning DSL developed with ourframework. OptiML provides matrix, vector,and graph data structures; domain-specificoperations, including linear algebra; anddomain-specific control structures such assum, which is used in the code snippet inFigure 2. The sum construct accumulatesthe result of the given block for every iteration(0 to m) and is implemented by extendingthe framework’s DeliteOpMapReduceparallel pattern.

To generate optimized executables fromthe high-level representation of the DSLoperations, the Delite compiler builds anIR of the application and applies variousoptimizations on the IR nodes. For exam-ple, because the two vector minus opera-tions (x(i)�mu0) within the sum areredundant, common subexpression elimina-tion removes the latter operation by reusingthe former result. After building the opti-mized IR, the framework’s code generatorsautomatically emit computation kernelsfor both the CPU and the GPU. When allof the application kernels have been gener-ated along with the Delite execution graph(DEG), which encodes the data and controldependencies of the kernels, the Delite run-time starts analyzing the DEG to schedulethe kernels and generate an execution planfor each target. The execution plan consistsof the kernel calls scheduled on the targetwith necessary memory transfers and syn-chronization. Finally, target language com-pilers (such as Scala and CUDA) compileand link the kernels from the frameworkwith the execution plans from the runtimeto generate executable files that run on thesystem.

mmi2011050042.3d 19/9/011 16:27 Page 44

Lexer ParserType

checker Analysis Optimization

Codegenerator

Adopt front end fromhighly expressive embedding language

Customize intermediate representation(IR) and participate in back-end phases

Figure 1. Typical compilation phases and the language virtualization approach to embedded

domain-specific languages. This approach lets the DSL adopt the front end from a highly

expressive embedding language, while customizing its intermediate representation and

participating in the back-end phases of its compilation.

....................................................................

44 IEEE MICRO

...............................................................................................................................................................................................

GPUS VS. CPUS

Building embedded parallel DSLsFor a high-performance DSL to target

heterogeneous parallel systems, its IR shouldhave at least three major characteristics:

� It should be able to accommodatetraditional compiler optimizations onDSL operations and data types.

� It should expose common parallel pat-terns for structured parallelism.

� It should encode enough domaininformation to allow implementa-tion flexibility and domain-specificoptimizations.

The Delite compiler framework, a reusablecompiler infrastructure for developingDSLs, structures the intermediate represen-tation of DSLs in a way that meets theserequirements.

Intermediate representationA single IR node can be viewed from

three perspectives, which provide differentoptimizations and code-generation strategies.

We built the Delite compiler frameworkusing the concept of a multiview IR, asFigure 3 illustrates.

The most basic view of an IR node is asymbol and its definition (the generic IRview), which is similar to a node in a tradi-tional compiler framework’s flow graph.We can therefore apply all the well-knownstatic optimizations at this level. The primarydifference is that our representation has acoarser granularity because each node is aDSL operation rather than an individual in-struction, and this often leads to better opti-mization results. For example, we can applythe common subexpression elimination(CSE) to the vector operations (x (i)�mu0,x (i )�mu1), as Figure 2 shows, instead ofjust to scalar operations. Currently appliedoptimizations include CSE, constant propa-gation, dead code elimination, and codemotion.

A generic IR node can be characterizedby its parallel execution pattern (the parallelIR view). Therefore, on top of the genericIR view, the Delite compiler framework

mmi2011050042.3d 19/9/011 16:27 5

Delite compiler framework

Machinespecifications

Delite runtime

..

// x : TrainingSet[Double] // mu0, mu1 : Vector[Double]

val sigma = sum(0, m) { i =>i f (x.labels(i) == false){((x(i)-mu0).t) ** (x(i)-mu0)}else{((x(i)-mu1).t) ** (x(i)-mu1)}

}

..

DEGgenerator

Liveness analysisaddress space management

scheduling

CPU 1execution

plan

Generate executablefor each target hardware

OptiML application

Scalagenerator

CUDAgenerator

C++generator

Input data

Target compilers(scalac, g++, nvcc)

app.deg

x.scala

x.cpp

x.cu

Optimized IR

Build IR

Executables

GPU 1execution

plan

Linear algebrasimplification

Initial IR

Common subexpressionelimination (CSE)

x ( i )-mu0x ( i )-mu1

Further IRoptimizations

.

.

.

. . .

Figure 2. Operation of the Delite compiler framework and runtime. The application written in OptiML is transformed into

an IR and optimized by the framework to generate the kernels for heterogeneous targets along with the Delite execution

graph. The runtime schedules the kernels on heterogeneous hardware for parallel execution. (DEG: Delite execution

graph.)

....................................................................


provides a finite number of common struc-tured parallel execution patterns in theform of DeliteOp IR nodes. Examplesinclude DeliteOpMap, which encodesdisjoint element access patterns withoutordering constraints, and DeliteOp-Foreach, which allows a DSL-definedconsistency model for overlapping ele-ments. The DeliteOpSequentialIR node is for patterns that aren’t paralleliz-able. Because certain parallel patterns sharea common notion of loops, we can fusemultiple loop patterns into a single loop.The parallel IR optimizer iterates over allthe IR nodes of the various loop types(DeliteOpMap, DeliteOpZipwith,and so on), and fuses those with the samenumber of iterations into a single loop.This optimization removes unnecessarymemory allocations and improves cache be-havior by eliminating multiple passes overdata, which is especially useful for memory-bound applications.

In addition to the parallel executionpattern, each domain operation has its ownsemantic information encoded in the corre-sponding domain-specific IR node (thedomain-specific IR view). This view enablesthe framework to apply domain-specificoptimizations, such as linear algebra simplifi-cations, through IR transformations. Thetransformation rules are simply describedby pattern matching on domain-specificIR nodes, and the optimizer replaces thematched nodes with a simpler set of nodes.Examples of the transformation on matrixoperations include (At )t ¼ A and A � B þA � C ¼ A � (B þ C ). The separation ofthe domain-specific IR from the parallel IRmakes it easy to design implicitly parallelDSLs, abstracting parallel execution patternsaway from DSL users.

This multiview IR greatly simplifies theprocess of developing a new DSL becauseall DSLs can reuse the generic IR and parallelIR in the framework, so DSL developers only

mmi2011050042.3d 19/9/011 16:27 Page 46

Matrixplus

Vectorexp

Matrixsum

DeliteOpReduce

DeliteOpMap

DeliteOpZipWith

Definition

S = sum(M ) V1 = exp(V2)M1 = M2 + M3

Domain-specificanalysis andoptimization

Domain userinterface

Parallelismanalysis andoptimization

Code generation

Genericanalysis andoptimization

Application

Domain-specific (DS) intermediate representation (IR)

Parallel IR

Generic IR

Collectionquicksort

DeliteOpDivide &Conquer

C2 = sort(C1)

Three views (DS/parallel/generic)of IR nodesfor each DSLoperation inthe framework

DSLoperationssyntax

Figure 3. Multiview of IR nodes in the Delite compilation framework. Here, the matrix addition operation (M1 ¼ M2 þ M3) is

represented as a MatrixPlus IR node, which extends the DeliteOpZipWith IR node, which again extends the Definition IR

node. The framework uses the generic IR view for traditional compiler optimizations, the parallel IR view for exposing parallel

patterns and loop-fusing optimizations, and the domain-specific IR view for domain-specific optimizations.

....................................................................

46 IEEE MICRO

...............................................................................................................................................................................................

GPUS VS. CPUS

need to design a domain-specific IR for eachdomain operation as an extension. BecauseDSL developers have expertise in the parallelexecution pattern of each domain operationin the DSL, they extend the appropriateDelite parallel IR node to create a domain-specific IR for each operation. In otherwords, DSL developers are only exposedto a high-level parallel instruction set (theparallel IR nodes), and the Delite compilerframework automatically manages the imple-mentation details of each pattern on multipletargets.

To build the IR from a DSL application,the Delite compiler framework uses the light-weight modular staging (LMS) technique.10

As the application starts executing withinthe framework, the framework translateseach operation into a symbolic representa-tion to form an IR node rather than actuallyexecuting the operation. The IR nodes trackall dependencies among one another, and theframework applies the various optimizationson the IR as mentioned earlier. After build-ing the machine-independent optimized IR,the Delite compiler framework starts thecode-generation phase to target heteroge-neous parallel hardware.

Heterogeneous target code generationGenerating a single binary executable for

the application at compile time limits theapplication’s portability and requires runtimeand hardware systems to rediscover depen-dency information to make machine-specificscheduling decisions. The Delite compilerframework defers such decisions by generat-ing kernels for each IR node in multipletarget programming models as well as theDEG describing the dependencies amongkernels. Currently supported targets areScala, C++, and CUDA.

DEG and kernel generation. The Delite gen-erator controls multiple target generators. Itfirst schedules IR nodes to form kernels inthe DEG, and iterates over the list of avail-able target generators to generate corre-sponding target code for the kernel.Because of the restrictions of the targethardware and programming model, itmight not generate the kernel for all targets,but the kernel generation will succeed as

long as at least one target succeeds. Aseach IR node has multiple viewpoints, ker-nels can be generated differently for eachview. For example, a matrix addition kernelcould be generated in the domain-specificview code generator written by a DSL devel-oper, but also in the parallel view becausethe operation is implemented by extendingthe DeliteOpZipWith parallel IR. Be-cause the Delite compiler framework pro-vides parallel implementations for theparallel IR nodes (DeliteOps), DSLdevelopers don’t have to provide code gen-erators for the DSL operations that extendone of them. When DSL developers alreadyhave an efficient implementation of the ker-nel (such as basic linear algebra subroutine[BLAS] libraries for matrix multiplication),they can generate calls to the external libraryusing DeliteOpExternal.

GPU code generation. GPU code genera-tion requires additional work because theprogramming model has more constraintsthan the Scala and C++ targets. Memory al-location, for example, is a major issue. Be-cause dynamic memory allocation withinthe kernel is either impossible or impracti-cal for performance in GPU programmingmodels, the Delite runtime preallocates alldevice memory allocations within the ker-nel before launching it. To achieve this,the CUDA generator collects the memoryrequirement information and passes it tothe runtime through a metadata field inthe DEG. In addition, because the GPUresides in a separate address space, inputand output data transfer functions are gen-erated so that the Delite runtime can man-age data communication. The CUDAgenerator also produces kernel configura-tion information (the dimensionality ofthe thread blocks and the size of eachdimension).

Variants. When multiple data-paralleloperations are nested, various paralleliza-tion strategies exist. In a simple case,a DeliteOpMap operation within aDeliteOpMap can parallelize the outerloop, the inner loop, or both. Therefore,the Delite compiler framework generates adata-parallel operation in both a sequential

mmi2011050042.3d 19/9/011 16:27 Page 47

....................................................................


version and a parallel version to provideflexible parallelization options when they’renested. This feature is especially usefulfor the CUDA target generator to improvethe coverage of GPU kernels, because paral-lelizing the outer loop isn’t always possiblefor GPUs due to the kernel’s memory-allocation requirements. In those cases,the outer loop is serialized and onlythe inner loop is parallelized as a GPUkernel.

Target-specific optimizations. Whereasmachine-independent optimizations areapplied when building the IR, machine-specific optimizations are applied duringcode generation. For example, the memory-access patterns that allow better bandwidthutilization might not always be the sameon the CPU and the GPU. Consider adata-parallel operation on each row of a ma-trix stored in a row-major format. For theCPU, where each core has a private cache,assigning each row to each core naturallyexploits spatial cache locality and preventsfalse sharing. However, the GPU prefersthe opposite access pattern, where eachthread accesses each column, because thememory controller can coalesce requestsfrom adjacent threads into a single transfer.Therefore, the CUDA generator emits codethat uses a transposed matrix with invertedindices for efficient GPU execution. In addi-tion, to exploit single-instruction, multiple-data (SIMD) units for data-parallel opera-tions on the CPU, we generate sourcecode that the target compiler can vectorize.It would also be straightforward to generateexplicit SIMD instructions (such as stream-ing SIMD extensions and advanced vectorextensions).

Executing embedded parallel DSLsDSLs targeting heterogeneous parallelism

require a runtime to manage application ex-ecution. This phase of execution includesgenerating a great deal of ‘‘plumbing’’ codefocused on managing parallel execution ona specific parallel architecture. The imple-mentation can be difficult to get right,both in terms of correctness and efficiency,but is common across DSLs. We thereforebuilt a heterogeneous parallel runtime to

provide these shared services for all DeliteDSLs.

Scheduling the DEGThe Delite runtime combines the

machine-agnostic DEG with the currentmachine’s specifications (for example, thenumber of CPUs or GPUs) to schedule theapplication across the available hardwareresources. It schedules the applicationbefore beginning execution using the staticknowledge provided in the DEG. Becausebranch directions are still unknown, theDelite runtime generates a partial schedulefor every straight-line path in the applicationand resolves how to execute those schedulesduring execution. The scheduling algorithmattempts to minimize communicationamong kernels by scheduling dependent ker-nels on the same hardware resource and basesdevice decisions on kernel and hardwareavailability. It schedules sequential kernelson a single resource while splitting data-parallel kernels selected for CPU executionto execute on multiple hardware resourcessimultaneously. Because the best strategy forparallelizing and synchronizing these data-parallel chunks isn’t known until after sched-uling, the runtime rather than the compilerframework is responsible for generating thedecomposition. In the case of a reduce kernel,for example, the framework’s code generatoremits the reduction function and the runtimegenerates a tree-reduction implementationthat’s specialized to the number of processorsselected to perform the reduction.

Generating execution plans for hardwareresources

Dynamically dispatching kernels into athread pool can have high overhead. How-ever, the knowledge provided by the DEGand the application’s static schedule is suffi-cient to generate an execution plan for eachhardware resource and compile them tocreate executable files. Each executablefile launches the kernels and performs thenecessary synchronization for its resourceaccording to the partial schedules. The com-bination of generating custom executable filesfor the chosen schedule and delaying the in-jection of synchronization code until afterscheduling allows for multiple optimizations

mmi2011050042.3d 19/9/011 16:27 Page 48

....................................................................

48 IEEE MICRO

...............................................................................................................................................................................................

GPUS VS. CPUS

in the compiled schedule that minimize run-time overhead. For example, data that doesn’tescape a given resource doesn’t require anysynchronization. In addition, the synchroni-zation implementations are customized tothe underlying memory model between thecommunicating resources. When sharedmemory is available, the implementationsimply passes the necessary pointers, andwhen the resources reside in separate addressspaces, it performs the necessary data trans-fers. Minimizing runtime overhead by elim-inating unnecessary synchronization andremoving the central kernel dispatch bottle-neck lets applications scale with much lesswork per kernel.

Managing execution on heterogeneous parallelhardware

Executing on heterogeneous hardware ismore challenging than executing on tradi-tional uniprocessor or even multicore sys-tems. The introduction of multiple addressspaces requires expensive data transfers thatshould be minimized. The Delite runtimeminimizes data transfers through detailedkernel dependency information provided bythe DEG. The graph specifies which inputsa kernel will simply read and which it willmutate. This information combined withthe schedule lets the runtime determine atany time during the execution whether theversion of an input data structure in agiven address space is nonexistent, valid,or old.

Managing memory allocations in each ofthese address spaces is also critical. TheDelite runtime uses the Java Virtual Machineto automatically manage memory for allCPU kernels, but GPUs have no such facili-ties. In addition, all memory used by a GPUkernel must be allocated before launching thekernel. To deal with these issues, the Deliteruntime preallocates all the data structuresfor a given GPU kernel by using the alloca-tion information supplied by the frame-work’s GPU code generator. The runtimealso performs liveness analysis using theschedule to determine the earliest point atwhich the GPU no longer needs each ker-nel’s inputs and outputs. By default, theGPU host thread attempts to run aheadasynchronously as much as possible, but

when this creates memory pressure it usesthe liveness information to wait until enoughdata becomes dead, free it, and continueexecuting.

ExperimentsWe evaluated a set of machine-learning

applications written in OptiML. We usedtwo quadcore Xeon 2.67-GHz processorswith 24 Gbytes of memory and an NvidiaTesla C2050 GPU for the performance anal-ysis. We didn’t include the initializationphase, including input data reading, in theexecution time. We report the average ofthe last five executions.

For the performance comparison withOptiML, we implemented the applicationsin three other ways: sequential C++ with li-brary, parallel Matlab for multicore CPU,and Matlab for GPU. Matlab is the mostwidely used programming model in themachine-learning community. Moreover,its performance is often competitive withC++ for machine-learning kernels due to itsefficient implementation of linear algebraoperations (such as BLAS). We made a rea-sonable effort in optimizing and parallelizingthe code and selecting the most efficient im-plementation among the libraries. However,this was more challenging than we expected.We couldn’t predict which optimizationstrategy (for example, vectorization, the par-allel construct parfor in Matlab) would per-form best, because many factors, such asthe number of cores on the system, affectperformance. In addition, the best optimiza-tion strategy depends on the particular usecase (for example, the matrix’s size or theamount of work in the operation). Requiringthe application to specify these low-level im-plementation details often results in multipleversions of the code and makes porting tonew devices difficult. OptiML does not suf-fer from this issue because the operationsare implicitly parallel and the optimizationsare applied automatically in the Delite com-piler framework and runtime rather thanmanually in the source code.

We evaluated the C++ implementa-tions using the Armadillo linear algebralibrary11 to show that the OptiML single-core baseline performs comparably tothe C++ library-based implementation.

mmi2011050042.3d 19/9/011 16:27 Page 49

....................................................................


Next, we compared the OptiML perfor-mance results on multicore CPU andGPU against Matlab implementations.Figure 4 shows that OptiML outperformsMatlab in nearly all applications both inthe absolute execution time and in scalabil-ity. This is mainly because Delite offers

efficient code generation, while Matlabhas interpretive overhead. Naıve Bayes,K-Means, and linear regression especiallybenefit from our ability to generate user-specified functions into specialized GPUkernels, which yields superior performancecompared to Matlab’s GPU support.

mmi2011050042.3d 19/9/011 16:27 Page 50

1.0

1.6

1.8

1.9

41.3

0.5

0.9

1.4

1.6

2.6

0.6

0.00

0.50

1.00

1.50

2.00

2.50

1 CPU 2 CPUs 4 CPUs 8 CPUs CPU +GPU(a) (b)

(c)(d)

(e) (f)

Nor

mal

ized

exe

cutio

n tim

e

Nor

mal

ized

exe

cutio

n tim

eN

orm

aliz

ed e

xecu

tion

time

Nor

mal

ized

exe

cutio

n tim

eN

orm

aliz

ed e

xecu

tion

time

Nor

mal

ized

exe

cutio

n tim

e

1.0

1.9

3.6

5.8 1.1

0.1

0.2

0.2

0.3

1.2

0.00

2.00

4.00

6.00

8.00

10.00

1 CPU 2 CPUs 4 CPUs 8 CPUs

0.01

100.00

110.00

...

1.0

1.7

2.7

3.5

11.0

1.0

1.9

3.2

4.7

8.9

0.6

0.00

0.50

1.00

1.50

2.00

1 CPU 2 CPUs 4 CPUs 8 CPUs CPU +GPU

1.0

2.1

4.1

7.1 2.

3

0.3

0.4

0.4

0.4

0.3

1.2

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50


1.0

1.9

3.1

4.2

1.10.

9

1.2

1.4

1.4

0.8

0.00

0.50

1.00

1.50

2.00


0.1

7.00

15.00

...

1.0

1.4

2.0

2.3 1.

7

0.5

0.9

1.3 1.1

0.4

0.5

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 CPUs 2 CPUs 4 CPUs 8 CPUs CPU +GPU

CPU +GPU

OptiML Parallelized Matlab C++

Figure 4. Execution time of the machine-learning applications implemented in OptiML, Matlab CPU and GPU, and C++:

Gaussian Discriminant Analysis (GDA) (a), naıve Bayes (b), K-Means (c), support vector machine (d), linear regression

(e), and Restricted Boltzmann Machine (RBM) (f). We normalized execution times to the OptiML single-CPU case, and

we report speedup numbers at the top of each bar.

....................................................................

50 IEEE MICRO

...............................................................................................................................................................................................

GPUS VS. CPUS

When we compare OptiML performanceof multicore CPU and GPU, we see thatGaussian Discriminant Analysis (GDA) andRestricted Boltzmann Machine (RBM)show better performance on the GPU. Thisis because those applications consist of data-parallel operations with regular memory-access patterns, which is a good fit forGPUs. The other four applications don’t per-form well on the GPU because the initialmemory transfer to the GPU takes toomuch time (naıve Bayes) or because of fre-quent communication between the CPUand GPU. From this result, we concludethat neither the CPU nor the GPU is alwaysthe optimal solution, and therefore we need ahybrid approach with enough flexibility, as inthe Delite runtime. In addition, OptiMLapplications don’t need to be changed at allto run on different targets, whereas otherimplementations require source code modifi-cations with optimization efforts to run rea-sonably well on each target.

Figure 5 shows the performance impactof static optimizations on a downsamplingapplication. The C++ implementation ishand-optimized with manual operation fus-ing. Without the Delite compiler frame-work’s static optimizations, OptiML is3 times slower than C++. However, withthe fusing optimization, which combinesmultiple iterations into a single pass, andthe code motion of hoisting operations outof the loops, OptiML obtains slightly betterperformance than C++.

U sing the Delite compiler framework,we’re currently implementing DSLs

for other domains including graph analysis,database querying, and mesh-based solvers.They all share the framework infrastructurefor the IR optimizations and heterogeneoustarget code generation, demonstrating oursystem’s effectiveness in building implicitlyparallel DSLs that target heterogeneoussystems. However, the current Delite frame-work isn’t well-suited for expressing allpossible DSLs. For example, highly dynamiclanguages that modify classes or dispatchmethods at runtime would be difficult toimplement, given that Delite promotes afunctional rather than object-orienteddesign. In addition, Delite doesn’t support

DSLs whose front ends can’t be fullyembedded in Scala. This isn’t a fundamentallimitation, however, and we expect toaddress it with further development. MICRO

AcknowledgmentsWe thank the anonymous reviewers for

their valuable feedback. This research wassponsored by US Army contract AHPCRCW911NF-07-2-0027-1; DARPA contractOracle order US1032821; European Re-search Council (ERC) under grant 587327"DOPPLER’’; the Stanford PPL affiliatesprogram, Pervasive Parallelism Lab: Nvidia,Oracle/Sun, AMD, NEC, and Intel; andthe following Fellowship programs: SGF,SOE, and KFAS.

....................................................................References

1. J. Held, J. Bautista, and S. Koehl, eds.,

‘‘From a Few Cores to Many: A Tera-

Scale Computing Research Review,’’

white paper, Intel, 2006.

2. ‘‘The Industry-Changing Impact of Acceler-

ated Computing,’’ white paper, Advanced

Micro Devices, 2008.

3. J. Nickolls and W.J. Dally, ‘‘The GPU Com-

puting Era,’’ IEEE Micro, vol. 30, no. 2,

2010, pp. 56-69.

mmi2011050042.3d 19/9/011 16:27 Page 51

0.9

1.8

3.3

5.6

1.0

1.9

3.4

5.8

0.3

0.6

0.9

1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 CPU 2 CPUs 4 CPUs 8 CPUs

Nor

mal

ized

exe

cutio

n tim

e

OptiML(w/o static opt.)OptiML(with static opt.)C++(with manual opt.)

Figure 5. Impact of static optimizations on a downsampling application.

Fusing and code motion optimizations of the Delite compiler framework

significantly improve OptiML performance, even better than manually

optimized C++ code. Speedup numbers are reported on top of each bar.

....................................................................


4. P. Hudak, ‘‘Building Domain-Specific

Embedded Languages,’’ ACM Computing

Surveys, vol. 28, no. 1, 1996, p. 196.

5. A. van Deursen, P. Klint, and J. Visser,

‘‘Domain-Specific Languages: An Anno-

tated Bibliography,’’ ACM SIGPLAN Noti-

ces, vol. 35, no. 6, 2000, pp. 26-36.

6. J. Peterson, P. Hudak, and C. Elliott,

‘‘Lambda in Motion: Controlling Robots

with Haskell,’’ Proc. Practical Aspects of De-

clarative Languages (PADL 99), Springer-

Verlag, 1998, pp. 91-105.

7. D. Bruce, ‘‘What Makes a Good Domain-

Specific Language? APOSTLE, and Its

Approach to Parallel Discrete Event Simula-

tion,’’ Proc. 1st ACM SIGPLAN Workshop

Domain-Specific Languages (DSL 97),

ACM Press, 1997, pp. 17-35.

8. H. Chafi et al., ‘‘Language Virtualization for

Heterogeneous Parallel Computing,’’ Proc.

ACM Int’l Conf. Object Oriented Program-

ming Systems Languages and Applications,

ACM Press, 2010, pp. 835-847.

9. A.K. Sujeeth et al., ‘‘OptiML: An Implicitly

Parallel Domain-Specific Language for

Machine Learning,’’ Proc. 28th Int’l Conf.

Machine Learning (ICML 11), ACM Press,

2011, pp. 609-616.

10. T. Rompf and M. Odersky, ‘‘Lightweight

Modular Staging: A Pragmatic Approach

to Runtime Code Generation and

Compiled DSLs,’’ Proc. 9th Int’l Conf.

Generative Programming and Component

Engineering (GPCE 10), ACM Press, 2010,

pp. 127-136.

11. C. Sanderson, Armadillo: An Open Source

Cþþ Linear Algebra Library for Fast

Prototyping and Computationally Intensive

Experiments, tech. report, NICTA, 2006;

www.nicta.com.au/pub?id=4314.

HyoukJoong Lee is a PhD candidate inelectrical engineering at Stanford University.His research interests include parallel com-puter architecture and general-purpose GPUcomputing with their programming models.Lee has an MS in electrical engineering fromStanford University.

Kevin J. Brown is a PhD candidate inelectrical engineering at Stanford Univer-sity. His research interests include parallelcomputer architecture and heterogeneous

runtime systems. Brown has an MS inelectrical engineering from Stanford University.

Arvind K. Sujeeth is a PhD candidate inelectrical engineering at Stanford University.His research interests include parallel pro-gramming models and domain-specific lan-guages for machine learning. Sujeeth has anMS in electrical engineering from StanfordUniversity.

Hassan Chafi is a PhD candidate inelectrical engineering at Stanford Univer-sity. His research interests include parallelprogramming models, domain-specific lan-guages, and frameworks for simplifyingprogramming heterogeneous systems.Chafi has an MS in electrical engineeringfrom Stanford University.

Kunle Olukotun is a professor of electricalengineering and computer science at Stan-ford University and director of the StanfordPervasive Parallelism Lab. His researchinterests include computer architecture andparallel programming environments. Oluko-tun has a PhD in computer engineering fromthe University of Michigan. He’s a fellow ofIEEE and the ACM.

Tiark Rompf is a PhD candidate at EcolePolytechnique Federale de Lausanne. Hisresearch interests include programming lan-guages (in particular Scala), parallelism andcompilers, with a focus on lowering the costof very high level programming abstractions.Rompf has an MS in computer science fromthe University of Lubeck, Germany.

Martin Odersky is a professor in the schoolof computer and communication sciences atEcole Polytechnique Federale de Lausanne.His research interests cover fundamental aswell as applied aspects of programminglanguages, including semantics, type systems,programming language design, and compilerconstruction. Odersky has a PhD in com-puter science from ETH Zurich. He’s afellow of the ACM.

Direct questions and comments about thisarticle to HyoukJoong Lee, Gates Building,Room 454, Stanford, CA 94305; [email protected].

mmi2011050042.3d 19/9/011 16:27 Page 52

....................................................................

52 IEEE MICRO

...............................................................................................................................................................................................

GPUS VS. CPUS

mmi2011050042.3d 19/9/011 16:27 Page 53

implementing domain-specific l heterogeneous p … · 2020-06-22 · implementing domain-specific...

Documents