resource elastic virtualization for fpgas using...

Resource Elastic Virtualization for FPGAsusing OpenCL

Anuj Vaishnav, Khoa Dang Pham, Dirk Koch and James GarsideSchool of Computer Science, The University of Manchester, Manchester, UK

Email: {anuj.vaishnav, khoa.pham, dirk.koch, james.garside}@manchester.ac.uk

Abstract—FPGAs are rising in popularity for acceleration inall kinds of systems. However, even in cloud environments, FPGAdevices are typically still used exclusively by one application only.To overcome this, and as an approach to manage FPGA resourceswith OS functionality, this paper introduces the concept of re-source elastic virtualization which allows shrinking and growingof accelerators in the spatial domain with the help of partialreconfiguration. With this, we can serve multiple applicationssimultaneously on the same FPGA and optimize the resourceutilization and consequently the overall system performance. Wedemonstrate how an implementation of resource elasticity can berealized for OpenCL accelerators along with how it can achieve2.3x better FPGA utilization and 49% better performance onaverage while simultaneously lowering waiting time for tasks.

I. INTRODUCTION

Modern SRAM-based FPGAs provide device capacitieswell beyond a million LUTs which, in many cases, is enoughto perform multiple tasks in parallel. Furthermore, High-Level Synthesis (HLS) has become mature enough to allowwidespread deployment of FPGAs for general purpose accel-eration and recent cloud service installations such as MicrosoftAzure, Amazon F1 Instances, and Alibaba Cloud underlinethis development. However, while these developments arevery impressive, FPGA run-time management and hardwarevirtualization techniques have not progressed at the same pace.

In this paper, we introduce and evaluate the concepts behindresource elastic FPGA virtualization which manages FPGAresources in the spatial domain (i.e. how many resources toallocate for a hardware task) as well as in the time domain(i.e. when to occupy a certain FPGA region and for howlong). The important novelty of this paper is that applicationscan readjust resources transparently to the calling application,to maximize FPGA utilization and consequently to optimizeoverall performance.

This paper assumes that the FPGA provides its resources inslots as shown in Figure 1. In the figure, and for the rest of thepaper, we use one dimensional tiling of the FPGA resourcesinto adjacent slots. However, the methodology proposed in thispaper could also be applied to systems that tile resources intwo dimensions, if needed.

In our approach to virtualization, modules may occupy oneor more adjacent slots. Modules may have more than onephysical implementation called an implementation alternativefor using different numbers of resources (e.g., for tradingresources for throughput). We also assume that the operationof a long acceleration job can be preempted and that the timeto reach the next possible preemption point is known or is atleast bound.

In the software world, context switching between tasks forvirtualization can be performed almost instantaneously. In theFPGA case, however, configuration time of hardware modulesis very expensive (typically in the range of many milliseconds).

Moreover, extra time is needed to deal with internal states ofhardware tasks (which is mainly defined by internal flip-flopvalues and other memory elements). Therefore, it often makessense to continue operation before changing the resourceallocation for the corresponding hardware task if this savestime/effort to deal with the internal state of the hardwaremodule. For example, if we want to preempt a hardware taskthat implements a two-dimensional convolution window (asused for several image filters and for Convolutional NeuralNetwork (CNN) classifiers), then it makes sense to onlypreempt a module at the end of a frame. In this case, therelatively large line buffers that normally keep a history forthe convolution window contain no state information needingpreservation and, consequently, can be discarded.

In most cases, it is relatively easy to define such dedicatedpreemption points. In the case of OpenCL (the de-factostandard for describing accelerators in a high-level manner)this is directly dictated by the programming model wherea larger compute problem is decomposed into small chunks(called work-groups). Here each work-group follows a run-to-completion model, where a completed work-group (and thecorresponding hardware module) does not contain any internalstate that needs to be considered further. In this paper, wewill introduce and demonstrate resource elastic virtualizationfollowing OpenCL (see Section IV). However, the concept ofresource elasticity can be applied in a much broader contextand there are further approaches that implement the idea ofpreemption points where the state is minimal. In [1], theconcept is automated within an HLS compilation frameworkwhere a compiler performs live variable analysis to identifyoperational states with a minimum of memory elements tostore.

In a nutshell, assuming a system analogue to the one shownin Figure 1 and a set of somehow preemptable modules,resource elastic virtualization aims at maximizing the FPGAresource allocation transparently to the application which callsthe hardware acceleration. This means that if a new arrivingtask needs FPGA resources, we have to somehow shrink therunning module layout to accommodate this task and, if atask terminates, we will expand the remaining tasks suchthat the FPGA is kept highly utilized for delivering overallbetter performance. From an application’s point of view, onlya generic accelerator call is exposed and the entire resourceallocation and hardware accelerator management (includingFPGA configuration) will be carried out by an FPGA virtual-ization layer.

The contributions of this paper include:• Concepts of resource elastic FPGA virtualization (Sec-

tion III)• Implementing FPGA virtualization using OpenCL for

Fig. 1: Physical FPGA layout featuring four partial reconfiguration regions(depicted as module slots). Hardware modules may take one or more adjacentslots and the communication is provided through a hardware operating system(OS) infrastructure.

resource elasticity (Section IV)• Evaluation of resource elastic schedulers using simulation

(Section V)• Case study on resource elastic schedulers (Section VI).

II. RELATED WORK

FPGA resource virtualization has been examined severaltimes before [2] and the different approaches can be classi-fied in three major directions: overlays, preemptive moduleexecution and virtual FPGAs as described below.

A. OverlaysOverlays or intermediate fabrics [3] provide a layer of

programmability that abstracts the low-level details of anFPGA. This approach is commonly used to enhance designproductivity by providing a software-like compilation flowrather than a hardware CAD tool flow. Because of thisabstraction, overlays can also be used for virtualization.

One direction of virtualization using overlays is target-ing portability across different FPGAs (and FPGA vendors)and has been shown for coarse-grained ALU-based arrays(CGRAs) [4], [5] as well as for fine-grained LUT-basedoverlays [6] or even hybrid systems [7]. This method of virtu-alization shares some ideas of a Java virtual machine where thesame bytecode can be executed on different hardware targets.

Another way to use overlays for virtualization is to applycontext switching techniques on the overlay. In its simplestfashion, a softcore processor running an OS would implementthis technique. However, the extra layer of programmabilityof an overlay comes at a substantial cost in performance.

B. Preemptive Hardware MultitaskingThe idea of preemptive hardware multitasking is adopted

from software systems which virtualize CPUs in the timedomain where a scheduler allocates CPU time slots to a set oftasks. Similarly, in the FPGA case, systems were built that canstop a running module on the FPGA, capture its state and usethe FPGA resources for another hardware task. The preemptedtask may then be resumed later in a fully transparent manner.

Unfortunately, the state of a hardware module is difficult tocapture in a general case and solutions: 1) require restrictionson the preemptive hardware modules, 2) are very target FPGAspecific, and 3) are costly in time (e.g., when using con-figuration readback techniques [8]–[10]) and/or in resources(e.g., when state access is implemented inside the module’suser logic [11]). Hardware design techniques including multi-cycle paths, multicycle I/O transactions, pipeline registers inprimitives such as in multipliers and memory blocks, latches oroptimization such as retiming are not sufficiently applicable (ifat all) with preemptable hardware, hence making this approachimpractical for many real-world applications.

C. Virtual FPGAsThe idea of virtual FPGAs is to divide a large FPGA into

smaller logical units that can be allocated at run-time. Whilesome resource allocators allow picking a variable numberof virtual FPGAs with respect to the currently available re-sources, the resource allocation is commonly not changed untilthe allocated resources are released upon task completion [12].

In summary, overlays and preemptive hardware multitaskingcommonly incorporate too much overhead and virtual FPGAsdo not provide enough flexibility to support a holistic solutionto FPGA virtualization. Moreover, FPGA virtualization in thetime domain alone is, for most FPGA virtualization systems, apitfall because it omits the spatial programming model whichis commonly used along with FPGAs. In other words, webelieve that an FPGA should ideally first be virtualized in thespace domain and (only if needed) in the time domain.

III. SPACE-TIME VIRTUALIZATION: RESOURCE-ELASTICHARDWARE

Consider a mostly compute bound system where somerandomly arriving and terminating tasks run in parallel. Tounderstand virtualization, let us start from a software scenariowhere the system would run on a single CPU. In this case, ascheduler would allocate time slots to the currently runningtasks following a scheduling policy that keeps the CPUutilization high with useful work and that implements somemeans of fairness (or quality of service).

In contrast to this, when using an FPGA, we should keepdevice resource utilization high with useful work. Becausetasks with different resource requirements may arrive andterminate over time, this implies that modules are fractionablein resource slots that can be allocated by a run-time system.In other words, for virtualizing FPGAs, it is desirable to havea certain level of resource elasticity in the system which wedefine as:

Hardware Resource Elasticity: the capability of a hard-ware accelerator to change its resource allocation trans-parently to the task that is using it. Resource elasticityallows for trading resources for throughput at run-time.

To support resource elasticity, a scheduler has to performspace domain multiplexing (SDM) by considering various im-plementation alternatives and/or a variable number of acceler-ator instances. To demonstrate how this can change utilizationand performance, let us assume the scenario illustrated inFigure 2. Because of the ability to change the allocation andto choose the best implementation on an event, we can seethat a resource elastic scheduler would allow maximizing theutilization compared to conventional scheduling which onlyconsiders fixed-size accelerators. With event we refer to atime where scheduling and resource allocation decisions maybe taken. In this paper, events include: 1) arrival of a newtask, 2) completion of a task and 3) reaching a preemptionpoint, as shown in Figure 2. That example reveals that resourceelasticity includes some reconfiguration overhead but results inbetter resource utilization and consecutively in faster execution(e.g. Task B) and/or higher throughput (Task A).

A resource elastic scheduler must perform three essentialtrade-offs which are commonly not considered by normal timedomain schedulers:

1) Multiple instances vs Different sized modules

task A, task B,60 task C,30 task D,10

1234

...A A A A A AB B B B B

C C CD

1234

...A

8

D

C C CB

B B.........

A AB B

A A

AB

A AB

time

a)

b)

slo

tslo

t

12

45

reconfiguration compute load

3

Fig. 2: Resource allocation for kernels (tasks) A, B, C and D in time whenusing a) Normal fixed module scheduling and b) Resource Elastic schedulingon a 4 slot FPGA. The circled events highlight cases where resources areneeded to accommodate new arriving tasks ( 1©, 2©, 3©) or cases where taskscomplete ( 4©, 5©).

Fig. 3: Logical slot configuration example where a) shows using multipleinstances of a single slot module, while b) shows using different sized module,which may have super-linear speed compared to single slot module.

2) Run to completion vs Changing module layout3) Collocated change vs Distributed changeTrade-off 1 arises at run-time when a module can utilize

additional slots for better performance. However, the penaltyof pausing the currently running module and performingpartial reconfiguration for changing to a different sized modulemay not be the best option given the work remaining and thepossible speed-up achievable with a different sized module.Figure 3 shows available possible module layout alternativeswhich a space domain scheduler would have to choose from,depending on the aim.

Trade-off 2 considers the case when the work left for thekernel is not enough to amortize the partial reconfigurationoverhead. This holds regardless if we use reconfiguration forshrinking and expansion of modules or for defragmenting themodule layout. In this case, a decision must be made onwhether changing the module layout to achieve better system-level performance at the cost of performance-sacrifice forcertain modules is beneficial or not.

Moreover, the decision of selecting a multi-instance optionover a different module size may also depend on the locationof available free slots which is expressed by Trade-off 3. Forexample, consider the scenario as shown in Figure 4 wherethe only available resource slot is at one end of the FPGA.In this scenario, the possible options for maximizing resourceutilization for Task A are to either replicate A or to defragmentthe FPGA such that available slots are collocated to use alarger implementation alternative for A.

Note, the scenarios used in the above description of trade-offs are relatively simple as the focus is just on kernel A. Thecomplexity of the possible situation increases as we considermultiple distinct kernels where it may be necessary to sacrificemore performance for a particular kernel to achieve higheroverall performance at system level. Further, the aim of thescheduler may not be performance but fairness or energy etc.which would change policies.

Similar decisions have to be taken by traditional operatingsystems and different objectives result in different policies(e.g., an embedded real-time operating system follows moredifferent objectives than a Linux OS). From a distance, Fig-ure 1 gives the impression that resource elastic FPGA virtu-alization is similar to multicore scheduling as done by many

Fig. 4: Example scenario where there are two possible alternatives. 1) Toreplicate A or 2) To perform defragmentation to use different sized modulefor A.

software operating systems. However, FPGA virtualization hasto consider several very different aspects such that modulesmay occupy multiple adjacent resources (slots) simultaneously(causing fragmentation issues), that modules may have im-plementation alternatives that may work on different problemsizes, and that context switching is significantly expensive.

Moreover, if the available slots are not sufficient to accom-modate all tasks even when selecting smallest implementationalternatives, our virtualization approach uses time divisionmultiplexing (TDM) as kind of a fallback approach. Anal-ogous to a virtual memory subsystem using disk swappingfor providing more than the physically available RAM at aperformance penalty, TDM transparently allows the provisionof more FPGA slots at some penalty for reconfiguration.

In this section, we highlighted the specific issues of resourceelastic FPGA virtualization; in the following section, we willprovide a concrete implementation. In this paper, we introducespace-time virtualization for one FPGA in a single operatingsystem scenario. By moving the virtualization managementlayer from the OS into a hypervisor, the mechanisms proposedhere could be used for virtualizing entire operating systems.

IV. FPGA VIRTUALIZATION FOR OPENCL

We selected the OpenCL execution model [13] for our casestudy as it is an industry standard for High-Level Synthesis(HLS) and heterogeneous computing in general. It is describedin the following sub-section while the design details of ourrun-time resource manager for performing resource elasticitywill be revealed in Section IV-B.

A. Execution Model

An OpenCL application is made up of a host programand kernels. A host program is responsible for submittingcomputation commands and manipulating memory objectswhile executing on the host machine. Kernels are the compute-heavy functions and are executed on accelerators which arealso called compute units. When a host program issues acommand for kernel execution, an abstract index space isdefined, known as NDRange. A kernel is executed exactlyone time for each point in the NDRange index and this unitof work is called a work-item. There is no execution orderdefined by the OpenCL standard for work-items. However,synchronization can be achieved with barriers for local mem-ory. Work-items are collated into work-groups and executed oncomputing units altogether. This provides a coarsely graineddecomposition of an NDRange and allows work-items to shareand synchronize data inside a work-group. Note that there is noorder of execution or sharing of data defined by the OpenCLstandard among work-groups. This property allows contextswitches at work-group granularity since work-groups onlyshare global memory without any synchronization restrictionsand because all the results are written back to global memoryoutside the accelerators at the end of their execution. Further,the independence of work-groups allows launching multiple

Fig. 5: Design flow for the OpenCL kernel development (design-time) andexecution (run-time) with resource elasticity.

instances of the same accelerator for scheduling differentwork-groups concurrently.

Furthermore, the implementation alternatives for OpenCLcan be synthesized by using Vivado HLS (or another HLS tool)with different optimization pragmas to implement area vs per-formance trade-offs. For example, a larger core for matrixmultiplication may benefit from better reuse of operands andfrom more logic for arithmetic. This allows a resource elasticsystem to benefit from more FPGA resources at run-time.

The complete design flow used for the development andexecution of the OpenCL kernel with resource elasticity isshown in Figure 5. The input required by the programmeris an OpenCL kernel with optionally different optimizationlevels and host code. The only difference compared to thestandard Xilinx OpenCL flow for FPGAs [14] is the generationof different kernel versions from the user viewpoint which canbe performed easily by selecting the relevant HLS optimizationoptions.

Note that OpenCL could be swapped with any other High-Level Language supported by HLS tools given similar execu-tion semantics with minor modifications. Moreover, it is nota requirement that each module is implemented with differentresource/performance variants as we also allow implementingresource elasticity by instantiating an accelerator multipletimes. Our resource manager (see Section IV-B) can arbitrarilyhandle multiple accelerator instances of different size.

B. Resource Manager DesignWe have implemented a resource manager that consists

of four different components: Waiting Queue, Scheduler, PRManager and Data Manager, as shown in Figure 6. TheWaiting Queue keeps track of kernels waiting for executionand contributes to the implementation of a Round Robin policyfor TDM which is used if the demand for accelerators exceedsthe available resources (typically given in terms of slots).The scheduler performs Space Domain Multiplexing (SDM)i.e. it decides which modules to shrink or to grow and inwhat manner given the meta-data (profiling information andavailable bitstreams) of kernels. Further, the scheduler is alsoresponsible for identifying which module should be replacedwhen performing TDM. The PR Manager performs the partialreconfiguration requests issued by the scheduler. While theData Manager keeps track of the work-group execution foreach kernel; it is also responsible for programming the ac-celerators for the next work-group. Note that the whole flowonly triggers when an execution command is issued from ahost program or if a kernel finishes its work-group execution.

This model shares some ideas of a cooperative operatingsystem that can guarantee real-time behaviour in absence of a

Fig. 6: Resource manager architecture for virtualizing the resource footprintof the OpenCL kernels at run-time.

global (interrupt) timer. Because the execution of a work-grouprequires that all data is available and because execution time ofwork-items is commonly not data dependent (best practice), anupper execution time can be defined, which allows applicationof resource elasticity under real-time constraints.

Upon the scheduler wake up call, the resource managerretrieves data from all accelerators which have completed theirwork-group execution and goes through the following stages:

1) Kernel Selection: At first, the resource manager selectsthe set of kernels which need to be considered for theallocation of resources. This mainly depends on three casesbased on the waiting queue size and FPGA state. i) Whenthe FPGA is empty and the waiting queue is non-empty, itextracts the maximum number of kernels which can be runon the FPGA concurrently from the queue, based on theirminimum size modules. ii) When the FPGA is non-emptybut the waiting queue is empty, the selected kernels are thekernels currently running on the FPGA. iii) When the waitingqueue and the FPGA are non-empty, it performs Round Robinscheduling in the time domain by closing all the kernels(removing them from the FPGA) which have completed theirwork-group execution and inserts them back to the waitingqueue. After this, it shrinks currently running kernels byemploying a heuristic for recovering the maximum possibleslots. We implemented the heuristic to be fair to all the waitingkernels by allocating resources to them as soon as possible. Itis possible to replace this heuristic with another to cater forother scheduling aims such as performance or energy. Notethat at this stage of scheduling we may have kernels whichhave one or more accelerators currently running and whichcannot be moved around. In this case, we select those kernelsplus the maximum number of kernels from the queue whichcan be executed together, given the free slots available.

2) Generation of Layout Options: Once the kernels areselected, we compute the module layout (i.e. the exact positionand size for each accelerator) for the selected kernels. Thisproblem has a large search space as each kernel may havedifferent implementation alternatives available (i.e. differentaccelerator sizes) which may also be replicated, hence increas-ing the number of possible combinations. However, this hasthree constraints which substantially prune the search space:

i. Availability only of certain sized modules rather thanall implementation alternatives (e.g., a kernel may onlyhave modules which may take at least 2 slots, making itinfeasible to be used when only one free slot is available).

ii. Run-time constraints: executing kernels cannot relocate,so may constraint available free space.

iii. Over allocation i.e. we cannot allocate more slots thanthe FPGA can provide.

Fig. 7: Logical slot module layout example where a) is completely fairbut wastes one of the slots, while b) is not absolutely fair but presents anacceptable trade-off with better utilization.

Currently, we perform this computation by enumerating allpossible combinations and removing infeasible module layoutsbased on the three constraints. Note, since the number of slotson an FPGA tends to be small, the enumeration space will besmall as well. However, one may possibly employ heuristicsfor better run-time at the risk of losing the optimal solution.

3) Resource Allocation: after layout generation, the re-source allocator evaluates module layouts based on fairnesswith the adapted version of the Jain index [15] which accountsfor the utilization of slots. Note that with fairness we refer tothe resource allocation for a set of kernels only. Currently, wedo not take the time domain into consideration for fairness.The adapted version of the Jain index is given by Equation 1,where x is the set of kernels in a module layout, n is thenumber of kernels and xk denotes the slot allocation forkernel k in the module layout. We added the term u(x) whichstates the number of slots utilized by the module layout andwhich can be calculated by Equation 2. We need to considerutilization (u(x)) for fairness because absolute fairness maycome at the cost of resource wastage. Consider the exampleshown in the Figure 7: if we were to choose module layout a)we would be completely fair in allocation between kernels bygiving each kernel a single slot, however, we would be wastingone of the available slots. If, instead, we choose modulelayout b) we would not be completely fair but would have anacceptable trade-off for maximizing resource utilization andperformance.

The module layout with maximum score is selected by theresource allocator which is formalized by Equation 3. Wherer(x, n) is the fairness score of the module layout x and Lf isthe set of all feasible module layouts.

r(x, n) =(∑n

k=1 xk)2

n×∑n

k=1 x2k

+ u(x) (1)

u(x) =n∑

i=1

xi (2)

argmax a(x) = {r(x, n) | x ∈ Lf} (3)

4) Resource Binding: after allocation of resources for eachkernel, the resource binder identifies the best possible moduleand number of instances for it, in terms of performance consid-ering the resource constraints based on the ranking function.We project the performance possible for each implementationalternative by calculating the time to completion of each kernelbased on Equation 4, where xk is the number of slots allocatedto kernel k, Tc(m) is the time to completion of module m,P (m) is the time taken by partial reconfiguration for modulem, and Lx

f is the possible combination of modules and numberof instances for kernel k with resource constraint of xk. Thetime to completion for a given module can be calculated byusing the information given by the HLS tool, profiling and/orannotated by the programmer as meta-data. For instance, anOpenCL kernel synthesis can derive the information of the

work-group execution time (Tw(m, k)) while a programmercan highlight work-group size (Ws). Here, the completion timeis estimated based on Equation 5, where Wl(k) is the numberof work-items left for the execution of Kernel k and is knownby Data Manager. The partial reconfiguration cost is specific tothe FPGA used and generally scales linearly with the numberof slots. It can be modelled by Equation 6, where P (m) isthe total latency for partial reconfiguration of module m, Ps

is the configuration bitstream size of a single resource slot,Nm is the number of slots required by module m and Pt

is the throughput of the configuration port (e.g., the InternalConfiguration Access Port - ICAP) of the FPGA.

argmin p(xk) = {Tc(m) + P (m) | m ∈ Lxf} (4)

Tc(mk) ≈ Tw(mk)×Wl(k)

Ws(mk)(5)

P (m) =Ps ×Nm

Pt(6)

Our projection of performance takes into account the accel-eration possible with the given module (by calculating its timeto completion) and overhead of partial reconfiguration to helptackle the Trade-offs 1 and 2 as mentioned in Section III. Thesame technique is used to break the tie in favour of the firstbest performing configuration when ranking by the resourceallocator. Thereafter, the resource binder requests PR Managerto load the new module layout.

5) Partial Reconfiguration Manager: upon receiving a re-configuration request, our current PR manager performs partialreconfiguration for each instance one at a time based on themodule layout selected by the resource allocator and resourcebinder. After reconfiguration, PR Manager instructs the DataManager to provide input to the new accelerator.

6) Programming Kernels: Data Manager programs the ker-nels with new input data and also retrieves the output datawhen available. Further, it handles the case when a kernelchange to a differently sized implementation alternative leadsto a change in work-group size, such that the final outcome ofthe kernel still remains the same. It does this by calculatingthe next work-group indices such that no work items are leftout at a new granularity, which may require certain work-items to be re-evaluated. However, this is safe to perform askernels commonly write their output at a different memorylocation from where the input operands are stored and as eachimplementation alternative has the same functionality for eachwork-item.

This section revealed that resource elasticity fits the OpenCLprogramming model directly and our resource manager pro-vides a reference implementation for the operating systemservices which can consider various resource elasticity trade-offs for running OpenCL accelerators. While other heuristicsmay be tailored to meet specific system requirements (e.g.,performance, response time, power consumption), a run-timemanager for a resource elastic system always has to providethe space-time mapping of resources, perform reconfigurationand has to manage the execution of kernels.

V. EXPERIMENTS

To evaluate the characteristics of the scheduling algorithm,we conduct a series of simulation experiments to explorehow different scheduling decisions impact our resource elasticvirtualization approach. A working case study is presentedin Section VI while the following subsections describe thesimulation experiments and findings in detail.

A. Experimental setup

With simulation, we explored the effects of scheduling ona wide variety of applications by changing the characteristicsof synthetic applications in terms of area requirements andcompletion time. We also used this to investigate schedulingeffects on different sized FPGAs by varying the slots availablefor scheduling.

We tested our system with 1000 different compute-boundscheduling requirements, where randomly arriving and ter-minating tasks run in parallel. To model a compute-boundschedule, we generate a long running kernel of 4000 work-groups at the arrival time zero. While the other accompanyingkernels in the schedule are generated in the range of [1,12] for each experiment. Characteristics of each kernel aregenerated randomly (except for arrival time and number ofwork-groups of long-running kernel) i.e. base latency of work-groups is in range of [10, 100], arrival time is in [1, 10000],the minimum slot size of kernels is in [1, min(4, n-1)]where n is the total number of slots available on the FPGAand the maximum slot size of a kernel is [minimum slotsize, min(4, n)]. The speedup achievable with different sizekernel is in [3, 10] and the number of work-groups is in[50, 500]. The respective parameter selection is based on auniform random distribution. We assume the I/O requirementsof kernels are not on the critical path of the application. Thepartial reconfiguration cost is modelled linearly proportionalto the number of slots with the cost of configuring a singleslot being 5x the smallest latency possible for a work-group.Note that the given range of work-group latency and speed upavailable from configuring another kernel, it is likely that thecost of reconfiguration would not be recovered from a singlework-group execution, as it would likely be the case in a real-world scenario. Furthermore, restricting the kernel size to amaximum of 4 slots allows us to study the scenario wherea user runs the kernels of the given sizes on a much biggerFPGA. The scenario assumed here look similar to the exam-ple in Figure 2b showing quite a fine-grained schedule andconsecutively relatively high configuration overhead. Whilethe configuration cost is basically the number of slots to bereconfigured over time, the relative overhead depends on forhow long the module runs after configuration. This means thatthe coarser the scheduling granularity is, the lower the relativeconfiguration overhead. Further, it is important to note thatour simulation workload shares some similarity with workloadtraces found in Google clusters [16], with its mix of long andshort running kernels.

We ran the randomly generated schedule requirements onfive different schedulers. The first three schedulers act as abaseline for our implementations and these are as follows:1) Normal Scheduler (NS) which simply allocates the tasks ata First-Fit slot and runs them to completion. This is the mostcommonly used strategy [17]–[19] and is beneficial for FPGAs

as the partial reconfiguration cost is quite high. 2) ConservativeCooperative Scheduler (CCS) is a time domain schedulerwhich performs a context switch when the task voluntarilyrelinquishes the control (in our case at the end of a work-groupexecution). However, since the standard cooperative schedulersdo not take into account the availability of implementationalternatives, it may always use the minimum sized module tooperate in conservative mode. This strategy aims at minimizingthe number of reconfigurations required as it would leavemaximum space available for new incoming kernels. Onthe contrary, we also compare against the version labelled3) Aggressive Cooperative Scheduler (ACS), which employsthe biggest module available for achieving maximum possibleperformance at the risk of higher reconfiguration count.

Our two implementations of resource elastic schedulingare 1) Standard Resource Elastic Scheduler (SRES) and 2)Performance driven Resource Elastic Scheduler (PRES). Theimplementation of SRES is discussed in Section IV-B. PRESfollows the same implementation but resource allocation ratingis performed with the aim of maximizing performance. Equa-tion 7 captures the rating used for PRES by minimizing thetotal completion time for the online kernels, where Tc(xi) iscompletion time calculated using Equation 5, x is the modulelayout and Lf is set of all feasible module layouts.

argmin a(x) = {∑n

i=1Tc(xi) | x ∈ Lf} (7)

B. Results

The average completion time (the most relevant perfor-mance indicator) for each scheduler is shown in Figure 8aand the average waiting time for each kernel is shown inFigure 8b. Note we calculate wait time as wi = si − ai,where ai is the arrival time of the kernel and si is the timewhen it begins its execution (after partial reconfiguration). Thecompletion time is the lowest for NS from the baseline forsmall FPGAs as it does not have to pay higher reconfigurationpenalties compared to other schedulers. Note that despite betterperformance, NS has poorest wait time from all the baselineschedulers, as it lets all kernels run to completion withoutinterruption. On the other hand, resource elastic schedulersprovide considerably lower completion time and similar waittime characteristics as cooperative schedulers (Figure 8b) evenincorporating higher reconfiguration overhead. Specifically, ifwe compare the performance targeting versions of schedulersi.e. PRES and ACS, we can see that PRES can achieve higherperformance as much as 39% to 64% than ACS. However, theperformance advantage does not scale linearly with the numberof resource slots available as they become so abundant thatall kernels can run concurrently in their full sized modulesfor most of the experiments without higher reconfigurationcost when using ACS. Resource elastic schedulers achievethis performance advantage by considering all the possibleimplementation alternatives and employing them dynamicallyat run-time to maximize performance.

We measured the resource utilization as the total number ofoccupied slots after every scheduler wake up call as plotted inFigure 8c after normalization. We can see the employmentof ACS leads to the highest utilization from baseline dueto the heuristic of always using the largest module for eachkernel. However, it does not tend towards the maximum

Fig. 8: Performance characteristics of the NS, CCS, ACS, SRES, and PRESschedulers on FPGAs with slot size 2, 4, 8 and 16 where a) is completiontime, b) is average wait time, c) is average utilization of an FPGA, d) is thenumber of partial reconfigurations performed and e) is the average overheadof the scheduler’s wake up call.

possible utilization as it cannot overcome the fragmentationrepercussions of its heuristic. While on the other hand, theability to adjust allocation of resources and replication ofmodules for higher performance helps our resource elasticschedulers to gain almost full utilization, which is about onaverage 2.3x higher utilization compared to ACS and about2.7x better on average when compared to NS. The drop inutilization with respect to an increase in the number of slotsavailable occurs because resource elastic schedulers tend toemploy the biggest module possible for a kernel to maximizeperformance at the cost of fragmentation and the scenarioswhere kernels did not offer smallest module size for filling

up the holes left in a particular module layout. Note that thehigher utilization is achieved at the cost of a higher numberof reconfiguration calls; the implication of this is captured inFigure 8a and 8b, as reconfiguration time is included in thetotal completion and wait times.

To quantify and contrast the overhead caused by the re-source elastic scheduler, we measured the time taken forexecution of single wake up calls on an x86 Intel Core i7-6850K running Ubuntu 16.04 LTS and the total number ofpartial reconfiguration calls performed for schedulers acrossall the experiments. The results of this are shown in Figure 8eand 8d, respectively. The computation overhead for a resourceelastic scheduler is between 10x to 100x higher as compared tobaseline schedulers. This is mostly due to our implementationwhich enumerates all the possible module layouts and usesexpensive rating functions. However, the total execution timeis still below tens of microseconds which is negligible com-pared to the partial reconfiguration cost which is in the rangeof milliseconds. Similarly, Figure 8d shows that the resourceelastic schedulers require about 12x to 100x more configu-ration calls than NS and a similar number of configurationcalls to ACS. However, despite this, the performance of theresource elastic scheduler is much better with lower waitingtime for the kernels, as shown in Figure 8a and 8b. This isdue to frequent partial reconfiguration for increasing resourceutilization which, in turn, improves performance.

VI. CASE STUDY

In our case study, we deploy resource elastic scheduling on arecent Xilinx Zynq UltraScale+ platform for the same baselineschedulers and resource elastic schedulers as analyzed in Sec-tion V. The platform on which we conducted the experimentis a TE8080 board featuring a XCZU9EG-FFVC900-2I-ES1MPSoC device. Our system has been partitioned into a staticpart, which includes ARM cores and AXI interfaces, and thepartially reconfigurable part, which is reserved for partiallyreconfigurable modules. In detail, the partially reconfigurablepart has four regions with identical resource footprints whichserve as slots, to host partially reconfigurable modules, asshown in Figure 9. In our design, the reconfigurable partoccupies approximately 50% of the whole chip resources, anda slot takes around 3 ms for partial reconfiguration.

OpenCL applications are implemented in one or morereconfigurable slots. They can be relocated at run-time usinga configuration controller based on the tool BitMan [20]. Thisallows reusing the bitstreams for different slots and reducesthe total number of bitstreams which needs to be storedconsiderably, as compared to the standard Xilinx PR flow. Theworkload is offloaded to these kernels from the ARM coreswhich share the main memory with them.

We conducted the scheduling experiments on this platformwith applications from communication, arithmetic and ma-chine learning: CRC32, matrix multiplication (mmult), andEuclidean distance (e-dist) for k-means. The matrix multipli-cation kernel has two different physical implementations ofslot sizes 1 and 4. Where the 4 slot version offers 5× theperformance of a 1 slot module due to better reuse of the dataand a bigger work-group size. CRC32 and Euclidean distancekernels occupy 1-slot each.

The scheduling requirements for our case study modelsmatrix multiplication as a long-running kernel which needs

E-dist – 1 slot

XCZU9EG-

FFVC900

Reconfigurable part

Static part

1

2

3

4

AXI bus macro

CRC32 – 1 slot

Mmult – 1 slot

Mmult – 4 slots

Fig. 9: Physical implementation and floor layout of the chip with dedicatedarea for partial reconfiguration (slots) and static system, such that OpenCLkernels can use the neighbouring slots if required.

123

Bslo

t

mmult

e-dist

mmult

configuration

4

123

slo

t

mmult

4

123

slo

t

4

A mmult

crc32

crc32

e-dist mmulta)

d)

b)Bmmult

mmult

mmult

crc32e-dist crc32

mmultmmult

crc32

e-dist

123

Bslo

t

mmult mmult

4

crc32e-dist

c)

Fig. 10: Execution trace of the scheduler where a) is PRES, b) is SRES, c)is ACS, and d) is NS and CCS

to run 20480 work-items with an arrival time of zero. WhileCRC32 and Euclidean distance are modelled as short runningkernels with 16384 work-items of low latency each and anarbitrarily chosen arrival time of 60 ms, such that it wouldinterrupt the long-running kernel. The overall decisions takenby different schedulers are shown in Figure 10 and theirrespective performance characteristics are captured in Table I.We can see that ACS provides highest performance and utiliza-tion among the baseline schedulers. However, with the abilityto grow and shrink modules, SRES and PRES effectivelymaximize the utilization and provides 36% and 26.5% betterperformance than ACS with similar waiting time, respectively.In this particular scenario, SRES outperforms PRES due tothe lack of looking ahead in PRES, which leads to a greedydecision of accelerating matrix multiplication over CRC32 andhence, a higher total completion time.

SRES PRES ACS CCS/NSmmult wait time 12 ms 12 ms 12 ms 3 msCRC32 wait time 4 ms 4 ms 4 ms 3 mse-dist wait time 7 ms 7 ms 7 ms 6 msTotal completion time 320 ms 368 ms 501 ms 1221 ms

TABLE I: Wait time of kernels and completion time of schedule for the casestudy. Where, A/B denotes that scheduling policies A and B provides thesame results.

VII. CONCLUSION

In this paper, we introduced the concept of resource elastic-ity as a novel run-time resource management solution to allowdynamic allocation of reconfigurable resources for FPGAs. Wedemonstrated how a resource manager can be designed forOpenCL to virtualize reconfigurable resources for a variablenumber of running kernels in order to improve utilization, andconsequently, performance. Our evaluation against classicalresource allocation strategies on a simulator and a physicalimplementation found that resource elasticity can provideabout 2.3x higher utilization and 49% better performance onaverage while also delivering lower waiting times for thekernels.

Our results are a strong indicator that future FPGA operat-ing systems and virtual machines need to consider mappingaccelerators firstly in the spatial domain. We demonstrated thatco-operative scheduling is a potentially better fit for FPGAsdue to its trade-off between overhead and ability of resourcereallocation, compared to state-of-the-art preemptive and run-to-completion approaches.

VIII. ACKNOWLEDGEMENTS

This work is supported by the European Commission underthe H2020 Programme and the ECOSCALE project (grantagreement 671632). We also like to thank Xilinx for support-ing this research through their university programme.

REFERENCES

[1] A. Bourge et. al., “Generating Efficient Context-Switch Capable CircuitsThrough Autonomous Design Flow,” ACM TRETS, Dec. 2016.

[2] A. Vaishnav, K. D. Pham and D. Koch, “A Survey on FPGA Virtual-ization,” in 28th FPL, Aug 2018.

[3] J. Coole and G. Stitt, “Intermediate Fabrics: Virtual Architectures forCircuit Portability and Fast Placement and Routing,” in CODES+ISSS,Oct 2010.

[4] A. K. Jain et. al., “Are Coarse-Grained Overlays Ready for GeneralPurpose Application Acceleration on FPGAs?” in 14th DASC/14thPiCom/2nd DataCom/2nd CyberSciTech). IEEE, 2016.

[5] C. Liu et al., “QuickDough: A Rapid FPGA Loop Accelerator DesignFramework Using Soft CGRA Overlay,” in FPT, Dec 2015.

[6] A. Brant et. al., “ZUMA: An Open FPGA Overlay Architecture,” in20th FCCM. IEEE Computer Society, 2012.

[7] D. Koch et. al., “An Efficient FPGA Overlay for Portable CustomInstruction Set Extensions,” in 23rd FPL, Sept 2013.

[8] H. Simmler et. al., “Multitasking on FPGA coprocessors,” in 10th FPL.Springer, 2000.

[9] A. Morales-Villanueva et. al., “Partial Region and Bitstream Cost Modelsfor Hardware Multitasking on Partially Reconfigurable FPGAs,” inIPDPS, May 2015.

[10] M. Happe et. al., “Preemptive Hardware Multitasking in ReconOS,” inApplied Reconfigurable Computing, 2015.

[11] D. Koch et. al., “Efficient Hardware Checkpointing – Concepts, Over-head Analysis, and Implementation,” in 15th FPGA. ACM, 2007.

[12] M. Asiatici et. al., “Virtualized Execution Runtime for FPGA Acceler-ators in the Cloud,” IEEE Access, vol. 5, 2017.

[13] A. Munshi, “The OpenCL Specification,” in Hot Chips 21 (HCS), 2009.[14] L. Wirbel, “Xilinx SDAccel: a Unified Development Environment for

Tomorrows Data Center,” The Linley Group Inc, 2014.[15] R. Jain et. el., “Throughput Fairness Index: An Explanation,” Dept. of

CIS, Ohio State University, Tech. Rep., 1999.[16] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes,

“Omega: Flexible, Scalable Schedulers for Large Compute Clusters,”in Proc. of the 8th ACM Euro. Conf. on Computer Syss., 2013.

[17] S. Byma et. al., “FPGAs in the Cloud: Booting Virtualized HardwareAccelerators with OpenStack,” in 22nd FCCM, May 2014.

[18] F. Chen et. al., “Enabling FPGAs in the Cloud,” in Proceedings of the11th ACM Conf. on Computing Frontiers. ACM, 2014.

[19] E. Lubbers and M. Platzner, “ReconOS: An RTOS Supporting Hard-andSoftware Threads,” in 17th FPL, Aug 2007.

[20] K. D. Pham, E. Horta, and D. Koch, “BITMAN: A Tool and API forFPGA Bitstream Manipulations,” in DATE. IEEE, 2017.

resource elastic virtualization for fpgas using...

Documents