efficiently scheduling runtime reconfigurations

12
58 Efficiently Scheduling Runtime Reconfigurations JAVIER RESANO, JUAN ANTONIO CLEMENTE, CARLOS GONZALEZ, and DANIEL MOZOS Universidad Complutense de Madrid and FRANCKY CATTHOOR IMEC vzw and Katholieke Universiteit Leuven Due to the emergence of portable devices that must run complex dynamic applications there is a need for flexible platforms for embedded systems. Runtime reconfigurable hardware can provide this flexibility but the reconfiguration latency can significantly decrease the performance. When dealing with task graphs, runtime support that schedules the reconfigurations in advance can drastically reduce this overhead. However, executing complex scheduling heuristics at runtime may generate an excessive penalty. Hence, we have developed a hybrid design-time/runtime recon- figuration scheduling heuristic that generates its final schedule at runtime but carries out most computations at design-time. We have tested our approach in a PowerPC 405 processor embedded on a FPGA demonstrating that it generates a very small runtime penalty while providing almost as good schedules as a full runtime approach. Categories and Subject Descriptors: C.1.3 [Processor Architectures]: Other Architecture Styles—Adaptable architectures; C.3 [Special-Purpose and Application-Based Systems]— Real-time and embedded systems General Terms: Performance, Algorithms, Design Additional Key Words and Phrases: Reconfigurable architectures, runtime/design-time scheduling, hardware multitasking, FPGAs ACM Reference Format: Resano, J., Clemente, J. A., Gonzalez, C., Mozos, D., and Catthoor, F. 2008. Efficiently scheduling runtime reconfigurations. ACM Trans. Des. Autom. Electron. Syst. 13, 4, Article 58 (September 2008), 12 pages. DOI = 10.1145/1391962.1391966 http://doi.acm.org/10.1145/1391962.1391966 This research was supported by PR34/07-15821 and TIN2006-03274. Authors’ addresses: J. Resano, J. A. Clemente, C. Gonzalez, and D. Mozos, Department of Computer Architecture, Universidad Complutense de Madrid, Spain, 28040 Madrid; F. Catthoor, Department of Electrical Engineering (ESAT), Katholieke Universiteit Leuven, B-3001 Heverlee, Belgium and IMEC vzw, B-3001 Heverlee, Belgium. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2008 ACM 1084-4309/2008/09-ART58 $5.00 DOI 10.1145/1391962.1391966 http://doi.acm.org/ 10.1145/1391962.1391966 ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Upload: independent

Post on 25-Nov-2023

2 views

Category:

Documents


0 download

TRANSCRIPT

58

Efficiently Scheduling RuntimeReconfigurations

JAVIER RESANO, JUAN ANTONIO CLEMENTE, CARLOS GONZALEZ,and DANIEL MOZOS

Universidad Complutense de Madrid

and

FRANCKY CATTHOOR

IMEC vzw and Katholieke Universiteit Leuven

Due to the emergence of portable devices that must run complex dynamic applications there is aneed for flexible platforms for embedded systems. Runtime reconfigurable hardware can providethis flexibility but the reconfiguration latency can significantly decrease the performance. Whendealing with task graphs, runtime support that schedules the reconfigurations in advance candrastically reduce this overhead. However, executing complex scheduling heuristics at runtimemay generate an excessive penalty. Hence, we have developed a hybrid design-time/runtime recon-figuration scheduling heuristic that generates its final schedule at runtime but carries out mostcomputations at design-time. We have tested our approach in a PowerPC 405 processor embeddedon a FPGA demonstrating that it generates a very small runtime penalty while providing almostas good schedules as a full runtime approach.

Categories and Subject Descriptors: C.1.3 [Processor Architectures]: Other ArchitectureStyles—Adaptable architectures; C.3 [Special-Purpose and Application-Based Systems]—Real-time and embedded systems

General Terms: Performance, Algorithms, Design

Additional Key Words and Phrases: Reconfigurable architectures, runtime/design-time scheduling,hardware multitasking, FPGAs

ACM Reference Format:Resano, J., Clemente, J. A., Gonzalez, C., Mozos, D., and Catthoor, F. 2008. Efficiently schedulingruntime reconfigurations. ACM Trans. Des. Autom. Electron. Syst. 13, 4, Article 58 (September2008), 12 pages. DOI = 10.1145/1391962.1391966 http://doi.acm.org/10.1145/1391962.1391966

This research was supported by PR34/07-15821 and TIN2006-03274.Authors’ addresses: J. Resano, J. A. Clemente, C. Gonzalez, and D. Mozos, Department of ComputerArchitecture, Universidad Complutense de Madrid, Spain, 28040 Madrid; F. Catthoor, Departmentof Electrical Engineering (ESAT), Katholieke Universiteit Leuven, B-3001 Heverlee, Belgium andIMEC vzw, B-3001 Heverlee, Belgium.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2008 ACM 1084-4309/2008/09-ART58 $5.00 DOI 10.1145/1391962.1391966 http://doi.acm.org/10.1145/1391962.1391966

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:2 • J. s et al.

1. INTRODUCTION

A new generation of small portable devices, such as mobile phones or PDAs,has become popular. These devices must run complex dynamic applications. Thebest way to cope with them is to design flexible platforms that respond to thevariable applications’ demands. Reconfigurable HW, for example, some FPGAs(Lysaght et al. [2006]), can be the key component of these platforms since itprovides high performance and flexibility. Its main disadvantage is that thereconfiguration process generates important execution time overheads, as it isexplained in Shang et al. [2002] and Resano et al. [2004]. Nevertheless, previousworks like Li et al. [2002], Noguera et al. [2004], Resano et al. [2005b], and Quet al. [2006] have demonstrated that with the appropriated scheduling supportthis overhead can be drastically reduced. However, for dynamic applications it isimpossible to know at design time when the reconfigurations will be demanded.Hence previous works either compute their schedules at runtime or target onlystatic applications. In the first case, authors normally do not evaluate how theexecution of the scheduler will decrease the system performance.

1.1 Contributions of the Article

Markovskiy et al. [2006] demonstrate that a runtime task scheduling approachfor reconfigurable HW may generate an important overhead and propose hy-brid design-time/runtime solutions that reduce the runtime computations whileproviding high quality schedules. In this paper we propose to follow a similarapproach to schedule runtime reconfigurations. To this end we have developeda specific hybrid heuristic for this problem. The idea of this hybrid heuristicwas initially introduced in Resano et al. [2005a], but since then we have im-plemented our approach in an actual FPGA-based embedded system; we havemeasured the execution time for graphs with different sizes and for two differ-ent memory hierarchies; and we have carried out new experiments taking intoaccount the overhead due to reconfiguration and the runtime computations.In addition, we have implemented the runtime reconfiguration scheduler pre-sented in Resano et al. [2004] to compare the results. The experiments haveproved that our hybrid reconfiguration scheduler reduces the overall overheadwhile demanding fewer resources than the runtime approach.

2. PROBLEM DESCRIPTION

Figure 1 depicts our target architecture. This is basically a heterogeneous multi-processor system that includes one or several processors (ISP), some ASIC re-sources for the most critical functionality, a set of Reconfigurable ProcessingUnits (RPU), and some communication resources. In addition it includes a con-figuration memory that stores all the configurations that may be loaded anda reconfiguration circuitry that connects this memory with the RPUs. As incurrent commercial FPGA platforms, we will assume that reconfigurations arealways carried out sequentially.

Each RPU is wrapped with a fixed interface that provides the basic operat-ing system (OS) and communication support functionality. In order to supportruntime reconfiguration all these interfaces have identical connections that are

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Efficiently Scheduling Runtime Reconfigurations • 58:3

Fig. 1. Target architecture.

known at design time. With this support each RPU can independently execute asubtask (that it is our basic scheduling unit), and communicate with the otherprocessing elements. This scheme was proposed by Marescaux et al. [2002],who also presented an OS extension for reconfigurable HW, and has also beenadopted by other research groups (Noguera et al. [2004]; Walder and Platzner[2004]). With this approach a RPU is similar to any other processing element.Hence, any task scheduler initially developed for heterogeneous multiproces-sor systems can also be used for this architecture. However, there is still animportant difference remaining: the reconfiguration process. In order to eas-ily adapt a multiprocessor scheduler for this architecture we propose to adda final scheduling step with specific support to optimize the reconfigurations.To develop our approach we have selected the TCM scheduling environmentpresented in Peng et al. [2001] although our work can be integrated in othersscheduling environments as long as they provide all the needed informationat design time and runtime. Since TCM is not one of the contributions of thepaper we will only provide some basic information about it relying on the ref-erences for further details. We selected TCM because it is a hybrid approachdeveloped for heterogeneous multiprocessors that shares our objective of se-lecting good schedules at runtime while carrying out most computations atdesign time. At design time the TCM scheduler analyses all task graphs andgenerates for each one a set of feasible schedules with different trade-offs. If atask exhibits too much dynamic behavior several task graphs are used to repre-sent the most relevant runtime scenarios, and each one is analyzed separately.TCM generates all the possible task schedules at design time. Our approachwill take advantage of this information to reduce the runtime computations.At runtime the TCM scheduler identifies the proper scenario and selects one ofthe pre-computed schedules for each active task attempting to meet the dead-lines while minimizing the energy consumption. TCM has already succeededscheduling complex dynamic applications as in Wong et al. [2001] and Yang et al.[2004].

Our work is targeting a dynamic environment where external events maytrigger the execution of one or several tasks. In our scheduling environmentthese tasks are represented as Control Data Flow Graphs (CDFG). The nodes ofthese graphs (called subtasks) are the basic scheduling unit. The TCM runtimescheduler analyses these tasks and selects a proper schedule for each of themtaking into account all the important issues, such as task deadlines, intertaskcommunications, energy-consumption reduction.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:4 • J. s et al.

Fig. 2. Execution time of the runtime reconfiguration scheduler.

However, since TCM was not developed for reconfigurable hardware, it ne-glects the reconfiguration overheads. Hence the actual execution may greatlydeviate from the initial schedule leading to possible deadline misses. In our pre-vious work we demonstrated that with the appropriated runtime support, theinitial schedules can be updated with the needed reconfigurations, while hidingmost of the reconfiguration delays. This goal was achieved applying a prefetchtechnique for the configurations that identifies the needed reconfigurations,analyses the given schedules and attempts to carry out all the reconfigurationsin advance in order to hide the reconfiguration latency. We presented a recon-figuration scheduler that obtains good results in Resano et al. [2004a] and inResano et al. [2005b] we extended it including a replacement technique that col-laborates with the scheduler, based on the well known LFD (Longest ForwardDistance) replacement policy (Belady [1966]). Using these techniques in our ex-periments we hid at least 93% of the initial reconfiguration overhead even forhighly dynamic applications. The complexity of this reconfiguration schedulingheuristic is O(N*log(N)), where N is the number of reconfigurations. Initiallyit may look affordable, but if we take into account that an embedded processormust carry out the computations, it is clear that we should evaluate the runtimedelays that it generates. To this end we have run our reconfiguration schedulerin a Power PC 405 processor embedded in a XC2VP30 FPGA (XILINX, [2007]).We implemented the system using the XILINX EDK environment that providesbasic components to develop an embedded system based on this processor (EDK[2007]). We measure the execution time with a HW timer for 26 different taskgraphs obtained from actual multimedia applications grouped in categories ac-cording to the number of reconfigurations demanded (Section IV provides moredetails about these applications). We have carried out the measurements twicestoring the code on an on-chip memory and an off-chip memory respectively.Figure 2 depicts the average results for each category, showing that when thecode is stored in an off-chip memory the execution consumes several millisec-onds. Using an on-chip memory largely improve these results. However, the ex-ecution time may be still unaffordable in many situations. In addition the codeoccupies 124KB that is a 40% of the total on-chip memory resources. Hence weneed to reduce the execution time and the size of the code of this reconfigurationscheduler.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Efficiently Scheduling Runtime Reconfigurations • 58:5

Fig. 3. Scheduling process.

3. THE HYBRID RECONFIGURATION SCHEDULER

As shown in Figure 3, our reconfiguration scheduler includes a design time mod-ule and a runtime module. In TCM all the task-graph schedules are availableat design time. However, in order to support subtask reuse we cannot carry outthe reconfiguration schedule fully at design time. Due to the very well-knowneffect of temporal locality, it is very likely that those subtasks that were loadedand executed at a given point of time will be executed again soon. However,the reuse opportunities depend on the runtime events. Hence this can only beaddressed at runtime.

We have considered two different options for splitting the reconfigurationscheduling process between a design time and a runtime phase. A first op-tion is to follow a similar approach to TCM generating one schedule at designtime for each possible runtime situation and, at runtime, identifying the properschedule. However, if there are many different situations, this option will be toocostly, hence we decide to develop another approach. The design time modulewill generate for each graph only one optimal schedule of the reconfigurationsunder certain conditions and the runtime module will carry out some smalladjustments when needed.

3.1 Reconfiguration Design-Time Scheduler

As shown in our previous work, a prefetch approach to can hide the latencyof most reconfigurations. However, for certain subtasks, it may fail to meet itsobjective because there is not always enough available time to schedule all theloads in advance. Clearly, those subtasks that cannot be hidden are more criticalfor the system performance. The first step of our reconfiguration design timescheduler is to identify these subtasks. Thus, for each design time task schedulethe reconfiguration scheduler identifies a minimum set of critical subtasks (CS)that fulfills the following conditions:

1. If all these CS are loaded at the beginning of the task execution, there willbe no overhead due to the loads of the remaining subtasks.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:6 • J. s et al.

Fig. 4. Pseudo-code that identifies the set of critical subtasks.

2. If any of these CS is not loaded at the beginning of the task execution, thereconfiguration scheduler module will not fully hide its load.

Figure 4 shows the pseudo-code of the CS selection process. The processstarts assigning a weight to each subtask in the graph. These weights are com-puted performing an ALAP (as late as possible) scheduling, and afterwards,computing for each subtask the longest path to the end of the execution ofthe whole graph. After computing the weight, the first iteration starts assum-ing that all the reconfigurations must be scheduled. Under this assumption ascheduling function attempts to hide all the reconfigurations. This function canbe set to use any scheduling algorithm. In our current implementation we usefor small/medium size graphs a branch & bound algorithm that guarantees theoptimal solution, and for large graphs a simpler scheduling heuristic that gen-erates good schedules in an affordable time (Resano et al. [2004] provide moredetails of this heuristic). Combining these two functions our scheduler findsalways optimal or near-optimal solutions without consuming too much designtime. However, if needed, any other scheduling heuristic can be easily includedin the system.

Once the first schedule is done, the design time module compares it withthe initial task schedule that does not include the reconfigurations and checksif there is any execution-time overhead. If this is the case, the functionAdd subtask to CS identifies which reconfigurations have generated this over-head, and move the subtask with greater weight to the CS set, which wasinitially empty. The following iterations repeat this process assuming that allthe subtasks assigned to the CS subset are reused (i.e., they do not demandany reconfiguration), until the Schedule reconfigurations function finds a finalschedule without any reconfiguration overhead. This is the input of the runtimereconfiguration scheduler.

3.2 Reconfiguration Runtime Scheduler

The output of the reconfiguration design time scheduler assumes that all thesubtasks from the CS subset are reused, whereas the remaining subtasks as-signed to RPUs are prefetched. However, these assumptions probably will notbe true at runtime. The first task of the runtime module is to carry out an ini-tialization phase that loads all the CS that cannot be reused. The loading orderwas decided at design time according to the subtasks’ weights. After this, it willcheck if any noncritical subtask can be reused. In that case it will cancel theprefetch of those subtasks without modifying the rest of the schedule.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Efficiently Scheduling Runtime Reconfigurations • 58:7

Fig. 5. Pseudo-code for the process that steers the execution of a task.

3.3 Intertask Optimization

Up to now the scheduling optimizations have been only applied inside theboundaries of a task. The reason is that the actual sequence of tasks executedis not known at design time. However simple optimizations can be done atruntime if enough information is available. In the TCM environment the run-time scheduler is invoked periodically, and it generates as output a sequenceof scheduled tasks. Using again the idea of critical subtasks, we have includedan intertask optimization technique to our hybrid heuristic. Basically, for eachtask the reconfiguration scheduler uses the final idle period of the reconfigura-tion circuitry to carry out the initialization phase of the subsequent task. If thisis possible, this task will not generate any overhead due to its reconfigurations.

3.4 Task Execution

Figure 5 shows the pseudo-code of the process that controls the execution of atask. This process starts with the initialization phase that loads all the CS thatcannot be reused. Then, it follows executing the first subtask of the graphs andcarrying out the first reconfiguration in parallel. After that, the process moni-tors the event queue. The events inform the process when a subtask executionor a reconfiguration has finished. Thus, each time the process receives an event,it checks the schedule, and if there is any subtask ready, it starts its execution.In addition, if the reconfiguration circuitry is idle it attempts to start the fol-lowing reconfiguration. If all the reconfigurations have been already carriedout, the process checks if it has received information about the next task thatis going to be executed. If this is the case, and there are enough free resources,it starts loading the CS of the following task.

Figure 6 illustrates with an example the different steps of the reconfigurationscheduling flow, starting from an initial schedule that does not include reconfig-urations (6a), and finishing with the final schedule executed at runtime (6c). 6bis the schedule selected at design time assuming that subtask 1 is reused (sinceit is the only CS) while the others must be loaded (since they are not critical

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:8 • J. s et al.

Fig. 6. Example of the different steps in the scheduling process. Ex i: execution of subtask i. L i:load of subtask i.

Fig. 7. Execution time of the runtime and hybrid reconfiguration schedulers.

these loads do not generate any delay). At runtime the scheduler will check ifthe previous assumptions are true. In this particular case, the scheduler findsthat subtask 3 is the only one that can be reused. Since some of the assumptionsare not correct, the reconfiguration runtime scheduler updates the design timeschedule including an initialization phase (6c.1) where subtask 1 is loaded, andit removes the load of subtask 3 because it can be reused (6c.2). Finally, it at-tempts to apply an intertask optimization. In this example the task scheduleralready knows which task is going to be executed next. Hence, it preloads oneof its critical subtasks (6d.3).

4. EXPERIMENTAL RESULTS

As a first experiment we have repeated the measurements presented in Figure 2but including also our hybrid approach (Figure 7). These results show an aver-age 22 speed-up factor for both on-chip and off-chip implementations. In addi-tion the size of the code that identifies the reusable tasks, updates the recon-figuration schedule and manages the task graph execution is 14 times smaller(9KB).

In the following experiments we will compare the quality of the schedulesobtained with both approaches. In these experiments the reconfiguration la-tency has been set to 4 ms, that it is the time needed to reconfigure one fifth of aXC2VP30 FPGA, since after implementing the task graphs using the XILINXISE design environment (ISE [2007]) all the subtasks considered in these ex-periments fits in this area.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Efficiently Scheduling Runtime Reconfigurations • 58:9

Table I. Set of Multimedia Benchmarks.Last four columns include the execution time of the application without including the

reconfiguration overhead (Ideal); including the overhead without applying a prefetch approach(with Overhead); applying the runtime prefetch approach presented in Resano et al. [2004]

(runtime) and finally applying our hybrid approach (hybrid).

Number of Number Ex time with Ex time Ex timeApplication Subtasks of CS Ideal ex. time Overhead with runtime with hybridPattern Rec. 6 1 94 ms 110 ms 98 ms 98 msJPEG dec. 4 1 81 ms 97 ms 85 ms 85 msParallel JPEG 8 1 57 ms 77 ms 61 ms 61 msMPEG encoder 5 2 33 ms 51 ms 39 ms 41 msPocket GL 5 2 26 ms 45 ms 32 ms 33 msAverage 6 1.4 58 ms 78 ms 63 ms 63.6 ms

Table I includes some information about the applications that we have usedto test our approach, which are a sequential and a parallel version of the JPEGdecoder, an MPEG encoder, a Pattern Recognition application that applies theHough transform over a matrix of pixels in order to identify geometrical fig-ures and a highly dynamic 3D rendering application based on the open sourcePocket-GL library (POCKET GL, [2007]). For the MPEG encoder there are infact three different graphs corresponding to three different scenarios (decodingof a B, P, or I frame). The Pocket GL application is a little bit more complexsince it includes 20 task graphs corresponding to 20 different runtime scenar-ios. Table I includes average data for MPEG and Pocket GL applications. Theresults obtained in these experiments demonstrate that our hybrid approachprovides almost as good results as the previous runtime approach. A coherentreason exists for these nice results. The hybrid scheduler generates at designtime an optimal schedule for the noncritical subtasks. Hence, no schedulingapproach can improve this part. The critical subtasks can still generate an im-portant overhead, but by definition, they will also generate overheads, althoughsometimes a little bit smaller, even when applying the reconfiguration scheduleat runtime. In these experiments we assume that no configuration is reused.In addition, we are not applying the intertask optimization. The following twoexperiments evaluate our approach including the possibility of subtask reuse(identified at runtime) and the impact of intertask optimization. In the firstexperiment (Figure 8) we have simulated 1000 iterations of the execution ofthe first four applications from Table I for different number of RPUs select-ing randomly which applications are executed each iteration. The first threecolumns present the execution time overhead due to the reconfigurations whenusing the runtime prefetch scheduler presented in Resano et al. [2004] (RT); thesame scheduler but including the intertask optimization (RT+IT) and finallythe hybrid scheduler (Hybrid). Next three columns (+Tex) again present thisoverhead but adding also the delays generated by the runtime prefetch com-putations. The results show that intertask optimization significantly reducesthe overhead, and that the benefits of applying a runtime approach decreasewhen we include in the measurements its execution time. In fact once this datais included our hybrid approach provides significant better results than theruntime approach.

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:10 • J. s et al.

Fig. 8. Reconfiguration overhead for the first 4 applications depicted in Table I and a variablenumber of RPUs.

Fig. 9. Reconfiguration overhead for a Pocket GL 3D rendering application, for different numberof RPUs.

We have carried out a similar experiment with the Pocket GL application.This application includes 20 different runtime scenarios, each one representedas a task graph, and for each iteration one of them is randomly selected. Figure 9summarizes the results showing that again the overall overhead, including thedelays due to reconfigurations and due to the computations of the prefetchscheduler, is greatly reduced with our hybrid approach.

5. CONCLUSIONS

Reconfigurable hardware provides the flexibility demanded by new generationembedded systems that must deal with dynamic applications. However the re-configuration latency can drastically reduce its performance. Previous workshave demonstrated that this overhead can be greatly reduced applying a run-time prefetch approach that attempts to schedule reconfigurations in advance.However, when an embedded processor must carry out the prefetch compu-tations, this approach can generate significant delays. In order to efficientlyschedule the reconfigurations we have developed a hybrid prefetch schedulerthat carries out most computations at design time, and only adjust slightly theschedule at runtime. To test our approach we have implemented this schedulerand the full runtime approach presented in Resano et al. [2004] in a PowerPCprocessor embedded on a Virtex-2 Pro FPGA. The results demonstrate that the

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

Efficiently Scheduling Runtime Reconfigurations • 58:11

runtime computations of our hybrid scheduler generates very small penalty onthe performance of the system (22 times smaller than the runtime approach)while producing schedules of a similar quality leading to important overalloverhead savings. As a future work we plan to develop a hw micro-architecturesupport to further improve the efficiency of the reconfiguration scheduler andtest it in a real multitasking system.

REFERENCES

BELADY, L. A. 1966. A study of replacement algorithms for virtual storage computers. IBM Syst.J., 5, 78–101.

EDK. 2007. Embedded System Tools Manual. http://www.xilinx.com/ise/embedded/edk91i docs/est rm.pdf.

ISE. 2007. http://www.xilinx.com/publications/prod mktg/pn0010867.pdf.LI, Z. AND HAUCK, S. 2002. Configuration prefetching techniques for partial reconfigurable copro-

cessor with relocation and defragmentation. In Proceedings of the ACM/SIGDA 10th Interna-tional Symposium on Field-Programmable Gate Arrays. Monterey, CA, 187–195.

LYSAGHT, P., BLODGET, B., MASON, J., YOUNG, J., AND BRIDGFORD B. 2006. Enhanced architectures,design methodologies and CAD tools for dynamic reconfiguration of Xilinx FPGAs. In Proceedingsof the 2006 International Conference on Field Programmable Logic and Applications. Madrid,Spain, 1–6.

MARESCAUX, T., BARTIC, A., VERKEST, D., VERNALDE, S., AND LAUWEREINS, R. 2002. Interconnectionnetworks enable fine-grain dynamic multi-tasking FPGAs. In Proceedings of the 12th Interna-tional Conference on Field-Programmable Logic and Applications. Montpellier, France, 795–805.

MARKOVSKIY, Y., CASPI, E., HUANG, R., YEH, J., CHU, M., WAWRZYNEK, J., AND DEHON, A. 2006. Analysisof quasi-static scheduling techniques in a virtualized reconfigurable machine. In Proceedings ofthe ACM/SIGDA 10th International Symposium on Field-Programmable Gate Arrays. Monterey,CA, 196–205.

NOGUERA, J. AND BADIA, R. M. 2004. Multitasking on reconfigurable architectures: Micro ar-chitecture support and dynamic scheduling. ACM Trans. Embed. Comput. Syst., 3, 2, 385–406.

PENG, Y., WONG, CH., MARCHAL, P., CATTHOOR, F., DESMET., D., VERKEST, D., AND LAUWEREINS, R. 2001.Energy-aware runtime scheduling for embedded-multiprocessors SOCs. IEEE J. Des. Test Com-put. 18, 5, 46–58.

POCKET G. L. 2007. http:/www.sundialsoft.freeserve.co.uk/pgl.htm.QU, Y., SOININEN, J., AND NURMI, J. 2006. A parallel configuration model for reducing the runtime

reconfiguration overhead. In Proceedings of the Conference on Design, Automation and Test inEurope (DATE). Munich, Germany, 965–969.

RESANO, J., MOZOS D., VERKEST, D. VERNALDE, S., AND CATTHOOR, F. 2004. A hybrid design-time/run-time scheduling flow to minimize the reconfiguration overhead of FPGAs. J. Microproc.Microarchi. 28, 5–6, 291–301.

RESANO, J., MOZOS, D., AND CATTHOOR, F. 2005a. A hybrid prefetch scheduling heuristic to minimizeat run-time the reconfiguration overhead of dynamically reconfigurable HW. Proceedings of theDesign Automation and Test in Europe Conference. Munich, Germany, 106–111.

RESANO, J., MOZOS, D., CATTHOOR, F., AND VERKEST, D. 2005b. A reconfiguration manager for dy-namically reconfigurable hardware. IEEE Design & Test Comput. 22, 5, 452–460.

SHANG, L. AND NIRAJ K. J. 2002. Hardware-software co-synthesis of low power real-time dis-tributed embedded systems with dynamically reconfigurable FPGAs. In Proceedings of the Con-ference on Asia South Pacific Design Automation/VLSI Design, Bangalore, India, 345–360.

WALDER, H. AND PLATZNER, M. 2004. A runtime environment for reconfigurable operating sys-tems. In Proceedings of the 14th International Conference on Field-Programmable Logic andApplications, Leuven, Belgium, 831–835.

WONG, C., MARCHAL, P., AND YANG, P. 2001. Task concurrency management methodology to sched-ule the MPEG4 IM1 player on a highly parallel processor platform. In Proceedings of the 9th

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.

58:12 • J. s et al.

International Symposium on Hardware/Software Codesign, Copenhaguen, Denmark, April, 25–27. 170–175.

XILINX. 2007. http://www.xilinx.com/univ/xupv2p.html.YANG, P. AND CATTHOOR, F. 2004. Dynamic mapping and ordering tasks of embedded real-time

systems on multiprocessor platforms. In Proceedings of Software and Compilers for EmbeddedSystems 8th International Workshop. 167–181.

Received April 2007; revised November 2007, April 2008; accepted April 2008

ACM Transactions on Design Automation of Electronic Systems, Vol. 13, No. 4, Article 58, Pub. date: Sept. 2008.