reconfigurable platforms for high performance processing

13
INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007 63 Reconfigurable Platforms for High Performance Processing ergio B. Nascimento 1,3 , Jordana Seixas 1 , Edson Barbosa 1,2 , Stelita Silva 1 , Abner Correa 1 , Viviane Lucy 1 , Victor Medeiros 1 , Arthur Rolim 1 , Dercy Lima 1 and Manoel Eusebio de Lima 1 1 Centro de Inform´ atica - UFPE - Caixa Postal 7851 - Cidade Universit´ aria Fone: + 55 81 2126.8430, Fax: + 55 81 2126.8438 Recife - PE - Brasil 2 Centro Federal de Educac ¸˜ ao Tecnol´ ogica de Sergipe - CEFET-SE Av. Engenheiro Gentil Tavares da Mota - Getlio Vargas, CEP: 49055-260 Fone: +55 79 3216.3100 Aracaju - SE - Brasil 3 Centro Federal de Educac ¸˜ ao Tecnol´ ogica de Pernambuco - CEFET-PE Av. Prof. Luiz Freire - 500, Cidade Universit´ aria, CEP: 50740-740 Fone: +55 81 2125.1716 Recife - PE - Brasil {psbn, jls, ebl2, sms, acb, vlss, vwcm, auar, dol, mel}@cin.ufpe.br Abstract—This work deals with reconfigurable computation platforms for high speed simulation of physical phenomena, based on numerical models of algebraic linear systems. This type of simulation is of extreme importance in research centers as CENPES/Petrobrs, that develops applications of geophysical processing for prospection of oil and gas. Currently, these applications are implemented on PCs conventional clusters. A new approach for this type of problem is presented here, based on reconfigurable computer systems using Field Programmable Gate Arrays technology (FPGA) and its implications regarding the hardware/software partitioning, operating system, memory connections, communication and device drivers. Such technolo- gies make possible appreciable profits in terms of performance - electric power and processing speed when compared to the conventional clusters. This solution also promotes cost reduction when applied to massive computation and high complexity large data applications, normally used in scientific computation. A typical application is the Portable, Extensible Toolkit for Scientific Computation (PETSc), models of reconfigurable archi- tectures for high performance and its computational properties will be presented, with emphasis to platforms based on FPGAs. A platform currently in development in the Project HPCIn, called Aquarius, will also be presented. This is a Project from the Center for Informatics at the Federal University of Pernambuco in cooperation with CENPES/Petrobrs. HPCIn is also part of the RPCMod research network. Index Terms—High Performance Computing, FPGAs, Computer Clusters I. I NTRODUCTION In this work possible solutions for physical phenomena simulation of will be presented, taking off advantage of the technology of programmable logic, from devices. These de- vices, as hardware accelerators, can substantially increase the performance in the information processing. This technology reduces the computation time by exploration of the potential parallelism, implementing operations to the level of bits in a matrix of programmable logic as shown in [22] [23]. Potential parallelism is almost always present in the simulation numerical models [6], [44], [47], [48], [49], [50]. The intense exploration of parallelism, with the use of FPGAs, can also result in lesser cost solution in the clusters implementation and in the reduction of the necessity for high frequencies operation of the processors (conventional CPUs operate in the order of 3GHz while the FPGAs, around 10MHz to 300MHz). The reduction in the clock frequency is compensated by the high degree of parallelism in the FPGAs, resulting in a high computation speed with reduction of the electric energy consumption. In this context, we will explore reconfigurable architectures forward high performance computing in massive data prob- lems, with emphasis to the resolution of large Algebraic Linear Systems (ALSs) [44], [48], [49]. A simple and illustrative example of an ALS operation is the matrices multiplication, as shown in Figure 1. The main parallel processing idea, in this case, consists of segmenting the line-column operation into sub-operations (Figure 1a), to carry through the multiplications in a parallel way. After that, the results are integrated ( ) as depicted in (Figure 1b). We can apply the same idea in other composed operations and implement them using the high parallelism degree in FPGAs to get the acceleration/power reduction. The efficiency of these devices for complex computational tasks acceleration already is proven from results of applications presented in the academic literature [1], [2], [4], [8], [13], [18], [19], [20], [26], [27], [28], [36], [39], [40] and commercial platforms [43], [45], [46], [62], [63], [64], [65]. However, still have so many challenges to be considered in order to have this technology as a friendly one to be use by the application specialists. These specialists do not have to be involved with the implementation details, but instead, they must concentrate themselves in the

Upload: independent

Post on 17-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007 63

Reconfigurable Platforms for High PerformanceProcessing

Sergio B. Nascimento1,3, Jordana Seixas1, Edson Barbosa1,2, Stelita Silva1, Abner Correa1, Viviane Lucy1,Victor Medeiros1, Arthur Rolim1, Dercy Lima1 and Manoel Eusebio de Lima1

1Centro de Informatica - UFPE - Caixa Postal 7851 - Cidade UniversitariaFone: + 55 81 2126.8430, Fax: + 55 81 2126.8438 Recife - PE - Brasil

2Centro Federal de Educacao Tecnologica de Sergipe - CEFET-SEAv. Engenheiro Gentil Tavares da Mota - Getlio Vargas, CEP: 49055-260

Fone: +55 79 3216.3100 Aracaju - SE - Brasil3Centro Federal de Educacao Tecnologica de Pernambuco - CEFET-PE

Av. Prof. Luiz Freire - 500, Cidade Universitaria, CEP: 50740-740Fone: +55 81 2125.1716 Recife - PE - Brasil

{psbn, jls, ebl2, sms, acb, vlss, vwcm, auar, dol, mel}@cin.ufpe.br

Abstract—This work deals with reconfigurable computationplatforms for high speed simulation of physical phenomena,based on numerical models of algebraic linear systems. Thistype of simulation is of extreme importance in research centersas CENPES/Petrobrs, that develops applications of geophysicalprocessing for prospection of oil and gas. Currently, theseapplications are implemented on PCs conventional clusters. Anew approach for this type of problem is presented here, basedon reconfigurable computer systems using Field ProgrammableGate Arrays technology (FPGA) and its implications regardingthe hardware/software partitioning, operating system, memoryconnections, communication and device drivers. Such technolo-gies make possible appreciable profits in terms of performance- electric power and processing speed when compared to theconventional clusters. This solution also promotes cost reductionwhen applied to massive computation and high complexity largedata applications, normally used in scientific computation.

A typical application is the Portable, Extensible Toolkit forScientific Computation (PETSc), models of reconfigurable archi-tectures for high performance and its computational propertieswill be presented, with emphasis to platforms based on FPGAs.

A platform currently in development in the ProjectHPCIn, called Aquarius, will also be presented. This is a Projectfrom the Center for Informatics at the Federal University ofPernambuco in cooperation with CENPES/Petrobrs. HPCIn isalso part of the RPCMod research network.

Index Terms—High Performance Computing,FPGAs, Computer Clusters

I. INTRODUCTION

In this work possible solutions for physical phenomenasimulation of will be presented, taking off advantage of thetechnology of programmable logic, from devices. These de-vices, as hardware accelerators, can substantially increase theperformance in the information processing. This technologyreduces the computation time by exploration of the potentialparallelism, implementing operations to the level of bits ina matrix of programmable logic as shown in [22] [23].

Potential parallelism is almost always present in the simulationnumerical models [6], [44], [47], [48], [49], [50].

The intense exploration of parallelism, with the use ofFPGAs, can also result in lesser cost solution in the clustersimplementation and in the reduction of the necessity for highfrequencies operation of the processors (conventional CPUsoperate in the order of ∼3GHz while the FPGAs, around∼10MHz to ∼300MHz).

The reduction in the clock frequency is compensated bythe high degree of parallelism in the FPGAs, resulting in ahigh computation speed with reduction of the electric energyconsumption.

In this context, we will explore reconfigurable architecturesforward high performance computing in massive data prob-lems, with emphasis to the resolution of large Algebraic LinearSystems (ALSs) [44], [48], [49]. A simple and illustrativeexample of an ALS operation is the matrices multiplication,as shown in Figure 1.

The main parallel processing idea, in this case, consistsof segmenting the line-column operation into sub-operations(Figure 1a), to carry through the multiplications in a parallelway. After that, the results are integrated (

∑) as depicted in

(Figure 1b).We can apply the same idea in other composed operations

and implement them using the high parallelism degree inFPGAs to get the acceleration/power reduction. The efficiencyof these devices for complex computational tasks accelerationalready is proven from results of applications presented in theacademic literature [1], [2], [4], [8], [13], [18], [19], [20], [26],[27], [28], [36], [39], [40] and commercial platforms [43],[45], [46], [62], [63], [64], [65]. However, still have so manychallenges to be considered in order to have this technology asa friendly one to be use by the application specialists. Thesespecialists do not have to be involved with the implementationdetails, but instead, they must concentrate themselves in the

64 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

Fig. 1. Matrix multiplication: Simple linear algebric system.

definition of functionalities of the system in a high level degreeof abstraction.

The next sections will be dedicated to the discussion oftopics relative to typical implementation flow of high perfor-mance applications. The Section II describes a possible generaldesign flow, its hardware/software partitioning approach andposterior implementation.

In Section III brief description of the scientific computationPETSc library [66] is presented. After that, in sections IVand V implications in operating systems for reconfigurablecomputation, temporal partitioningand tasks scheduling re-spectively are considered. Sections VI and VII present thesoftware/hardware interface components and the generation forFPGAs arithmetical libraries implementations.

Examples of high performance processing systems arepresented in Section VIII. An experimental example highprocessing platform based on a case study, under developmentin the RPCMod network[83], called Aquarius, is presented inSection IX. Section X presents some considerations and ourconclusions.

II. APPLICATIONS DEVELOPMENT

Reconfigurable computation applications running in a highperformance platform need a hardware/software co-design(hw/sw codesign) [14], [37] platform approach. In this plat-form, software and hardware components work together,through special interfaces. In general, the hardware compo-nents (FPGAs, ASICs) function as co-processors hardwired tothe software components. The software components are, ingeneral, microprocesors for general proposal CPUs or DSPprocessors.

Although it can seem that a solution with the biggestpossible amount of executed tasks being in the hardware,it results in the best implementation, this is not necessarilytruth. We must also take in account the cost/benefit of thishw/sw approach, considering the fact that, a good part of anapplication can have essentially a sequential behavior, and sonot good for hardware.

A good FPGA performance is always obtained when theapplications that can take off the advantage of its intrinsiccharacteristics. Therefore, in high performance systems, itmakes necessary a critical analysis of the system as a whole(profiling), in the direction of better identifying the speedbottlenecks that justify the effective use of these co-processors(FPGA).

Figure 2 shows a possible design flow that takes in accountthe hw/sw-co-design methodology [14], based on the detailedtasks analysis of the application, taking into account thesystem partitioning, software (cpu) and hardware (FPGAs co-processors) components.

Fig. 2. HPCIn project design flow.

The basic principle that must conduct the design floware the applications specification and its implementation interms of software and hardware components as shown in [22][23]. In this direction, the application designer must need thepossible minimum information on the software as the hardwarecomponents implementation.

The application designer must be worried basically aboutthe algorithm to solve his problem and its codification in ahigh level language like C, C++, Java, FORTRAN, etc.

For the software components the properties between thespecification and implementation of the components is guar-anteed compilers.

In order to verify the necessity of migrating implementationin software for the hardware, with objective to reach thedesirable requirements of performance, a set of analyses basedon profiling can be carried through to generate performanceinformation, that allow to select parts of the application codeas candidates to be accelerated in FPGA co-processors.

The tasks executed in these regions then are sped up bymeans of direct code implementation in the reconfigurablehardware. In this work, the co-processors are FPGAs. After the

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 65

hw/sw partitioning and interfaces generation, the specialists inhardware design carry through the process of digital synthesisof the modules to be mapped into the FPGAs [23]. Thesoftware components are compiled. To make possible theconcept defined in Figure 2, it is necessary to allow that anapplication designer can map tasks in target architectures thatuse the FPGA, independently of the hardware developmentteam.

This can be made by reuses of previously designed andstored components in libraries (FPGAs configuration). Thesemodules must implement the used typical functions in theapplications domain. Then, these modules could be combinedto compose the application. The designer will be able tomap these modules as simple software function inside of theapplication code. These functions, through an operating systemcall, are responsable by the activation of the components inthe FPGA, data transfer for processing and reading of results.

In Figure 3, we have the form as these function calls, thatabstraction in hardware, can be implemented.

In this figure we have an mathematical acceleration exampleof the PETSc library functions[66]. A function in the PETSclibrary is usually implemented in software. To carry throughits acceleration in FPGA it would be necessary to extendthe library with calls implemented by a new library, thereconfigurable PETSc library (RC-PETSc). The new modulesform the library in hardware are generated through highlevel synthesis methods(HLS - High Level Synthesis) [79].For each function from PETSc will be implemented, a newextension in hardware, called RC function. This new functioncarries through the same task of its version in software,but now, the computation uses the module in hardware. TheRC function keeps the same call form, send parameters andreturn variables. Internally, the RC function must activate thereconfigurable hardware (call hardware). This function com-prises three steps: sends the core for the FPGA(function hw),the data to be processed (par in), and returns the results(rep).

Fig. 3. FPGA hardware function call

III. SCIENTIFIC LIBRARY - PETSC

The PETSc means Portable, Extensible Toolkit for Scien-tific Computation [66] and constitutes a software library thatcontains data structures and functions for scaled and parallelscientific applications. These functions are modelled by meansof ordinary and partial differential equations. Figure 4, showsthe structure of the PETSc. This library allows programs im-plementation for simulation in C++, object oriented [9], withcapacity to solve problems with about 500 million variables.They can be implemented in parallel by clusters with 6000processors or more.

Fig. 4. Library structure for PETSc functions.

Currently, in the HPCIn project, we are interested in speed-ing up the operations of algebraic linear processing in PETSc.These operations are controlled for the linear layer solversthat implements methods based on the Krylov Space [66].The nucleus of the linear functions that we intend to speedup, initially, is in sub-library BLAS (Basic Linear AlgebricSystems).

The BLAS has a set of operations that involve vectors andmatrices and, as we can observe in Figure 4, are in the samelevel of MPI and MPI-IO libraries.

These libraries contain data communication functions. Thissuggests that the BLAS functions are atomic and integrallyexecuted in each node. Thus, a linear system simulation ismapped in a cluster by partitioning sub-group operations ,type BLAS, being each sub-group executed in a node of thenet. This may represent that, the simple inclusion of FPGAsin the cluster, can accelerate BLAS functions, and so thesystem processing, reducing the number of processors. Thisreduction in the number of processors, implies in reductionof the allation-maintenance costs, reduction of the loss ofperformance on account of communication bottlenecks, re-duction of the necessity of physical space and economy inthe consumption of electric energy.

Thus, from a based simple analysis in profiles, we can definewhich functions of the BLAS must be executed in the FPGA,representing the high time consumption operations. The ideais to keep in the PETSc library, for each operation, multipleimplementations of BLAS functions that can be mapped forsimulation in an interchangeable form.

We must have an original implementation of the functionin software and some implementations of the same functionin hardware (FPGA), with different degrees of parallelismexploration.

The application designer will have then to choose the bestoption (function) to be used depending on the accelerationneeds and the logical resources available in the FPGA. Theform to implement the extensions is the illustrated in Figure3.

The parameters transfer mechanisms are identified as thereturn, in software BLAS functions.

The code inside the function is modified and implementedwith configuration and communication calls for the FPGA,

66 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

through the hardware activation function (call hardware)(see Figure 3). The maintenance of the original form for callfunctions makes substitutions of the sequential algorithms, insoftware, for reconfigurable computation, transparent to theprogrammer. This procedure does not affect the programmingstyle and increase the final simulation performance.

The idea is to leave the system the most transparent systempossible for the user, preventing the necessity of rewritingapplication existing codes (code legacy). Must be enough toinstruct the compilation to use the functions adequately chosenin a hw/sw-codesign stage [14].

These functions are part of the library extended for thehardware (RC-PETSc).

IV. OPERATING SYSTEMS IMPLICATIONS

The reduction in the cost, the increase of the computationalpower and the possibility to dynamically reconfigure theFPGAs [7], [10], [11], [12], [17], [21], [29], [30], [35], [37],[38], [41] opened the way for use of this technology in highperformance applications. These advantages also generatedthe possibility to execute several of these applications in thesame device. However, the resources in these devices are onlyused adequately under a efficient management. If currently theFPGAs are fast, cheap and consume little energy, on the otherhand, they not yet widely used as computational resource.

The little use of these devices as computation elements toa large extent is provoked by the necessity of engineeringknowledge in computation and electronics, in special digitalelectronics. Also, it is necessary sophisticated hardware designmethodologies (synthesis of high level, logical and physicalsynthesis). Normally, the application designers are “experts”only in their development areas, in restricted applicationdomains. This makes impracticable the design with use of FP-GAs. Thus, the prototyping process based on FPGAs involves[17], [38] some difficulties, such as:

• More complex design flow;• high software licenses costs and complex development

platforms;• no easy solutions portability between devices from differ-

ent manufacturers and from different generations of thesame manufacturer.

The implementation of an operating system that considersthe presence of FPGAs devices, as powerful computationalresource, abstracting their complexity, has the potential toreduces the majority of the operational problems in sucharchitecture. These operating systems are called OperatingSystems for Reconfigurable Computers (R.O.S) [7], [10], [30],[31], [32], [33], [34], [42], [67], [76]. The figure 5 presents,as example, the organization of a typical R.O.S, that we aredeveloping for the Aquarius platform, in the context of theHPCIn project.

The software architecture presented in Figure 5 aims toprovide data communication, reconfigurable resources man-agement and facilitate interaction between users and programs.

This approach makes the FPGA reconfiguration process(Virtex-II Board [43]) more flexible and robust. The maincomponents of software are: 1- The operating system kernel;

2- the file system; 3-device drivers and user processes, asPETSc. In the Aquarius platform a Linux version carried forembedded systems [67], [70], [71], [72], [73], called uCLinuxis used. In the case, it is used release 2.6 adapted to the NIOS-II embedded system, from Altera[45]. This OS is implementedin a commercial FPGA prototyping board, the Stratix-II board[45] that works as the Aquarius controller.

Fig. 5. ROS - Reconfigurable Operating System for the Aquarius Platform.

The element that introduces characteristics of a reconfig-urable OS in this platform is the module IP-SelectMap, usedduring the Virtex-II FPGA reconfiguration. This module is aperipheral connected to the NIOS-II bus system, AVALON[69].The SelectMap module is responsible for the hardware con-figurations according to the FPGA application in the userprogram. The hardware in this case is represented by a FPGA,a Virtex-II device, from Xilinx [43], in a Memec board [43].

Other hardware modules are being developed to implementthe data communication tasks between the Virtex-II and theNIOS-II. The physical device, IP-SelectMap, is abstracted forthe Operating System through the HAL layer (the HardwareAbstraction Layer) and its driver, the IP-SelectMap-Driver.

The HAL layer makes the R.O.S sees the FPGA as a setof logical registers from the IP-SelectMap. The IP-SelectMapallows to define the way for programming and sending theFPGA configuration bytes (bitstreams) through the 32-bitsAVALON bus. In this way the FPGA is seen as an file fromthe uCLinux OS. In such a way, an application will be ableto configure or to reconfigure the functions in the FPGA,associating this device to an standard file from the operatingsystem. Thus, all the implemented modules in the FPGA thatcompose the extension of the PETSc library (RC-PETSc) canthen be generated as configuration files and stored in thememory system from NIOS-II. These standard files are thentreated as uCLinux files.

More details of this platform will be presented in SectionIX, that also describes some experimental results alreadyreached.

V. TIMING PARTITIONING AND TASKS SCHEDULING FORDYNAMIC RECONFIGURABLE ARCHITECTURES

In general, an application is composed by a computationaltask set library that, in the case of massive data processinginvolving numerical simulation models, can be representedby great dataflows of floating-point operations. These oper-ations that process great mass of data can reach hundreds ofTerabytes. In many situations, the amount of logic necessaryto implement all the computation can surpass the amount ofresources logical in the FPGA. In this in case, it is necessary

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 67

to apply two time Scheduling Partitioning techniques to mapthe applications:

• Temporal partitioning• Reconfigurable system schedulingIn massive computation applications that involves great

amounts of data, the temporal partitioning [35], [51], [52],[53], [77], [78] could be used to create some logical in FPGAs,contexts call, or partitions, that are executed sequentially.In this case, groups of tasks are mapped in each one ofthese partitions and then use a large amount of resources toimplement its computation. Therefore they do not need toshare the resources with the tasks that are placed in otherpartitions. In these cases, we have an increase of the amountof logic available for the tasks. This results in a performanceincreasing, with significant profits in the computation times.These profits, however, are only possible if the amount ofprocessed data for each partition is enough great to makethe acceleration of the data computation surpasses the lost forthe FPGA reconfiguration time. Figure 6 shows the temporalpartitioning flow [77], [78] to be incorporated in the HPCInproject.

The temporal partitioning is responsible for dividing theapplication into groups of tasks that are fit in the FPGAs. Thus,each group is configured and executed in the FPGA in differentperiod of time, by means of a timing multiplex technique. Thetemporal partitioning methods and analysis of the FPGA logicresources, that will be incorporated into Aquarius is proposedby Nascimento in [77], [78] (Figure 6).

This work describes techniques for Design Space Explo-ration (DSE) [54], [55], [77], [78]. In this approach, the tem-poral partitioning is directly related to the balanced distributionof the FPGA resources, among tasks of each partition. Thus,it is possible to spent less execution time for applications.

Fig. 6. Temporal partitioning and Design Space Exploration.

The method presented in [77], [78] also looks for to definethe partitions in function of the best combination of tasks,that allows optimum advantage of the FPGA resources. In thismethod, a searching process chooses the best implementationfor each task, from a pre-synthesized library of components.This library keeps multiple different implementations for eachtask. Each implementation has its own degree of parallelism,

with its timing (TTask) and FPGA area (ATask) associated.Each task in the library is characterized by a Optimum Paretocurve [54], [55], where the points (ATask, TTask) of eachavailable implementation are located in a graph “FPGA area”x “Computation time”.

Carrying through local search on these curves, for eachpresent task in a partition, it is possible to choose thebest combination of tasks implementation, resulting in thelesser computation time. These techniques of local searchon Optimum Pareto curves are combined with search basedon heuristics techniques. These techniques are variations ofthe Search Tabu method [80], [81], for temporal partitioningimplementations.

The Scheduling for reconfigurable systems [7], [31], [32],[33], [34], [42], [76] is strongly related to the application tem-poral partitioning process. A degree of freedom exists wherethe temporal partitionind can allow that different partitionsexecute exactly in one same time slot, when the platform hasmore than one FPGA.

Thus, it is necessary to define the best order for partitionsand their best allocations in the platform regarding its logicalresources. This task is carried through by schedulers that arevery similar to those used in software. However, it is necessaryto incorporate the intrinsic characteristics from reconfigurablelogic. An example of a scheduler for Real time applicationscan be seen in Eskinazy [76]. In this work the author describesa static scheduling that defines the best resources allocationand the best tasks order. This scheduler uses methods basedon Petri nets [74], [76] and on a left-edge allocation algorithm[76], to take the scheduling decisions. The decisions implytasks time and tasks size estimations in the FPGAs. Sinceeach partition, generated by means of a temporal partitioningmethod, can be considered a only task, we can adapt thisscheduler [76] to our approach.

Fig. 7. Eskinazy model for static scheduling of tasks in a multi-FPGAplatform.

The scheduling method proposed by Eskinazy is presented,as an example, in Figure 7. In this model, each task Taski,is a FPGA partition, which is represented by a transition inthe Petri net graph. The execution of a start transition(PAR S)initiates the execution of tasks, that can have serial, parallelor arbitrary behavior(in the example of the Figure 7 all thetasks can be executed in parallel). The tasks share a set of

68 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

reconfigurable logic resources, represented for the amountof tokens (the tokens are in the place called ‘FPGAs’, inFigure 7). In the presented example, we have only 3 tokens(3 FPGAs). This means that only three partitions (Taski’s) cansimultaneously be executed. In this case, the scheduler mustdecided which are the three partitions that must simultaneouslyinitiated and which one must stay in wait state due the non-availability of FPGA resources. This problem can be solved,by any arbitrary set of tasks, by means of the states coveringanalyses based on Petri net, as presented in [74], [76]. Also itis necessary to determine in which of the available FPGA, eachpartition must be mapped. For this, the work presented [76]proposes the left-edge algorithm, originally created to allocatevariables to registers [76].

The methods for temporal partitioning and scheduling pre-sented in this section are being integrated in an design envi-ronment that also incorporates FPGAs commercial synthesistools, to implement the design flow depicted in Figure 2.

VI. MEMORY ARCHITECTURE AND BUS COMMUNICATION

The FPGAs, with their potential parallelism, are an attrac-tive option in the acceleration of high performance systemsapplications that demand parallel processing [6], [56], [57].However, this acceleration intrinsically related to the efficiencyof the memory system [15], [16], [24], [25]. Applicationsof massive data computation need systems of high speedand large bandwidth memory. In this direction, the FPGAsare good because of their high density of general I/O pins(GPIOs). These GPIOs can be used to create multiple mem-ory access channels. As an example, the FPGA Virtex-IIXC2V2000[43] has 2 million gates (one gate is equivalent toa two entries NAND gate) to tasks implementation and ∼600I/O pins. In this architecture it is possible to have 10 memorychannel, each one having 24MB address and 16 bits data buswidth. Simultaneous data access is also possible by sharedmemory systems. In Figure 8, we have two possible sharedmemory space architectures.

Fig. 8. CrossBar and HyperCube architectures.

In the left superior part of Figure 8, we have a CrossBarconnection type [22] among 4 circuits of memory (A, B, C, D)and 4 FPGAs (W, X, Y, Z). This architecture allows a completeconnectivity. Thus, the FPGAs can access any memory block.

However, this architecture is suitable only for a small numberof elements, because of the high cost and complexity of thetheir connections.

A good option that preserves a good degree of connectivitywith low cost implementation is the HyperCube architecture[56], [57]. The HyperCube is shown in the bottom of theFigure 8. In this approach the data traffic between any pairof nodes is done by intermediate nodes, taking always thesmaller way possible, guaranteeing the lesser timing. Differ-ently of the CrossBar model, that keeps a constant access timebetween any pair FPGA-memory, in the HyperCube model thisaccess time is proportional to the minimal distance among theelements. HyperCube requires less connections and presentsbetter scalability when compared to the CrossBar architecture.

Both architectures however do not take into account any par-ticular characteristic of one given application. Thus, althoughthese architectures can be useful for external connection forFPGAs and memory systems, they can not be a good optionfor internal FPGA-memory structure. Otherwise, because ofthe internal large FPGA routing resources, there are facilitiesto customize connections according to each application.

He Chuan et. al in [2], [50] presents the advantages ofthe FPGAs regarding conventional CPUs when in memorystructures:

• Because of the memory hierarchy presents in systemsbased on CPUs, exists a significant distance between thetheoretical processors peak performance, in Mflops, andthe real performance that can be reached. According toChuan, even though manual optimization sometimes itis possible to reach only 50% of the theoretical perfor-mance. In more realistic cases this number falls for 5%.

• FPGAs can implement much more memory bandwidth,due to high density of GPIOs. As an example: the man-agement of 8 simultaneous 64 -bits DDRAM Channels[50].

• FPGAs have explicit management of connections be-tween registers, blocks of internal memory and functionalunits. This allows that “data cache design” and “databuferring” are made according to the application dataflow.

• FPGAs do not need to execute stored programs. Thiseliminates the necessity to occupy the memory systemwith streams of instructions.

The characteristics above presented by Chuan, indicate thata style of memory hierarchy architecture can be adopted in amixed distributed structure of cache and functional units insidethe FPGA. In this approach the data reuse can also be betterexplored during the tasks implementation. To exemplify theseideas, an example produced by Chuan in [50] in presented inFigure 9.

In this example, we see as the connections with externalmemories and cache is organized. The connections with theexternal memory use four channels (access the speed andpressure field’s tables). One of the memory channels feedsa memory system cache with a standard organization (linebuffers and registers of samples P(a,b)) that are specific fora efficient implementation of the wave equation resolutionacoustics algorithm [50]. All the functional units and datapth,associated to these units are represented by the computing

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 69

module engine (circle in the center of Figure 9). We intend touse the same principles of this design to implement the sub-library BLAS functions, present in PETSc, as efficient IP-Coresin FPGA.

Fig. 9. IP Core FPGA Implementation for simulation of a 2D Acousticwave propagation.

We consider that one of the main challenges of the HPCInproject consists in the generation of these IP-Cores associatesto a design style and code reuse. This methodology must allowtransparency between the development of an application andits implementation in FPGA, as described in Section II.

VII. FLOATING-POINT ARITHMETIC UNITS

In many numerical simulation problems, as those presentedin the oil industry [2], [47], [48], [49], [50] and mechani-cal simulations, etc., it is necessary that the calculations ofcomplex systems be carried through with a high precisionarithmetic. For this, the use of double precision floating pointreal numbers based on the standard IEEE 754 [60] [61] havebeen used. These IP-Cores, based on IEEE 754 standard, need,in our case, to be implemented targeting a FPGA mapping.These cores for dataflow processing, can significantly, speedup the BLAS sub-library operations from PETSc.

In the context of the HPCIn project, we intend to implementhigh efficiency IP-Cores (minimum area and high speed)from PETSc-BLAS libraries that incorporate floating-pointunits.

All these IP-Cores will also have a standard communica-tion interface (wrapper). The Figure 10 shows a schematicrepresentation of the strategy in a FPGA architecture.

IP-cores for floating-point operations are implemented asfunctional units (sum, subtraction, multiplication, root square-of, etc), and stored in a low-level library (called arithmeticallibrary).

BLAS operations extended for reconfigurable system, mustbe developed based on this library. These modules can be

Fig. 10. Strategies for implementations of floating point operations in FPGA.

supplied as soft-cores or firm-cores or still be developed in aproprietary approach. In Aquarius platform the basic floating-point modules will be specified in languages like C, C++,SystemC [59] or in a HDL(Hardware Description language)like VHDL or Verilog.

In Figure 11, we have a stretch of a programmable logicmatrix from FPGA Xilinx Virtex-II [43]. Currently does notexist in the market, FPGAs devices with dedicated floating-point units. Today, only Integer multipliers blocks and RAMmemory blocks and CPUs are presented.

In the case of the Virtex-II, in Figure 11, there are 18bits Integer multipliers. Each multiplier occupies the spaceequivalent to the 4 CLBs (Configurable Logic Blocks).

Fig. 11. Today FPGA technology and specialized blocks in the logic array.

Thus, due to inexistence of point-floating multipliers units,these are implemented from logical blocks (CLBs), reducingthe final operation speed and occupying a considerable internalarea. However, although these difficulties, this arithmetic inFPGA can surpasses the real performance of conventionalmodern CPUs [12], [21].

Although currently to be possible to incorporate matrix offloating-point blocks in FPGAs, the current demand for suchhigh precision architecture, does not yet justify, the inclusionof such hard-cores inside commercial devices, probably be-cause of its high implementation cost. Figure 12 shows a

70 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

matrix equivalent to that presented in Figure 11, but withfloating-point blocks.

Fig. 12. Future commercial FPGA technology with introduction of special-ized floating point units.

This comparison shows the evident high cost, for theimplementation floating-point units in the current FPGAstechnology. In this example, each floating-point unit consumes18 CLBs.

VIII. COMERCIAL RECONFIGURABLE PLATFORMS

Currently a considerable number of hardware acceleratingplatforms is available. These systems mean, since generalpurpose to specific application ones, like bioinformatics ap-plication platforms [62]. These platforms, in general, are de-veloped for installation based on large servers for HPC (HighPerformance Computers) [65], like SGI platforms (Figure 13),to personal computers, like PC desktop or laptop [62].

The Figure 13 illustrates a series of available commercialplatforms, that have been investigated.

In the scope of HPCIn project, a prototype must be de-veloped under a PC platform. Thus, four platform have beenselected to be detailed in this work:

• XtremeData - XD1000 (Figure 13a)[63].• CLC-Bio Cube (Figure 13c) [62].• AVNet - Virtex-II Board (Figure 14) [64].• Aquarius Platform (under development in the HPCIn

project) - (Figure 15)[78] [82].

A. XtremeData-XD1000 PlatformIn the XtremeData-XD1000 platform [63] the co-processor,

a Xilinx Virtex-II FPGA device is directly connected tothe socket of a dual core Opetron CPU by a high speedbus(5.4GB/s). The FPGA is connected in a Socket 940 busin a Multi-CPU board. The FPGA has a direct access to thethe mother board memory DDRAM local bank. The platformhas available device drivers for Linux Operating System. ThisOS allows the FPGA reconfiguration control with libraries ofpre-synthesized modules and a complete framework for thedevelopment of new applications.

Fig. 13. Commercial platform for hardware acceleration.

B. CLC-Bio Cube Platform

The platform CLC-Bio Cube was developed for bioinfor-matics applications (genomics, proteomics) [62]. Its conceptis to allow an easy installation, where the co-processors canbe connected through a simple USB external communica-tion interface. Thus, any PC, desktop or laptop can easilybe connected. However, the USB interface can represent acommunication bottleneck for some types of computation thatinvolves a massive data transfer between the main CPU andthe FPGA co-processor.

C. AVNet - Virtex-II Platform

This is a PCI-X based platform. In this architecture theco-processor is implemented in a Virtex-II. The PCI core isimplemented in the FPGA together with the application. Figure14 shows the PCI-X board from AVNET.

Fig. 14. Prototype board AVNet - Virtex-II.

Also all memory DDRAM infrastructure needs to be in-cluded in the FPGA bitstreams. The synthesis tools are avail-able by the FPGA manufacturer (ISE - EDK from Xilinx[43]). The data communication between the hardware co-processor and host computer is done by the PCI/PCI-X bus.Currently platforms are being studied towards the HPCInproject platform definition. The final version must take inaccount the involved cost, performance, power and targetapplications.

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 71

IX. EXPERIMENTAL RESULTS

In parallel to the use and study of commercial design kits,a reconfigurable platform, called Aquarius, is under devel-opment. This platform follows the described flow presentedin Section II, and the S.O considerations from Section IV.Also, the platforms takes into account the application timingpartitioning and scheduling methods presented in Section V.Figure 15 shows the Aquarius prototype board and its maincomponents.

Fig. 15. Aquarius reconfigurable computer platform with operational systemuCLinux.

A. Aquarius Platform

The Aquarius platform [78], [82], that is being developedby the HPCIn project group, has, at the moment, as the maingoal, to allow the implementation of small scale applicationsexamples, to validate its acceleration methods (concept test).This platform consists of a reconfigurable computer based onFPGAs. It allows massive data processing applications runningon virtual hardware machine [10], [11], [29], [30], [35], [77],[78], based on FPGA dynamic reconfiguration methods. Itsarchitecture was conceived to allow the accomplishment ofstatic and dynamic FPGA reconfigurations, taking off theadvantages of the techniques for timing partitioning [77], [78]and tasks Scheduling [76].

Although still small, in memory capacity memory andcommunication, its conception will allow scalability with theinclusion of: new larger FPGAs, to increase the processingcapacity; new peripheral devices for communication; and morememory capacity to store a great amount of data. Aquariususes a customized embedded operationing system, based onLinux, the uCLinux, which was adapt to take the advantagesof the reconfigurable architecture and its resources.

The platform is in fact, a hybrid platform composed by twocommercial prototype boards: a FPGA Altera Stratix-II [45]and a FPGA Xilinx Virtex-II [43].

This environment provides all support for the Virtex-IIconfiguration, used as hardware co-process for accelerating

bottleneck blocks, and implements a system based on theprocessor the NIOS-II soft-core microprocessor [75] fromAltera. The NIOS-II runs the uCLinux [72], responsible forall Aquarius platform control.

The FPGA Virtex-II is used as a reconfigurable logic re-source for computational tasks implementation in hardware,because of the demand in parallelism and performance thatcan not be offered by software implementation in the NIOS-II. Thus, the Aquarius is a hw-sw codesign platform, wherethe software components are executed in the NIOS-II andthe hardware ones in the Virtex-II. For Virtex-II programminga special module was developed, the IP-SelectMAP. Thismodule, that is instantiated inside the Stratix-II and connectedto the system by the NIOS-II bus, controls the interface moduleSelectMap [43], [68], in charge for the Virtex-II configuration.

Aquarius allows multiple hardware component configura-tions. The configuration bitstream can be stored in its Com-pactFlash (connected to the FPGA Stratix-II). Interfaces forcommunication and control of the tasks are under develop-ment. All interfaces must be system compatible, guarantiedby wrappers, as in Figure 10.

B. Temporal Partitioning and Tasks Scheduling

We have developed heuristics for design space exploration,that solve temporal partitioning applications problems andchoose tasks implementation, from ip-cores libraries. Theseheuristics have been described in Section 5 and are presentedin [77], [78]. In these works, experimental results, show theheuristics power to generate high performance implementa-tions in a low computational time. They also have demon-strated the capacity to face implementation of massive dataapplications, running only 2% (on average) slow than the bestpossible time(minimum time possible regarding reconfigurableplatform resources constraints). In many cases, our heuristicsfind the best solutions, allowing real high performance solu-tions.

The Scheduling methods in [76] also has been validatedand integrated to the timing partitioning approach, generatingtools that can create fast implementations, with good quality,and a better use of FPGA resources, high precision scientificcomputation problems.

C. uCLinux Operating System Aquarius Platform

The Aquarius platform is controlled by the uCLinux [67],[70], [71], [72], [73]. In order to provide transparency for theoperating system resources, a series of special components hadbeen developed, targeting the reconfigurable hardware. In thisdirection, two IP-SelectMap control drivers have been createto the reconfigurable logic management (Virtex-II). The driversare listed in Table I below.

These drivers abstract the SelectMap Virtex-II programminginterface, that is seen by uCLinux as a system device andaccesses as an file. The configurations, as said before, arestored (bitstreams) in a in-house compact flash memory. Anychosen FPGA configuration can be sent to the Virtex-II byan standard OS read/write file operation. Both device driversfollow the architecture shown in Figure 16.

72 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

TABLE IDevice Drivers DO uCLinux PARA CONTROLE DE CONFIGURAO DO Virtex-II

NA Aquarius.

Driver name Driver functionality

IPSM WAIT: Transfers 32-bits blocks configurationbitstream from Nios to Virtex-II byIP-SelectMap module, in a poolingsyncronization strategy, by using theAvalon Wait State[ ].

IPSM IRQ: Transfers the configuration bitstreamto the Virtex-II through IP-SelectMap,in a sequence of 32bits blocks, byInterrupt Request method (IRQ).

Fig. 16. Architectural model based on layers for Aquarius device driversplatform.

The device drivers are OS components that must carrythrough the following functions: 1 - to control a specifichardware component; 2 - to provide access services to othersoftware components; 3 - to deal with errors in the manip-ulation of hardware devices. Analyzing the device driversin Figure 16, we verify that the HAL layer (the HardwareAbstraction Layer) is the layer next to the physical device. Thislayer supplies access functions that depend of the hardwarecharacteristics. Through the functions implemented in HAL,the NIOS-II processor can have access to the data, controland status registers from IP-SelectMAP.

The next layer is the kernel, considered the “brain” ofthe device drivers. This layer is responsible by IP-SelectMapoperation mode, involving its control and the registers accesssequence. These parameters define the services offered forthe application layers. The kernel is also responsible by theservices consistency verification. In Aquarius, for instance,this layer makes the “correctness” verification of the FPGAreconfiguration processes.

The drivers in the upper layer is called API (OperatingSystem Application Program Interface). The API controlscommunication between applications and the kernel. Thisresource provide an easy way to the users to have access toall services offered by the FPGA co-process.

In the Table II is presented the API functions from thesystem device drivers (ISPM).

The device drivers presented in this section create anabstraction of the FPGA Virtex-II that passes to be accessed bythe uCLinux OS as a standard device. To complete the software

TABLE IIDevice Drivers DO uCLinux PARA CONTROLE DE CONFIGURAO DO Virtex-II

NA Aquarius.

Device Driver Service DescriptionAPI function:

ipsm open: Open a device and associatesa file to the FPGA configuration.

ipsm write: Write a specified amount ofbytes from a user bufferspace to the IP-SelectMAPinterface.

ipsm init: Configures and initializesthe driver.

ipsm ctrl: Especifies the driver operationmode.

interrupt handler: Interrupt handler functiongenerated by IP-SelectMAP.

architecture, three utilities programs had been developed.These programs are listed in Table III with its respectivesyntaxes.

TABLE IIIAquarius UTILITIES PROGRAMS FOR THE FPGA DYNAMIC

RECONFIGURATION.

Program: Sintax: Description: Drivers:

ipsm prog: ipsm prog t Total IPSM WAITtotal.bin Virtex-II

reconfi-guration.

ipsm prog: ipsm prog p Parcialparcial.bin Virtex-II

reconfi-guration.

ipsm sch: ipsm sch s Total and IPSM IRQschedule.txt Parcial

Virtex-IIreconfi-gurationbased on thepartitiongresultsof the StaticSchedulingprocess

The listed software components in Tables I and II are specialelements that customize the uCLinux OS and so it starts toprovide resources as an Operating System for ReconfigurableHardware (ROS) as defined in Section IV.

Currently, the Aquarius platform is being extended to incor-porate other components to its ROS, in order to provide moreefficiently implementation in massive data processing applica-tions. These new functionalities include the introduction of ascheduler for reconfigurable hardware, memory managementand communication among processes.

X. CONCLUSIONS

This work presented an overview of important points relatedto high performance processing systems, based on reconfig-urable logic devices. These devices, here are represented byFPGAs co-processors. A possible design flow was suggested,as well as aspects related to the hardware/software partitioning,process scheduling, device drivers generations and commercial

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 73

operating systems implications when adapted for recon-figurable ones. Some high performance commercial platformshad been presented, as well as the Aquarius platform. Aquariusis under development in the Computation Engineering Group(GRECO), Center of Informatics, at Federal University ofPernambuco. This plataform will be used for case studiesimplementation to validate the design flow suggested.

Particularly, the Aquarius was presented as a hybrid plat-form, based on Xilinx Virtex-II and Altera Stratix-II FPGAplatforms. It is part of the called HPCIn project, that integratesthe Research network in Computational Modeling (RPCMod),in partnership with the Petrobrs Research Center (CENPES).

In this network, Aquarius will be used for acceleration ofscientific simulations in Oil and Gas exploration applications,aiming to accelerate the high performance needs presentingis such applications. This solution also aims to present a lowcost, high speed processing and electric power reduction, whencompared to traditional systems.

The use of temporal partitioning promotes better use of theFPGA resources, which now can be shared by distinct taskssince they run in different time intervals. This has the effectto increase the amount of resources available in the FPGA foreach task, allowing yet, a bigger exploration in parallelism andperformance increase. In the case of a Multi-FPGA platform,scheduling methods for resources management have been alsoconsidered, taking into account the best sequence of partitionsactivation and their allocation to FPGAs.

This work also discussed about the importance of reuseof the hardware codes. As in software, this methodologygive the designs the chance to reuse codes previously usedand information on their previous design space exploration.Thus, the time for designing simulators that make use of thereconfigurable logic resources is significantly reduced.

The proposed model provides facilities that arerequested in many high performance application areas, suchas: medical diagnosis for image analysis, bioinformatics anddigital signal processing, and others, that can be adapted tothe platform without any changing in its structure.

The use of a embedded operating system, like theuCLinux, was extremely important to manager resources suchas board memory access, file system and reconfigurable logicresources, guaranteeing hardware abstraction and easiness ap-plications development. These facilities were possible becauseof the development of devices drivers to the Aquarius platform,extending the uCLinux standard with ROS characteristics. Newfeatures as processes scheduler, memory management systemand communication inter-processes will be incorporated soonin this ROS.

Heuristics methods for design space exploration analysis arestill under development. IP-cores for BLAS functions, appliedin scientific applications that use intensity linear algebraicsolutions, will also be implemented as hardware componentsto improve their performance.

In the near future we intend to integrate the design flowtools and circuit implementation with our heuristics and the allreconfigurable platform resources. As a real case Aquarius willbe applied for oil and gas exploration industry applications.

XI. ACKNOWLEDGMENT

We would like to thank the Petrobrs Ressearch Center(CENPES) in name of Mr. Ismael H. F. Santos, for his interestand dedication to this work. Also thanks to FINEP and theRPCMod network coordination.

REFERENCES

[1] BORGATTI, M. et al. A reconfigurable signal processing ic with embed-ded fpga and multi-port flash memory. In: DAC ’03: Proceedings of the40th conference on Design automation. New York, NY, USA: ACM Press,2003. p. 691–695. ISBN 1-58113-688-9.

[2] HE, C. et al. Prestack kirchhoff time migration on high perfor-mance reconfigurable computing platform. In: Proceedings of the SEG2005 International Exposition and 75th Annual Meeting. [s.n.], 2005.Disponıvel em: <http://faculty.cs.tamu.edu/zhao/papers/conf/2005/0511-IEAM-HSLZ.pdf>.

[3] PLAZA, A.; VALENCIA, D.; PLAZA, J. High-performance computingin remotely sensed hyperspectral imaging: The purity index algorithm as acase study. In: Proceedings of the 7th Workshop on Parallel and DistributedScientific and Engineering Computing (PDSEC). Rhodes Island, Greece:[s.n.], 2006.

[4] YOSHIMI, M. et al. The design of scalable stochastic biochemicalsimulator on fpga. In: Proceedings of the IEEE 2005 Conference on FieldProgrammable Technology (FPT05). Singapore: [s.n.], 2005.

[5] RAMOS, J. et al. High-performance, dependable multiprocessor. In:Proceedings of the IEEE Aerospace Conference. Big Sky, MN: [s.n.], 2006.

[6] ABU-GHAZALEH, N. et al. Exploiting hpc for parallel discrete eventsimulation. In: Proceedings of the 2004 Users Group. [S.l.: s.n.], 2004.

[7] CARVALHO, E. et al. Padreh: a framework for the design and imple-mentation of dynamically and partially reconfigurable systems. In: SBCCI’04: Proceedings of the 17th symposium on Integrated circuits and systemdesign. New York, NY, USA: ACM Press, 2004. p. 10–15. ISBN 1-58113-947-0.

[8] CHODOWIEC, P. et al. Experimental testing of the gigabit ipsec-compliant implementations of rijndael and triple des using slaac-1v fpgaaccelerator board. In: ISC ’01: Proceedings of the 4th InternationalConference on Information Security. London, UK: Springer-Verlag, 2001.p. 220–234. ISBN 3-540-42662-0.

[9] COAD, P.; YOURDON, E. Object-oriented design. Upper Saddle River,NJ, USA: Yourdon Press, 1991. ISBN 0-13-630070-7.

[10] COMPTON, K. Programming Architectures for Run-TimeReconfigurable Systems. Dissertao (Mestrado) — NorthwesternUniversity, Dept. of ECE, December 1999. Disponıvel em:<citeseer.ist.psu.edu/compton99programming.html>.

[11] COMPTON, K.; HAUCK, S. Reconfigurable computing: a survey ofsystems and software. ACM Comput. Surv., ACM Press, New York, NY,USA, v. 34, n. 2, p. 171–210, 2002. ISSN 0360-0300.

[12] COMPTON, K.; HAUCK, S. Flexibility measurement of domain-specific reconfigurable hardware. In: FPGA ’04: Proceedings of the 2004ACM/SIGDA 12th international symposium on Field programmable gatearrays. New York, NY, USA: ACM Press, 2004. p. 155–161. ISBN 1-58113-829-6.

[13] LEESER, M. et al. Parallel-beam backprojection: An fpga implementa-tion optimized for medical imaging. J. VLSI Signal Process. Syst., KluwerAcademic Publishers, Hingham, MA, USA, v. 39, n. 3, p. 295–311, 2005.ISSN 0922-5773.

[14] DEMICHELI, G. Hardware/software co-design: Application domainsand design technologies. In: PUBLISHERS, D. K. A. (Ed.). G. DeMicheliand M. Sami, editors Hardware/Software Co-design. [S.l.: s.n.], 1996. p.1–28.

[15] DINIZ, P. C.; PARK, J. Data reorganization engines for the nextgeneration of system-on-a-chip fpgas. In: FPGA ’02: Proceedings of the2002 ACM/SIGDA tenth international symposium on Field-programmablegate arrays. New York, NY, USA: ACM Press, 2002. p. 237–244. ISBN1-58113-452-5.

[16] DINIZ, P.; PARK, J. Automatic synthesis of data storage and controlstructures for fpga-based computing engines. In: FCCM ’00: Proceedingsof the 2000 IEEE Symposium on Field-Programmable Custom ComputingMachines. Washington, DC, USA: IEEE Computer Society, 2000. p. 91.ISBN 0-7695-0871-5.

[17] DITTMANN, F.; RETTBERG, A.; SCHULTE, F. A y-chart basedtool for reconfigurable system design. In: Workshop on DynamicallyReconfigurable Systems (DRS). Innsbruck, Austria: VDE Verlag, 2005. p.67–73.

74 INTERNATIONAL JOURNAL OF MODELING AND SIMULATION FOR THE PETROLEUM INDUSTRY, VOL. 1, NO. 1, AUGUST 2007

[18] DRAPER, B. A. et al. Accelerated image process-ing on fpgas. Image Processing, IEEE Transactionson, v. 12, n. 12, p. 1543–1551, 2003. Disponıvel em:<http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1257391>.

[19] ELDREDGE, J. G.; HUTCHINGS, B. L. Density enhancement of a neu-ral network using FPGAs and run-time reconfiguration. In: BUELL, D. A.;POCEK, K. L. (Ed.). IEEE Workshop on FPGAs for Custom ComputingMachines. Los Alamitos, CA: IEEE Computer Society Press, 1994. p.180–188. Disponıvel em: <citeseer.ist.psu.edu/eldredge94density.html>.

[20] GRAHAM, P.; NELSON, B. Fpga-based sonar processing. In: FPGA’98: Proceedings of the 1998 ACM/SIGDA sixth international symposiumon Field programmable gate arrays. New York, NY, USA: ACM Press,1998. p. 201–208. ISBN 0-89791-978-5.

[21] GUO, Z. et al. A quantitative analysis of the speedup factors of fpgasover processors. In: FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12thinternational symposium on Field programmable gate arrays. New York,NY, USA: ACM Press, 2004. p. 162–170. ISBN 1-58113-829-6.

[22] HAUCK, S. The roles of fpgas in reprogrammable systems. Pro-ceedings of the IEEE, v. 86, n. 4, p. 615–638, 1998. Disponıvel em:<http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=663540>.

[23] HAUCK, S.; AGARWAL, A. Software technologies for reconfigurablesystems. Submited to Proceedings of the IEEE, 1997. Disponıvel em:<citeseer.ist.psu.edu/hauck96software.html>.

[24] CORBAL, J.; ESPASA, R.; VALERO, M. Command vector memorysystems: High performance at low cost. In: PACT ’98: Proceedings of the1998 International Conference on Parallel Architectures and CompilationTechniques. Washington, DC, USA: IEEE Computer Society, 1998. p. 68.ISBN 0-8186-8591-3.

[25] JAYASENA, N.; DALLY, W. J. Streams and vectors: A memory systemperspective. 6th WorkShop on Media and Streaming Processors; Portland,OR, EUA, December 2004.

[26] LEESER, M.; MILLER, S.; YU, H. Smart camera based on reconfig-urable hardware enables diverse real-time applications. In: FCCM ’04:Proceedings of the 12th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines. Washington, DC, USA: IEEE ComputerSociety, 2004. p. 147–155. ISBN 0-7695-2230-0.

[27] LOCKWOOD, J. W. et al. Reprogrammable network packet processingon the field programmable port extender (fpx). In: FPGA ’01: Proceed-ings of the 2001 ACM/SIGDA ninth international symposium on Fieldprogrammable gate arrays. New York, NY, USA: ACM Press, 2001. p.87–93. ISBN 1-58113-341-3.

[28] MOISSET, P.; DINIZ, P.; PARK, J. Matching and searching analysisfor parallel hardware implementation on fpgas. In: FPGA ’01: Proceed-ings of the 2001 ACM/SIGDA ninth international symposium on Fieldprogrammable gate arrays. New York, NY, USA: ACM Press, 2001. p.125–133. ISBN 1-58113-341-3.

[29] MURPHY, C. Virtual hardware using dynamic reconfigurable field pro-grammable gate arrays. In: GARS2005: Proceedings of the GERI AnnualResearch Symposium. [S.l.: s.n.], 2005.

[30] NEEMA, S.; BAPTY, T.; SCOOT, J. Adaptive computing and runtimereconfiguration. In: MAPLDCon99. [S.l.: s.n.], 1999.

[31] WALDER, H.; PLATZNER, M. Online scheduling for block-partitionedreconfigurable devices. In: DATE ’03: Proceedings of the conference onDesign, Automation and Test in Europe. Washington, DC, USA: IEEEComputer Society, 2003. p. 10290. ISBN 0-7695-1870-2.

[32] WALDER, H.; PLATZNER, M. Reconfigurable hardware operatingsystems: From concepts to realizations. In: Int’l Conf. on Engineering ofReconfigurable Systems and Architectures (ERSA). [s.n.], 2003. Disponıvelem: <citeseer.ist.psu.edu/walder03reconfigurable.html>.

[33] STEIGER, C. Operating systems for reconfigurable embedded plat-forms: Online scheduling of real-time tasks. IEEE Trans. Comput., IEEEComputer Society, Washington, DC, USA, v. 53, n. 11, p. 1393–1407,2004. ISSN 0018-9340. Member-Herbert Walder and Member-MarcoPlatzner.

[34] PLESSL, C. Reconfigurable hardware in wearable computing nodes.In: ISWC ’02: Proceedings of the 6th IEEE International Symposium onWearable Computers. Washington, DC, USA: IEEE Computer Society,2002. p. 215. ISBN 0-7695-1816-8.

[35] PLESSL, C.; PLATZNER, M. Virtualization of hardware- introductionand survey. ERSA, 2004.

[36] RIZZO, D.; COLAVIN, O. A video compression case study on a recon-figurable vliw architecture. In: DATE ’02: Proceedings of the conferenceon Design, automation and test in Europe. Washington, DC, USA: IEEEComputer Society, 2002. p. 540.

[37] ROSENSTIEL, W. Reconfigurable systems – a new era in digital systemdesign has just begun. University of Tubingen WSI – Dep. ComputerEngineering, 2001.

[38] SCHAUMONT, P. et al. A quick safari through the reconfigurationjungle. In: DAC ’01: Proceedings of the 38th conference on Designautomation. New York, NY, USA: ACM Press, 2001. p. 172–177. ISBN1-58113-297-2.

[39] Spaanenburg, L. et al. Natural learning of neural networks by recon-figuration. In: Rodriguez-Vazquez, A.; Abbott, D.; Carmona, R. (Ed.).Bioengineered and Bioinspired Systems. Edited by Rodriguez-Vazquez,Angel; Abbott, Derek; Carmona, Ricardo. Proceedings of the SPIE, Volume5119, pp. 273-284 (2003). [S.l.: s.n.], 2003. (Presented at the Society ofPhoto-Optical Instrumentation Engineers (SPIE) Conference, v. 5119), p.273–284.

[40] SWAMINATHAN, S. et al. A dynamically reconfigurable adaptiveviterbi decoder. In: FPGA ’02: Proceedings of the 2002 ACM/SIGDA tenthinternational symposium on Field-programmable gate arrays. New York,NY, USA: ACM Press, 2002. p. 227–236. ISBN 1-58113-452-5.

[41] TOMMISKA, M. Applications of Reprogrammability in Algorithm ac-celeration. Tese (Doutorado) — Department of Electrical and Communi-cations Engeneering of Helsinki University of Technology, March 2005.

[42] WIGLEY, G.; KEARNEY, D. Research issues in operating systems forreconfigurable computing. In: Proceedings of the International Conferenceon Engineering of Reconfigurable System and Algorithms (ERSA). [S.l.:s.n.], 2002. p. 10–16.

[43] XILINX. Virtex II Platform FPGA User Guide. [S.l.], December 2002.[44] BARRETT, R. et al. Templates for the Solution of Linear Sys-

tems: Building Blocks for Iterative Methods. Philadalphia: Society forIndustrial and Applied Mathematics. Also available as postscript fileon http://www.netlib.orgtemplatesTemplates.html, 1994. Disponıvel em:<http://citeseer.ist.psu.edu/46493.html>.

[45] Site of Altera - www.altera.com.[46] AVNET. SoC-RDS01 Rapid Development System. [S.l.].[47] SMITH, M. C. Scientific computing beyonf cpus: Fpga implementations

of common scientific kernels. MAPLD2, 2005.[48] ZHUO, L.; PRASANNA, V. K. High performance linear algebra op-

erations on reconfigurable systems. In: SC ’05: Proceedings of the 2005ACM/IEEE conference on Supercomputing. Washington, DC, USA: IEEEComputer Society, 2005. p. 2. ISBN 1-59593-061-2.

[49] WANG, X.; ZIAVRAS, S. G. Parallel direct solution of linear equationson fpga-based machines. In: IPDPS ’03: Proceedings of the 17th Inter-national Symposium on Parallel and Distributed Processing. Washington,DC, USA: IEEE Computer Society, 2003. p. 113.1. ISBN 0-7695-1926-1.

[50] HE, C. High-order finite difference modeling on reconfigurable comput-ing platform. In: SEG 2005: Proceedings of the International Expositionand 75th Annual Meeting. [S.l.: s.n.], 2005.

[51] TANOUGAST, C. et al. A partitioning methodology that optimisesthe area on reconfigurable real-time embedded systems. EURASIP Jour-nal on Applied Signal Processing, v. 2003, n. 6, p. 494–501, 2003.Doi:10.1155/S1110865703212051.

[52] KAUL, M.; VEMURI, R. Temporal partitioning combined with designspace exploration for latency minimization of run-time reconfigured de-signs. In: DATE ’99: Proceedings of the conference on Design, automationand test in Europe. New York, NY, USA: ACM Press, 1999. p. 43. ISBN1-58113-121-6.

[53] OUNI, B.; MTIBAA, A.; ABID, M. Synthesis and time partitioning forreconfigurable systems. In: . [S.l.]: Springer, 2004. v. 9, p. 177–191.

[54] GRIES, M. Methods for Evaluating and Covering the Design Spaceduring Early Design Development. [S.l.], August 2003. Disponıvel em:<http://www.gigascale.org/pubs/385.html>.

[55] ANDERSSON, J. A Survey of Multiobjective Optimization inEngineering Design. Linkping, Sweden, 2000. Disponıvel em:<citeseer.ist.psu.edu/andersson00survey.html>.

[56] LEIGHTON, F. T. Introduction to parallel algorithms and architectures:array, trees, hypercubes. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 1992. ISBN 1-55860-117-1.

[57] EL-REWINI, H.; ABD-EL-BARR, M. Advanced Computer Architectureand Parallel Processing (Wiley Series on Parallel and Distributed Com-puting). [S.l.]: Wiley-Interscience, 2005. ISBN 0471467405.

[58] Site of PDesigner - www.pdesigner.org.[59] Site of SystemC - www.systemc.org.[60] KAHAN, W. Ieee standard 754 for binary floating-point arithmetic.

Lecture Notes, Elec. Eng and Computer Science University of California,Berkeley, May 1996.

[61] DOU, Y. et al. 64-bit floating-point fpga matrix multiplication. In: FPGA’05: Proceedings of the 2005 ACM/SIGDA 13th international symposiumon Field-programmable gate arrays. New York, NY, USA: ACM Press,2005. p. 86–95. ISBN 1-59593-029-9.

[62] Site of CLC-Bio - www.clcbio.com.[63] Site of XtreamData - www.xtreamdata.com.

NASCIMENTO et al.: RECONFIGURABLE PLATFORMS FOR HIGH PERFORMANCE PROCESSING 75

[64] Site of AVNet - www.avnet.com.[65] Site of SGI - www.sgi.com.[66] Site of PETSc - www.petsc.com.[67] WILLIAMS, J. W.; BERGMANN, N. Embedded linux as a platform

for dynamically self-reconfiguring systems-on-chip. In: ERSA. [S.l.: s.n.],2004. p. 163–169.

[68] XILINX. Using a Microprocessor to Configure Xilinx FPGAs via SlaveSerial or SelectMAP Mode. [S.l.], November 2002.

[69] Avalon Bus - Altera:www.altera.com/literature/manual/mnl avalon spec.pdf.

[70] CORBET, J.; RUBINI, A.; KROAH-HARTMAN, G. Linux DeviceDrivers, 3rd Edition. [S.l.]: O’Reilly Media, Inc., 2005. ISBN 0596005903.

[71] UCLinux - www.uclinux.org.[72] UCLinux NIOS Port - Microtronix:

www.microtronix.com.[73] BOVET, D. P. D. P.; CESATI, M. Understanding the Linux kernel. Third.

[S.l.]: O’Reilly, 2005. xvi + 923 p. ISBN 0-596-00565-2 (paperback).[74] MURATA, T. Petri nets: Properties, analysis and applications. Pro-

ceedings of the IEEE, v. 77, n. 4, p. 541–580, 1989. Disponıvel em:<http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=24143>.

[75] ALTERA. Nios ii processor reference handbook. Altera Corp., 2005.[76] SANT’ANNA, R. E. A Methodology to Real Time Scheduling Tasks

in Reconfigurable Dynamically Architecture. Tese (Doutorado) — FederalUniversity of Pernambuco (UFPE), Informatic Center, August 2006.

[77] NASCIMENTO, P. S. B. do; LIMA, M. E. de. Temporal partitioningfor image processing based on time-space complexity in reconfigurablearchitectures. In: DATE ’06: Proceedings of the conference on Design,automation and test in Europe. 3001 Leuven, Belgium, Belgium: EuropeanDesign and Automation Association, 2006. p. 375–380. ISBN 3-9810801-0-6.

[78] NASCIMENTO, P. S. B. do et al. Mapping of image processingsystems to fpga computer based on temporal partitioning and design spaceexploration. In: SBCCI ’06: Proceedings of the 19th annual symposium onIntegrated circuits and systems design. New York, NY, USA: ACM Press,2006. p. 50–55. ISBN 1-59593-479-0.

[79] GUPTA, S. User Manual for the SPARK Parallelizing High-LevelSynthesis Framework - Version 1.1. San Diego and Irvine, April 2004.

[80] GLOVER, F. Tabu search - part i. ORSA Journal on Computing, v. 1,n. 3, p. 190–206, 1989.

[81] GLOVER, F. Tabu search - part ii. ORSA Journal on Computing, v. 2,p. 4–32, 1990.

[82] SEIXAS, J. L. Aquarius - Uma Plataforma para Desenvolvimento deSistemas Digitais Dinamicamente Reconfiguraveis. Dissertao (Mestrado)— Federal University of Pernambuco (UFPE), Informatic Center, February2007.

[83] Site of RPCMod:www.demec.ufpe.br/rpcmod/index.html.