high-performance computing using fpgas || low cost high performance reconfigurable computing

Low Cost High Performance ReconfigurableComputing

Javier Castillo, Jose Luis Bosque, Cesar Pedraza, Emilio Castillo,Pablo Huerta, and Jose Ignacio Martinez

Abstract High Performance Reconfigurable Computing (HPRC) has emerged asan alternative way to accelerate applications using FPGAs. Although these HPRCsystems have a performance comparable to standard supercomputers and at a muchlower cost, HPRC systems are still not affordable for many institutions. We present alow-cost HPRC system built on standard FPGA boards with an architecture that canexecute many scientific applications faster than in Graphical Processor Units andtraditional supercomputers. The system is made up of 32 low-cost FPGA boards anda custom-made high-speed network interface using RocketIO interfaces. We havedesigned a SystemC methodology and CAD framework that allow the designer tosimulate any MPI scientific application before generating the final implementationfiles. The software runs on the PowerPC processor embedded in the FPGA on alight ad-hoc implementation of MPI, and the hardware is automatically translatedfrom SystemC to Verilog, and connected to the PowerPC. This makes the SMILEHPRC system fully compatible with any existing MPI application. The proof ofthe concept of the SMILE HPRC has been exhaustively tested with two complexand demanding applications: the Monte Carlo financial simulation and the BooleanSynthesis using Genetic Algorithms. The results show a remarkable performance,reasonable costs, small power consumption, no need of cooling systems, smallphysical space requirements, system scalability and software portability.

J. Castillo (�) • P. Huerta • J.I. MartinezUniversidad Rey Juan Carlos, Madrid, Spaine-mail: [email protected]; [email protected]; [email protected]

J.L. Bosque • E. CastilloUniversidad de Cantabria, Santander, Spaine-mail: [email protected]; [email protected]

C. PedrazaUniversidad Santo Tomas, Bogota, Colombiae-mail: [email protected]

W. Vanderbauwhede and K. Benkrid (eds.), High-Performance Computing Using FPGAs,DOI 10.1007/978-1-4614-1791-0 15, © Springer Science+Business Media, LLC 2013

453

mailto:[email protected]; [email protected]; [email protected]

mailto:[email protected]; [email protected]

mailto:[email protected]

454 J. Castillo et al.

1 Introduction

The use of hardware accelerators to speed computationally intensive applications uphas become a major trend in the supercomputing field. As the size of supercomputersincreases, different problems arise regarding communication bottlenecks and othercommon problems such as power consumption, cooling or the physical spaceneeded to install a machine of these characteristics.

Due to these problems, researchers have been proposing architectures based ondifferent types of accelerators for many years. In this context, the arrival of twokey technologies: Reconfigurable logic (RC) and graphical processor units (GPU)has been of an enormous importance. In both cases the concept behind is basicallythe same: finding the best suitable implementation of the algorithm for the specificcharacteristics of the target architecture. In the context of RC, the target is an FPGAdevice mainly full of programmable logic resources, whilst for the GPUs the deviceis made up of a large number of execution cores. These two alternatives are notmutually exclusive: there are hybrid machines that use both technologies in order tomake the most of their intrinsic characteristics[1].

In this context, the use of RC and FPGAs has created a new field of research,named high-performance reconfigurable computing (HPRC). HPRC systems areusually classified into two main groups [2]: uniform node non-uniform systems(UNNSs) where the nodes are made up of FPGAs or microprocessors connectedthrough a high-speed network and non-uniform node uniform systems (NNUSs)where only one type of node is used, each one containing a microprocessor withan FPGA tightly coupled. The well-known Cray XD1[3] supercomputer belongs tothis category.

The SMILE project (Scientific Parallel Multiprocessing based on Low-CostReconfigurable Hardware) presents a new HPRC architecture and a developmentmethodology using commercial off-the-shelf (COTS) FPGA boards[4]. The SMILEarchitecture is an NNUS architecture where all the nodes are equal, containing abuilt-in PowerPC processor and the FPGA logic tightly coupled through the systembus. The main features of the SMILE architecture are the following:

• High performance: The architecture, where parts of the applications are accel-erated using custom hardware, has a performance similar to standard general-purpose microprocessor clusters and to GPU-accelerated systems.

• Low-power consumption: Each SMILE node consumes only 5 W, according toour lab measurements, which is an order of magnitude smaller than the powerconsumption of a personal computer. This is a really important advantage forbuilding a supercomputer because the SMILE architecture does not need anycooling system.

• Scalability: For most applications adding more nodes to a SMILE cluster isjust as easy as connecting new FPGA boards to the high-speed interconnectionnetwork.

• Portability of the parallel applications: the system runs an MPI libraryimplementation, therefore any parallel application already programmed can

Low Cost High Performance Reconfigurable Computing 455

be ported to the new environments. The only additional effort required is thedevelopment of the custom hardware to speed the application up, keeping all thecommunication scheme unchanged.

• Low-cost: By using low-cost commercial FPGA boards the nodes of the systemhave a reasonable price.

• Low-space utilization: The FPGA boards can be placed in a stacked configurationthat takes up a very small physical space.

The main contributions of the SMILE project are the design and implementationof a new HPRC architecture based on low-cost FPGA boards. A full SystemC-based methodology was developed to help porting any MPI Application to theSMILE system, as well as a complete CAD framework to use this methodology.This framework enables the simulation and debugging of the whole system beforegenerating and downloading the final implementation to the HPRC system. Twodifferent applications were developed and tested using the methodology: a MonteCarlo financial simulation and a combinational circuit synthesis using GeneticAlgorithms. The whole framework, methodology and HPRC SMILE system havebeen fully tested through an exhaustive experimental evaluation, as well as hisresults compared to two other high-performance architectures: GPUs and cluster.The today’s GPUs importance, as mentioned in the second paragraph of this section,can be appreciated in the TOP500 [5] where a total of 28 systems on the list useGPU technology. The two Chinese systems, Tianhe in the top one and Nebulae inthe third position, and the new Japanese Tsubame 2.0 system as number 4 [6] are allusing NVIDIA GPUs to accelerate the computations. In all the cases the GPUs arecoupled to the processor through an extension bus (PCI-e, HyperTransport, etc).

Section 2 continues with a review of similar research work in the literature. InSect. 3 the SMILE HPRC architecture is described in detail. Section 4 presentsthe methodology developed to create any SMILE application using our SystemCframework. In Sect. 5 the two case studies (benchmarks) are described based onthe three different architectures. Section 6 presents the results of both benchmarksrunning on the SMILE HPRC system, GPUs and the ALTAMIRA cluster. Finally,we extract some conclusions and suggest future work.

2 Background

In the HPRC field, different vendors have started to offer machines that include bothFPGAs and high-end processors, introducing the concept of HPRCs [7, 8]. Thesemachines combine the hardware customization of the FPGAs and the flexibility ofthe software running on a general-purpose processor, providing new ways to explorethe design space to obtain the best performance.

One of the main HPC vendors, Silicon Graphics International (SGI), providesSGI Reconfigurable Application Specific Computing (SGI-RASC) [9], a technologythat can be used with the Altix line of HPC servers. The RC100 model [10] includes


2 Virtex-4 LX200 FPGAs and two NUMAlink interfaces with 12.8 GB/s bandwidth.Another vendor, SRC Computers, offers the H MAP and I MAP reconfigurableprocessors used in the SRC-7 family of products [11].

Another well-known HPC vendor, Cray Inc., developed the XD1 system as adistributed memory HPC system [3]. The Cray XD1 entry-level supercomputerrange uses AMD Opteron 64-bit CPUs and incorporates Xilinx Virtex-II Pro FPGAsfor application-specific implementations.

The research and academic community also had several proposals for HPRCsystems. Splash 2 was an attached processor system using Xilinx XC4010 FPGAsas its processing elements, developed in the Supercomputing Research Center [12].Another system that uses FPGAs in a cluster of independent operating systems wasdeveloped in the project called Sepia [13], which used the DEC PCI Pamette FPGAboard for building a low-cost cluster for image processing applications.

The reconfigurable computing cluster (RCC) project worked on the feasibility ofusing FPGAs to build cost-effective petascale cluster computers, building a clusterof 64 Xilinx ML-410 Development boards with a Virtex-4 FPGA [14]. Yoshimiet al. [15] designed a 512 FPGA cube made up of eight 64-FPGA boards mainly tobe used for physics, financial simulation and massively parallel cryptographic keycracking.

Cathey et al. [16] present a reconfigurable data flow processing architecture thatexplicitly targets at the same time both fine- and course-grained parallelism. Thisarchitecture is based on multiple FPGAs organized in a scalable direct network.

One of the most relevant projects on HPRC is the Research Accelerator forMultiple Processors (RAMP) [17]. In this project there are several universitiesresearching on the next generation tools for computer architecture and computerscience. RAMP seeks to take advantage of the high degree of parallelism and densityof the FPGAs to emulate new highly parallel computer systems. The project usesa custom platform named Berkeley Emulation Engine 2 (BEE2). BEE2 provides alarge amount of FPGA resources, DRAM memory, and high bandwidth I/O channelson one single board. One of the main milestones of the project is to build a proof ofconcept HPRC named the RAMP Blue cluster.

Although the capacity of current FPGAs has grown enormously in recent years,sometimes the number of functions to be executed in hardware exceeds the FPGAresources limit. To tackle this issue El-Gahzawi et al. [2] propose a techniquecalled Virtualization, i.e. using partial run-time reconfiguration to switch betweenthe different hardware functions programmed in the FPGA, multiplexing the FPGAresources over time.

The main benefit of the HPRC systems: the design of new hardware functions forevery application to take full advantage of the FPGA logic and the characteristics ofeach application is, at the same time, the main drawback when the standard scientificcommunity, with no experience in hardware design tries to use HPRC systems.To facilitate the hardware design the EDA vendors are constantly making efforts tohelp with high-level synthesis tools based on C-like languages. For example, MitrionC and Handel-C tools work for the SGI RASC, and Impulse C for the Cray XD1.


Another important trend is the use of hybrid machines combining GPU+FPGA.This provides a designer different targets for the different application tasks sothat the designer can find the most appropriate resource for every task in termsof performance or any other considerations like timing, usage, etc. Tsoi and Luk[1] presents a heterogeneous machine using nodes made up of a mix of differentaccelerators, as well as a map-reduction framework to map these tasks into thedifferent resources. Showerman et al. [18] describes another hybrid system ofVirtex-4 FPGAs and Nvidia Quadro GPUs tested in weather forecast, moleculardynamics and cosmology applications, with up to a 48x speedup in some tests.

3 SMILE Architecture

The SMILE project is built as a custom distributed-memory parallel computingmachine with low-cost FPGA boards. The programming model is based on theexecution of concurrent processes with message-passing communication and syn-chronization. The communication uses an MPI standard library that provides theportability of applications to different platforms. The concurrent processes aredivided into a software part, running on the PowerPC processor of each FPGA, anda custom hardware accelerator designed for the application using the methodologypresented in Sect. 4.

The SMILE cluster is made up of up to 32 FPGA nodes and a host computermonitoring the cluster operations and sharing the storage space (hard disks).Currently, each node is a Diligent XUPV2P Board, selected for the low price,low power consumption and high performance. Although all the infrastructurecan be migrated to more advanced FPGA families (plug and play), Virtex2P isstill use due to the board price. With a moderate cost of $499 per node it ispossible to compete with supercomputers in some applications, as can be seen inSect. 6. The power consumption of the SMILE cluster is not only several ordersof magnitude smaller than the power consumption of a traditional supercomputer,but also means no cooling system needed for the SMILE cluster. The physical sizeneeded to accommodate the SMILE cluster is also several orders of magnitudesmaller than the space needed for a conventional supercomputer (table size vs. roomsize).

The board includes a Xilinx V2P30 FPGA with two PowerPC 405 microproces-sors and 8 Multi-Gigabit transceivers that are used for high-speed communicationsbetween the boards. The dedicated hardware implemented in the FPGA logic isconnected to the PowerPC processor through the on-chip peripheral bus (OPB).The board also contains the peripherals needed to develop complex applicationssuch as DDR-SRAM controllers, System ACE controllers (for compact flashmemories) and RS232 interfaces.

With all these elements it is possible to run a full version of the Linux kernel inthe PowerPC microprocessor, using all the programs and libraries available for thisoperating system.


The main tool for the SMILE Cluster is an ad-hoc MPI implementation thatprovides the management of the cluster communication using a standard API.For the initial versions of the SMILE cluster we used a standard MPI implemen-tation, more precisely LAM/MPI, freely available on the web. But the overheadintroduced by this MPI version was unacceptable: more than 46 s only for executingthe processes in all the nodes. This overhead forced the development of our own MPIimplementation called SMPI: a lightweight implementation of the MPI standard.The SMPI library offers the possibility of sending data between nodes through theEthernet connection or through the Rocket IOs of the board. The library implementsjust a subset of the MPI standard targeting the best performance with the minimumoverhead possible.

The MPI Init function opens all the sockets needed to communicate theprocesses, and once the Send and Receive functions are open they act as simplecollective communication links between nodes. The data can also be sent throughthe RocketIO interfaces of the boards because they are now integrated in the systemas any other standard Ethernet device of the Linux kernel (use of standard TCPsockets, etc.).

3.1 Network

The network system is a critical issue in parallel architectures. The Ethernet networkinterface provided by Xilinx is not fast enough to support cluster communications,so this network interface is used only for management tasks. To avoid the com-munication bottleneck introduced by this low-speed network, a high-performancenetwork has been designed using the three-bidirectional SATA channels includedon the board. This network works at 1.5 Gbps offering a performance similar tocurrent parallel system networks.

The management of the RocketIO transceivers has being simplified by the use ofthe Aurora core provided by Xilinx. This communication channel can now be usedin the SMILE Cluster thanks to the new ad-hoc interface developed by our teamfor the Aurora Core and the Linux operating system. This interface, called SMILECommunication Element (SCE), has two main goals: appears as a conventionalnetwork resource in Linux (a call to the SMPI library) and provides the networkrouting channels between the boards.

The SCE has three different parts (Fig. 1). Since the board has three connectors,the SCE includes: one Aurora core for each connector to manage the data exchangein that link, the Send and Receive FIFOs to store the packets and deal withcongestion problems, and some ad-hoc logic to implement the routing algorithm.

The network topology is defined in terms of groups of four nodes named SMILEBlock Elements (SBE). Every node in the SBE is connected to its SBE neighborswith a bidirectional channel. To allow the routing between nodes into different SBEs


Fig. 1 SMILE communication element

every top node of each SBE is connected to the top nodes of the other SBEs in a ringtopology. With this configuration, the SMILE cluster of 32 nodes has a diameter of10 steps.

The routing algorithm uses a Reservation packet that finds the path betweenthe nodes. Once the path is found, the SCEs of the nodes in the path are markedand reserved, appearing as a shortcut for the following packets. When the frametransmission ends the SCEs are released and free to be used in other request. If acongestion problem appears, two transmissions on the same link, the packets arestored in the SCE FIFOs until the path is free again.

A simple routing algorithm has been designed to find the best path in the SMILEcluster. If the packet destination is in the working node, the SCE delivers the packetto the PowerPC processor in the working node. If not, the packet needs to be routed.In this case the SCE selects which one of the other two interfaces in the board isgoing to be used. When the destination is in the same SBE, the address can belower than the working node (the package is sent to the previous node) or higher(the package is sent to the next node). When the destination is in a different SBE,the data goes up to the next node in its SBE and is sent to the SBE destination. Oncein the correct SBE destination, the packet goes down the SBE nodes to its finalnode destination. All the information needed by the routing algorithm (workingnode address, neighbors, SBE neighbor, etc.) is included in the SCE by the Linuxdriver during the system start-up.


4 SMILE SystemC Model

Any MPI application running on a distributed memory parallel machine usinga message-passing interface library is suitable to be ported to SMILE. Theseapplications are made up of a set of processes running in parallel in processingnodes and a well-defined communication and synchronization scheme based onthe standard functions provided by the MPI library. These applications can run onSMILE with no modifications at all, just compiling the application for the PowerPCprocessor. However, if we want to make the most of the SMILE architecturea HW/SW co-design methodology should be used. Therefore, the original MPIapplication is refined following a set of steps to reach a final implementation madeup of the original processes running on the FPGA boards, each one accelerated bya custom hardware accelerator.

The idea is to start with a high-level model of the system and refine the modeldown to the final implementation. In the SMILE application context, the entry pointis a parallel application using an MPI library that needs to be accelerated by customhardware. The steps involved in obtaining the final system implementation are:

• Profiling. The first step is to profile the application, finding the time-consumingparts that are appropriate to be implemented into hardware (GNU profiling tools).

• Development of the SystemC model. A SMILE application SystemC model is anMPI application that runs as a SystemC thread. Under this approach, it is possibleto run a set of SystemC models in parallel that communicate data through the MPIprimitives. This model is fully equivalent to the original application.

• Design of the hardware high-level model. This hardware high-level model is afunctional implementation of the final hardware used to speed the application upand is later added to the SystemC system model. The connection between thesoftware and the hardware models is done through untimed sc fifo channels, asshown in Fig. 2.

Fig. 2 SystemC untimed model


Fig. 3 SystemC model ready to synthesis

• Refinement of the hardware high-level model. There are two different options:using a high-level synthesis tool such as Impulse C or Autopilot to synthesize, ordesigning an ad-hoc RT-level model and follow a traditional RT-level synthesismethodology. In any case, this step produces a final hardware version ready to beimplemented and simulated in the framework.

• Redesign of the communication link between software and hardware. A SystemCmodel of the Intellectual Property Interface (IPIF) provided by Xilinx is con-nected to the RT model. The IPIF is a hardware block that communicatesthe PowerPC processor with the IP-Cores implemented in the FPGA throughthe OPB and PLB bus. It supports different configurations such as DMA andinterrupt-driven or register-based communications. The SystemC IPIF modelsall these configurations and enables the connection of the RT model to the MPIapplication.

• Replacement of the functions used by the MPI Application to communicate withthe hardware model with the Xilinx functions provided to communicate with theIP-Core through the IPIF (Fig. 3)

These are the steps to follow in order to test the final SMILE system:

1. The MPI Application is compiled for the platform using the Xilinx IPIF libraries.2. The hardware is synthesized using the appropriate CAD tool. We translate the

hardware modules to Verilog using a custom tool, developed by the authors andcalled sc2v [19].

3. The hardware netlist is connected to the physical IPIF interface in the EDKenvironment.

4. The system is then synthesized and the bitstream, ready to program the FPGAs,is generated and downloaded.


5 Benchmarks

Two applications have been developed as a benchmark for the SMILE HPRCsystem: a Monte Carlo simulation for financial problems (called the EuropeanPricing option problem) and the optimization of Boolean circuit synthesis formany variables. Both problems have been tested in a High-Performance Cluster,a GPU and the SMILE HPRC using the developed SystemC environment. For theparallelization of the benchmarks, we followed the four-step methodology of IanFoster [20], which encourages the development of scalable parallel algorithms.The methodology provides the best portability of the applications; therefore, theprocesses and communication schemes are identical for both the SMILE HPRCand for the cluster. The portability of the applications is thus assured at the designlevel. However, it is worth mentioning that each specific implementation has beenoptimized to obtain the maximum performance for the corresponding architecture,so that we can make a fair comparison between the different architectures.

5.1 Monte Carlo Financial Simulation

The Monte Carlo simulation is widely used in many problems. Its main drawbackis that is very demanding from a computational point of view, with a relatively slowconvergence rate. As a result, lots of effort have been made to accelerate the MonteCarlo simulation [21–25].

The Monte Carlo simulation is used to solve the European Pricing Optionproblem, as described in this subsection. In financial terms an option is anagreement: a buyer buys the right, but not the obligation, to buy or sell a value, at acertain price. Scholes proposed a differential equation that is able to calculate a goodapproximation to the option value based on several assumptions [26]. The Black–Scholes model assumes a perfect market hypothesis where financial markets areefficient and prices on traded assets already reflect all the known information.With this hypothesis, the security price changes mathematically with the Markovprocesses and the value of the asset can be represented as a Brownian motion.Solving this equation, it is obtained:

ST = S0 · e(r−0.5μ2)T+(μ√

T N(0,1)) (1)

where r is the rate we can expect in a riskless market. ST is the price of theoption depending on the random variations of the market modeled by the normaldistribution. It is possible to calculate the expectation of Vcall(S,T ) by generating alarge number of N(0,1) samples and computing the average estimated profit.

Vmean =1N

N

∑i=1

Vi(S,T ) (2)


If the return value of the money in a riskless investment is subtracted from theexpected profit, we obtain the current value of the option. From this equation, theevaluation of the expected profit is just a question of generating a large number ofGaussian random samples and evaluating the expected profit for each one.

5.1.1 Hardware Implementation on SMILE

The hardware implementation on SMILE follows the SystemC methodologypresented in Sect. 4 and the SMILE application runs in a set of nodes with ahardware coprocessor attached to the PowerPC in the FPGA. The developmentsteps begin with the profiling of the MPI application. From this profiling wefound out that the biggest time consuming part of the algorithm is the pathcalculation, therefore a custom hardware coprocessor has to be implemented forthis computation. The system operation can be described as follows:

• The application running on the SMILE host sends the simulation parameters andthe iteration number to the nodes.

• Each node receives the data and sends the parameters to the coprocessor throughthe system bus.

• The coprocessor calculates the expected profit and the confidence value for theindicated number of paths.

• The expected profit and confidence is sent back to the host where the final valuesare calculated and displayed.

Following the SystemC methodology, we first develop a SystemC untimed modelof the coprocessor. Once the simulation is working properly we write a SystemCRT model of the coprocessor, changing the communication channels from theuntimed models to the modeled IPIF interface. In this benchmark, the SystemCRT Monte Carlo coprocessor uses several external IP-Cores: a VHDL exponentiator,a Mersenne Twister random number generator and some floating point adders andmultipliers. Therefore, we design all the SystemC models of the IP-Cores, sothat they can be replaced later in the netlist for the Place&Route process. Whenthe SystemC RT simulation of the whole system is running properly, the core istranslated into Verilog using the sc2v tool and then synthesized, placed and routed.

The inputs of the coprocessor are: the parameters of the simulation, the outputof the Gaussian random number generator, and the expected profit and confidencevalues from the previous iterations. The generation of normally distributed randomsamples can be divided into two steps: the generation of uniform-distributed samplesand their conversion to Gaussian-distributed samples. This conversion is usuallycarried out by the Box–Muller equations that generate the Gaussian samples fromtwo uniform samples. Our approach implements a new algorithm that representsthe equations in the polar coordinate system instead of the Cartesian system. Thisrepresentation dramatically reduces the number of operations because it is no longernecessary to implement the sin and cos functions. The general structure of theGaussian random number generator is shown in Fig. 4. It is important to notice that


0

500

1000

1500

2000

2500

3000

3500

−4 −3 −2 −1 0 1 2 3 4

Fig. 4 Gaussian random number generator

m T

mT

GRNG

CONTROL UNIT

PLB Bus

3

3 3

3

3

s

x 0

1 2 3

3

3

ex

cmp doublesum_reg

sum_reg2

+

+

+

-

Fig. 5 Pipelined datapath

in order to have the function tabulated and reduce the area of the generator, a fixed-point representation with one sign bit, 23 decimal part bits and 8 integer part bitshas been used.

The inputs to the Box–Muller conversion module are two samples of a uniformdistribution in the range [0,1) generated by two Mersenne–Twister random numbergenerators. The result goes into the tabulated function that produces two samplesper cycle.

Figure 5 shows a pipelined version of the datapath (each element shows itsnumber of cycles). The number of cycles in each stage is limited by the latency of thenon-pipelined exponentiation that takes 3 cycles for each calculation. The processis divided into two parts: the first calculates the expected profit and compares theexpected profit with zero to check if the option is valid. This part of the algorithmis in single precision IEEE754 floating-point arithmetic. Afterwards, the expected


profit of the whole operation and the confidence value are computed in doubleprecision arithmetic. This implies a conversion step from single to double precision.At the end of the process, sum reg and sum reg2 contain the expected profit and theconfidence value. The coprocessor’s control unit, connected to the PowerPC systembus, controls the number of iterations of the coprocessor.

5.1.2 GPU Architecture and Programming Model

For this subsection, we used the Monte Carlo pricing option calculation programprovided by NVIDIA inside the NVIDIA CUDA SDK 2.0 in order to compareSMILE HPRC with the optimized version of the GPU algorithm. The process isthe same as described in Sect. 5.1.1.

The goal of this implementation is to generate a large number of threads to keepthe GPU efficiently busy. The number of options is typically in the hundreds range,but the number of paths per option is in the millions. Therefore, the most appropriatedistribution is to use multiple blocks per option to hide the latency of reading therandom input values.

Finally, a data distribution is done splitting a bi-dimensional grid into blocks,in terms of the number of options and the number of paths per option. With thisapproach, each thread computes and sums the payoff for multiple simulation pathsof different options.

5.1.3 Parallel Implementation

The best way to carry out the parallelization of the Monte Carlo algorithm isthe domain decomposition, because of the independence of the data and the highdegree of data parallelism. In this context, we consider a bi-dimensional problemwhere options are the first dimension and paths for each of the options the second.Given that there are no data dependences, each task will generate a set of pseudo-random numbers (the paths) and compute a subset of the pay-off values for someof the options. This approach has several advantages: selecting the most suitableparallelism degree and balancing the computation and communication times tooptimize the performance. Additionally, each random number in the sequence isused for all the options, therefore increasing the locality and reducing the memoryrequirements.

In this context, all the tasks need the Mersenne–Twister parameters to generatethe sequence of pseudo-random numbers; therefore, a general broadcast is com-pulsory. Once each task has obtained the pay-off value, the average of all of thesepaths should be calculated. Hence, a reduced parallel operation can be used on atree communication pattern. Each option has to be reduced but all of them canbe done in parallel. Finally, a node has to gather the results of all the options,therefore, a gather operation of the nodes with the final results is needed. However,this approach can increase the communication overhead because of the reductions.


To minimize the impact of the communication overhead in the performance, a singlenode is in charge of collecting all the partial results and computing the average foreach option.

Taking all these considerations in mind, there are two different kinds ofprocesses: a master process that is in charge of broadcasting the Mersenne–Twisterparameters to the rest of the processes and gathering the partial results, and a setof slave processes that calculate the values of the pay-off function. The masterprocess will be assigned to the front-end of the cluster and the slave processes willbe allocated in the computational nodes.

5.2 Boolean Synthesis with SMILE

The Boolean Synthesis is a design flow process that optimizes and reduces thenumber of logic gates of a circuit in order to minimize costs, chip area, andincrease performance. The use of Evolutionary Algorithms (EA) is a new trendto find original solutions to the problem. In EA, hardware is represented witha chromosome and managed with the Darwinian concept of Natural Selection[27]. The chromosomes mutate and cross with others to create a new populationof individuals. As in Nature, when a population of individuals is generated, afitness function determines which are suitable for accomplishing the target functionrequirements, and then a selection process excludes some members while therest mutate and cross again, creating a new population. This process is repeateduntil a set of individuals that accomplishes the requirements and restrictions ofthe target function is obtained. For any combinational system problem the fitnessfunctions evaluate the truth table to see if the individual solves the problem andother optimization parameters like number of gates or number of logic levels. It isimportant to notice that for hardware synthesis it is necessary to use a variationof the simple genetic algorithm (SGA) known as genetic programming (GP) [27]that is able to modify the chromosome length and create new mutating and crossingoperators.

Chromosome Representation

The representation in GP is the way a logic circuit is coded using a bit array inorder to be managed in the evolution process [28]. This representation must be ableto manage all the different solutions of the problem and, moreover, the crossingand muting operators should not generate invalid individuals, and must cover allthe solution space so the search is really random. There are different ways ofrepresenting combinational hardware for a genetic algorithm [27, 29, 30]. The 2-D tree representation is appropriate for implementing parallel systems because itenables the chromosomes to be split to balance the computational load [31]. Figure 6shows the selected cell-based structure representation. Each cell has 3 functions f


Fig. 6 Cell structure and its representation inside the chromosome

and 4 input variables v coded in binary. This representation allows more cells to beadded to represent larger circuits (if a more complex solution is necessary) and issuitable to directly translate each cell to the FPGA-LUT architecture.

Fitness Function

Equation (3) shows the fitness function of our GA, the responsible for quantifyingthe way a chromosome or individual fulfills the requirements. Constants ω1, ω2, andω3 are used to establish the weights of each of the parameters that will determine thefitness function. The double-summation term calculates the number of coincidencesof the individual X for all the possible combinations at the output with the targetfunction Y. The P(X) function calculates the number of logic gates of a chromosometaking into account some of the introns or segments of the genotype string that willnot have any associated function and that do not contribute to the result of the logiccircuit represented. The function L(X) determines the number of levels of the circuit,or in other words, the number of gates in the critical path. The constant m refers tothe number of outputs in the circuit and n the number of possible input combinationsin the circuit.

[H]fitness = ω1.[m

∑j=1

n

∑i=1

Y(j, i)−X(j, i)]+ω2.P(x)+ω3.L(x) (3)

Genetic Operators

The selection operator is responsible for the identification of the best individuals inthe population, taking into account the exploitation and the exploration [31]. Theformer allows the individuals with better fitness to survive and reproduce moreoften, and the latter searches in more areas, i.e., finding better results. On theother hand, the mutation operator modifies the chromosome randomly in order toincrease the search space. It changes: (1) an operator or variable and (2) a segmentin the chromosome. Both are executed randomly and with a certain probability.A variable mutating probability during the execution of the algorithm (evolvablemutation) [32] is more effective for Evolvable Systems. Finally, the crossingoperator combines two selected individuals to obtain two additional individuals toadd to the population. A crossing system with one or two randomly selected crossingpoints has been implemented because it is more efficient for Evolvable Systems [30].


PowerPCProcessor

FC1PLB

FC1

FC1

FC1

DDRRAM

Ethernet

DMAController

RegistersController

Registers

MersenneTwister

Chromosome Objectivefunction

LUTROM

MinitermsCalculation

FitnessCalculation

GatesCalculation

Critical PathCalculation

LocalMemory

System Interface Fitness Calculation

Fig. 7 FCU block diagram

5.2.1 Hardware Implementation on SMILE

Once again we use the SystemC proposed methodology to generate a SMILE HPRCworking system. The profiling of the algorithm determined that the biggest timeconsuming part of the algorithm is the fitness function calculation and the newindividual generation (25 and 35% of the execution time, respectively). Therefore,these two have been specifically accelerated with a coprocessor connected to thePowerPC processor.

The fitness calculation unit (FCU) calculates the three parameters using theobjective function, the chromosome and the number of variables as inputs. Thiscoprocessor is connected to the PowerPC 405 processor through the PLB bus usinga custom interface. The interface allows register-based and DMA communicationto transfer the objective function and the chromosome efficiently. Figure 7 showsthe FCUs structure. Once the chromosome has been read from the DDR memoryby the memory controller, all the basic cells are converted into their equivalent inLook-Up Table (LUT) through an ROM-based translation. The next block computesthe midterm value (number of hits of that individual) using the information from theobjective function and a counter as inputs. After computing the number of gatesand the logic levels, the fitness calculation block computes the final fitness valuethat will be sent back to the PowerPC processor. Finally, in order to accelerate thenew generation of individuals, crossing and mutation, a Mersenne–Twister-basedpseudo random number generator was inserted between the registers of the PLBinterface. To further accelerate the fitness evaluation process, 4 coprocessors wereimplemented in each FPGA of the cluster; therefore, 4 individuals can be evaluatedat the same time in each node. All the components were modeled in SystemC andthen translated to Verilog, in order to get to the final SMILE implementation.


h0 h1 h2 hN h0 h1 h2 hN

Shared memoryBlock0Island0

Block1Island1

Best individualsisland0

Shared memory

Global memory

Best M Best M

Micropopulation

Best individualsisland1

Fig. 8 Block diagram of the CUDA implementation

5.2.2 GPU Architecture and Programming Model

Random Number Generation

During the GP execution, a great amount of random numbers are required togenerate an initial population and to mutate and cross individuals. Generating thesenumbers in the CPU and moving them into the GPU is not feasible because ittakes a long time. For this reason, a Mersenne–twister algorithm is executed on theGPU before the kernel −GP to generate a buffer of random numbers in the globalmemory.

Kernel Structure

Figure 8 shows the way the GP has been implemented in the graphics device.A kernel is executed in a thread and is able to generate a μ-population, and performthe select, mutation and crossing operations the number of generations required(Fig. 9). After P generations, M individuals are transferred to the global memoryand then to the host device (CPU system). The number P is known as the frequencyof migration and the number M is called the migration factor.

It is important to highlight that each thread can cooperate with other threadsinside the same block, through the shared memory, sharing the best individuals andimproving the efficiency of the GP.


Generate micropopulation

Calculate fitness

Selection

Crossing

Mutation

gen=p? gen=p?

gen=max? gen=max?

N

YMigrate best

Y

N

Thread h0 Thread h0

island 0

Generate micropopulation

Calculate fitness

Selection

Crossing

Mutation

N

YMigrate best

Y

N

Global memory

Shared memory

Fig. 9 Thread execution model

5.2.3 Parallel Implementation

As Natural Evolution works with a whole population and not with a single individual(except for selection and reproduction), some operations can be done separately,meaning that almost all operations in a GP are implicitly parallel. Using theisland approach, the population is divided into subpopulations that evolve in eachprocessor of the cluster or parallel architecture. When the system starts, eachprocessor creates its subpopulation and starts the evolution process, made up offitness evaluation, selection, crossing, mutation, and reproduction. These processesare asynchronous because each node starts and ends independently. Once the systemreaches a number of generations, a percentage of the individuals are selected tobe transferred from one processor to another. A master processor is in charge ofcollecting the in-transfer individuals and moving them to the rest of the nodes(slaves), increasing the probability of convergence of the algorithm. The ratio ofdata exchange (the number of the best individuals to be exchanged increases theprobability of finding a better solution) and the migration frequency are importantparameters in improving the performance of the algorithm.


6 Evaluation

This section presents a set of empirical experiments to evaluate the SMILE HPRCsystem and compares the results with the GPU and conventional cluster approaches.Again, the main goals are validating the viability of the SMILE HPRC architectureto efficiently solve high-performance computing applications and to verify theperformance and scalability of the SMILE HPRC system.

As mentioned in Sect. 5, it is worth mentioning that each specific implementationhas been optimized to obtain the maximum performance for the corresponding ar-chitecture, so that we can make a fair comparison between the different approaches.

Three different systems have been used for the experiments to compare thedifference architectures and implementations.

1. The graphics processing unit, GPU, is an NVIDIA GeForce 330M with up to 961436 MHz stream processors, connected to the PC host by a PCI Express Bus.It has 1 GB of GDDR3 memory at a 2-GHz clock rate.

2. The cluster set-up, ALTAMIRA, is made up of 18 eServer BladeCenters,with 256 JS20 nodes (512 processors) linked together using a 1-Gbps Myrinetnetwork.

3. The SMILE configuration is made up of up to 32 FPGA nodes, with thearchitecture described in Sect. 3.

6.1 Experimental Results for Monte Carlo Simulation

Figure 10 shows the speed-up of the GPUs vs. SMILE HPRC whilst Fig. 11 showsthe speed-up of SMILE HPRC vs. the ALTAMIRA cluster, in number of paths peroption and for different number of options. In the GPU-CUDA combination, thenumber of threads is 4,096, and the number of nodes of SMILE and ALTAMIRA is32 nodes (the largest configuration available in SMILE at that time).

The excellent performance of the GPU compared with the SMILE HPRC has tobe highlighted. However, as can be seen in Fig. 10, the speed-up decreases with theworkload growth, either due to the number of options or due to the number of pathsper option. This can be explained by the limitations of the GPU memory. Managinga large amount of data increases the number of global memory accesses, whichforces a much higher latency, heavily decreasing the response time. This leads to aserious problem of scalability. Figure 10 only shows values up to 50 million pathsbecause, above this value, there is a memory overflow and the GPU stops working.

As a result, the GPU not only has a better performance than the SMILE HPRC,but it also has serious scalability problems because of the data size, even getting toa state of system crash.


2.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

4.4

0 20 40 60 80 100

Spe

edup

Number of paths per option (millions)

Speedup SMILE vs ALTAMIRA

100 options125 options150 options175 options200 options

Fig. 10 Speed-up of GPU vs. SMILE

2.5

3

3.5

4

4.5

5

5.5

0 20 40 60 80 100

Spe

edup

Number of paths per option (millions)

Speedup SMILE vs ALTAMIRA

100 options125 options150 options175 options200 options

Fig. 11 Speed-up of SMILE vs. Altamira

On the other hand, from Fig. 11, it should be pointed out that there is an excellentperformance improvement in SMILE HPRC compared to the ALTAMIRA clusterfor the same number of nodes (32 nodes). Additionally, both architectures lack of


0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Res

pons

e T

ime

Number of Nodes

SMILE - MPI+VHDL

100 Options

125 Options

150 Options

175 Options

200 Options

Fig. 12 Elapsed time for different number of nodes (s)

the GPU scalability problems already seen. It is worth mentioning that the speed-upof the SMILE HPRC vs. the ALTAMIRA cluster grows with the workload.

This also means that the SMILE architecture is even more scalable than theALTAMIRA cluster for this size problem. With these settings, SMILE HPRC isabout 5 times faster than the ALTAMIRA cluster. Another key to understandingthe results is that even though the ALTAMIRA cluster has CPU nodes running at2.2 GHz compared with the 100 MHz of the hardware accelerator in the SMILEHPRC, the hardware is able to generate a valid result each clock cycle. In compar-ison the ALTAMIRA’s CPUs spent millions of cycles running the code needed togenerate random numbers and generate a result.

Finally, Fig. 12 shows the elapsed time of the SMILE HPRC for different numberof nodes. The elapsed time decreases quickly as the number of nodes increases. Thisbehavior is explained because the communication time is practically negligible forthis application. Hence the SMILE HPRC presents excellent scalability features.The same behavior is observed for the ALTAMIRA cluster.

6.2 Experimental Results of Boolean Synthesis

The experimental results obtained with the implementations of the Booleansynthesis problem described in Sect. 5.2 are presented in this section. In theseexperiments the number of nodes in the SMILE HPRC and the ALTAMIRA clustergoes from 2 to 16, 16 being the largest configuration of the SMILE architecture,and the population size goes from 512 to 2,048 individuals.

Figures 13–15 show the response time for all the architectures in terms of thenumber of nodes (SMILE and ALTAMIRA) and the number of threads (GPU).


20

200

400

600

800

1000

1200

4 variables8 variables

12 variables1400

1600

4 6 8 10Nodes

Response time with 2048 individuals

Res

pons

e tim

e [s

]

12 14 16

Fig. 13 Altamira response time with 2,048 individuals with different number of variables and 16nodes

In general, there is a significant decrease in the response times with the increase ofnodes (threads) for all the architectures. In the GPU, the response time gets stableat about 200 threads.

The first key aspect that should be noted is the strong impact of the numberof variables in the response time in both the ALTAMIRA cluster and the GPU.The increase in the response time when the number of variable goes from 8 to12 is quite remarkable. However, in the SMILE HPRC this effect is really smallor virtually disappears. This can be explained because the number of variablesproduces an exponential growth in the search space of the genetic algorithm andleads to a great increase in the amount of computation needed to simulate the circuitin the ALTAMIRA cluster and in the GPU. However, in SMILE the circuit is directlytested in hardware and also takes far less time. Thus, the impact on the response timeis much smaller.

Talking about scalability, all the three architectures present good features.The GPU does not have the limitations observed in the Monte Carlo Simulationbecause the size of the data used in the Boolean synthesis is much smaller. Thisconcludes that the GPU architecture is very sensitive to the memory requirementsof the application. This situation is also the case when using a single FPGA, but notwhen using a cluster. If the memory requirements grow, we only need to increasethe number of nodes in the cluster to ensure the system scalability.

Finally, Fig. 16 and Table 1 show the speed-up of the SMILE HPRC architecturevs. the ALTAMIRA cluster and GPU, respectively. In both cases, there is anexcellent improvement in the performance offered by the SMILE HPRC when


2


12 variables

4 6 8 10

Nodes


12 14 160

2

4

6

8

10

12R

espo

nse

time

[s]

Fig. 14 Altamira response time with 2,048 individuals with different number of variables

50100

150200

250 05

1015

2025

30

200400600800

1000120014001600180020002200


12 variables

Threads

Blocks (islands)

rt [s]

Fig. 15 Response time for the GP on GPU from 32 up to 256 threads

compared to the other two alternatives. Table 1 shows that the speed-up increaseswhen 256 threads are launched instead of 32, no matter the number of islands orCUDA blocks. The reason is because the utilization of the processing elements


20

50

100

150

200

250

300

350

400

450


12 variables

4 6 8 10

Nodes

Speedup SMILE vs Altamira with 2048 individuals

Spe

edup

12 14 16

Fig. 16 Speed-up of SMILE vs. ALTAMIRA

Table 1 Speed-up of SMILEvs. NVIDIA 450GTS with 12variables

32 threads 256 threads

NVIDIA 450GTS 2 islands 41.6 250NVIDIA 450GTS 32 islands 41.6 250.6

inside the GPU (192) grows with the number of threads. The maximum speed-upfor 12 variables goes from 41 to 250 for the GPU, and from 150 to 450 for theALTAMIRA cluster.

In the figures, the speed-up increases dramatically with the number of variables,due to the exponential growth in the search space of the genetic algorithm, explainedbefore. Likewise, the speed-up increases, although moderately, with the number ofprocessors. This confirms the excellent scalability properties of the SMILE HPRC.

7 Conclusions

In this chapter the SMILE HPRC, a new HPRC architecture based on a cluster ofFPGA boards has been proposed and fully described. The nodes are interconnectedthrough a specific design network with a bandwidth in the Gigabit/s range. The most


significant features of the SMILE HPRC are the reasonable costs, the small powerconsumption, the no need of cooling systems, the small physical space requirements,the high performance offered for specific applications, the system scalability and thesoftware portability. The architecture can execute any MPI parallel application, andalso take advantage of the FPGA adaptability, re-configurability and performance.Moreover, a new SystemC methodology has been developed to facilitate thedevelopment and debugging of applications for the SMILE HPRC architecture. Thismethodology and its associated CAD framework enable the simulation of the fullsystem architecture at system level (the parallel program and the communicationpatterns) as well as at node level (custom hardware developed for an application).

An empirical evaluation has determined both the performance and the scalabilityof the SMILE HPRC architecture. As benchmarks for these experiments, twowell-known applications, the Monte Carlo simulation for financial problems andthe Boolean synthesis of digital circuits, have been used and fully detailed.The experiments compared the three different architectures: a GPU programmedwith CUDA, a high-performance cluster with a parallel MPI application and theSMILE HPRC with hardware ad-hoc implementations. The experimental resultshighlight the excellent behavior of the SMILE HPRC in performance and scalabilityfor both applications. For the Boolean Synthesis problem, the SMILE HPRCdelivers an outstanding performance compared to the ALTAMIRA cluster and theGPU. However, in the case of Monte Carlo simulation, the GPU overcomes theSMILE HPRC with the current configuration. In terms of scalability, the propertiesof the SMILE HPRC are much better than the rest of the architectures for all theexperiments. Another important fact is the portability that enables any parallelapplication developed with MPI to be implemented in the SMILE HPRC byreplacing the slow software functions by faster custom hardware.

We want to add a discussion about the scalability of SMILE in terms of memoryand communications. One of the biggest advantages of SMILE is the distributedmemory architecture. Each node can upgrade its memory being able to processmore data. The direct connection of the memory with the FPGA its one of thekey aspects of SMILE, when more data is needed to process, SMILE gets biggerspeedups against GPUs and CPUs.

On the other hand communication is one of the weak points of SMILE as inany parallel architecture. SMILE is suitable for algorithms with low data transfersbetween nodes and large sets of data to process in the FPGA. The presentedexamples follow that schema. With other algorithms it is still possible to get bigspeedups, but depends enormously of the communications patterns. In some casescould be possible to rearrange the high-speed board connections to optimize itsbehavior for a given application.

As future work, we propose to extend the system to support any available boardin the market through the SystemC framework and continue our research in orderto add new applications that can take full advantage of the proposed SMILE HPRCarchitecture.


References

1. K.H. Tsoi, W. Luk, Axel: a heterogeneous cluster with fpgas and gpus, in FPGA ’10:Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field ProgrammableGate Arrays (ACM, New York, 2010), pp. 115–124

2. T.A. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V.V. Kindratenko, D.A. Buell, The promiseof high-performance reconfigurable computing. IEEE Comput. 41(2), 69–76 (2008)

3. C. Inc., The supercomputing company cray xd1 supercomputer. IEEE Computer, http://www.hpc.unm.edu/∼tlthomas/buildout/Cray XD1 Datasheet.pdf. Accessed 27 Mar 2013

4. K. Morris, Cots supercomputing (2007), http://www.fpgajournal.com/articles 2007/20070710cots.htm/

5. TOP500, Top500 list June (2010), http://www.top500.org/list/2010/06/. Accessed 27 Mar 20136. S. Matsuoka, The tsubame cluster experience a year later, and onto petascale tsubame 2.0, in

Proceedings of the 14th European PVM/MPI User’s Group Meeting on Recent Advances inParallel Virtual Machine and Message Passing Interface (Springer, Berlin, 2007), pp. 8–9

7. D.A. Buell, T.A. El-Ghazawi, K. Gaj, V.V. Kindratenko, Guest editors’ introduction; High-performance reconfigurable computing. IEEE Comput. 40(3), 23–27 (2007)

8. M.B. Gokhale, P.S. Graham, Reconfigurable Computing Reconfigurable Computing, Acceler-ating Computation with Field-Programmable Gate Arrays (Springer, Dordrecht, 2005)

9. Renwick Ron: SGI’s Approach to Multi-paradigm Computing (2007), http://www.arsc.edu/files/arsc/news/archive/fpga/Tue-1330-Renwick.pdf. Accessed 27 Mar 2013

10. SGI: Sgi RASC RC100 blade (2006), http://www.sgi.com/pdfs/3939.pdf. Accessed 27 Mar2013

11. S. Comp., Src-7: reconfigurable general purpose computing system, Tech. Rep., SRC Comput-ers Inc (2007), http://www.srccomp.com/techpubs/docs/SRC MAP 69226-JA.pdf. Accessed27 Mar 2013

12. J.M. Arnold, D.A. Buell, E.G. Davis, Splash 2, in SPAA ’92: Proceedings of the FourthAnnual ACM Symposium on Parallel Algorithms and Architectures (ACM, New York, 1992),pp. 316–322

13. L. Moll, M. Shand, A. Heirich, Sepia, Scalable 3d compositing using pci pamette, in FCCM’99: Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable CustomComputing Machines (IEEE Computer Society, Washington, DC, 1999), p. 146

14. R. Sass, W.V. Kritikos, A.G. Schmidt, S. Beeravolu, P. Beeraka, Reconfigurable computingcluster (rcc) project: investigating the feasibility of fpga-based petascale computing, in FCCM’07: Proceedings of the 15th Annual IEEE Symposium on Field-Programmable CustomComputing Machines (IEEE Computer Society, Washington, DC, 2007), pp. 127–140

15. M. Yoshimi, Y. Nishikawa, M. Miki, T. Hiroyasu, H. Amano, O. Mencer, A performanceevaluation of cube: one-dimensional 512 fpga cluster, in ARC. Lecture Notes in ComputerScience, vol. 5992 (Springer, Berlin, 2010), pp. 372–381

16. C.L. Cathey, J.D. Bakos, D.A. Buell, A reconfigurable distributed computing fabric ex-ploiting multilevel parallelism, in Proceedings of the 14th Annual IEEE Symposium onField-Programmable Custom Computing Machines (FCCM’06) (IEEE Computer Society,Washington, DC, 2006), pp. 121–130

17. J. Wawrzynek, M. Oskin, C. Kozyrakis, D. Chiou, D.A. Patterson, S.-L. Lu, J.C. Hoe, K.Asanovic, Tech. Rep. UCB/EECS-2006-158, EECS Department, University of California,Berkeley (November 2006), http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-158.html

18. M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, W. Hwu,Qp: a heterogeneous multi-accelerator cluster, in Proceedings of the 10th LCI InternationalConference on High-performance Clustered Computing (Linux Cluster Institute, 2009)

19. J. Castillo, P. Huerta: sc2v, Systemc to Verilog translator (2004), http://opencores.org/project,sc2v(2004). Accessed 27 Mar 2013

http://www.hpc.unm.edu/~tlthomas/buildout/Cray_XD1_Datasheet.pdf

http://www.hpc.unm.edu/~tlthomas/buildout/Cray_XD1_Datasheet.pdf

http://www.fpgajournal.com/articles_2007/20070710_cots.htm/

http://www.fpgajournal.com/articles_2007/20070710_cots.htm/

http://www.top500.org/list/2010/06/

http://www.arsc.edu/files/arsc/news/archive/fpga/Tue-1330-Renwick.pdf

http://www.arsc.edu/files/arsc/news/archive/fpga/Tue-1330-Renwick.pdf

http://www.sgi.com/pdfs/3939.pdf

http://www.srccomp.com/techpubs/docs/SRC_MAP_69226-JA.pdf

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-158.html

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-158.html

http://opencores.org/project,sc2v (2004)

http://opencores.org/project,sc2v (2004)


20. I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel SoftwareEngineering (Addison-Wesley Longman Publishing Co. Inc., Boston, 1995)

21. J.S. Kim, S.J. Byun, A parallel Monte Carlo simulation on cluster systems for financialderivatives pricing, in Congress on Evolutionary Computation (IEEE, Edinburgh, 2005),pp. 1040–1044

22. G. Morris, M. Aubury, Design space exploration of the European option benchmark usinghyperstreams, in International Conference on Field Programmable Logic and Applications,FPL 2007, Amsterdam, 2007, pp. 5–10

23. D.B. Thomas, J.A. Bower, W. Luk, Hardware architectures for Monte-Carlo based financialsimulations, in IEEE International Conference on Field Programmable Technology, FPT 2006,Bangkok, 2006, pp. 377–380

24. G. Zhang, P. Leong, C. Ho, K. Tsoi, C. Cheung, D.-U. Lee, R. Cheung, W. Luk, Reconfigurableacceleration for Monte Carlo based financial simulation, in Proceedings of IEEE InternationalConference on Field-Programmable Technology, Singapore, 2005, pp. 215–222

25. V. Agarwal, L.-K. Liu, D.A. Bader, Financial modeling on the cell broadband engine, in 2008IEEE International Symposium on PDPS, Miami, FL, 2008, pp. 1–12

26. F. Black, M.S. Scholes, The pricing of options and corporate liabilities. J. Polit. Econ. 81(3),637–654 (1973)

27. J. Koza, F. Bennett, D. Andre, M. Keane, Genetic programming iii: Darwinian invention andproblem solving. Evol. Comput. 7, 451–453 (1999)

28. F. Rothlauf, Representations for Genetic and Evolutionay Algorithms (Springer, Heidelberg,2006)

29. T. Higuchi, T. Niwa, T. Tanaka, H. Iba, H. deGaris, Evolving hardware with genetic learning:a first step towards building a Darwin machine, in Proceedings of the Second InternationalConference on from Animals to Animats, (MIT Press, Cambridge, MA, USA, 1993), pp. 417–424

30. J. Miller, P. Thomson, Aspects of digital evolution: Evolvability and architecture, in Proceed-ings of International Conference Parallel Problem Solving from Nature—PPSN V, 927–936(Springer, 1998)

31. Q. Yu, C. Chen, C. Pan, Parallel genetic algorithms on programmable graphics hardware. Lect.Notes Comput. Sci. 3612, 1051–1059 (2006)

32. R. Krohling, Y. Zhou, A. Tyrrell, Evolving fpga-based robot controllers using an evolutionaryalgorithm, in Proceedings of I International Conference on Artificial Immune Systems,Canterbury, 2002, pp. 41–46

high-performance computing using fpgas || low cost high performance reconfigurable computing

Documents