cellular automata for structural optimization on ... · cellular automata for structural...
TRANSCRIPT
-
Cellular Automata for Structural Optimization on Recongfigurable
Computers
Thomas R. Hartka
Thesis submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Engineering
Dr. Mark T. Jones, Chair
Dr. Peter M. Athanas
Dr. Michael S. Hsiao
May 12, 2004
Blacksburg, Virginia
Keywords: configurable computing, cellular automata, design optimization
Copyright 2004 c©, Thomas R. Hartka
-
Cellular Automata for Structural Optimization on Recongfigurable
Computers
Thomas R. Hartka
(ABSTRACT)
Structural analysis and design optimization is important to a wide variety of disciplines. The
current methods for these tasks require significant time and computing resources. Reconfig-
urable computers have shown the ability to speed up many applications, but are unable to
handle efficiently the precision requirements for traditional analysis and optimization tech-
niques. Cellular automata theory provides a method to model these problems in a format
conducive to representation on a reconfigurable computer. The calculations do not need to be
executed with high precision and can be performed in parallel. By implementing cellular au-
tomata simulations on a reconfigurable computer, structural analysis and design optimization
can be performed significantly faster than conventional methods.
This work was partially supported by NSF grant #9908057 as well as by the Virginia
Tech Aspires program.
-
Acknowledgements
I would first like to thank my advisor, Dr. Mark Jones, for his guidance through my entire
research. Without his guidance I never would have been able to complete this thesis.
Thanks to Dr. Athanas for serving on my thesis committee and for making development
on the DINI board possible.
Thanks to Dr. Hsiao for serving on my committee and being an excellent teacher.
Thanks to Dr. Gurdal and his researchers for providing the mathematics for the cellular
automata models and for all the effort spent getting to understand reconfigurable computing
so the equations mapped efficiently.
Thanks to all the professors and students involved with the Configurable Computing Lab
for making it a great place to work.
Thanks to all the people that helped in the process of reviewing and editing this thesis.
I am forever indebted to anyone who will review sixty pages of my writing.
Thanks to everyone else who I have not mentioned that helped with my work. I could
not have done it without the support from the people around me.
iii
-
Contents
1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Configurable Computers for Scientific Computations . . . . . . . . . . . . . . 8
2.3 Limited Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 System Design 13
3.1 Design Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Distributed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Problem Specific Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Program Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Results 32
iv
-
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Problem Specific Design Results . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Program Based Design Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Comparison of Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Conclusions 48
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A 51
Vita 55
v
-
List of Figures
3.1 Setup of the configurable computer used for simulationing CA model. . . . . 15
3.2 Distribution of logical CA cells among PEs. . . . . . . . . . . . . . . . . . . 16
3.3 Arithmetic unit for Problem Specific design. . . . . . . . . . . . . . . . . . . 19
3.4 PE layout for Problem Specific design . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Return data chains for Program Based design. . . . . . . . . . . . . . . . . . 23
3.6 Control Unit for Program Based design. . . . . . . . . . . . . . . . . . . . . 24
3.7 Analysis cycle flow and precision for each operation. . . . . . . . . . . . . . . 26
3.8 Computational unit for Program Based design. . . . . . . . . . . . . . . . . . 27
3.9 Multiply accumulator used in computational unit. . . . . . . . . . . . . . . . 27
3.10 MSB data return chain, used for determining the most significant ‘1’ of residuals. 28
3.11 Unit for shifting the precision of intermediate results. . . . . . . . . . . . . . 28
3.12 Matrix accumulator used for analysis updates. . . . . . . . . . . . . . . . . . 29
3.13 Data flow for uploading and download data to FPGAs. . . . . . . . . . . . . 30
4.1 Diagram of CA model for perform analysis on a beam. . . . . . . . . . . . . 33
4.2 Beam analysis problem modeled on the configurable computer. . . . . . . . . 33
vi
-
4.3 Precision of PE vs. Percent Utilization of FPGA for Problem Specific design 35
4.4 % Utilization of FPGA and maximum clock frequency vs. number of PEs for
Problem Specific design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Cell updates per second vs. number of PEs for the Problem Specific design. . 37
4.6 Actual results and results from Problem Specific design for beam problem. . 39
4.7 Precision of PE vs. % Utilization of FPGA for Program Based design. . . . . 40
4.8 Efficiency vs. number of inner iterations per analysis cycle. . . . . . . . . . . 42
4.9 % Utilization of FPGA and maximum clock frequency vs. number of PEs for
Program Based design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.10 Cell updates per second vs. number of PEs for Program Based design. . . . 44
4.11 Actual results and results from Program Based design for beam problem. . . 46
A.1 Spreadsheet with position of control signals and short description. . . . . . . 52
A.2 Spreadsheet containing update program . . . . . . . . . . . . . . . . . . . . . 53
A.3 Spreadsheet converting signals to the form used by the Program Based model 53
A.4 Spreadsheet containing the data values in a form that can be loaded into
memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
-
List of Tables
4.1 Times for operations associated with Problem Specific analysis cycle for DINI
board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Clock cycles for different phases of residual-update analysis cycle. . . . . . . 41
4.3 Time for operations associated with analysis on Program Based design. . . . 45
4.4 Maximum cell updates per second for both implementations. . . . . . . . . . 47
4.5 Maximum cell updates per second and speed up for both implementations
compared to PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
viii
-
Chapter 1
Introduction
Structural analysis and design optimization are an integral part of many industries. Appli-
cations range from simple applications such as testing and optimizing a support beam, to
very complicated applications such as optimizing the structure of a car for crash resistance.
Performing the design iterations manually is very time consuming. Therefore, a significant
amount of research has been conducted to develop efficient methods to automate the design
process.
Traditional methods for automating design have involved running simulations on general
purpose processors. In these methods, calculations must be performed in high precision.
Parallelization of the calculations, where possible, is done though expensive supercomputers
with hundreds of processors. Even with massive amounts of computing power, the simula-
tions will usually take hours to complete.
Cellular Automata (CA) has proved to be a very powerful tool for modeling physical
phenomena. CA models have successfully captured the behavior of complex systems such
as fluid flow around a wing and pedestrian traffic [?, ?]. Recently, CA theory has been
extended to structural analysis and design optimization [?]. Using CA in structural models
changes analysis and design optimization into a high parallizable form that does not require
1
-
2
high-precision calculations. This provides the potential for significant speed-up.
1.1 Thesis Statement
Using CA provides a method to efficiently map structural design optimization problems onto
FPGAs. By exploiting the inherent parallelism of FPGAs there is the potential for speed-up
over general purpose processors.
To achieve this objective, distributed processing systems were implemented on a con-
figurable computer. The system consisted of a host PC connected to a PCI based board
with five FPGAs. Two designs were developed for the FPGAs to rapidly iterate CA mod-
els for structural analysis. The two designs represent significantly different approaches to
accomplishing the same objective.
The author’s contributions to this work are the following:
- developed a custom FPGA design for simulating a beam CA model,
- developed a separate FPGA design that executes programs to simulating CA models,
- wrote programs for the FPGA design to simulate a beam CA model, and
- implemented a limited precision method in hardware for solving iterative improvement
problems.
1.2 Thesis organization
Chapter 2 presents background information about CA theory, scientific computations on con-
figurable computers, and the limited precision method used in the designs. Chapter 3 gives
details on the two implementations developed to solve the CA models. Chapter 4 presents
-
3
the results for each of the two implementations and comparisons to traditional methods.
Chapter 5 summarizes the work performed for this project and the results obtained.
-
Chapter 2
Background
This chapter presents previous work in areas related to this research. The combined contri-
butions discussed were used in the completion of the simulation environment and prototype
presented in this thesis.
2.1 Cellular Automata
The concept of Cellular Automata (CA) theory is to be able to model systems with many
objects that interact [?]. The systems are divided into discrete units, or cells, that act
autonomously. The advantage of using CA is that the behavior of some complex systems
can be captured using relatively simple rules for each cell [?]. Attempting to reproduce this
behaviour without breaking those systems into autonomous units, even if possible, would be
complicated.
Each cell in a CA model can be in a single state at any given point in the simulation. The
number of states the cell may be in depends on the problem being solved. In many models
the number of elements in the set of states is small (8 or less), but there is a new class of
CA models that use a continuous state space. These continuous state space CA models are
4
-
5
known as coupled map-lattice or cell dynamic schemes. The next state of a cell is based on
an update rule, sometimes referred to as a transition rule, which is a function of its current
state and the current state of its neighbors [?]. The collective state of all of the cells in the
model at any given point is known as the global state [?].
Stanislas Ulam is generally credited with the first work in CA, originally referred to
as cellular space or automata networks. John von Neumann extended Ulam’s work and
proposed CA as a way to model self-reproducing biological systems [?, ?]. The work of
Ulam and von Neumann provides a formal method for simulating complex systems. Their
research, and much of the current research in CA, focused on modeling dynamic systems in
which time and space are discrete. Each calculation of the next state of all the cells in a
system represents a step in time [?]. A good example of this type of CA model is Conway’s
Game of Life in which cells can be in one of two states: alive or dead. Each update of the
global state represents a new generation of organisms [?].
There are a number of architectures for CA models, each resulting in different behavior.
The number of dimensions of the CA model can differ greatly depending on the system being
modeled. Models are typically one, two, and three dimensions in practice. However, there
is no limit to the number of dimensions that can be used [?]. The number of dimensions of
the grid has a large effect on the communication network among cells, known as the cellular
neighborhood.
In the work on two-dimensional grids there are two common cellular neighborhoods. The
first is the von Neumann neighborhood, in which each cell communicates only with the four
cells that are orthogonally adjacent to it. The second is a Moore neighborhood in which a
cell communicates with all eight cells surrounding it [?]. Though von Neumann and Moore
neighborhoods are common, cells are not limited to communicating only with those that
are adjacent. The “MvonN Neighborhood” uses the nine cells (including the center cell)
in the Moore neighborhood as well as the four cells orthogonally one space away from the
current cell. Additionally, the communication of the cells within a model is not required to
-
6
be consistent throughout the model [?].
There has also been a significant amount of work investigating non-rectangular cell sys-
tems. Gas lattice automata are a subset of cellular automata that commonly use the FHP
model. The FHP model uses a hexagonal grid, where cells communicate with their six im-
mediate neighbors [?,?]. The use of triangular and regular polygonal lattices is common in
specialized applications of cellular automata because they can better capture the behavior
of certain systems [?].
Models in which communication and update rules are consistent throughout the model
are called uniform. Though most of the work in the area of CA has used uniform models,
the use of non-uniform rules does not necessarily detract from the effectiveness of using CA.
A number of experiments have been conducted to model the effect of “damaged” areas of
a grid where cells use different rules [?]. In terms of simulating a CA model on a serial
processor, a uniform grid has the advantage that only one update rule is needed [?].
The grid for a CA model may be finite or infinite. In his work, von Neumann examined
infinite grids as a method to construct a universal computer [?]. Although von Neumann’s
work on infinite grids was theoretical, methods for representing and calculating CA models
on infinite grids have been developed [?]. Finite grids are much simpler to implement and
process in parallel because the maximum size of the active area is known before processing
begins. However, the use of finite grids introduces the problem of how to calculate cells on
the edge of the grid, known as the boundary conditions.
There are several ways to handle the processing of cells on the edge of a finite grid. The
first method is to logically connect the cells on one edge of the grid to cells on the opposite
edge, producing a loop. Another way to handle boundary conditions is to use a fixed value
for cells at the perimeter of the grid. In systems with fixed boundary conditions, the edge
cells are known as dummy cells because they do not need to be updated. The third method
for calculating the update for edge cells is to use an update rule that is different than that
used in the internal cells [?]. An example of a uniform rule would be an edge cell that simply
-
7
mirrored the value of the closest internal cell. The type of boundary condition used depends
largely on the problem being modeled.
The early work in CA theory concentrated on theoretical computational questions, such
as computational universality. In later work, it has been used as a method to study social,
physical, and biological systems [?]. A number of studies have been conducted to use CA
to capture the aggregate behavior of groups of autonomous beings, for example, car traffic,
pedestrian flow, and ant colonies [?,?,?]. In scientific computing, many successful attempts
have been made to model such phenomenon as fluid dynamics, chemical absorption, and
heat transfer using CA [?,?,?].
Some of the most recent work in CA has been in the field of structural analysis and
design. The first work in this area was the development of methods to optimize the angle
and cross-section area of trusses in a fixed structure [?]. These methods proved successful in
merging analysis and design into a CA model and showed powerful computational properties.
This success prompted more work to extend CA theory to create models for other structural
design problems.
A model was developed to minimize the weight of a beam needed to prevent buckling
[?]. The beam is represented in sections, to which constraints and external forces can be
applied independently. The cross sectional area for each section is determined to produce the
minimum total weight of the beam. Experiments with this method showed models converged
to the correct minimum solution. This area of CA research shows substantial promise for
accelerating structural design optimization through parallel computing.
-
8
2.2 Configurable Computers for Scientific Computa-
tions
FPGAs, which are usually the basis for configurable computers, show considerable speed-up
for a variety of applications when compared to general purpose processors. These applications
include signal processing, pattern searching, and encryption. The use of FPGAs for these
tasks, which mainly involve bit manipulation, has shown orders of magnitude acceleration
[?,?,?]. These accelerations are possible because the tasks can be broken down into simple
operations. The operations can then be performed in parallel throughout the chip.
Although pattern matching and bit manipulation have been widely studied on FPGAs,
FPGAs have typically not been used for scientific computations. Two significiant deterrents
in using FPGAs are the limited available programmable logic and the slow clock speeds. In
the past, FPGAs have only been able to represent circuits that had gate counts in the low
thousands [?]. This low gate count is restrictive for scientific computations. For example, a
32-bit parallel multiplier could not be emulated by most of the FPGAs in the Xilinx XC3000
family, chips that were first produced in the mid 1990s (based on number of CLBs). This
becomes even more of a handicap because FPGAs typically operate at clock speeds much
lower than average CPUs. A general purpose processor will usually outperform an FPGA if
the FPGA cannot carry out parallel or deeply pipelined operations.
These limitations of FPGAs have been greatly reduced in recent chips because of the
much larger transistor densities. The latest Xilinx FPGAs can emulate circuits with up
to 10 million gates [?]. With increased programmable logic, it is possible to have many
more arithmetic units performing complicated operations in parallel. In comparison to the
previous example, a Xilinx XC2V8000 chip, currently Xilinx’s largest FPGA, has enough
logic to represent thirty-five 32-bit multipliers (based on number of CLBs). Floating-point
operations continue to require a large percentage of available resources. Still, researchers have
begun exploring scientific computations on FPGAs. A paper from researchers at Virginia
-
9
Tech used the flexibility of FPGAs to develop representations of floating-point numbers
that are more efficient on FPGAs [?]. In 2002, researchers published a paper detailing the
development of a limited precision floating-point library and an optimizer to determine the
minimum precision needed in DSP calculations [?].
Using the least precision possible is important on an FPGA. General purpose processors
usually compute operations in higher precision than is needed because of the limited choices
for precision. However, the fine-grain control of the logic in an FPGA allows custom arith-
metic units of any precision. This flexibility can be extended to dynamically controlling the
precision of different calculations on the same unit. Other work has been presented on a
variable precision coprocessor for a configurable computer and algorithms given for variable
precision arithmetic units [?]. Two papers have been published investigating how to manage
dynamically varying precision and to show how the overall runtime is substantially decreased
by using minimal precision [?,?].
CA has been used in the computer science community for some time. In 1985, a book was
published describing implementations for CA simulations on massively parallel computers
[?]. However, there has been little work done in trying to run these models on configurable
computers. There have been some papers written on using CA on FPGAs, but they all focus
on models with simple cell update rules and small state sets. For example, the CAREM
system was developed to efficiently model CA on FPGAs [?]. The two models published as
examples of using the CAREM system were an image thinning algorithm and a forest fire
simulation. In both cases the models were simple, having state set sizes of 4 or less. Other
cellular automata simulation systems have been proposed for reconfigurable computers which
concentrated on fluid dynamics [?,?]. However, like CAREM, this system is only capable
of handling simple models with very limited number of states.
Custom hardware architectures were implemented by Norman Morgolus from MIT for
processing CA. The most successful was known as the Cellular Automata Machine 8 (CAM-
8) [?]. The CAM-8 is based on custom SIMD processors that are connected in three
-
10
dimensions. Each processor is responsible for a section of data in the model which is stored
in a DRAM. Processing on each cell’s data is performed using a look up tables (LUT) stored
SRAM. This architecture shows impressive results, generating up to 3 billion cell updates
per second. However, the LUT based processing limits models to a fairly small state size.
There have been a number of projects which use the CAM-8 in areas such as modeling fluid
motion [?] and gas lattices [?]. The CAM-8 is now sold commerically.
2.3 Limited Precision
The use of configurable computers has renewed study in the area of limited precision com-
puting. Determining the least number of bits needed for a task was important when many
chips were custom designed and silicon was expensive. With the rise of cheaper fabrication
methods and inexpensive, powerful CPUs, this area has became less important. The use of
general purpose processors with dedicated floating-point units lessens the penalty for using
floating-point for all calculations. However, as configurable computers become popular, the
use of limited precision for calculations has again become important [?].
All configurable computers are based on programmable logic at some level of granularity.
Historically, the most popular type of programmable logic is the FPGA. FPGAs have bit-
level granularity so arithmetic units can be built with any precision. In most cases, each
additional bit of precision of an arithmetic unit will require more chip resources. Also, the
maximum clock frequency for an arithmetic unit may decrease with each additional bit of
precision. This high sensitivity makes using the lowest precision possible very important to
optimizing a design on an FPGA.
Limiting precision has been extended further for FPGAs for solving iterative problems in
a recent paper [?]. This paper describes a method for performing low precision calculations
that are collated into high precision results. Similar ideas were developed for CPUs, but
those studies focused on using single precision floating-point calculations to find double
-
11
precision solutions [?,?]. As mentioned earlier, FPGAs have a much finer grain of control
over precision, and floating-point calculations are expensive when using FPGAs. So a new,
modified version of this concept has recently been investigated specifically suited for use on
a configurable computer [?].
The reason that low-precision arithmetic can be used in iterative improvement problems
is that the answer converges gradually. During each step, a correction is found that improves
the solution. When the correction is large and the highest bits of the solution are converging,
the low bits do not hold any useful information. Therefore, there is no advantage to using
a precision that calculates the low order bits before the upper bits have converged. As
the solution becomes closer to the final answer, the refinement at each step becomes less.
Because the refinement is small, the high order bits no longer change. At this point there is
no longer any reason to recompute the high bits of the solution.
This property of iterative improvement problems makes it possible to use fewer bits to
calculate the correction than the number of bits that are in the final result. In this way,
only the high order bits are calculated while the correction is large; inversely, only the low
order bits are computed when the correction becomes small. This is possible by calculating
the error (residual) in the equation for the iterative improvement problem. The goal of the
example below is to find a value for x which satisfies the equation.
A ∗ xi = b. (2.1)
The residual (or error) in this equation can be written as
r = b − A ∗ xi. (2.2)
Instead of using the initial equation, the change in x can be calculated
∆xi = A−1 ∗ r. (2.3)
-
12
The previous calculation can be performed with lower precision arithmetic. This step is
iterated a number of times, then ∆xi is then added back into the previous x
xi+1 = xi + ∆xi (2.4)
This method has been shown to converge to the correct solution [?]. It is applicable to our
work in CA because the CA models we use for structural design optimization are in a form
that utilizes this method. The advantage of using this method on reconfigurable computers
comes from the fact that the bulk of the operations are performed during the update phase.
A large number of resources can be devoted to accelerating the update calculations because
the update can be calculated at a low precision.
-
Chapter 3
System Design
This section describes two approaches to implementing CA models on FPGAs. Both ap-
proaches use an array of uniform, simple processing elements (PEs) spread throughout the
chip. A large number can fit on a single FPGA because the PEs are relatively simple. This
distributed computing is effective because of the parallel nature of solving the CA models.
The two designs described in this section illustrate a fundamental tradeoff in hardware
design, flexibility versus speed. The first implementation is a custom circuit developed
to solve the analysis equation for a given design. The second implementation executes a
program stored in memory that controls arithmetic operations. Both designs solve the same
analysis problem.
It is important to note that the underlying theory behind the two designs is the same. In
both cases, the design is intended to determine the displacement and rotation of sections of
a beam given the constraints and external forces on the beam. Though they solve the same
problem, the motivation behind each design is fundamentally different. Therefore, although
the same equations are used for solving for the beam variables, the form of the equations are
optimized for the specific implementation.
13
-
14
3.1 Design Background
When performing operations on an FPGA, it is much more efficient to use fixed-point arith-
metic than floating-point arithmetic. For this reason, both of the designs represent numbers
in fixed-point notation. The nature of CA models allows for this type of representation. The
position of the decimal depends on the architecture and the type of data being stored. The
number of bits of precision varies based on the operation being performed. In both models,
intermediate values produced during calculations are stored in increasing precision to avoid
loss of data. The data is then truncated before the final value is stored.
These designs were developed to perform calculations for a one dimensional CA model,
with two degrees of freedom for each cell. The arithmetic for higher dimensional problems can
be performed without significant changes to the structure of the PE. The main difference
in higher dimensional problems is the change in the communication pattern. In the one-
dimensional models considered, each cell only needs to communicate with its immediate
right and left neighbors. In the case of a two dimensional problems, cells often need to
communicate data with four to eight neighboring cells.
3.1.1 System Hardware
The concepts presented in this thesis for solving CA models on configurable computers can
be applied to many hardware configurations; however, both designs were developed with a
particular system in mind. The system uses a host PC connected over a PCI bus to a card
containing five FPGAs (see Figure 3.1). The FPGAs are all Xilinx Virtex II - XC2V4000
chips [?]. There are several features which make the Virtex II desirable for simulating CA
models.
The first advantage of the Virtex II is the large amount of internal RAM distributed
throughout the chip. These internal BlockRAMs have customizable width and depth. They
also have two ports that can independently read and write to different addresses. Transfer-
-
15
Figure 3.1: Setup of the configurable computer used for simulationing CA model.
ring data on and off chip is an expensive operation and is typically the bottleneck in most
applications. By utilizing these memories, we avoid having to transfer data to external banks
of RAM.
The second advantage of the Virtex II is the built-in multiplication units. In the sea-of-
gates model for FPGAs, implementing multipliers is expensive. This is especially true if the
precision is large because the size of multipliers grow with the square of the number of bits
of precision. In the Virtex II, there is a built-in multiplier associated with each BlockRAM.
This lends itself to the distributed processor models we used.
3.1.2 Distributed Layout
FPGAs are designed to be as flexible as possible so they can be used in many applications,
but this flexibility comes at a cost in terms of space and speed for any arithmetic unit when
compared to custom VLSI. Chips such as general purpose processors have custom designed
arithmetic units that have a significant advantage in executing sequential operations. The
-
16
reason an FPGA has the potential for speed up versus a general purpose processor is that it
can perform many operations in parallel or deeply pipeline the operations.
In order to maximize the ability of an FPGA to perform operations in parallel, as much
of the reconfigurable resources as possible should be in use at the same time. To accomplish
this objective, both designs use many uniform, simple PEs operating in parallel. Each PE
is responsible for calculating the next value for a section of cells in the CA model (see
Figure 3.2).
Figure 3.2: Distribution of logical CA cells among PEs.
This distribution is simplified because each cell in the CA model is governed by the
same equations. The arithmetic units in each PE implement the governing equation; each
logical cell is represented by the data values that are inserted into the equation. There is a
BlockRAM associated with each processing unit that stores the set of data values for each
cell. The number of cells represented by a PE is determined by the number of logical cell
data sets that can be stored in the BlockRAM.
This concept of having multiple cells per PE greatly increases the number of logical cells
that can be represented in a design. A certain amount of chip resources is needed to calculate
a cell update. If only one cell was represented in each PE, then the PE could be slightly
-
17
smaller and a BlockRAM would not be needed. However, the resources required are not
greatly increased by moving from a PE that calculates the update for one cell. There are
enough BlockRAMs on the Virtex-II so that the number of BlockRAMs does not limit the
number of PEs that can fit on the chip.
During a single iteration, all of the logical cells contained within a processing unit are
updated once. The update for a cell depends on its right and left neighbors. To calculate
the update for cells on the edge of the section of logical cells a PE represents, the PE needs
data from the PEs to its right and left. At the end of an iteration, each PE transfers the
data from its leftmost cell to the PE representing cells to the left. Likewise, the data from
the rightmost cell must be transferred to the PE representing the cells to the right.
After this transfer, each processing unit has all of the information needed to compute the
next update for all of the cells it represents. Calculations for all cells can start simultaneously
because the necessary information about all cells is known at the beginning of the iteration.
In both designs, registers are placed between arithmetic units. If an arithmetic unit required
more than one cycle to complete, a pipelined version of the component was used. This
pipelining allows multiple cell updates to be computed concurrently, because cells do not
need to wait until the previous cell has completely finished processing.
3.2 Problem Specific Design
The original direction of this project was to develop a toolset that could rapidly produce
custom hardware models based on specific problems. A designer who wanted to use the tools
would specify the problem in a custom programming language. A compiler would interpret
the input code and produce a custom FPGA configuration to solve the problem. It was
expected that a toolset could be developed for creating custom bitstreams rapidly enough
to make the system useful.
The first step in this development process was to analyze typical CA analysis equations
-
18
and manually create an optimized layout. The equations used are based on an analysis
problem with two degrees of freedom, v and Θ.
v̄c = (C0 ∗ (vl + vr) + C1 ∗ (Θl − Θr)) + Fc
Θ̄c = (C2 ∗ (vl − vr) + C3 ∗ (Θl + Θr)) + Mc (3.1)
The variable vc represents the v value for the current cell being processed. The variables
vl and vr are the v values for the current cell’s right and left neighbors. F represents an
external force. v̄c is the value of vc at the next time step. These equations can be used to
solve a one-dimensional CA analysis problem, such as deflection of a uniform beam. This
form of the equations was chosen because it can be mapped to a small, linear circuit. The
main goal was to minimize the number of multiplications needed because multiplication units
are costly in terms of space on the FPGA, .
An optimized design was built to solve these equations. For each operation a variety of
components was considered, and multiple layouts were investigated in the implementation
of the equations. Maximum clock frequency, latency, and size were examined when selecting
each component. To further optimize the circuit, because they are independent, vc and Θc
are computed simultaneously. Figure 3.3 shows the final optimized design.
The outputs of all the components shown in Figure 3.3 are registered. Additionally, the
constant multipliers are pipelined. The resulting latency though the circuit is 6 clock cycles.
The circuit is designed such that all information to compute the update value is provided
at the point at which it is needed. In particular, the Fc and Mc values are loaded 5 clock
cycles after the corresponding Θ and v values. In this design, when this pipeline is filled the
circuit can produce an updated value every clock cycle.
The constant multipliers were used because they had a much lower latency and were
much smaller than traditional multipliers. Using constant multipliers is only possible if
the coefficients in Equations 3.1 are fixed. In the case of analyzing the deflection of a
uniform beam, these coefficients are constant. These multipliers have the characteristic of
-
19
Figure 3.3: Arithmetic unit for Problem Specific design.
having a structure independent of the constant multiplicand. Therefore, if the location in
the bitstream of the constant multiplier is known, the values in the FPGA look-up tables
(LUTs) values could be modified directly to reflect changes in the coefficient.
The disadvantage of using constant multipliers is that design optimizations made to the
density of the beam would require that a different type of multiplier be used. Also, if the
beam was not uniform, the value of vc and Θc would be needed to compute updated values,
v̄c and Θ̄c. This narrows the usefulness of this design, but it provides an optimized baseline
for comparing other designs.
Each PE in the Problem Specific design contains arithmetic logic, a finite state machine
(FSM), and a BlockRAM. The BlockRAM contains all of the values for the cells. The FSM
controls the addresses from which data is loaded and stored in the BlockRAM. The PE
operates most efficiently when the pipeline is filled. When the pipeline is filled, a new set of
data needs to be applied each clock cycle, and updated values need to be stored each clock
cycle. To accommodate this flow of data, one port of the BlockRAM is devoted to loading
data and the other is devoted to storing data (see Figure 3.4).
-
20
Figure 3.4: PE layout for Problem Specific design
The Edge Registers are used to communicate data to neighboring PEs. When the update
for a cell on either end of the section of the model for which the PE is responsible is calculated,
the new value is stored in the Edge Registers. Each PE has access to these registers in its
right and left neighbor. When data is needed from a neighbor, the values are loaded from
the Edge Registers instead of from the BlockRAM. For PEs on the boundary of the model,
the Edge Registers are connected to constant values.
To implement design optimization, FPGA configuration bitstreams need to be produced
for both analysis and design phases. The FPGA would first be loaded with the analysis
design and the configuration would be iterated until the data values converged. After the
cells converge, the data needed for the design improvement phase is stored in the internal
BlockRAMs. The FPGA would next perform a partial reconfiguration and load the design
improvement bitstream, during which the contents of the BlockRAMs would not be changed.
In this way, data would be passed between the analysis and design phases.
The results would be extracted from the board through readback. During the readback
-
21
operation, the FPGA dumps its entire configuration including flip-flops and BlockRAM
contents. Once the contents of the FPGA are dumped, careful filtering of the data would
yield the current results. This method negates the need for using specialized hardware to
support downloading data.
The residual-update method, described in the Background chapter, can be used in finding
the solution for a CA model because it is an iterative improvement problem. The advantage
to using this method would be that low precision calculations can be used to generate a high
precision result. The reconfiguration between analysis and design phases would provide the
opportunity needed for loading updated coefficient to the FPGAs. The result of implement-
ing the residual-update method would be that an 8-bit design could produce results with
precisions such as 16 or 32 bits.
3.3 Program Based Design
The Program Based design represents a fundamentally different approach to solving the same
analysis problem as the Problem Specific design. The Problem Specific design can perform
analysis updates very rapidly because it uses custom hardware. However, using a custom
design means that for each new problem an optimized circuit must be designed, and an FPGA
configuration must be generated. The overhead of building a custom configuration for each
problem could easily erase any speed advantage. On the opposite end of the spectrum, a
compiled program running on a general purpose CPU has very low initial overhead, but it
cannot take advantage of the inherent parallelism of CA. The Program Based design was
developed to bridge the gap between the analysis speed of custom hardware and the flexibility
of a general purpose processor.
The first major change, compared to the Problem Specific model, is that the Program
Based design executes a program stored in internal BlockRAM to control data accesses
and the arithmetic units in the PEs. In the Problem Specific model these operations were
-
22
performed using a fixed finite state machine. Another significant change is that the control
logic is removed from the PEs and placed in a central control unit. The signals are then
propagated to the PEs throughout the chip. The third important modification is that the
equations are represented in a matrix form to provide a more flexible architecture that
can handle a variety of problems. This matrix arithmetic is expressed in the layout of
the arithmetic units. The last major difference is that the Program Based model has the
capability to compute results in both high precision and low precision forms on the FPGA
and then combine the two results.
The goal of flexibility for the Program Based design is reflected in the form of the equations
for the model. The hardware is designed to solve problems set up in matrix form. This
provides a simpler method to implement, and eventually automate, CA design algorithms.
The matrix form of the beam equations is shown in the following equations:
v̄c
Θ̄c
=
vl
Θl
C0 C1
C2 C3
+
vc
Θc
K0 K1
K2 K3
+
vr
Θr
C4 C5
C6 C7
+
Fc
MC
(3.2)
This equation solves the same analysis problem as the Problem Specific design. This is
one of a range of two-dimensional problems that can be solved by the Program Based design.
Equations can be implemented with any number of terms and are expressed in matrix form.
The Problem Specific model only solves problems that can be represented in the form of the
beam equations, while the Program Based model has the capability to capture the behavior
of a variety of problems.
The complexity of control logic increased greatly in the Program Based model as compared
to the Problem Specific model. The finite state machines that controlled the load and store
logic of the Problem Specific model are ill-equipped to handle the increase in complexity.
The control logic in the Program Based model uses significantly more resources, so it is
advantageous to move the control logic to a centralized location. There is a penalty involved
in distributing the control signals; however, the size of each PE would more than double if
-
23
the control logic was not centralized.
The architecture of having a single control unit makes the Program Based design similar to
a Single Instruction, Multiple Data (SIMD) parallel computer (see Figure 3.5). Removing the
control from the individual PEs is possible because all cells, including boundary cells, can be
represented by changing the coefficients in the matrix equation. Historically, there has been
a lack of widespread interest in SIMD parallel computers because they are inflexible and
require custom processors. However, SIMD machines have been successful in multimedia
and DSP applications [?]. These applications involve repetitive calculations that can be
performed in parallel, similar to those needed for CA models.
Figure 3.5: Return data chains for Program Based design.
The Control Unit (CU) requires feedback from the PEs, for example a flag indicating
that calculations are complete. The routing resources around the CU would be consumed
quickly because there are a large number of PEs that need to communicate with the CU. To
avoid this problem, there are multiple PEs on each return data bus so only the last PE in
the chain needs to be routed directly to the CU. The drawback is that extra computational
cycles are needed. This is because the returning data takes an extra clock cycle to propagate
back to the CU for every link.
-
24
Instructions stored in the CU are not like those of a traditional microprocessor. The
instructions for a traditional general purpose processor are encoded, while the instructions
in this design are stored as a 72-bit word that requires no decoding. The result is that
most control signals can be connected directly from the memory in the CU to the PEs (see
Figure 3.6). This method of storing instructions has the advantages of being both fast and
allowing any combination of control signals to achieve maximum parallelism.
Figure 3.6: Control Unit for Program Based design.
The instructions contain two main parts, the flow control logic and control signals. The
flow control portion interacts with the flow control logic in the CU to determine which
instruction is executed next. The flow control logic allows for increments to the program
counter, branches, and conditional branches. The control signals manage operations in the
PEs. These include: clearing registers, loading data, and shifting data. The signals for
controlling the BlockRAMs in the PEs are fed through address logic to allow absolute and
relative address jumps.
-
25
Though there is a plan to automate the process of writing the programs loaded into the
control unit, the first programs were written manually. A spreadsheet was used to select the
values of every control signal at each time step. The spreadsheet was set up to automatically
insert the signal values into the proper bit position (see Appendix A for example). The integer
equivalent of the binary number is loaded into the control unit memory at compilation time.
In the current design, the program cannot be changed at run time.
To understand the reasoning behind the arithmetic logic in the PEs, it is necessary to
understand the process for using a residual and an update to calculate results. It is possible
to use a residual-update method, described in Chapter 2, to find the solution because the CA
solutions are attained by iterative improvement. This method has the advantage of using
low precision arithmetic for most calculations. In describing this method, n is the number
of bits used in high precision calculations and k is the number of bits used in low precision
calculations.
This method works by first calculating the residual, or error in the equation that is being
solved. The residual calculation must be performed in n bits for every cell in the model. The
most significant k bits of the residuals are then extracted and stored. The k bits must be
taken from the same position in every residual. The largest element in the residual vector
dictates which bits are selected. The update equation is then calculated in k bits, and the
k-bit version of the residual is used in place of the Fc and Mc in Equation 3.2. This k-bit
update is performed until the results converge. After the k bit updates are found, they are
added into an accumulated version of the variables at the same offset as the bits that were
taken out of the residual. The cycle repeats using the accumulated version of the variables
in the residual equation. These iterations are repeated until the accumulated versions of the
variables converge.
The flow chart in Figure 3.7 shows an analysis cycle using this method. This method
is effective because the majority of the time is spent in calculating the update in k bits.
More parallel arithmetic units can be used to speed up the calculations because the update
-
26
Figure 3.7: Analysis cycle flow and precision for each operation.
calculation is be performed in k bits, .
There are three main parts to the PEs used in the Program-Based design (as shown in
Figure 3.8):
- Multiply Accumulator: calculates the residual in n bits
- Shift Unit: extracts k bits from n-bit residual and adds the update into accumulated
variables
- Matrix Accumulator: calculates cell updates in k bits
The Multiply Accumulator is simply a multiplication unit and an adder with the registered
version of its output connected to one of its inputs (see Figure 3.9). There is only one
multiply accumulator per PE because it uses n-bit arithmetic, and these n-bit precision
units are large. The Multiply Accumulator takes advantage of the built-in 18x18 multiplier
units on the Virtex-II FPGAs to save resources.
The minimization of the hardware results in multiple clock cycles being needed to compute
-
27
Figure 3.8: Computational unit for Program Based design.
Figure 3.9: Multiply accumulator used in computational unit.
residual values. The latency through each unit is one clock cycle, so the pipeline is two stages.
For the equation proposed in the beginning of this section, it takes 16 clock cycles to calculate
the residuals for one cell. The expense of the residual calculation is tolerable because many
update calculations are performed between residual calculations.
After the residual is calculated it must be converted to a k-bit number. During the
residual calculation, the most significant bit of the largest residual value is found. There is
a mechanism in each PE that stores the absolute value of the largest residual calculated.
This value is passed along the return data chain until it arrives at the control unit. Each
PE performs a logical OR on the value passed to it and the largest value it has calculated.
-
28
This process destroys the actual value of the largest residual, but the number passed to the
control unit shows the position of the most significant ‘1’. This position is used to determine
which bits of the residual are stored for the update phase.
Figure 3.10: MSB data return chain, used for determining the most significant ‘1’ of residuals.
The logic to extract the bits is based on a multiplexer, a register, and a right shifter. The
multiplexer selects between an input from memory or a right shift version of the value stored
in the register. The value output from the multiplexer is loaded into the register. During the
bit shifting phase, the register is first loaded with the n-bit value. The right shifted value
is then selected from the multiplexor. The value is looped through right shifter until the
desired bits are in the lowest position. The number of clock cycles required is dependent on
the number of bit positions the value needs to be shifted.
Figure 3.11: Unit for shifting the precision of intermediate results.
-
29
The adder, after the shift logic, is used during the addition phase at the end of the outer
analysis cycle. During the addition phase, the update is loaded into the highest bits and
shifted to the correct position. It is then added to the previous value, which is read from
memory. A signal from the control unit selects which value is output from the unit.
The final piece of the PE for the Program Based design is the Matrix Accumulator. The
Matrix Accumulator is similar to the Multiply Accumulator unit, except the arithmetic is
performed in k bits and more hardware is used to speed up calculations. The unit is designed
specifically to be able to multiply a 2x2 matrix by a 2x1 matrix. For example, Figure 3.12
shows the circuit calculating the equation:
v̄
Θ̄
=
v
Θ
C0 C1
C2 C3
(3.3)
Figure 3.12: Matrix accumulator used for analysis updates.
The multiplier has a three clock cycle latency and is fully pipelined. The entire unit has
latency of five clock cycles. The update for each cell using the matrix version of the beam
equations, described earlier in this section, takes 9 clock cycles. However, when the pipeline
-
30
is filled, the circuit can produce an update every five clock cycles, and this circuit calculates
the update for both analysis variables simultaneously.
Every PE can select to have the input to Port B of the memory connected directly to the
output of Port A of its left or right neighbor. In this way, PEs transfer data about the cells
on the edge of the section of the CA model for which it is responsible. This system is also
used to upload and download data from the FPGA. The PE that calculates the values for
the cells on the left end of the model can read data from the PCI bus, while the PE that
calculates the values for cells on the right end of the model can write data to the PCI bus.
Figure 3.13: Data flow for uploading and download data to FPGAs.
To upload coefficients and external forces, as well as initializing variable values, the host
computer begins by writing the data into the memory of the leftmost PE for the rightmost
PE. The data is then shifted through all the PEs until it gets to the proper place. At the
same time new data is shifted into leftmost PE. Downloading the results is a similar process,
it involves shifting the data right and reading it off the rightmost PE. An external clock is
used to keep data transfers synchronized.
Although reconfiguration is not part of the analysis cycle, the implementation of the
system for performing design optimization will use reconfiguration in a number of ways.
Each analysis model has fixed connections for communicating among PEs. It is possible
to pass data through intermediate PEs to transfer data between PEs that are not directly
connected. However, to achieve maximum efficiency, communication should be done over
direct connections when possible. The design system will have a number of different analysis
models, each with a different communication pattern. When the user specifies the initial
-
31
problem the system will select the bitstream for the most appropriate model and load it into
the FPGAs.
Design optimization may be performed in a number of different ways. The first possible
technique is to use reconfiguration. A bitstream developed to perform design optimization
could be loaded on the FPGAs using partial reconfiguration. The data would be passed
between analysis and design models through the BlockRAMs, like the method proposed
for the Problem Specific model. Another technique would be to use the analysis model
to perform the design calculations. Design would require new coefficients, which could be
loaded into the FPGA using the uploading and downloading models described earlier. The
disadvantage of this method is that the analysis design might not be capable of performing
all of the operations needed, or the operations may be very inefficient. The final possibility
is to use a Virtex-II Pro FPGA, which contains built-in PowerPC processors. These internal
processors could be used to run a program to calculate the new design values.
-
Chapter 4
Results
4.1 Problem Formulation
The results in this section are based on solving the analysis of a CA model of a one-
dimensional beam. The model is formulated from work by researchers at Virginia Tech
[?]. The beam is divided into cells that have two degrees of freedom, vertical displacement
(w) and rotation (q). Each cell also has a separate vertical thickness, which is the design
variable. The thickness of the beam is specified at the middle of each cell, and then linearly
interpolated in between the specified points (see Figure 4.1). Cells in the model are evenly
distributed along the beam.
There are a number of possible configurations for each cell. The cell can have a fixed
displacement, a fixed rotation, a fixed displacement and rotation, or it can be free in dis-
placement and rotation. External forces can be applied to any cell. The forces can be in
the form of a vertical force (F) or a bending moment (M). These different configurations are
represented by changing the coefficients in the equation that is solved by each model. Using
these available cell configurations, many classical static beam problems can be solved.
The CA model, shown in Figure 4.2, was modeled with 20 cells and was run on both the
32
-
33
Figure 4.1: Diagram of CA model for perform analysis on a beam.
Figure 4.2: Beam analysis problem modeled on the configurable computer.
Problem Specific and Program Based designs. 20 cells was choosen so the model could be
quickly simulated. The first cell in the model is a dummy cell, for which no computations
are performed. The cells (1 and 19) on the ends of the beam have fixed displacement and
rotation. Cell 14 has a fixed vertical displacement. All other cells in the model are free
in displacement and rotation. There is a vertical force pushing up on cell 9. The force is
scaled to produce a maximum displacement of slightly less than 127, so the result can be
represented in 8 bits.
-
34
4.2 Problem Specific Design Results
The designs presented in the implementation section were intentionally developed indepen-
dently of any fixed precision for the results. There is a trade off between the number bits of
the solution that will be calculated and the number of cells that can be represented in the
system. In addition, larger precision results in lower maximum clock frequency and/or an
increased pipeline length.
Another factor in changing the precision is memory access. The BlockRAMs have pro-
grammable port widths that can accommodate some changes in precision. The BlockRAMs’
two ports can each handle up to 36 bits and can independently read or write. Once the data
transfer limit has been exceeded, accessing the data needed will take multiple clock cycles
or more memories must be used in the design.
During development, multiple versions of the Problem Specific design were built that used
different precision for calculations. Figure 4.3 shows the growth of the size of a PE as the
precision of the calculations is increased. The graph shows that size grows rapidly as the
number of bits is increased. This makes it very important to use the least precision possible
for calculations.
From this data, and based on the beam problem being solved, 8 bits of precision was
chosen as likely to be the most effective. In most of the following analysis, an 8-bit model
was studied for the Program Based design. However, this precision could change based on
the type of problem being solved and the number of cells in the model. In this respect,
the Problem Specific design would have more flexibility with regard to precision than the
Program Based design because the Problem Specific design is custom-made for each problem.
Once 8 bits was selected for the precision, the number of PEs needed to be selected. The
maximum number of PEs that can fit on an FPGA is limited by the programmable logic
and routing resources on the chip. However, when the chip usage gets high, the routing gets
inefficient and the maximum frequency at which the circuit can be clocked drops rapidly.
-
35
4 6 8 10 12 14 16 180
0.01
0.02
0.03
0.04
0.05
0.06
0.07Problem Specific design − % utilization vs. PE precision
PE precision (bits)
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
Figure 4.3: Precision of PE vs. Percent Utilization of FPGA for Problem Specific design
-
36
5 10 15 20 25 30 35 40 450
0.2
0.4
0.6
0.8
1
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
% utilization and maximum clock frequency vs. Number of PEs
Number of PEs5 10 15 20 25 30 35 40 45
0
20
40
60
80
100
Max
imum
clo
ck fr
eque
ncy
(MH
z)
chip utilizationmaximum frequency
Figure 4.4: % Utilization of FPGA and maximum clock frequency vs. number of PEs for
Problem Specific design.
Figure 4.4 shows the chip utilization and the maximum clock frequency versus the number of
PEs. The number of logical cells that can be represented increases linearly with the number
of PEs. However, the clock maximum frequency decreases gradually as the number of PEs
increases, then drops quickly after the 35th PE.
The Problem Specific and Program Based designs vary widely in the number of cells they
can represent and the precision of the result. In order to compare these differing designs,
the maximum number of cell updates per second is used as a metric. This is also used as the
metric to determine the speed-up over a program running on a general purpose processor.
The number of cell updates per second for the Problem Specific design is simply the
-
37
5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500Cell updates per second vs. # of PEs
Number of PEs
Cel
l upd
ates
per
sec
ond
(mill
ion)
Figure 4.5: Cell updates per second vs. number of PEs for the Problem Specific design.
number of PEs multiplied by the maximum clock frequency. This is because in the Problem
Specific implementation, each PE produces a result every clock cycle during analysis. Fig-
ure 4.5 shows a peak in the maximum number of cell updates when 35 PEs are on the chip.
With 35 PEs the 8-bit design has a maximum clock frequency of 64.5 MHz. In comparison,
a 12-bit model with 35 PEs cannot fit on the Virtex-II 4000 FPGA.
There are additional factors that limit the cell updates per second that can actually be
performed on the system (see Table 4.1). The most costly factors are the reconfiguration
and readback times on the FPGA and the time it takes the host to compute the coefficients
for a given design. These times are important because the communication with the host is
-
38
Time (ms)
Operation 1 PE 1 FPGA DINI Board
Reconfig N/A 1190 4760
Readback not yet implement in DINI API
Host computations 1.11 39 156
Table 4.1: Times for operations associated with Problem Specific analysis cycle for DINI
board.
done through reconfiguration and readback.
These results are dependent on the design being able to accurately produce analysis
results. The problem described earlier in this chapter in the Problem Formulation section
(see Figure 4.2) was modeled on the Problem Specific design. The force was scaled so the
result would be able to be represented in 8 bits. The actual results were calculated using a
C++ program running on a PC which used floating-point arithmetic for all calculations. The
results show (see Figure 4.6) that the system was able to produce results that were similar,
but not exactly correct. The mean of the error between the actual results and the results
attained from the Problem Specific model for the displacement and rotation were 38.4% and
41.4% respectively. This large error is due to the rounding that takes place because fixed
point arithmetic is used. The answer would be improved, if better accuracy is needed, by
using reconfiguration and the residual-update method for iterative improvement.
4.3 Program Based Design Results
The Program Based design has much the same sensitivity to precision as the Program Based
design, but the Program Based design does not have quite as much flexibility in term of
precision. Because the full precision calculations of the residual rely on the built-in 18x18
multipliers, it is difficult to increase the full precision result to more than 18 bits. However,
-
39
0 2 4 6 8 10 12 14 16 18 20−40
−20
0
20
40
60
80
100
120Actual results vs Problem Specific design results
Dis
plac
emen
t (m
m)
Position (m)
actual displacementProb Spec displacementactual rotationProb Spec rotation
Figure 4.6: Actual results and results from Problem Specific design for beam problem.
-
40
4 6 8 10 12 14 16 180
0.01
0.02
0.03
0.04
0.05
0.06
0.07Chip utilization vs. PE precision
PE precision (bits)
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00
Figure 4.7: Precision of PE vs. % Utilization of FPGA for Program Based design.
there is some flexibility in the precision of the update calculations. Figure 4.7 shows how
the size of the PE grows as the precision of the update arithmetic is increased. The growth
is similar to that of the Problem Specific model.
Based on this data, 6 bits was selected for the precision of the update calculations. The
6-bit precision design attains the maximum per second with 60 PEs. The maximum clock
frequency is 94.8 MHz. If the same design used 8 bits of precision the maximum clock
frequency is 88.1 MHz.
The precision selected for the update is also trade off between having smaller update units
and having to perform the outer iteration more often. When the precision of the update
calculation is larger, more inner iterations are performed before new residuals need to be
calculated. Using a smaller precision has the advantage of being able to devote more, smaller
units to calculating the update. The number of clock cycles needed for each phase of the
-
41
Analysis Cycle Phase Clock Cycles
Residual Calc 550
Shift Residual 330-990
Cell Update 190 * Inner Iterations
Add 410-1135
Table 4.2: Clock cycles for different phases of residual-update analysis cycle.
analysis cycle is shown in Table 4.2.
As the number of inner iterations increases during each analysis cycle, the Program Based
model becomes more efficient. Figure 4.8 shows the increase in the efficiency of the design
versus the number of inner update iterations for each residual calculation. For this graph,
the average number of cycles for the Shifting and Adding phase of the analysis cycle was
used. When the number of inner iterations is below 10, more than half the time is spent in
calculating the residual or shifting the results. However, the model rapidly becomes more
efficient. With 35 inner iterations per analysis cycle this design achieves 75% efficiency, and
at 90 inner iterations the efficiency is 90%. The number of iterations needed will depend on
the type of problem and the number of cells in the model.
The total cell updates per second that can be performed with the whole chip is dependent
on the number of PEs on the FPGA. The maximum number of cell updates per second for the
Program Based design is achieved slightly before the chip resources are exhausted because
of routing congestion (see Figure 4.9). Figure 4.10 shows how the maximum number of cell
updates per second rises then deteriorates as the number of PEs is increased.
The Program Based model depends on communication with a host through the PCI bus
in the current design. Before the calculation can begin, the coefficients for a specific design
need to be loaded into each of the PEs. Then, after the analysis is complete, the results
need to be downloaded back to the host. The transfer times for these operations are shown
-
42
0 50 100 150 200 2500
10
20
30
40
50
60
70
80
90
100Efficiency vs. inner iterations per analysis cycle
Inner iterations per analysis cycle
Effi
cien
cy (
% o
f tim
e sp
ent c
alcu
latin
g up
date
s)
Figure 4.8: Efficiency vs. number of inner iterations per analysis cycle.
-
43
0 10 20 30 40 50 60 700
0.5
1
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
% utilization and maximum clock frequency vs. Number of PEs
Number of PEs0 10 20 30 40 50 60 70
0
100
200
Max
imum
clo
ck fr
eque
ncy
(MH
z)
chip utilizationmaximum frequency
Figure 4.9: % Utilization of FPGA and maximum clock frequency vs. number of PEs for
Program Based design.
-
44
10 20 30 40 50 60 700
500
1000
1500Cell updates per second vs. # of PEs
Cel
l upd
ates
per
sec
ond
(mill
ion)
Number of PEs
Figure 4.10: Cell updates per second vs. number of PEs for Program Based design.
-
45
Time (ms)
Operation 1 PE 1 FPGA DINI Board
Uploading Coefficients 2.10 114 228
Downloading Results .311 18.5 37
Host Computations .360 21.5 86.0
Table 4.3: Time for operations associated with analysis on Program Based design.
in Table 4.3. This communication is synchronized by an external clock supplied by the host.
There is additional overhead due to the computation of analysis coefficients on the host for
each design.
The problem shown in Figure 4.2 was modeled on the Program Based design. This was
the same model run on the Problem Specific design, except the force is scaled up so that
the maximum result was closer to an 18 bit number. The results were again compared to a
C++ simulation that used floating-point for all calculations. Figure 4.11 shows the results
attained from the Program Based model were very close to the actual results. The mean of
the percent error between the Program Based model data and the actual results was 0.099%
for displacement and .118% error rotation.
The results for the Program Based model were much more accurate than the results from
the Problem Specific model because the Program Based model uses 18 bits of precision
during the residual calculations. The speed of the Program Based model comes from the use
of only 6 bits for the update calculations. The external force was scaled so the maximum
displacement would be close to 18 bits. After they were computed, the results were scaled
down to match the original problem.
-
46
0 2 4 6 8 10 12 14 16 18 20−40
−20
0
20
40
60
80
100
120Actual results vs Program Based design results
Position (m)
Dis
plac
emen
t (m
m)
actual displacementProg Based displacementactual rotationProg Based rotation
Figure 4.11: Actual results and results from Program Based design for beam problem.
-
47
Maximum Cell Updates per Second (millions)
Operation 1 PE 1 FPGA DINI Board
Problem Specific 65.1 2279 9116
Program Based 18.9 1137 4548
Table 4.4: Maximum cell updates per second for both implementations.
Operation Maximum Cell Updates Sec (millions) Speed-up
PC 48.9 -
Problem Specific 9116 186.4
Program Based 4548 93.0
Table 4.5: Maximum cell updates per second and speed up for both implementations com-
pared to PC.
4.4 Comparison of Designs
Most of the results presented in the earlier sections were for systems using one FPGA.
However, the DINI board intended for this system has 4 FPGAs. The total computing
power increases linearly because all the FPGAs can run in parallel. Table 4.4 shows the
maximum number of updates for each of the systems for each level of the system.
To compare these FPGA based designs to conventional methods, a C++ program was
written to calculate the results. To make the comparison as fair as possible, the program
uses integer arithmetic instead of slower floating-point arithmetic. Integer computations are
closer to the fixed point math used by the FPGA designs. The program was compiled using
GCC with optimization enabled and executed on a PC with a 1.7 GHz processor and 1 GB
of RAM running Linux Debian. Table 4.5 shows the speed up attained by the FPGA designs
over the general purpose processor version.
-
Chapter 5
Conclusions
5.1 Summary
Cellular Automata (CA) theory has been studied for decades, with most of the work done
on modeling natural systems. Recently, the use of CA theory has been extended to provide a
system for structural analysis and design optimization. This work has proved to be success-
ful, but the calculations are very slow on traditional general purpose processors. The parallel
nature of these CA models makes them a good candidate for implementation on a reconfig-
urable computer. The work presented in this thesis shows the initial steps toward making
an automated tool to perform structural design optimization accelerated by a reconfigurable
computer.
The contribution of this thesis was to design and implement two models for the analysis
phase of the CA structural design optimization cycle. Both designs take advantage of the
parallel nature of cellular automata by using a distributed array of processing elements.
For the Problem Specific implementation, these processing elements are customized to each
problem. The Program Based implementation has a more flexible processing unit that
is controlled by a program designed to simulate a specific cellular automata model. The
48
-
49
Program Based implementation also has the built-in capability to use a residual-update
method to accelerate calculations and improve accuracy.
5.2 Results
The results show the Problem Specific design and the Program Based design were able to
generate cell updates at the rate of 9.12 and 4.55 billion per second, respectively. Though the
Problem Specific design proved to be able to generate updates more rapidly, this increase
in speed came at the expense of precision and flexibility. The Program Based model’s
competitive speed, improved accuracy, and ability to handle a range of update rules make it
the architecture that provides the most potential for an automated system.
Both hardware implementations of these CA model for structural analysis were very
successful in term of performance. When compared to a 1.7 GHz Pentium 4 processor, the
Problem Specific design proved to be 186 times faster. The Program Based design, which
was slightly slower, was still 93 times faster than the general purpose processor version.
These speed-ups are a step towards making a CA system for structural design optimization
that significantly outperforms traditional methods.
5.3 Future Work
There are a number of interesting areas that need to be studied in order to design an
automated tool for perform structural design optimization using CA. The most immediate
may be the need for a translator and compiler for the programs used by the Program Based
design. For the work in this thesis, the programs were all written by hand. This process was
very difficult and time consuming. A compiler is needed to take a higher level abstraction and
generate machine level instructions. The end product should be a compiler that could take a
problem specified by a design engineer who has knowledge of the hardware implementation
-
50
and produce the necessary instructions.
Additionally, an efficient method for design calculations must be implemented. In the
current system, the results are downloaded to a host computer where design calculations can
be performed. However, this is an inefficient technique. A number of possibilities exist for
executing the design calculations on the board, such as using partial reconfiguration or on
board processors. These possibilities need to be investigated to identify the best method.
Another area that needs work is implementing multi-grid on the system. The multi-grid
approach to these CA problems would be to calculate results while varying the resolution of
the grids. In other words, the number of cells representing the system would increase and
decrease based on certain algorithms. Multi-grid could also be used to blend analysis and
design steps into a single cycle. These methods have the potential for huge reductions in the
number of calculations needed.
-
Appendix A
This appendix gives an example of how the programs for the Program Based model are
written.
The Processing Elements (PEs) in the Program Based model were developed to perform
high and low precision arithmetic and convert between the two forms. The control logic
needed to simulate a cellular automata model is complex, so programs are used to set the
control signals. The program is stored in an internal BlockRAM contained in the Control
Unit(CU). As the PEs were designed, the control signals for each unit were assigned to
particular bits of the BlockRAM in the CU. The position of the bits and a short description
of their function was recorded in a Excel spreadsheet. Figure A.1 shows a screenshot of this
spreadsheet.
Each phase of the analysis cycle was written as a separate program. There are control
signals for each functional unit of the PE, but only one functional unit is in use during
each phase. The first step in writing a program was to identify the signals needed for the
particular phase. The pertinent signals were placed across the top of a spreadsheet and the
value of the signal was specified below. Each horizonal line represents a clock cycle step.
Figure A.2 shows an example of a program. This particular program calculates the update
during the inner iteration of the analysis cycle.
The signals on the left are used to determine the order in which instructions are executed.
51
-
52
Figure A.1: Spreadsheet with position of control signals and short description.
The program counter will increment by one unless a loop is specified. The signals on the left
control the PEs. Signals with only two options are specified as Y or N. Signals with more
options are specified as a number or letter from a certain set.
There is a second spreadsheet which determines the numerical value for each signal. The
values of the signals are then converted into an intermediate form. The intermediate form is
the numerical value of the signal multiplied by two to the power of its bit position. The final
value is the sum of all the intermediate values (see Figure A.3). This is the number that is
loaded into the CU BlockRAM. These final values are then put in a form that can be read
into memory (see Figure A.4).
-
53
Figure A.2: Spreadsheet containing update program
Figure A.3: Spreadsheet converting signals to the form used by the Program Based model
-
54
Figure A.4: Spreadsheet containing the data values in a form that can be loaded into memory.
-
Vita
Thomas Hartka was born in June 1980 in Baltimore, Maryland. He atttended from Arch-
bishop High School in Severn, Maryland. Thomas enrolled in the College of Engineering at
Virginia Tech in the fall of 1998. He graduated Cum Laude with a Bachelor of Science in
Computer Engineering. Thomas choose to remain at Virginia Tech to pursue his Master’s
Degree. He became involved in research at Virginia Tech Configurable Computing Lab. After
graduating, Thomas will attend Johns Hopkins’ Post-Baccalaureate Premedical Program.
55