cellular automata for structural optimization on ... · cellular automata for structural...

Cellular Automata for Structural Optimization on Recongfigurable

Computers

Thomas R. Hartka

Thesis submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Engineering

Dr. Mark T. Jones, Chair

Dr. Peter M. Athanas

Dr. Michael S. Hsiao

May 12, 2004

Blacksburg, Virginia

Keywords: configurable computing, cellular automata, design optimization

Copyright 2004 c©, Thomas R. Hartka

Cellular Automata for Structural Optimization on Recongfigurable

Computers

Thomas R. Hartka

(ABSTRACT)

Structural analysis and design optimization is important to a wide variety of disciplines. The

current methods for these tasks require significant time and computing resources. Reconfig-

urable computers have shown the ability to speed up many applications, but are unable to

handle efficiently the precision requirements for traditional analysis and optimization tech-

niques. Cellular automata theory provides a method to model these problems in a format

conducive to representation on a reconfigurable computer. The calculations do not need to be

executed with high precision and can be performed in parallel. By implementing cellular au-

tomata simulations on a reconfigurable computer, structural analysis and design optimization

can be performed significantly faster than conventional methods.

This work was partially supported by NSF grant #9908057 as well as by the Virginia

Tech Aspires program.

Acknowledgements

I would first like to thank my advisor, Dr. Mark Jones, for his guidance through my entire

research. Without his guidance I never would have been able to complete this thesis.

Thanks to Dr. Athanas for serving on my thesis committee and for making development

on the DINI board possible.

Thanks to Dr. Hsiao for serving on my committee and being an excellent teacher.

Thanks to Dr. Gurdal and his researchers for providing the mathematics for the cellular

automata models and for all the effort spent getting to understand reconfigurable computing

so the equations mapped efficiently.

Thanks to all the professors and students involved with the Configurable Computing Lab

for making it a great place to work.

Thanks to all the people that helped in the process of reviewing and editing this thesis.

I am forever indebted to anyone who will review sixty pages of my writing.

Thanks to everyone else who I have not mentioned that helped with my work. I could

not have done it without the support from the people around me.

iii

Contents

1 Introduction 1

1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 4

2.1 Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Configurable Computers for Scientific Computations . . . . . . . . . . . . . . 8

2.3 Limited Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 System Design 13

3.1 Design Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.2 Distributed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Problem Specific Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Program Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results 32

iv

4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Problem Specific Design Results . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Program Based Design Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Comparison of Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Conclusions 48

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A 51

Vita 55

v

List of Figures

3.1 Setup of the configurable computer used for simulationing CA model. . . . . 15

3.2 Distribution of logical CA cells among PEs. . . . . . . . . . . . . . . . . . . 16

3.3 Arithmetic unit for Problem Specific design. . . . . . . . . . . . . . . . . . . 19

3.4 PE layout for Problem Specific design . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Return data chains for Program Based design. . . . . . . . . . . . . . . . . . 23

3.6 Control Unit for Program Based design. . . . . . . . . . . . . . . . . . . . . 24

3.7 Analysis cycle flow and precision for each operation. . . . . . . . . . . . . . . 26

3.8 Computational unit for Program Based design. . . . . . . . . . . . . . . . . . 27

3.9 Multiply accumulator used in computational unit. . . . . . . . . . . . . . . . 27

3.10 MSB data return chain, used for determining the most significant ‘1’ of residuals. 28

3.11 Unit for shifting the precision of intermediate results. . . . . . . . . . . . . . 28

3.12 Matrix accumulator used for analysis updates. . . . . . . . . . . . . . . . . . 29

3.13 Data flow for uploading and download data to FPGAs. . . . . . . . . . . . . 30

4.1 Diagram of CA model for perform analysis on a beam. . . . . . . . . . . . . 33

4.2 Beam analysis problem modeled on the configurable computer. . . . . . . . . 33

vi

4.3 Precision of PE vs. Percent Utilization of FPGA for Problem Specific design 35

4.4 % Utilization of FPGA and maximum clock frequency vs. number of PEs for

Problem Specific design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Cell updates per second vs. number of PEs for the Problem Specific design. . 37

4.6 Actual results and results from Problem Specific design for beam problem. . 39

4.7 Precision of PE vs. % Utilization of FPGA for Program Based design. . . . . 40

4.8 Efficiency vs. number of inner iterations per analysis cycle. . . . . . . . . . . 42

4.9 % Utilization of FPGA and maximum clock frequency vs. number of PEs for

Program Based design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.10 Cell updates per second vs. number of PEs for Program Based design. . . . 44

4.11 Actual results and results from Program Based design for beam problem. . . 46

A.1 Spreadsheet with position of control signals and short description. . . . . . . 52

A.2 Spreadsheet containing update program . . . . . . . . . . . . . . . . . . . . . 53

A.3 Spreadsheet converting signals to the form used by the Program Based model 53

A.4 Spreadsheet containing the data values in a form that can be loaded into

memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii

List of Tables

4.1 Times for operations associated with Problem Specific analysis cycle for DINI

board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Clock cycles for different phases of residual-update analysis cycle. . . . . . . 41

4.3 Time for operations associated with analysis on Program Based design. . . . 45

4.4 Maximum cell updates per second for both implementations. . . . . . . . . . 47

4.5 Maximum cell updates per second and speed up for both implementations

compared to PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii

Chapter 1

Introduction

Structural analysis and design optimization are an integral part of many industries. Appli-

cations range from simple applications such as testing and optimizing a support beam, to

very complicated applications such as optimizing the structure of a car for crash resistance.

Performing the design iterations manually is very time consuming. Therefore, a significant

amount of research has been conducted to develop efficient methods to automate the design

process.

Traditional methods for automating design have involved running simulations on general

purpose processors. In these methods, calculations must be performed in high precision.

Parallelization of the calculations, where possible, is done though expensive supercomputers

with hundreds of processors. Even with massive amounts of computing power, the simula-

tions will usually take hours to complete.

Cellular Automata (CA) has proved to be a very powerful tool for modeling physical

phenomena. CA models have successfully captured the behavior of complex systems such

as fluid flow around a wing and pedestrian traffic [?, ?]. Recently, CA theory has been

extended to structural analysis and design optimization [?]. Using CA in structural models

changes analysis and design optimization into a high parallizable form that does not require

1

2

high-precision calculations. This provides the potential for significant speed-up.

1.1 Thesis Statement

Using CA provides a method to efficiently map structural design optimization problems onto

FPGAs. By exploiting the inherent parallelism of FPGAs there is the potential for speed-up

over general purpose processors.

To achieve this objective, distributed processing systems were implemented on a con-

figurable computer. The system consisted of a host PC connected to a PCI based board

with five FPGAs. Two designs were developed for the FPGAs to rapidly iterate CA mod-

els for structural analysis. The two designs represent significantly different approaches to

accomplishing the same objective.

The author’s contributions to this work are the following:

- developed a custom FPGA design for simulating a beam CA model,

- developed a separate FPGA design that executes programs to simulating CA models,

- wrote programs for the FPGA design to simulate a beam CA model, and

- implemented a limited precision method in hardware for solving iterative improvement

problems.

1.2 Thesis organization

Chapter 2 presents background information about CA theory, scientific computations on con-

figurable computers, and the limited precision method used in the designs. Chapter 3 gives

details on the two implementations developed to solve the CA models. Chapter 4 presents

3

the results for each of the two implementations and comparisons to traditional methods.

Chapter 5 summarizes the work performed for this project and the results obtained.

Chapter 2

Background

This chapter presents previous work in areas related to this research. The combined contri-

butions discussed were used in the completion of the simulation environment and prototype

presented in this thesis.

2.1 Cellular Automata

The concept of Cellular Automata (CA) theory is to be able to model systems with many

objects that interact [?]. The systems are divided into discrete units, or cells, that act

autonomously. The advantage of using CA is that the behavior of some complex systems

can be captured using relatively simple rules for each cell [?]. Attempting to reproduce this

behaviour without breaking those systems into autonomous units, even if possible, would be

complicated.

Each cell in a CA model can be in a single state at any given point in the simulation. The

number of states the cell may be in depends on the problem being solved. In many models

the number of elements in the set of states is small (8 or less), but there is a new class of

CA models that use a continuous state space. These continuous state space CA models are

4

5

known as coupled map-lattice or cell dynamic schemes. The next state of a cell is based on

an update rule, sometimes referred to as a transition rule, which is a function of its current

state and the current state of its neighbors [?]. The collective state of all of the cells in the

model at any given point is known as the global state [?].

Stanislas Ulam is generally credited with the first work in CA, originally referred to

as cellular space or automata networks. John von Neumann extended Ulam’s work and

proposed CA as a way to model self-reproducing biological systems [?, ?]. The work of

Ulam and von Neumann provides a formal method for simulating complex systems. Their

research, and much of the current research in CA, focused on modeling dynamic systems in

which time and space are discrete. Each calculation of the next state of all the cells in a

system represents a step in time [?]. A good example of this type of CA model is Conway’s

Game of Life in which cells can be in one of two states: alive or dead. Each update of the

global state represents a new generation of organisms [?].

There are a number of architectures for CA models, each resulting in different behavior.

The number of dimensions of the CA model can differ greatly depending on the system being

modeled. Models are typically one, two, and three dimensions in practice. However, there

is no limit to the number of dimensions that can be used [?]. The number of dimensions of

the grid has a large effect on the communication network among cells, known as the cellular

neighborhood.

In the work on two-dimensional grids there are two common cellular neighborhoods. The

first is the von Neumann neighborhood, in which each cell communicates only with the four

cells that are orthogonally adjacent to it. The second is a Moore neighborhood in which a

cell communicates with all eight cells surrounding it [?]. Though von Neumann and Moore

neighborhoods are common, cells are not limited to communicating only with those that

are adjacent. The “MvonN Neighborhood” uses the nine cells (including the center cell)

in the Moore neighborhood as well as the four cells orthogonally one space away from the

current cell. Additionally, the communication of the cells within a model is not required to

6

be consistent throughout the model [?].

There has also been a significant amount of work investigating non-rectangular cell sys-

tems. Gas lattice automata are a subset of cellular automata that commonly use the FHP

model. The FHP model uses a hexagonal grid, where cells communicate with their six im-

mediate neighbors [?,?]. The use of triangular and regular polygonal lattices is common in

specialized applications of cellular automata because they can better capture the behavior

of certain systems [?].

Models in which communication and update rules are consistent throughout the model

are called uniform. Though most of the work in the area of CA has used uniform models,

the use of non-uniform rules does not necessarily detract from the effectiveness of using CA.

A number of experiments have been conducted to model the effect of “damaged” areas of

a grid where cells use different rules [?]. In terms of simulating a CA model on a serial

processor, a uniform grid has the advantage that only one update rule is needed [?].

The grid for a CA model may be finite or infinite. In his work, von Neumann examined

infinite grids as a method to construct a universal computer [?]. Although von Neumann’s

work on infinite grids was theoretical, methods for representing and calculating CA models

on infinite grids have been developed [?]. Finite grids are much simpler to implement and

process in parallel because the maximum size of the active area is known before processing

begins. However, the use of finite grids introduces the problem of how to calculate cells on

the edge of the grid, known as the boundary conditions.

There are several ways to handle the processing of cells on the edge of a finite grid. The

first method is to logically connect the cells on one edge of the grid to cells on the opposite

edge, producing a loop. Another way to handle boundary conditions is to use a fixed value

for cells at the perimeter of the grid. In systems with fixed boundary conditions, the edge

cells are known as dummy cells because they do not need to be updated. The third method

for calculating the update for edge cells is to use an update rule that is different than that

used in the internal cells [?]. An example of a uniform rule would be an edge cell that simply

7

mirrored the value of the closest internal cell. The type of boundary condition used depends

largely on the problem being modeled.

The early work in CA theory concentrated on theoretical computational questions, such

as computational universality. In later work, it has been used as a method to study social,

physical, and biological systems [?]. A number of studies have been conducted to use CA

to capture the aggregate behavior of groups of autonomous beings, for example, car traffic,

pedestrian flow, and ant colonies [?,?,?]. In scientific computing, many successful attempts

have been made to model such phenomenon as fluid dynamics, chemical absorption, and

heat transfer using CA [?,?,?].

Some of the most recent work in CA has been in the field of structural analysis and

design. The first work in this area was the development of methods to optimize the angle

and cross-section area of trusses in a fixed structure [?]. These methods proved successful in

merging analysis and design into a CA model and showed powerful computational properties.

This success prompted more work to extend CA theory to create models for other structural

design problems.

A model was developed to minimize the weight of a beam needed to prevent buckling

[?]. The beam is represented in sections, to which constraints and external forces can be

applied independently. The cross sectional area for each section is determined to produce the

minimum total weight of the beam. Experiments with this method showed models converged

to the correct minimum solution. This area of CA research shows substantial promise for

accelerating structural design optimization through parallel computing.

8

2.2 Configurable Computers for Scientific Computa-

tions

FPGAs, which are usually the basis for configurable computers, show considerable speed-up

for a variety of applications when compared to general purpose processors. These applications

include signal processing, pattern searching, and encryption. The use of FPGAs for these

tasks, which mainly involve bit manipulation, has shown orders of magnitude acceleration

[?,?,?]. These accelerations are possible because the tasks can be broken down into simple

operations. The operations can then be performed in parallel throughout the chip.

Although pattern matching and bit manipulation have been widely studied on FPGAs,

FPGAs have typically not been used for scientific computations. Two significiant deterrents

in using FPGAs are the limited available programmable logic and the slow clock speeds. In

the past, FPGAs have only been able to represent circuits that had gate counts in the low

thousands [?]. This low gate count is restrictive for scientific computations. For example, a

32-bit parallel multiplier could not be emulated by most of the FPGAs in the Xilinx XC3000

family, chips that were first produced in the mid 1990s (based on number of CLBs). This

becomes even more of a handicap because FPGAs typically operate at clock speeds much

lower than average CPUs. A general purpose processor will usually outperform an FPGA if

the FPGA cannot carry out parallel or deeply pipelined operations.

These limitations of FPGAs have been greatly reduced in recent chips because of the

much larger transistor densities. The latest Xilinx FPGAs can emulate circuits with up

to 10 million gates [?]. With increased programmable logic, it is possible to have many

more arithmetic units performing complicated operations in parallel. In comparison to the

previous example, a Xilinx XC2V8000 chip, currently Xilinx’s largest FPGA, has enough

logic to represent thirty-five 32-bit multipliers (based on number of CLBs). Floating-point

operations continue to require a large percentage of available resources. Still, researchers have

begun exploring scientific computations on FPGAs. A paper from researchers at Virginia

9

Tech used the flexibility of FPGAs to develop representations of floating-point numbers

that are more efficient on FPGAs [?]. In 2002, researchers published a paper detailing the

development of a limited precision floating-point library and an optimizer to determine the

minimum precision needed in DSP calculations [?].

Using the least precision possible is important on an FPGA. General purpose processors

usually compute operations in higher precision than is needed because of the limited choices

for precision. However, the fine-grain control of the logic in an FPGA allows custom arith-

metic units of any precision. This flexibility can be extended to dynamically controlling the

precision of different calculations on the same unit. Other work has been presented on a

variable precision coprocessor for a configurable computer and algorithms given for variable

precision arithmetic units [?]. Two papers have been published investigating how to manage

dynamically varying precision and to show how the overall runtime is substantially decreased

by using minimal precision [?,?].

CA has been used in the computer science community for some time. In 1985, a book was

published describing implementations for CA simulations on massively parallel computers

[?]. However, there has been little work done in trying to run these models on configurable

computers. There have been some papers written on using CA on FPGAs, but they all focus

on models with simple cell update rules and small state sets. For example, the CAREM

system was developed to efficiently model CA on FPGAs [?]. The two models published as

examples of using the CAREM system were an image thinning algorithm and a forest fire

simulation. In both cases the models were simple, having state set sizes of 4 or less. Other

cellular automata simulation systems have been proposed for reconfigurable computers which

concentrated on fluid dynamics [?,?]. However, like CAREM, this system is only capable

of handling simple models with very limited number of states.

Custom hardware architectures were implemented by Norman Morgolus from MIT for

processing CA. The most successful was known as the Cellular Automata Machine 8 (CAM-

8) [?]. The CAM-8 is based on custom SIMD processors that are connected in three

10

dimensions. Each processor is responsible for a section of data in the model which is stored

in a DRAM. Processing on each cell’s data is performed using a look up tables (LUT) stored

SRAM. This architecture shows impressive results, generating up to 3 billion cell updates

per second. However, the LUT based processing limits models to a fairly small state size.

There have been a number of projects which use the CAM-8 in areas such as modeling fluid

motion [?] and gas lattices [?]. The CAM-8 is now sold commerically.

2.3 Limited Precision

The use of configurable computers has renewed study in the area of limited precision com-

puting. Determining the least number of bits needed for a task was important when many

chips were custom designed and silicon was expensive. With the rise of cheaper fabrication

methods and inexpensive, powerful CPUs, this area has became less important. The use of

general purpose processors with dedicated floating-point units lessens the penalty for using

floating-point for all calculations. However, as configurable computers become popular, the

use of limited precision for calculations has again become important [?].

All configurable computers are based on programmable logic at some level of granularity.

Historically, the most popular type of programmable logic is the FPGA. FPGAs have bit-

level granularity so arithmetic units can be built with any precision. In most cases, each

additional bit of precision of an arithmetic unit will require more chip resources. Also, the

maximum clock frequency for an arithmetic unit may decrease with each additional bit of

precision. This high sensitivity makes using the lowest precision possible very important to

optimizing a design on an FPGA.

Limiting precision has been extended further for FPGAs for solving iterative problems in

a recent paper [?]. This paper describes a method for performing low precision calculations

that are collated into high precision results. Similar ideas were developed for CPUs, but

those studies focused on using single precision floating-point calculations to find double

11

precision solutions [?,?]. As mentioned earlier, FPGAs have a much finer grain of control

over precision, and floating-point calculations are expensive when using FPGAs. So a new,

modified version of this concept has recently been investigated specifically suited for use on

a configurable computer [?].

The reason that low-precision arithmetic can be used in iterative improvement problems

is that the answer converges gradually. During each step, a correction is found that improves

the solution. When the correction is large and the highest bits of the solution are converging,

the low bits do not hold any useful information. Therefore, there is no advantage to using

a precision that calculates the low order bits before the upper bits have converged. As

the solution becomes closer to the final answer, the refinement at each step becomes less.

Because the refinement is small, the high order bits no longer change. At this point there is

no longer any reason to recompute the high bits of the solution.

This property of iterative improvement problems makes it possible to use fewer bits to

calculate the correction than the number of bits that are in the final result. In this way,

only the high order bits are calculated while the correction is large; inversely, only the low

order bits are computed when the correction becomes small. This is possible by calculating

the error (residual) in the equation for the iterative improvement problem. The goal of the

example below is to find a value for x which satisfies the equation.

A ∗ xi = b. (2.1)

The residual (or error) in this equation can be written as

r = b − A ∗ xi. (2.2)

Instead of using the initial equation, the change in x can be calculated

∆xi = A−1 ∗ r. (2.3)

12

The previous calculation can be performed with lower precision arithmetic. This step is

iterated a number of times, then ∆xi is then added back into the previous x

xi+1 = xi + ∆xi (2.4)

This method has been shown to converge to the correct solution [?]. It is applicable to our

work in CA because the CA models we use for structural design optimization are in a form

that utilizes this method. The advantage of using this method on reconfigurable computers

comes from the fact that the bulk of the operations are performed during the update phase.

A large number of resources can be devoted to accelerating the update calculations because

the update can be calculated at a low precision.

Chapter 3

System Design

This section describes two approaches to implementing CA models on FPGAs. Both ap-

proaches use an array of uniform, simple processing elements (PEs) spread throughout the

chip. A large number can fit on a single FPGA because the PEs are relatively simple. This

distributed computing is effective because of the parallel nature of solving the CA models.

The two designs described in this section illustrate a fundamental tradeoff in hardware

design, flexibility versus speed. The first implementation is a custom circuit developed

to solve the analysis equation for a given design. The second implementation executes a

program stored in memory that controls arithmetic operations. Both designs solve the same

analysis problem.

It is important to note that the underlying theory behind the two designs is the same. In

both cases, the design is intended to determine the displacement and rotation of sections of

a beam given the constraints and external forces on the beam. Though they solve the same

problem, the motivation behind each design is fundamentally different. Therefore, although

the same equations are used for solving for the beam variables, the form of the equations are

optimized for the specific implementation.

13

14

3.1 Design Background

When performing operations on an FPGA, it is much more efficient to use fixed-point arith-

metic than floating-point arithmetic. For this reason, both of the designs represent numbers

in fixed-point notation. The nature of CA models allows for this type of representation. The

position of the decimal depends on the architecture and the type of data being stored. The

number of bits of precision varies based on the operation being performed. In both models,

intermediate values produced during calculations are stored in increasing precision to avoid

loss of data. The data is then truncated before the final value is stored.

These designs were developed to perform calculations for a one dimensional CA model,

with two degrees of freedom for each cell. The arithmetic for higher dimensional problems can

be performed without significant changes to the structure of the PE. The main difference

in higher dimensional problems is the change in the communication pattern. In the one-

dimensional models considered, each cell only needs to communicate with its immediate

right and left neighbors. In the case of a two dimensional problems, cells often need to

communicate data with four to eight neighboring cells.

3.1.1 System Hardware

The concepts presented in this thesis for solving CA models on configurable computers can

be applied to many hardware configurations; however, both designs were developed with a

particular system in mind. The system uses a host PC connected over a PCI bus to a card

containing five FPGAs (see Figure 3.1). The FPGAs are all Xilinx Virtex II - XC2V4000

chips [?]. There are several features which make the Virtex II desirable for simulating CA

models.

The first advantage of the Virtex II is the large amount of internal RAM distributed

throughout the chip. These internal BlockRAMs have customizable width and depth. They

also have two ports that can independently read and write to different addresses. Transfer-

15

Figure 3.1: Setup of the configurable computer used for simulationing CA model.

ring data on and off chip is an expensive operation and is typically the bottleneck in most

applications. By utilizing these memories, we avoid having to transfer data to external banks

of RAM.

The second advantage of the Virtex II is the built-in multiplication units. In the sea-of-

gates model for FPGAs, implementing multipliers is expensive. This is especially true if the

precision is large because the size of multipliers grow with the square of the number of bits

of precision. In the Virtex II, there is a built-in multiplier associated with each BlockRAM.

This lends itself to the distributed processor models we used.

3.1.2 Distributed Layout

FPGAs are designed to be as flexible as possible so they can be used in many applications,

but this flexibility comes at a cost in terms of space and speed for any arithmetic unit when

compared to custom VLSI. Chips such as general purpose processors have custom designed

arithmetic units that have a significant advantage in executing sequential operations. The

16

reason an FPGA has the potential for speed up versus a general purpose processor is that it

can perform many operations in parallel or deeply pipeline the operations.

In order to maximize the ability of an FPGA to perform operations in parallel, as much

of the reconfigurable resources as possible should be in use at the same time. To accomplish

this objective, both designs use many uniform, simple PEs operating in parallel. Each PE

is responsible for calculating the next value for a section of cells in the CA model (see

Figure 3.2).

Figure 3.2: Distribution of logical CA cells among PEs.

This distribution is simplified because each cell in the CA model is governed by the

same equations. The arithmetic units in each PE implement the governing equation; each

logical cell is represented by the data values that are inserted into the equation. There is a

BlockRAM associated with each processing unit that stores the set of data values for each

cell. The number of cells represented by a PE is determined by the number of logical cell

data sets that can be stored in the BlockRAM.

This concept of having multiple cells per PE greatly increases the number of logical cells

that can be represented in a design. A certain amount of chip resources is needed to calculate

a cell update. If only one cell was represented in each PE, then the PE could be slightly

17

smaller and a BlockRAM would not be needed. However, the resources required are not

greatly increased by moving from a PE that calculates the update for one cell. There are

enough BlockRAMs on the Virtex-II so that the number of BlockRAMs does not limit the

number of PEs that can fit on the chip.

During a single iteration, all of the logical cells contained within a processing unit are

updated once. The update for a cell depends on its right and left neighbors. To calculate

the update for cells on the edge of the section of logical cells a PE represents, the PE needs

data from the PEs to its right and left. At the end of an iteration, each PE transfers the

data from its leftmost cell to the PE representing cells to the left. Likewise, the data from

the rightmost cell must be transferred to the PE representing the cells to the right.

After this transfer, each processing unit has all of the information needed to compute the

next update for all of the cells it represents. Calculations for all cells can start simultaneously

because the necessary information about all cells is known at the beginning of the iteration.

In both designs, registers are placed between arithmetic units. If an arithmetic unit required

more than one cycle to complete, a pipelined version of the component was used. This

pipelining allows multiple cell updates to be computed concurrently, because cells do not

need to wait until the previous cell has completely finished processing.

3.2 Problem Specific Design

The original direction of this project was to develop a toolset that could rapidly produce

custom hardware models based on specific problems. A designer who wanted to use the tools

would specify the problem in a custom programming language. A compiler would interpret

the input code and produce a custom FPGA configuration to solve the problem. It was

expected that a toolset could be developed for creating custom bitstreams rapidly enough

to make the system useful.

The first step in this development process was to analyze typical CA analysis equations

18

and manually create an optimized layout. The equations used are based on an analysis

problem with two degrees of freedom, v and Θ.

v̄c = (C0 ∗ (vl + vr) + C1 ∗ (Θl − Θr)) + Fc

Θ̄c = (C2 ∗ (vl − vr) + C3 ∗ (Θl + Θr)) + Mc (3.1)

The variable vc represents the v value for the current cell being processed. The variables

vl and vr are the v values for the current cell’s right and left neighbors. F represents an

external force. v̄c is the value of vc at the next time step. These equations can be used to

solve a one-dimensional CA analysis problem, such as deflection of a uniform beam. This

form of the equations was chosen because it can be mapped to a small, linear circuit. The

main goal was to minimize the number of multiplications needed because multiplication units

are costly in terms of space on the FPGA, .

An optimized design was built to solve these equations. For each operation a variety of

components was considered, and multiple layouts were investigated in the implementation

of the equations. Maximum clock frequency, latency, and size were examined when selecting

each component. To further optimize the circuit, because they are independent, vc and Θc

are computed simultaneously. Figure 3.3 shows the final optimized design.

The outputs of all the components shown in Figure 3.3 are registered. Additionally, the

constant multipliers are pipelined. The resulting latency though the circuit is 6 clock cycles.

The circuit is designed such that all information to compute the update value is provided

at the point at which it is needed. In particular, the Fc and Mc values are loaded 5 clock

cycles after the corresponding Θ and v values. In this design, when this pipeline is filled the

circuit can produce an updated value every clock cycle.

The constant multipliers were used because they had a much lower latency and were

much smaller than traditional multipliers. Using constant multipliers is only possible if

the coefficients in Equations 3.1 are fixed. In the case of analyzing the deflection of a

uniform beam, these coefficients are constant. These multipliers have the characteristic of

19

Figure 3.3: Arithmetic unit for Problem Specific design.

having a structure independent of the constant multiplicand. Therefore, if the location in

the bitstream of the constant multiplier is known, the values in the FPGA look-up tables

(LUTs) values could be modified directly to reflect changes in the coefficient.

The disadvantage of using constant multipliers is that design optimizations made to the

density of the beam would require that a different type of multiplier be used. Also, if the

beam was not uniform, the value of vc and Θc would be needed to compute updated values,

v̄c and Θ̄c. This narrows the usefulness of this design, but it provides an optimized baseline

for comparing other designs.

Each PE in the Problem Specific design contains arithmetic logic, a finite state machine

(FSM), and a BlockRAM. The BlockRAM contains all of the values for the cells. The FSM

controls the addresses from which data is loaded and stored in the BlockRAM. The PE

operates most efficiently when the pipeline is filled. When the pipeline is filled, a new set of

data needs to be applied each clock cycle, and updated values need to be stored each clock

cycle. To accommodate this flow of data, one port of the BlockRAM is devoted to loading

data and the other is devoted to storing data (see Figure 3.4).

20

Figure 3.4: PE layout for Problem Specific design

The Edge Registers are used to communicate data to neighboring PEs. When the update

for a cell on either end of the section of the model for which the PE is responsible is calculated,

the new value is stored in the Edge Registers. Each PE has access to these registers in its

right and left neighbor. When data is needed from a neighbor, the values are loaded from

the Edge Registers instead of from the BlockRAM. For PEs on the boundary of the model,

the Edge Registers are connected to constant values.

To implement design optimization, FPGA configuration bitstreams need to be produced

for both analysis and design phases. The FPGA would first be loaded with the analysis

design and the configuration would be iterated until the data values converged. After the

cells converge, the data needed for the design improvement phase is stored in the internal

BlockRAMs. The FPGA would next perform a partial reconfiguration and load the design

improvement bitstream, during which the contents of the BlockRAMs would not be changed.

In this way, data would be passed between the analysis and design phases.

The results would be extracted from the board through readback. During the readback

21

operation, the FPGA dumps its entire configuration including flip-flops and BlockRAM

contents. Once the contents of the FPGA are dumped, careful filtering of the data would

yield the current results. This method negates the need for using specialized hardware to

support downloading data.

The residual-update method, described in the Background chapter, can be used in finding

the solution for a CA model because it is an iterative improvement problem. The advantage

to using this method would be that low precision calculations can be used to generate a high

precision result. The reconfiguration between analysis and design phases would provide the

opportunity needed for loading updated coefficient to the FPGAs. The result of implement-

ing the residual-update method would be that an 8-bit design could produce results with

precisions such as 16 or 32 bits.

3.3 Program Based Design

The Program Based design represents a fundamentally different approach to solving the same

analysis problem as the Problem Specific design. The Problem Specific design can perform

analysis updates very rapidly because it uses custom hardware. However, using a custom

design means that for each new problem an optimized circuit must be designed, and an FPGA

configuration must be generated. The overhead of building a custom configuration for each

problem could easily erase any speed advantage. On the opposite end of the spectrum, a

compiled program running on a general purpose CPU has very low initial overhead, but it

cannot take advantage of the inherent parallelism of CA. The Program Based design was

developed to bridge the gap between the analysis speed of custom hardware and the flexibility

of a general purpose processor.

The first major change, compared to the Problem Specific model, is that the Program

Based design executes a program stored in internal BlockRAM to control data accesses

and the arithmetic units in the PEs. In the Problem Specific model these operations were

22

performed using a fixed finite state machine. Another significant change is that the control

logic is removed from the PEs and placed in a central control unit. The signals are then

propagated to the PEs throughout the chip. The third important modification is that the

equations are represented in a matrix form to provide a more flexible architecture that

can handle a variety of problems. This matrix arithmetic is expressed in the layout of

the arithmetic units. The last major difference is that the Program Based model has the

capability to compute results in both high precision and low precision forms on the FPGA

and then combine the two results.

The goal of flexibility for the Program Based design is reflected in the form of the equations

for the model. The hardware is designed to solve problems set up in matrix form. This

provides a simpler method to implement, and eventually automate, CA design algorithms.

The matrix form of the beam equations is shown in the following equations:

v̄c

Θ̄c

=

vl

Θl

C0 C1

C2 C3

+

vc

Θc

K0 K1

K2 K3

+

vr

Θr

C4 C5

C6 C7

+

Fc

MC

(3.2)

This equation solves the same analysis problem as the Problem Specific design. This is

one of a range of two-dimensional problems that can be solved by the Program Based design.

Equations can be implemented with any number of terms and are expressed in matrix form.

The Problem Specific model only solves problems that can be represented in the form of the

beam equations, while the Program Based model has the capability to capture the behavior

of a variety of problems.

The complexity of control logic increased greatly in the Program Based model as compared

to the Problem Specific model. The finite state machines that controlled the load and store

logic of the Problem Specific model are ill-equipped to handle the increase in complexity.

The control logic in the Program Based model uses significantly more resources, so it is

advantageous to move the control logic to a centralized location. There is a penalty involved

in distributing the control signals; however, the size of each PE would more than double if

23

the control logic was not centralized.

The architecture of having a single control unit makes the Program Based design similar to

a Single Instruction, Multiple Data (SIMD) parallel computer (see Figure 3.5). Removing the

control from the individual PEs is possible because all cells, including boundary cells, can be

represented by changing the coefficients in the matrix equation. Historically, there has been

a lack of widespread interest in SIMD parallel computers because they are inflexible and

require custom processors. However, SIMD machines have been successful in multimedia

and DSP applications [?]. These applications involve repetitive calculations that can be

performed in parallel, similar to those needed for CA models.

Figure 3.5: Return data chains for Program Based design.

The Control Unit (CU) requires feedback from the PEs, for example a flag indicating

that calculations are complete. The routing resources around the CU would be consumed

quickly because there are a large number of PEs that need to communicate with the CU. To

avoid this problem, there are multiple PEs on each return data bus so only the last PE in

the chain needs to be routed directly to the CU. The drawback is that extra computational

cycles are needed. This is because the returning data takes an extra clock cycle to propagate

back to the CU for every link.

24

Instructions stored in the CU are not like those of a traditional microprocessor. The

instructions for a traditional general purpose processor are encoded, while the instructions

in this design are stored as a 72-bit word that requires no decoding. The result is that

most control signals can be connected directly from the memory in the CU to the PEs (see

Figure 3.6). This method of storing instructions has the advantages of being both fast and

allowing any combination of control signals to achieve maximum parallelism.

Figure 3.6: Control Unit for Program Based design.

The instructions contain two main parts, the flow control logic and control signals. The

flow control portion interacts with the flow control logic in the CU to determine which

instruction is executed next. The flow control logic allows for increments to the program

counter, branches, and conditional branches. The control signals manage operations in the

PEs. These include: clearing registers, loading data, and shifting data. The signals for

controlling the BlockRAMs in the PEs are fed through address logic to allow absolute and

relative address jumps.

25

Though there is a plan to automate the process of writing the programs loaded into the

control unit, the first programs were written manually. A spreadsheet was used to select the

values of every control signal at each time step. The spreadsheet was set up to automatically

insert the signal values into the proper bit position (see Appendix A for example). The integer

equivalent of the binary number is loaded into the control unit memory at compilation time.

In the current design, the program cannot be changed at run time.

To understand the reasoning behind the arithmetic logic in the PEs, it is necessary to

understand the process for using a residual and an update to calculate results. It is possible

to use a residual-update method, described in Chapter 2, to find the solution because the CA

solutions are attained by iterative improvement. This method has the advantage of using

low precision arithmetic for most calculations. In describing this method, n is the number

of bits used in high precision calculations and k is the number of bits used in low precision

calculations.

This method works by first calculating the residual, or error in the equation that is being

solved. The residual calculation must be performed in n bits for every cell in the model. The

most significant k bits of the residuals are then extracted and stored. The k bits must be

taken from the same position in every residual. The largest element in the residual vector

dictates which bits are selected. The update equation is then calculated in k bits, and the

k-bit version of the residual is used in place of the Fc and Mc in Equation 3.2. This k-bit

update is performed until the results converge. After the k bit updates are found, they are

added into an accumulated version of the variables at the same offset as the bits that were

taken out of the residual. The cycle repeats using the accumulated version of the variables

in the residual equation. These iterations are repeated until the accumulated versions of the

variables converge.

The flow chart in Figure 3.7 shows an analysis cycle using this method. This method

is effective because the majority of the time is spent in calculating the update in k bits.

More parallel arithmetic units can be used to speed up the calculations because the update

26

Figure 3.7: Analysis cycle flow and precision for each operation.

calculation is be performed in k bits, .

There are three main parts to the PEs used in the Program-Based design (as shown in

Figure 3.8):

- Multiply Accumulator: calculates the residual in n bits

- Shift Unit: extracts k bits from n-bit residual and adds the update into accumulated

variables

- Matrix Accumulator: calculates cell updates in k bits

The Multiply Accumulator is simply a multiplication unit and an adder with the registered

version of its output connected to one of its inputs (see Figure 3.9). There is only one

multiply accumulator per PE because it uses n-bit arithmetic, and these n-bit precision

units are large. The Multiply Accumulator takes advantage of the built-in 18x18 multiplier

units on the Virtex-II FPGAs to save resources.

The minimization of the hardware results in multiple clock cycles being needed to compute

27

Figure 3.8: Computational unit for Program Based design.

Figure 3.9: Multiply accumulator used in computational unit.

residual values. The latency through each unit is one clock cycle, so the pipeline is two stages.

For the equation proposed in the beginning of this section, it takes 16 clock cycles to calculate

the residuals for one cell. The expense of the residual calculation is tolerable because many

update calculations are performed between residual calculations.

After the residual is calculated it must be converted to a k-bit number. During the

residual calculation, the most significant bit of the largest residual value is found. There is

a mechanism in each PE that stores the absolute value of the largest residual calculated.

This value is passed along the return data chain until it arrives at the control unit. Each

PE performs a logical OR on the value passed to it and the largest value it has calculated.

28

This process destroys the actual value of the largest residual, but the number passed to the

control unit shows the position of the most significant ‘1’. This position is used to determine

which bits of the residual are stored for the update phase.

Figure 3.10: MSB data return chain, used for determining the most significant ‘1’ of residuals.

The logic to extract the bits is based on a multiplexer, a register, and a right shifter. The

multiplexer selects between an input from memory or a right shift version of the value stored

in the register. The value output from the multiplexer is loaded into the register. During the

bit shifting phase, the register is first loaded with the n-bit value. The right shifted value

is then selected from the multiplexor. The value is looped through right shifter until the

desired bits are in the lowest position. The number of clock cycles required is dependent on

the number of bit positions the value needs to be shifted.

Figure 3.11: Unit for shifting the precision of intermediate results.

29

The adder, after the shift logic, is used during the addition phase at the end of the outer

analysis cycle. During the addition phase, the update is loaded into the highest bits and

shifted to the correct position. It is then added to the previous value, which is read from

memory. A signal from the control unit selects which value is output from the unit.

The final piece of the PE for the Program Based design is the Matrix Accumulator. The

Matrix Accumulator is similar to the Multiply Accumulator unit, except the arithmetic is

performed in k bits and more hardware is used to speed up calculations. The unit is designed

specifically to be able to multiply a 2x2 matrix by a 2x1 matrix. For example, Figure 3.12

shows the circuit calculating the equation:

v̄

Θ̄

=

v

Θ

C0 C1

C2 C3

(3.3)

Figure 3.12: Matrix accumulator used for analysis updates.

The multiplier has a three clock cycle latency and is fully pipelined. The entire unit has

latency of five clock cycles. The update for each cell using the matrix version of the beam

equations, described earlier in this section, takes 9 clock cycles. However, when the pipeline

30

is filled, the circuit can produce an update every five clock cycles, and this circuit calculates

the update for both analysis variables simultaneously.

Every PE can select to have the input to Port B of the memory connected directly to the

output of Port A of its left or right neighbor. In this way, PEs transfer data about the cells

on the edge of the section of the CA model for which it is responsible. This system is also

used to upload and download data from the FPGA. The PE that calculates the values for

the cells on the left end of the model can read data from the PCI bus, while the PE that

calculates the values for cells on the right end of the model can write data to the PCI bus.

Figure 3.13: Data flow for uploading and download data to FPGAs.

To upload coefficients and external forces, as well as initializing variable values, the host

computer begins by writing the data into the memory of the leftmost PE for the rightmost

PE. The data is then shifted through all the PEs until it gets to the proper place. At the

same time new data is shifted into leftmost PE. Downloading the results is a similar process,

it involves shifting the data right and reading it off the rightmost PE. An external clock is

used to keep data transfers synchronized.

Although reconfiguration is not part of the analysis cycle, the implementation of the

system for performing design optimization will use reconfiguration in a number of ways.

Each analysis model has fixed connections for communicating among PEs. It is possible

to pass data through intermediate PEs to transfer data between PEs that are not directly

connected. However, to achieve maximum efficiency, communication should be done over

direct connections when possible. The design system will have a number of different analysis

models, each with a different communication pattern. When the user specifies the initial

31

problem the system will select the bitstream for the most appropriate model and load it into

the FPGAs.

Design optimization may be performed in a number of different ways. The first possible

technique is to use reconfiguration. A bitstream developed to perform design optimization

could be loaded on the FPGAs using partial reconfiguration. The data would be passed

between analysis and design models through the BlockRAMs, like the method proposed

for the Problem Specific model. Another technique would be to use the analysis model

to perform the design calculations. Design would require new coefficients, which could be

loaded into the FPGA using the uploading and downloading models described earlier. The

disadvantage of this method is that the analysis design might not be capable of performing

all of the operations needed, or the operations may be very inefficient. The final possibility

is to use a Virtex-II Pro FPGA, which contains built-in PowerPC processors. These internal

processors could be used to run a program to calculate the new design values.

Chapter 4

Results

4.1 Problem Formulation

The results in this section are based on solving the analysis of a CA model of a one-

dimensional beam. The model is formulated from work by researchers at Virginia Tech

[?]. The beam is divided into cells that have two degrees of freedom, vertical displacement

(w) and rotation (q). Each cell also has a separate vertical thickness, which is the design

variable. The thickness of the beam is specified at the middle of each cell, and then linearly

interpolated in between the specified points (see Figure 4.1). Cells in the model are evenly

distributed along the beam.

There are a number of possible configurations for each cell. The cell can have a fixed

displacement, a fixed rotation, a fixed displacement and rotation, or it can be free in dis-

placement and rotation. External forces can be applied to any cell. The forces can be in

the form of a vertical force (F) or a bending moment (M). These different configurations are

represented by changing the coefficients in the equation that is solved by each model. Using

these available cell configurations, many classical static beam problems can be solved.

The CA model, shown in Figure 4.2, was modeled with 20 cells and was run on both the

32

33

Figure 4.1: Diagram of CA model for perform analysis on a beam.

Figure 4.2: Beam analysis problem modeled on the configurable computer.

Problem Specific and Program Based designs. 20 cells was choosen so the model could be

quickly simulated. The first cell in the model is a dummy cell, for which no computations

are performed. The cells (1 and 19) on the ends of the beam have fixed displacement and

rotation. Cell 14 has a fixed vertical displacement. All other cells in the model are free

in displacement and rotation. There is a vertical force pushing up on cell 9. The force is

scaled to produce a maximum displacement of slightly less than 127, so the result can be

represented in 8 bits.

34

4.2 Problem Specific Design Results

The designs presented in the implementation section were intentionally developed indepen-

dently of any fixed precision for the results. There is a trade off between the number bits of

the solution that will be calculated and the number of cells that can be represented in the

system. In addition, larger precision results in lower maximum clock frequency and/or an

increased pipeline length.

Another factor in changing the precision is memory access. The BlockRAMs have pro-

grammable port widths that can accommodate some changes in precision. The BlockRAMs’

two ports can each handle up to 36 bits and can independently read or write. Once the data

transfer limit has been exceeded, accessing the data needed will take multiple clock cycles

or more memories must be used in the design.

During development, multiple versions of the Problem Specific design were built that used

different precision for calculations. Figure 4.3 shows the growth of the size of a PE as the

precision of the calculations is increased. The graph shows that size grows rapidly as the

number of bits is increased. This makes it very important to use the least precision possible

for calculations.

From this data, and based on the beam problem being solved, 8 bits of precision was

chosen as likely to be the most effective. In most of the following analysis, an 8-bit model

was studied for the Program Based design. However, this precision could change based on

the type of problem being solved and the number of cells in the model. In this respect,

the Problem Specific design would have more flexibility with regard to precision than the

Program Based design because the Problem Specific design is custom-made for each problem.

Once 8 bits was selected for the precision, the number of PEs needed to be selected. The

maximum number of PEs that can fit on an FPGA is limited by the programmable logic

and routing resources on the chip. However, when the chip usage gets high, the routing gets

inefficient and the maximum frequency at which the circuit can be clocked drops rapidly.

35

4 6 8 10 12 14 16 180

0.01

0.02

0.03

0.04

0.05

0.06

0.07Problem Specific design − % utilization vs. PE precision

PE precision (bits)

% c

hip

utili

zatio

n (b

ased

on

Virt

ex II

−40

00)

Figure 4.3: Precision of PE vs. Percent Utilization of FPGA for Problem Specific design

36

5 10 15 20 25 30 35 40 450

0.2

0.4

0.6

0.8

1

% c

hip

utili

zatio

n (b

ased

on

Virt

ex II

−40

00)

% utilization and maximum clock frequency vs. Number of PEs

Number of PEs5 10 15 20 25 30 35 40 45

0

20

40

60

80

100

Max

imum

clo

ck fr

eque

ncy

(MH

z)

chip utilizationmaximum frequency

Figure 4.4: % Utilization of FPGA and maximum clock frequency vs. number of PEs for

Problem Specific design.

Figure 4.4 shows the chip utilization and the maximum clock frequency versus the number of

PEs. The number of logical cells that can be represented increases linearly with the number

of PEs. However, the clock maximum frequency decreases gradually as the number of PEs

increases, then drops quickly after the 35th PE.

The Problem Specific and Program Based designs vary widely in the number of cells they

can represent and the precision of the result. In order to compare these differing designs,

the maximum number of cell updates per second is used as a metric. This is also used as the

metric to determine the speed-up over a program running on a general purpose processor.

The number of cell updates per second for the Problem Specific design is simply the

37

5 10 15 20 25 30 35 40 45 500

500

1000

1500

2000

2500Cell updates per second vs. # of PEs

Number of PEs

Cel

l upd

ates

per

sec

ond

(mill

ion)

Figure 4.5: Cell updates per second vs. number of PEs for the Problem Specific design.

number of PEs multiplied by the maximum clock frequency. This is because in the Problem

Specific implementation, each PE produces a result every clock cycle during analysis. Fig-

ure 4.5 shows a peak in the maximum number of cell updates when 35 PEs are on the chip.

With 35 PEs the 8-bit design has a maximum clock frequency of 64.5 MHz. In comparison,

a 12-bit model with 35 PEs cannot fit on the Virtex-II 4000 FPGA.

There are additional factors that limit the cell updates per second that can actually be

performed on the system (see Table 4.1). The most costly factors are the reconfiguration

and readback times on the FPGA and the time it takes the host to compute the coefficients

for a given design. These times are important because the communication with the host is

38

Time (ms)

Operation 1 PE 1 FPGA DINI Board

Reconfig N/A 1190 4760

Readback not yet implement in DINI API

Host computations 1.11 39 156

Table 4.1: Times for operations associated with Problem Specific analysis cycle for DINI

board.

done through reconfiguration and readback.

These results are dependent on the design being able to accurately produce analysis

results. The problem described earlier in this chapter in the Problem Formulation section

(see Figure 4.2) was modeled on the Problem Specific design. The force was scaled so the

result would be able to be represented in 8 bits. The actual results were calculated using a

C++ program running on a PC which used floating-point arithmetic for all calculations. The

results show (see Figure 4.6) that the system was able to produce results that were similar,

but not exactly correct. The mean of the error between the actual results and the results

attained from the Problem Specific model for the displacement and rotation were 38.4% and

41.4% respectively. This large error is due to the rounding that takes place because fixed

point arithmetic is used. The answer would be improved, if better accuracy is needed, by

using reconfiguration and the residual-update method for iterative improvement.

4.3 Program Based Design Results

The Program Based design has much the same sensitivity to precision as the Program Based

design, but the Program Based design does not have quite as much flexibility in term of

precision. Because the full precision calculations of the residual rely on the built-in 18x18

multipliers, it is difficult to increase the full precision result to more than 18 bits. However,

39

0 2 4 6 8 10 12 14 16 18 20−40

−20

0

20

40

60

80

100

120Actual results vs Problem Specific design results

Dis

plac

emen

t (m

m)

Position (m)

actual displacementProb Spec displacementactual rotationProb Spec rotation

Figure 4.6: Actual results and results from Problem Specific design for beam problem.

40

4 6 8 10 12 14 16 180

0.01

0.02

0.03

0.04

0.05

0.06

0.07Chip utilization vs. PE precision

PE precision (bits)

% c

hip

utili

zatio

n (b

ased

on

Virt

ex II

−40

00

Figure 4.7: Precision of PE vs. % Utilization of FPGA for Program Based design.

there is some flexibility in the precision of the update calculations. Figure 4.7 shows how

the size of the PE grows as the precision of the update arithmetic is increased. The growth

is similar to that of the Problem Specific model.

Based on this data, 6 bits was selected for the precision of the update calculations. The

6-bit precision design attains the maximum per second with 60 PEs. The maximum clock

frequency is 94.8 MHz. If the same design used 8 bits of precision the maximum clock

frequency is 88.1 MHz.

The precision selected for the update is also trade off between having smaller update units

and having to perform the outer iteration more often. When the precision of the update

calculation is larger, more inner iterations are performed before new residuals need to be

calculated. Using a smaller precision has the advantage of being able to devote more, smaller

units to calculating the update. The number of clock cycles needed for each phase of the

41

Analysis Cycle Phase Clock Cycles

Residual Calc 550

Shift Residual 330-990

Cell Update 190 * Inner Iterations

Add 410-1135

Table 4.2: Clock cycles for different phases of residual-update analysis cycle.

analysis cycle is shown in Table 4.2.

As the number of inner iterations increases during each analysis cycle, the Program Based

model becomes more efficient. Figure 4.8 shows the increase in the efficiency of the design

versus the number of inner update iterations for each residual calculation. For this graph,

the average number of cycles for the Shifting and Adding phase of the analysis cycle was

used. When the number of inner iterations is below 10, more than half the time is spent in

calculating the residual or shifting the results. However, the model rapidly becomes more

efficient. With 35 inner iterations per analysis cycle this design achieves 75% efficiency, and

at 90 inner iterations the efficiency is 90%. The number of iterations needed will depend on

the type of problem and the number of cells in the model.

The total cell updates per second that can be performed with the whole chip is dependent

on the number of PEs on the FPGA. The maximum number of cell updates per second for the

Program Based design is achieved slightly before the chip resources are exhausted because

of routing congestion (see Figure 4.9). Figure 4.10 shows how the maximum number of cell

updates per second rises then deteriorates as the number of PEs is increased.

The Program Based model depends on communication with a host through the PCI bus

in the current design. Before the calculation can begin, the coefficients for a specific design

need to be loaded into each of the PEs. Then, after the analysis is complete, the results

need to be downloaded back to the host. The transfer times for these operations are shown

42

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100Efficiency vs. inner iterations per analysis cycle

Inner iterations per analysis cycle

Effi

cien

cy (

% o

f tim

e sp

ent c

alcu

latin

g up

date

s)

Figure 4.8: Efficiency vs. number of inner iterations per analysis cycle.

43

0 10 20 30 40 50 60 700

0.5

1

% c

hip

utili

zatio

n (b

ased

on

Virt

ex II

−40

00)

% utilization and maximum clock frequency vs. Number of PEs

Number of PEs0 10 20 30 40 50 60 70

0

100

200

Max

imum

clo

ck fr

eque

ncy

(MH

z)

chip utilizationmaximum frequency

Figure 4.9: % Utilization of FPGA and maximum clock frequency vs. number of PEs for

Program Based design.

44

10 20 30 40 50 60 700

500

1000

1500Cell updates per second vs. # of PEs

Cel

l upd

ates

per

sec

ond

(mill

ion)

Number of PEs

Figure 4.10: Cell updates per second vs. number of PEs for Program Based design.

45

Time (ms)


Uploading Coefficients 2.10 114 228

Downloading Results .311 18.5 37

Host Computations .360 21.5 86.0

Table 4.3: Time for operations associated with analysis on Program Based design.

in Table 4.3. This communication is synchronized by an external clock supplied by the host.

There is additional overhead due to the computation of analysis coefficients on the host for

each design.

The problem shown in Figure 4.2 was modeled on the Program Based design. This was

the same model run on the Problem Specific design, except the force is scaled up so that

the maximum result was closer to an 18 bit number. The results were again compared to a

C++ simulation that used floating-point for all calculations. Figure 4.11 shows the results

attained from the Program Based model were very close to the actual results. The mean of

the percent error between the Program Based model data and the actual results was 0.099%

for displacement and .118% error rotation.

The results for the Program Based model were much more accurate than the results from

the Problem Specific model because the Program Based model uses 18 bits of precision

during the residual calculations. The speed of the Program Based model comes from the use

of only 6 bits for the update calculations. The external force was scaled so the maximum

displacement would be close to 18 bits. After they were computed, the results were scaled

down to match the original problem.

46

0 2 4 6 8 10 12 14 16 18 20−40

−20

0

20

40

60

80

100

120Actual results vs Program Based design results

Position (m)

Dis

plac

emen

t (m

m)

actual displacementProg Based displacementactual rotationProg Based rotation

Figure 4.11: Actual results and results from Program Based design for beam problem.

47

Maximum Cell Updates per Second (millions)


Problem Specific 65.1 2279 9116

Program Based 18.9 1137 4548

Table 4.4: Maximum cell updates per second for both implementations.

Operation Maximum Cell Updates Sec (millions) Speed-up

PC 48.9 -

Problem Specific 9116 186.4

Program Based 4548 93.0

Table 4.5: Maximum cell updates per second and speed up for both implementations com-

pared to PC.

4.4 Comparison of Designs

Most of the results presented in the earlier sections were for systems using one FPGA.

However, the DINI board intended for this system has 4 FPGAs. The total computing

power increases linearly because all the FPGAs can run in parallel. Table 4.4 shows the

maximum number of updates for each of the systems for each level of the system.

To compare these FPGA based designs to conventional methods, a C++ program was

written to calculate the results. To make the comparison as fair as possible, the program

uses integer arithmetic instead of slower floating-point arithmetic. Integer computations are

closer to the fixed point math used by the FPGA designs. The program was compiled using

GCC with optimization enabled and executed on a PC with a 1.7 GHz processor and 1 GB

of RAM running Linux Debian. Table 4.5 shows the speed up attained by the FPGA designs

over the general purpose processor version.

Chapter 5

Conclusions

5.1 Summary

Cellular Automata (CA) theory has been studied for decades, with most of the work done

on modeling natural systems. Recently, the use of CA theory has been extended to provide a

system for structural analysis and design optimization. This work has proved to be success-

ful, but the calculations are very slow on traditional general purpose processors. The parallel

nature of these CA models makes them a good candidate for implementation on a reconfig-

urable computer. The work presented in this thesis shows the initial steps toward making

an automated tool to perform structural design optimization accelerated by a reconfigurable

computer.

The contribution of this thesis was to design and implement two models for the analysis

phase of the CA structural design optimization cycle. Both designs take advantage of the

parallel nature of cellular automata by using a distributed array of processing elements.

For the Problem Specific implementation, these processing elements are customized to each

problem. The Program Based implementation has a more flexible processing unit that

is controlled by a program designed to simulate a specific cellular automata model. The

48

49

Program Based implementation also has the built-in capability to use a residual-update

method to accelerate calculations and improve accuracy.

5.2 Results

The results show the Problem Specific design and the Program Based design were able to

generate cell updates at the rate of 9.12 and 4.55 billion per second, respectively. Though the

Problem Specific design proved to be able to generate updates more rapidly, this increase

in speed came at the expense of precision and flexibility. The Program Based model’s

competitive speed, improved accuracy, and ability to handle a range of update rules make it

the architecture that provides the most potential for an automated system.

Both hardware implementations of these CA model for structural analysis were very

successful in term of performance. When compared to a 1.7 GHz Pentium 4 processor, the

Problem Specific design proved to be 186 times faster. The Program Based design, which

was slightly slower, was still 93 times faster than the general purpose processor version.

These speed-ups are a step towards making a CA system for structural design optimization

that significantly outperforms traditional methods.

5.3 Future Work

There are a number of interesting areas that need to be studied in order to design an

automated tool for perform structural design optimization using CA. The most immediate

may be the need for a translator and compiler for the programs used by the Program Based

design. For the work in this thesis, the programs were all written by hand. This process was

very difficult and time consuming. A compiler is needed to take a higher level abstraction and

generate machine level instructions. The end product should be a compiler that could take a

problem specified by a design engineer who has knowledge of the hardware implementation

50

and produce the necessary instructions.

Additionally, an efficient method for design calculations must be implemented. In the

current system, the results are downloaded to a host computer where design calculations can

be performed. However, this is an inefficient technique. A number of possibilities exist for

executing the design calculations on the board, such as using partial reconfiguration or on

board processors. These possibilities need to be investigated to identify the best method.

Another area that needs work is implementing multi-grid on the system. The multi-grid

approach to these CA problems would be to calculate results while varying the resolution of

the grids. In other words, the number of cells representing the system would increase and

decrease based on certain algorithms. Multi-grid could also be used to blend analysis and

design steps into a single cycle. These methods have the potential for huge reductions in the

number of calculations needed.

Appendix A

This appendix gives an example of how the programs for the Program Based model are

written.

The Processing Elements (PEs) in the Program Based model were developed to perform

high and low precision arithmetic and convert between the two forms. The control logic

needed to simulate a cellular automata model is complex, so programs are used to set the

control signals. The program is stored in an internal BlockRAM contained in the Control

Unit(CU). As the PEs were designed, the control signals for each unit were assigned to

particular bits of the BlockRAM in the CU. The position of the bits and a short description

of their function was recorded in a Excel spreadsheet. Figure A.1 shows a screenshot of this

spreadsheet.

Each phase of the analysis cycle was written as a separate program. There are control

signals for each functional unit of the PE, but only one functional unit is in use during

each phase. The first step in writing a program was to identify the signals needed for the

particular phase. The pertinent signals were placed across the top of a spreadsheet and the

value of the signal was specified below. Each horizonal line represents a clock cycle step.

Figure A.2 shows an example of a program. This particular program calculates the update

during the inner iteration of the analysis cycle.

The signals on the left are used to determine the order in which instructions are executed.

51

52

Figure A.1: Spreadsheet with position of control signals and short description.

The program counter will increment by one unless a loop is specified. The signals on the left

control the PEs. Signals with only two options are specified as Y or N. Signals with more

options are specified as a number or letter from a certain set.

There is a second spreadsheet which determines the numerical value for each signal. The

values of the signals are then converted into an intermediate form. The intermediate form is

the numerical value of the signal multiplied by two to the power of its bit position. The final

value is the sum of all the intermediate values (see Figure A.3). This is the number that is

loaded into the CU BlockRAM. These final values are then put in a form that can be read

into memory (see Figure A.4).

53

Figure A.2: Spreadsheet containing update program

Figure A.3: Spreadsheet converting signals to the form used by the Program Based model

54

Figure A.4: Spreadsheet containing the data values in a form that can be loaded into memory.

Vita

Thomas Hartka was born in June 1980 in Baltimore, Maryland. He atttended from Arch-

bishop High School in Severn, Maryland. Thomas enrolled in the College of Engineering at

Virginia Tech in the fall of 1998. He graduated Cum Laude with a Bachelor of Science in

Computer Engineering. Thomas choose to remain at Virginia Tech to pursue his Master’s

Degree. He became involved in research at Virginia Tech Configurable Computing Lab. After

graduating, Thomas will attend Johns Hopkins’ Post-Baccalaureate Premedical Program.

55

cellular automata for structural optimization on ... · cellular automata for structural...

Documents