cellular automata for structural optimization on ... · cellular automata for structural...

63
Cellular Automata for Structural Optimization on Recongfigurable Computers Thomas R. Hartka Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Dr. Mark T. Jones, Chair Dr. Peter M. Athanas Dr. Michael S. Hsiao May 12, 2004 Blacksburg, Virginia Keywords: configurable computing, cellular automata, design optimization Copyright 2004 c , Thomas R. Hartka

Upload: others

Post on 18-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

  • Cellular Automata for Structural Optimization on Recongfigurable

    Computers

    Thomas R. Hartka

    Thesis submitted to the Faculty of the

    Virginia Polytechnic Institute and State University

    in partial fulfillment of the requirements for the degree of

    Master of Science

    in

    Computer Engineering

    Dr. Mark T. Jones, Chair

    Dr. Peter M. Athanas

    Dr. Michael S. Hsiao

    May 12, 2004

    Blacksburg, Virginia

    Keywords: configurable computing, cellular automata, design optimization

    Copyright 2004 c©, Thomas R. Hartka

  • Cellular Automata for Structural Optimization on Recongfigurable

    Computers

    Thomas R. Hartka

    (ABSTRACT)

    Structural analysis and design optimization is important to a wide variety of disciplines. The

    current methods for these tasks require significant time and computing resources. Reconfig-

    urable computers have shown the ability to speed up many applications, but are unable to

    handle efficiently the precision requirements for traditional analysis and optimization tech-

    niques. Cellular automata theory provides a method to model these problems in a format

    conducive to representation on a reconfigurable computer. The calculations do not need to be

    executed with high precision and can be performed in parallel. By implementing cellular au-

    tomata simulations on a reconfigurable computer, structural analysis and design optimization

    can be performed significantly faster than conventional methods.

    This work was partially supported by NSF grant #9908057 as well as by the Virginia

    Tech Aspires program.

  • Acknowledgements

    I would first like to thank my advisor, Dr. Mark Jones, for his guidance through my entire

    research. Without his guidance I never would have been able to complete this thesis.

    Thanks to Dr. Athanas for serving on my thesis committee and for making development

    on the DINI board possible.

    Thanks to Dr. Hsiao for serving on my committee and being an excellent teacher.

    Thanks to Dr. Gurdal and his researchers for providing the mathematics for the cellular

    automata models and for all the effort spent getting to understand reconfigurable computing

    so the equations mapped efficiently.

    Thanks to all the professors and students involved with the Configurable Computing Lab

    for making it a great place to work.

    Thanks to all the people that helped in the process of reviewing and editing this thesis.

    I am forever indebted to anyone who will review sixty pages of my writing.

    Thanks to everyone else who I have not mentioned that helped with my work. I could

    not have done it without the support from the people around me.

    iii

  • Contents

    1 Introduction 1

    1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background 4

    2.1 Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Configurable Computers for Scientific Computations . . . . . . . . . . . . . . 8

    2.3 Limited Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3 System Design 13

    3.1 Design Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.1.1 System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.1.2 Distributed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Problem Specific Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.3 Program Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4 Results 32

    iv

  • 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.2 Problem Specific Design Results . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3 Program Based Design Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.4 Comparison of Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5 Conclusions 48

    5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    A 51

    Vita 55

    v

  • List of Figures

    3.1 Setup of the configurable computer used for simulationing CA model. . . . . 15

    3.2 Distribution of logical CA cells among PEs. . . . . . . . . . . . . . . . . . . 16

    3.3 Arithmetic unit for Problem Specific design. . . . . . . . . . . . . . . . . . . 19

    3.4 PE layout for Problem Specific design . . . . . . . . . . . . . . . . . . . . . . 20

    3.5 Return data chains for Program Based design. . . . . . . . . . . . . . . . . . 23

    3.6 Control Unit for Program Based design. . . . . . . . . . . . . . . . . . . . . 24

    3.7 Analysis cycle flow and precision for each operation. . . . . . . . . . . . . . . 26

    3.8 Computational unit for Program Based design. . . . . . . . . . . . . . . . . . 27

    3.9 Multiply accumulator used in computational unit. . . . . . . . . . . . . . . . 27

    3.10 MSB data return chain, used for determining the most significant ‘1’ of residuals. 28

    3.11 Unit for shifting the precision of intermediate results. . . . . . . . . . . . . . 28

    3.12 Matrix accumulator used for analysis updates. . . . . . . . . . . . . . . . . . 29

    3.13 Data flow for uploading and download data to FPGAs. . . . . . . . . . . . . 30

    4.1 Diagram of CA model for perform analysis on a beam. . . . . . . . . . . . . 33

    4.2 Beam analysis problem modeled on the configurable computer. . . . . . . . . 33

    vi

  • 4.3 Precision of PE vs. Percent Utilization of FPGA for Problem Specific design 35

    4.4 % Utilization of FPGA and maximum clock frequency vs. number of PEs for

    Problem Specific design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.5 Cell updates per second vs. number of PEs for the Problem Specific design. . 37

    4.6 Actual results and results from Problem Specific design for beam problem. . 39

    4.7 Precision of PE vs. % Utilization of FPGA for Program Based design. . . . . 40

    4.8 Efficiency vs. number of inner iterations per analysis cycle. . . . . . . . . . . 42

    4.9 % Utilization of FPGA and maximum clock frequency vs. number of PEs for

    Program Based design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.10 Cell updates per second vs. number of PEs for Program Based design. . . . 44

    4.11 Actual results and results from Program Based design for beam problem. . . 46

    A.1 Spreadsheet with position of control signals and short description. . . . . . . 52

    A.2 Spreadsheet containing update program . . . . . . . . . . . . . . . . . . . . . 53

    A.3 Spreadsheet converting signals to the form used by the Program Based model 53

    A.4 Spreadsheet containing the data values in a form that can be loaded into

    memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    vii

  • List of Tables

    4.1 Times for operations associated with Problem Specific analysis cycle for DINI

    board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.2 Clock cycles for different phases of residual-update analysis cycle. . . . . . . 41

    4.3 Time for operations associated with analysis on Program Based design. . . . 45

    4.4 Maximum cell updates per second for both implementations. . . . . . . . . . 47

    4.5 Maximum cell updates per second and speed up for both implementations

    compared to PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    viii

  • Chapter 1

    Introduction

    Structural analysis and design optimization are an integral part of many industries. Appli-

    cations range from simple applications such as testing and optimizing a support beam, to

    very complicated applications such as optimizing the structure of a car for crash resistance.

    Performing the design iterations manually is very time consuming. Therefore, a significant

    amount of research has been conducted to develop efficient methods to automate the design

    process.

    Traditional methods for automating design have involved running simulations on general

    purpose processors. In these methods, calculations must be performed in high precision.

    Parallelization of the calculations, where possible, is done though expensive supercomputers

    with hundreds of processors. Even with massive amounts of computing power, the simula-

    tions will usually take hours to complete.

    Cellular Automata (CA) has proved to be a very powerful tool for modeling physical

    phenomena. CA models have successfully captured the behavior of complex systems such

    as fluid flow around a wing and pedestrian traffic [?, ?]. Recently, CA theory has been

    extended to structural analysis and design optimization [?]. Using CA in structural models

    changes analysis and design optimization into a high parallizable form that does not require

    1

  • 2

    high-precision calculations. This provides the potential for significant speed-up.

    1.1 Thesis Statement

    Using CA provides a method to efficiently map structural design optimization problems onto

    FPGAs. By exploiting the inherent parallelism of FPGAs there is the potential for speed-up

    over general purpose processors.

    To achieve this objective, distributed processing systems were implemented on a con-

    figurable computer. The system consisted of a host PC connected to a PCI based board

    with five FPGAs. Two designs were developed for the FPGAs to rapidly iterate CA mod-

    els for structural analysis. The two designs represent significantly different approaches to

    accomplishing the same objective.

    The author’s contributions to this work are the following:

    - developed a custom FPGA design for simulating a beam CA model,

    - developed a separate FPGA design that executes programs to simulating CA models,

    - wrote programs for the FPGA design to simulate a beam CA model, and

    - implemented a limited precision method in hardware for solving iterative improvement

    problems.

    1.2 Thesis organization

    Chapter 2 presents background information about CA theory, scientific computations on con-

    figurable computers, and the limited precision method used in the designs. Chapter 3 gives

    details on the two implementations developed to solve the CA models. Chapter 4 presents

  • 3

    the results for each of the two implementations and comparisons to traditional methods.

    Chapter 5 summarizes the work performed for this project and the results obtained.

  • Chapter 2

    Background

    This chapter presents previous work in areas related to this research. The combined contri-

    butions discussed were used in the completion of the simulation environment and prototype

    presented in this thesis.

    2.1 Cellular Automata

    The concept of Cellular Automata (CA) theory is to be able to model systems with many

    objects that interact [?]. The systems are divided into discrete units, or cells, that act

    autonomously. The advantage of using CA is that the behavior of some complex systems

    can be captured using relatively simple rules for each cell [?]. Attempting to reproduce this

    behaviour without breaking those systems into autonomous units, even if possible, would be

    complicated.

    Each cell in a CA model can be in a single state at any given point in the simulation. The

    number of states the cell may be in depends on the problem being solved. In many models

    the number of elements in the set of states is small (8 or less), but there is a new class of

    CA models that use a continuous state space. These continuous state space CA models are

    4

  • 5

    known as coupled map-lattice or cell dynamic schemes. The next state of a cell is based on

    an update rule, sometimes referred to as a transition rule, which is a function of its current

    state and the current state of its neighbors [?]. The collective state of all of the cells in the

    model at any given point is known as the global state [?].

    Stanislas Ulam is generally credited with the first work in CA, originally referred to

    as cellular space or automata networks. John von Neumann extended Ulam’s work and

    proposed CA as a way to model self-reproducing biological systems [?, ?]. The work of

    Ulam and von Neumann provides a formal method for simulating complex systems. Their

    research, and much of the current research in CA, focused on modeling dynamic systems in

    which time and space are discrete. Each calculation of the next state of all the cells in a

    system represents a step in time [?]. A good example of this type of CA model is Conway’s

    Game of Life in which cells can be in one of two states: alive or dead. Each update of the

    global state represents a new generation of organisms [?].

    There are a number of architectures for CA models, each resulting in different behavior.

    The number of dimensions of the CA model can differ greatly depending on the system being

    modeled. Models are typically one, two, and three dimensions in practice. However, there

    is no limit to the number of dimensions that can be used [?]. The number of dimensions of

    the grid has a large effect on the communication network among cells, known as the cellular

    neighborhood.

    In the work on two-dimensional grids there are two common cellular neighborhoods. The

    first is the von Neumann neighborhood, in which each cell communicates only with the four

    cells that are orthogonally adjacent to it. The second is a Moore neighborhood in which a

    cell communicates with all eight cells surrounding it [?]. Though von Neumann and Moore

    neighborhoods are common, cells are not limited to communicating only with those that

    are adjacent. The “MvonN Neighborhood” uses the nine cells (including the center cell)

    in the Moore neighborhood as well as the four cells orthogonally one space away from the

    current cell. Additionally, the communication of the cells within a model is not required to

  • 6

    be consistent throughout the model [?].

    There has also been a significant amount of work investigating non-rectangular cell sys-

    tems. Gas lattice automata are a subset of cellular automata that commonly use the FHP

    model. The FHP model uses a hexagonal grid, where cells communicate with their six im-

    mediate neighbors [?,?]. The use of triangular and regular polygonal lattices is common in

    specialized applications of cellular automata because they can better capture the behavior

    of certain systems [?].

    Models in which communication and update rules are consistent throughout the model

    are called uniform. Though most of the work in the area of CA has used uniform models,

    the use of non-uniform rules does not necessarily detract from the effectiveness of using CA.

    A number of experiments have been conducted to model the effect of “damaged” areas of

    a grid where cells use different rules [?]. In terms of simulating a CA model on a serial

    processor, a uniform grid has the advantage that only one update rule is needed [?].

    The grid for a CA model may be finite or infinite. In his work, von Neumann examined

    infinite grids as a method to construct a universal computer [?]. Although von Neumann’s

    work on infinite grids was theoretical, methods for representing and calculating CA models

    on infinite grids have been developed [?]. Finite grids are much simpler to implement and

    process in parallel because the maximum size of the active area is known before processing

    begins. However, the use of finite grids introduces the problem of how to calculate cells on

    the edge of the grid, known as the boundary conditions.

    There are several ways to handle the processing of cells on the edge of a finite grid. The

    first method is to logically connect the cells on one edge of the grid to cells on the opposite

    edge, producing a loop. Another way to handle boundary conditions is to use a fixed value

    for cells at the perimeter of the grid. In systems with fixed boundary conditions, the edge

    cells are known as dummy cells because they do not need to be updated. The third method

    for calculating the update for edge cells is to use an update rule that is different than that

    used in the internal cells [?]. An example of a uniform rule would be an edge cell that simply

  • 7

    mirrored the value of the closest internal cell. The type of boundary condition used depends

    largely on the problem being modeled.

    The early work in CA theory concentrated on theoretical computational questions, such

    as computational universality. In later work, it has been used as a method to study social,

    physical, and biological systems [?]. A number of studies have been conducted to use CA

    to capture the aggregate behavior of groups of autonomous beings, for example, car traffic,

    pedestrian flow, and ant colonies [?,?,?]. In scientific computing, many successful attempts

    have been made to model such phenomenon as fluid dynamics, chemical absorption, and

    heat transfer using CA [?,?,?].

    Some of the most recent work in CA has been in the field of structural analysis and

    design. The first work in this area was the development of methods to optimize the angle

    and cross-section area of trusses in a fixed structure [?]. These methods proved successful in

    merging analysis and design into a CA model and showed powerful computational properties.

    This success prompted more work to extend CA theory to create models for other structural

    design problems.

    A model was developed to minimize the weight of a beam needed to prevent buckling

    [?]. The beam is represented in sections, to which constraints and external forces can be

    applied independently. The cross sectional area for each section is determined to produce the

    minimum total weight of the beam. Experiments with this method showed models converged

    to the correct minimum solution. This area of CA research shows substantial promise for

    accelerating structural design optimization through parallel computing.

  • 8

    2.2 Configurable Computers for Scientific Computa-

    tions

    FPGAs, which are usually the basis for configurable computers, show considerable speed-up

    for a variety of applications when compared to general purpose processors. These applications

    include signal processing, pattern searching, and encryption. The use of FPGAs for these

    tasks, which mainly involve bit manipulation, has shown orders of magnitude acceleration

    [?,?,?]. These accelerations are possible because the tasks can be broken down into simple

    operations. The operations can then be performed in parallel throughout the chip.

    Although pattern matching and bit manipulation have been widely studied on FPGAs,

    FPGAs have typically not been used for scientific computations. Two significiant deterrents

    in using FPGAs are the limited available programmable logic and the slow clock speeds. In

    the past, FPGAs have only been able to represent circuits that had gate counts in the low

    thousands [?]. This low gate count is restrictive for scientific computations. For example, a

    32-bit parallel multiplier could not be emulated by most of the FPGAs in the Xilinx XC3000

    family, chips that were first produced in the mid 1990s (based on number of CLBs). This

    becomes even more of a handicap because FPGAs typically operate at clock speeds much

    lower than average CPUs. A general purpose processor will usually outperform an FPGA if

    the FPGA cannot carry out parallel or deeply pipelined operations.

    These limitations of FPGAs have been greatly reduced in recent chips because of the

    much larger transistor densities. The latest Xilinx FPGAs can emulate circuits with up

    to 10 million gates [?]. With increased programmable logic, it is possible to have many

    more arithmetic units performing complicated operations in parallel. In comparison to the

    previous example, a Xilinx XC2V8000 chip, currently Xilinx’s largest FPGA, has enough

    logic to represent thirty-five 32-bit multipliers (based on number of CLBs). Floating-point

    operations continue to require a large percentage of available resources. Still, researchers have

    begun exploring scientific computations on FPGAs. A paper from researchers at Virginia

  • 9

    Tech used the flexibility of FPGAs to develop representations of floating-point numbers

    that are more efficient on FPGAs [?]. In 2002, researchers published a paper detailing the

    development of a limited precision floating-point library and an optimizer to determine the

    minimum precision needed in DSP calculations [?].

    Using the least precision possible is important on an FPGA. General purpose processors

    usually compute operations in higher precision than is needed because of the limited choices

    for precision. However, the fine-grain control of the logic in an FPGA allows custom arith-

    metic units of any precision. This flexibility can be extended to dynamically controlling the

    precision of different calculations on the same unit. Other work has been presented on a

    variable precision coprocessor for a configurable computer and algorithms given for variable

    precision arithmetic units [?]. Two papers have been published investigating how to manage

    dynamically varying precision and to show how the overall runtime is substantially decreased

    by using minimal precision [?,?].

    CA has been used in the computer science community for some time. In 1985, a book was

    published describing implementations for CA simulations on massively parallel computers

    [?]. However, there has been little work done in trying to run these models on configurable

    computers. There have been some papers written on using CA on FPGAs, but they all focus

    on models with simple cell update rules and small state sets. For example, the CAREM

    system was developed to efficiently model CA on FPGAs [?]. The two models published as

    examples of using the CAREM system were an image thinning algorithm and a forest fire

    simulation. In both cases the models were simple, having state set sizes of 4 or less. Other

    cellular automata simulation systems have been proposed for reconfigurable computers which

    concentrated on fluid dynamics [?,?]. However, like CAREM, this system is only capable

    of handling simple models with very limited number of states.

    Custom hardware architectures were implemented by Norman Morgolus from MIT for

    processing CA. The most successful was known as the Cellular Automata Machine 8 (CAM-

    8) [?]. The CAM-8 is based on custom SIMD processors that are connected in three

  • 10

    dimensions. Each processor is responsible for a section of data in the model which is stored

    in a DRAM. Processing on each cell’s data is performed using a look up tables (LUT) stored

    SRAM. This architecture shows impressive results, generating up to 3 billion cell updates

    per second. However, the LUT based processing limits models to a fairly small state size.

    There have been a number of projects which use the CAM-8 in areas such as modeling fluid

    motion [?] and gas lattices [?]. The CAM-8 is now sold commerically.

    2.3 Limited Precision

    The use of configurable computers has renewed study in the area of limited precision com-

    puting. Determining the least number of bits needed for a task was important when many

    chips were custom designed and silicon was expensive. With the rise of cheaper fabrication

    methods and inexpensive, powerful CPUs, this area has became less important. The use of

    general purpose processors with dedicated floating-point units lessens the penalty for using

    floating-point for all calculations. However, as configurable computers become popular, the

    use of limited precision for calculations has again become important [?].

    All configurable computers are based on programmable logic at some level of granularity.

    Historically, the most popular type of programmable logic is the FPGA. FPGAs have bit-

    level granularity so arithmetic units can be built with any precision. In most cases, each

    additional bit of precision of an arithmetic unit will require more chip resources. Also, the

    maximum clock frequency for an arithmetic unit may decrease with each additional bit of

    precision. This high sensitivity makes using the lowest precision possible very important to

    optimizing a design on an FPGA.

    Limiting precision has been extended further for FPGAs for solving iterative problems in

    a recent paper [?]. This paper describes a method for performing low precision calculations

    that are collated into high precision results. Similar ideas were developed for CPUs, but

    those studies focused on using single precision floating-point calculations to find double

  • 11

    precision solutions [?,?]. As mentioned earlier, FPGAs have a much finer grain of control

    over precision, and floating-point calculations are expensive when using FPGAs. So a new,

    modified version of this concept has recently been investigated specifically suited for use on

    a configurable computer [?].

    The reason that low-precision arithmetic can be used in iterative improvement problems

    is that the answer converges gradually. During each step, a correction is found that improves

    the solution. When the correction is large and the highest bits of the solution are converging,

    the low bits do not hold any useful information. Therefore, there is no advantage to using

    a precision that calculates the low order bits before the upper bits have converged. As

    the solution becomes closer to the final answer, the refinement at each step becomes less.

    Because the refinement is small, the high order bits no longer change. At this point there is

    no longer any reason to recompute the high bits of the solution.

    This property of iterative improvement problems makes it possible to use fewer bits to

    calculate the correction than the number of bits that are in the final result. In this way,

    only the high order bits are calculated while the correction is large; inversely, only the low

    order bits are computed when the correction becomes small. This is possible by calculating

    the error (residual) in the equation for the iterative improvement problem. The goal of the

    example below is to find a value for x which satisfies the equation.

    A ∗ xi = b. (2.1)

    The residual (or error) in this equation can be written as

    r = b − A ∗ xi. (2.2)

    Instead of using the initial equation, the change in x can be calculated

    ∆xi = A−1 ∗ r. (2.3)

  • 12

    The previous calculation can be performed with lower precision arithmetic. This step is

    iterated a number of times, then ∆xi is then added back into the previous x

    xi+1 = xi + ∆xi (2.4)

    This method has been shown to converge to the correct solution [?]. It is applicable to our

    work in CA because the CA models we use for structural design optimization are in a form

    that utilizes this method. The advantage of using this method on reconfigurable computers

    comes from the fact that the bulk of the operations are performed during the update phase.

    A large number of resources can be devoted to accelerating the update calculations because

    the update can be calculated at a low precision.

  • Chapter 3

    System Design

    This section describes two approaches to implementing CA models on FPGAs. Both ap-

    proaches use an array of uniform, simple processing elements (PEs) spread throughout the

    chip. A large number can fit on a single FPGA because the PEs are relatively simple. This

    distributed computing is effective because of the parallel nature of solving the CA models.

    The two designs described in this section illustrate a fundamental tradeoff in hardware

    design, flexibility versus speed. The first implementation is a custom circuit developed

    to solve the analysis equation for a given design. The second implementation executes a

    program stored in memory that controls arithmetic operations. Both designs solve the same

    analysis problem.

    It is important to note that the underlying theory behind the two designs is the same. In

    both cases, the design is intended to determine the displacement and rotation of sections of

    a beam given the constraints and external forces on the beam. Though they solve the same

    problem, the motivation behind each design is fundamentally different. Therefore, although

    the same equations are used for solving for the beam variables, the form of the equations are

    optimized for the specific implementation.

    13

  • 14

    3.1 Design Background

    When performing operations on an FPGA, it is much more efficient to use fixed-point arith-

    metic than floating-point arithmetic. For this reason, both of the designs represent numbers

    in fixed-point notation. The nature of CA models allows for this type of representation. The

    position of the decimal depends on the architecture and the type of data being stored. The

    number of bits of precision varies based on the operation being performed. In both models,

    intermediate values produced during calculations are stored in increasing precision to avoid

    loss of data. The data is then truncated before the final value is stored.

    These designs were developed to perform calculations for a one dimensional CA model,

    with two degrees of freedom for each cell. The arithmetic for higher dimensional problems can

    be performed without significant changes to the structure of the PE. The main difference

    in higher dimensional problems is the change in the communication pattern. In the one-

    dimensional models considered, each cell only needs to communicate with its immediate

    right and left neighbors. In the case of a two dimensional problems, cells often need to

    communicate data with four to eight neighboring cells.

    3.1.1 System Hardware

    The concepts presented in this thesis for solving CA models on configurable computers can

    be applied to many hardware configurations; however, both designs were developed with a

    particular system in mind. The system uses a host PC connected over a PCI bus to a card

    containing five FPGAs (see Figure 3.1). The FPGAs are all Xilinx Virtex II - XC2V4000

    chips [?]. There are several features which make the Virtex II desirable for simulating CA

    models.

    The first advantage of the Virtex II is the large amount of internal RAM distributed

    throughout the chip. These internal BlockRAMs have customizable width and depth. They

    also have two ports that can independently read and write to different addresses. Transfer-

  • 15

    Figure 3.1: Setup of the configurable computer used for simulationing CA model.

    ring data on and off chip is an expensive operation and is typically the bottleneck in most

    applications. By utilizing these memories, we avoid having to transfer data to external banks

    of RAM.

    The second advantage of the Virtex II is the built-in multiplication units. In the sea-of-

    gates model for FPGAs, implementing multipliers is expensive. This is especially true if the

    precision is large because the size of multipliers grow with the square of the number of bits

    of precision. In the Virtex II, there is a built-in multiplier associated with each BlockRAM.

    This lends itself to the distributed processor models we used.

    3.1.2 Distributed Layout

    FPGAs are designed to be as flexible as possible so they can be used in many applications,

    but this flexibility comes at a cost in terms of space and speed for any arithmetic unit when

    compared to custom VLSI. Chips such as general purpose processors have custom designed

    arithmetic units that have a significant advantage in executing sequential operations. The

  • 16

    reason an FPGA has the potential for speed up versus a general purpose processor is that it

    can perform many operations in parallel or deeply pipeline the operations.

    In order to maximize the ability of an FPGA to perform operations in parallel, as much

    of the reconfigurable resources as possible should be in use at the same time. To accomplish

    this objective, both designs use many uniform, simple PEs operating in parallel. Each PE

    is responsible for calculating the next value for a section of cells in the CA model (see

    Figure 3.2).

    Figure 3.2: Distribution of logical CA cells among PEs.

    This distribution is simplified because each cell in the CA model is governed by the

    same equations. The arithmetic units in each PE implement the governing equation; each

    logical cell is represented by the data values that are inserted into the equation. There is a

    BlockRAM associated with each processing unit that stores the set of data values for each

    cell. The number of cells represented by a PE is determined by the number of logical cell

    data sets that can be stored in the BlockRAM.

    This concept of having multiple cells per PE greatly increases the number of logical cells

    that can be represented in a design. A certain amount of chip resources is needed to calculate

    a cell update. If only one cell was represented in each PE, then the PE could be slightly

  • 17

    smaller and a BlockRAM would not be needed. However, the resources required are not

    greatly increased by moving from a PE that calculates the update for one cell. There are

    enough BlockRAMs on the Virtex-II so that the number of BlockRAMs does not limit the

    number of PEs that can fit on the chip.

    During a single iteration, all of the logical cells contained within a processing unit are

    updated once. The update for a cell depends on its right and left neighbors. To calculate

    the update for cells on the edge of the section of logical cells a PE represents, the PE needs

    data from the PEs to its right and left. At the end of an iteration, each PE transfers the

    data from its leftmost cell to the PE representing cells to the left. Likewise, the data from

    the rightmost cell must be transferred to the PE representing the cells to the right.

    After this transfer, each processing unit has all of the information needed to compute the

    next update for all of the cells it represents. Calculations for all cells can start simultaneously

    because the necessary information about all cells is known at the beginning of the iteration.

    In both designs, registers are placed between arithmetic units. If an arithmetic unit required

    more than one cycle to complete, a pipelined version of the component was used. This

    pipelining allows multiple cell updates to be computed concurrently, because cells do not

    need to wait until the previous cell has completely finished processing.

    3.2 Problem Specific Design

    The original direction of this project was to develop a toolset that could rapidly produce

    custom hardware models based on specific problems. A designer who wanted to use the tools

    would specify the problem in a custom programming language. A compiler would interpret

    the input code and produce a custom FPGA configuration to solve the problem. It was

    expected that a toolset could be developed for creating custom bitstreams rapidly enough

    to make the system useful.

    The first step in this development process was to analyze typical CA analysis equations

  • 18

    and manually create an optimized layout. The equations used are based on an analysis

    problem with two degrees of freedom, v and Θ.

    v̄c = (C0 ∗ (vl + vr) + C1 ∗ (Θl − Θr)) + Fc

    Θ̄c = (C2 ∗ (vl − vr) + C3 ∗ (Θl + Θr)) + Mc (3.1)

    The variable vc represents the v value for the current cell being processed. The variables

    vl and vr are the v values for the current cell’s right and left neighbors. F represents an

    external force. v̄c is the value of vc at the next time step. These equations can be used to

    solve a one-dimensional CA analysis problem, such as deflection of a uniform beam. This

    form of the equations was chosen because it can be mapped to a small, linear circuit. The

    main goal was to minimize the number of multiplications needed because multiplication units

    are costly in terms of space on the FPGA, .

    An optimized design was built to solve these equations. For each operation a variety of

    components was considered, and multiple layouts were investigated in the implementation

    of the equations. Maximum clock frequency, latency, and size were examined when selecting

    each component. To further optimize the circuit, because they are independent, vc and Θc

    are computed simultaneously. Figure 3.3 shows the final optimized design.

    The outputs of all the components shown in Figure 3.3 are registered. Additionally, the

    constant multipliers are pipelined. The resulting latency though the circuit is 6 clock cycles.

    The circuit is designed such that all information to compute the update value is provided

    at the point at which it is needed. In particular, the Fc and Mc values are loaded 5 clock

    cycles after the corresponding Θ and v values. In this design, when this pipeline is filled the

    circuit can produce an updated value every clock cycle.

    The constant multipliers were used because they had a much lower latency and were

    much smaller than traditional multipliers. Using constant multipliers is only possible if

    the coefficients in Equations 3.1 are fixed. In the case of analyzing the deflection of a

    uniform beam, these coefficients are constant. These multipliers have the characteristic of

  • 19

    Figure 3.3: Arithmetic unit for Problem Specific design.

    having a structure independent of the constant multiplicand. Therefore, if the location in

    the bitstream of the constant multiplier is known, the values in the FPGA look-up tables

    (LUTs) values could be modified directly to reflect changes in the coefficient.

    The disadvantage of using constant multipliers is that design optimizations made to the

    density of the beam would require that a different type of multiplier be used. Also, if the

    beam was not uniform, the value of vc and Θc would be needed to compute updated values,

    v̄c and Θ̄c. This narrows the usefulness of this design, but it provides an optimized baseline

    for comparing other designs.

    Each PE in the Problem Specific design contains arithmetic logic, a finite state machine

    (FSM), and a BlockRAM. The BlockRAM contains all of the values for the cells. The FSM

    controls the addresses from which data is loaded and stored in the BlockRAM. The PE

    operates most efficiently when the pipeline is filled. When the pipeline is filled, a new set of

    data needs to be applied each clock cycle, and updated values need to be stored each clock

    cycle. To accommodate this flow of data, one port of the BlockRAM is devoted to loading

    data and the other is devoted to storing data (see Figure 3.4).

  • 20

    Figure 3.4: PE layout for Problem Specific design

    The Edge Registers are used to communicate data to neighboring PEs. When the update

    for a cell on either end of the section of the model for which the PE is responsible is calculated,

    the new value is stored in the Edge Registers. Each PE has access to these registers in its

    right and left neighbor. When data is needed from a neighbor, the values are loaded from

    the Edge Registers instead of from the BlockRAM. For PEs on the boundary of the model,

    the Edge Registers are connected to constant values.

    To implement design optimization, FPGA configuration bitstreams need to be produced

    for both analysis and design phases. The FPGA would first be loaded with the analysis

    design and the configuration would be iterated until the data values converged. After the

    cells converge, the data needed for the design improvement phase is stored in the internal

    BlockRAMs. The FPGA would next perform a partial reconfiguration and load the design

    improvement bitstream, during which the contents of the BlockRAMs would not be changed.

    In this way, data would be passed between the analysis and design phases.

    The results would be extracted from the board through readback. During the readback

  • 21

    operation, the FPGA dumps its entire configuration including flip-flops and BlockRAM

    contents. Once the contents of the FPGA are dumped, careful filtering of the data would

    yield the current results. This method negates the need for using specialized hardware to

    support downloading data.

    The residual-update method, described in the Background chapter, can be used in finding

    the solution for a CA model because it is an iterative improvement problem. The advantage

    to using this method would be that low precision calculations can be used to generate a high

    precision result. The reconfiguration between analysis and design phases would provide the

    opportunity needed for loading updated coefficient to the FPGAs. The result of implement-

    ing the residual-update method would be that an 8-bit design could produce results with

    precisions such as 16 or 32 bits.

    3.3 Program Based Design

    The Program Based design represents a fundamentally different approach to solving the same

    analysis problem as the Problem Specific design. The Problem Specific design can perform

    analysis updates very rapidly because it uses custom hardware. However, using a custom

    design means that for each new problem an optimized circuit must be designed, and an FPGA

    configuration must be generated. The overhead of building a custom configuration for each

    problem could easily erase any speed advantage. On the opposite end of the spectrum, a

    compiled program running on a general purpose CPU has very low initial overhead, but it

    cannot take advantage of the inherent parallelism of CA. The Program Based design was

    developed to bridge the gap between the analysis speed of custom hardware and the flexibility

    of a general purpose processor.

    The first major change, compared to the Problem Specific model, is that the Program

    Based design executes a program stored in internal BlockRAM to control data accesses

    and the arithmetic units in the PEs. In the Problem Specific model these operations were

  • 22

    performed using a fixed finite state machine. Another significant change is that the control

    logic is removed from the PEs and placed in a central control unit. The signals are then

    propagated to the PEs throughout the chip. The third important modification is that the

    equations are represented in a matrix form to provide a more flexible architecture that

    can handle a variety of problems. This matrix arithmetic is expressed in the layout of

    the arithmetic units. The last major difference is that the Program Based model has the

    capability to compute results in both high precision and low precision forms on the FPGA

    and then combine the two results.

    The goal of flexibility for the Program Based design is reflected in the form of the equations

    for the model. The hardware is designed to solve problems set up in matrix form. This

    provides a simpler method to implement, and eventually automate, CA design algorithms.

    The matrix form of the beam equations is shown in the following equations:

    v̄c

    Θ̄c

    =

    vl

    Θl

    C0 C1

    C2 C3

    +

    vc

    Θc

    K0 K1

    K2 K3

    +

    vr

    Θr

    C4 C5

    C6 C7

    +

    Fc

    MC

    (3.2)

    This equation solves the same analysis problem as the Problem Specific design. This is

    one of a range of two-dimensional problems that can be solved by the Program Based design.

    Equations can be implemented with any number of terms and are expressed in matrix form.

    The Problem Specific model only solves problems that can be represented in the form of the

    beam equations, while the Program Based model has the capability to capture the behavior

    of a variety of problems.

    The complexity of control logic increased greatly in the Program Based model as compared

    to the Problem Specific model. The finite state machines that controlled the load and store

    logic of the Problem Specific model are ill-equipped to handle the increase in complexity.

    The control logic in the Program Based model uses significantly more resources, so it is

    advantageous to move the control logic to a centralized location. There is a penalty involved

    in distributing the control signals; however, the size of each PE would more than double if

  • 23

    the control logic was not centralized.

    The architecture of having a single control unit makes the Program Based design similar to

    a Single Instruction, Multiple Data (SIMD) parallel computer (see Figure 3.5). Removing the

    control from the individual PEs is possible because all cells, including boundary cells, can be

    represented by changing the coefficients in the matrix equation. Historically, there has been

    a lack of widespread interest in SIMD parallel computers because they are inflexible and

    require custom processors. However, SIMD machines have been successful in multimedia

    and DSP applications [?]. These applications involve repetitive calculations that can be

    performed in parallel, similar to those needed for CA models.

    Figure 3.5: Return data chains for Program Based design.

    The Control Unit (CU) requires feedback from the PEs, for example a flag indicating

    that calculations are complete. The routing resources around the CU would be consumed

    quickly because there are a large number of PEs that need to communicate with the CU. To

    avoid this problem, there are multiple PEs on each return data bus so only the last PE in

    the chain needs to be routed directly to the CU. The drawback is that extra computational

    cycles are needed. This is because the returning data takes an extra clock cycle to propagate

    back to the CU for every link.

  • 24

    Instructions stored in the CU are not like those of a traditional microprocessor. The

    instructions for a traditional general purpose processor are encoded, while the instructions

    in this design are stored as a 72-bit word that requires no decoding. The result is that

    most control signals can be connected directly from the memory in the CU to the PEs (see

    Figure 3.6). This method of storing instructions has the advantages of being both fast and

    allowing any combination of control signals to achieve maximum parallelism.

    Figure 3.6: Control Unit for Program Based design.

    The instructions contain two main parts, the flow control logic and control signals. The

    flow control portion interacts with the flow control logic in the CU to determine which

    instruction is executed next. The flow control logic allows for increments to the program

    counter, branches, and conditional branches. The control signals manage operations in the

    PEs. These include: clearing registers, loading data, and shifting data. The signals for

    controlling the BlockRAMs in the PEs are fed through address logic to allow absolute and

    relative address jumps.

  • 25

    Though there is a plan to automate the process of writing the programs loaded into the

    control unit, the first programs were written manually. A spreadsheet was used to select the

    values of every control signal at each time step. The spreadsheet was set up to automatically

    insert the signal values into the proper bit position (see Appendix A for example). The integer

    equivalent of the binary number is loaded into the control unit memory at compilation time.

    In the current design, the program cannot be changed at run time.

    To understand the reasoning behind the arithmetic logic in the PEs, it is necessary to

    understand the process for using a residual and an update to calculate results. It is possible

    to use a residual-update method, described in Chapter 2, to find the solution because the CA

    solutions are attained by iterative improvement. This method has the advantage of using

    low precision arithmetic for most calculations. In describing this method, n is the number

    of bits used in high precision calculations and k is the number of bits used in low precision

    calculations.

    This method works by first calculating the residual, or error in the equation that is being

    solved. The residual calculation must be performed in n bits for every cell in the model. The

    most significant k bits of the residuals are then extracted and stored. The k bits must be

    taken from the same position in every residual. The largest element in the residual vector

    dictates which bits are selected. The update equation is then calculated in k bits, and the

    k-bit version of the residual is used in place of the Fc and Mc in Equation 3.2. This k-bit

    update is performed until the results converge. After the k bit updates are found, they are

    added into an accumulated version of the variables at the same offset as the bits that were

    taken out of the residual. The cycle repeats using the accumulated version of the variables

    in the residual equation. These iterations are repeated until the accumulated versions of the

    variables converge.

    The flow chart in Figure 3.7 shows an analysis cycle using this method. This method

    is effective because the majority of the time is spent in calculating the update in k bits.

    More parallel arithmetic units can be used to speed up the calculations because the update

  • 26

    Figure 3.7: Analysis cycle flow and precision for each operation.

    calculation is be performed in k bits, .

    There are three main parts to the PEs used in the Program-Based design (as shown in

    Figure 3.8):

    - Multiply Accumulator: calculates the residual in n bits

    - Shift Unit: extracts k bits from n-bit residual and adds the update into accumulated

    variables

    - Matrix Accumulator: calculates cell updates in k bits

    The Multiply Accumulator is simply a multiplication unit and an adder with the registered

    version of its output connected to one of its inputs (see Figure 3.9). There is only one

    multiply accumulator per PE because it uses n-bit arithmetic, and these n-bit precision

    units are large. The Multiply Accumulator takes advantage of the built-in 18x18 multiplier

    units on the Virtex-II FPGAs to save resources.

    The minimization of the hardware results in multiple clock cycles being needed to compute

  • 27

    Figure 3.8: Computational unit for Program Based design.

    Figure 3.9: Multiply accumulator used in computational unit.

    residual values. The latency through each unit is one clock cycle, so the pipeline is two stages.

    For the equation proposed in the beginning of this section, it takes 16 clock cycles to calculate

    the residuals for one cell. The expense of the residual calculation is tolerable because many

    update calculations are performed between residual calculations.

    After the residual is calculated it must be converted to a k-bit number. During the

    residual calculation, the most significant bit of the largest residual value is found. There is

    a mechanism in each PE that stores the absolute value of the largest residual calculated.

    This value is passed along the return data chain until it arrives at the control unit. Each

    PE performs a logical OR on the value passed to it and the largest value it has calculated.

  • 28

    This process destroys the actual value of the largest residual, but the number passed to the

    control unit shows the position of the most significant ‘1’. This position is used to determine

    which bits of the residual are stored for the update phase.

    Figure 3.10: MSB data return chain, used for determining the most significant ‘1’ of residuals.

    The logic to extract the bits is based on a multiplexer, a register, and a right shifter. The

    multiplexer selects between an input from memory or a right shift version of the value stored

    in the register. The value output from the multiplexer is loaded into the register. During the

    bit shifting phase, the register is first loaded with the n-bit value. The right shifted value

    is then selected from the multiplexor. The value is looped through right shifter until the

    desired bits are in the lowest position. The number of clock cycles required is dependent on

    the number of bit positions the value needs to be shifted.

    Figure 3.11: Unit for shifting the precision of intermediate results.

  • 29

    The adder, after the shift logic, is used during the addition phase at the end of the outer

    analysis cycle. During the addition phase, the update is loaded into the highest bits and

    shifted to the correct position. It is then added to the previous value, which is read from

    memory. A signal from the control unit selects which value is output from the unit.

    The final piece of the PE for the Program Based design is the Matrix Accumulator. The

    Matrix Accumulator is similar to the Multiply Accumulator unit, except the arithmetic is

    performed in k bits and more hardware is used to speed up calculations. The unit is designed

    specifically to be able to multiply a 2x2 matrix by a 2x1 matrix. For example, Figure 3.12

    shows the circuit calculating the equation:

    Θ̄

    =

    v

    Θ

    C0 C1

    C2 C3

    (3.3)

    Figure 3.12: Matrix accumulator used for analysis updates.

    The multiplier has a three clock cycle latency and is fully pipelined. The entire unit has

    latency of five clock cycles. The update for each cell using the matrix version of the beam

    equations, described earlier in this section, takes 9 clock cycles. However, when the pipeline

  • 30

    is filled, the circuit can produce an update every five clock cycles, and this circuit calculates

    the update for both analysis variables simultaneously.

    Every PE can select to have the input to Port B of the memory connected directly to the

    output of Port A of its left or right neighbor. In this way, PEs transfer data about the cells

    on the edge of the section of the CA model for which it is responsible. This system is also

    used to upload and download data from the FPGA. The PE that calculates the values for

    the cells on the left end of the model can read data from the PCI bus, while the PE that

    calculates the values for cells on the right end of the model can write data to the PCI bus.

    Figure 3.13: Data flow for uploading and download data to FPGAs.

    To upload coefficients and external forces, as well as initializing variable values, the host

    computer begins by writing the data into the memory of the leftmost PE for the rightmost

    PE. The data is then shifted through all the PEs until it gets to the proper place. At the

    same time new data is shifted into leftmost PE. Downloading the results is a similar process,

    it involves shifting the data right and reading it off the rightmost PE. An external clock is

    used to keep data transfers synchronized.

    Although reconfiguration is not part of the analysis cycle, the implementation of the

    system for performing design optimization will use reconfiguration in a number of ways.

    Each analysis model has fixed connections for communicating among PEs. It is possible

    to pass data through intermediate PEs to transfer data between PEs that are not directly

    connected. However, to achieve maximum efficiency, communication should be done over

    direct connections when possible. The design system will have a number of different analysis

    models, each with a different communication pattern. When the user specifies the initial

  • 31

    problem the system will select the bitstream for the most appropriate model and load it into

    the FPGAs.

    Design optimization may be performed in a number of different ways. The first possible

    technique is to use reconfiguration. A bitstream developed to perform design optimization

    could be loaded on the FPGAs using partial reconfiguration. The data would be passed

    between analysis and design models through the BlockRAMs, like the method proposed

    for the Problem Specific model. Another technique would be to use the analysis model

    to perform the design calculations. Design would require new coefficients, which could be

    loaded into the FPGA using the uploading and downloading models described earlier. The

    disadvantage of this method is that the analysis design might not be capable of performing

    all of the operations needed, or the operations may be very inefficient. The final possibility

    is to use a Virtex-II Pro FPGA, which contains built-in PowerPC processors. These internal

    processors could be used to run a program to calculate the new design values.

  • Chapter 4

    Results

    4.1 Problem Formulation

    The results in this section are based on solving the analysis of a CA model of a one-

    dimensional beam. The model is formulated from work by researchers at Virginia Tech

    [?]. The beam is divided into cells that have two degrees of freedom, vertical displacement

    (w) and rotation (q). Each cell also has a separate vertical thickness, which is the design

    variable. The thickness of the beam is specified at the middle of each cell, and then linearly

    interpolated in between the specified points (see Figure 4.1). Cells in the model are evenly

    distributed along the beam.

    There are a number of possible configurations for each cell. The cell can have a fixed

    displacement, a fixed rotation, a fixed displacement and rotation, or it can be free in dis-

    placement and rotation. External forces can be applied to any cell. The forces can be in

    the form of a vertical force (F) or a bending moment (M). These different configurations are

    represented by changing the coefficients in the equation that is solved by each model. Using

    these available cell configurations, many classical static beam problems can be solved.

    The CA model, shown in Figure 4.2, was modeled with 20 cells and was run on both the

    32

  • 33

    Figure 4.1: Diagram of CA model for perform analysis on a beam.

    Figure 4.2: Beam analysis problem modeled on the configurable computer.

    Problem Specific and Program Based designs. 20 cells was choosen so the model could be

    quickly simulated. The first cell in the model is a dummy cell, for which no computations

    are performed. The cells (1 and 19) on the ends of the beam have fixed displacement and

    rotation. Cell 14 has a fixed vertical displacement. All other cells in the model are free

    in displacement and rotation. There is a vertical force pushing up on cell 9. The force is

    scaled to produce a maximum displacement of slightly less than 127, so the result can be

    represented in 8 bits.

  • 34

    4.2 Problem Specific Design Results

    The designs presented in the implementation section were intentionally developed indepen-

    dently of any fixed precision for the results. There is a trade off between the number bits of

    the solution that will be calculated and the number of cells that can be represented in the

    system. In addition, larger precision results in lower maximum clock frequency and/or an

    increased pipeline length.

    Another factor in changing the precision is memory access. The BlockRAMs have pro-

    grammable port widths that can accommodate some changes in precision. The BlockRAMs’

    two ports can each handle up to 36 bits and can independently read or write. Once the data

    transfer limit has been exceeded, accessing the data needed will take multiple clock cycles

    or more memories must be used in the design.

    During development, multiple versions of the Problem Specific design were built that used

    different precision for calculations. Figure 4.3 shows the growth of the size of a PE as the

    precision of the calculations is increased. The graph shows that size grows rapidly as the

    number of bits is increased. This makes it very important to use the least precision possible

    for calculations.

    From this data, and based on the beam problem being solved, 8 bits of precision was

    chosen as likely to be the most effective. In most of the following analysis, an 8-bit model

    was studied for the Program Based design. However, this precision could change based on

    the type of problem being solved and the number of cells in the model. In this respect,

    the Problem Specific design would have more flexibility with regard to precision than the

    Program Based design because the Problem Specific design is custom-made for each problem.

    Once 8 bits was selected for the precision, the number of PEs needed to be selected. The

    maximum number of PEs that can fit on an FPGA is limited by the programmable logic

    and routing resources on the chip. However, when the chip usage gets high, the routing gets

    inefficient and the maximum frequency at which the circuit can be clocked drops rapidly.

  • 35

    4 6 8 10 12 14 16 180

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07Problem Specific design − % utilization vs. PE precision

    PE precision (bits)

    % c

    hip

    utili

    zatio

    n (b

    ased

    on

    Virt

    ex II

    −40

    00)

    Figure 4.3: Precision of PE vs. Percent Utilization of FPGA for Problem Specific design

  • 36

    5 10 15 20 25 30 35 40 450

    0.2

    0.4

    0.6

    0.8

    1

    % c

    hip

    utili

    zatio

    n (b

    ased

    on

    Virt

    ex II

    −40

    00)

    % utilization and maximum clock frequency vs. Number of PEs

    Number of PEs5 10 15 20 25 30 35 40 45

    0

    20

    40

    60

    80

    100

    Max

    imum

    clo

    ck fr

    eque

    ncy

    (MH

    z)

    chip utilizationmaximum frequency

    Figure 4.4: % Utilization of FPGA and maximum clock frequency vs. number of PEs for

    Problem Specific design.

    Figure 4.4 shows the chip utilization and the maximum clock frequency versus the number of

    PEs. The number of logical cells that can be represented increases linearly with the number

    of PEs. However, the clock maximum frequency decreases gradually as the number of PEs

    increases, then drops quickly after the 35th PE.

    The Problem Specific and Program Based designs vary widely in the number of cells they

    can represent and the precision of the result. In order to compare these differing designs,

    the maximum number of cell updates per second is used as a metric. This is also used as the

    metric to determine the speed-up over a program running on a general purpose processor.

    The number of cell updates per second for the Problem Specific design is simply the

  • 37

    5 10 15 20 25 30 35 40 45 500

    500

    1000

    1500

    2000

    2500Cell updates per second vs. # of PEs

    Number of PEs

    Cel

    l upd

    ates

    per

    sec

    ond

    (mill

    ion)

    Figure 4.5: Cell updates per second vs. number of PEs for the Problem Specific design.

    number of PEs multiplied by the maximum clock frequency. This is because in the Problem

    Specific implementation, each PE produces a result every clock cycle during analysis. Fig-

    ure 4.5 shows a peak in the maximum number of cell updates when 35 PEs are on the chip.

    With 35 PEs the 8-bit design has a maximum clock frequency of 64.5 MHz. In comparison,

    a 12-bit model with 35 PEs cannot fit on the Virtex-II 4000 FPGA.

    There are additional factors that limit the cell updates per second that can actually be

    performed on the system (see Table 4.1). The most costly factors are the reconfiguration

    and readback times on the FPGA and the time it takes the host to compute the coefficients

    for a given design. These times are important because the communication with the host is

  • 38

    Time (ms)

    Operation 1 PE 1 FPGA DINI Board

    Reconfig N/A 1190 4760

    Readback not yet implement in DINI API

    Host computations 1.11 39 156

    Table 4.1: Times for operations associated with Problem Specific analysis cycle for DINI

    board.

    done through reconfiguration and readback.

    These results are dependent on the design being able to accurately produce analysis

    results. The problem described earlier in this chapter in the Problem Formulation section

    (see Figure 4.2) was modeled on the Problem Specific design. The force was scaled so the

    result would be able to be represented in 8 bits. The actual results were calculated using a

    C++ program running on a PC which used floating-point arithmetic for all calculations. The

    results show (see Figure 4.6) that the system was able to produce results that were similar,

    but not exactly correct. The mean of the error between the actual results and the results

    attained from the Problem Specific model for the displacement and rotation were 38.4% and

    41.4% respectively. This large error is due to the rounding that takes place because fixed

    point arithmetic is used. The answer would be improved, if better accuracy is needed, by

    using reconfiguration and the residual-update method for iterative improvement.

    4.3 Program Based Design Results

    The Program Based design has much the same sensitivity to precision as the Program Based

    design, but the Program Based design does not have quite as much flexibility in term of

    precision. Because the full precision calculations of the residual rely on the built-in 18x18

    multipliers, it is difficult to increase the full precision result to more than 18 bits. However,

  • 39

    0 2 4 6 8 10 12 14 16 18 20−40

    −20

    0

    20

    40

    60

    80

    100

    120Actual results vs Problem Specific design results

    Dis

    plac

    emen

    t (m

    m)

    Position (m)

    actual displacementProb Spec displacementactual rotationProb Spec rotation

    Figure 4.6: Actual results and results from Problem Specific design for beam problem.

  • 40

    4 6 8 10 12 14 16 180

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07Chip utilization vs. PE precision

    PE precision (bits)

    % c

    hip

    utili

    zatio

    n (b

    ased

    on

    Virt

    ex II

    −40

    00

    Figure 4.7: Precision of PE vs. % Utilization of FPGA for Program Based design.

    there is some flexibility in the precision of the update calculations. Figure 4.7 shows how

    the size of the PE grows as the precision of the update arithmetic is increased. The growth

    is similar to that of the Problem Specific model.

    Based on this data, 6 bits was selected for the precision of the update calculations. The

    6-bit precision design attains the maximum per second with 60 PEs. The maximum clock

    frequency is 94.8 MHz. If the same design used 8 bits of precision the maximum clock

    frequency is 88.1 MHz.

    The precision selected for the update is also trade off between having smaller update units

    and having to perform the outer iteration more often. When the precision of the update

    calculation is larger, more inner iterations are performed before new residuals need to be

    calculated. Using a smaller precision has the advantage of being able to devote more, smaller

    units to calculating the update. The number of clock cycles needed for each phase of the

  • 41

    Analysis Cycle Phase Clock Cycles

    Residual Calc 550

    Shift Residual 330-990

    Cell Update 190 * Inner Iterations

    Add 410-1135

    Table 4.2: Clock cycles for different phases of residual-update analysis cycle.

    analysis cycle is shown in Table 4.2.

    As the number of inner iterations increases during each analysis cycle, the Program Based

    model becomes more efficient. Figure 4.8 shows the increase in the efficiency of the design

    versus the number of inner update iterations for each residual calculation. For this graph,

    the average number of cycles for the Shifting and Adding phase of the analysis cycle was

    used. When the number of inner iterations is below 10, more than half the time is spent in

    calculating the residual or shifting the results. However, the model rapidly becomes more

    efficient. With 35 inner iterations per analysis cycle this design achieves 75% efficiency, and

    at 90 inner iterations the efficiency is 90%. The number of iterations needed will depend on

    the type of problem and the number of cells in the model.

    The total cell updates per second that can be performed with the whole chip is dependent

    on the number of PEs on the FPGA. The maximum number of cell updates per second for the

    Program Based design is achieved slightly before the chip resources are exhausted because

    of routing congestion (see Figure 4.9). Figure 4.10 shows how the maximum number of cell

    updates per second rises then deteriorates as the number of PEs is increased.

    The Program Based model depends on communication with a host through the PCI bus

    in the current design. Before the calculation can begin, the coefficients for a specific design

    need to be loaded into each of the PEs. Then, after the analysis is complete, the results

    need to be downloaded back to the host. The transfer times for these operations are shown

  • 42

    0 50 100 150 200 2500

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100Efficiency vs. inner iterations per analysis cycle

    Inner iterations per analysis cycle

    Effi

    cien

    cy (

    % o

    f tim

    e sp

    ent c

    alcu

    latin

    g up

    date

    s)

    Figure 4.8: Efficiency vs. number of inner iterations per analysis cycle.

  • 43

    0 10 20 30 40 50 60 700

    0.5

    1

    % c

    hip

    utili

    zatio

    n (b

    ased

    on

    Virt

    ex II

    −40

    00)

    % utilization and maximum clock frequency vs. Number of PEs

    Number of PEs0 10 20 30 40 50 60 70

    0

    100

    200

    Max

    imum

    clo

    ck fr

    eque

    ncy

    (MH

    z)

    chip utilizationmaximum frequency

    Figure 4.9: % Utilization of FPGA and maximum clock frequency vs. number of PEs for

    Program Based design.

  • 44

    10 20 30 40 50 60 700

    500

    1000

    1500Cell updates per second vs. # of PEs

    Cel

    l upd

    ates

    per

    sec

    ond

    (mill

    ion)

    Number of PEs

    Figure 4.10: Cell updates per second vs. number of PEs for Program Based design.

  • 45

    Time (ms)

    Operation 1 PE 1 FPGA DINI Board

    Uploading Coefficients 2.10 114 228

    Downloading Results .311 18.5 37

    Host Computations .360 21.5 86.0

    Table 4.3: Time for operations associated with analysis on Program Based design.

    in Table 4.3. This communication is synchronized by an external clock supplied by the host.

    There is additional overhead due to the computation of analysis coefficients on the host for

    each design.

    The problem shown in Figure 4.2 was modeled on the Program Based design. This was

    the same model run on the Problem Specific design, except the force is scaled up so that

    the maximum result was closer to an 18 bit number. The results were again compared to a

    C++ simulation that used floating-point for all calculations. Figure 4.11 shows the results

    attained from the Program Based model were very close to the actual results. The mean of

    the percent error between the Program Based model data and the actual results was 0.099%

    for displacement and .118% error rotation.

    The results for the Program Based model were much more accurate than the results from

    the Problem Specific model because the Program Based model uses 18 bits of precision

    during the residual calculations. The speed of the Program Based model comes from the use

    of only 6 bits for the update calculations. The external force was scaled so the maximum

    displacement would be close to 18 bits. After they were computed, the results were scaled

    down to match the original problem.

  • 46

    0 2 4 6 8 10 12 14 16 18 20−40

    −20

    0

    20

    40

    60

    80

    100

    120Actual results vs Program Based design results

    Position (m)

    Dis

    plac

    emen

    t (m

    m)

    actual displacementProg Based displacementactual rotationProg Based rotation

    Figure 4.11: Actual results and results from Program Based design for beam problem.

  • 47

    Maximum Cell Updates per Second (millions)

    Operation 1 PE 1 FPGA DINI Board

    Problem Specific 65.1 2279 9116

    Program Based 18.9 1137 4548

    Table 4.4: Maximum cell updates per second for both implementations.

    Operation Maximum Cell Updates Sec (millions) Speed-up

    PC 48.9 -

    Problem Specific 9116 186.4

    Program Based 4548 93.0

    Table 4.5: Maximum cell updates per second and speed up for both implementations com-

    pared to PC.

    4.4 Comparison of Designs

    Most of the results presented in the earlier sections were for systems using one FPGA.

    However, the DINI board intended for this system has 4 FPGAs. The total computing

    power increases linearly because all the FPGAs can run in parallel. Table 4.4 shows the

    maximum number of updates for each of the systems for each level of the system.

    To compare these FPGA based designs to conventional methods, a C++ program was

    written to calculate the results. To make the comparison as fair as possible, the program

    uses integer arithmetic instead of slower floating-point arithmetic. Integer computations are

    closer to the fixed point math used by the FPGA designs. The program was compiled using

    GCC with optimization enabled and executed on a PC with a 1.7 GHz processor and 1 GB

    of RAM running Linux Debian. Table 4.5 shows the speed up attained by the FPGA designs

    over the general purpose processor version.

  • Chapter 5

    Conclusions

    5.1 Summary

    Cellular Automata (CA) theory has been studied for decades, with most of the work done

    on modeling natural systems. Recently, the use of CA theory has been extended to provide a

    system for structural analysis and design optimization. This work has proved to be success-

    ful, but the calculations are very slow on traditional general purpose processors. The parallel

    nature of these CA models makes them a good candidate for implementation on a reconfig-

    urable computer. The work presented in this thesis shows the initial steps toward making

    an automated tool to perform structural design optimization accelerated by a reconfigurable

    computer.

    The contribution of this thesis was to design and implement two models for the analysis

    phase of the CA structural design optimization cycle. Both designs take advantage of the

    parallel nature of cellular automata by using a distributed array of processing elements.

    For the Problem Specific implementation, these processing elements are customized to each

    problem. The Program Based implementation has a more flexible processing unit that

    is controlled by a program designed to simulate a specific cellular automata model. The

    48

  • 49

    Program Based implementation also has the built-in capability to use a residual-update

    method to accelerate calculations and improve accuracy.

    5.2 Results

    The results show the Problem Specific design and the Program Based design were able to

    generate cell updates at the rate of 9.12 and 4.55 billion per second, respectively. Though the

    Problem Specific design proved to be able to generate updates more rapidly, this increase

    in speed came at the expense of precision and flexibility. The Program Based model’s

    competitive speed, improved accuracy, and ability to handle a range of update rules make it

    the architecture that provides the most potential for an automated system.

    Both hardware implementations of these CA model for structural analysis were very

    successful in term of performance. When compared to a 1.7 GHz Pentium 4 processor, the

    Problem Specific design proved to be 186 times faster. The Program Based design, which

    was slightly slower, was still 93 times faster than the general purpose processor version.

    These speed-ups are a step towards making a CA system for structural design optimization

    that significantly outperforms traditional methods.

    5.3 Future Work

    There are a number of interesting areas that need to be studied in order to design an

    automated tool for perform structural design optimization using CA. The most immediate

    may be the need for a translator and compiler for the programs used by the Program Based

    design. For the work in this thesis, the programs were all written by hand. This process was

    very difficult and time consuming. A compiler is needed to take a higher level abstraction and

    generate machine level instructions. The end product should be a compiler that could take a

    problem specified by a design engineer who has knowledge of the hardware implementation

  • 50

    and produce the necessary instructions.

    Additionally, an efficient method for design calculations must be implemented. In the

    current system, the results are downloaded to a host computer where design calculations can

    be performed. However, this is an inefficient technique. A number of possibilities exist for

    executing the design calculations on the board, such as using partial reconfiguration or on

    board processors. These possibilities need to be investigated to identify the best method.

    Another area that needs work is implementing multi-grid on the system. The multi-grid

    approach to these CA problems would be to calculate results while varying the resolution of

    the grids. In other words, the number of cells representing the system would increase and

    decrease based on certain algorithms. Multi-grid could also be used to blend analysis and

    design steps into a single cycle. These methods have the potential for huge reductions in the

    number of calculations needed.

  • Appendix A

    This appendix gives an example of how the programs for the Program Based model are

    written.

    The Processing Elements (PEs) in the Program Based model were developed to perform

    high and low precision arithmetic and convert between the two forms. The control logic

    needed to simulate a cellular automata model is complex, so programs are used to set the

    control signals. The program is stored in an internal BlockRAM contained in the Control

    Unit(CU). As the PEs were designed, the control signals for each unit were assigned to

    particular bits of the BlockRAM in the CU. The position of the bits and a short description

    of their function was recorded in a Excel spreadsheet. Figure A.1 shows a screenshot of this

    spreadsheet.

    Each phase of the analysis cycle was written as a separate program. There are control

    signals for each functional unit of the PE, but only one functional unit is in use during

    each phase. The first step in writing a program was to identify the signals needed for the

    particular phase. The pertinent signals were placed across the top of a spreadsheet and the

    value of the signal was specified below. Each horizonal line represents a clock cycle step.

    Figure A.2 shows an example of a program. This particular program calculates the update

    during the inner iteration of the analysis cycle.

    The signals on the left are used to determine the order in which instructions are executed.

    51

  • 52

    Figure A.1: Spreadsheet with position of control signals and short description.

    The program counter will increment by one unless a loop is specified. The signals on the left

    control the PEs. Signals with only two options are specified as Y or N. Signals with more

    options are specified as a number or letter from a certain set.

    There is a second spreadsheet which determines the numerical value for each signal. The

    values of the signals are then converted into an intermediate form. The intermediate form is

    the numerical value of the signal multiplied by two to the power of its bit position. The final

    value is the sum of all the intermediate values (see Figure A.3). This is the number that is

    loaded into the CU BlockRAM. These final values are then put in a form that can be read

    into memory (see Figure A.4).

  • 53

    Figure A.2: Spreadsheet containing update program

    Figure A.3: Spreadsheet converting signals to the form used by the Program Based model

  • 54

    Figure A.4: Spreadsheet containing the data values in a form that can be loaded into memory.

  • Vita

    Thomas Hartka was born in June 1980 in Baltimore, Maryland. He atttended from Arch-

    bishop High School in Severn, Maryland. Thomas enrolled in the College of Engineering at

    Virginia Tech in the fall of 1998. He graduated Cum Laude with a Bachelor of Science in

    Computer Engineering. Thomas choose to remain at Virginia Tech to pursue his Master’s

    Degree. He became involved in research at Virginia Tech Configurable Computing Lab. After

    graduating, Thomas will attend Johns Hopkins’ Post-Baccalaureate Premedical Program.

    55