hardware acceleration for full-wave analysis of integrated...

Hardware Acceleration for Full-Wave Analysis of Integrated Circuits

John R. Humphrey, Michael R. Bodnar, Eric J. Kelmelis, James P. Durbano Accelerated Computing Division

EM Photonics, Inc. 51 E. Main St. Suite 203, Newark, DE, 19711

USA

Abstract: - As circuit speeds increase and sizes decrease, lumped sum and quasi-static methods become inaccurate and rigorous analysis via the Method of Moments is then required to achieve accurate results. Unfortunately, this requires the solution of a large, dense matrix. Despite algorithmic advances, there is still a need for solving the system in less time. To this end, we have developed a prototype solver that implements a state-of-the-art algorithm in an accelerated hardware environment. In this paper, we examine our progress to date, which consists of the development of a matrix-vector-multiply unit, the key component of the design. Key-Words: - Integrated circuits, full-wave analysis, PILOT, method of moments, FPGA, hardware acceleration

1 Introduction As circuit speeds increase and sizes decrease, lumped sum and quasi-static analysis becomes inaccurate. Thus, rigorous full-wave EM simulation is required to obtain the correct results. When performing analysis of integrated circuits, the Method of Moments (MOM) technique is particularly useful because Green’s theorem allows the volume integrals to be reduced to surface integrals. Thus, the problem size is considerably smaller than it would be if modeled with competing volume-based techniques such as the Finite-Difference Time-Domain (FDTD) method or the Finite Element Method (FEM). MOM simulations generate dense systems of linear equations that must be solved. Many options exist for evaluating the resulting matrices. Direct and iterative solutions are both simple to implement, but are inefficient, requiring at least O(N2) time. More modern solutions, such as the Fast Multipole Method (FMM), have brought this to a nearly ideal O(N log N) time, but are of limited use in the full-wave analysis of circuits due to the complexities of layered Green’s functions. In order to overcome these limitations, the current state-of-the-art approach for full-wave simulations of electronic circuits is the pre-determined interaction-list supported oct-trees (PILOT) method. It maintains O(N log N) solution time but with a Green’s function independence. The ability to solve these complex problems in O(N log N) time signals a maturity of the algorithms in

this field. Despite this success, processing time can still take many hours to days for problems of even moderate size. Further reducing calculation time requires improving implementations in order to accelerate a proven algorithm. To this end, we have developed a hardware-based accelerated version of PILOT, using field-programmable gate-array (FPGA) technology. In this paper, we describe the PILOT algorithm, our approach for accelerating computationally-intense algorithms with reconfigurable hardware, and our success in implementing a prototype PILOT solver. 2 The PILOT Algorithm The PILOT algorithm is an extension of the IES3 (Integral Equation Solver for 3-dimensional problems with arbitrary kernels) method, a kernel-independent solver developed at Bell Labs. The IES3 algorithm seeks to exploit the low-rank properties of MOM submatrices. To clarify this point, consider an MxN submatrix of the MOM coefficient matrix. Physically, this submatrix represents the interaction between N sources and M observers. When the sources and observers are well-separated, the interaction has few degrees of freedom. Mathematically, this means that the MxN submatrix is low rank. In this case, a low-rank approximation can be attained via Singular Value Decomposition (SVD) or QR decomposition. The SVD technique is conceptually straightforward, but QR decomposition leads to better results, so it is typically the compression used. [1]

Proceedings of the 10th WSEAS International Confenrence on APPLIED MATHEMATICS, Dallas, Texas, USA, November 1-3, 2006 1

The most difficult issue involved in submatrix compression is determining which regions are compressible. IES3 uses a two-fold approach. First, the nodes are numbered such that the difference between node numbers is representative of the physical distance between them. Clearly, an optimal ordering is almost impossible to obtain (this is what causes sub-optimal compression of the coefficient matrix). Second, regions on which compression is attempted are identified by a statistical algorithm. If the statistical algorithm is too conservative, then compression opportunities will be missed. If it is too liberal, time will be wasted needlessly attempting to compress non-compressible regions. The main advantage of the IES3 algorithm is that, once the compressed representation is attained, solving the system of equations is extremely efficient. Most of the efficiency comes from the fact that the solver is working with a much smaller data set. For instance, compared to FMM (FASTCAP implementation), the data set for IES3 is reduced by 80%. This directly translates into fewer operations required to carry out a matrix-vector multiplication. However, IES3 is encumbered by an exceptionally long problem setup time. The overall result is that, unless solving a large number of right-hand sides, the advantage achieved by working with a smaller data set is only slightly better than the time required for compression. IES3 offers several advantages over FMM. The key benefits of note are: • Fast matrix-vector multiply time • Smaller data set • Easily applied to any kernel • Easy extension to multilayer dielectric structures • Simple, regular operations to perform the MVM – requires no transcendental function evaluation. The only disadvantage of IES3 as compared to FMM is the long setup time. The PILOT algorithm seeks to overcome this limitation by applying an FMM-like setup with the goal of attaining a QR-compressed coefficient matrix. Essentially, a multilevel spatial oct-tree is applied to the problem, thus grouping the unknowns by their locations in space instead of by element numbers [2]. The result of the PILOT compression algorithm is a better-compressed coefficient matrix in a much shorter amount of time. The primary reason for this is that the algorithm more accurately identifies compressible regions because the decision is based on distance, not statistical prediction. Additionally,

it uses “merged interaction lists,” which refer to the fact that certain interactions are common among cells of a certain level. These results can be cached and reused, leading to an additional efficiency. Now that we have described the PILOT algorithm, we can begin to discuss the acceleration process. In the next two sections, we describe the ideas behind hardware acceleration and the specifics of accelerating the PILOT algorithm. 3 Reconfigurable, Hardware-Based Acceleration Current personal computers (PCs) and high performance computers (HPCs) are designed as general computational machines that can be applied to a wide array of problems. However, such an approach results in a “jack of all trades, master of none” paradigm. Rather than use generic compilers to map algorithms onto general-purpose hardware, a new approach is needed for scientific computations that allows for an ideal mapping of algorithms onto application-specific hardware. To this end, our research is focused on accelerating computationally intensive algorithms, such as PILOT, using hardware platforms. A reconfigurable hardware platform, specifically field-programmable gate arrays (FPGAs), is used to reduce costs and development time, as well as increase flexibility and reusability. Alternative hardware platforms are not well suited for the PILOT method. For example, microprocessors and digital signal processors (DSPs) do not have the required computational power. Although application specific integrated circuits (ASICs) could provide the necessary computational power, they are orders of magnitude more expensive to manufacture and have significantly longer development cycle than FPGAs. Furthermore, unlike ASICs, the reconfigurability of FPGAs can be exploited to create “all-in-one” devices, whose functionality can be dynamically changed. We have already demonstrated the promise of this technology by developing a hardware accelerator for the finite-difference time-domain (FDTD) algorithm [3]. The acceleration platform consists of a host PC, a CAD interface, and a custom, FPGA-based PCI card. The user describes the design to analyze through the CAD tool, which then sends the problem description to the accelerator card via the PCI bus. The accelerator card consists of 16 GB of DDR


SDRAM, 36 Mb of DDR SRAM, a Xilinx Virtex-II 8000 FPGA, and a PLX 9656 external PCI controller (Fig. 1). These components provide a peak memory bandwidth of 9.7 GB/s and a computational throughput of 40 GFLOPS. Using this reconfigurable platform, our hardware-based FDTD accelerator demonstrated performance comparable to a 150-node computer cluster.

Fig. 1. EM Photonics Celerity™ accelerator board. This card connects to the standard PCI bus of a desktop machine and houses a Xilinx Virtex-II 8000 FPGA and 16 GB of DDR SDRAM. 4 Accelerating PILOT The PILOT method is extremely well suited to the full-wave analysis of electronic circuits. While it is faster than the alternative algorithms, its performance is still limited by the platform on which it is implemented, commonly a microprocessor-based system. In the previous section, we discussed our approach for accelerating computationally-intense algorithms. We now present our success using this technique to improve PILOT’s performance. The PILOT method produces a series of QR-compressed submatrices that represent the MOM coefficient matrix. The final solution is then achieved via an iterative solver, in which the matrix-vector product is approximated by the operation Q*(R*V). This is the bottleneck of the system solution process, and thus is the piece that is most logically addressed when translating the algorithm into an accelerated environment. We first developed the high-level architecture for our system (Fig. 2). The solver implemented in the FPGA receives all input data from the PCI bus. This data is then stored in internal caches or is routed to the external DRAM where it is stored until needed. Similarly, results are sent back to the PC via the PCI bus. The main computational portion of the design is a plurality of matrix-vector-multiply (MVM) units

that work in parallel to produce the matrix-vector product of the compressed submatrix. The layout of an MVM unit is shown in Fig. 3. It consists of a single floating-point multiplier, a floating-point adder, caches for the input and output vectors, and control logic to bind it together. The control logic is primarily concerned with managing the caches to ensure the correct data is being produced and consumed.

Fig. 2. Overall MVM system architecture. This shows the FPGA connected to its peripherals: the PCI bus, DRAMs, and SRAM. A very high-level view of the contents of the FPGA is also shown.

Fig. 3. The internal architecture of a single MVM unit, showing the arithmetic hardware with the accompanying caches and control logic. To prove viability, we produced a prototype of the PILOT accelerator on our reconfigurable hardware platform. The prototype is a scaled-down version of the final system, and consists of only a single MVM unit. To test this system, we integrated our hardware implementation of the MVM solver directly with a PILOT solver provided by Dr. Vikram Jandhyala, the developer of the PILOT method. Dr. Jandhyala also provided us with


sample data for testing, some of which is shown in Figure 4. The results are very encouraging, showing a maximum of 5.5e-4 percent error between our two implementations. Most of the error is due to our hardware operating in single precision, while we compared to data calculated in double precision. Thus, our results are quite accurate. In terms of performance, the system exceeds the performance of a baseline implementation in MATLAB by about 10%. While MATLAB is not a very fast language, it is still a reasonable comparison for our prototype system.

0 2 4 6x 10-4

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3 x 10-4

x-axis

y-ax

is

Fig. 4. A Comb Filter, courtesy of Dr. Vikram Jandhyala of the University of Washington. The final statistics of the hardware prototype and subsequent projections for the full solver are shown in Table 1. The first three values shown in the table are registers, Block RAMs, and LUTs. These are the basic resources at our disposal when building an FPGA system. Our utilizations are expressed in percentages of the entire chip (a Xilinx Virtex-II 8000 FPGA). Table 1. Resource utilization for prototype hardware system.

Category Prototype Solver

Full Solver (Projected)

Registers 3705 (3%) (~80%) Block RAMs 10 (3%) (~99%)

LUTs 3098 (3%) (~80%) Clock speed 94.9 MHz 150 MHz

Performance of an algorithm implemented in an FPGA is primarily a function of the speed of the hardware and the amount of parallelism that can be exploited. As shown in the table above, our prototype solver is using approximately 3% of the FPGA’s area. This means we can theoretically increase parallelism by 33 times. Additionally, our

current design runs at 94.9 MHz. Based on our benchmarks and prior experience, we expect to be able to increase clock speed to the 130-150 MHz range. By combining these factors, and assuming no additional optimizations, we should be able to achieve a 50x speed increase over our current design. At this speed, we will be able to quickly solve problems that could not previously have been attempted on a desktop machine. 5 Conclusion Although algorithms for the full-wave EM analysis of integrated circuits have reached a highly mature state, there is still a need for faster analysis of larger systems. Thus, it becomes desirable to accelerate the solution of these algorithms. Our work has shown that a dedicated hardware acceleration system is feasible and can theoretically provide large speedups over a single PC. Our prototype version accurately solves the problems of interest in a reasonable timeframe and can be improved to replace large PC clusters with a single desktop machine. Our ongoing work is focused on building the final version of the system whose design was presented above. This system will be able to solve much larger problems more quickly than traditional software implementations, thus making full-wave analysis of practical electronic circuits feasible. References: [1] S. Kapur, D. E. Long. "IES3: Efficient

Electrostatic and Electromagnetic Simulation," IEEE Computational Science and Engineering, vol. 05, no. 4, pp. 60-67, October-December, 1998.

[2] D. Gope and V. Jandhyala. “Oct-Tree Based Multilevel Low-Rank Decomposition Algorithm for Rapid 3D Parasitic Extraction,” IEEE Trans. Computer-Aided Design of Integrated Circuits & Systems, vol 21, no. 11, pp. 1575-1580, November 2004.

[3] J. P. Durbano, J. R. Humphrey, F. E. Ortiz, P. F. Curt, D. W. Prather, and M. S. Mirotznik, “Hardware Acceleration of the 3D Finite-Difference Time-Domain Method,” IEEE AP-S International Symposium on Antennas & Propagation, June 2004.


hardware acceleration for full-wave analysis of integrated...

Documents