Optimized Implementation across Slice Fabric on FPGA
By
Aqib Perwaiz
2006-NUST-TfrPhD-ComE-06
Supervisor
Dr. Shoab Ahmed Khan
College of Electrical and Mechanical Engineering National University of Sciences and Technology, Pakistan
August 2013
Optimized Implementation across Slice Fabric on FPGA
By
Aqib Perwaiz
2006-NUST-TfrPhD-ComE-06
A thesis submitted in partial fulfillment of the requirement for the
Degree of Doctor of Philosophy
Supervisor
Dr. Shoab Ahmed Khan
Department of Computer Engineering
College of Electrical and Mechanical Engineering Pakistan
August 2013
i
ACKNOWLEDGEMENT First of all, I am thankful to ALLAH Almighty for his mercy, help and guidance, without
which this work would not have been possible. I would like to express my gratitude to
Prof. Dr. Younus Javed, Dean ASG. He has always emphasized on higher education
and advance studies in the University and tried to promote a culture of research and
technological development without his efforts and interest, this work could not have
been possible.
It was a great honor for me to be supervised by Dr. Shoab A Khan, besides his
significant contribution to this work; he influenced my development as a member of the
research community in this field. Ever since I started my studies in Bachelors of
Electrical Engineering, Dr. Shoab has been a role model for me.
I would like to thank the members of the Research Monitoring Committee and the
foreign experts who have guided me throughout my work and helped me in keeping my
research on the right path.
I owe my parents for every success in my life, their encouragement and support is a key
factor in every achievement that I have ever made. I am also indebted to my wife, for
her continuous encouragement and patience during the course of my PhD. I would also
like to thank my children, for their patience and motivation that I always find from their
smiles.
Special thanks to the Higher Education Commission for their financial support.
ii
This dissertation is dedicated to my family for
their love, deep understanding, endless
patience and especially my wife for her
encouragement at all times.
iii
SUMMARY
This thesis proposes a mathematical modeling based technique that optimizes mapping
of Digital Signal Processing (DSP) algorithms on FPGAs. The thesis mathematically
models the problem by defining objective function that optimizes attributes like area,
power, and timing under a set of design constraints. The constraints list the embedded
blocks on FPGAs as resources. Any high-end DSP system consists of multiple sub-
systems. Each sub-system has multiple architectural options to select from, multiple
architectural options of Software defined Radio / Software defined jammer have been
discussed. Beside architectural design options, there are many other attributes that
directly affects the mapped resources. The world length quantization plays an critical
role in further optimizing the selected architectural option. The thesis models all these
attributes and the solution lists the resources required for the optimized mapping. The
thesis then indexes the results to select the best FPGA from the database. The model
also work on already selected FGPA and optimizes its resources to best fit a complex
design in the available HW , the thesis further discusses the effect of world length on
hardware(HW) complexity. The experiments demonstrate that world length of
intermediate variables does not help in improving the performance beyond a certain
point. The thesis explores the intricate relationship of intermediate variable lengths, with
the overall accuracy of the results and links it with the complexity of HW. Several design
examples are listed to show the validity of the findings. As an example CORDIC
algorithm has been explored to analyze the effect of bit resolution on the hardware
complexity and least mean square error.
iv
In the design space exploration, several architectural options are discussed. The
options include bit serial, byte serial, folded, unfolded, and distributed arithmetic based
architecture. In the discussion, novel techniques of mapping algorithm on these
architectures are also presented. For example, while discussing bit serial architectures
a novel design of serial multiplication is presented. The multiplier created in the process
is used in the design of subsystems. In this preview, the design of a serial least mean
square adaptive filter is presented. A bitwise serial CORDIC architecture used in direct
digital frequency synthesizer is also explored.
The thesis further focuses more on architectural design options that best maps on
FPGAs. The architectures that are optimal for custom design may perform poorly once
mapped on FPGA. This observation is substantiated by giving design examples from
Compression tress. These trees are very fundamental to DSP architectures due to their
vide use in general purpose multiplication, multiplication with constants and multiple
operand addition and subtraction. Different compression ratios for Wallace tree have
been explored to identify the correct ratio of Wallace compression tree to best map on
LUTs based FPGA.
Mapping a DSP algorithm on the hardware entails the technique of floating point to fixed
point conversion. Matlab ® tool has been used to map the above mentioned algorithms
on the hardware, Xilinx ® has been used to synthesize the same and LP solve has been
used to solve the complex mathematical model.
v
LIST OF ACRONYMS
DSP Digital Signal Processing
FPGA Field Programmable Gate Array
ASIC Application Specific Integrated Circuit
COMB Combined application of WLA and HLS
DCT Discrete Cosine Transform
DFG Data Flow Graph
FIR Finite Impulse Response
FU Functional Unit
HOM Homogeneous-architecture approach
HET Heterogeneous-architecture approach
HLS High-Level Synthesis
IIR Infinite Impulse Response
IOB Input / Output Block
LE Logic Element
LMS Least Mean Squares
LSB Least Significant Bit
LUT Look-Up Table
MILP Mixed Integer Linear Programming
MSB Most Significant Bit
MSE Mean Square Error
MUX Multiplexer
MWL Multiple Word-Length
vi
RTL Register Transfer Logic.
SEQ Sequential application of WLA and HLS.
SFG Signal Flow Graph
SNR Signal to Noise Ratio
SQNR Signal to Quantization Noise Ratio
UWL Uniform Word-Length
WLA Word-Length Allocation
vii
Contents
Summary iii List of Acronyms iv
List of figures viii 1. Overview
1.1 Introduction 1
1.2 Problem statement 2
1.3 Structure of this thesis 4
1.4 References 4
2. An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs
2.1 Introduction 6
2.2 Optimization Mathematical Model 7
2.3 WCDMA Receiver Example 12
2.4 Results 20
2.5 Conclusion 20
2.6 References 21
3. Hardware Mapping On FPGA
3.1 Overview 23
3.2 Hardware resources available on FPGA 24
3.2.1 Express fabric technology 24
3.2.2 Routing and interconnect architecture 25
3.2.3 Block Rams 25
viii
3.2.4 Clock management 26
3.2.5 Dedicated MAC modules 26
3.3 Look up Table 26
3.4 Digital Signal Processor( DSP 48) 27
3.5 References 28
4. Trading Off Word Length with Optimized Area
4.1 Overview 31
4.2 Proposed Algorithm 32
4.2.1 Format Conversion 33
4.2.2 Insertion of accuracy handlers 33
4.2.3 Modeling of hardware utilization for the selected word length
33
4.2.4 Iteration on word length 33
4.2.5 Analysis for percentage increase in area. Decrease in timing and increase in accuracy
33
4.2.6 Design space determination 34
4.2.7 Design space exploration 34
4.2.8 Selecting the appropriate word length that offers best accuracy, area and timing trade off
34
4.3 Design Example
4.3.1 CORDIC Algorithm 35
4.3.2 CORDIC Modeling 35
4.3.3 CORDIC synthesis on XILINX 38
4.3.4 Experimental Results 40
4.4 Conclusion 41
ix
4.5 References 41
5. Optimizing Bit Serial Architecture
5.1 Overview 43
5.2 Bit Serial Multiplication 44
5.3 Algorithm for Bit Wise Serial Multiplication 47
5.4 Design Example of Bit Serial Multiplier 50
5.5 Architecture 51
5.6 Implementation and Results 52
5.7 The LMS FIR filter Using Bit Serial Compressor 53
5.6.1 Bit Serial Adder 54
5.8 LMS Filter Architecture 55
5.9 Implementation and Results 56
5.10 References 56
6. Optimization On FPGA Slice Fabric
6.1 Overview 60
6.2 Optimization Techniques vs FPGA architecture 61
6.2.1 Compression Trees 62
6.2.2 Multiplier pipelining 63
6.2.3 Optimization of Bit resolution 63
6.3 Design Optimization 64
6.3.1 Optimization of FIR filter 64
6.3.2 Optimization of IIR filter 67
6.4 Complex Multiplier 68
6.5 Experimental Results 70
x
6.5.1 FIR filter 70
6.5.2 IIR filter 71
6.6 Complex Multiplier Synthesis 72
6.6.1 Optimization of Bit width 72
6.7 Conclusion 73
6.8 References 74
7. Conclusion And future Work 76
xi
List of Figures
Fig 2.1 Block layout of WCDMA receiver
Fig 2.2 Data rate of WCDMA receiver
Fig 3.1 Block diagram of Vertex 5 6-input LUT
Fig 3.2 LUT showing programmable I/O blocks
Fig 3.3 Internal architecture of Digital Signal processor DSP 48 showing the registers and carry chain
Fig 4.1 I/O systems with multiple inputs and outputs
Fig 4.2 Effects of increase in bit width, hardware complexity and its effects on LMS error in the design
Fig 4.3 Bit resolution Vs LMS error where bit width of X,Y= bit width of ф
Fig 4.4 Bit resolution Vs LMS error where bit width of X,Y< bit width of ф
Fig 4.5 Bit resolution Vs LMS error where bit width of X,Y> bit width of ф
Fig 4.6 Analysis on no. of slices, registers and IO’s with bit resolution of X,Y( varying) and ф ( fixed)
Fig 4.7 Analysis on no. of slices, registers and IO’s with bit resolution of X,Y( fixed) and ф ( varying)
Fig 5.1 Multiplication of two numbers having bit width of 8 x bits each
Fig 5.2 Serial compression of two numbers illustrated in dot notation
Fig 5.3 Compression cycles for serial multiplication shown in dot notation
Fig 5.4 Serial multiplication input to triangular compressor
Fig 5.5 Multiplication of two four x bit numbers
Fig 5.6 Bit wise dot product of first bit of A and B
Fig 5.7 Bit serial compressor based multiplication architecture showing the input X and Y , output p, cycle tracker, terms generator and triangular serial
xii
compressor
Fig 5.8 LMS FIR filter with serial i/p and o/p
Fig 5.9 Bit wise serial adder
Fig 5.10 Architecture of bit wise serial LMS filter composed of triangular compressor serial adder’s error calculator and filter weight adjuster
Fig 5.11 Bit serial CORDIC architecture
Fig 5.12 Flow chart of algorithm for the calculation of sine and cosine
Fig 5.13 Bit serial modified CORDIC architecture
Fig 5.14 Error analysis
Fig 6.1 6 input LUT’s, CLB’s and carry chain of Virtex-5 exploded view
Fig 6.2 Virtex-5 FPGA DSP 48 slice
Fig 6.3 FIR filter having sever taps
Fig 6.4 Systolic FIR filter with cut sets represented by dashed lines
Fig 6.5 Schematic of 6:3 type compression trees
Fig 6.6 Schematic of 3:2 type compression trees
Fig 6.7 Schematic of 4:2 type compression trees
Fig 6.8 Schematic of 7:3 type compression trees
Fig 6.9 IIR filter of first order
Fig 6.10 First order transformation of IIR filter
Fig 6.11 Schematic of Complex multiplier
Fig 6.12 Complex multiplier incorporating booth encoded Wallace tree reduction technique
Fig 6.13 The frequency (MHz) and number of utilized LUTs in CSD by using different compression trees for FIR filter compression
Fig 6.14 The number of utilized LUTs and frequency (MHz) in CSD by using different compression trees for FIR filter compression
xiii
Fig 6.15 Complex multiplier using different compression trees for comparison of LUTs and path delay
Fig 6.16 LUTs and clock rates for FIR filter
Overview 2013
1
Chapter 1
Overview ____________________________________________________________________________________
1.1 Introduction
In every signal processing system Field Programmable Gate Arrays (FPGAs) are used
for the prototyping / evaluation of the algorithm for the timing performance and the
throughput of the system. Latest FPGAs have virtual embedded computational blocks
[1] which offer higher speed computational units. While designing a specific algorithm
the structure of the embedded blocks, resources available on the hardware, bit width of
inputs and the depth of pipelining plays a vital role to achieve area and timing
performance [2].
Latest FPGAs offer reconfigurable logic blocks custom designed for high
throughput multiply accumulate operations, dedicated carry chain support, Block
Random Access Memories (RAMs) and internal slice cascade structure [3]. The layout
of logic elements in blocks of FPGA’s restricts the application of customary optimization
techniques and it renders a need for specialized techniques specific to the available
resources in FPGA. Traditional optimization techniques [4] which have proved well
suited for FPGA’s may not exhibit same superior performance there by it is essential to
choose a different family of FPGA’s with emphasis on separate optimization methods to
generate optimal hardware architecture [5]. Advanced applications need an elaboration
of the requirement to perform custom optimizations on a particular FPGA with a goal to
Overview 2013
2
maximize the performance. In short the extent of algorithm optimization is highly
depends on the target device configuration.
1.2 Problem Statement
The objective of this research is to evolve novel optimal techniques for implementation
of signal processing algorithms like Infinite Impulse Response (IIR), Finite Impulse
Response (FIR) filters, Direct Digital Frequency Synthesizer (DDFS) and Coordinate
Rotation Digital Computer (CORDIC) algorithm on FPGA based architecture. DSP
algorithms have constraints and different architectural options can be realized leading to
the same design within the FPGA design space. It becomes a complex problem while
implementing the design thereby selecting a suitable option. An algorithm has been
developed which identifies the design architectural option considering the design
constraints. The optimization is performed based on the throughput requirement and
FPGA fabric architecture. These algorithms are selected for their widely usability in
many DSP applications. As the optimization is considered based on the enhancement
of throughput and accuracy, therefore for lower throughput requirements, bit and word
serial architectures are also being considered. The research also explores the tradeoff
of word-length on accuracy and area / timing of the design.
The thesis first builds presents the Novel mathematical model for optimization
within the design space and then makes a base by giving an elaborate account of high
speed computational resources available in new generation FPGAs. Virtex-5 is used as
a choice platform. The thesis then discusses optimization effects due to varying word
length and hardware mapping on FPGA.
Overview 2013
3
For lower throughput requirement, bit serial architectures are proposed. An
algorithm for a serial multiplier has been developed and multiple instances of the
multiplier have been used to realize a bit serial Least Mean Squares(LMS) filter. Serial
implementation of CORDIC algorithm has also been discussed to model a bit serial
CORDIC which can be used as a DDFS.
For the DSP circuits implementation fixed-point arithmetic is used .To minimize
the design costs and least mean square error the word length has to be selected very
precisely. An algorithm for the word length optimization of CORDIC [6] has been
developed, the synthesis of algorithm shows that an increase in the bit resolution the
hardware complexity increases linearly and the least mean square error decreases
marginally. Therefore it is mandatory to find an optimum point where performance and
minimum hardware complexity converge.
For desired optimization correct selection of target device is a vital parameter.
Programmable devices are an attractive choice for system designers as the re
configurable capabilities make FPGAs [7] a suitable prototyping platform. FPGAs have
been used for the analysis of algorithms developed during this work as the latest
advancement in FPGAs offer new possibilities of implementing high performance DSP
algorithms. The optimal resource usage available in the FPGA gives an insight into the
mapping of different compression trees. For Virtex-5 it has been concluded through
experimentation that by selecting a compression ratio of 6 to 3 efficiently multiplies
using addition which yields reduced area and high speed implementations.
Overview 2013
4
1.3 Structure of this Thesis
The work consists of seven chapters, including the first chapter on introduction.
The second chapter is about the optimization of DSP algorithm considering multiple
architectural options and selecting the appropriate FPGA device to meet the design
constraints. The third chapter is about the hardware mapping on FPGA, different
available resources including computation blocks and multiplier units within an FPGA
are discussed. Chapter 4 describes the trading off word-length in optimizing area,
CORDIC algorithm is implemented to analyze the results and conclude the effects of
varying word length on least mean square error. Chapter 5 is about the bit serial design
of CORDIC algorithm and bit serial multiplier; multiple instances of same multiplier are
used to realize a bit serial LMS filter. Chapter 6 is about the optimized implementation
on the slice fabric of FPGA, different compression trees are analyzed with respect to a
specific family of FPGA, word-length optimization techniques for FIR and IIR digital
filters and complex multipliers are also discussed. Chapter 7 is concluding the research
and highlighting the future work.
1.4 References
[1] C.H. Ho, P.H.W. Leong and W. Luk, “Virtual Embedded Blocks: A Methodology
for Evaluating Embedded Elements in FPGAs”14th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM'06) 0-7695-
2661-6/06
[2] L.W. Couch 11, Modern Communication Systems, Prentice Hall, 1994.
Overview 2013
5
[3] L.K. Tan, et al. "An 800-MHz quadrature digital synthesizer," IEEE JSSC,
vol. 30, N 12, pp.1463-1473, 1995.
[4] R. El-Ashry ,M. Rehan, Hassan El Kamchouchi and F. Gebali,
“Performance-optimized FPGA implementation for the flexible triangle
search block-based motion estimation algorithm” Electrical and Computer
Engineering (CCECE), 2011 24th Canadian Conference on may 2011.
[5] J.E. Voider, "The CORDIC trigonometric computing technique," IRE
Transactions on Electronic Computers, vol. EC-8, pp.330-334, 1959.
[6]
Er. ManojArora, Er. R S Chauhan, Er.LalitBagg “FPGA Prototyping of
Hardware Implementation of CORDIC Algorithm”, International Journal of
Scientific & Engineering Research, Volume 3, Issue 1, January-2012 ISSN
2229-5518.
[7] Steve Kilts Advanced FPGA Design: Architecture, Implementation, and
Optimization chapter 1.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
6
Chapter 2
An Optimal Designing Solution for Efficient Utilization and
Mapping of Resources on FPGAs
____________________________________________________________________________________
2.1 Introduction
This chapter presents a novel model for optimizing device resources based on the
design constraints. The model also identifies the target device to be used based on the
optimization constraints. For a DSP algorithm multiple optimization options are available
based on a set of constraints to give best solution in terms of throughput, area, timing
and power consumption [1]. High-end FPGAs have millions of embedded as well as
distributed resources. Complex applications can be mapped by adopting multiple design
options for each architectural option e.g. folding, unfolding and parallel design options
whereby selection is mainly based on the throughput of the design. Same design can be
implemented by using different set of resources to achieve the set criteria. There are
multiple mapping options where an algorithm or component can be mapped such that it
uses different type of resources within the same device [2]. These options offer intricate
optimization problem to any designer of a complex digital system. This chapter presents
a mathematical model where multiple design options can be worked out based on
constraints to select best available optimization within a multiple variable FPGA [3] – [9]
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
7
design space. The algorithm is used to select the best option on availability of multiple
solutions for the constraint of hardware resources. The chapter considers the design
and implementation of high data rate WCDMA receiver for optimization using the
proposed technique.
2.2 Optimization Mathematical model
The optimization problem is first modeled as an integer programming problem. To
demonstrate the working of the model the design of a WCDMA receiver is considered.
The receiver consists of several blocks and for each block based on the throughput
requirements, multiple architectural design options are available. The design has to
explore the tradeoffs in this multi variable design space to get the optimal solution that
best fits on a selected FGPA and optimizes its resources while meeting the throughput
constraints. The problems complexities exponentially grow for complex design thus
require a tool to make the selection for the designer. Our proposed technique develops
an integer programming model for the problem. To demonstrate the effectiveness of the
technique, the model is mapped on a WCMDA receiver to optimize the target clock,
number of MACs, number of adders, number of LUTs and number of registers while
meeting the throughput constraint.
The modeling starts by defining decision variables. Let 푥 be a decision variable in the
optimization problem, where 푗is the component and 푖 is the architectural option
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
8
available for that component. Few of the possible architectural options for each design
components are described in Table 2.1.
Table 2.1. Architectural options and their description
Ser Option number Description
a. 0 Embedded resources
b. 1 Distributed resources
c. 2 Bit serial/ word serial architecture
d. 3 Folded architecture
e. 4 Unfolded architecture
The optimization problem is solved for a set of design constraints. These constraints
relate to the resources on the FPGA and the throughput requirements on each
component. A listing of these constraints is as follows:
Area Constraints
These set of constraint relates with the resources on the FPGA. The designer can
budget these resources for each part of the design and put the budgeted number as a
constraint or can let the optimization model solve it for a global optimal solution for the
complete design while the solution is constraints in available resources.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
9
The Adders Constraint
The adder constraint relates to the adders on the FPGA for a component 푗having
architectural option 푖. The designer can fix the number of adders in a design to be
implemented on a specific target device. If 푎 represent the adder for a component 푗
having architectural option 푖and 퐴 represent the total number of adders available on
the FPGA then the constraint for the adder is defined as fol:-
푎 푥 ≤ 퐴 (1)
The Multiplier Constraint
The multiplier constraint relates to the MACs on the FPGA for a component 푗 having
architectural option. The designer can plan the number of MACs in a design to be
implemented on a specific target device. If 푚 represent MACsfor a component 푗 having
architectural option 푖푎푛푑푀 represent the total number of MACs available on the FPGA
then the constraint for MACs is written as fol:-
푚 푥 ≤ 푀 (2)
The Register Constraint
The register constraint relates to the registers on the FPGA for a component 푗 having
architectural option 푖. The designer can plan the number of registers in a design to be
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
10
implemented on a specific target device. If 푟 represent the register for a component 푗
having architectural option 푖 and 푅 represent the total number of registers available on
the FPGA then the register constraint is defined as fol:-
푟 푥 ≤ 푅 (3)
The Look Up Table(LUT) Constraint
The LUT constraint relates to the LUTS on the FPGA for a component 푗 having
architectural option 푖. The designer can plan the number of LUTs in a design to be
implemented on a specific target device. If 푙 represent the LUTs in the design for a
component 푗 having architectural option 푖 and 퐿 represent the total number of LUTs
available on the FPGA then the LUT constraint is defined as fol:-
푙 푥 ≤ 퐿 (4)
Memory Constraints (SRAM Block constraint)
This constraint is an optional constraint and it optimizes the use of RAM block which is
directly related to the power consumption of FPGA. If 푟푎푚 represent the RAM blocks
for a component 푗 having architectural option 푖and 푅퐴푀 represent the total RAMs
available on the FPGA then the RAM constraint is defined as fol:-
푟푎푚 푥 ≤ 푅퐴푀 (5)
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
11
Power Constraint
This is an optional constraint and can be placed for design with low power objectives. If
푝 represent the desired power level in the design for a component 푗 having
architectural option 푖 and 푃 represent the total power the FPGA can handle then the
power constraint is defined as fol:-
푝 푥 ≤ 푃 (6)
As discussed earlier the decision variable 푥 must meet equation (7) to optimize an
architectural option 푁 in the design.
푥 = 1∀푖 = 1,2,3,4, … … … . . ,푁 (7)
For the design it is required to minimize the fol equation
(∝ 푙 푥 + ∝ 푚 푥 +∝ 푟 푥 +∝ 푟푎푚 푥 +∝ 푎 푥 +∝ 푝 푥 ) (8)
Where ∝ is the weight of LUTs constraint 푙 in the design,∝ is the weight of MACs
constraint푚 ,∝ is the weight of register constraint푟 ,∝ is the weight of memory
constraint 푟푎푚 ,∝ is the weight of adder constraint푎 푎푛푑 ∝ is the weight of
power푝 in the design. For any problem these equations are solved using the tool
LPsolve which determine the architectural option to be selected by using the specific
constraints. These constraints map on the resources of FPGA and therefore determine
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
12
the target device under implementation. For better understanding this mathematical
model has been implemented on a WCDMA receiver.
2.3 WCDMA Receiver Example
The purpose of a digital receiver is to recover the baseband signal without synchronous
demodulation; it includes the signal processing immediately after the Analog Front End
(AFE) from detecting the start of burst to the actual stream of information intended for
communication. The proposed mathematical model is mapped on a WCDMA receiver for
Software Defined Radio under development at Center for Advanced Research in
Engineering. The component layout and interconnection of the WCDMA receiver for
Software Defined Radio is shown in Figure 2.1. For each of the components / blocks
there are multiple design options based on the throughput, timing and other design
constraints. The typical data rate is high and to achieve this high data rates, every block /
component has to operate on high clock rate which relates to pipelining and no. of
registers, the data rate of each component is shown in Figure 2.2. The more the
registers, the more is the power consumed. Therefore, it becomes a complex problem to
solve, if at the design time the exact weight-age of each subcomponent is known then
we exactly know the architecture to implement and the FPGA that supports the complete
design is also identified.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
13
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Correlation of received signal with Spreading
Sequence
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Figure 2.1. Block layout of WCDMA receiver
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
14
Since there are multiple modules the throughput of each module is different
depending the data stream it is handling, following formula was used for the chip rate
calculation:-
푐ℎ푖푝푟푎푡푒 =푇ℎ푟표푢푔ℎ푝푢푡 × 푆푝푟푒푎푑푖푛푔푔푎푖푛× (퐷푎푡푎푙푒푛푔푡ℎ+ 푡푟푎푖푛푖푛푔퐿푒푛푔푡ℎ)
퐵푖푡푠푝푒푟푠푦푚푏표푙 × 퐹표푟푤푎푟푑푒푟푟표푟푐표푟푟푒푐푡푖표푛 × 푑푎푡푎푙푒푛푔푡ℎ
The constraints in this design are as per Table 2.3 below:- Table 2.3. Design parameters for WCDMA Receiver
Serial Parameter Target design
1. Training Length 32 2. Spreading Factor 16 3. Data Length 288 4. Modulation Index 4 5. Modulation Schemes QPSK 6. Target throughput 512 kbps 7. Forward Error Correction ½
8. Chip Rate 9.102 Mcps 9. Up sampling Factor 4
With above values the chip rate is
퐶ℎ푖푝푟푎푡푒 =512000 × 16 × (288 + 32)
2 × 0.5 × 288 = 9.102푀푐푝푠
Since the up sampling factor is 4 the actual bandwidth becomes
퐵푎푛푑푤푖푑푡ℎ = 푢푝푠푎푚푝푙푖푛푔푓푎푐푡표푟 × 4 = 36.408푀푐푝푠
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
15
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Start of burst detection and timing compensation
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Correlation of received signal with Spreading
Sequence
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
Signal Down Sampling
Correlation of received signal with Spreading
Sequence
Data despreading
Course frequency estimation
Channel estimation and compensation
Course frequency compensation
Fine frequency estimation and compensation
Channel Equalization and phase adjustment
Training sequence removal
Forward error correction
Symbol demapping
36.408 Mcps
9.102Mcps (Chip rate)
568.89 Ksps
512 Ksps
256 Ksps
512 Kbps
568.89 Ksps
568.89 Ksps
568.89 Ksps
568.89 Ksps
568.89 Ksps
Demodulated bit stream
Figure 2.2. Data rates for WCDMA receiver
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
16
There are total of 12 sub modules in the WCDMA receiver and for each module there are
several options as per Table 2.4.
Table 2.4 . Architectural options for WCDMA receiver
Serial Module Throughput Options Available
1. Signal down sampling 36.408 Mcps 4
2. Correlation of received signal with spreading sequence
9.102 Mcps 4
3. Start of burst detection and timing sequence
568.89 Ksps 3
4. Data de-spreading 568.89 Ksps 4
5. Course frequency estimation 568.89 Ksps 4
6. Channel estimation and compensation
568.89 Ksps 4
7. Course frequency compensation
568.89 Ksps 4
8. Fine frequency estimation and compensation
568.89 Ksps 5
9. Channel equalization and phase adjustment
568.89 Ksps 5
10. Training sequence removal 568.89 Ksps 5
11. Forward error correction 256 Ksps 4
12. Signal de-mapping 512 Kbps 4
The whole design was transformed in terms of equations defined above and constraints
values were defined, LP solve tool solved the equation and provided the best
architectural option for each sub component/ blocks in the design. The initial values of
the design constraints are as per Table 2.5.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
17
Table 2.5. Initial values of constraints
Serial
Constraint
Value
Option 1 Option 2
a. ∝ 0.0004 0.00028
b. ∝ 0.0416 0.0357
c. ∝ 0.0003 0.00025
d. ∝ 0.00019 0.0001530
e. ∝ 0.01 0.008
f. ∝ 0.083 0.002
g. 퐴 5200 6500
h. 푀 24 28
i. 푅 3000 4000
j. 퐿 2500 3500
k. 푅퐴푀 120 500
l. 푃 100 125
Against these constraints a grid of 12x5 (12 components and 5 options each) was
initialized at max and the selected options as per the solution of LP solve is also
highlighted. As we change the constraints the selected option and finally the target
device also changes, the details are as per Table 2.6.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
18
Table 2.6. Gird illumination of selected architectural option for WCDMA receiver. The green dots and the
interconnect represent the option 1 constraint values and the blue dot and interconnect represent the
option 2 values of the design constraints
ser Embedded resources
Distributed resources
Bit/ word serial
folded Unfolded
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
19
The selected architecture for each sub component / block is as per the Table 2.7.
Table 2.7. Selected implementation option for each module of WCDMA receiver
Serial
Sub component /
block
Throughput
Selected Option
Option 1 constraint
Option 2 constraint
1. Signal down sampling 36.408 Mcps 0( Embedded) 0( Embedded)
2. Correlation of received signal with spreading sequence
9.102 Mcps 1(Distributed) 3( folded)
3. Start of burst detection and timing sequence
568.89 Ksps 1(Distributed) 3( folded)
4. Data de-spreading 568.89 Ksps 3( Folded) 1(Distributed)
5. Course frequency estimation
568.89 Ksps 3( Folded) 1(Distributed)
6. Channel estimation and compensation
568.89 Ksps 0( Embedded) 1(Distributed)
7. Course frequency compensation
568.89 Ksps 0( Embedded) 1(Distributed)
8. Fine frequency estimation and compensation
568.89 Ksps 1( Distribited) 4(un folded)
9. Channel equalization and phase adjustment
568.89 Ksps 0( Embedded) 3( folded)
10. Training sequence removal
568.89 Ksps 0( Embedded) 3( folded)
11. Forward error correction 256 Ksps 0( Embedded) 3( folded)
12. Signal de-mapping 512 Kbps 0( Embedded) 3( folded)
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
20
As the values of the constraints are changed the architectural options are also changed.
As the architectural options are related to the resources of FPGA therefore it concludes
that this has direct impact on the selection of target device.
2.4 Results The WCDMA receiver for software defined radio was implemented using the
mathematical model to select the architectural option for each component / blocks of the
design. Two values were given for each constraint and the model was solved using LP
solve. The results show that for varying options there are two FPGAs that meet the
requirement the first one is Spartan 3A device xc3sd3400a-4cs484 [12] and the second
one is Vertex 5 device Xc5vfs240T. The specifications of these devices are as per Table
2.8.
Table 2.8. Resources available on Spartan 3A and Vertex 5 FPGA
Serial Resources Spartan 3A Vertex 5 1. Slices 23872 37440
2. 4 I / p LUTs 47774 149760
3. Flip flops 47774 149760
4. DSP Blocks 126 1056
5. Block Rams 126 516
2.5 Conclusion
This mathematical model has presented a novel technique which helps an algorithm
designer to map his algorithm on different available architectural options thereby while
adjusting the weight-ages of different resources the best fit target FPGA is also
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
21
identified. The complex example of WCDMA receiver has been discussed and with the
given throughput requirement at each stage the design maps perfectly on the Spartan
3A FPGA on the option 1 constraints and Vertex 5 FPGA on the option 2 constraints.
The result of implementation and LP solve solution confirms the novelty of the
algorithm. Any system can be optimally designed to fit in the FPGA design space basing
on the fine adjustment of the constraints. By carefully adjusting the constraints low
power solutions are realizable. Other implementations of this model could be one the
modern day software defined jammers which have almost the same complex
components with an addition of few for the Spectrum search.
2.6 References
1. Vinoo Sumeri and Ranga Venuri,”Throughput optimization with design space
exploration during partitioning of multi FPGA Architectures”, Laboratory for
Digital Design Environment.
2. Alastair M. Smith, Member, IEEE, George A. Constantinides, Senior Member,
IEEE, and Peter Y. K. Cheung, Senior Member, IEEE”” FPGA Architecture
Optimization using Geometric Programming
3. OgnjenŠcekic,”FPGA comparative analysis” pages 2 – 140.
4. J. Lamoureux, and S. J. E Wilton “On the Interaction between Power-Aware
FPGA CAD Algorithms,” IEEE International Conference on Computer-Aided
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
22
Desig, Nov. 2003.
5. M. French, L. Wang, T. Anderson, M. Wirthlin, “Integrated Tool Suite for Post
Synthesis FPGA Power Consumption Analysis,” Military and Aerospace
Programmable Logic Devices (MAPLD) International Conference, Washington,
D.C., September 2005.
6. B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, “A CAD Suite for
High Performance FPGA Design,” Field Customizable Computing Machines,
1999.
7. L. Shang, A. Kaviani, and K. Bathala, “Dynamic Power Consumption in
Virtex-II FPGA Family,” FPGA ’02, Monterey, California, February, 2002.
8. M. French, L. Wang, T. Anderson, and M. Wirthlin, “Post Synthesis-Level
Power Estimation for FPGAs,” IEEE Symposium on Field-Programmable
Custom Computing Machines, April 2005.
9. L. Wang, M. French, A. Davoodi, D. Agarwal, “FPGA Dynamic Power
Minimization Through Placement and Routing Constraints,” .
10. A public domain version of LP_Solve is maintained by the Open Source
Community at the URL: http://sourceforge.net/projects/lpsolve
11. LP_Solve Mixed Integer Linear Programming (MILP) solver, was originally
developed by Michel Berkelaar (mailto:[email protected]) in ANSI C as Non-
An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013
23
Public domain software, available via anonymous FTP at
ftp://ftp.es.ele.tue.nl/pub
12. http://www.xilinx.com/support/documentation/data_sheets/ds610.pdf
Hardware Mapping on FPGA 2013
23
Chapter 3 Hardware Mapping on FPGA ____________________________________________________________
3.1 Overview
Latest generation FPGAs [1], [2] have high integration densities, huge number of
dedicated resources for processing and storage at high clock speeds. These features
make them an attractive choice to map complex DSP algorithms to achieve desired
performance.
As a algorithm designer the main focus is on the optimization of performance
parameters such as area, power and timing delays [3], [4]. If the algorithm bit width is
correctly mapped on the bit handling capacity of resource available on the target device
then the complete design area can be taken as a sum of individual component areas. In
the same way overhaul power consumption can be computed as the sum of switching
power of all the input signals and the mean power consumption of each functional unit
(FU) [5], [6], [7], [8], [9], [10]. For the optimization on cost in terms of resource usage for
the FPGA architecture having intrinsic features, the existent design require inclusion
and proper modeling within the optimization process[11], [12], [13], [14].
Latest FGPAs have built in specialized blocks, when the algorithm is correctly mapped
in terms of internal pipelining of the multiplier compressor onto the DSP of FPGA it will
result in reductions in design cost [15], [16], [17], [18] and design time. Therefore the
understanding of resources available on the target device is very important to be known
to the designer in order to map the algorithm for optimum performance.
Trading off world length with optimized area 2013
24
3.2 Hardware resources available on FPGA
Hardware resources available on the FPGA play a vital role while mapping the
algorithm on the target device. For all practical purposes Xilinx Vertex-5 FPGA will be
considered and few resources available on Xilinx Vertex-5 FPGA are discussed as
under:-
3.2.1 Express Fabric technology
Express Fabric technology is based on a 6-input LUT architecture and routing.
The combination of carry chains/ dedicated multiplexers, Look-Up Tables (LUTs)
and Flip-Flops (FFs) determine the efficiency and performance of implementing
arithmetic and logic functions. The Virtex-5 family has a fully independent (not
shared) 6-input LUT as shown in Figure 3.1.
Figure 3.1Block Diagram of a Virtex-5 6-Input LUT
Trading off world length with optimized area 2013
25
LUT input architecture is the determining factor for minimizing the critical
path delay which eventually represents the performance of logic fabric. In order
to minimize the critical path the 6-input LUT has be exactly mapped onto the
algorithm otherwise it will result in inefficient use of the wider-input LUTs and the
die size which determines the area also increases.
3.2.2 Routing and interconnect Architecture
Interconnect timing delays which can account for more than 50% of the critical
path delay are minimized in Vertex-5 FPGA by changing the interconnect pattern.
The diagonally symmetric interconnect pattern have enhanced performance due
to the reduction in the places vs hop ratio and enhancement in the connection vs
the hop ratio. This design helps in finding the optimal routes.
3.2.3 Block RAMs
Ram Blocks are used for the in-chip data storage. The block RAM base size in
the Virtex-5 family has doubled as it was in Virtex-4 family and this has resulted
in deeper pipelining, larger memory arrays and the usage of full RAM as two half
RAMs. Therefore the block RAM available (Virtex-5) when operated in Simple
Dual Port mode effectively doubles the block RAM bandwidth. Enhanced block is
ideal for performance maximizing and power management tool.
Trading off world length with optimized area 2013
26
3.2.4 Clock management
For synthesizing various clock signals these blocks are used. Being dedicated
they boost the internal performance while increasing the board system
frequency.
3.2.5 Dedicated MAC modules
The Virtex-5 family has introduced the DSP48E slice, a new DSP slice that has
an enhanced multiplier width (25 x 18), independent c register, logic Unit
Functionality and dedicated hardware central processing unit in the form of hard
power pc core. It the bit width of algorithm is exactly mapped on this DSP it will
result in achieving the desired area and timing performance.
3.3 Look Up Table (LUT)
A LUT in an FPGA is a array of interconnected programmable logic blocks (transistors).
These programmable logic blocks are programmed to switch on /off which interconnects
the wire, a large numbers of these blocks can be wired in this way. Input/output from the
FPGA is via special I/O pads which contain sequential logic circuitry.
Virtex-5 architecture has real 6-input LUT with dual-LUT capability. There are a total of
64 bits of logic programming space and 6 independent inputs and any function of 6
inputs and numerous combinations of one or two smaller functions can easily be
implemented.
Trading off world length with optimized area 2013
27
Figure 3.2 Look up table showing programmable I/O blocks
The 6-input LUT also includes associated carry logic, MUXs, and a flip-flop as shown in
Figure 3.2.
3.4 Digital signal processor (DSP 48E)
DSP48E is the digital signal processing slice in Virtex-5 FPGA. By using several slices
together efficient digital filters can be realized. If design styles as shown in Fig 2.3 are
incorporated it can result in substantial savings [20][Xilinx.com].
Trading off world length with optimized area 2013
28
Figure 3.3 Internal Architecture of digital signal processor (DSP 48) showing the Registers and carry
chain
To achieve performance and power characteristics, the Pipelining of DSP algorithms is
often required. There are three pipelining stages in DSP48E slice and when it is used as
a multiplier when all the stages are utilized performance is guaranteed. When the
MREG as shown in Fig 3.3 is enabled it results in saving almost 15% of the overall
slice.
If resources discussed above are a part of almost every latest FPGA, as a designer
while realizing a DSP algorithm, the algorithm has to be mapped on the FPGA available
resources for onward resource saving and algorithm efficiency.
3.5 References [1] Altera Corp. www.altera.com. [2] Xilinx Inc. http://www.xilinx.com.
[3] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee. Accurate Area and Delay
Estimators for FPGAs. In Proc. Design, Automation and Test in Europe
Trading off world length with optimized area 2013
29
Conference and Exhibition, 2002. [4] C. Brandolese, W. Fornaciari, and F. Salice. An Area Estimation Methodology for
FPGA Based Designs at System C-Level. In Proc. Design Automation
Conference, 2004, pages 129–132, 2004.
[5] S. Bilavarn, G. Gogniat, and J.L. Philippe. Area Time Power Estimation for FPGA
Based Designs at a Behavioral Level. In Proc. Int. Conf. on Electronics, Circuits
and Systems, volume 1, pages 524–527, 2000.
[6] J.A. Clarke, A.A. Gaffar, and G.A. Constantinides. Parameterized Logic Power
Consumption Models for FPGA-based Arithmetic. In Proc. Int. Conf. on Field
Programmable Logic and Applications, pages 626 – 629, 2005.
[7] J.A. Clarke, A.A. Gaffar, and G.A. Constantinides. Fast Word-Level Power Models
for Synthesis of FPGA-Based Arithmetic. In Proc. IEEE Int. Symp. on Circuits and
Systems, pages 1299–1302, 2006.
[8] R. Jevtic, C. Carreras, and G. Caffarena. High-level Switching Activity Models for
Multipliers in FPGAs. In Proc. ACM/SIGDA Int. Symp. on Field Programmable
Gate Arrays, pages 224–225. ACM Press, 2007.
[9] R. Jevtic and G. Carreras, C. Caffarena. Switching Activity Models for Power
Estimation in FPGA Multipliers. In Proc. Int. Workshop on Applied Recon-
figurable Computing, pages 201–213, 2007.
[10] C.S. Bouganis, G.A. Constantinides, and P.Y.K. Cheung. A Novel 2D Filter
Design Methodology for Heterogeneous Devices. In Proc. IEEE Symposium on
Field-Programmable Custom Computing Machines, 2005.
[11] G. Caffarena, J. A. López, C. Carreras, and O. Nieto-Taladriz. High-Level ynthesis
of Multiple Word-Length DSP Algorithms using HeterogeneousResource FPGAs.
In Proc. Field Programmable Logic and Applications, pages 675–678, 2006.
[12] G. Caffarena, J. A. López, C. Carreras, and O. Nieto-Taladriz. Optimized
Implementation of DSP Cores on FPGAs Using Logic-based and Embedded
Resources. In Symp. on System-on-Chip, pages 103–106, 2006.
[13] D.Chen and J. Cong. Register Binding and Port Assignment for Multiplexer
Trading off world length with optimized area 2013
30
Optimization. In Proc. IEEE Asilomar Conf. on Signals, Systems and Computers,
volume 1, pages 68–73, 1994.
[14] P. Metzgen and D. Nancekievill. Multiplexer Restructuring for FPGA
Implementation Cost Reduction. In Proc. Design Automation Conference, pages
421 – 426, 2005.
[15] H.A. Atat and I. Ouaiss. Register Binding for FPGAs with Embedded Memory. In
Proc. IEEE Symp.on Field-Programmable Custom Computing Machines, pages
165–175, 2004.
[16] G.W. Morris, G.A. Constantinides, and P.Y.K. Cheung. Using DSP Blocks for
ROM Replacement: A Novel Synthesis Flow . In Proc. Int. Conf. Field
Programmable Logic and Applications, pages 77–82, 2005.
[17] S.J.E. Wilton. Implementing Logic in FPGA Memory Arrays: Heterogeneous
Memory Architectures. In Proc. IEEE Int. Conf. on Field-Programmable
Technology, 2002.
[18] X. Liang, J.S. Vetter, M.C. Smith, and A.S. Bland. Balancing FPGA Resource
Utilities. In Proc. Int. Conf. on Eng. of Reconf. Systems and Algorithms, pages
156–162, 2005.
Trading off world length with optimized area 2013
31
Chapter 4
Trading off world length with optimized area
______________________________________________________________________
4.1 Overview
In a particular digital signal processing system the number of processed bits at a time is
a major source of resource wastage. The selection of the word-lengths of variables is
carried out to meet the applications output error tolerance. As a designer the aim is to
determine a correct word length at which the cost and the output distortion match a
certain criteria depending upon the application under consideration.
(a) (b)
Figure 4.1 (a) I/O System with multiple inputs and outputs (b) Optimal word length - cost Vs distortion
tradeoff
Trading off world length with optimized area 2013
32
Consider a I/O system comprising of 푀 inputs, 푁outputs, a internal variable 푆 and
desired quantization 푄 as shown in Figure 4.1(a). For a desired quantization error 푇 to
be in some limits the requirement is to determine the size of different variables and
states that gives the desired quantization error with minimum hardware (Word length,
registers, multipliers etc.).
For the algorithm specific quantization error the width of input variables constraint the
size of푄. Empirical determination of this relationship can be computed which help in
setting the optimal bit width of different variables for achieving desired푄. The example
of CORDIC algorithm is discussed below which analyzes the effects on ‘Q’ by varying
the bit width of input variables. To achieve the optimal word-length fixed-point arithmetic
is used for the implementation. The tradeoff is shown in the Fig 4.1(b) in which cost and
distortion curve analysis clearly shows that longer word length may improve application
performance but at the cost of an increased hardware cost where as a shorter word
length may increase the quantization errors and overflows there by reducing the
hardware cost [1] [2]. The aim is looking for an optimal point at which the performance
of application is maximized with minimum hardware cost and minimum quantization
errors. The outcome of this research is an algorithm for word length optimization, the
details are discussed below.
4.2 Proposed Algorithm
The algorithm for the word-length optimization is a six step process. The details are as
under:-
Trading off world length with optimized area 2013
33
4.2.1 Format conversion
This is the start in which the format conversion is carried out. The input in floating
point is converted to fixed point for implementation on the hardware.
4.2.2 Insertion of accuracy handlers
Since it introduces quantization noise the accuracy handlers are inserted to log the
quantization error for finding an optimal point in the design space that minimizes
area but maintaining required quantization performance.
4.2.3 Modeling of HW utilization for the selected world length
The system is analyzed for different word lengths based on the resources available
on a particular architecture and then finally based on the quantization error
constraint on the output, area; timing and word length are selected. This selection is
based on application to application usage.
4.2.4 Iterating on world length
Iterations are carried to explore the design space. These iterations usually require bit
by bit changing of inputs. The outputs are analyzed for achieving the desired level of
performance.
4.2.5 Analysis of percentage increase in area, decrease in timing
and percentage increase in accuracy
Analysis is carried out for different set of word lengths that indicates the increase in
timing, performance area and reduction in quantization error.
Trading off world length with optimized area 2013
34
4.2.6 Design Space Determination
The exhaustive search is minimized by finding the relationship between the world
lengths of different input signals. There, usually is a strong relationship among the
input signals that governs the quantization performance of output signals. Their
relationships can be extracted from the mathematical dependencies of the inputs
and outputs or for highly complex algorithms they can be empirically determined by
running algorithms for different world-lengths.
4.2.7 Design space exploration
Exploring the design space, an optimum point is desired whereby any increase in
the word length has least effect on the quantization error and has best tradeoff for
area and resource usage.
4.2.8 Selecting the appropriate world length that offers best
accuracy, area and timing tradeoff
The optimum point analyzed in the last step is the selected word length.
4.3 Design Example
To effectively understand the proposed algorithm, CORDIC algorithm has been
implemented.
4.3.1 CORDIC Algorithm
CORDIC (Coordinate Rotation Digital Computer) algorithm is used for the generation
of digital sine and cosine [3] [6] and this digital transformation is achieved by
iterating the equations recursively. The algorithm accuracy is how ever proportional
Trading off world length with optimized area 2013
35
to the bit width of angle d . A vector (A1, B1) is mathematically transformed into a
new vector (A2, B2) .Mathematically in equation form it can be represented as :-
2 1 1*cos( ) *sin( )a a b (1)
2 1 1*sin( ) *cos( )b a b (2)
Where
2 1d (3)
4.3.2 CORDIC Modeling
MATLAB software has been used for the designing, modeling and simulation of
CORDIC algorithm. The built in quantization functionality of MATLAB [7] - [9] has been
used to map the algorithm arithmetic in floating point.
Figure 4.3, Figure 4.4 and Figure 4.5 show the LMS error of hardware complexity viz~
a~viz the bit width resolution. The analysis follow a trend according to which the least
mean square error minimizes (almost approach zero)when the relation in the Equation
below holds.
Bit width of input A,B > Bit width of angle Φ (4)
Any increase in bit width after a certain point has minimal effects on the reduction of
the LMS error rather it has drastic effects on the hardware complexity, the same
conclusion has been illustrated in Figure 4.2
Trading off world length with optimized area 2013
36
Figure 4.2 Effects of increase in bit width, hardware complexity and its effects on LMS error in the design
space
This concludes that for CORDIC algorithm if the condition mentioned below holds it will
guarantee min hardware utilization with min quantization error.
Min {bit width (X, Y) > (bit width (Φ) minus 2)} = Min least mean square error
Figure 4.3 Bit resolution vs LMS error where bit width of X,Y = bit width of Φ
Trading off world length with optimized area 2013
37
Figure 4.4 Bit resolution vs LMS where bit width of X,Y <bit width of Φ
Figure 4.5 Bit resolution vs LMS error where bit width of X,Y >bit width of Φ
After having analyzed the effect of increase in the bit width on the least mean
square error with in the design space, the same algorithm has also been explored to
Trading off world length with optimized area 2013
38
analyze the bit width resolution effect on the hardware complexity. CORDIC algorithm
has been implemented in MODEL SIM followed by the synthesis on Xilinx.
4.3.3 CORDIC Synthesis on XILINX
MODELSIM software was used for the implementation of CORDIC algorithm and Xilinx
software was used for the synthesis of same. 1’s compliment value of angles ranging
from 0 to л was used as the systems i/p and different iterations have been realized to
reduce the LMS error.
Table 4.1a and Table 4.1b show the result of synthesis. In Table 4.1a the bit width X
and Y was varied from 10 bits to 30 bits and the bit resolution of angle Φ was kept fixed
at 9 bits where as in Table 4.1b the bit width of X and Y was kept fixed and the bit
resolution of angle Φ was varied from 9 bits to 16 bits.
Table 4.1 a Synthesis Results with bit resolution of X, Y= varied and Φ= fixed
Serial
Device Utilization
Bit Resolution
Selected Device : v50fg256-6
10,9 11,9 12,9 13,9 14,9 15,9 16,9 17,9
1 No. of slices 67 77 85 93 98 107 108 124
2 No. of registers 43 48 53 44 50 57 53 60
3 No. of IO’s 33 35 37 39 41 43 45 49
Trading off world length with optimized area 2013
39
Table 4.1 b Synthesis Report with bit resolution of X, Y= fixed and Φ= varied
The variation in resource utilization by making different bit width selections for X, Y and
Φ are shown in Figure 4.9 and Figure 4.10 respectively.
Figure 4.6Analysis on no. of slices, registers and IO’s with bit resolution of X, Y (varying) and Φ (fixed)
0
20
40
60
80
100
120
140
10,10,9 12,12,9 14,14,9 16,16,9
Slices
Registers
IO's
Serial Device Utilization
Bit Resolution
Selected Device : v50fg256-6
20,9 20,10 20,11
20,12
20,13
20,14
20,15
20,16
1 No. of slices 163 164 164 165 166 167 167 168
2 No. of registers 74 75 76 77 78 79 80 81
3 No. of IO’s 61 62 63 64 65 66 67 68
X axis: Bit resolution of X, Y, Φ
Trading off world length with optimized area 2013
40
Figure 4.7Analysis on no. of slices, registers and IO’s with bit resolution X, Y (fixed) and Φ (varying)
4.3.4 Experimental Results
The experimental results show an increase in resource utilization with increase in bit
width resolution of X, Y and Φ. However the LMS error decreases where the condition
of bit resolution of X, Y > bit resolution of Φ holds. However for the CORDIC
algorithm the ideal bit width for X, Y=11, 11 bits and Φ=9 bits. The resource utilization at
this input bit width selection is tabulated in table 4.2.
Table 4.2 Device Utilization of CORDIC
0
50
100
150
200
20,20,9 20,20,11 20,20,13 20,20,15
Slices
Registers
IO's
Serial Resource Utilization Bit width resolution(X,Y, Φ)
Selected Device : v50fg256-6
11,11,9
1 No. of slices 77
2 Sliced Flip flops 48
3 IO’s 35
X axis: Bit resolution of X, Y, Φ
Trading off world length with optimized area 2013
41
4.4 Conclusion
The fixed point arithmetic is used for mapping most of the FPGA designs due to high
complexity / cost of floating point hardware. For all the practical purposes the bit
resolution of input variables should be greater that the bit resolution of angle when
CORDIC is used as Direct Digital Frequency Synthesizer (DDFS).
4.5 References
[1] L.W. Couch 11, Modern Communication Systems, Prentice Hall, 1994.
[2] L.K. Tan, et al. "An 800-MHz quadrature digital synthesizer," IEEE JSSC, vol.
30, N 12, pp.1463-1473, 1995.
[3] J.E. Voider, "The CORDIC trigonometric computing technique," IRE
Transactions on Electronic Computers, vol. EC-8,pp.330-334, 1959.
[4] V.F. Kroupa, "Spectral Properties of DDFS: Computer Simulations and
Experimental Verifications," IEEE International Frequency Control Symposium,
pp.613-23, 1994.
[5] M.J. Flanaga, G.A. Zimmerman, "Spur-reduced digital sinusoid synthesis," IEEE
Trans. Comm. vol. 43, No. 7, pp. 2254- 2262, 1995.
[6] C.M. Rader, "VLSI systolic arrays for adaptive nulling," IEEE Signal Processing
Magazine, 1996.
[7] Mathworks Corp, MATLAB Technical Computing Environment
,www.Mathworks.com,Jan.2003.
Trading off world length with optimized area 2013
42
[8] L. Presti, G. Cardamone, "A direct digital frequency synthesizer using an IIR
filter implemented with a DSP microprocessor," IEEE ICASSP-94, vol. 3, 1994
[9] E. Grayver, B. Daneshrad, "Reconfigurable Signal Processing ASIC
Architecture for High Speed Data Communications," ISCAS 98, June 1998
Optimizing Bit Serial Architecture 2013
43
Chapter 5
Optimizing Bit Serial Architecture ____________________________________________________________________________
5.1 Overview
Bit serial architectures are attractive choice for applications where data I/O is on a
serial interface. Many high speed serial interfaces are in use for many applications (like
Telecom serial interface port (TSIP), DSP serial peripheral interface) in our day to day
life. In these applications, it is always very tempting to use the serial clock to execute
the design. This requires innovative designs that can work on bit by bit basis. This
section presents two designs of considerable complexity to demonstrate the feasibility
of mapping algorithms on serial architectures. One is Adaptive Filter application and the
second is CORDIC algorithm. As multiplier and adder are basic components in most of
signal processing applications, their architectures are first discussed and then these
architectures are used in the complex examples to realize the effect of efficient
component design on the overhaul application.
Pin count, floor space, and wire length requirements are reduced in bit-serial arithmetic
VLSI designs. However, performing bit-serial arithmetic poses challenging design and
implementation problems. Research in bit-serial arithmetic using conventional binary
representations has focused on the design of multipliers and squarer’s [15] - [18].
Optimizing Bit Serial Architecture 2013
44
5.2 Bit Serial Multiplication
Bit serial multiplication can be performed either by the serial-serial multiplication
technique or by serial-parallel multiplication technique. We have used the serial-serial
multiplication technique to realize a triangular compressor which performs efficient bit
serial multiplication.
The back ground research of serial multiplication reveals that significant work has been
done in the past. R. F. Lyon [1] in his research discussed about a very efficient serial
multiplier which was performing serial multiplication by utilizing an efficient two’s
compliment pipelined serial multiplier. The multiplier was heavy on resources and this
was the drawback of his technique. H. J.Sips [2] and by N. R. Strader and V. T.
Rhyne[3] focused on the multiplication of unsigned numbers and designed a modular
full precision bit serial multiplier. R. Gnanasekaran[4]developed a very complicated and
complex multiplication scheme which automatically caters for negative weight of the
most significant bit of the operands in the two’s complement representation. Rhyne and
Strader [5] presented a complicated booth recoded multiplication scheme in which n
identical cells result in 2n-bit product but this design resulted in unnecessary complexity
[6]. Few serial/parallel implementations were also presented by Gnansekaran [7].
Denyer and Renshaw used the modified Booth’s algorithm [8] and designed an NMOS
serial multiplier which utilized multiplier cells [9]. Kanopoulos presented a bit serial 3 x 3
matrix/vector multiplier [10]. After going through all the serial multipliers which have
been designed and implemented and keeping in view our requirement of handling the
video streaming which involves bit serial multiplication, a bit serial multiplication
Optimizing Bit Serial Architecture 2013
45
algorithm was realized based on the serial-serial multiplication technique. Figure 4.1
illustrates multiplication of two [12] numbers, both the numbers have a bit width of eight
x bits. As a result of multiplication eight x partial products (PP_0 - PP_7) are generated
as shown below:-
A7 A6 A5 A4 A3 A2 A1 A0
B7 B6 B5 B4 B3 B2 B1 B0
A7B0 A6B0 A5B0 A4B0 A3B0 A2B0 A1B0 A0B0 PP_0
A7B1 A6B1 A5B1 A4B1 A3B1 A2B1 A1B1 A0B1 PP_1
A7B2 A6B2 A5B2 A4B2 A3B2 A2B2 A1B2 A0B2 PP_2
A7B3 A6B3 A5B3 A4B3 A3B3 A2B3 A1B3 A0B3 PP_3
A7B4 A6B4 A5B4 A4B4 A3B4 A2B4 A1B4 A0B4 PP_4
A7B5 A6B5 A5B5 A4B5 A3B5 A2B5 A1B5 A0B5 PP_5
A7B6 A6B6 A5B6 A4B6 A3B6 A2B6 A1B6 A0B6 PP_6
A7B7 A6B7 A5B7 A4B7 A3B7 A2B7 A1B7 A0B7 PP_7
P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
Figure 5.1Multiplication of two numbers having a bit width of 8 x bits each
Figure 5.1 illustrates the bit serial multiplication. As arrive for the multiplication a dot
product takes place and results in A0B0which is P0(LSB of final product P)along with a
carry out. As the cycles continue and with the arrival of each progressing bit of A and B
this serial multiplication continues as illustrated in the figure above.
In each cycle the number of terms increase and it following the trend 2n+1, where n is
the cycle number. The shape of Figure 5.1 follows a triangle shape and that is why the
Optimizing Bit Serial Architecture 2013
46
name of this technique is termed as triangular compression technique. In dot notation
Figure 5.2 shows the multiplication.
Figure 5.2 Serial Compression of two numbers illustrated in dot notation
The detail working and the partial product generation in each cycle is shown in Figure
5.3,the complexity of this algorithm is O (n).
Figure 5.3 Compression cycles for serial multiplication shown in dot notation
Optimizing Bit Serial Architecture 2013
47
5.3 Algorithm for Bit wise Serial Multiplication
An algorithm for the bit wise serial multiplication is discussed here; the designed
algorithm can be mapped on any bit wise serial multiplication architecture.
The description of algorithm is as under:-
Algorithm
INPUT: A, B
OUTPUT: X
INITIALIZE: Ai and Bi = 0 for I > W-1(Where W is the width of the input)
c i,j and s i,j = 0 for all i, j
Generation of Terms
Begin
for i=0 to W-1
begin
for j=0 to W-1
begin
, 1, 1 , ,& 2i i i j i i j i j i jA B carry sum carry sum ;
end
,0i ip sum
for i = W to 2W-1
1, 1i W i Wp sum
Optimizing Bit Serial Architecture 2013
48
Triangular compression
begin
for i=0 to W-1
begin
for j=0 to W-1
begin
{carry[i+1],product[i-1]} cycle[i][j]+cycle[i+1][j]+product[i] ;
end
Optimizing Bit Serial Architecture 2013
49
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
0 0
1 0
1 1 0 1
2 0
2 1
AB
2 2 1 2 0 2
3 0
3 1
3 2
3 3 2 3 1 3 0 3
4 0
4 1
4 2
4 3
4 4 3
4 2 4 1 4 0 4
5 0
5 1
5 2
5 3
5 4
5 5 4 5 3 5 2 5 1 5 0 5
6 0
6 1
6 2
6 3
6 4
6 5
6 6 5 6 4 6 3 6 2 6 1 6 0 6
7 0
7 1
7
AA
A
2
7 3A
7 4
7 5
AA
7 6A
7 7 6 7 5 7 4 7 3 7 2 7 1 7 0 7A
14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Figure 5.4 Serial Multiplication Input to Triangular Compressor
Optimizing Bit Serial Architecture 2013
50
5.4 Design Example of Bit Serial Multiplier
For a better understanding of the algorithm an example of proposed multiplier is
discussed. Let there are two four bit serial inputs A and B such that
A = 0101
B = 1111
0101 A
1111 B __________________
0101 pp_1 0101 pp_2 0101 pp_3 0101 pp_4 __________________ 1001011 X Figure 5.5Multiplication of two four x bit numbers
The step by step multiplication as per the algorithm discussed above is as under:-
1. The dot product of first bit of A and first bit of B results in first bit of X and a carry
out as illustrated in Figure 5.6
2. The dot product of second bit of A and second bit of B results in second bit of
and no carry forward as indicated in Figure 5.6.
3. The dot product of third bit of A and third bit of B results in third bit of X and a
carry forward as shown in Figure 5.6.
Optimizing Bit Serial Architecture 2013
51
1 1
______ 1 X1
01 11
_________________ 0 1
0 ____________________
1 X 2 0
(a) (b)
101 111
_________________ 11
101 ____________________ 110 X 3 01 _____________________ 1 0 01 _______________________ 100
0101 1111
_________________ 100 ____________________ 000 0101 _____________________ 001 X4 100 _______________________ 1001
(c) (d)
Figure 5.6 (a) Bit wise dot product of first bit of A and B (b) Bit wise dot product of second bit of Aand B
(c) Bit wise dot product of third bit of A and B(d) Bit wise dot product of fourth bit of A and B
The partial products of fourth cycle and the carry of the third cycle concludes the final
product X as shown in Figure 5.6
5.5 Architecture
The architecture of bit wise serial multiplier is shown in Figure 5.7. The bit wise serial
output is available immediately in the in the next cycle after the input is received serially.
Since the bit width of each serial input is eight x bits the final product is sixteen x bits out
Optimizing Bit Serial Architecture 2013
52
of which first eight bits contain most of the information. To keep the output limited to
eight bits, the last eight bits of the output are truncated and it results loss in the
precision but for the application such as video streaming it is covered and the saving in
the hardware resources viz~ a ~ viz the precision compromised is huge.
Figure 5.7 Bit Serial Compressor Based Multiplication Architecture showing the input X and Y, output P,
cycle tracker, terms generator and triangular serial compressor
5.6 Implementation and Results
The efficiency of proposed bit serial multiplier is compared with a conventional bit serial
multiplier [11]. Both the designs were implemented on FPGA and the implementations
results are shown in Table 5.1. The proposed design was compared with a conventional
bit wise serial multiplier and the results show about 38% saving in number of look up
tables, 30% saving in the number of flip flops and 25 % increase in the operating
frequency.
Optimizing Bit Serial Architecture 2013
53
Table 5.1Implementation Results
Look Up Tables ( Numbers)
Flip Flops ( Numbers)
Clock Frequency (MHz)
Xilinx
Virtex5
Altera
Stratix-III
Xilinx
Virtex5
Altera
Stratix-III
Xilinx
Virtex5
Altera
StratixIII
Conventional bit wise serial
multiplier
13 11 12 12 454 656
Proposed bit wise serial multiplier
9 8 8 8 565 840
5.7 The LMS FIR filter Using Bit Serial Compressor
In the least mean square FIR filter a weighted linear sum of the present and past K
samples of the input signal is used to find the filter output at any instance of time.
Mathematically it can be represented as
1
0
( ) ( ) ( ) ( )N
T ij i j
jx i v i y i v y
(1)
Where
( ) [ ( ), ( 1), ( 2),.............. ( 1)]Ty i y i x i y i y i k
0 1 2 1( ) [ ( ), ( ), ( ),................. ( )]Tkv i v i v i v i v i
Equation 2 and Equation 3 update the weights of the algorithm
( 1) ( ) ( ) ( )v i v i e i y i (2)
( ) ( ) ( )e i d i x i (3)
Optimizing Bit Serial Architecture 2013
54
( )d i is the signal which is used as the reference
The filter computation and the adaptation require O(D) computation [13] and the
computations involves 2D additions and 2D+1 multiplications.
Figure 5.8 LMS FIR Filter with serial i/p and o/p
Figure 5.8 shows a 3 x tap LMS FIR filter, both i /p are bit wise serial. There are three x
multipliers which form a part of the filter. Multiple instances of proposed triangular
compressor based serial multiplier have been used to realize this filter. For the addition
bit wise serial adder as discussed below has been implemented.
5.7.1 Bit Serial Adder
Figure 5.9 shows a bit wise serial adder which performs the operations as per
Equation 1 and Equation 2.
( _ )Sum A B carry in (1)
( )Carryin AB r AB BCarry (2)
Optimizing Bit Serial Architecture 2013
55
Figure 5.9 Bit wise serial adder
5.8 LMS Filter Architecture
Figure 4.10 shows the architecture is above filter as implemented. The architecture
comprises of bit wise serial proposed multipliers, bit wise serial adders, registers, error
calculator and filter weight adjuster which adjusts the weight basing on the difference
between the filter final o/p and the reference signal .
Figure 5.10Architecture of bit wise serial LMS filter composed of triangular compressor serial adder’s
error calculator and filter weight adjuster
Optimizing Bit Serial Architecture 2013
56
5.9 Implementation and Results
Two versions of the LMS adaptive filter one utilizing the bit serial triangular compressor
based multiplier and the second utilizing the conventional bit serial multiplier were
implemented on FPGA. The results of both the filter versions are tabulated in Table 5.2.
Table 5.2Implementation Results
Look Up Tables
( Numbers) Flip Flops
( Numbers) Clock Frequency
(MHz)
Xilinx
Virtex5
Altera
StratixIII
Xilinx
Virtex5
Altera
StratixIII
Xilinx
Virtex5
Altera
StratixIII
Conventional adaptive
filter
39 33 36 36 454 565
Proposed adaptive
filter
27 24 8 24 656 840
The results show 38% saving in the number of Look Up Tables, 30% saving in the
number of Flip Flops and 25 % increase in the clock frequency.
5.10 References
[I] R. F. Lyon, “Two’s complement pipeline multipliers,” IEEE Trans.
Communication. vol. COM-24, no. 4, pp. 418-425, Apr. 1976.
[2] H. J. Sips, “Comments on ‘An O(n) parallel multiplier with bit sequential input and
output,’” IEEE Trans. Computer, vol. C-31, no. 4, pp. 325-327, Apr. 1982.
Optimizing Bit Serial Architecture 2013
57
[3] N. R. Strader and V. T. Rhyne, “A canonical bit-sequential multiplier,” IEEE
Trans. Computer, vol. C-31, no. 8, pp. 791-795, Aug. 1982.
[4] R. Gnanasekaran, “On a bit-serial input and bit-serial output multiplier,” IEEE
Trans. Computer, vol. C-32, no. 9, pp. 878-880, Sept. 1983.
[5] T. Rhyne and N. R. Strader, 11, “A signed bit-sequential multiplier,” IEEE Trans.
Computer, vol. C-35. no. 10, pp. 896901, Oct. 1986
.[6] L. Dadda, “On serial-input multipliers for two’s complement numbers,” IEEE
Trans. Computer, vol. 38. no. 9, pp. 1341-1345, Sept. 1989.
[7] “A fast serial-parallel binary multiplier,” IEEE Trans. Computer, vol. C-34, no. 8,
pp. 741-744, 1985.
[8] P. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-Serial Approach,
Addison-Wesley, 1985.
[9] J. Newkirk and R. Mathews, The VLSI Designer’s Library. Addison- Wesley,
1983.
[10] N. Kanopoulos, “A bit-serial architecture for digital signal processing,” IEEE
Trans. Circuits Sys., vol. CAS-32, no. 3, pp. 289-291, 1985.
[11] C.W.Ng, N.Wong and T.S Ng “Efficient FPGA implementation of bit stream
multipliers” Electronics letter online no: 20070293, department of Electrical and
Optimizing Bit Serial Architecture 2013
58
Electronic Engineering, The University on Hong Kong 26 April 2007.
[12] Woon-SengGan, Sen M. Kuo,“Teaching DSP Software Development: From
Design to Fixed-Point Implementations” IEEE Transactions On Education, Vol.
49, No. 1, February 2006
[13] “Implementation of an LMS Adaptive Filter on an FPGA Employing Multiplexed
Multiplier Architecture” Daniel Allred, Venkatesh Krishnan, Walter Huang, and
David Anderson Center for Signal and Image Processing, Georgia Institute of
Technology, Atlanta, GA 30332-0250.
[14] C.W.Ng, N.Wong and T.S Ng “Efficient FPGA implementation of bit stream
multipliers “Electronics letter online no: 20070293, department of Electrical and
Electronic Engineering, the university on Hong Kong, 26 April 2007.
[15] Dadda, L., “On Serial-Input Multipliers for Two’s Complement Numbers”, IEEE
Transactions on Computers, Vol. 38, No. 9, pp. 1341-1345, Sep. 1989.
[16] Denyer, P. and D. Renshaw, WSI Signal Processing: A Bit-Serial Approach,
Addison-Wesley, 1985.
[17] Ercegovac, M.D. and T. Lang, Division and Square Root: Digit-Recurrence
Algorithms and Implementations, Kluwer, Boston, 1994.
[18] Strader, N.R. and V.T. Rhyne, “A Canonical Bit-Sequential Multiplier”, IEEE
Transactions on Computers, Vol. C-31, No. 8.
Optimizing Bit Serial Architecture 2013
59
[19] Andraka R., .Building a high performance bit serial processor in an FPGA., On-
Chip System Design Conference, North Kingstown, 1996.
[20] http://comparch.doc.ic.ac.uk/publications/files/osk00jvlsisp.ps
Optimization on FPGA Slice Fabric 2013
60
Chapter 6
Optimization on FPGA Slice Fabric ____________________________________________________________________________
6.1 Overview
FPGA is an essential part in today’s almost every communication system involving
software defined signal processing applications. The reason being the design
algorithms are tested for their performance in terms of accuracy, timing, complexity,
area and power consumption after being mapped on the FPGA. All these parameters
are related with the bit width which is being processed at a time and which eventually
depends upon the architecture of the FGPA i.e the available resources. Any reprieve of
even a single bit may cause degradation of magnitudes therefore at design stage the
inbuilt composition of objective tool if taken in contemplation ends up in guaranteed
optimal performance [1].
This work extends the application of methods described in [2] [3] [4]. This resulted in
reducing critical path. By introducing the multiple pipelining along with some techniques
optimization has been achieved for the designing of different digital filters.
Compression trees play a vital role in the overall optimization and they often have
different configurations and can optimize the algorithm if selected properly. Same is the
case with pipelining and bit width reduction depending upon the type of optimization
required. These techniques when performed at the very beginning i.e at the design
stage results in considerable optimization. The optimization techniques that map on the
Optimization on FPGA Slice Fabric 2013
61
structure of FPGA are described below as it is the first step towards the process of
performance maximization.
6.2 Optimization Techniques vs FPGA architecture
The internal structure of Virtex-5 FPGA is shown in Figure 6.1; the express fabric
consists of CLBs which has multiple LUTs which contain a dedicated carry chain for
high speed data propagation.
Figure 6.16 input LUTs, CLBs and carry chain of Virtex-5 slice exploded view
Different i/p and o/p patterns of the LUTs provide different options for multiple
combinational logics. From Figure 6.1 it is clear that the LUT is 6- i/p and for optimized
implementation each available resource has to be carefully used by keeping in view the
data it can handle.
Optimization on FPGA Slice Fabric 2013
62
Figure 6.2Vertex 5 FPGA DSP 48 slice
The DSP slice in Virtex-5 is shown in Figure 6.2, besides the rated frequency of 550
MHz with a 25 x 18 bits resolution, this feature if fully utilized helps in the designing and
prototyping of high-performance digital filters [5].
A detailed analysis of the Virtex-5reveals that an optimal implementation of any
algorithm based on multiplication reduction methods can be mapped on this FPGA
using 6:3 compression tree structure as opposed to 3:2 or 4:2 or other similar structures
due to the presence of 6 input LUTS which actually reduces the no. of hops there by
reducing the critical path.
6.2.1 Compression Trees
Compression trees are used for multiplication instead of using a dedicated
hardware one of the examples is the Wallace compression tree [4].Wallace tree
Optimization on FPGA Slice Fabric 2013
63
with a compression ratio of 4:2 was recognized as the most efficient but with a 6
i/p LUT present in the Virtex-5 FPGA the Wallace tree with compression ratio of
6:3 provides the best results in terms of performance.
6.2.2 Multiplier Pipelining
DSP 48 is a block which performs the multiplication and accumulation in the
Virtex-5 FPGA. There is an inherent 3 stages of pipelining which enables up to 4
levels of pipelining at max without having to incur any additional hardware
resources. The design’s throughput can be enhanced from 80 MHz to 500 MHz
[5] by efficiently using this slice.
6.2.3 Optimization of Bit Resolution
Appropriate choice of bit width of an algorithm has a direct impact on the
consumption of power, mean square error (MSE) and complexity [8]. The bit
width has to be carefully selected so as to map accurately on the internal
resource structure of FPGA. This will result in a guaranteed optimized design.
Optimization on FPGA Slice Fabric 2013
64
6.3 Design Optimizations
To analyze the hardware design optimizations few digital filters were studied. The first
was a FIR filter [6] which was implemented different forms. Then the same filter was
converted in its CSD form and conversion of same FIR filter employing different
compression trees for the synthesis of same. The second was an IIR filter in pipelined
and direct form was implemented and thirdly a complex multiplier was also
implemented.
6.3.1 Optimization of FIR filter
Figure 6.3shows an FIR filter with seven taps which was synthesized on FPGA of
Xilinx Virtex-5 family. The design was implemented using 8 DSP48 blocks
running at 73.678 MHz [9] [10].
Figure 6.3 FIR filter having seven taps
The systolic implementation of same resulted in 8 x times faster resource
utilization i.e 592 MHz as shown in Figure 6.4.
Optimization on FPGA Slice Fabric 2013
65
Figure 6.4Systolic FIR filter with cut-set represented by dashed lines
Compression trees with compression ratios of 3:2, 4:2, 6:3 and 7:3 were used
after transforming the same filter in Canonic Sign Digit (CSD) form. Figure 6.5,
Figure 6.6, Figure 6.7 and Figure 6.8 represent the schematic of various
compression trees. The numbers of ones in a coefficient are reduced by around
33% using CSD representation [13].
Figure 6.5Schematic of 6:3 type compression trees
Optimization on FPGA Slice Fabric 2013
66
Figure 6.6Schematic of 3:2 type compression trees
Figure 6.7Schematic of 4:2 type compression trees
Optimization on FPGA Slice Fabric 2013
67
Figure 6.8 Schematic of 7:3 type compression trees
6.3.2 Optimization of IIR filter
Figure 6.9 shows the implementation of a1storder IIR filter [7] [8] . Pipeline stages
were added to the filter as shown in Figure 5.10 by application of Look ahead
transformation [2]. The Synthesis of both the filters shows an increase in clock
speed up to 370.157 MHz from 247.588 MHz.
( 1) ( ) ( 1)x i a x i b y i (6)
Figure 6.9 IIR Filter of first order
Optimization on FPGA Slice Fabric 2013
68
Now after applying the transform, we have
2
( 2 ) ( 1) ( 1)( ) ( 1) ( 2 )
x i a x i b y ia x i ab y i b y i
(7)
Figure 6.10 First order transformation of IIR filter
6.4 Complex Multiplier
4 multiplications, 1 addition and 1 subtraction operation is involved in each complex
multiplication.
(a + ib) x( c + id)=(ac - bd) +i(ad + bc)(9)
Figure 5.11showsthe schematics of complex multiplier.
Figure 6.11Schematic of Complex multiplier
Optimization on FPGA Slice Fabric 2013
69
LUT based execution method is realized to implement complex multiplier, by utilizing
the carry chain the implementation was very efficient. The partial product generation
was achieved by utilizing the Booth algorithm and partial product reduction is achieved
by Wallace tree. Booth recoding algorithm [11] is used for generation of partial products
that are reduced by half. Compression trees incorporating different compression ratios
are implemented for comparison of LUTs used and the path delays are optimized by left
to right scanning of operands.
The two’s compliment equivalent of a multiplier X is described by following Equations:-
21
10
2 ( 2 )k
ii
i kk
X b b
(10)
3 2 1 1 0 1( 2 )2 ......... ( 2 )i ki i i kb b b b b b
(11)
/ 2 12
2 1 2 (2 1)0
( 2 )2i
kk k k
kb b b
(12)
Here for an even value of i 1ib represents the sign bit, the following equation gives the
product / 2 1
22 1 2 ( 2 1)
0
( 2 )2i
kk k k
kY C b b b
(13)
The overall architecture of optimized complex multiplier implementation by using
encoding of consecutive two bits to a single bit through scanning three consecutive bits
is given below. This reduces the number of partial products by half.
Optimization on FPGA Slice Fabric 2013
70
Figure 6.12 complex multiplier incorporating booth encoded wallace tree reduction technique
6.5 Experimental Results
Figure 6.5 and Figure 6.6 represent the FIR and IIR filters. These filters have been
realized by implementing compression trees having different compression ratios. The
design environment was based on VHDL Coding Software implemented using Xilinx
ISE and Modelsim simulator.
6.5.1 FIR Filter
Designs were synthesized by focusing on the clock frequency. From the
synthesis results minimum clock period and the logic utilization are compared.
The results of different implementations of FIR filter after being mapped with
compression tress with different compression ratios were compared.
Optimization on FPGA Slice Fabric 2013
71
Figure 6.13The frequency (MHz) and number of utilized LUTs in CSD by using different
compression trees for FIR filter Comparison
6.5.2 IIR Filter
IIR filters are compared by implementing different forms and incorporation
different compression tree ratios.
Figure 6.14The number of utilized LUTs and frequency (MHz) in CSD by using different
compression trees for IIR filter Comparison.
Optimization on FPGA Slice Fabric 2013
72
6.6 Complex Multiplier Synthesis
With same optimization parameters a 32 bit complex multiplier was synthesized by
incorporating compress tree with different compression ratios and the results were
compared. Figure 6.13 show the results of the synthesis in terms of Look up tables
utilized and path delays.
6.6.1 Optimization of Bit Width
Direct form FIR filter was synthesized using CSD implementation [12] for various
bit widths of input and the filter coefficients. Figure 6.11shows the resulting LUTs
andclock speed.
Figure 6.15 Complex multiplier using different compression treesfor Comparison of LUTs and
Path Delay of
Optimization on FPGA Slice Fabric 2013
73
Figure 6.16 LUTs and Clock rates for FIR filter
The results show a 10% saving in the look up tables and a increase of 1.1% se
in clock speed . The error has also been reduced, results show a variance of 0.1704 in
the LMS error when a format of Q1.15 was used during the implementation.
6.7 Conclusion
Key components of DSP systems have been implemented. Throughout the
implementation the focus was on the LUT and critical path delay reduction by keeping in
view the available resources on the target platform. Compression tree with different
compression ratios were realized during the implementation and results how that the
compression ratio of 6:3 correctly maps on the inherent structure of Virtex -5 FPGA for
all practical purposes.
Optimization on FPGA Slice Fabric 2013
74
6.8 References
[1] Xcell Journal “Achieve high performance with vertex 5 FPGA”,fourth quarter
2006.
[2] K.Satoh, J.Tada, H.Yanagida, and Y.tamura,”Parallel Image Reconstruction
Operation By dedicated Hardware for three Dimensional Ultrasound
Imaging”,pp.1522-1525, Proc of IEEE UFFC, Nov. 2007
[3] Keshab.k.parhi, “Pipelined and parallel recursive and adaptive filters” chapter 10 of
pipelined adaptive digital filters
[4] Keshab.k.parhi, “Bit level Arithmetic architectures” chapter 13 of pipelined adaptive
digital filters
[5] Vojin G. Oklobodzija, “The Computer Engineering Handbook”, CRC Press
[6] Anna Kunchevaand GeorgeYanchev, “ Synthesis and implementation of DSP
Algorithm in Advanced Programmable architectures” Proc of ISCCS 2008.
[7] AntoliSergyienko, Volodymir Lepekha, JuriKanevski and PrzemyslawSoltan, “
Implementation Of IIR Digital Filters In FPGA” Poland.
[8] Shanthala S and S.Y.Kulkarni, “Hight speed and low power FPGA Implementation
Of FIR Filter for DSP Applications”,EuropeanJounral of scientific research ISSN
1450-216x Vol.31 No.1(2009), PP. 19-28.
[9] Xilinx Co.,:Xcell journal vol.58.59”,2007 Spring.
[10] D.Phanthavong,”Designing with dsp 48 blocks using precision synthesis,”Xcell
Journal, 2005.
Optimization on FPGA Slice Fabric 2013
75
[11] Ki-seon Cho, Jong, Jin Seok, Goang Choi, “54x54 bit Radix 4 Multiplier based on
modified booth algorithm”, ACM 2003 1-58113-677.
[12] AqibPerwaiz and Shoab .A .Khan “ Effect of Bit Precision on hardware
complexity for DDFS architecture”, IEEE Conference
Conclusion and Future Work
2013
76
Chapter 7
Conclusions and Future Work ____________________________________________________________________________
This work has addressed the optimization techniques custom to the target technology
under consideration. A mathematical model that optimizes mapping of Digital Signal
Processing (DSP) algorithms on FPGAs has been presented. Any high-end DSP
system consists of multiple sub-systems. Each sub-system can be defined by multiple
architectural options based on the design constraints. Beside architectural design
options, there are many other attributes that directly affects the mapped resources. The
world length quantization plays a critical role in further optimizing the selected
architectural option. The thesis has modeled all these attributes and the solution lists
the resources required for the optimized mapping. The target device is selected based
on the results and the constraints defined in the design. By adjusting the constraints the
target device is changed and low power solutions are possible. The experiments
demonstrate that world length of intermediate variables does not help in improving the
performance beyond a certain point. The thesis has also explored the intricate
relationship of intermediate variable lengths, with the overall accuracy of the results and
links it with the complexity of HW. Several design examples have been listed to confirm
the validity of the findings.
In the design space exploration, several architectural options have been discussed. The
options include bit serial, byte serial, folded, unfolded, and distributed arithmetic based
architectures. The architectures that are optimal for custom design may perform poorly
Conclusion and Future Work
2013
77
once mapped on FPGA. This observation is substantiated by giving design examples
from Compression tress. These trees are very fundamental to DSP architectures due to
their vide use in general purpose multiplication, multiplication with constants and
multiple operand addition and subtraction. Different compression ratios for Wallace tree
have been explored to identify the correct ratio of Wallace compression tree to best map
on LUTs based FPGA.
The inherent architecture of device under consideration plays an important role in
optimizing the mapping of the algorithm on FPGA. An automatic technique that explores
different architectural options subject to design constraints can save FPGA resources.
The automatic technique is based on a sound mathematical model that helps in
suggesting the best target device that meets all the constraints in an optimized solution.
Besides exploring architectural options, there are many other design parameters that
further help in optimizing design to meet the required specifications. The quantization of
each variable in the algorithm is very critical. Optimized Word-Length Allocation (WLA)
tailors the precision arithmetic operations and results in saving area and cost. The
thesis lists techniques for optimization and implements them while pursuing the ultimate
goal of algorithm design, there have been contributions in straight away saving a high
percentage of resources in case of FIR and IIR filters or for that matter any complex
multiplier. The deductions from the thesis are listed as below:-
1. The mathematical model presented in chapter 2 helps an algorithm designer to
map his algorithm on different available architectural options thereby while
Conclusion and Future Work
2013
78
adjusting the weight-ages of different resources the best fit target FPGA is also
identified. The complex example of WCDMA receiver has been discussed and
with the given throughput requirement at each stage the design maps perfectly
on the Spartan 3A FPGA on one set of constraints and Vertex 5 FPGA on the
other set of constraints. Any system can be optimally designed to fit in the FPGA
design space basing on the fine adjustment of the constraints. By carefully
adjusting the constraints low power solutions are realizable.
2. In a particular digital signal processing system the number of processed bits at a
time is a major source of resource wastage. The selection of the word-lengths of
variables is carried out to meet the applications output error tolerance. To
achieve an optimum word length at which the cost and the output distortion
match a set criteria depending upon the application is a target for an algorithm
designer. As in case of CORDIC (discussed in chapter 4) for all practical
purposes the bit resolution of input variables should be greater that the bit
resolution of angle when CORDIC is used as Direct Digital Frequency
Synthesizer (DDFS).
3. A DSP algorithm designer must determine the dynamic range and desired
precision of input, intermediate, and output signals in a design implementation to
ensure that the algorithm fidelity criteria are met. In most of the cases results show
a linear increase in the hardware complexity with increase in the bit resolution,
going beyond a certain bit resolution is not advisable as it only adds to the
Conclusion and Future Work
2013
79
hardware complexity but has no contribution towards the reduction in the least
mean square error.
4. To implement bit serial multiplication in DSP algorithms the proposed bit serial
multiplier proved to be more efficient.
5. Compression trees are used to add different partial products of a multiplication
and eliminate the need for using a of dedicated multiplier hardware. Traditionally
the 4 to 2 Wallace tree has been considered the most efficient compression
choice but in our case the choice of 6 to 3 compression techniques is a better
option as it exactly maps on the inherent structure of FPGAs which have 6 i/p
LUT.
Combining all the above deductions concludes an algorithm for optimization across slice
fabric of FPGA, the optimization steps are as under:-
1. Ascertain the word length allocation.
2. Check the internal pipelining of DSP blocks within the FPGA under use.
3. For multiplication use the compression technique that exactly maps on the
internal architecture of the target device.
4. For Serial multiplication use proposed bit wise serial triangular compressor
multiplier.
5. Proposed CORDIC can be used as DDFS
Conclusion and Future Work
2013
80
Further extension of the work leads to the compilation of component library for different
FPGA vendors to automate the optimization of the DSP algorithms by the designers.
Another extension of the work is the implementation of multiple implementation
techniques on the components of any complex digital signal processing systems