accuracy-performance tradeoffs on an fpga through...

8
Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David Boland and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London London, United Kingdom Email: {k.shi11, david.boland03, g.constantinides}@imperial.ac.uk Abstract—Embedded applications can often demand strin- gent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the imple- mentation. However, by reducing the precision, the engineer introduces quantization error into the design. In this paper, we demonstrate that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantization errors. Through the use of analytical models and empirical results on a Xilinx Virtex-6 FPGA, we show that a geometric mean reduction of 67.9% to 98.8% in error expectation or a geometric mean improvement of 3.1% to 27.6% in operating frequency can be obtained using this alternative design methodology. Keywords-FPGA; overclocking; probabilistic design; I. I NTRODUCTION FPGA-based accelerators have demonstrated significant performance gains over software designs across a range of applications [1], [2]. However, one of the major factors that limits the performance of these accelerators is that they typically run at much lower clock frequencies than general purpose processors (GPPs) or GPUs. While it is unlikely that FPGA-based accelerators will ever be run at the same clock frequency as GPPs or GPUs, timing analysis tools typically recommend that a user should run their implementation at a very conservative clock frequency in order to avoid the possibility of timing violations. This substantially limits the potential performance of the device. The standard techniques to boost the operating frequency of a datapath are either to heavily pipeline the design or reduce the precision used. While pipelining may boost the maximum frequency, it will not tend to reduce the latency of the circuit. As a result, this method will not be applicable to many embedded applications, which typically have strict latency requirements, or in datapath containing feedback where C-slow retiming is inappropriate. Reducing the datap- ath precision will reduce the latency of the accelerator at the cost of introducing quantization error into the design. Due to the freedom of FPGAs to employ customized variable repre- sentations, research into exploiting the potential benefits of using the minimum precision necessary to satisfy a design specification, such as the maximum tolerable error, has been an extensive research topic within the FPGA community [3]. However, the choice of precision is not the only source of error in a datapath. Recently, we have seen a growth of research that explores the potential power or performance benefits that can be obtained when operating circuits beyond the deterministic region. This topic is expected to be of growing importance due to the increasingly stringent tim- ing/power requirements, design complexity and the environ- mental and process variations, which are all accompanied by the continuous scaling of process technologies [4]. As pointed out by the international technology roadmap for semiconductors (ITRS07) [5], while future technologies may suffer from much poorer timing performance, extra benefits of manufacturing, test and power consumption can be obtained if the tight requirement of absolute correctness is released for devices and interconnect. Research in this area has typically focused on relaxing the design constraints and the safety margins that are conven- tionally used. A series of work named “Better Than Worst- Case Design” introduced a universal structure with cores (which operate with high performance) and checkers (which check and recover the system from timing errors) [6]. As an exemplary design, the Razor project [7] scaled the supply voltage and clock frequency beyond the most conservative value, while monitoring the error rate by utilizing a self- checking circuit. This work demonstrated that the benefits brought by removing the safe margin outweigh the cost of monitoring and recovering from errors. Related work involves operating circuits slightly slower than the critical path delay with dedicated checker circuits to ensure that timing errors will not occur [8], developing timing analysis tools that decide the optimum operating frequencies in the non-deterministic region due to process variation [9] or dynamically voltage scaling an FPGA upon detection of timing errors to prevent them occurring in the future [10]. Alternatively, there is research focusing on designing “probabilistic circuits” which trade accuracy for perfor- mance, power and silicon area improvements by using techniques such as voltage overscaling and imprecise ar- chitectures. For example, Palem et al. [11] described a technique for the ripple carry adder that employed differ- ent voltage regions for different bits along a carry chain.

Upload: phamanh

Post on 27-Jan-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

Accuracy-Performance Tradeoffs on an FPGA Through Overclocking

Kan Shi, David Boland and George A. ConstantinidesDepartment of Electrical and Electronic Engineering

Imperial College LondonLondon, United Kingdom

Email: {k.shi11, david.boland03, g.constantinides}@imperial.ac.uk

Abstract—Embedded applications can often demand strin-gent latency requirements. While high degrees of parallelismwithin custom FPGA-based accelerators may help to someextent, it may also be necessary to limit the precision usedin the datapath to boost the operating frequency of the imple-mentation. However, by reducing the precision, the engineerintroduces quantization error into the design. In this paper, wedemonstrate that for many applications it would be preferableto simply overclock the design and accept that timing violationsmay arise. Since the errors introduced by timing violationsoccur rarely, they will cause less noise than quantization errors.Through the use of analytical models and empirical resultson a Xilinx Virtex-6 FPGA, we show that a geometric meanreduction of 67.9% to 98.8% in error expectation or a geometricmean improvement of 3.1% to 27.6% in operating frequencycan be obtained using this alternative design methodology.

Keywords-FPGA; overclocking; probabilistic design;

I. INTRODUCTION

FPGA-based accelerators have demonstrated significantperformance gains over software designs across a range ofapplications [1], [2]. However, one of the major factors thatlimits the performance of these accelerators is that theytypically run at much lower clock frequencies than generalpurpose processors (GPPs) or GPUs. While it is unlikely thatFPGA-based accelerators will ever be run at the same clockfrequency as GPPs or GPUs, timing analysis tools typicallyrecommend that a user should run their implementation ata very conservative clock frequency in order to avoid thepossibility of timing violations. This substantially limits thepotential performance of the device.

The standard techniques to boost the operating frequencyof a datapath are either to heavily pipeline the design orreduce the precision used. While pipelining may boost themaximum frequency, it will not tend to reduce the latencyof the circuit. As a result, this method will not be applicableto many embedded applications, which typically have strictlatency requirements, or in datapath containing feedbackwhere C-slow retiming is inappropriate. Reducing the datap-ath precision will reduce the latency of the accelerator at thecost of introducing quantization error into the design. Due tothe freedom of FPGAs to employ customized variable repre-sentations, research into exploiting the potential benefits ofusing the minimum precision necessary to satisfy a design

specification, such as the maximum tolerable error, has beenan extensive research topic within the FPGA community [3].

However, the choice of precision is not the only sourceof error in a datapath. Recently, we have seen a growth ofresearch that explores the potential power or performancebenefits that can be obtained when operating circuits beyondthe deterministic region. This topic is expected to be ofgrowing importance due to the increasingly stringent tim-ing/power requirements, design complexity and the environ-mental and process variations, which are all accompaniedby the continuous scaling of process technologies [4]. Aspointed out by the international technology roadmap forsemiconductors (ITRS07) [5], while future technologiesmay suffer from much poorer timing performance, extrabenefits of manufacturing, test and power consumption canbe obtained if the tight requirement of absolute correctnessis released for devices and interconnect.

Research in this area has typically focused on relaxing thedesign constraints and the safety margins that are conven-tionally used. A series of work named “Better Than Worst-Case Design” introduced a universal structure with cores(which operate with high performance) and checkers (whichcheck and recover the system from timing errors) [6]. As anexemplary design, the Razor project [7] scaled the supplyvoltage and clock frequency beyond the most conservativevalue, while monitoring the error rate by utilizing a self-checking circuit. This work demonstrated that the benefitsbrought by removing the safe margin outweigh the costof monitoring and recovering from errors. Related workinvolves operating circuits slightly slower than the criticalpath delay with dedicated checker circuits to ensure thattiming errors will not occur [8], developing timing analysistools that decide the optimum operating frequencies in thenon-deterministic region due to process variation [9] ordynamically voltage scaling an FPGA upon detection oftiming errors to prevent them occurring in the future [10].

Alternatively, there is research focusing on designing“probabilistic circuits” which trade accuracy for perfor-mance, power and silicon area improvements by usingtechniques such as voltage overscaling and imprecise ar-chitectures. For example, Palem et al. [11] described atechnique for the ripple carry adder that employed differ-ent voltage regions for different bits along a carry chain.

Page 2: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

That is, higher voltage would be applied for computationsgenerating most significant bits, and vice versa. However,the ability to implement non-uniform voltage scaling islimited in practical situations. For the second approach, Luet al. proposed a simplified datapath that can be employedto mimic and speculate the original logic functions [12].Similarly, Gupta et al. developed approximate adders at thetransistor level and compared the energy efficiency of theirproposed architectures over truncation of input word-lengthof conventional structures [13]. Both articles are based onthe observation that errors only occur with specific inputpatterns. However, the link between probability of outputcorrectness and energy saving are not analysed. In addition,these techniques cannot be directly applied to FPGA.

In this work, we attempt to bring the strands of researchof arithmetic precision determination and overclocking to-gether. We evaluate the probabilistic behavior of basic arith-metic primitives with different datapath precisions, whenoperating beyond the deterministic region. We suggest thatfor certain applications it is beneficial to move away fromthe traditional model of creating a conservative design thatis guaranteed to avoid timing violations. Instead, it maybe preferable to create a design in which timing violationsmay occur, under the knowledge that they are unlikely tooccur frequently because they require specific input patternsto generate errors. To support this hypothesis, we initiallypresent probabilistic models of errors generated in thisprocess for basic arithmetic operators: the ripple carry adder(RCA) and constant coefficient multiplier (CCM). We followthis with experimental data from a Xilinx Virtex-6 FPGA,across a range of benchmark circuits and applications. Weshow that not only does this approach allow us to reducethe need for the conservative timing margin, more impor-tantly, our models and experimental results demonstrate thatperformance benefits can be achieved in comparison to thetraditional situation where target latency is limited by choiceof precision. The main contributions of this paper are:

∙ Detailed descriptions of how to create probabilisticmodels for overclocking and truncation errors for basicarithmetic primitives,

∙ Analytical and empirical results from FPGA imple-mentation that demonstrate that allowing rare timingviolations to occur results in less error than truncatinga datapath to meet timing.

The rest of the paper is organized as follows: we firstpresent theoretical probabilistic error models for RCA andCCM in Section II and Section III, respectively. This isfollowed by the description of a practical experimentalsetup on the Xilinx Virtex-6 FPGA in Section IV, and thedemonstration of the benefits of our proposed approach inSection V, before drawing conclusions in Section VI.

II. RIPPLE CARRY ADDER

A. Adder Structures in FPGAs

Adders serve as a key building block for arithmeticoperations. Generally speaking, the ripple carry adder (RCA)is the most straightforward and widely used adder structure.As such, the philosophy of our approach is first exemplifiedwith the analysis of a RCA. We later describe how thismethodology can be extended to other arithmetic operatorsin Section III by discussing the CCM that is commonly usedin DSP applications and numerical algorithms.

Typically the maximum frequency of a RCA is determinedby the longest carry propagation. Consequently, modernFPGAs offer built-in architectures for very fast ripple carryaddition. For instance, the Altera Cyclone series uses fasttables [14] while the Xilinx Virtex series employs dedicatedmultiplexers and encoders for the fast carry logic [15].Figure 1 illustrates the structure of an 𝑛-bit RCA, whichis composed of 𝑛 serial-connected full adders (FAs) andutilizes the internal fast carry logic of the Virtex-6 FPGA.

While the fast carry logic reduces the time of each indi-vidual carry-propagation delay, the overall delay of carry-propagation will eventually overwhelm the delay of sumgeneration of each LUT with increasing operand word-lengths. For our initial analysis, we assume that the carrypropagation delay of each FA is a constant value 𝜇, whichis a combination of logic delay and routing delay, and hencethe critical path delay of the RCA is 𝜇𝑅𝐶𝐴 = 𝑛𝜇, as shownin Figure 1. For an 𝑛-bit RCA, it follows that if the samplingperiod 𝑇𝑆 is greater than 𝜇𝑅𝐶𝐴, correct results will besampled. If, however, 𝑇𝑆 < 𝜇𝑅𝐶𝐴, intermediate results willbe sampled, potentially generating errors.

In the following sections, we consider two methods thatwould allow the circuit to run at a frequency higher than1/𝑇𝑆 . The first is a traditional circuit design approachwhere operations occur without timing violations. To thisend, the operand word-length is truncated in order to meetthe timing requirement. This process results in truncationor roundoff error. In our proposed new scenario, circuitsare implemented with greater word-length, but are clockedbeyond the safe region so that timing violations sometimesoccur. This process generates “overclocking error”.

B. Probabilistic Model of Truncation Error

For ease of discussion, we assume that the input to ourcircuit is a fixed point number scaled to lie in the range[−1, 1). For our initial analysis, we assume every bit of eachinput is uniformly and independently generated. However,this assumption will be relaxed in Section V where thepredictions are verified using real image data. The errorsat the output are evaluated in terms of the absolute valueand the probability of their occurring. These two metricsare combined as the error expectation.

If the input signal of a circuit is 𝑘 bits, truncation erroroccurs when the input signal is truncated from 𝑘 bits to 𝑛

Page 3: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

0 1 MUXCY

XORCY

6-inputLUT

0 1 MUXCY

XORCY

6-inputLUT

0 1 MUXCY

XORCY

6-inputLUT

An-1Bn-1

A1

B1

B0

A0

Cin

C0

C1

Sn-1

S1

S0

Cn-2

Cout

Full Adder

RCA

Figure 1. An 𝑛-bit ripple carry adder in Virtex-6 FPGA.

bits. Under this premise, the mean value of the truncatedbits at signal input (𝐸𝑇𝑖𝑛) is given by (1).

𝐸𝑇𝑖𝑛 =1

2

𝑘∑𝑖=𝑛+1

2−𝑖 = 2−𝑛−1 − 2−𝑘−1 (1)

Since we assume there are two mutually independentinputs to the RCA, the overall expectation of truncation errorfor the RCA is given by (2).

𝐸𝑇 =

{2−𝑛 − 2−𝑘, if 𝑛 < 𝑘

0, otherwise (2)

C. Probabilistic Model of Overclocking Error

1) Generation of Overclocking Error: For a given 𝑇𝑆 , themaximum length of error-free carry propagation is describedby (3), where 𝑓𝑆 denotes the sampling frequency.

𝑏 :=

⌈𝑇𝑆

𝜇

⌉=

⌈1

𝜇 ⋅ 𝑓𝑆

⌉(3)

However, since the length of an actual carry chain duringexecution is dependent upon input patterns, in general,the worst case may occur rarely. To determine when thistiming constraint is not met and the size of the error inthis case, we expand standard results [16] to the followingstatements, which examine carry generation, propagationand annihilation, as well as the corresponding summationresults of a single bit 𝑖, according to the relationship betweenits input patterns 𝐴𝑖 and 𝐵𝑖:

∙ If𝐴𝑖=𝐵𝑖= 1, a new carry chain is generated at bit 𝑖,and 𝑆𝑖=𝐶𝑖−1;

∙ If 𝐴𝑖 ∕= 𝐵𝑖, the carry propagates for this carry chain atbit 𝑖, and 𝑆𝑖=0;

∙ If 𝐴𝑖=𝐵𝑖, the current carry chain annihilates at bit 𝑖,and 𝑆𝑖=1.

2) Absolute Value of Overclocking Error: For an 𝑛-bitRCA, let 𝐶𝑡𝑚 denote the carry chain generated at bit 𝑆𝑡 withthe length of 𝑚 bits. For a certain 𝑓𝑆 , the maximum lengthof error-free carry propagation, 𝑏, is determined through(3). The presence of overclocking error requires 𝑚 > 𝑏.Since the length of carry chain cannot be greater than 𝑛,parameters 𝑡 and 𝑚 are bounded by (4) and (5):

0 ≤ 𝑡 ≤ 𝑛− 𝑏 (4)

𝑏 < 𝑚 ≤ 𝑛+ 1− 𝑡 (5)

For 𝐶𝑡𝑚, correct results will be generated from bit 𝑆𝑡 tobit 𝑆𝑡+𝑏−1. Hence the absolute value of error seen at theoutput, normalized to the MSB (2𝑛), is given by (6), where𝑆𝑖 and 𝑆𝑖 denote the actual and error-free output of bit 𝑖respectively.

𝑒𝑡𝑚 =

∣∣∣∑𝑛𝑖=𝑡+𝑏(𝑆𝑖 − 𝑆𝑖) ⋅ 2𝑖

∣∣∣2𝑛

(6)

𝑆𝑖 and 𝑆𝑖 can be determined using the equations from theprevious statements in Section II-C1. In the error-free case,the carry will propagate from bit 𝑆𝑡 to bit 𝑆𝑡+𝑚−1, andwe will obtain 𝑆𝑡+𝑏 = 𝑆𝑡+𝑏+1 = ⋅ ⋅ ⋅ = 𝑆𝑡+𝑚−2 = 0 forcarry propagation, and 𝑆𝑡+𝑚−1 = 1 for carry annihilation.However, when a timing violation occurs, the carry will notpropagate through all these bits. Substituting these valuesinto (6) yields (7). Interestingly, the value of overclockingerror has no dependence on the length of carry chain 𝑚.

𝑒𝑡𝑚 =

∣∣2𝑡+𝑚−1 − 2𝑡+𝑚−2 − ⋅ ⋅ ⋅ − 2𝑡+𝑏∣∣

2𝑛= 2𝑡+𝑏−𝑛 (7)

3) Probability of Overclocking Error: The carry chain𝐶𝑡𝑚 occurs when there is a carry generated at bit 𝑡, a carryannihilated at bit 𝑡 + 𝑚 − 1 and the carry propagates inbetween. Consequently, its probability 𝑃𝑡𝑚 is given by (8).

𝑃𝑡𝑚 = 𝑃(𝐴𝑡=𝐵𝑡=1)𝑃(𝐴𝑡+𝑚−1=𝐵𝑡+𝑚−1) ⋅𝑡+𝑚−2∏𝑖=𝑡+1

𝑃(𝐴𝑖 ∕=𝐵𝑖) (8)

Under the assumption that 𝐴 and 𝐵 are mutually indepen-dent and uniformly distributed, we have 𝑃(𝐴𝑖=𝐵𝑖=1) = 1/4,𝑃(𝐴𝑖 ∕=𝐵𝑖) = 1/2 and 𝑃(𝐴𝑖=𝐵𝑖) = 1/2, so 𝑃𝑡𝑚 can beobtained by (9). Note that (9) takes into account the carryannihilation always occurs when 𝑡+𝑚− 1 = 𝑛.

𝑃𝑡𝑚 =

{(1/2)𝑚+1 if 𝑡+𝑚− 1 < 𝑛(1/2)𝑚 if 𝑡+𝑚− 1 = 𝑛

(9)

4) Expectation of Overclocking Error: Expectation ofoverclocking error can be expressed by (10).

𝐸𝑂 =∑𝑡

∑𝑚

𝑃𝑡𝑚 ⋅ 𝑒𝑡𝑚 (10)

Using 𝑃𝑡𝑚 and 𝑒𝑡𝑚 from (7) and (9) respectively, 𝐸𝑂 canbe obtained by (11).

Page 4: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

𝐸𝑂 =

{2−𝑏 − 2−𝑛−1, if 𝑏 ≤ 𝑛

0, otherwise (11)

D. Comparison between Two Scenarios

In the traditional scenario, the word-length of RCA mustbe truncated, using 𝑛 = 𝑏− 1 bits, in order to meet a given𝑓𝑆 . The error expectation is then given by (12).

𝐸𝑡𝑟𝑎𝑑 = 2−𝑏+1 − 2−𝑘 (12)

Overclocking errors are allowed to happen in the secondscenario, therefore the word-length of RCA is set to be equalto the input word-length, that is, 𝑛 = 𝑘. Hence we obtain(13) according to (11).

𝐸𝑛𝑒𝑤 = 2−𝑏 − 2−𝑘−1 (13)

Comparing (13) and (12), we have (14). This equationindicates that by allowing timing violations, the overallerror expectation of RCA outputs drops by a factor of 2in comparison to traditional scenario. This provides the firsthint that our approach is useful in practice.

𝐸𝑛𝑒𝑤

𝐸𝑡𝑟𝑎𝑑=

2−𝑏 − 2−𝑘−1

2−𝑏+1 − 2−𝑘=

1

2(14)

III. CONSTANT COEFFICIENT MULTIPLIER

As another key primitive of arithmetic operations, CCMcan be implemented using RCA and shifters. For example,operation 𝐵 = 9𝐴 is equivalent to 𝐵 = 𝐴 + 8𝐴 =𝐴+ (𝐴 << 3), which can be built using one RCA and oneshifter. We first focus on a single RCA and single shifterstructure. We describe how more complex structures con-sisting of multiple RCAs and multiple shifters can be builtin accordance with this baseline structure in Section III-C.

In this CCM structure, let the two inputs of the RCA bedenoted by 𝐴𝑆 and 𝐴𝑂 respectively, which are both two’scomplement numbers. 𝐴𝑆 denotes the “shifted signal”, withzeros padded after LSB, while 𝐴𝑂 denotes the “originalsignal” with MSB sign extension. For an 𝑛-bit input signal,it should be noted that an 𝑛-bit RCA is sufficient for thisoperation, because no carry will be generated or propagatedwhen adding with zeros, as shown in Figure 2.

Figure 2. Different types of carry chain in constant coefficient multiplier.The notion 𝑠 denotes the shifted bits and 𝐵𝑃 denotes the binary point.

A. Probabilistic Model of Truncation Error

Let 𝐸𝑇𝑖𝑛 and 𝐸𝑇𝑜𝑢𝑡 denote the expectation of truncationerror at the input and output of CCM respectively. We then

have (15), where 𝑐𝑜𝑒 denotes the coefficient value of theCCM, and 𝐸𝑇𝑖𝑛 can be obtained according to (2).

𝐸𝑇𝑜𝑢𝑡 = ∣𝑐𝑜𝑒∣ ⋅ 𝐸𝑇𝑖𝑛 (15)

B. Probabilistic Model of Overclocking Error

1) Absolute Value of Overclocking Error: The absolutevalue of overclocking error of carry chain 𝐶𝑡𝑚 is increasedby a factor of 2𝑠 due to shifting, compared to RCA. Hence𝑒𝑡𝑚 in CCM can be modified from (7) to give (16).

𝑒𝑡𝑚 = 2𝑡+𝑏−𝑛+𝑠 (16)

2) Probability of Overclocking Error: Due to the depen-dencies in a CCM, carry generation requires 𝑎𝑡 = 𝑎𝑡−𝑠 = 1,propagation and annihilation of a carry chain is best consid-ered separately for four types of carry chain generated atbit 𝑡. We label these by 𝐶𝑡𝑚1 to 𝐶𝑡𝑚4 in Figure 2, definedby the end region of the carry chain. For 𝐶𝑡𝑚1, we have:

∙ Carry propagation:𝑎𝑖 ∕=𝑎𝑖+𝑠 where 𝑖∈ [𝑡+1, 𝑛−𝑠−2];∙ Carry annihilation:𝑎𝑗=𝑎𝑗+𝑠 where 𝑗∈ [𝑡+1, 𝑛−𝑠−1].

Similarly for 𝐶𝑡𝑚2, we have:

∙ Carry propagation: 𝑎𝑖 ∕= 𝑎𝑛−1 where 𝑖 ∈ [𝑛−𝑠−1, 𝑛−3]; or 𝑎𝑖 ∕= 𝑎𝑖+𝑠 where 𝑖 ∈ [𝑡+ 1, 𝑛− 𝑠− 2];

∙ Carry annihilation:𝑎𝑗= 𝑎𝑛−1where 𝑗∈ [𝑛−𝑠−1, 𝑛−2].

For the first two types of carry chain 𝐶𝑡𝑚1 and 𝐶𝑡𝑚2,the probability of carry propagation and annihilation is 1/2and the probability of carry generation is 1/4, under thepremise that all bits of input signal are mutually independent.Therefore (17) can be obtained by substituting this into (8).

𝑃𝑡𝑚 = (1/2)𝑚+1

, if 𝑡+𝑚− 1 ⩽ 𝑛− 2 (17)

For carry annihilation of 𝐶𝑡𝑚3, 𝑎𝑛−1 = 𝑎𝑛−1, which isalways true. Thus the probability of 𝐶𝑡𝑚3 is given by (18).

𝑃𝑡𝑚 = (1/2)𝑚, if 𝑡+𝑚− 1 = 𝑛− 1 (18)

𝐶𝑡𝑚4 represents carry chain annihilates over 𝑎𝑛−1, thereforecarry propagation requires 𝑎𝑛−1 ∕= 𝑎𝑛−1. This means 𝐶𝑡𝑚4never occurs in a CCM.

Altogether, 𝑃𝑡𝑚 for a CCM is given by (19).

𝑃𝑡𝑚 =

{(1/2)𝑚+1 if 𝑡+𝑚− 1 < 𝑛− 1(1/2)𝑚 if 𝑡+𝑚− 1 = 𝑛− 1

(19)

3) Expectation of Overclocking Error: Since the carrychain of a CCM will not propagate over 𝑎𝑛−1, the upperbound of parameter 𝑡 and 𝑚 should be modified from (4)and (5) to give (20) and (21).

0 ⩽ 𝑡 ⩽ 𝑛− 𝑏− 1 (20)

𝑏 < 𝑚 ⩽ 𝑛− 𝑡 (21)

Finally, by substituting (19) and (16) with modifiedbounds of 𝑡 and 𝑚 into (10), we obtain the expectation of

Page 5: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

overclocking error for a CCM to be given by (22).

𝐸𝑂 =

{2𝑠−𝑏−1 − 2𝑠−𝑛−1, if 𝑏 ≤ 𝑛− 1

0, otherwise(22)

C. CCM with Multiple RCAs and Shifters

In the case where a CCM is composed of two shifters andone RCA, such as operation 𝐵=20𝐴=(𝐴 << 2) + (𝐴 <<4), let the shifted bits be denoted as 𝑠1 and 𝑠2 respectively.Hence the equivalent 𝑠 in (22) can be obtained through (23).

𝑠 = ∣𝑠1 − 𝑠2∣ (23)

For those operations such as 𝐵 = 37𝐴 = (𝐴 << 5) +(𝐴 << 2) + (𝐴 << 1), the CCM can be built using a treestructure. Each root node is the baseline CCM and the errorsare propagated through an adder tree, of which the error canbe determined based on our previous RCA model.

IV. TEST PLATFORM

In our experiments, we compare two design perspectives.In the first scenario, the word-length of the input signal istruncated before propagating through the datapath in order tomeet a given latency. In our proposed overclocking scenario,the circuit is overclocked while keeping the original operandword-length. The benefits of the proposed methodology aredemonstrated over a set of DSP example designs, which areimplemented on the Xilinx ML605 board with a Virtex-6FPGA XC6VLX240T-1FFG1156.

A. Experimental Setup

We initially build up a test framework on an FPGA.The general architecture is depicted in Figure 3. The mainbody of the test framework consists of the circuit under test(CUT), the test frequency generator and the control logic,as shown in the dotted box in Figure 3. The I/Os of theCUT are registered by the launch registers (LRs) and thesample registers (SRs), which are all triggered by the testclock. Input test vectors are stored in the on-chip memoryduring initialization. The results are sampled using XilinxChipScope. Finally, we perform an offline comparison ofthe output of the original circuit at the rated frequency withthe output of the overclocked as well as the truncated designsusing the same input vectors.

The test frequency generator is implemented using twocascaded mixed-mode clock managers (MMCMs), createdusing Xilinx Core Generator [17]. Besides the outputs, thecorresponding input vectors and memory addresses are alsorecorded into the comparator, as can be seen in Figure 3,in order to ensure that the recorded errors arise fromoverclocking the CUT rather than the surrounding circuitrywhen high test frequencies are applied.

B. Benchmark Circuits

Three types of DSP designs are tested: digital filters (FIR,IIR and Butterworth), a Sobel edge detector and a direct

Figure 3. FPGA Test framework, which is composed of a measurementarchitecture (the dotted box) and an off-line comparator.

implementation of a Discrete Cosine Transformation (DCT).The filter parameters are generated through MATLAB filterdesign toolbox, and they are normalized to integers forimplementation. Table I summarizes the operating frequencyof each implemented design in Xilinx ISE14.1 when theword-length of input signal is 8-bit.

Table IRATED FREQUENCIES OF EXAMPLE DESIGNS.

Design Frequency (MHz) Description

FIR Filter 126.2 5𝑡ℎ orderSobel Edge Detector 196.7 3× 3

IIR Filter 140.3 7𝑡ℎ orderButterworth Filter 117.1 9𝑡ℎ order

DCT 176.7 4-point

The input data are generated from two sources. One iscalled “uniform independent inputs”, which are randomlysampled from a uniform distribution of 8-bit numbers. Theother is referred to as “real inputs”, which denote 8-bit pixelvalues of the 512×512 Lena image.

C. Exploring the Conservative Timing Margin

Generally, the operating frequency provided by EDA toolstends to be conservative to ensure the correct functionalityunder a wide range of operating environments and work-loads. In a practical situation, this may result in a large gapbetween the predicted frequency and the actual frequencyunder which the correct operation is maintained [18].

For example, the predicted frequencies and the actualfrequencies of a 5𝑡ℎ order FIR filter using different word-lengths are depicted in Figure 4. The “actual” maximumfrequencies are computed by increasing the operating fre-quency from the rated value until errors are observed atthe output; the maximum operating frequency with correctoutput is recorded for the current word-length. As can beseen in Figure 4, the circuit can operate without errors at amuch higher frequency in practice than predicted accordingto our experiments. A maximum speed differential of 3.2×is obtained when the input signal is 5-bit.

Page 6: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

1 2 3 4 5 6 7 80

100

200

300

400

500

600

Word−length of Input

Fre

quen

cy (

MH

z)

Timing AnalyzerFPGA

Figure 4. The maximum operating frequencies for different input word-lengths of an FIR filter. The dotted line depicts the rated frequency reportedby the Xilinx Timing Analyzer. The solid line is obtained through realFPGA tests using our platform.

In our experiments in Section V, the conservative timingmargin is removed in the traditional scenario for a fairercomparison to the overclocking scenario. To do this, for eachtruncated word-length, we select the maximum frequency atwhich we see no overclocking error on the FPGA board inour lab. For example, in Figure 4, the operating frequencyof the design when the word-lengths are truncated to 8, 5and 2 bits are 400MHz, 450MHz and 500MHz respectively.

Figure 4 also demonstrates that when the circuit is trun-cated, it allows the circuit to operate at a higher frequencythan the frequency of full precision implementation. How-ever, a non-uniform period change can be observed forboth results. For instance, the maximum operating frequencykeeps almost constant when the operand word-length re-duces from 8 to 6 or from 5 to 3 in both the experimentalresults and those of timing analyzer. This will cause a slightdeviation between our analytical model which assumes thatthe single bit carry propagation delay to be a constantvalue, as discussed in (12) with expression 𝑛 = 𝑏 − 1.This deviation will be influenced by many factors includinghow the architecture has been packed onto LUTs and CLBsand process variation causing non-uniform interconnectiondelays [19]. However, we shall see that our model remainsclose to the true empirical results in Section V.

D. Evaluation Metric of Outputs

The results are evaluated in terms of mean relative error(MRE), which represents the percentage of error at outputs.MRE is given by (24), where 𝐸𝑒𝑟𝑟𝑜𝑟 and 𝐸𝑜𝑢𝑡 refer to themean value of error and the correct output respectively.

𝑀𝑅𝐸 =

∣∣∣∣𝐸𝑒𝑟𝑟𝑜𝑟

𝐸𝑜𝑢𝑡

∣∣∣∣× 100% (24)

E. Computing Model Parameters

The accuracy of our proposed models is examined withpractical results on Virtex-6 FPGA. We first determine themodel parameters. There are two types of parameters inthe models of overclocking error. The first is based on thecircuit architecture. For example, the word-length of RCAs

and CCMs (𝑛), the shifted bits of the shifters in CCM(𝑠), and the word-length of the input signal (𝑘). This isdetermined through static analysis. The second depends ontiming information, such as the single bit carry propagationdelay 𝜇. In order to keep consistency with the assumptionmade in models that 𝜇 is a fixed value, it is obtainedaccording to the actual FPGA measurement results.

Initially the maximum error-free frequency 𝑓0 is applied.In this case we have (25) where 𝑑𝑐 is a constant value whichdenotes the interconnection delay. The frequency is thenincreased such that (26) is obtained. This process repeatsuntil the maximum frequency 𝑓𝑛−1 is applied in (27). Basedon these frequency values, 𝜇 can be determined.

𝑓0 = 𝑛𝜇+ 𝑑𝑐 (25)

𝑓1 = (𝑛− 1)𝜇+ 𝑑𝑐 (26)

⋅ ⋅ ⋅𝑓𝑛−1 = 𝜇+ 𝑑𝑐 (27)

V. RESULTS AND DISCUSSION

A. Case study: FIR filter

We first assess the accuracy of our proposed models oferror. The modeled values of both overclocking error andtruncation error of the FIR filter are presented in Figure 5(dotted lines), as well as the actual measurements on theFPGA (solid lines) with two types of input data. The resultsdemonstrate that our models match well with the practicalresults obtained using the uniform independent inputs.

400 450 500 550 600 650 70010

−4

10−3

10−2

10−1

100

101

102

103

Frequency (MHz)

Mea

n R

elat

ive

Err

or (

%)

TraditionalOverclocking: Uniform DataOverclocking: Real DataModel: Traditional ScenarioModel: Overclocking Scenario

n=7

n=5

n=1

n=2

Figure 5. A demonstration of two design perspectives with a 5𝑡ℎ orderFIR filter, which is implemented on Virtex-6 FPGA. The modeled valuesof both overclocking errors and truncation errors are presented as dottedlines. The actual FPGA measurements are depicted using solid lines. Twotypes of inputs are employed in the overclocking scenario: the uniformlydistributed data and the real image data from Lena.

According to Figure 5, output errors are reduced in theoverclocking scenario for both input types in comparison tothe traditional scenario, as expected by our models. In addi-tion, we see that using real data, more significant reductionof MRE are achieved, and that no errors are observed when

Page 7: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

(a) 425MHz, n=8 , no errors observed (b) 430MHz, n=8, SNR=47.15dB (c) 480MHz, n=8, SNR=24.1dB (d) 520MHz, n=8, SNR=10.86dB

(e) 425MHz, n=7, SNR=26.06dB (f) 430MHz, n=5, SNR=24.85dB (g) 480MHz, n=2, SNR=6.57dB (h) 520MHz, n=1, SNR=3.95dB

Figure 6. Output images of the FIR filter for both overclocking scenario (top row) and traditional scenario (bottom row) under various operating frequencies.

frequency is initially increased. This is because for real data,long carry chains are typically generated with even smallerprobabilities, and the longest carry chain rarely occurs.

The output images of the FIR filter for both of thetwo scenarios with increasing frequencies are presented inFigure 6, from which we can clearly see the differencesbetween the errors generated in these two scenarios. In theoverclocking scenario, we observe errors in the MSBs forcertain input patterns. This leads to “salt and pepper noise”,as shown on the images in the top row of Figure 6. In thetraditional scenario, truncation causes an overall degradationof the whole image, as can be seen in the bottom row ofFigure 6. Furthermore, it is difficult to recover from the lattertype of error, since it is generated due to precision loss.

B. Potential Benefits in Circuit Design

Our results could be of interest to a circuit designer intwo ways. Typically, either the designer will want to createa circuit that can run at a given frequency with the minimumpossible MRE, or the algorithm designer will wish to run asfast as possible whilst maintaining a specific error tolerance.In the first case, the experimental results for all five exampledesigns on FPGA are summarized in Table II in terms of therelative reduction of MRE as given in (28) where 𝑀𝑅𝐸𝑇𝑟𝑎𝑑

and 𝑀𝑅𝐸𝑜𝑣𝑟𝑐 denote the value obtained in the traditionalscenario and in the overclocking scenario, respectively.

𝑀𝑅𝐸𝑇𝑟𝑎𝑑 −𝑀𝑅𝐸𝑜𝑣𝑟𝑐

𝑀𝑅𝐸𝑇𝑟𝑎𝑑× 100% (28)

In this table, the frequency is normalized to the maximumerror-free frequency for each design when the input signal

is 8-bit. The N/A in Table II refers to the situations wherea certain frequency simply cannot be achieved using thetraditional scenario. It can be seen that a significant reductionof MRE can be achieved using the proposed overclockingscenario, and the geometric mean reduction varies from67.9% to 95.4% using uniform input data. Even largerdifferences of MRE can be observed when testing with realimage data for each design, ranging from 83.6% to 98.8%,as expected given the results shown in Figure 5.

Table III illustrates the frequency speedups for eachdesign when the specified error tolerance varies from 0.05%to 50%. For all designs, we see that the overclockingscenario still outperforms the traditional scenario for eachMRE budget in terms of operating frequency. Likewise,the frequency speedup is higher for real image inputs thanuniform inputs. The geometric mean of frequency speedupsof 3.1% to 21.8% can be achieved by using uniform data,while 5.3% to 27.6% when using real image data.

VI. CONCLUSION

This paper has explored the probabilistic behavior of keyarithmetic primitives in an FPGA when operating beyond theconservative region. We have developed models for errorsgenerated due to both overclocking and truncation of inputs.These models indicate that it may be preferable to allow tim-ing violations to occur, under the knowledge that they willonly occur rarely. We support this hypothesis with empiricalresults on a Virtex-6 FPGA that demonstrate a geometricmean reduction of 67.9% to 98.8% in mean relative error,or a geometric mean improvement in operating frequency

Page 8: Accuracy-Performance Tradeoffs on an FPGA Through …cas.ee.ic.ac.uk/people/gac1/pubs/KanFCCM13.pdf · Accuracy-Performance Tradeoffs on an FPGA Through Overclocking Kan Shi, David

Table IIRELATIVE REDUCTION OF MRE IN OVERCLOCKING SCENARIO FOR VARIOUS NORMALIZED FREQUENCIES BASED ON (28).

NormalizedFrequency

FIR Sobel IIR Butterworth DCT4 Geo.MeanUniform Lena Uniform Lena Uniform Lena Uniform Lena Uniform Lena Uniform Lena

1.04 99.85% 100.00% 99.51% 99.74% 72.28% 90.09% 79.03% 100.00% 83.58% 98.06% 86.14% 97.50%1.08 98.93% 99.97% 96.26% 93.75% 71.64% 90.50% 78.81% 100.00% 83.26% 98.45% 85.15% 96.46%1.12 94.27% 98.82% 96.25% 93.62% 73.63% 88.25% 81.88% 84.87% 89.44% 99.56% 86.68% 92.84%1.16 99.66% 99.91% 73.73% 93.62% 73.10% 89.92% 79.30% 84.23% 79.07% 99.44% 80.44% 93.23%1.20 96.03% 99.90% 81.55% 81.52% 70.76% 75.67% 64.96% 84.50% N/A* N/A* 77.46% 84.95%1.24 98.46% 99.32% 81.43% 81.67% 70.47% 76.12% 63.66% 84.23% N/A* N/A* 77.44% 84.92%1.28 95.39% 99.29% 60.41% 78.24% N/A* N/A* 54.38% 75.15% N/A* N/A* 67.92% 83.58%1.32 95.37% 98.75% N/A* N/A* N/A* N/A* N/A* N/A* N/A* N/A* 95.37% 98.75%

* Current frequency cannot be achieved in the traditional scenario. These points are excluded from the calculation of geometric means.

Table IIIFREQUENCY SPEEDUPS IN OVERCLOCKING SCENARIO UNDER VARIOUS ERROR BUDGETS.

Error Budget%

FIR Sobel IIR Butterworth DCT4 Geo.MeanUniform Lena Uniform Lena Uniform Lena Uniform Lena Uniform Lena Uniform Lena

0.05 4.76% 21.43% 6.82% 6.82% 0.95% 1.26% 12.40% 24.03% 0.72% 0.96% 3.07% 5.32%0.5 19.05% 21.43% 13.64% 6.82% 0.63% 10.06% 24.03% 24.03% 0.48% 12.44% 4.52% 13.45%1 19.05% 28.57% 13.64% 18.18% 10.06% 16.35% 24.03% 24.03% 0.48% 12.44% 7.86% 19.10%5 19.15% 25.53% 18.18% 18.18% 0.54% 0.82% 0.63% 0.94% 7.66% 12.44% 3.91% 5.36%10 19.15% 25.53% 10.64% 10.64% 0.54% 1.09% 6.92% 13.21% 4.91% 4.911% 5.19% 7.19%20 19.15% 25.53% 5.77% 15.39% 8.70% 8.70% 3.26% 3.26% 6.70% 10.88% 7.32% 10.39%50 15.69% 15.69% 10.53% 19.30% 42.50% 50.00% 46.74% 54.89% 15.06% 19.25% 21.8% 27.59%

of 3.1% to 27.6% can be achieved in real applicationsover the conventional scenario. In the future, we wish toexpand our methodology by incorporating silicon area as thethird evaluation metric, and to analyze the tradeoffs usingalternative architectures.

ACKNOWLEDGMENT

This work is supported by the EPSRC (GrantsEP/I020557/1 and EP/I012036/1).

REFERENCES

[1] K. Underwood, “FPGAs vs. CPUs: trends in peak floating-point performance,” in Proc. Int. Symp. Field ProgrammableGate Arrays, 2004, pp. 171–180.

[2] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performanceand energy comparison of FPGAs, GPUs, and multicoresfor sliding-window applications,” in Proc. Int. Symp. FieldProgrammable Gate Arrays, 2012, pp. 47–56.

[3] G. Constantinides, N. Nicolici, and A. Kinsman, “Numericaldata representations for FPGA-based scientific computing,”IEEE Design Test of Computers, vol. 28, no. 4, pp. 8–17,2011.

[4] B. Colwell, “We may need a new box,” Computer, vol. 37,no. 3, pp. 40–41, 2004.

[5] S. I. Association, “International technology roadmap for semi-conductors (ITRS),” 2007.

[6] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge, “Opportu-nities and challenges for better than worst-case design,” 2005,pp. 2–7.

[7] D. Ernst, N. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler,D. Blaauw, T. Austin, K. Flautner, et al., “Razor: A low-power pipeline based on circuit-level timing speculation,” inInt. Symp. on Microarchitecture, 2003, pp. 7–18.

[8] A. Uht, “Going beyond worst-case specs with TEAtime,”Computer, vol. 37, no. 3, pp. 51–56, 2004.

[9] K. Keutzer and M. Orshansky, “From blind certainty toinformed uncertainty,” in Proc. Int. workshop on TimingIssues in the Specification and Synthesis of Digital Systems,2002, pp. 37–41.

[10] J. Levine, E. Stott, G. Constantinides, and P. Cheung, “Onlinemeasurement of timing in circuits: For health monitoring anddynamic voltage & frequency scaling,” 2012, pp. 109–116.

[11] Z. Kedem, V. Mooney, K. Muntimadugu, and K. Palem, “Anapproach to energy-error tradeoffs in approximate ripple carryadders,” in Int. Symp. on Low Power Electronics and Design,2011, pp. 211 –216.

[12] S. Lu, “Speeding up processing with approximation circuits,”IEEE Computer, vol. 37, no. 3, pp. 67–73, 2004.

[13] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power digital signal processing using approximate adders,”Computer-Aided Design of Integrated Circuits and Systems,IEEE Trans. on, vol. 32, no. 1, pp. 124–137, 2013.

[14] Altera, “Cyclone device handbook,” 2008.

[15] Xilinx, “Virtex-6 FPGA configurable logic block user guide,”2009.

[16] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital inte-grated circuits: a design perspective (2nd edition). Prentice-Hall, 2003.

[17] Xilinx, “Virtex-6 FPGA clocking resources user guide,” 2011.

[18] B. Gojman, S. Nalmela, N. Mehta, N. Howarth, and A. De-Hon, “GROK-LAB: generating real on-chip knowledge forintra-cluster delays using timing extraction,” in Proc. Int.Symp. on Field Programmable Gate Arrays, 2013, pp. 81–90.

[19] H. Wong, L. Cheng, Y. Lin, and L. He, “FPGA deviceand architecture evaluation considering process variations,” inProc. Int. Conf. on Computer-Aided Design, 2005, pp. 19–24.