a floating-point advanced cordic processor

13
Journalof VLSISignalProcessing,10, 53-65 (1995) O 1995Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Floating-Point Advanced Cordic Processor D.E. METAFAS AND C.E. GOUTIS VLSIDesign Laboratory,Department of ElectricalEngineering, Universityof Patras, Patras 26110, Greece Received April21, 1992; Revised September 10, 1993 Abstract. In this paper, a novel architecture of a floating-point digital signal processor is presented. It introduces a single hardware structure with a full set of elementary arithmetic functions which includes sin, cos, tan, arctanh, circular rotation and vectoring, sinh, cosh, tanh, arctanh, hyperbolic rotation and vectoring, square root, logarithm, exponential as well as addition, multiplication and division. The architecture of the processor is based on the COordinate Rotation Digital Computer (CORDIC) and the Convergence Computing Method (CCM) algorithms for computing arithmetic functions and it is fully parallel and pipelined. Its advanced functionality is achieved without significant increase in hardware, in comparison to ordinary CORDIC processor, and makes it an ideal processing element in high speed multiprocessor applications, e.g. real time Digital Signal Processing (DSP) and computer graphics. 1 Introduction In the signal and image processing domains, the huge amount of data involved in computations has to be pro- cessed in real time. In most applications there is clearly the need for multiprocessor systems with very powerful processing elements in terms of throughput and func- tionality. Until now, several architectures of powerful processing elements, which have been proposed in the literature, are based on the COordinate Rotation Digital Computer (CORDIC) algorithms [1-5]. Ahmed et al. [6] have shown that CORDIC arithmetic units are bet- ter building blocks than the multipliers/accumulators in DSP systems. The CORDIC algorithms introduced by Voider [7] and extended by Walther [8] evaluates a variety of plane coordinate transformations with iterative procedures involving only additions and shift operations. Mod- ifications of the standard CORDIC algorithms have been proposed which increase the processing speed but also the hardware complexity. The on-line CORDIC architecture doubles the hardware complexity of a conventional CORDIC module [9]. Also the VLSI im- plementation of CORDIC using redundancy, requires to perform two conventional CORDIC iterations in par- allel [10]. The CORDIC based processing elements proposed by de Lange [2] and Schmidt [1] reduce the hardware complexity but they implement a subset of the CORDIC functions. Another iterative method for computing elementary arithmetic functions is the Convergence Computing Method (CCM) proposed by Specker [ 11] and extended by Chen [12]. It evaluates arithmetic functions not in- cluded in the CORDIC scheme, such as logarithms of any base, exponential and square root. The implemen- tation of CCM algorithms in digital signal processors has not been studied before. In this paper, the architecture of a very powerful floating-point, fully parallel and pipelined digital sig- nal processor is presented, having a host of arithmetic capabilities. It provides a superior functionality in comparison to previous designs [1-5]. Its set of high speed arithmetic functions includes square root, divi- sion, multiplication, addition, logarithm, exponential, trigonometric and hyperbolic functions. It outperforms the conventional multiplier/accumulator approach in high speed applications, e.g. real-time digital signal processing [6, 13-15], matrix equation solving algo- rithms [16], robotics and computer graphics [17]. The proposed architecture, compared to a pure fully par- allel and pipelined CORDIC processor, has advanced functionality. The trade-off is a small additional hard- ware cost, about a 2.2% increase to the number of transistor, as it is shown in section 5. The use of CORDIC processor in applications that involve high throughput computations of logarithms, exponentials and square roots, e.g. real time cepstrum evaluation, is row extended.

Upload: teipir

Post on 19-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Journal of VLSI Signal Processing, 10, 53-65 (1995) O 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

A Floating-Point Advanced Cordic Processor

D.E. METAFAS AND C.E. GOUTIS VLSI Design Laboratory, Department of Electrical Engineering, University of Patras, Patras 26110, Greece

Received April 21, 1992; Revised September 10, 1993

Abstract . In this paper, a novel architecture of a floating-point digital signal processor is presented. It introduces a single hardware structure with a full set of elementary arithmetic functions which includes sin, cos, tan, arctanh, circular rotation and vectoring, sinh, cosh, tanh, arctanh, hyperbolic rotation and vectoring, square root, logarithm, exponential as well as addition, multiplication and division. The architecture of the processor is based on the COordinate Rotation Digital Computer (CORDIC) and the Convergence Computing Method (CCM) algorithms for computing arithmetic functions and it is fully parallel and pipelined. Its advanced functionality is achieved without significant increase in hardware, in comparison to ordinary CORDIC processor, and makes it an ideal processing element in high speed multiprocessor applications, e.g. real time Digital Signal Processing (DSP) and computer graphics.

1 Introduct ion

In the signal and image processing domains, the huge amount of data involved in computations has to be pro- cessed in real time. In most applications there is clearly the need for multiprocessor systems with very powerful processing elements in terms of throughput and func- tionality. Until now, several architectures of powerful processing elements, which have been proposed in the literature, are based on the COordinate Rotation Digital Computer (CORDIC) algorithms [1-5]. Ahmed et al. [6] have shown that CORDIC arithmetic units are bet- ter building blocks than the multipliers/accumulators in DSP systems.

The CORDIC algorithms introduced by Voider [7] and extended by Walther [8] evaluates a variety of plane coordinate transformations with iterative procedures involving only additions and shift operations. Mod- ifications of the standard CORDIC algorithms have been proposed which increase the processing speed but also the hardware complexity. The on-line CORDIC architecture doubles the hardware complexity of a conventional CORDIC module [9]. Also the VLSI im- plementation of CORDIC using redundancy, requires to perform two conventional CORDIC iterations in par- allel [10]. The CORDIC based processing elements proposed by de Lange [2] and Schmidt [1] reduce the

hardware complexity but they implement a subset of the CORDIC functions.

Another iterative method for computing elementary arithmetic functions is the Convergence Computing Method (CCM) proposed by Specker [ 11] and extended by Chen [12]. It evaluates arithmetic functions not in- cluded in the CORDIC scheme, such as logarithms of any base, exponential and square root. The implemen- tation of CCM algorithms in digital signal processors has not been studied before.

In this paper, the architecture of a very powerful floating-point, fully parallel and pipelined digital sig- nal processor is presented, having a host of arithmetic capabilities. It provides a superior functionality in comparison to previous designs [1-5]. Its set of high speed arithmetic functions includes square root, divi- sion, multiplication, addition, logarithm, exponential, trigonometric and hyperbolic functions. It outperforms the conventional multiplier/accumulator approach in high speed applications, e.g. real-time digital signal processing [6, 13-15], matrix equation solving algo- rithms [16], robotics and computer graphics [17]. The proposed architecture, compared to a pure fully par- allel and pipelined CORDIC processor, has advanced functionality. The trade-off is a small additional hard- ware cost, about a 2.2% increase to the number of transistor, as it is shown in section 5. The use of CORDIC processor in applications that involve high throughput computations of logarithms, exponentials and square roots, e.g. real time cepstrum evaluation, is row extended.

54 Metafas and Goutis

The architecture of the proposed processor is based on the CORDIC and the CCM algorithms, in order to obtain the direct realization of the full set of elementary arithmetic functions (Table 1), in a single hardware structure. This is achieved by modifying the CORDIC and CCM algorithm, so that they can be mapped into the same pipe segments. The implementation of the two algorithms in a common architecture is partly studied in [18-19] by the authors.

The architecture is floating-point, fully parallel and pipelined resulting a high throughput processing ele- ment which fulfils the requirements of modern appli- cations. The pipeline length, the evaluation time and the accuracy are independent of the function itself.

The rest of the manuscript is organized as follows: The CORDIC and CCM algorithms, as well as the adaptations which are required, are presented in sec- tions 2 and 3 respectively. In section 4, the architecture of the proposed processor is presented, while in sec- tion 5 the VLSI implementation of a 21-bit processor is studied in detail.

2 The C O R D I C Algorithms

The CORDIC algorithms present an efficient way to compute recursively a variety of arithmetic functions by performing only add and shift operations at each step. The algorithms are based on rotations in circular, linear and hyperbolic coordinate systems [8]. Each function is evaluated using three variables: x, y and z. These variables are initialized to the input values and after the completion of the algorithm they contain the desired outputs. The CORDIC iterative equations are applied N times to guarantee N bits accuracy.

The generalized CORDIC iterations are:

X i + I ~ X i - - m f f i S i Y i

Y i + l = Y i - t - f f i ~ i x i

Z i + l ~ Zi - - f f iOlm, i

i E {0, 1 . . . . . N} (1)

The parameter m can have three values, corresponding to the circular (m = 1), linear (m = 0) and hyperbolic (m = - 1 ) coordinate systems. The iteration step ~i is set equal to r - i which is an integral power of the machine radix r (r = 2), and so it is easily implemented by a shift operation. The sequence of the values of i depends on the parameter m as shown in Table 2 for the Walther's standard CORDIC iterations. The vale of cri (o'i = - 1 or + 1) is selected so that either y or z is driven towards zero to obtain the desired results. The

Table 1. The set of functions of the advanced CORDIC processor.

Function Outputs

Circular x cos(z) - y sin(z)

rotation x sin(z) + y cos(z) 0

Circular ( V / ~ + y2) vectoring 0

z + t a n - l ( y / x )

x

Multiplication y n t x z

0

x Division 0

z + y/x

Hyperbolic x cosh(z) + y sinh(z) rotation x sinh(z) + y cosh(z)

0 Hyperbolic ~ y2)

vectoring 0 z q- tanh - l ( y / x )

Exponential y b z

0 Logari thm 1

z + logb(y) Inverse x / 4"- f i

square root I

Table 2. Values of standard CORDIC parameters.

Coordinate Domain of Scaling factor

system m i sequence convergence K,,,

1 0, 1, 2 . . . . ~1 .74 ~1 .65 0 1, 2, 3 . . . . 1.0 1.0

- 1 1 , 2 , 3 , 4 , 4 , 5 . . . . * ~ I . 1 3 ~ 0 . 8

*For m = - 1 the following integers are repeated: {4, 13 . . . . . k, 3k + 1 . . . . }.

Otm,i denotes the CORDIC angle

1 - - - tan - l (Si~/-m) (2) Olin, i - -

Depending on the parameter m the algorithm perform three classes of functions: circular, linear and hyper- bolic. In each class there are two modes: rotation and vectoring.

Taking cri = sign(zi), i.e. adopting the rotation mode and N iterations, Eq. (1) results in

X N + 1 = K m ( c o S ( , v / - - m Otm)XO - - vrmsin(.v/mot,,,)yo)

YN+~ = K m s i n ( ~ ' ~ , , , ) x o + cos(v'-'m~m)yo

zlv+l = 0 (3)

A Floating-Point Advanced Cordic Processor 55

Table 3. Values of modified CORDIC parameters.

Coordinate

system m i sequence Domain of

convergence Scaling factor

iterations

I

0 - 1

- 1 " , 0, 1,2 . . . . . 16

0, 1 ,2 ,3 . . . . . 16

1 , 1 , 2 , 3 , 4 , 4 . . . . . 13,13,14,15,16,16

~3 .14

2 ~1 .57

yi = +l fori = 2 , 3 , 4, 7, 8, 10, 12, 14, 16

)4 = O V i Yi = + 1 f o r / = 2 , 3 , 15and

)4 = - 1 fo r i = 7, 11, 12

*Special initial iteration for range extension.

where Km depends only on the number of iterations

K,,, = H ~ ( 1 + ma~) i

and

Olin : E ffiOlm'i "~ ZO

(4)

(5)

adopting the

i

Similarly taking cri = -sign(y/), i.e. vectoring mode, Eq. (1) results in

xtr = K,, ,~x 0 + my 2

YN+I = 0

Z N + , = Z o - - m arctan ( ~ 0 m ) (6)

The CORDIC convergence relationship is

N

a,n,k -- ~_~ Olin, j <_< Otm,N, k = O, 1 . . . . N 1

j = k + l

(7)

In the case of hyperbolic operation modes, Eq. (1) is repeated for specific values of i to satisfy the conver- gence relationship (7). This technique is called double pass iterations [4]

Xi+I ,2 = Xi+I , I + f f i+ l , l~ iY i+l , l

Yi+I ,2 ~- Y i+ l , I + f f i+l , l~iXi+l,1

Z/+I ,2 = Zi+I,1 - - O'/+l,10t--l , / ( 8 )

Furthermore, in order to force the scale factor (Km) to converge toward unity another set of equations is used, called scaling factor iterations [4]

Xi+I , 2 = X i+ l , 1 --~ }ti~iXi+l, 1

Yi+I ,2 = Yi+I,1 + YiaiYi+l,1 (9)

where Vi = - 1,0 or 1, depends on m and i.

For a fully pipelined implementation of the CORDIC algorithms, the double pass iterations and the scaling factor iterations, in the case of hyperbolic operation modes, are preferable to be executed for different i values, in order to achieve high hardware utilization, as it's shown in section 5.

The range of convergence in CORDIC algorithms is

Iz01 _< Cm (10a)

for the rotation mode and

arctan < Cm (10b)

for the vectoring mode,

= ~ c%.i + Otm.N-I (10c) where Cm i

A special initial iteration is applied in the case of circu- lar mode to extend the range of convergence to [-Jr, 7r]

X0, 2 = -croy o

Y0,2 : - ff0x0

Zo.2 = Zo - Cro~- (11)

The double pass and scaling factor iterations have been adopted instead of other techniques [13, 16], so that the full set of the CORDIC functions is included in the processor operation set and also the CCM equa- tions to be mapped on the same architecture's pipe seg- ments, as it's shown in section 4.

The modified parameters of the CORDIC algorithms for 16-bit accuracy have been evaluated, according to the selected scheme and are shown in Table 3. Us- ing these parameters we obtain range of convergence [ - { , {] for the hyperbolic functions, [-Jr, rr] for the

1 1 circular functions and normalization factors ,.- , K, which guarantee 16-bit accuracy of the outputs, as it's shown in section 5.

56 M e t a f a s a n d Gou t i s

3 The CCM Algorithms

The CCM algorithms are based on a normalization pro- cess and compute certain elementary functions [ 11-12, 20]. A class of the CCM algorithms for computing the logarithm, exponential and the square root functions is described below. N iterations are required in order to obtain accuracy of N bits.

The logarithm of any base is computed using the following normalization [ 12]

l-'~ iN= l a i t=l

N

-- ~_~ logt,(ai ) (12) i=1

Defining xi+l = x ia i , Yi+l,i : Yi - l O g b ( a i ) and ai =

1 -k- di 2-i the algorithm for the logh(x) function is

Xi+ 1 ~ X i --~ d i 2 - i x i

Yi+I =Yi - l o g b ( l +di 2-i) i = 1,2 . . . . . N (13)

ifxi(1 + 2 - i ) > 1 thendi = 0 e l s e d i = 1. After N iterations it results in

XN+ I ~ 1

YN+I = Yt -- logb(xl) (14)

The range of convergence is xl ~ [0.5, 1). The exponential function b x is computed using the

normalization [ 12]

Hi=I ai'l-logt' I"Ii=l ai ~-~ b x = bX_logl, u N = aibX+Y~ui=l(_logbai)

i=1 (15)

where b is a predefined constant. /v

When the sequence x + Y-~.i=l ( - lOgb ai) tends to zero

the limit of 1-I~V_i ai is b x .

Defining xi+l = xi - logb(oti), Yi+J = yioti where ai : 1 + di 2i the algorithm for the b x function is

xi+l = xi - lOgh(1 + di 2-i)

Yi+l : Yi q- d i 2 - i y i i = 0, 1 . . . . . N (16)

i f x i - logt,(1 + 2 i) < 0 then di : 0 else d i = 1.

After N + 1 iterations it results in

XN+ 1 ~- 0

YN+I = yob x~ (17)

The range of convergence is x0 6 [0.5, 1).

The inverse square root is computed using the nor- malization [12]

y y I"IiNl ai = ( 1 8 )

x ]--IL, a~

N 9 When the sequence x l-Ii=l a7 tends to 1, the sequence

y I-lUi=t ai will approach "~xx" Defining xi+, = x ia? and yi+l = yiai the iterative equations for the square root function are

Xi+l , l -~ Xi q- d i 2 - i x i

Xi+l , 2 ~ Xi+l, 1 "~ d i 2 - i x i + l , l

Yi+I : Yi " ~ - d i 2 - i x i i = 1,2 . . . . . N (19)

i fx i ( l + 2- i ) 2 > i then di = 0 elsedi = 1.

After N iterations it results in

X N : I

Yt YN : (20)

The range of convergence is xl ~ [0.25, 1). The radix-2 class of CCM algorithms is adopted in-

stead of higher radix classes [20] in order to achieve compatibility with CORDIC algorithms.

4 Architecture of the Advanced CORDIC Processor

The floating-point implementation of the processor is based on a modification of the standard floating-point arithmetic, which eliminates all exponent alignments and mantissa normalizations, otherwise needed in each addition/subtraction, except for the first and the last. The numerical accuracy remains the same, assuming a sufficient number of overflow and guard bits as it is shown in section 5. The hardware overhead is small in comparison to a fixed-point implementation and a similar data throughput is obtained.

The processor architecture (Fig. 1) is composed of three main blocks: a) the floating-point preprocessor, which is a special purpose floating-point to fixed-point number format converter, b) the fixed-point CORDIC pipeline and c) the floating-point postprocessor, which is a special purpose fixed-point to floating-point num- ber format converter.

A Floating-Point Advanced Cordic Processor 57

X. Y. Z In In in

FLOATING POINT PREPROCESSOR

RBB

RBB

EXP PATH

EXP PATH

EXP PATH

C In

SCU

SCU

SCU

Fig. 1. Processor architecture.

RBB

FLOATING POINT POSTPROCESSOR

x out Yout Zout

EXP PATH

E• PATH

SCU

s c u

c out

4.1 Floating-point Preprocessor

The architecture of the floating-point preprocessor is shown in Fig. 2. The entries of the preprocessor are floating-point numbers with normalized mantissa and exponent in. two's complement format. We denote m!~,

m an and m myin and m~ n the mantissas of the inputs, e x , ey e z out and m ~ the the exponents of the inputs, and m ~ my

mantissas of the outputs. The exponent output e ~ of the preprocessor is moving without changes through the fixed-point CORDIC pipeline and it is used by the postprocessor to calculate the exponents of the proces- sor's outputs. The outputs of the preprocessor, which are the inputs of the fixed-point pipeline, have to be

in the range of convergence according to the opera- tion mode (Table 4). The function of the preprocessor, depending on the operation mode, is described below.

In the Circular and Hyperbolic modes, the prepro- cessor adapts the mix n or miv ~ mantissa which the associ- ated exponent is the smallest exponents, according to the difference between the exponent values, such that

out and out have equal ex- the resulting mantissas m x my ponents. The largest exponent of them is outputted. In order to achieve an important reduction in the com- plexity of the algorithm, in the two rotation modes, we restrict the range of the z input, representing the angle, in [ -C , , , Cm] where Cm is defined in Eq. (10) [16]. The z input is converted to fixed-point number format.

58 Metafas and Goutis

Fig. 2.

Table 4.

e in e In

mxj, BARREL

SHIFTER

BARREL

SHIFTER

m o u t j, y

Floating-point preprocessor architecture.

The range of inputs in the processor's fixed-point pipeline.

Input range

Function X path Y path Z path

Circular rotation [ -1 , 1) [ -1 , 1) [-rr, n] Circular vectoring arctan( Y / X ) < rc [ -4 , 4) Multiplication [ -1 , 1) [ -1 , 1) [ -2 , 2l Division Y / X < = 2 [ -4 , 4) Hyperbolic rotation [ -1 , 1) [ -1 , 1) [ - r r /2 , rr/2] Hyperbolic vectoring arctanh(Y/X) < 7r/2 [ -4 , 4) Exponential - [ -1 , 1) [0.5, 1) Logarithm - [0.5, 1) [ -4 , 4) Inverse square root [ -1 , I) [0.25, 1) -

in in if ed > 0 then Denoting ed = ey -- e x ,

out in --ca m x = m x 2

out in m y = m y

out in -ez / n z ~ /"n z 2

eOUt in = ey

else

The preprocessing, shown lowing expressions:

(21a)

out in m x = m x

out = miyn2ea m y

out m~n2e~ m z =

e ~ eix n (21b)

above, is based on the fol-

/n o

/n'

BARREL

SHIFTER

m~ z

m eyd l

u:

eOUt[

�9 r o o d 2 Y

3-

For rotation modes

.. e,. sin (m~2 e~") rnixn2 e!" cos (rnizn2 e~") -t- rny 2 !"

= (mOUtcos (mOUt) + m OUtsin (mOUt))U .... (22a)

and for vectoring modes

~/(mixn2e!~") 2 -'1-" (miyn2e~) 2 = ~/(mx~ 2 + (m~Ut) 2 2 e~

(22b)

in e~" [ my n2e~ ~ [ my ut '~ m z 2 + a r c t a n / ~ / = out+ arctan/~_~out /

\ m x 2 x / mz \mOU ]

(22c)

In the Multiplication mode, the preprocessor adapts the mantissas such that the mantissas of the two terms of the multiplication result, y and x z have equal exponents.

in (e~ n + if ed > 0 then Denoting ed = e r -- eixn),

out in 2-ea m x = m x

out in my = my

out in m z = m z

e ~ = eyin (23a)

else out in

m x ~ m x out in 2ca m y = my

out in m z = m, z

eOUt in + eix n (23b) = e z

A Floating-Point Advanced Cordic Processor 59

The preprocessing is based on the following expression

�9 i n �9 i n In e r In e z / out out outh,'~e ~ min2el '" q- m x 2 m z 2 -I- (24) = t m y m x m z ) z y

Also in the Division mode, the preprocessor adapts the mantissas such that the mantissas of the two terms of the division result, z and x / y have equal exponents. In the Exponential mode no adaptations are required and the range of z input is restricted to [0.5, 1) (or ezin = 0 ) . In the Logarithm mode the preprocessor adapts the mantissa of z input such as its exponent becomes 0 and the range of the z input is restricted to (0, 1).

In the Inverse Square Root mode the preprocessor adapts the mantissa of the y input such that the exponent of the resulting expression can be calculated

out in m x = m x

out in 2-(e~,~mod2) m y = m y

eOUt in in �9 = e x - ey dw 2 (25)

The preprocessing is based on the following expression

in e! n out 2 e~ m x 2 = . m x (26) " ~ / m o u t g e ~ nmOd2

V . ~ y ~ -

Denoting as Ne the number of bits of the exponents and Nm the number of bits of the mantissas, the ba- sic hardware modules required for the preprocessor (Fig. 2), includes five Ne-bit adder/subtractor circuits for the calculation of ed, seven MUXs and three barrel- shifters. The size of the barrel-shifters is 2 N'-I rows of Nm bits each.

4 .2 F i x e d - p o i n t P i p e l i n e

The architecture of the pipeline fully exploits the common basic structure of the CORDIC and CCM algorithms. If N is the number of bits of the operand mantissas, the architecture is composed of N + 2 re- cursion building blocks, where as Recursion Building Block (RBB) is defined as the amount of hardware used for one algorithm recursion. Each recursion is assigned to the corresponding RBB, characterized by a shift value i. Two iterations are executed in general in one RBB; a main iteration and an supplementary one with the same characteristic shift value i.

Two different types of RBBs are used, called type-A and type-B which are shown in Figs. 3 and 4 respec- tively. The need for the two different types stems from the specific characteristics of the elementary function

xP+ O xP+ O

q,,

Fig. 3. Type-A recursion bui lding b lock architecture.

algorithms, like double pass iterations in hyperbolic mode (Eq. 8), scale factor iterations in circular and hy- perbolic modes (Eq. 9) and double iterations in square root mode (Eq. 19).

The functional description of type-A RBB for the ith recursion is given below:

o = 0 [x ( t )+Crx2- iy ( t ) if s x

0 1 x t = l x ( t ) + C r x 2 - i x ( t ) if s x

I : = 0 0 x ( t ) if s x s x

I 2 = 0 1 x ( t + 1)= x, if s x s x

I s~ = 10 Xt - [ - y x 2 - i x t if s x

(27a)

o = 0 y ( t ) q- O'y 2 - - i x ( t ) if Sy o 1 Y t = y ( t ) + Oy 2 - i y ( t ) if Sy =

y ( t + 1)=

and

I s~ = 0 0 y ( t ) if sy

I 2 = 0 1 Yt if Sy Sy

i 2 10 Yt + y y 2 - i y t if sy Sy =

o = 0 z ( t + l ) = z ( t ) if s z o 1 z ( t ) + C r : m if s z =

(27b)

(27c)

where xt and y, temporary variables. The s ~ s~, s 2, 0 S1, 2 and o Sy, Sy S z are the control signals selecting the

appropriate microoperation of the type-A RBB accord- ing to the operation mode and its index in the pipe.

60 Metafas and Goutis

Fig. 4.

yp~2) ~ za§

Type-B recursion building block architecture.

The ax, O'y, O-z, Yx and Kv are control signals for the adder/subtractors. The Cm is a constant selected among arctan(2-i), 2 -i , arctanh(2 - i ) and logb(1 + 2 - / ) ac- cording to the RBB operation mode and index i.

Also, the functional description of type-B RBB for the ith recursion is

x( t + 1) / o o _ x ( t ) +Crx2-iy( t ) if s x =

- - " o 1 [ x ( t ) + O'x2-'x(t) if s x =

x (t + 2)

x ( t ) if

x( t + 1) if

x( t + 1) + E~2-iy(t + 1) if

x ( t -k- 1) -t- yx2 - i x ( t -t- I) if

y( t + 1)

_ ] y( t ) + fly 2 - i x ( t )

-- I y( t ) -k- ay 2 - i y ( t )

o = 0 if Sy 0 1 if Sy =

y( t + 2)

y( t )

y( t + 1)

y( t + 1) + yy 2 - i x ( t + 1)

y( t + 1) -t- ~'y 2 - i y ( t + 1)

s~s 2 = O0

44=Ol SxXS2 = 10

I 9 SxS x = 11

(28a)

l 2 = 0 0 if Sy Sy

1 S y = 0 1 if Sy l 2 10 if Sy S r = 1 2 11 if Sy Sy =

(28b)

and

z(t q- 1) = z(t) + ~rzCm

z(t)

z(t + 2 ) = z(t + l)

z( t + 1) + yzCm

if o l = O0 S z S z

if s z~ = 01 (28c)

if 0 l = l 0 Sz Sz

1 o l s~,sO, s~,,Sy, sOandszare thecontro l where the s x, s x , signals selecting the appropriate microoperation of the type-B RBB according to the operation mode and its index in the pipe.

The ax, fly, o z , Fx-, Fy and Fz are control signals for the adder/subtractors. The c,, is a constant selected among arctan (2-i) , 2 -/ , arctanh(2 - i ) and lOgb (i -t-2 -i) according to the RBB operation mode and index i.

Defining as pipe segment the elementary hardware of the pipe which has registers at the inputs and isolated control, the type-A RBB consists of one pipe segment while the type-B RBB of two. The type-A RBB in- cludes two shift/add units for each of the X and Y path and one for the Z path (Fig. 3). In the type-B RBB each of the two pipe segments includes one shift/add unit for each one of the X, Y and Z path (Fig. 4). The shift unit is implemented as a hard-wired shifter.

Type-A RBBs are used for the recursions of the algorithms which include only a main iteration or for those whose supplementary iteration is indepen- dent from the main one (scaling factor iteration of the

A Floating-Point Advanced Cordic Processor 61

/ I b~e

\ /

N - I btt~ / N

N ," t ~ wor162 w o r a _ l

. ' * , - I I , 0 . z / ( N b~t~

...

**~ J, ~L C O n t r o l b = t

N-bit adder / eubtractor

Fig. 5.

I bits

[ MUX

N - I bits

\ /

g I / N b , t s ...

\

... ~ w o r n . z

t r o l b , t

Supplementary N-bit adder/subtractor

Architecture of shift/add units in type-A RBB.

CORDIC algorithms and double iterations of inverse square rooting). On the contrary, recursions including supplementary iterations depended on the results of the main iterations need to be mapped on type-B RBBs (double pass iteration of the CORDIC hyperbolic algo- rithms). Two simplified RBBs, called Init-A RBB and Init-B RBB, are used for range extension iteration in CORDIC circular modes and exponential mode. Sup- plementary iterations are not executed in these RBBs. The first one implements Eq. (11) and the second one Eq. (8) orEq. (16) for i = 0.

In type-A RBBs, when using standard N-bit ripple carry adders, composed of 1-bit full adder cells and XOR gates (called FAD cells), in the two shift/add units of each of the X and Y path, the total delay of the i-th RBB is N q- i FAD delays. In order to reduce the total delay to N FAD delays, a special purpose architecture of adders [13, 21] is used, exploiting pipelining at the bit level (Fig. 5). Denoting as [x] the i top bits of x then the i top bits of the result will be [x] - 1, [x] or [x] + 1. The blGcks PA compute these three quantities in parallel with the production of x and when the sign of the second operand and the carry of the N - i FADs becomes available, the PG circuit selects the appropri- ate value. The function of the circuits FAD, PA and PG in the ith RBB is described below: In type-A and type-B RBBs for 1 < j < N - i,

FADj sj = (Yj-] ~)ctrl) ~ (xj ~ C j _ I )

cj = (Yj-I ~ ctrl) �9 (xj ~ cj-1) + (xj ~ c j - ] )

where ctrl selects addition or subtraction. In type-A RBBs for N - i < j < N,

FAj sj = (X j - 1 (~ ctrl) ~ Cj-I cj = (x j -] ~ ctrl) . c j - l

In type-B RBBs,

PG ao = not(not(cN-i) + (YN ~ ctrl)) al = not(cN-i ~ (YN ~ ctrl)) a2 = not(cN-i + not(yN ~ ctrl))

In type-B RBBs for N - i < j < N,

PAy C1, j = CI, j_ 1 �9 Xj

C2,j = CI,j--I + X j sj = c], j- i �9 x j

x j not(cl.j_] �9 x j )

if a0 = 1 or if al = 1 or if a2 = 1

The control signals which select the operation mode of the processor and also the exponent are moving without changes through the fixed-point pipeline. The range of valid input data, in this proposed fixed-point pipeline, depends on the operation mode and they are presented in Table 4.

4.3 Floating-point Postprocessor

The architecture of the floating-point postprocessor is shown in Fig. 6. The entries of the postprocessor are

62 Metafas and Goutis

X

4, J BARREL SHIFTER

DO

out m 8 x x

I d"

B~SBHI~ ER~EL~

out ut m o

yorz yorz

e/n

Fig. 6. Floating-point postprocessor architecture.

the outputs of the fixed-point pipeline and they are in �9 i n two's complement format. We denote mix n, m~" and m z

the mantissas of the inputs, e in the exponent, outputted from the preprocessor and moved unchanged through the fixed-point pipeline, m x~ mrOUt and m z~ the man- tissas of the outputs and e ~ out ey ande ~ the exponents of the outputs.

The function of the postprocessor is described below: For the Circular, Linear, Hyperbolic, Exponential and Square rooting modes denoting ea = I - Number O f Sign Bits(m in) where I is the number of bits in the integer part of mantissa including the sign bit, we have

m out = min2-ea

e ~ = e in + ea (29)

for the one or two output values of the processor. As it's shown in Table 1, the output values are x and y in circular rotation, multiplication and hyperbolic rota- tion modes, x and z in circular vectoring, division and hyperbolic vectoring modes, y in exponential mode, z in logarithm mode, and x in inverse square root mode.

In the Logarithm mode the calculation of e in logb(2) value is required before the postnormalization process [22], which can be easily produced using a look-up table. Therefore the postprocessor functions as follows

t i n m z = m z + e inlogb(2 )

and rn~ ut -- mlz2-e:,

out ~ t e z e d (3o)

where e' a = I - Number O f Sign Bits (re'z). Denoting Ne the number of bits of the exponent e in,

Nm the number of bits of the mantissas and Nf the num- ber of overflow bits in the mantissas, the basic hard- ware modules required for the postprocessor (Fig. 6), includes two detect shift circuits for the calculation of ea, two Ne-bit adder/subtractor circuits, one Nm- bit adder/subtractor circuit, three MUXs, two barrel- shifters and one look-up table. The size of the look-up table is 2 N,-1 words of Nm bits each, while the size of the barrel-shifters is 2 N'-l + N f rows of gm bits.

5 A 21-bit Floating-point Advanced CORDIC Processor

In this section, the VLSI design characteristics of an 21- bit floating-point CORDIC processor with advanced operation set, based on the previous analysis, are pre- sented.

The inputs of the processor are 21-bit floating-point numbers with 16-bit mantissa and 5-bit exponent, both in two's complement format. In order to achieve 16 bits accuracy, 16 main iterations are required for ev- ery elementary arithmetic function of the processor.

A Floating-Point Advanced Cordic Processor 63

Table 5. The range of inputs of the advanced CORDIC processor.

Input range

Function X path Y path Z path

Circular rotation [-216, 216) [-216, 216) [-rr, zr] Circular vectoring arctan(Y/X) < 7r [-4, 4) Multiplication [-216 , 216 ) [-216 , 216 ) [-216 , 216 ) Division [--216 , 216 ) [-216 , 216 ) [-216 , 216 ) Hyperbolic rotation [-216, 216 ) [-216, 216 ) [-;,r/2, rr/2l Hyperbolic vectoring arctanh(Y/X) < zr/2 [-4, 4) Exponential - [--216, 216) [0.5, 1) Logarithm -- (0, 216) [--4, 4) Inverse square root [-216, 216) [0, 216 ) -

The supplementary iterations, which are required for the convergence of the CORDIC hyperbolic algorithms (double pass iterations) and also for range extension of the angle in these modes, depend on the number of main iterations. For 16 main iterations, the dou- ble pass iterations (Eq. 8) are executed for i E G1, G1 = {1,4, 13, 16}, satisfying CORDIC convergence relationship (Eq. 7), in the corresponding RBBs. The scale factor produced, is estimated as follows

16

K-1 = I--I ~/(1 - 2 -2i) I"I ~/(1 - 2 -zj) (31) i=1 jEGI

The scaling factor iterations (Eq. 9) for the hyper- bolic modes are executed with gi = +1 for i E G2, G2 = {2, 3, 15} and with Yi = - 1 for i E G3, G3 = {7, 11, 12}. A shift value i included in G1 f'l(G2 UG3), means that both double pass iteration and scale factor iteration have to be executed in the ith RBB. In this case additional hardware in that RBB is required. Therefore G1, G2 and G3 are produced such that the following condition is satisfied

GI f3 (G2 U G3) = 0 (32)

Also the scaling factor iterations of the CORDIC circular functions are executed for i ~ G4, G4 = {2, 3, 4, 7, 8, 10, 12, 14, 16} with yi = +1. Both the scaling factor iterations for hyperbolic and circular

l 1 modes result in normalization factors, ~ and K-"? respectively, which guarantee 16-bit accuracy of the outputs.

The architecture of the fixed-point pipeline consists of 12 type-A RBBs with associated shift values 2- 3, 5-12, 14 and 15, 4 type-B RBBs with associated

shift values 1, 4, 13, 16 and two special RBBs for the range extension iterations of the circular, linear and exponential modes. Type-A RBB consists one pipe segment while type-B RBB two. So the pipe includes 22 pipe segments.

The X and Y paths of the fixed-point pipeline include 34 shift/add units each, while the Z path includes 22 shift/add units. The input data for the X and Y path of the pipe are 16-bit fixed-point numbers normalized to the range [ -1 , 1) and for the Z path 18-bit fixed-point numbers normalized to the range [ -4 , 4).

Three overflow bits are required for the X and Y paths inside the fixed-point pipeline which are deter- mined by the maximum values of x and y outputs,

max(x in, yin)Icosh ( 2 ) + sinh ( 2 ) ] = 4.81 < 23

(33) and 1 overflow bit for the Z path.

Also due to the rounding errors which caused by a maximum of 29 shift/add operations in hyperbolic modes, 4 guard bits are required to guarantee the 16- bits accuracy of the output data, such that

2912 -(16+4) < 2 -16 (34)

resulting in a 23-bit datapath width in each of the X, Y and Z path. The processor's latency is approximately 24 23-bit adder/subtractor delays while its throughput is one 23-bit adder/subtractor delay.

The 21-bit Advanced CORDIC processor chip has been designed using the Mentor Graphics Cell Station VLSI CAD tool and fabricated in 1.5/~m ES2 standard cell technology. It consists of 154,000 transistors and performs 30 106 operations per second. The size of

64 M e t a f a s a n d Gout i s

Table 6. Hardware utilization of the arithmetic functions.

Circular Hyperbolic Linear Square rot./vect, rot./vect, mult./div, log/,(x) b x root

74% 80% 57% 35% 37% 71%

the chip is equal to 1 cm 2 whi le its power dissipation

is es t imated to 1.5 Watt. The size o f the chip could be

drast ical ly reduced using h igh-complexi ty cells or full

cus tom design.

The proport ion of processor ' s hardware required for

independent pipel ined implementa t ion of each e lemen-

tary funct ion is shown in Table 6. It is calculated ap-

proximate ly as the ratio o f shift/add units involved in

the funct ion operat ion and the total amount o f proces-

sor 's shift/add units (90 units). For the implementa t ion

of a pure, fully parallel and pipel ined C O R D I C proces-

sor 88 shift/add units are required which results in ap-

proximate ly 97.8% of the proposed advanced C O R D I C

processor.

6 Conclusions

The floating-point , fully parallel and pipelined, ad-

vanced C O R D I C processor which is presented in this

paper, is an ideal processing e lement for real- t ime ap-

plications. The advanced functional i ty of the processor

is der ived f rom the implementa t ion of the C O R D I C

and the C C M algori thms for e lementary ari thmetic

funct ions ' evaluation, in a single hardware structure.

This extension o f C O R D I C processor ' s functionali ty

is obtained without significant increase in hardware.

It should be noted that it is usually possible to find a

more efficient scheme to compute a particular function,

i f this is all that is required, however the advantage of

the proposed processor lies in its universality.

References

1. G. Schmidt, D. Timmermann, H. Hahn, J.E Bohme, B.J. Hosticka, and G. Zimmer, "Parameter Optimization of the CORDIC algorithm and implementation in a CMOS chip," Pro- ceedings of EUSIPCO 1986, pp. 1219-1222.

2. A.A.J. de Lange, A.J. van de Hoeven, E.F. Deprettere, and J. Bu, "An optimal floating-point pipeline CMOS CORDIC processor," Proceedings of lSCAS 1988, pp. 2043-2047.

3. T. Sung, T. Parng, Y. Hu, and P. Chou, "Design and implemen- tation of a VLSI CORDIC processor," Proceedings of ISCAS 1986, pp. 934-935.

4. G.L. Haviland and A.A. Tuszynski, "A CORDIC arithmetic processor chip," IEEE Transactions on Computers, Vol. C-29, pp. 68-79, Feb. 1980.

5. D. Timmermann, H. Hahn, B.J. Hosticka, and G. Schmidt, "A programmable CORDIC chip for digital signal processing applications," IEEE Journal of Solid State Circuits, Vol. 26, pp. 1317-1320, Sept. 1991.

6. J.E. Voider, "The CORDIC trigonometric computing technique," IRE Transactions on Electronics and Computer, Vol. EC-8, pp. 330-334, Sept. 1959.

7. J.S. Walther, "A unified algorithm for elementary functions," Proceedings of Spring Joint Computer Conference 1971, Vol. 38, pp. 379-385.

8. H.M. Ahmed, J.M. Delosme, and M. Morf, "Highly concurrent computing structures for matrix arithmetic and signal process- ing," IEEE Computer, Vol. 15, pp. 65-82, Jan. 1982.

9. H.X. Lin and H.J. Sips, "On-Line CORDIC algorithms," IEEE Transaction on Computers, Vol. C-39, pp. 1038-1052, Aug. 1990.

10. J. Duprat and J. Muller, "Fast VLSI implementation of CORDIC using redundancy," AlgorithrrL~ and Parallel VLSI Architec- tures, Vol. B, Elsevier Science Publishers, pp. 155-164, 1991.

1 I. W.H. Specker, "A class of algorithms for In x, exp x, sin x, cos x, tan- 1 x and cot- I x," IRE Transactions on Electronics and Computers, Vol. EC-14, pp. 85-86, Feb. 1965.

12. R.C. Chen, "Automatic computation of exponentials, loga- rithms, ratios and square roots," 1BM Journal on Research and Development, Vol. 16, pp. 380-388, 1972.

13. E.E Deprettere, P. Dewilde, and R. Udo, "Pipelined CORDIC architectures for fast VLSI filtering and array processing," Pro- ceedings of lCASSP 1984, pp. 41. A.6.1-41.A.6.4.

14. A.M. Despain, "Fourier transform computers using CORDIC iterations," IEEE Transaction on Computers, Vol. C-23, pp. 993-1001, Oct. 1974.

15. L.W. Chang and S.W. Lee, "Systolic arrays for the Discrete Hartley Transform," IEEE Transactions on Signal Processing, Vol. 39, pp. 2411-2418, Nov. 1991.

16. J.R. Cavallaro and F.T. Luk, "CORDIC arithmetic for an SVD processor," Journal on Parallel and Distributed Computing, Vol. 5, pp. 271-290, June 1988.

17. B. Yang, D. Timmermann, J.F. Bohme, H. Hahn, B.J. Hosticka, G. Schmidt, and G. Zimmer, "Special computers: Graphics, Robotics," Proceedings of CompEuro 1987, pp. 727-730.

18. D.E. Metafas and C.E. Goutis, "A DSP processor with a power- ful set of elementary arithmetic operations based on CORDIC and CCM algorithms" Proceedings of EUROMICRO 1990, pp. 51-58.

19. D.E. Metafas and C.E. Goutis, "A floating-point pipeline CORD1C processor with extended operation set," Proceedings of lSCAS 1991, pp. 3066--3069.

20. M.R.D. Rodrigues, J.H.P. Zurawski, and J.B. Gosling, "Hard- ware evaluation of mathematical functions," IEEE Proceedings, Vol 128, Pt. E, No. 4, pp. 155-164, July 1981.

21. D.E. Metafas, G. Krikis, and C.E. Goutis, "VLSI design of an 8-bit fixed-point CORDIC processor with extended operation set," Proceedings ofEUROASIC 1991, pp. 158-161.

22. O.G. Koufopavlou, D.E. Metafas, and C.E. Goutis, "Archi- tecture and VLSI module generator for expressing digital sig- nals in decibels," International Journal of Electronics, Vol. 71, pp. 297-307, 1971.

A F l o a t i n g - P o i n t A d v a n c e d C o r d i c P r o c e s s o r 65

Editor Note: The journal has been unable to obtain corrected proofs of this paper from the author. We have made our best effort to present the material in a fault-free manner and apologize for any errors that may have occurred in the publication process.

Dr. Dimitris E. Metafas was born in Piraeus, Greece, on December 1, 1964. He received the Diploma in electrical engineering in 1987 and the Ph.D. degree in 1993 from University of Patras, Patras, Greece. He has been working in research projects from ES- PRIT, RACE and National Programs. His research interests include Computer-Aided VLSI Design, Silicon Compilers for Computation- ally Intensive Systems, Digital Signal Processing and System Design for Image Processing.

D. Costas Goutis was a Lecturer at the School of Physics and Math- ematics at the University of Athens from 1970 to 1972. In 1973, he was the Technical Manager in the Greek Post Office (Telecom- munications) responsible for the installation and maintenance of he telephone exchanges in provincial region. From 1976 to 1977, he was a Research Assistant in the Department of Electronic and Elec- trical Engineering at the University of Strathclyde, U.K., working on Spectrum Analysis. From 1977 to 1979, he was a Research Fellow at the above department, working on Image Processing (SERC grant). From 1979 to 1985, he was a Lecturer at the Department of Electri- cal and Electronic Engineering, at the University of Newcastle-upon- Tyne, U.K. He then became an Associate Professor in the Department of Electrical Engineering, at the University of Patras, Greece. From 1989, he has been a Full Professor at the same department, where he directs the VLSI Design Laboratory. His recent research interests fo- cuses on VLSI Circuit Design, Low Voltage VLSI Design, Systems Design, Analysis and Design of Systems for Image Processing, and Neural Networks. He has published more than 70 papers in interna- tional journals and conferences. He has been awarded a large number of Research Contracts from ESPRIT, RACE and National Programs.