lut

4
Optimization of LUT for Memory-Based VLSI Structures Jerlin Ajith Davidson.C, II M.E., Applied Electronics. , PSN College of Engg and Technology.Tirunelveli. (Email id: [email protected]) Abstract - Recently, the systems antisymmetric product coding (APC) and odd-multiple-storage (OMS) techniques for lookup-table (LUT) design for memory-based multipliers to be used in digital signal processing applications were proposed. Each of these techniques results in the reduction of the LUT size by a factor of two. In this brief, we present a different form of APC and a modified OMS scheme, in order to combine them for efficient memory-based multiplication. The proposed combined approach provides a reduction in LUT size to one-fourth of the conventional LUT. I have also suggested a simple technique for selective sign reversal to be used in the proposed design. It is shown that the proposed LUT design for small input sizes can be used for efficient implementation of high- precision multiplication by input operand decomposition. It is found that the proposed LUT-based multiplier involves comparable area and time complexity for a word size of 8 bits, but for higher word sizes, it involves significantly less area and less multiplication time than the canonical-signed-digit (CSD)-based multipliers. I. INTRODUCTION Along with the progressive device scaling, semiconductor memory has become cheaper, faster, and more power-efficient. Moreover, according to the projections of the international technology roadmap for semiconductors, embedded memories will have dominating presence in the system on- chips (SoCs), which may exceed 90% of the total SoC content. It has also been found that the transistor packing density of memory components is not only higher but also increasing much faster than those of logic components. Apart from that, memory- based computing structures are more regular than the multiplyaccumulate structures and offer many other advantages, e.g., greater potential for high-throughput and low-latency implementation and less dynamic power consumption. Memory- based computing is well suited for many digital signal processing (DSP) algorithms, which involve multiplication with a fixed set of coefficients. A conventional lookup-table (LUT)based multiplier is shown in Fig. 1, where A is a fixed coefficient, and X is an input word to be multiplied with A. Assuming X to be a positive binary number of word length L, there can be 2L possible values of X, and accordingly, there can be 2L possible values of product C = A X. Therefore, for memory-based, multiplication, an LUT of 2L words, consisting of precomputed product values corresponding to all possible values of X, is conventionally used. Fig.1.Conventional LUT-based multiplier. Several architectures have been reported in the literature for memory-based implementation of DSP algorithms involving orthogonal transforms and digital filters.However, we do not find any significant work on LUT optimization for memory-based ICETT-2011 www.icett.com Page 1 of 4

Upload: ravicse533

Post on 11-Mar-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: lut

Optimization of LUT for Memory-Based VLSI Structures

Jerlin Ajith Davidson.C, II M.E., Applied Electronics.,

PSN College of Engg and Technology.Tirunelveli.

(Email id: [email protected])

Abstract - Recently, the systems antisymmetricproduct coding (APC) and odd-multiple-storage(OMS) techniques for lookup-table (LUT) designfor memory-based multipliers to be used indigital signal processing applications wereproposed. Each of these techniques results in thereduction of the LUT size by a factor of two. Inthis brief, we present a different form of APCand a modified OMS scheme, in order tocombine them for efficient memory-basedmultiplication. The proposed combined approachprovides a reduction in LUT size to one-fourth ofthe conventional LUT. I have also suggested asimple technique for selective sign reversal to beused in the proposed design. It is shown that theproposed LUT design for small input sizes canbe used for efficient implementation of high-precision multiplication by input operanddecomposition. It is found that the proposedLUT-based multiplier involves comparable areaand time complexity for a word size of 8 bits, butfor higher word sizes, it involves significantlyless area and less multiplication time than thecanonical-signed-digit (CSD)-based multipliers.

I. INTRODUCTION

Along with the progressive device scaling,semiconductor memory has become cheaper,faster, and more power-efficient. Moreover,according to the projections of the internationaltechnology roadmap for semiconductors,embedded memories will have dominatingpresence in the system on- chips (SoCs), whichmay exceed 90% of the total SoC content. It hasalso been found that the transistor packingdensity of memory components is not onlyhigher but also increasing much faster than those

of logic components. Apart from that, memory-based computing structures are more regular thanthe multiply–accumulate structures and offermany other advantages, e.g., greater potential forhigh-throughput and low-latency implementationand less dynamic power consumption. Memory-based computing is well suited for many digitalsignal processing (DSP) algorithms, whichinvolve multiplication with a fixed set ofcoefficients.

A conventional lookup-table (LUT)basedmultiplier is shown in Fig. 1, where A is a fixedcoefficient, and X is an input word to bemultiplied with A. Assuming X to be a positivebinary number of word length L, there can be 2Lpossible values of X, and accordingly, there canbe 2L possible values of product C = A ・ X.Therefore, for memory-based, multiplication, anLUT of 2L words, consisting of precomputedproduct values corresponding to all possiblevalues of X, is conventionally used.

Fig.1.Conventional LUT-based multiplier.

Several architectures have been reportedin the literature for memory-basedimplementation of DSP algorithms involvingorthogonal transforms and digitalfilters.However, we do not find any significantwork on LUT optimization for memory-based

ICETT-2011

www.icett.com

Page 1 of 4

Page 2: lut

multiplication.I have presented a new approachto LUT design, where only the odd multiples ofthe fixed coefficient are required to be stored,which we have referred to as the odd-multiple-storage (OMS) scheme in this brief. In addition,we have shown that, by the antisymmetricproduct coding (APC) approach, the LUT sizecan also be reduced to half.

II. PROPOSED LUT OPTIMIZATIONS

I discuss here the proposed APCtechnique and its further optimization bycombining it with a modified form of OMS

A. APC for LUT Optimization

For simplicity of presentation, we assumebothX and A to be positive integers.2 The productwords for different values of X for L = 5 areshown in Table I. It may be observed in this tablethat the input word X on the first column of eachrow is the two’s complement of that on the thirdcolumn of the same row. In addition, the sum ofproduct values corresponding to these two inputvalues on the same row is 32A. Let the productvalues on the second and fourth columns of arow be u and v, respectively. Since one can writeu = [(u + v)/2 − (v − u)/2] and v = [(u + v)/2 + (v− u)/2], for (u + v) = 32A, we can have

u = 16A − [(v-u)/2] v = 16A + [(v-u)/2]

This behavior of the product words can be usedto reduce the LUT size, where, instead of storingu and v, only [(v − u)/2] is stored for a pair ofinput on a given row. The 4-bit LUT addressesand corresponding coded words are listed on thefifth and sixth columns of the table, respectively.Since the representation of the product is derivedfrom the antisymmetric behavior of the products,we can name it as antisymmetric product code.The desired product could be obtained by

Product word = 16A + (sign value) × (APCword)

The product value for X = (10000) correspondsto APC value “zero,” which could be derived byresetting the LUT output, instead of storing thatin the LUT.

B. Modified OMS for LUT Optimization

TABLE IOMS-BASED DESIGN OF LUT FOR L = 5

For the multiplication of any binary wordX of size L, with a fixed coefficient A, instead ofstoring all the 2L possible values of C = A ・ X,only (2L/2) words corresponding to the oddmultiples of A may be stored in the LUT, whileall the even multiples of A could be derived byleft-shift operations of one of those oddmultiples. Based on the above assumptions, theLUT for the multiplication of an L-bit input witha W-bit coefficient could be designed by thefollowing strategy.1) A memory unit of [(2L/2) + 1] words of (W +L)-bit width is used to store the product values,where the first (2L/2) words are odd multiples ofA, and the last word is zero.

ICETT-2011

www.icett.com

Page 2 of 4

Page 3: lut

2) A barrel shifter for producing a maximum of(L − 1) left shifts is used to derive all the evenmultiples of A.3) The L-bit input word is mapped to the (L − 1)-bit address of the LUT by an address encoder,and control bits for the barrel shifter are derivedby a control circuit.

The even multiples 2A, 4A, and 8A arederived by left-shift operations of A. Similarly,6A and 12A are derived by left shifting 3A, while10A and 14A are derived by left shifting 5A and7A, respectively. A barrel shifter for producingamaximum of three left shifts could be used toderive all theeven multiples of A.

III. IMPLEMENTATION OF LUT USINGTHE OPTIMIZATION SCHEME

In this section, we discuss theimplementation of the LUT-based multiplierusing the proposed scheme, where the LUT isoptimized by a combination of the proposed APCscheme and a modified OMS technique.

A. Implementation of the LUT Multiplier UsingAPC for L = 5

The structure and function of the LUT-based multiplier for L = 5 using the APCtechnique is shown in Fig. 2. It consists of a four-input LUT of 16 words to store the APC valuesof product words. Besides, it consists of anaddress-mapping circuit and an add/subtractcircuit. The address-mapping circuit generatesthe desired address. The address-mapping circuit,however, can be optimized to be realized bythree XOR gates, three AND gates, two ORgates, and a NOT gate.

B. Implementation of the Optimized LUT UsingModified OMS

The proposed APC–OMS combineddesign of the LUT for L = 5 and for any

coefficient width W is shown in Fig. 3. It consistsof an LUT of nine words of (W + 4)-bit width, afour-to-nine-line address decoder, a barrelshifter, an address generation circuit, and acontrol circuit for generating the RESET signaland control word (s1s0) for the barrel shifter.The precomputed values of A × (2i + 1) arestored as Pi, for i = 0, 1, 2, . . . , 7, at the eightconsecutive locations of the memory array, Thedecoder takes the 4-bit address from the addressgenerator and generates nine word-select signals,i.e., {wi, for 0 ≤ i ≤ 8}, to select the referencedword from the LUT. The 4-to-9-line decoder is asimple modification of 3-to-8-line decoder. Thecontrol bits s0 and s1 to be used by the barrelshifter to produce the desired number of shifts ofthe LUT output are generated by the controlcircuit, according to the corresponding relations.

Fig. 2. LUT-based multiplier for L = 5 using the APCtechnique.

ICETT-2011

www.icett.com

Page 3 of 4

Page 4: lut

Fig. 3. Proposed APC–OMS combined LUT design.

Fig. 4. (a) Four-to-nine-line address-decoder.(b) Control circuit for generation of s0, s1, and RESET.

The RESET signal can alternatively be generatedas (d3 AND x4). The control circuit to generatethe control word and RESET is shown in Fig.4(b). The address-generator circuit receives the5-bit input operand X and maps that onto the 4bit address word (d3d2d1d0).

IV. RESULTS AND CONCLUSION

The proposed LUT multipliers for wordsize L = W = 8,16, and 32 bits are coded inVHDL and synthesized in Xilinx ISE 9.1i.Simulation is done in Modelsim 5.8c, wheretheLUTs are implemented as arrays of constants,and additions are implemented by the Wallacetree and ripple carry array. The CSD-based

multipliers having the same addition schemes arealso synthesized with the same technologylibrary. In this brief, we have shown thepossibility of using LUTbased multipliers toimplement the constant multiplication for DSPapplications.

REFERENCES

[1]J.-I. Guo, C.-M. Liu, and C.-W. Jen, “The efficientmemory-based VLSI array design for DFT and DCT,”IEEE Trans. Circuits Syst. II, Analog Digit. SignalProcess.,Oct. 1992.

[2]H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the designautomation of the memory-based VLSI architectures forFIR filters,” IEEE Trans. Consum. Electron., Aug. 1993.[3] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.Stouraitis, “A systolic array architecture for the discretesine transform,” IEEE Trans. Signal Process.,Sep. 2002.

[4]A. K. Sharma, Advanced Semiconductor Memories:Architectures,Designs, and Applications. Piscataway, NJ:IEEE Press, 2003.

[5] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “Amemory-efficient realization of cyclic convolution and itsapplication to discrete cosine transform,” IEEE Trans.Circuits Syst. Video Technol., Mar. 2005.

[6] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.Stouraitis, “Systolic algorithms and a memory-baseddesign approach for a unified architecture for thecomputation of DCT/DST/IDCT/IDST,” IEEE Trans.Circuits Syst. I,Jun. 2005.

[7]P. K. Meher, “Systolic designs for DCT using a lowcomplexity concurrent convolutional formulation,” IEEETrans. Circuits Syst. Video Technol., Sep. 2006.

[8] P. K. Meher, “Memory-based hardware for resource-constrained digital signal processing systems,” Dec. 2007.

[9] P. K. Meher, “New approach to LUT implementationand accumulation for memory-based multiplication,”May2009.

[10] P. K. Meher, “New look-up-table optimizations formemory-based multiplication,” in Proc. ISIC, Dec. 2009,

ICETT-2011

www.icett.com

Page 4 of 4