[ieee 2009 norchip - trondheim, norway (2009.11.16-2009.11.17)] 2009 norchip - hardware...

Hardware implementation of mapper forfaster-than-Nyquist signaling transmitter

Deepak Dasalukunte, Fredrik Rusek, Viktor OwallDept. of Electrical and Information Technology,

Lund University, Lund, SE-22100, Sweden.Email: {dde,fredrik.rusek,viktor.owall}@eit.lth.se

Karthik Ananthanarayanan, Murali KandasamyDept. of Electronics and Communication Engineering,

National Institute of Technology, Trichy, India.Email: [email protected], [email protected]

Abstract—This paper presents the implementation of the map-per block in a faster-than-Nyquist (FTN) signaling transmitter.The architecture is Look-Up Table (LUT) based and the complex-ity is reduced to a few adders and a buffer to store intermediateresults. Two flavors of the architecture has been designed andevaluated in this article, one, a register based implementationfor the buffer and the other using a Random Access Memory(RAM). The tradeoff between the two is throughput versus area.The register based implementation is fast requiring only oneclock cycle to complete the calculation (i.e a read, calculate andwrite back) for every incoming FTN symbol. However, it becomesprohibitive when systems with large number of sub-carriers(>64) is considered. The RAM based implementation provides abetter solution in terms of area with slightly lower throughput.The mapper has been targetted for both FPGA (Xilinx Virtex-IIPro) and ASIC (130nm standard cell CMOS) implementations.The design has been successfully tested on the FPGA and itsoutput verified with the reference MATLAB model.

I. INTRODUCTION

Faster-than-Nyquist signaling is a technique of transmittinginformation at a rate higher than the allowed Nyquist’s or-thogonality limit [1] [2]. Systems employing this techniquehas shown to achieve higher information rates at the cost ofincreased processing in the transmitter and the receiver. Thisincrease in complexity is due to the interference introduced bythe information symbols onto each other as they now becomenon-orthogonal. This paper deals with the implementationof FTN transmitter for OFDM based systems. Fig. 1 showshow the OFDM and the FTN symbols appear on the time-frequency grid. The separation between the OFDM symbolsis such that they do not interfere with each other either intime or in frequency (𝑇Δ𝐹Δ = 1). FTN symbols are packedtighter destroying orthogonality (𝑇Δ𝐹Δ < 1). It is evidentfrom Fig. 1, that for the same block size, FTN techniquecan transmit more information symbols than the OFDM sys-tem. Furthermore, from an hardware implementation point ofview the FTN symbols cannot be transmitted as is, becausethey can appear on fractional frequency sub-carriers or timeinstances. One approach is to represent each FTN symbolinterms of a chosen orthogonal basis which will also resultin retaining IFFT as an effective implementation for multi-carrier modulation. The idea is to introduce minimal changesin the original transmitter so that FTN can be included asan add on functionality. This can be achieved by the abovementioned approach of representing the FTN symbol as an

T

1/T

ISI/ICI FTNsymbol OFDM symbol

freq

time𝑇Δ𝐹Δ=

1

𝑇Δ𝐹Δ < 1

Fig. 1. Time-frequency grid showing FTN and OFDM symbols

interference pattern on the orthogonal basis. This approach ishenceforth called mapping, as the FTN symbols are mappedonto a predefined set of orthogonal basis functions.

A generic architecture of such an FTN mapper is shownin Fig. 2 where the resultant information to be transmittedon a particular frequency sub-carrier is calculated from theincoming FTN symbols. A detailed study on the choice oforthogonal basis, choice of number of projection coefficients intime and frequency (𝑁𝑡, 𝑁𝑓 ) and implementation approacheshas been discussed in [3]. It was also shown that a look-up table (LUT) approach was efficient in terms of hardwareimplemenatation of the mapper block. The LUTs store the pre-calculated projection coefficients of each FTN symbol onto theorthogonal basis. The LUT approach becomes attractive be-cause, for a given time-frequency spacing of the FTN symbolswith respect to the OFDM grid, the projection coefficients ofthe FTN symbols show a repetitive pattern in both time (𝑇𝑟𝑒𝑝),and frequency (𝐹𝑟𝑒𝑝). This repetitive pattern makes the sizeof the LUT small and fixed, independent of the number ofsub-carriers or time instances the OFDM system is operatingat. The architectural concept of the mapper was developed in[3] and in this article the ideas are refined and taken all theway to hardware implementation.

The block diagram of a complete FTN transmitter in anOFDM based system is shown in Fig. 3. It is a seriallyconcatenated system with a (7,5) convolutional coder as theouter encoder and the FTN mapper as the inner encoder. The

978-1-4244-4311-6/09/$25.00 ©2009 IEEE

Cfx

C20

t

D

D

D

D

D

D

Orthogonal symbols

FTN symbolsf

fx

fx−1

x+1

21

01

00

02

10

C

C23

C

C

C

CC11

C12

FTN mapper f

(t)(t−1) (t+1)1x 0x 2xC C

Fig. 2. A generic architecture of the mapper detailing the calculation ofoutput on a single sub-carrier

Fig. 3. Block diagram of FTN tranmitter

information is transmitted as blocks of data and in principle,with larger block size (interleaver size) one will gain improvedperformance in terms of BER for a given SNR [4]. However,in practice this results in huge system latency and largeinterleaver size at the transmitter and receiver. Further, atthe receiver, the decoding process cannot be initiated untilthe receiver has buffered the entire transmitted block. Thecomplexity in iterative type of decoding also increases due tothe large amount of symbols being decoded. On the other hand,a small block size though improves the speed of encoding anddecoding, falls back on the performance measure. Hence, ablock size in the range of 1000-1500 can be said to be areasonable number in the performance complexity tradeoff. Inthis article the number of sub-carriers for the OFDM systemis set to 𝑁 = 128 for 10 time instances resulting in a blocksize of 1280 coded FTN symbols, i.e. 1280

2= 640 uncoded

information bits. The choice of block size of 128×10 = 1280was to keep the hardware complexity reasonable and notcompromising the performance at the same time.

II. FTN MAPPER IMPLEMENTATION

The FTN mapper primarily consists of LUTs that store thepre-calculated values of the representations (also referred to asprojection coefficients) of the FTN symbols. Each FTN symbolwhen projected onto an orthogonal basis can be represented by𝑁𝑡 basis functions in time and 𝑁𝑓 basis functions in frequency.These 𝑁𝑡 × 𝑁𝑓 projections can be used to reconstruct theoriginal FTN symbol. Thus, for an FTN system with time-frequency spacing, 𝑇Δ𝐹Δ < 1, it is enough to store (𝑁𝑡 ×𝑁𝑓 )×𝐹𝑟𝑒𝑝×𝑇𝑟𝑒𝑝 values in the LUT and this will satisfy therequirement of a system with any number of sub-carriers ortime instances.

Fig. 4 shows the block diagram of the implementationof FTN mapper. The inputs to the mapper are Offset-QAM(OQAM) modulated symbols (±1). The FTN system can beconfigured for a particular time-frequency spacing (𝐹Δ, 𝑇Δ).Though these parameters can take any values between 0 and1 (𝐹Δ = 1

𝑇Δ), only those that are rational fractions [5] are of

interest as they exhibit repetition which is exploited in the LUTapproach to store a finite number of projection coefficients.The coefficients for such 𝐹Δ, 𝑇Δ are pre-calculated and theLUT is generated to be used within the mapper. The input,𝑇Δ𝐹Δ, to the FTN mapper is used by the configuration blockto set the following system parameters:

∙ repetition of projection coefficients in time (𝑇𝑟𝑒𝑝) andfrequency (𝐹𝑟𝑒𝑝)

∙ selecting the appropriate LUT for calculations∙ generating adder array depending on the number of

projection coefficients (𝑁𝑡 ×𝑁𝑓 )∙ generating buffer depending on sub-carriers (𝑁) and

projection coefficients in time (𝑁𝑡).The buffer size and the number of projection points (and in

turn the adder array) are fixed at the time of synthesis through𝑁,𝑁𝑡 and 𝑁𝑓 parameters. However, the time-frequency spac-ings of the FTN system can be changed during run timebut has to be of fixed value during the calculation of onecomplete block. A different choice of 𝑇Δ𝐹Δ means choosingthe appropriate LUT, reconfiguring the 𝑇𝑟𝑒𝑝 and 𝐹𝑟𝑒𝑝 whichare automatically updated by the configuration block. TheFSM that provides enable/control signals to the rest of thelogic will adapt accordingly.

1) Datapath controller : The datapath controller is a statemachine that keeps track of the time instances and frequencysub-carrier of the incoming FTN symbols in order to generatean LUT address. The LUT values at this address correspondsto the representation of the input FTN symbol in terms of theorthogonal basis. In other words, the LUT location stores theprecalculated ISI/ICI pattern of the incoming FTN symbol.There two modulo counters counting upto 𝐹𝑟𝑒𝑝 and 𝑇𝑟𝑒𝑝

to keep track of the sub-carrier and time instance indices.The controller resets itself to the initial state to begin thetransmission of a new block of data when it has reached thepre-defined block size.

A. Register based FTN mapper

The register based FTN mapper uses a bank of registers asthe buffer (Fig. 4). The advantage of using registers in thebuffer is that the calculation corresponding to each incomingFTN symbol can be completed within a single clock cycle.The LUT has been implemented as a combinatorial logic anda small delay when reading out the values. Summing the valuesread out from the LUT with a proper set of locations in thebuffer is also combinatorial in nature. Hence it is only requiredto choose the registers, whose values are available at theiroutputs, to be added along with the LUT values using the adderarray. So the result at the output of the adder array is readyby the next clock edge to be stored back into the registers.The writing back of the result can be done by a simple enable

Fig. 4. FTN mapper architecture (Register based).

Fig. 5. Timing diagram showing reading LUT, buffer contents; calculationand write back in register based architecture

signal on the register. The timing diagram in Fig. 5 shows theread, calculate and write back operations happening within oneclock cycle.

Though this approach seems to be the preferred solutionwhen throughput is of concern, it has to be noted that themultiplexer and demultiplexer between the adder array andbank of registers depend on 𝑁,𝑁𝑡 and 𝑁𝑓 . In general, a(𝑁 × 𝑁𝑡) : (𝑁𝑓 × 𝑁𝑡) multiplexer from the register outputsto the adder array and an equal size demultiplexer from theadder output to the register input will be required. If 𝑁 = 128,𝑁𝑓 = 𝑁𝑡 = 3 then a 384 : 9 line multiplexer and 9 : 384line demultiplexer will be required. Since this amounts toquite a large amount of combinatorial resources and requiringsignificant amount of routing, this approach is not attractivefor implementation, especially when 𝑁 > 64. Further detailsabout the area resources for register based FTN mapper isprovided in the results section. In such a scenario, a RAMbased approach is employed as explained in the followingsection.

B. RAM based FTN mapper

A considerable amount of resources consumed for routingand mulitplexing in the register based FTN mapper can be

avoided by the use of RAMs with some tradeoff in speed. Thisis due to the fact that only one location can be accessed at atime unlike the register based approach in which any numbercan be read by just connecting to their outputs. The RAMbased architecture is shown in Fig. 6 where the RAMs areused as buffers. Each column in the original buffer (Fig. 4) isreplaced by a RAM module with the same depth (𝑁 ) as of theoriginal buffer. Each RAM now stores one value correspondingto a time instance and hence requires 3 RAMs, when 𝑁𝑡 = 3.One reason for using different RAMs is because by doingso, 3 values can be read out simultaenuously. This can alsobe done by having a single RAM that can hold 3 valuesin one location. However, when it comes to shifting out thedata, the RAM cannot be used for calculations as long as thepartial result (that corresponding to the earliest time instance)is shifted out. During this time the calculation for the new FTNsymbols has to be stalled. Further, since it is only a part of theentire contents of the RAM that corresponds to the output, theremaining ones are to be written back after re-formatting thevalues resulting in a lot of data transfer operations which isinefficient when it comes to power consumption. In summary,the use of a single RAM as a buffer will lead to ‘processand wait’ situations for the FTN mapper and a lot of datarearrangement.

In order to have a pipelined operation between the calcu-lation stages, 3 separate RAMs one corresponding to eachcolumn is instantiated so that data can be read out and writteninto the RAMs simultaneously. Further, to make the shiftingout of the result and calculation of newer incoming data tohappen in parallel an extra RAM is instantiated. From Fig. 6 itcan be seen that at any given time 3 RAMs are involved withthe datapath controller to perform calculation the while thefourth one contains the result from the most recent calculationthat needs to be passed on to the IFFT block. The RAMholding the result is handled by the shift-out-logic to readout the data, clear the contents and prepare it to be used for

Fig. 7. Timing diagram for RAM based FTN mapper with and withoutpipelined adder

the next set of calculations by the datapath controller. TheRAMs involved in the calculation are active in a cyclic fashionand only the outputs corresponding to the active RAMs areselected and read/written by the datapath controller, while thefourth RAM is left to the control of shift-out-logic. In Fig.6, the greyed out portion shows the currently used RAMs bythe datapath controller to perform calculations and their inputoutput ports are connected to the adder array (shown by solidlines). The fourth RAM is not involved in the calculation ofthe outputs and hence logically disconnected from the datapathcontroller (shown by dashed lines).

When it comes to arithmetic units, now only 3 (𝑁𝑡) addersare sufficient in the adder array as only 3 values can be readout from the RAMs in a particular clock cycle. So, the LUTcontents are also modified to provide only 3 values at a time.This means that the datapath controller is slightly modified,i.e. calculation of every FTN symbol is performed in 3 stepsbecause of the limitation in reading from the RAM. Further,the one clock cycle read latency constrains the calculation to atotal of 9 clock cycles per FTN symbol (3 mem locations × 3clock cycles per location) while this can be reduced by the useof a pipelined adder. The two scenarios are shown by a timingdiagram in Fig. 7, where the first case is without a pipelinedadder requiring 3 clock cycles per memory location (and hencea total of 9 clock cycles per FTN symbol) while the second oneuses a pipelined adder and the total clock cycles reduces to 5.The pipelined adder version of the RAM based approach waschosen for implementation as the rate of calculation can bealmost doubled with an additional pipeline stage at the adderoutputs. Further, the RAM can also be effectively utilized as itcan be accessed to read/write during every cycle of operation,while idle states exist in the non-pipelined version.

TABLE IFTN MAPPER AREA COMPARISON ON XILINX FPGA

Architecture Logic RAMs No. of 18× 18

Cells (128× 16) adders Multipliers

Register based 16979 0 9 -RAM based 1009 4 3 3128 pt. IFFT core 1712 7 - 9

III. RESULTS

The two flavors of the FTN mapper architecture has beenimplemented in VHDL and is targetted for both FPGA andASIC (UMC 130nm process using standard cell libraries fromFaraday technology) [6] [7]. This section provides the resultsfrom the synthesis for both FPGA and ASIC. The results referto an OFDM system with 128 sub-carriers and 𝑁𝑡 = 𝑁𝑓 = 3projection points for the FTN mapper. Table. I provides thearea figures for the design targetted for an FPGA (Virtex-IIPro). Apart from the resource utilization of the FTN mapper,the table also provides resource usage for a 128-point IFFTused for MCM in the OFDM transmitter. This IFFT coreis generated using the Xilinx CORE generatorTM [8]. Thisprovides a comparison between the MCM block, one of thesignificantly large blocks in the transmitter, and the FTNmapper. It can be seen that the RAM based FTN mapper isless than 60% of the IFFT block. Also, the Block RAMs usedand the actual arithmetic resources (adder/multipliers) are alsoabout half as that of the IFFT. The FTN mapper has beensuccessfully tested and verified on the FPGA. The outputsfrom the FPGA are compared with the reference MATLABmodel. The LUT uses 12 bits for representing the values with11 bits for the fractional part. The output is 16 bits wide,while the input is just 1 bit as it corresponds to the OQAMmodulated symbols i.e. ±1.

The register based architecture is quite expensive in terms ofthe resource usage and has to be avoided when the sub-carriersof the OFDM system is > 64. It also has to be noted that ina system having large number of sub-carriers, even the IFFTblock is not a direct mapped implementation for the reasonsof area requirements. With a time multiplexed design for theMCM block, the preceeding FTN mapper can also employ asimilar approach saving hardware real estate. The RAM basedimplementation provides N outputs from the mapper to theIFFT block at regular intervals. The architecture for the IFFTcan be chosen to match the output rate of the FTN mapper tohave the data being computed in a pipelined fashion.

Table. II gives the comparison of FTN mapper in 130nmtechnology for the two architectures. The table details theresources (in terms of standard cells) consumed by each blockwithin the FTN mapper and it can be seen that the bufferand the shift-out-logic in the register based version consumesthe most area. This is avoided by instantiating RAMs for thebuffers which also reduces the area occupied by the shift-out-logic as it now just reads out the result values from one ofthe RAMs. Fig. 8 shows the final layout of the RAM based

Fig. 6. RAM based FTN mapper architecture

TABLE IIFTN MAPPER AREA COMPARISON IN 130NM STANDARD CMOS PROCESS

Reg. based arch RAM based arch

Total no. of std. cells 20479 3936

Buffer 13451 -Adder array 689 246Shift out logic 4353 154Datapath controller 1330 1703Look-up Tables 631 1505Configuration block 25 21No. of RAMs (128× 16) 0 4

FTN mapper targetted for 130nm standard CMOS process. Itis evident from the figure that it is the memories that consumethe most area, while the actual logic (LUT, controller, adderarray) is significantly small. The design on the FPGA can runupto a maximum clock frequency of 51MHz, while it reportsa speed of 330MHz in the 130nm process. The dynamic andleakage power for the ASIC implementation is 28.5𝑚𝑊 and32.3𝜇𝑊 respectively.

IV. CONCLUSION

In this paper we have implemented the mapper for a faster-than-Nyquist signaling transmitter in hardware. The archi-tecture has been refined to use RAMs instead of registersto save on hardware real estate. When the sub-carriers inthe OFDM system becomes large (> 64) the register basedimplementation is prohibitive. In the case of RAM basedapproach, it is mainly the memory size that scales with thesub-carriers and hence is attractive for implementation forsystems even with thousands of sub-carriers. The compromisein the RAM based approach is the throughput, but this tradeoffis acceptable compared to the savings in area it brings in.The results also support the original argument in [3] that theinclusion of FTN technique into the OFDM transmitter is quitesimple and need to be included as an add on processing block

Fig. 8. Layout of the RAM based architecture implemented in 130nmstandard cell CMOS

without any major changes.

REFERENCES

[1] J. E. Mazo, “Faster-than-Nyquist signaling,” Bell System Technical Jour-nal, vol. 54, pp. 1451–1462, Oct 1975.

[2] F. Rusek and J. Anderson, “Multistream faster than nyquist signaling,”Communications, IEEE Transactions on, vol. 57, no. 5, pp. 1329–1340,May 2009.

[3] D. Dasalukunte, F. Rusek, J. B. Anderson, and V. Owall, “A TransmitterArchitecture for Faster-than-Nyquist Signaling Systems,” in Proc. IEEEIntl. Symp. Circuits and Syst., Taipei, Taiwan, May 2009.

[4] S. Benedetto, D. Divsalar, G. Montorsi, and F. Pollara, “Serial concate-nation of interleaved codes: performance analysis, design, and iterativedecoding,” Information Theory, IEEE Transactions on, vol. 44, no. 3, pp.909–926, May 1998.

[5] F. Rusek, “Partial Response and Faster-than-Nyquist Signaling,” Ph.D.dissertation, Dept of Electrical and Information Technology, Lund Univ,2007.

[6] “UMC 0.13𝜇𝑚 process.” [Online]. Available: http://www.umc.com/English/pdf/0.13DM.pdf

[7] “Faraday 0.13𝜇𝑚 libraries and IP.” [Online]. Available: http://freelibrary.faraday-tech.com/ips/013library.html

[8] “Xilinx CORE generator.” [Online]. Available: http://www.xilinx.com/tools/coregen.htm

[ieee 2009 norchip - trondheim, norway (2009.11.16-2009.11.17)] 2009 norchip - hardware...

Documents