reconfigurable architectures for adaptable mobile systems

Reconfigurable Architectures for Adaptable MobileSystems(Invited Paper)

Gerard J.M. Smit, Gerard K. RauwerdaUniversity of Twente, department EEMCS

P.O. Box 217, 7500 AE Enschede, the [email protected]

Abstract— Mobile wireless terminals tend to becomemulti-mode wireless communication devices. Furthermore,these devices become adaptive. Heterogeneous reconfig-urable hardware provides the flexibility, performance andefficiency to enable the implementation of these devices.The implementation of a WCDMA and an OFDM receiverin the same coarse-grained reconfigurable MONTIUM pro-cessor is discussed.

Index Terms— Heterogeneous reconfigurable hardware,Software defined radio, SoC, MONTIUM , WidebandCDMA, OFDM

I. INTRODUCTION

Future wireless communications systems tend to be-come multi-mode, multi-functional devices. Adaptivitybecomes ever more important. These systems have toadapt to changing environmental conditions (e.g. moreor less users in a cell or varying noise figures due toreflections or user movements) as well as to changinguser demands (bandwidth, traffic patterns and QoS).When the system can adapt – at run-time – to theenvironment, significant savings in computational costscan be obtained [1], [2]. Furthermore, the hardwarearchitectures have to be extremely efficient as these areused in battery-operated terminals and cost effective asthey are used in consumer products.

Heterogeneous reconfigurable hardware offers the nec-essary flexibility for performing multiple wireless com-munication standards and can achieve the performancerequired by the wireless standards. Furthermore, thecombination of mixed-grained reconfigurable hardwareenables energy-efficient implementations of the wire-less standards. Much work has been done on SoftwareDefined Radio (SDR) in the SDR forum context [3].However this forum mainly focuses on general-purposeprocessors and they do not concentrate on reconfigurableplatforms and energy-efficiency.

As already stated in the introduction one of themain reasons for introducing reconfigurable hardwarein a wireless terminal is to support multiple wirelesscommunication standards. The support of multiple wire-less communication standards introduces a first level ofadaptivity in the wireless terminal because the terminalcan switch between wireless communication standards.For example when packet data transport is performedover UMTS and a WLAN hotspot becomes available theterminal can switch from UMTS to a WLAN standard.This is referred to asstandards leveladaptivity. Stan-dards level adaptivity has an impact on the digital signalprocessing in the wireless terminal because the wirelesscommunication standard defines the DSP functions thathave to be performed to implement the standard.

Although a wireless communication standard usuallydefines the DSP functions which have to be performedto implement the standard, it usually does not definethe algorithms that have to be used to implement thesefunctions. So the communication system can therefore‘adapt the algorithms’ that are used to implement aDSP function. ‘Adapt the algorithms’ means that thecommunication system selects an algorithm from a setof algorithms that implement the same DSP function.Therefore this second level of adaptivity is referred toasalgorithm-selection leveladaptivity.

For a specific algorithm, there are also opportuni-ties for adaptivity by changing parameters of the algo-rithm. This third level of adaptivity is calledalgorithm-parameter leveladaptivity [4].

An adaptive wireless terminal only makes sense whenit better suits the needs of a user at an acceptablecomplexity and efficiency compared to a traditionalnon-adaptive terminal. The standards level of adaptivityallows the terminal to adapt the communication standardthat is used to the Quality of Service (QoS) require-ments and the communication environment. Here the

term communication environmentis used to indicate theavailable wireless communication standards and wirelesschannel conditions at a certain location. The algorithm-selection level of adaptivity allows the terminal to selectthe algorithms that satisfy the QoS requirements in thegiven communication environment in the most efficientmanner. The algorithm-parameter level of adaptivity al-lows the terminal to do the same with the parameters ofa specific algorithm. So a reconfigurable and adaptiveterminal is able to track the QoS requirements andcommunication environment on a finer grained scale thana traditional non-adaptive terminal.

The complexity of an adaptive terminal is determinedby the complexity of the standards that it has to support,the complexity of the algorithms that are used to imple-ment the DSP functions of the standard and the com-plexity of the measurement and control overhead that isrequired to make the terminal adaptive. The complexityof the algorithms in the set of algorithms that is used foralgorithm-selection level adaptivity should therefore becentered around the complexity of the algorithm that isused in a traditional non-adaptive terminal. To minimizethe overhead of measuring QoS and channel conditions,measurements that are already required by the wirelesscommunication standard should be (re)used whereverpossible. The complexity of any additional measurementalgorithms should be kept low. The same is true forcontrol algorithms.

In this paper we discuss the implementation of wire-less communication systems in heterogeneous reconfig-urable hardware. The implementation of a flexible RAKEreceiver, used for UMTS communications, and the im-plementation of an OFDM receiver, used in HiperLAN/2,is studied to show the feasibility of implementing multi-mode communication systems using reconfigurable hard-ware.

II. RELATED WORK

So far most algorithmic level research on reconfig-urability in UMTS, as for example in theMuMoR [5]project, has focussed on multi-mode reconfigurabilityto enable Software Defined Radios (SDRs) supportingmultiple communication standards. TheEASYproject [6]aims at developing a power/cost efficient System-on-Chip (SoC) implementation of the HiperLAN/2 standard.In theAdaptive Wireless Networking (AWGN)project [4],however, reconfigurability will be used to allow the com-munication system to adapt to changing environmentalconditions.

Both academy and industry show interest in coarse-grained reconfigurable architectures. The Pleiadesproject at UC Berkeley focuses on an architectural tem-plate for ultra low-power high-performance multimediacomputing [7]. In the Pleiades architecture template ageneral-purpose microprocessor is surrounded by a het-erogeneous array of autonomous, special-purpose satel-lite processors. The Pleiades SoC design methodologyassumes a (very) specific algorithm domain. The extremeprocessor platform (XPP) of PACT is based on clustersof coarse-grained processing array elements (PAEs) [8].Actual PAEs are tailored to the algorithm domain of aparticular XPP processor. Silicon Hive [9] offers coarse-grained reconfigurable block accelerators (e.g. Avispaand Moustique) and stream accelerators (e.g. Bresca)for high performance and low power applications. Thearchitecture consists of VLIW-like datapath elements.

III. R ECONFIGURABLE HETEROGENEOUS

ARCHITECTURE

Heterogeneous reconfigurable systems might becomethe future of mobile hardware. The basic idea behindthe use of heterogeneous reconfigurable hardware isthat one can match the granularity of the DSP algo-rithms with the granularity of the hardware. For instancesome algorithms perform operations best on bit-levelwhile other perform best on word-level. Four typesof processing elements can be distinguished:general-purposeprocessor,fine-grained reconfigurablehardware,coarse-grained reconfigurablehardware anddedicatedhardware.

A. The Chameleon System-on-Chip

GPP

GPP

FPGA

FPGA

FPGA

FPGA

ASIC

ASIC

Montium

DSP

DSP

Montium

Montium

Montium

Montium

Montium

Fig. 1. The Chameleon SoC.

We propose a System-on-Chip (SoC), which consistsof the above mentioned processing elements. Figure 1shows the SoC, which we call the Chameleon SoC.The SoC contains a tiled architecture, where tiles canbe processing elements of different granularities. Thetiles in the SoC are interconnected by a Network-on-Chip (NoC). Both the SoC and NoC are dynamicallyreconfigurable, which means that the programs (running

on the reconfigurable processing elements) as well as thecommunication links between the processing elementsare defined at run-time [10].

The Chameleon SoC contains several general-purposetiles, i.e. GPP and DSP. The FPGA tiles in the SoCrepresent fine-grained processing elements. The coarse-grained processing elements in the Chameleon SoC areimplemented by MONTIUM tile processors. Some tiles inthe SoC are implemented as ASIC, which have dedicatedfunctionality.

B. The Montium Tile Processor

M01 M02

Communication and Configuration Unit

M03 M04 M05 M06 M07 M08 M09 M10

ALU5

A C DB

W

OUT2 OUT1

ALU4 E

A C DB

W

OUT2 OUT1

ALU3 E

A C DB

W

OUT2 OUT1

ALU2 E

A C DB

W

OUT2 OUT1

ALU1 E

A C DB

OUT2 OUT1

Sequencer

Memorydecoder

Crossbardecoder

Registerdecoder

ALUdecoder

Fig. 2. The MONTIUM Tile Processor.

The MONTIUM is an example of a coarse-grainedreconfigurable processor. The MONTIUM [11], [10] tar-gets the 16-bit digital signal processing (DSP) algorithmdomain. A single MONTIUM processing tile is depictedin Figure 2. At first glance the MONTIUM architecturebears a resemblance to a VLIW processor. However,the control structure of the MONTIUM is very different.For (energy-) efficiency it is imperative to minimize thecontrol overhead. This can be accomplished by staticallyscheduling instructions as much as possible at compiletime.

The lower part of Figure 2 shows the Communica-tion and Configuration Unit (CCU) and the upper partshows the reconfigurable Tile Processor (TP). The CCUimplements the interface for off-tile communication. Thedefinition of the off-tile interface depends on the NoCtechnology that is used in the SoC. The CCU enablesthe MONTIUM to run in ’streaming’ as well as in ’block’

mode. In ’streaming’ mode the CCU and the MONTIUM

run in parallel (communication and computation overlapin time). In ’block’ mode the CCU first reads a blockof data, then starts the MONTIUM, and finally aftercompletion of the MONTIUM the CCU sends the resultsto the next tile.

The TP is the computing part that can be configuredto implement a particular algorithm. Figure 2 revealsthat the hardware organization of the tile processor isvery regular. The five identical ALUs (ALU1· · · ALU5)in a tile can exploit spatial concurrency to enhanceperformance. This parallelism demands a very highmemory bandwidth, which is obtained by having 10local memories (M01· · · M10) in parallel. The smalllocal memories are also motivated by the locality ofreference principle. The data path has a width of 16-bitsand the ALUs support both signed integer and signedfixed-point arithmetic. The ALU input registers providean even more local level of storage. Locality of referenceis one of the guiding principles applied to obtain energy-efficiency in the MONTIUM. A vertical segment thatcontains one ALU together with its associated inputregister files, a part of the interconnect and two localmemories is called a Processing Part (PP). The fiveProcessing Parts together are called the Processing PartArray (PPA). A relatively simple sequencer controlsthe entire PPA. The sequencer selects configurable PPAinstructions that are stored in the decoders of Figure 2.

Each local SRAM is 16-bit wide and has a depthof 512 positions, which adds up to a storage capacityof 8 Kbit per local memory. A reconfigurable AddressGeneration Unit (AGU) accompanies each memory. It isalso possible to use the memory as a lookup table forcomplicated functions that cannot be calculated usingan ALU, such as sine or division (with one constant).A memory can be used for both integer and fixed-pointlookups.

Figure 3 shows the ALU that is used in the MONTIUM.A single ALU has four 16-bit inputs. Each input hasa private input register file that can store up to fouroperands. The input register file cannot be bypassed,i.e. an operand is always read from an input register.Input registers can be written by various sources via aflexible interconnect. An ALU has two 16-bit outputs,which are connected to the interconnect. The ALU isentirely combinational and consequentially there are nopipeline registers within the ALU. Neighbouring ALUscan also communicate directly on level 2. The West-output of an ALU connects to the East-input of the ALUneighbouring on the left. The East-West connection does

not introduce a delay or pipeline, as it is not registered.

functionunit 2

functionunit 1

functionunit 3

functionunit 4

multiplier

adder

in_A in_B in_C in_D

out_2 out_1

in_East

out_West

level 1

level 2

Fig. 3. The MONTIUM ALU.

IV. A PPLICATION DOMAIN

A. Software Defined Radio

Software Defined Radio (SDR) denotes wireless com-munication systems that are characterized by an analogfront-end followed by a programmable, digital basebandprocessing part. In the analog front-end, the radio signalis received, filtered and amplified. The filtered, amplifiedradio signal is converted to digital samples, which arethe input of the digital baseband processing part. Aprogrammable, digital baseband processing part enablesreprogramming of the functional modules that have tobe performed, like modulation/demodulation techniques.

A complete hardware based radio system (e.g. anASIC solution) has limited utility since parameters foreach of the functional modules are fixed. A radio systembuilt using SDR technology extends the utility of thesystem to a wide range of applications using differentlink-layer protocols and modulation/demodulation tech-niques. SDR provides an efficient and relatively inexpen-sive solution to the design of multi-mode, multi-band,multi-functional wireless devices that can be enhancedusing software upgrades only.

SDR-enabled devices (i.e. mobile terminals) can bedynamically programmed to reconfigure the characteris-tics of the device. So, the same hardware can be adaptedto perform different functions at different times.

Another advantage of the SDR template is the fact thatreal-adaptive systems can be implemented. Traditionalalgorithms in wireless communications are rather static.The recent emergence of new applications that requiresophisticated adaptive, dynamical algorithms based onsignal and channel statistics to achieve optimum perfor-mance has drawn renewed attention to run-time recon-figurability [12].

B. Wideband CDMA receiver

The Universal Mobile Telecommunications System(UMTS) standard, defined by ETSI, is an example of

a Third Generation (3G) mobile communication system.The communication system has an air interface thatis based on Code Division Multiple Access (CDMA).We will investigate the possibilities of implementingthe DSP functionality of a UMTS receiver in recon-figurable hardware. We only focus on the downlink ofthe UMTS receiver at the mobile terminal in the FDDmode, the most relevant UMTS properties are shown inTable I [13].

TABLE I

DOWNLINK UMTS PROPERTIES.

chip rate 3.84 Mega chips/sscrambling code length 38400 chipsspreading factor (SF) 4 – 512output symbol rate 7.5 – 960 kilo symbols/smodulation QPSK, 16-QAM

Figure 4 shows the baseband processing, performedin the W-CDMA receiver. Since multi-path fading is acommon phenomenon in wireless communication sys-tems, the receiver has to combat for the effects ofmulti-path fading. In the UMTS communication systemthe signals from the strongest multi-paths are receivedindividually. This means that the path searcher of thereceiver searches for the strongest received paths andestimates the path-delays. Whenever the delay of anindividual path is known, the receiver will performthe de-scrambling and de-spreading operations on thedelayed signal. The operations of de-scrambling andde-spreading are also denoted as a RAKE finger. Inthe Maximal Ratio Combiner (MRC) the received soft-values of the individual RAKE fingers are combined andindividually weighted to provide optimal Signal-to-NoiseRatio (SNR). The weighting factors of the individualRAKE fingers are determined by a channel estimator.The RAKE fingers in co-operation with the MRC arecalled RAKE receiver.

Pulse

shapingDe-mapper

De-scrambling De-spreading




Maxim

alR

atio

Com

bin

ing

samples bitssymbolsPulse

shapingDe-mapper





Maxim

alR

atio

Com

bin

ing

Pulse

shapingDe-mapper





Maxim

alR

atio

Com

bin

ing

samples bitssymbols

Fig. 4. W-CDMA baseband functions in the receiver.

C. OFDM receiver

HiperLAN/2 is a wireless local area network (WLAN)access technology and is similar to the IEEE 802.11aWLAN standard. HiperLAN/2 operates in the 5 GHzfrequency band and makes use of orthogonal frequencydivision multiplexing (OFDM) to transmit the analoguesignals. The bit rate of HiperLAN/2 at the physical leveldepends on the modulation type and is either 12, 24, 48or 72 Mbit/s.

The basic idea of OFDM is to transmit high data rateinformation by dividing the data into several parallel bitstreams, and let each one of these bit streams modulatea separate subcarrier. A HiperLAN/2 channel contains52 subcarriers and has a channel spacing of 20 MHz. 48subcarriers carry actual data and 4 carry pilots.

Prefixremoval

Freq. offsetcorrection

InverseOFDM

HiperLAN/2 receiver

EqualizationDe-

mappingPhase offsetcorrection

Fig. 5. HiperLAN/2 baseband functions in the receiver.

The receiver not only performs the inverse operationof the transmitter, it also has to correct for all thedistortions that are introduced in the wireless channel.Figure 5 depicts a model of the HiperLAN/2 receiver.In general, the model can be used for any OFDM-likesystem. The diffent standards for OFDM-like systems,e.g. HiperLAN/2, DAB, DRM, are generally different inthe number of carriers and the transmission bandwidth.Table II summarizes the OFDM properties for differentstandards.

TABLE II

PROPERTIES OF THE DIFFERENTOFDM STANDARDS.

Hiper DAB DRMLAN/2 I II III IV A B

Bandwidth [MHz] 20 1.54 1.54 1.54 1.54 0.012 0.012# carriers 52 1536 384 192 768 203 181Symbol time [µs] 4 1,246 312 156 623 26,667 26,667Frame time [ms] 2 96 24 24 48 400 400

The synchronization of the receiver is performed intwo steps. Firstly, coarse synchronization is performed inorder to synchronize the receiver with the frame. Duringcoarse-synchronization the received signal is correlatedwith known preambles, which indicate the start of aframe. Secondly, the prefix information of an OFDMsymbol is used for fine-synchronization. After fine-

synchronization, the prefix is removed from the OFDMsymbol.

Differences between the oscillator frequencies of thetransmitter and the receiver result in frequency offsetand cause inter-subcarrier interference. The HiperLAN/2receiver can compensate for frequency offset by multi-plying the data samples of an OFDM symbol with thefrequency offset correction coefficient. The frequencyoffset correction coefficient can be determined by usinginformation from the received preamble sections of theMAC frame.

The inverse OFDM part of the receiver converts thereceived signal into received subcarrier values. The re-ceived sub-carrier values may still suffer from distortionsthat need to be corrected before de-mapping them to abitstream.

The equalizer corrects the distortions caused by fre-quency selective fading. The coefficients for the equal-izer can be determined by using information from thereceived preamble sections of the MAC frame. Since thecoherence time of a HiperLAN/2 channel is about 20 msand a burst of a MAC frame has a duration of 2 ms, thecoefficients need to be determined only at the start ofthe MAC frame [14].

Based on the equalized pilot values, the phase dis-tortion of the received signal is corrected. The phasecorrection coefficient is determined using pilots.

The received complex-number samples will be trans-lated into an useful received bitstream. The de-mapfunction assumes that the most likely symbol that wastransmitted, was the symbol that maps to the valueclosest to the received value.

V. IMPLEMENTATION

Good development tools are essential for quick imple-mentation of applications in reconfigurable architectures.The intention of the MONTIUM development tools isto start with a high-level description of an application(in C/C++ or Matlab) and translate this descriptionto a MONTIUM configuration. The development toolscomprise, among other things, a stand-alone MONTIUM

simulator, a MONTIUM assembler and a MONTIUM

configuration editor [10]. Until now, majority of theapplications are implemented in the MONTIUM using theassembler and the configuration editor.

For functional simulations we use a co-simulationenvironment with Matlab and ModelSim, as depicted inFigure 6. Using this environment, one can simulate the(baseband) functionality of a complete communication

system in software (i.e. using Matlab) and partly inhardware (i.e. using ModelSim).

In Matlab code all functions of the system undersimulation are defined. The software part of the simulatoris furthermore responsible for the control mechanismsin the system under simulation. VHDL code describesthe functionality implemented in hardware. ModelSimuses this code to perform simulations of the functionalityin hardware. The software and hardware counterpartscooperate in one simulation with each other via filetransfers.

Fig. 6. The co-simulation environment.

A. Wideband CDMA receiver

The W-CDMA receiver has been implemented in het-erogeneous reconfigurable hardware. Since most base-band processing consists of multiply-accumulate (MAC)operations, the baseband processing of the receiver wasimplemented in coarse-grained reconfigurable hardware,in our case the MONTIUM. The scrambling code in thereceiver will be generated with simple combinationallogic, consisting of shift-registers and XOR gates. Theseare typical operations that can be performed in fine-grained reconfigurable hardware, like an FPGA. We as-sume that the control-oriented functionality is performedin the GPP and provides the right information to thebaseband processing part of the W-CDMA receiver.

In our design the pulse shape filter, which can beimplemented as a FIR filter, is implemented in oneMONTIUM tile. The output streams of the pulse shapefilter are the input for the RAKE receiver, which isimplemented in a second MONTIUM tile. Figure 7 showsthe functional blocks in the W-CDMA receiver that areimplemented in each processing tile.

The W-CDMA receiver runs in ’streaming’ mode. Thereceiver can process four individual paths of the receivedsignal. Consequently, the receiver requires four complex-number data streams for the four fingers. All fingers

Fig. 7. The RAKE receiver in heterogeneous processing tiles.

require the same scrambling code. The receiver takesthe complex-number scrambling code stream as an input.The spreading code is stored in local memory, becausethe code is relatively small with a maximum length of512 samples. Furthermore, the spreading code is as-signed to a particular user in the UMTS communicationsystem and, therefore, the spreading code will not changefrequently. The received symbols of the individual signalpaths – fingers – are combined, while each symbol isscaled with a complex-number coefficient. These coef-ficients are provided by the channel estimator, which isperformed on the GPP. The receiver outputs a bit streamwith the received data.

Figure 2 shows that the CCU is directly connectedto the global buses inside the MONTIUM. The CCUimplements the interface for off-tile communication andso it guarantees that during ’streaming’ mode the correctsignals are available for the MONTIUM tile. Figure 8depicts typical signal activity on the global buses insidethe MONTIUM during RAKE processing. The differentsignal streams, which are streamed from outside theMONTIUM, are indicated with characters (’A’ till ’J’) inFigure 8. The MONTIUM is able to process two RAKEfingers in parallel. The chips of two RAKE fingers canbe de-scrambled and de-spread in two clock cycles. Thetypical signal activity reveals the regular organization ofthe implemented receiver. First one chip of finger 1 andone of finger 2 are de-scrambled and de-spread, in thenext 2 clock cycles one chip of finger 3 and one offinger 4 are de-scrambled and de-spread. This typicalsequence of signal processing repeats till a completesymbol (consisting of SF chips) is de-scrambled and de-spread. The next 5 clock cycles are used for combiningthe results of the 4 fingers and de-mapping the symbolsto a bit stream. So, in total4×SF + 5 clock cycles areneeded to process one output symbol.

01 00 10 11 10 11

Clk

SIO 01 00 10 11 10 11

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

A

B

C

D

E

F

G

H

I

A - data finger 1B - data finger 2C - data finger 3D - data finger 4E - scrambling code

F - MRC coefficient finger 1G - MRC coefficient finger 2H - MRC coefficient finger 3I - MRC coefficient finger 4J - output bitstream

J

A C

B D

E

A

B

C

D

A C

B D

A

B

C

D

A C

B D

A

B

C

D

A C

B D

E

E

E

E

E

E

E

E

Fig. 8. Signal activity inside the MONTIUM on the global buses (1)· · · (10).

1) Configuration: The configuration size of theflexible RAKE receiver in the MONTIUM is only 858bytes. One tile can be configured for RAKE receivingin 429 clock cycles. For a configuration clock frequencyof 100 MHz this means that a RAKE receiver with 4fingers can be configured in 4.29µs.

In case the spreading factor changes, and so thespreading code, the new spreading code only has to beloaded in the local memory of the MONTIUM and a con-stant in the MONTIUM configuration has to be changed.Loading a particular spreading code and reconfiguringthe constant costsSF + 1 clock cycles.

The signal streams for the different fingers arebuffered in local memories inside the MONTIUM. Whenthe delay of one of the paths changes, then the bufferingstrategy in the local memories has to be changed. Thebuffering strategy of the memories is configured with 24bytes. These 24 bytes can be reconfigured in 12 clockcycles. Consequently, the RAKE receiver can update itscomplete path delay profile in 120 ns, assuming that theconfiguration clock frequency is 100 MHz.

The signal activity in Figure 8 shows that the signalprocessing of 4 RAKE fingers is very regular. The ideabehind the regular structure of the 4-RAKE receiver isthat it can be easily adapted to another configuration withfor instance less fingers. Suppose we want to change thereceiver to a 2 finger one, this means that finger 3 andfinger 4 are no longer needed. The CCU will thereforestall the streaming of stream ’C’ and ’D’ onto globalbuses 1 till 4 (Figure 8). So, the de-scrambling and de-spreading phase of finger 3 and finger 4 (data streams ’C’and ’D’) can be bypassed and the number of operationsin the combining phase can also be reduced. In total, forreconfiguring the number of fingers from 4 to 2, only24 bytes have to be reconfigured in the configuration

memory of the MONTIUM. Assuming that the clockfrequency of the processor tile during reconfiguration is100 MHz, the RAKE receiver can be reconfigured in 120ns, which corresponds to 12 clock cycles.

2) Dynamic Voltage and Frequency Scaling:Volt-age and frequency scaling is an important measure tocontrol the power consumption of embedded systems. InCMOS design, the power consumption depends quadrat-ically on the supply voltage and linearly on frequency.The main idea of dynamic voltage and frequency scalingis that the supply voltage should be kept as low aspossible. Besides, the maximum operating frequency istightly coupled to the supply voltage level. This meansthat by scaling the clock frequency of hardware, thesupply voltage can be scaled as well, resulting in aquadratic decrease of the power consumption.

From Figure 8 can be seen that the clock frequency ofthe MONTIUM during RAKE processing of 4 fingers isabout 4 times the chip rate. Moreover, when the RAKEreceiver is reconfigured to 2 finger processing, then theclock frequency of the MONTIUM can be reduced toabout 2 times the chip rate.

Using power estimation tooling, we estimated thedynamic power consumption of a typical multiply-accumulate operation in the MONTIUM to be about0.5 mW/MHz, realized in 0.13µm CMOS technology.Consequently, the power consumption of the imple-mented RAKE receiver will be 5 mW in 2-finger modeand 10 mW in 4-finger mode.

An efficient ASIC implementation of a W-CDMARAKE receiver was described in [15]. The receiverwas implemented in 0.13 CMOS technology. Accordingto [15], the power dissipation of the ASIC implemen-tation is about 1.5 mW, regardless whether 2 or 4RAKE fingers are implemented. When we compare the

power consumption of the ASIC implementation withthe MONTIUM implementation, we can conclude that thepower consumption of the MONTIUM is about 3 to 7times larger. As expected, the ASIC implementation ismore energy-efficient than an implementation in recon-figurable hardware, however, the ASIC implementationis fixed and the functionality of the ASIC can not bechanged.

B. HiperLAN/2 receiver

The HiperLAN/2 receiver has been implemented inthe same reconfigurable hardware. Figure 9 shows thefunctional blocks in the receiver that are implemented ineach processing tile. The synchronization part (prefix-removal) has still not been implemented. Neverthelessthe function, which contains many correlation opera-tions, can easily be implemented in the MONTIUM.

Fig. 9. The HiperLAN/2 receiver in heterogeneous processing tiles.

Irregular tasks, which are outside the algorithm do-main of the MONTIUM, are performed in software (i.e.on a GPP). The irregular processes in the HiperLAN/2receiver are the estimation of frequency offset and com-putation of equalization coefficients. These coefficientshave to be determined only once per MAC frame, i.e.once per 2 ms.

During frequency offset correction, which is per-formed in the MONTIUM tile, every complex-numbersample is multiplied with the frequency offset correctionfactor. The frequency offset is estimated in software bythe GPP once per MAC frame. One OFDM symbol, con-taining 64 complex-number samples, can be corrected in67 clock cycles.

A Fast Fourier Transform (FFT) on a vector of 64complex-number time samples can perform the inverse

OFDM function. Using the MONTIUM, the 64-FFT canbe performed in 204 clock cycles for one OFDM symbol.

The equalizer, phase offset correction and de-mappingfunctionality are implemented in one MONTIUM tile ina pipelined fashion. The coefficients for equalization aredetermined once every 2 ms in software by the GPP. Dur-ing equalization, the received carriers are multiplied withthe equalization coefficients. After equalization the pilotvalues are used to determine the phase offset correctionfactor. The phase offset correction factor is determinedin the MONTIUM, since the phase offset can vary forevery OFDM symbol and the correction factor has to bedetermined on an OFDM symbol basis (once every 4µs).Hence, determining the phase offset correction factor insoftware (i.e. GPP) would create large communicationoverhead between the GPP and the MONTIUM tile. Phaseoffset correction invokes also a complex multiplication,like equalization. As a consequence the equalizer andphase offset corrector use the same functionality of theMONTIUM. In a pipelined, parallel manner the correctedcomplex-number samples are translated into a bitstream.Hard-decision de-mapping is implemented with LUTfunctionality. A parametrizable de-mapper has been im-plemented, which can be used for QPSK, 16-QAM and64-QAM modulated signals by only changing the LUTtable in the memory of the MONTIUM.

TABLE III

PROPERTIES OF THEHIPERLAN/2 IMPLEMENTATION .

Frequency Equalizer,offset Inverse Phase offset,

correction OFDM De-mapper

Execution time [cycles] 67 204 110Communication time [cycles] 128 116 <100

Minimum system clock withstreaming communication [MHz] 17 51 28Minimum processor clockwith block communication 25 72 37(@ 100 MHz) [MHz]

Configuration size [bytes] 274 946 576Configuration time [cycles] 137 473 288

1) Configuration: The total configuration sizes ofthe MONTIUM are small for the different functions(Table III). The FFT implementation in the MONTIUM

requires the largest configuration size, which is less than1 Kbyte of data. The configuration data can be writteninto the configuration memory of the MONTIUM in about500 clock cycles, since 2 bytes are written in one clockcycle. Suppose that the MONTIUM is running at a clockfrequency of 100 MHz, then this MONTIUM tile canbe (re-)configured in 4.73µs. Notice that the maximumradio turn-around time of the HiperLAN/2 system is 6µs [16], so the implemented HiperLAN/2 receiver can

be considered as a real-time dynamically reconfigurablereceiver.

2) Frequency scaling:All operations in the phys-ical layer are performed on OFDM symbols. So, oneshould assure that each 4µs a new OFDM symbol can beprocessed. When a streaming on-chip network betweenthe processors is assumed, the communication time isnot a bottleneck and one only has to guarantee that,for example, the data processing for frequency offsetcorrection is performed during 67 clock cycles in 4µs.Hence, the minimum clock frequency of the MONTIUM

is 17 MHz, when a streaming on-chip network betweenthe tiles is assumed.

Typically, the clock frequency of the NoC will be fixedand only the clock frequency of the processing tiles canbe varied. When we assume the clock frequency of theNoC to be fixed at 100 MHz, then the clock frequencyof the MONTIUM for frequency offset correction has tobe at least 25 MHz (Table III).

VI. CONCLUSIONS

Because heterogeneous reconfigurable systems mightbecome the future of mobile hardware, we proposed aheterogeneous System-on-Chip (SoC) containing recon-figurable processing elements of different grain sizes.The processing elements in the SoC are dynamicallyinterconnected by a Network-on-Chip (NoC).

The MONTIUM architecture showed to have sufficientflexibility and processing capabilities for implementingnext generation wireless communication systems. Thefeasibility of using heterogeneous hardware is demon-strated by implementing a RAKE receiver and a Hiper-LAN/2 receiver.

The flexible RAKE receiver implements the basebandprocessing for receiving WCDMA signals. It is flexiblebecause the number of RAKE fingers can be adjustedin real-time. In less than 5µs a MONTIUM can beconfigured for RAKE procesing. One MONTIUM onlyhas to be partially reconfigured to change the number offingers in the RAKE receiver. Adjusting the number offingers from 4 to 2 only takes 120 ns; short enough toclassify as dynamic reconfiguration.

The same reconfigurable hardware can be configuredas a HiperLAN/2 receiver. The HiperLAN/2 receiver canbe implemented in four MONTIUM tiles. The perfor-mance requirements of the receiver can be met at fairlylow clock frequencies, with low configuration overhead.The MONTIUM tiles can be configured for HiperLAN/2baseband processing in less than 5µs.

ACKNOWLEDGEMENT

This research is supported by the EU-FP6 project 4S(Smart Chips for Smart Surroundings)(IST-001908) andthe Freeband Knowledge Impulse programme, a jointinitiative of the Dutch Ministry of Economic Affairs,knowledge institutions and industry.

REFERENCES

[1] Lodewijk T. Smit, Gerard J. M. Smit, and Johann L. Hurink.Energy-efficient Wireless Communication for Mobile Multime-dia Terminals. InProceedings of The International ConferenceOn Advances in Mobile Multimedia, pages 115–124, Jakarta,Indonesia, September 2003.

[2] Lodewijk T. Smit. Energy-Efficient Wireless Communication.PhD thesis, University of Twente, Enschede, the Netherlands,January 2004.

[3] SDR Forum.http://www.sdrforum.org.[4] Gerard Rauwerda, Jordy Potman, Fokke Hoeksema, and Gerard

Smit. Adaptive Wireless Networking. InProceedings of the 4thPROGRESS Symposium on Embedded System, pages 205–211,Nieuwegein, the Netherlands, October 2003.

[5] MuMoR project. http://www.mumor.org.[6] EASY project. http://easy.intranet.gr.[7] A. Abnous. Low-Power Domain-Specific Processors for Dig-

ital Signal Processing. PhD thesis, University of California,Berkeley, USA, 2001.

[8] V. Baumgarte, F. May, A. Nuckel, M. Vorbach, and M. Wein-hardt. PACT XPP – A Self-Reconfigurable Data ProcessingArchitecture. InProceedings Engineering of ReconfigurableSystems and Algorithms, pages 64–70, Las Vegas, Nevada,USA, June 2001.

[9] Silicon Hive. http://www.siliconhive.com.[10] Paul M. Heysters.Coarse-Grained Reconfigurable Processors

– Flexibility meets Efficiency. PhD thesis, University of Twente,Enschede, the Netherlands, September 2004.

[11] Paul M. Heysters, Gerard J. M. Smit, and Egbert Molenkamp.A Flexible and Energy-Efficient Coarse-Grained ReconfigurableArchitecture for Mobile Systems.Journal of Supercomputing,26(3):283–308, November 2003.

[12] Jordy Potman, Fokke Hoeksema, and Kees Slump. Tradeoffsbetween Spreading Factor, Symbol Constellation Size and RakeFingers in UMTS. InProceedings of PRORISC 2003, pages543–548, Veldhoven, the Netherlands, November 2003.

[13] H. Holma and A. Toskala.WCDMA for UMTS: Radio Accessfor Third Generation Mobile Communications. John Wiley &Sons, 2001.

[14] Anna Berno. Time and Frequency Synchronization Algorithmsfor HIPERLAN/2. Master’s thesis, University of Padova, Italy,October 2001.

[15] Max Nilsson. Efficient ASIC implementation of a WCDMARake Receiver. Master’s thesis, Lulea University of Technology,Sweden, April 2002.

[16] ETSI. Broadband Radio Access Networks (BRAN); HiperLANType 2; Data Link Control (DLC) Layer Part 1: Basic DataTransport Functions. ETSI TS 101 761-1 v1.1.1 (2000-04),April 2000.

reconfigurable architectures for adaptable mobile systems

Documents