ieee transactions on circuits and systems—i: regular ... - ir.nctu.edu.tw · version april 27,...

13
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011 1129 Interpolation-Based QR Decomposition and Channel Estimation Processor for MIMO-OFDM System Po-Lin Chiu, Member, IEEE, Lin-Zheng Huang, Li-Wei Chai, and Yuan-Hao Huang, Member, IEEE Abstract—This paper presents a modied interpolation-based QR decomposition algorithm for the grouped-ordering multiple- input multiple-output (MIMO) orthogonal frequency division mul- tiplexing (OFDM) systems. Based on the original research that in- tegrates the calculations of the frequency-domain channel estima- tion and the QR decomposition for the MIMO-OFDM system, this study proposes a modied algorithm that possesses a scalable prop- erty to save the power consumption for interpolation-based QR de- composition in the variable-rank MIMO scheme. Furthermore, we also develop the general equations and a timing scheduling method for the hardware design of the proposed QR decomposition pro- cessor for the higher-dimension MIMO system. Based on the pro- posed algorithm, a congurable interpolation-based QR decom- position and channel estimation processor was designed and im- plemented using a 90-nm one-poly nine-metal CMOS technology. The processor supports 2 2, 2 4 and 4 4 QR-based MIMO detection for the 3GPP-LTE MIMO-OFDM system and achieves the throughput of 35.16 MQRD/s at its maximum clock rate 140.65 MHz. Index Terms—Interpolation, multiple-input multiple-output (MIMO), QR decomposition (QRD). I. INTRODUCTION M ORE and more ubiquitous applications of wireless com- munication in our daily lives have increased demand for high data rate and high-quality wireless access. Thus, wide- band communication techniques have been developed to in- crease the service quality. The adoption of orthogonal frequency division multiplexing with multiple-input multiple-output tech- nology (MIMO-OFDM) promises a signicant increase in data rate and spectral efciency without bandwidth expansion. The spatial multiplexing (SM) technique conveys independent data streams simultaneously via different transmit antennas so as to increase the data rate [1]. Therefore, the MIMO receiver obtains the combinative data stream that suffered from wireless channel effects. Thus, a MIMO detector, that detects the transmitted data Manuscript received June 17, 2010; revised September 09, 2010; accepted October 26, 2010. Date of publication December 17, 2010; date of current version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626. This paper was recommended by Associate Editor G. Sobelman. Y.-H. Huang is with the Institute of Communications Engineering and De- partment of Electrical Engineering, National Tsing-Hua University, Hsinchu, Taiwan 30013, R.O.C. (e-mail: [email protected]). P.-L. Chiu is with the Department of Communication Engineering, National Chiao-Tung University, Hsinchu, Taiwan 30010, R.O.C. and the ITRI, Hsinchu, Taiwan 31040, R.O.C. (e-mail: [email protected]). L.-Z. Huang and L.-W. Chai are with the Department of Electrical Engi- neering, National Tsing-Hua University, Hsinchu, Taiwan 30013, R.O.C. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2010.2092090 of each transmit antenna, plays an important role in the MIMO system. Beside the extremely high-complexity maximum likelihood (ML) MIMO detector, several suboptimal MIMO detection al- gorithms have been proposed to improve the complexity and detection performance. The iterative detection schemes [2], [3] and the tree-search-based detection schemes [4], [5] are two kinds of the most popular algorithms. The iterative detection algorithm, like successive interference cancellation (SIC), has lower complexity than ML and can be implemented easily. The tree-search-based detection algorithm, like sphere decoding and K-best algorithm, has near-ML performance, but its complexity is also higher than that of the iterative detection algorithm. Both kinds of detection algorithms need the QR decomposition pre- processing to avoid the complicated pseudo-inverse computa- tion of channel matrix. Then, the subsequent detection processes become more simple. The throughput and complexity are two implementation is- sues of the QR decomposition, and many researchers [6]–[12] have made their efforts to improve the processing latency and the hardware cost of the QR decomposition. However, most of them considered the single-carrier QR decomposition and these tone-by-tone QR decompositions could not fulll the require- ments of the real-time processing for the multi-carrier MIMO system if the MIMO dimension is very high. The complexity of QR decomposition is for an matrix, and grows proportionally with the FFT size in the OFDM system. Because the FFT size is usually very large and the MIMO detection must be performed on every subcarrier, the complexity of QR decom- position becomes tremendous in the MIMO-OFDM systems. In order to support much higher data rate, MIMO dimension has an increasing trend in the wireless communication systems. For example, IEEE 802.16e uses 2 2 MIMO scheme [13], and 4 4 MIMO scheme is supported by IEEE 802.11n [14] and 3GPP-LTE [15]. Furthermore, higher MIMO dimension, such as 8 8, is discussed in IEEE 802.16m [16] and 3GPP LTE-Ad- vanced [17] standard committees. Thus, the increasing com- plexity of QR decomposition caused by high MIMO dimension is an essential issue for the next generation wireless communi- cation systems. Accordingly, the interpolation-based QR decomposition (IQRD) algorithms [18], [19] were proposed to mitigate the complexity issue of QR decomposition in the MIMO-OFDM system. The interpolation-based QR decomposition algorithm efciently combines the calculations of channel estimation in the OFDM system and the QR decomposition in the MIMO detection so as to greatly reduce the overall complexity. The algorithm performs QR decomposition only on the pilot 1549-8328/$26.00 © 2010 IEEE

Upload: others

Post on 23-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011 1129

Interpolation-Based QR Decomposition and ChannelEstimation Processor for MIMO-OFDM SystemPo-Lin Chiu, Member, IEEE, Lin-Zheng Huang, Li-Wei Chai, and Yuan-Hao Huang, Member, IEEE

Abstract—This paper presents a modified interpolation-basedQR decomposition algorithm for the grouped-ordering multiple-inputmultiple-output (MIMO) orthogonal frequency divisionmul-tiplexing (OFDM) systems. Based on the original research that in-tegrates the calculations of the frequency-domain channel estima-tion and the QR decomposition for the MIMO-OFDM system, thisstudy proposes amodified algorithm that possesses a scalable prop-erty to save the power consumption for interpolation-basedQR de-composition in the variable-rankMIMO scheme. Furthermore, wealso develop the general equations and a timing scheduling methodfor the hardware design of the proposed QR decomposition pro-cessor for the higher-dimension MIMO system. Based on the pro-posed algorithm, a configurable interpolation-based QR decom-position and channel estimation processor was designed and im-plemented using a 90-nm one-poly nine-metal CMOS technology.The processor supports 2 2, 2 4 and 4 4 QR-based MIMOdetection for the 3GPP-LTE MIMO-OFDM system and achievesthe throughput of 35.16MQRD/s at its maximum clock rate 140.65MHz.

Index Terms—Interpolation, multiple-input multiple-output(MIMO), QR decomposition (QRD).

I. INTRODUCTION

M ORE andmore ubiquitous applications of wireless com-munication in our daily lives have increased demand

for high data rate and high-quality wireless access. Thus, wide-band communication techniques have been developed to in-crease the service quality. The adoption of orthogonal frequencydivision multiplexing with multiple-input multiple-output tech-nology (MIMO-OFDM) promises a significant increase in datarate and spectral efficiency without bandwidth expansion. Thespatial multiplexing (SM) technique conveys independent datastreams simultaneously via different transmit antennas so as toincrease the data rate [1]. Therefore, the MIMO receiver obtainsthe combinative data stream that suffered from wireless channeleffects. Thus, aMIMO detector, that detects the transmitted data

Manuscript received June 17, 2010; revised September 09, 2010; acceptedOctober 26, 2010. Date of publication December 17, 2010; date of currentversion April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan,R.O.C., under Grant 98-EC-17-A-05-01-0626. This paper was recommendedby Associate Editor G. Sobelman.Y.-H. Huang is with the Institute of Communications Engineering and De-

partment of Electrical Engineering, National Tsing-Hua University, Hsinchu,Taiwan 30013, R.O.C. (e-mail: [email protected]).P.-L. Chiu is with the Department of Communication Engineering, National

Chiao-Tung University, Hsinchu, Taiwan 30010, R.O.C. and the ITRI, Hsinchu,Taiwan 31040, R.O.C. (e-mail: [email protected]).L.-Z. Huang and L.-W. Chai are with the Department of Electrical Engi-

neering, National Tsing-Hua University, Hsinchu, Taiwan 30013, R.O.C.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2010.2092090

of each transmit antenna, plays an important role in the MIMOsystem.Beside the extremely high-complexity maximum likelihood

(ML) MIMO detector, several suboptimal MIMO detection al-gorithms have been proposed to improve the complexity anddetection performance. The iterative detection schemes [2], [3]and the tree-search-based detection schemes [4], [5] are twokinds of the most popular algorithms. The iterative detectionalgorithm, like successive interference cancellation (SIC), haslower complexity than ML and can be implemented easily. Thetree-search-based detection algorithm, like sphere decoding andK-best algorithm, has near-ML performance, but its complexityis also higher than that of the iterative detection algorithm. Bothkinds of detection algorithms need the QR decomposition pre-processing to avoid the complicated pseudo-inverse computa-tion of channel matrix. Then, the subsequent detection processesbecome more simple.The throughput and complexity are two implementation is-

sues of the QR decomposition, and many researchers [6]–[12]have made their efforts to improve the processing latency andthe hardware cost of the QR decomposition. However, most ofthem considered the single-carrier QR decomposition and thesetone-by-tone QR decompositions could not fulfill the require-ments of the real-time processing for the multi-carrier MIMOsystem if the MIMO dimension is very high. The complexityof QR decomposition is for an matrix, and growsproportionally with the FFT size in the OFDM system. Becausethe FFT size is usually very large and the MIMO detection mustbe performed on every subcarrier, the complexity of QR decom-position becomes tremendous in the MIMO-OFDM systems. Inorder to support much higher data rate, MIMO dimension hasan increasing trend in the wireless communication systems. Forexample, IEEE 802.16e uses 2 2 MIMO scheme [13], and4 4 MIMO scheme is supported by IEEE 802.11n [14] and3GPP-LTE [15]. Furthermore, higher MIMO dimension, suchas 8 8, is discussed in IEEE 802.16m [16] and 3GPP LTE-Ad-vanced [17] standard committees. Thus, the increasing com-plexity of QR decomposition caused by high MIMO dimensionis an essential issue for the next generation wireless communi-cation systems.Accordingly, the interpolation-based QR decomposition

(IQRD) algorithms [18], [19] were proposed to mitigate thecomplexity issue of QR decomposition in the MIMO-OFDMsystem. The interpolation-based QR decomposition algorithmefficiently combines the calculations of channel estimation inthe OFDM system and the QR decomposition in the MIMOdetection so as to greatly reduce the overall complexity.The algorithm performs QR decomposition only on the pilot

1549-8328/$26.00 © 2010 IEEE

Page 2: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1130 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

subcarriers and obtains the mapped QRmatrices of the data sub-carriers by interpolation. Based on the generic idea, we proposea scalable interpolation-based QR decomposition algorithm forthe high-dimension MIMO-OFDM system [20]. The proposedalgorithm has better stability than the Gram-Schmidt-basedinterpolation-based QR decomposition algorithm [18] becausethe proposed algorithm does not need square root and divisionoperations before the inverse mapping process. Moreover, inthis paper, we establish a configurable hardware architectureto support various MIMO configurations according to theMIMO channel rank. In the reduced-rank MIMO channel, theproposed hardware architecture can save computation power bythe scalability property of the proposed algorithm. Moreover,we also develop a timing-schedule analysis algorithm basedon the proposed hardware architecture so that the hardwarearchitecture and timing schedule can be easily extended to thehigher-dimension MIMO scheme.The remainder of this paper is organized as follows. The

signal model and system specifications are defined in Section II.Section III introduces the traditional and the proposed interpo-lation-based QR decomposition algorithms. In Section IV, wepresent the timing schedule analysis method for the configurablehardware. Then, the proposed configurable hardware architec-ture and chip implementation results are shown in Section V.Finally, Section VI makes a brief conclusion.

II. SIGNAL MODEL AND SYSTEM SPECIFICATIONS

A. Signal Model

For the spatial multiplexing MIMO system with transmitantennas and receive antennas, where , thedata vector is transmitted via transmitantennas. Then, the transmitted data is affected by anMIMO fading channel and an additive white Gaussiannoise (AWGN) . Hence, the signalvector is received at the receiver side, andthe MIMO signal model can be expressed as

Generally, in the MIMO receiver, the channel information is es-timated by the channel estimator. The MIMO detector uses thereceived data and the estimated channel response to detect thetransmitted data thereafter. Usually, the channel is assumedperfectly known in the MIMO detector for problem simplifica-tion.In theMIMO-OFDM system, as shown in Fig. 1, signal band-

width is divided into many subcarriers according to the numberof FFT size. The system usually needs to performMIMO detections in one OFDM symbol, thus, the computa-tional complexity is tremendously high. If the antenna numberincreases, the computation complexity further grows dramati-cally but the available execution time is still restricted in oneOFDM symbol. Therefore, for the QR-based MIMO detector,the QR decomposition and MIMO detection are two essentialimplementation issues.

Fig. 1. Block diagram of the MIMO-OFDM system.

Fig. 2. Pilot locations for the first transmit antenna in the 3GPP-LTE system.

B. System Specification

In this paper, we follow the specification of 3GPP-LTE Re-lease 8 [15] to perform the analysis, simulation, and implemen-tation. The 3GPP-LTE supports a maximum 2048-point FFTsize at 20-MHz bandwidth and has three MIMO configurations,2 2, 2 4, and 4 4. In order to facilitate the operation ofthese MIMO configurations, 3GPP-LTE specifies the pilot pat-tern within an unit of resource block (RB)which contains twelveconcatenated subcarriers. Fig. 2 shows an example of the pilotpattern in every RB of a subframe at the first antenna port,but the location may be shifted in frequency domain for dif-ferent subframes. In the simulation of this paper, we use refer-ence channel models specified by 3GPP-LTE [21] including Ex-tended Vehicular A (EVA), Extended Pedestrian A (EPA), andExtended Typical Urban (ETU) channel models.

III. INTERPOLATION-BASED QR DECOMPOSITION

In general, the traditional tone-by-tone QR decomposition ina MIMO-OFDM system estimates the channel matrices by in-terpolation firstly and then performs the QR decomposition oneach subcarrier as shown in Fig. 3(a), in which , and

are the MIMO channel matrices for pilot subcarriers. Thecomputational complexity of this brute-force QR decomposi-tion is extremely high for the large number of subcarriers and thehigh-dimension MIMO scheme. Thus, the interpolation method[18] was proposed to avoid computing the QR decomposition onevery subcarrier so as to greatly reduce the overall complexity.

Page 3: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1131

Fig. 3. (a) Traditional tone-by-tone QR decomposition algorithm. (b) Tradi-tional interpolation-based QR decomposition algorithm in the MIMO-OFDMsystem.

A. Traditional Interpolation-Based QRD

The generic interpolation-based QR decomposition al-gorithm [18] is a Gram-Schmidt-based algorithm. TheGram-Schmidt orthogonalization is an iterative processexpressed by

(1)

where . The and are the column vec-tors of and , respectively, and the is the row vectorof . The superscripts of and represent the Hermitian andtranspose of a matrix respectively. The interpolation-based QRdecomposition [18] verifies that only the Laurent polynomial(LP) matrices can be interpolated. This algorithm is conceptu-ally depicted in Fig. 3(b). Instead of estimating the channel ma-trices of all subcarriers, the algorithm only needs the channelmatrices of the pilot subcarriers, , , and , and com-putes their corresponding and matrices. Because the

and matrices are not LP matrices, the polynomial interpo-lation technique can not be applied straightforward. Therefore,an invertible mapping function was introduced to obtain themapped LP matrices ; consequently, the

interpolation of and matrices becomes applicable. Then,the and matrices of data subcarriers are obtained by the in-verse mapping from their interpolated and matrices.The mapping and inverse mapping functions are formulated by

(2)

(3)

and

B. Proposed Interpolation-Based QRD

The traditional Gram-Schmidt-based interpolation-based QRdecomposition requires square root and division operations (1)and the mapping function. Thus, we try to further reduce thesecalculations by exploring (1) in a 2 2 MIMO system as anexample. The 2 2 QR decomposition is written as

Then, according to (2), the mapped and matrices become

(4)

and

(5)

We found that all entries in the and matrices have nosquare root and division operations and all product terms in the

Page 4: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1132 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

and can be found in the entries of and its Hermitianmatrix , where

This property exists in any MIMO dimensions. Accordingly,we propose a modified interpolation-based QR decomposition(MIQRD) algorithm which computes the Hermitian matrix ofthe channel matrix before performing the QR decomposition.Then, the and matrices are computed directly from the en-tries of channel matrix and its corresponding Hermitian matrixwith only multiplication and addition. After the and ma-trices are calculated by this one-step processing, the subsequentprocesses are the same as the traditional interpolation-based QRdecomposition algorithm such as interpolation and inverse map-ping. The proposed algorithm can be illustrated in Fig. 4(a).Furthermore, we extend the proposed algorithm to the 4 4

QR decomposition to show its scalable property. The proposedone-step processing is divided into several micro steps ac-cording to the number of transmit antennas, and these microsteps are performed sequentially as follows:

(6a)

(6b)

(6c)

(6d)

(6e)

These sequential processes are similar to the order of Gram-Schmidt algorithm. The first column of the and the firstrow of the are computed firstly. Then, the second columnand the second row are computed by utilizing the results

of the previous micro-step. The other columns and rows of theand can also be deduced by analogy.The proposed algorithm features a scalability property for dif-

ferent MIMO schemes. In the practical MIMO-OFDM system,

Fig. 4. Proposed interpolation-based QR decomposition for (a) the samecolumn order and (b) the grouped column order.

the equivalent channel matrix may be not square because ofnon-full-rank channel or system deployment. For example, fora 4 3 channel matrix, the dimension of corresponding Her-mitian matrix is 3 3. Therefore, calculations for (6e) andor , where , in (6b)–(6e) can be eliminated.In summary of (6b)–(6e), for the scalable channel ma-trix and , – represent the calculations for

, – for , and for , andfor . Without loss of generality, if the and of an

channel matrix are computed, all the and ma-trices of other channel matrices, whereand , can be obtained without extra calcula-tions in the proposed algorithm. Thus, the proposed algorithmfeatures the scalability property that saves the hardware energyconsumption while limited antenna resource is assigned in thesystem.

Page 5: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1133

Fig. 5. Hermitian matrix before and after exchanging order of and .

TABLE IRELATIONSHIPS OF EXCHANGING AND

C. Application to Sorted QR Decomposition

In the QR-based MIMO detection, sorted QR decomposi-tion is usually applied to improve the detection performance.However, the traditional interpolation-based QR decom-position becomes inapplicable in this situation because

, and have different columnorders. In order to solve this ordering problem, we propose agrouped-ordering modified interpolation-based QR decompo-sition (GO-MIQRD) algorithm. We divide all subcarriers intoseveral groups and the subcarriers in one group have the samecolumn order. Fig. 4(b) shows an example of two groups. Thetraditional algorithm must perform the QR decompositions ofthe pilot subcarrier twice at the group boundary, such as subcar-rier 7, for two different column orders; thus, the computationalcomplexity doubles.Nevertheless, the proposed algorithm can obtain the andmatrices of two different column orders by sharing the Her-

mitian matrix entries and a few additional computations. Forexample, if and of one channel matrix are exchangedas the channel matrix of the other column order, all entries ofHermitian matrix for the second column order are the same asthose for the original column order except that their indices areexchanged, as shown in Fig. 5. If we compute the (6b)–(6e)for the index-exchanged Hermitian matrix, the order-exchanged

pertaining to the original can be calculated

as listed in Table I. In this best case, only needs to be re-com-

puted, and and share the sameHermitianma-trix. Therefore, the computational complexity can be greatly re-duced. Assume that there are groups in an OFDM symbol andpilots in each group. The GO-MIQRD algorithm is summa-

rized in the following.1) Set .2) Determine the column order of group .

3) Set .4) If , change the column order of the pilot ingroup to the column order of group , and go to step6 by using sharing property. Otherwise, go to the next step.

5) Calculate Hermitian matrix of pilot .6) Set .7) Calculate the column of and the row of8) If , go to the next step. Otherwise, setand go back to step 7.

9) If , go to the next step. Otherwise, set andgo back to step 5.

10) Interpolate the and of these pilots to obtain theand of other subcarriers in group .

11) If , go to the next step. Otherwise, set andgo back to step 2.

12) For each subcarrier, apply .For an channel matrix, the complexity of com-

puting the and matrices in the traditional algorithm [18],[22] is

where is the number of complex multiplications for com-puting Givens-rotation-based QR decomposition, and de-notes the cost of mapping function . Furthermore, the com-plexity of the proposed algorithm for the same result is

The computational complexity of the traditional and theproposed algorithms for various MIMO configurations arecompared in Fig. 6. The proposed algorithm has a lowercomplexity than the traditional algorithm especially for thegrouped-ordering scheme and the amount of reduced com-plexity varies depending on the column order. In Fig. 6(b),although the complexity of the worst case is higher than that ofthe traditional algorithm in 8 8 MIMO, the average cost stillapproximates that of the traditional algorithm. Most importantof all, the proposed algorithm has much lower complexity thanthe traditional algorithm for the lower-rank MIMO schemesdue to its scalability property.

D. High-Dimensional Extension

As described in Section III-B, the proposed algorithm can beextended to any MIMO dimensions. For the proposed one-stepprocess, the straightforward method expands (1) and (2) sim-ilar to (4) and (5). Then, each micro step, such as (6b) and (6c),of the one-step process can be determined from these deducedforms. However, in the higher-dimension MIMO system, theequation expansion becomes very complicated and the micro

Page 6: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1134 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

Fig. 6. Computational complexity comparisons of the traditional method andthe proposed one-step method for (a) four receiving antennas and (b) eight re-ceiving antennas.

steps of the proposed one-step algorithm are difficult to be de-rived correctly. Thus, we further present an iterative algorithmthat can efficiently derive themicro steps andmake the proposedalgorithm more useful.Asmentioned in Section III-B, the down-scalability describes

that and for , ,etc., can be derived from and for an channelmatrix. From a reverse viewpoint, it is very possible that theup-scalability should be applicable such that and for

, , etc., can also be derivedfrom and for an channel matrix. By investigatingthe results for 2 2, 3 3 and 4 4 in (6b)–(6e), we foundthat includes and terms, includesand terms, and includes and terms.Therefore, we deduce that for 5 5 QR matrices shouldinclude and terms and so on. According to theserules, we can derive a generalized method to compute the andmatrices for any MIMO dimensions.

1) Definition III.1: For a set ,where , there exists a function

which is defined as

where means element is excluded in the set .Definition III.2: For an channel matrix, the entries

of and are defined as

where and .According to Definition III.1 and Definition III.2, (6b)–(6e)

can be summarized as

Similarly, all entries of the and for other MIMO dimen-sions can be derived quickly by these two definitions.

E. Interpolation Scheme

The and matrices of data subcarriers are generated byinterpolation and then delivered to subsequent MIMO detectorafter inverse mapping. Hence, the accuracy of the interpolated

Page 7: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1135

Fig. 7. SER performance results by utilizing different interpolation intervalsand MCS-QR-SIC MIMO detectors for the (a) QPSK (4,4,1,1) case and (b)16QAM (16,1,1,1) case.

and affects the final detection performance. In order tochoose a proper interpolation scheme, we evaluate the detec-tion performance by utilizing theMCS-QR-SICMIMO detector[23] under the ETU channel model because the ETU model hasthe longest tap delay in the 3GPP-LTE channel models. In thesimulation, we used 1024 1000 data symbols and evaluated(4,4,1,1) and (16,1,1,1) cases in the 4 4 MCS-QR-SICMIMOdetector for the QPSK and 16QAM modulations, respectively.The pilot subcarrier interval in 3GPP-LTE is six subcarriers asillustrated in Fig. 2. In Fig. 7, we can see that there is a large per-formance gap between the perfect channel case and interpolatedchannel case especially in the high-order constellation even forthe traditional channel estimation method.Because the ETU channel is very frequency selective, short-

ening interpolation interval is a possible method to improvethe accuracy of interpolation. In Fig. 7, the symbol error rate(SER) performances of utilizing different interpolation inter-vals show that the detection performance is improved witha shorter interpolation interval. Thus, we can perform time-domain interpolation first to generate additional pseudo-pilotchannels, and then the pilot subcarrier interval in the frequency-

Fig. 8. Basic processing elements (a) norm and (b) MAC.

domain interpolation can be shortened from six to three subcar-riers. The simulation shows that the performance degradationis small by using linear polynomial interpolation scheme when

is around 16 dB and 20 dB for the QPSK and 16 QAMmodulations, respectively.

IV. TIMING SCHEDULE ANALYSIS

The proposed one-step process [see (6a)–(6e)] has two spe-cial properties. The first property is that all computations arecomposed of addition and multiplication operations. Thus, twobasic hardware elements, norm andmultiplication-and-accumu-lation (MAC), are proposed to implement these operations, asshown in Fig. 8. The norm element calculates the inner productand the MAC element accumulates the products. According tothe complexity analysis in Section III-C, many norm and MACelements are required to realize parallel implementation, espe-cially in the high-dimension MIMO system. The second prop-erty is that there exists data dependency between each microstep in the one-step process. Therefore, we must carefully parti-tion the algorithm into several subparallel computations on thenorm and MAC elements. The most important design issue is tobalance the tradeoff between the hardware complexity and thecomputation cycle count with an appropriate subparallel hard-ware architecture and an efficient timing schedule of all opera-tions.We use norms and MACs, that is complex mul-

tipliers for receive antennas, to approach the cycle bound, sothat the hardware efficiency can approach 100%. As the in-creases, the design complexity grows enormously because thereis a large number of the norm and MAC combinations, andscheduling their operations for the optimal product, whereis the hardware complexity defined as the number of complex

multipliers and is the computational cycle count, becomesvery difficult.Therefore, we propose a method to obtain the optimal timing

schedule by programming. First, in order to apply the scala-bility property to the timing schedule, we construct a depen-dency tree to determine the output order of entries of and, as shown in Fig. 9. Then, we can further expand the tree

for according to the Definition III.1 and Definition III.2.Fig. 10 shows the expanded tree structure for . We can findthat all computations in one-step process can be covered byall possible , where ,

and . Thus, we use an -bitbinary variable to represent the . For each bit, “1” means

Page 8: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1136 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

Fig. 9. Output order tree of and for 4 4 channel matrix.

Fig. 10. Dependency tree of for 4 4 channel matrix.

the existence of the corresponding , and “0” represents its ab-sence. For example, “0101” represents {1, 3} and “1110” rep-resents {2, 3, 4}. Then, a dependency tree can be built up basedon the data dependency of the micro-steps. For a given com-bination of the norms and MACs, we can traverse the depen-dency tree to schedule the operations of the proposed algorithmon the norms and MACs. One node must be scheduled after allof its child nodes have been scheduled. Note that because thecomputation of Hermitian matrix has the highest priority, weschedule dedicated norm elements to perform Hermitian matrixcalculations to reduce the computational cycle count. After theHermitian matrix computation is finished, the norm elementsare then shared by other calculations. The MAC elements arealmost scheduled for vectors because each of them occupies

Fig. 11. (a) Timing schedule results for different MIMO configurations. (b)Scheduled smallest product values for 12 12 channel matrix under dif-ferent timing constraints.

Fig. 12. Proposed MIQRD hardware architecture.

complex multipliers over several cycles. We performed thescheduling program using different combinations of norms andMACs based on these rules, and obtained the scheduled cyclecount versus the hardware complexity ( multipliers)for various MIMO configurations, as shown in Fig. 11(a). If theone-step processing module must meet the throughput require-ment of the system, we can use Fig. 11(a) to obtain the min-imum hardware complexity under the constrained cycle time.On the other hand, the product versus timing constraintis plotted in Fig. 11(b). If the energy cost is the design con-straint, the optimal hardware combination can be determined bychoosing the smallest timing constraint that approximates the

lower-bound, that is, 200 cycles for the example of 12 12channel matrix in Fig. 11(b).

V. HARDWARE ARCHITECTURE AND CIRCUIT IMPLEMENTATION

Based on the proposed interpolated-QR decomposition algo-rithm and the timing schedule program, we designed and imple-mented the QR decomposition processor for 3GPP-LTEMIMO-OFDM system. The proposedQR decomposition processor con-sists of one-step processing, interpolation and inverse mapping,as shown in Fig. 12.

A. One-Step Processing

In the system, the pilot channel matrix is updated every threecycles of the system clock and four or vectors must be

Page 9: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1137

Fig. 13. Computation cycle counts for different combinations of the norm andthe MAC in a 4 4 MIMO system.

Fig. 14. Timing schedule of 4 4 one-step with two norms and four MACs.

Fig. 15. Block diagram of one-step process with two norms and four MACs.

generated within four cycles for a 4 4 channel matrix. There-fore, in order to generate the and for three data sub-carriers, the timing processing constraint for the one-step pro-cessing module is twelve cycles of the operating .Then, two norms and four MACs are selected in the one-stepprocessing module according to Fig. 13, and the final timingschedule can be derived from the scheduling program, as shownin Fig. 14. For 2 2 and 4 2 channel matrices, one-step pro-cessing requires only four cycles so that doubled clock rate isenough. The top-level block diagram of the proposed one-stepprocess is illustrated in Fig. 15. The memory stores five tem-poral complex values and 4 4 complex ma-trices, including , , , and Hermitian matrix.

B. Interpolation

In 3GPP-LTE, the locations of pilot subcarriers in an OFDMsymbol may be shifted in frequency domain for different sub-frames, and therefore the pilots may be not located at the edge of

Fig. 16. Three possible pilot locations at the resource block boundary.

Fig. 17. (a) Interpolation architecture of and and (b) the linear real-valued interpolator.

resource block. Three possible locations of the pilot and pseudo-pilot (estimated by time-domain interpolation) are summarizedin Fig. 16. In 3GPP-LTE, the user equipment (UE) only receivesthe data located in the assigned resource blocks. In case 2 andcase 3, the first pilot is not located in the resource block. In orderto design a regular hardware architecture of the interpolationmodule, the first pilot channel is extrapolated by the channel es-timator.The architecture of the and interpolation is shown in

Fig. 17(a), in which linear real-valued interpolators are utilized,as shown in Fig. 17(b). Because the proposed MIQRD is de-signed to output one column of and one row of per cycle, itrequires fifteen real-valued interpolators, eight for and sevenfor calculations. The memories after the interpolators storethe results for inverse mapping. If the pilot is a pseudo-pilot, itneeds to be de-mapped so its result must also be stored. Besides,for 4 2 and 2 2 channel matrices, additional memories arerequired to wait for the processing of inverse mapping since theprocessing time of one-step module is shorter than that of the4 4 case.

C. Inverse Mapping

Fig. 18(a) shows the block diagram of the inverse mappingmodule and it is composed of the demapping factor cal-culation and the demapper. In order to simplify the clock design,

Page 10: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1138 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

Fig. 18. Block diagram of (a) the inverse mapping, (b) the demapping factorcalculation, and (c) the parallel multiplications of demapper.

Fig. 19. Fixed-point simulation of MIQRD.

the inverse mapping hardware also operates at a quadruple fre-quency of the system clock. According to the word-length de-termined from the fixed-point simulation and the critical time

Fig. 20. SER performances of fixed point simulations for differentword-lengths of and .

of the synthesis result, both of the square root and division areimplemented by two-stage pipelined architecture to meet thetiming constraint as depicted in Fig. 18(b). Then, 15 real-valuedmultipliers are used to de-map one column of and one row ofto the corresponding and , as shown in Fig. 18(c).

D. Fixed-Point Simulation

We use the symbol error rate (SER) as the metric to determinethe word-length of the signal in the fixed-point simulation. Be-cause the clipping error has larger impact on the performanceof the proposed architecture than the truncation error, we deter-mine the word-length of the integer part for the signal first, andthen the fraction part. A fixed-point 2’s complement number isrepresented by (I,F), in which I and F denote the word-lengths ofthe integer part and the fraction part respectively. According tothe fixed point simulation in Fig. 19, the word-length of the sig-nals is chosen to achieve with QPSK and 16QAMmodulations. Fig. 20 shows the SER performances of fixed pointsimulations for different word-lengths of and in 4 4 16QAM system. Table II lists the word-length of the signals in thethree modules. The word-length for the and is larger thanthose of others and dominates the operating frequency becausethe dynamic range of grows along with the dimension ofchannel matrix. The processing time schedules of the wholeMIQRD processor for different channel matrices are shown inFig. 21. The latencies for and cases are 27 cyclesand 15 cycles, respectively, to produce one column/row vectorper cycle at the MIQRD output.

E. Circuit Implementation

The proposed interpolation-based QR decomposition pro-cessor was designed and implemented using UMC 90-nmone-poly nine-metal CMOS technology. The processor occu-pies with core area. Fig. 22 shows themicrograph of the chip. The function of the chip was verifiedand the performance measured using a digital test station. Thechip consumes 49 mW at its maximum frequency 140.65 MHz.The comparison of the implementation results of the proposedand other QR decomposition processors is summarized in

Page 11: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1139

Fig. 21. Processing time of each module in MIQRD for (a) channel matrix and (b) channel matrix.

TABLE IIWORD-LENGTHS OF THE MODULES IN THE PROPOSED MIQRD ARCHITECTURE

Table III. Since the proposed interpolation-based QR decom-position processor also executes the task of channel estimation,it requires a larger gate count than the other works. However,the throughput of the proposed chip is 35.16 MQRD/s and thenormalized throughput of our chip is much higher than those ofother works based on the tone-by-tone sole QR decompositionalgorithm. Although raw gate-count number is very large inour design, the gate efficiency, either including or excludingthe channel estimation, is not the worst as compared to othersin the literature. Therefore, this architecture is suitable fordesigning a high-throughput QR decomposition processor forMIMO-OFDM receiver.

VI. CONCLUSION

This paper proposes a scalable interpolation-based QR de-composition algorithm and a grouped-ordering scheme for the

Fig. 22. Chip micrograph.

MIMO-OFDM system. The derived general equations and theproposed timing scheduling method facilitate the architecturedesign of the proposed algorithm. A configurable interpola-tion-based QR decomposition processor is also presented in thework. The processor supports 2 2, 4 2 and 4 4 channelmatrices to meet the requirements of the next generationwireless communication system. Meanwhile, it features muchhigher data throughput at balancing the hardware cost that isvery suitable for low-complexity MIMO-OFDM systems.

Page 12: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

1140 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

TABLE IIICOMPARISON RESULTS

ACKNOWLEDGMENT

The authors would like to thank Chip Implementation Centerof National Applied Research Laboratories in Taiwan for tech-nical support.

REFERENCES

[1] G. J. Foschini, “Layered space-time architecture for wireless commu-nication in a fading envirenment when using multiple antennas,” BellLab. Tech. J., vol. 1, no. 2, pp. 41–59, 1996.

[2] P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela,“V-BLAST: An architecture for realizing very high data rates over therich-scattering wireless channel,” in Proc. URSI Int. Symp. Signals,Systems, and Electronics, Sep. 1998, pp. 295–300.

[3] R. Böhnke, D. Wübben, V. Kühn, and K. D. Kammeyer, “Re-duced complexity MMSE detection for BLAST architectures,” Proc.Globecom ’03, vol. 4, pp. 2258–2262, Dec. 2003.

[4] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point searchin lattices,” IEEE Trans. Inform. Theory, vol. 48, no. 8, pp. 2201–2214,Aug. 2002.

[5] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, “A VLSI ar-chitecture of a K-best lattice decoding algorithm for MIMO channels,”in Proc. IEEE Int. Symp. Circuits and Systems, May 2002, vol. 3, pp.273–276.

[6] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner,“VLSI implementation of a high-speed iterative sorted MMSE QR de-composition,” in Proc. IEEE Int. Symp. Circuits and Systems, May2007, pp. 1421–1424.

[7] C. K. Singh, S. H. Prasad, and P. T. Balsara, “VLSI architecture formatrix inversion using modified Gram-Schmidt based QR decomposi-tion,” in Proc. 20th Int. Conf. VLSI Design (Held Jointly with 6th Int.Conf. on Embedded Systems), Jan. 2007, pp. 836–841.

[8] P. Salmela, A. Burian, H. Sorokin, and J. Takala, “Complex-valued QRdecomposition implementation for MIMO receivers,” in Proc. IEEEInt. Conf. Acoustics, Speech and Signal Processing, Mar. 2008, pp.1433–1436.

[9] K. H. Lin, R. C. H. Chang, C. L. Huang, F. C. Chen, and S. C. Lin,“Implementation of QR decomposition for MIMO-OFDM detectionsystems,” in Proc. 15th IEEE Int. Conf. Electronics, Circuits and Sys-tems, Aug. 2008, pp. 57–60.

[10] P. Luethi, C. Studer, S. Duetsch, E. Zgraggen, H. Kaeslin, N.Felber, and W. Fichtner, “Gram-Schmidt-Based QR decompositionfor MIMO detection: VLSI implementation and comparison,”in Proc. IEEE Asia Pacific Conf. Circuits and Systems, Nov.2008, pp. 830–833.

[11] D. Patel, M. Shabany, and P. G. Gulak, “A low-complexity high-speedQR decomposition implementation for MIMO receivers,” in Proc.IEEE Int. Symp. Circuits and Systems, May 2009, pp. 33–36.

[12] R. C. H. Chang, C. H. Lin, K. H. Lin, C. L. Huang, and F. C. Chen,“Iterative QR decomposition architecture using the modified Gram-Schmidt algorithm for MIMO systems,” IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 58, no. 5, pp. 1–8, May 2010.

[13] IEEE 802.16e-2005-Amendment 2: Physical and Medium Access Con-trol Layers for Combined Fixed and Mobile Operation in LicensedBands, IEEE Std. 802.16, 2006.

[14] IEEE 802.11n-2009-Amendment 5: Enhancements for HigherThroughput, IEEE Std. 802.11, 2009.

[15] 3GPP TS 36.211-Physical Channels and Modulation, 3GPP TechnicalSpecification, Rev. 8.9.0 2009.

[16] IEEE.16m-Amendment: Air Interface for Fixed andMobile BroadbandWireless Access Systems—Advanced Air Interface, IEEE Draft, Rev.D5 2010.

[17] 3GPP TR 36.814-Further Advancements for E-UTRA Physical LayerAspects 2010, 3GPP Technical Report, Rev. 9.0.0.

[18] D. Cescato, M. Borgmann, H. Bölcskei, J. Hansen, and A. Burg, “Inter-polation-based QR decomposition inMIMO-OFDM systems,” inProc.IEEE 6th Workshop on Signal Processing Advances in Wireless Com-munications, Jun. 2005, pp. 945–949.

[19] D. Wübben and K. D. Kammeyer, “Interpolation-based successive in-terference cancellation for per-antenna-coded MIMO-OFDM systemsusing P-SQRD,” in Proc. IEEE Workshop on Smart Antennas, Mar.2006.

[20] P. L. Chiu, L. Z. Huang, and Y. H. Huang, “Scalable interpo-lation-based QRD architecture for subcarrier-grouped-orderingMIMO-OFDM system,” in Proc. 43th IEEE Asilomar Conf. onSignals, Systems, and Computers, Nov. 2009, pp. 708–712.

[21] 3GPP TS 36.101-User Equipment (UE) Radio Transmission and Re-ception 2010, 3GPP Technical Specification, Rev. 8.9.0.

[22] A. Burg, “VLSI Circuits for MIMO Communication Systems,” Ph.D.,Swiss Federal Inst. Technol., Zurich, Switzerland, 2006.

[23] P. L. Chiu and Y. H. Huang, “A scalable MIMO detection architec-ture with non-sorted multiple-candidate selection,” in Proc. IEEE Int.Symp. Circuits and Systems, May 2009, pp. 689–692.

Po-Lin Chiu (M’09) was born in Tainan, Taiwan,R.O.C., in 1975. He received the B.S. and M.S.degrees in electrical engineering from National Cen-tral University, Taoyuan, Taiwan, R.O.C., in 1997and 1999, respectively. He is currently pursuing thePh.D. degree at the Department of CommunicationEngineering, National Chiao-Tung University,Hsinchu, Taiwan.Since 2004, he has been with the Department of

Communication Engineering, National Chiao-TungUniversity. He is also a Member of Technical Staff

at ITRI, Hsinchu. His research interests include the baseband signal processing,multiple-input multiple-output signal processing, and VLSI design of wirelesscommunications.

Page 13: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR ... - ir.nctu.edu.tw · version April 27, 2011. This work was supported by ITRI, Hsinchu, Taiwan, R.O.C., under Grant 98-EC-17-A-05-01-0626

CHIU et al.: INTERPOLATION-BASED QR DECOMPOSITION AND CHANNEL ESTIMATION PROCESSOR 1141

Lin-Zheng Huang was born in Taiwan in 1985. Hereceived the B.S. degree in electronic engineeringfrom National Cheng Kung University, Tainan,Taiwan, R.O.C., in 2007 and the M.S. degree inelectrical engineering from National Tsing-HuaUniversity, Hsinchu, Taiwan, R.O.C., in 2010. Heis currently in military service in Taiwan, R.O.C..His research interests include VLSI design andimplementation of the communication applications.

Li-Wei Chai was born in Taiwan, R.O.C., in 1984.He received the B.S. degree in communicationsengineering from Feng-Chia University, Taichung,Taiwan, in 2008 and the M.S. degree in commu-nications engineering from National Tsing-HuaUniversity, Hsinchu, Taiwan, in 2010.He is currently in military service in Taiwan.

His research interests include VLSI design andimplementation of the communication applications.

Yuan-Hao Huang (S’98–M’02) was born in Taiwan,R.O.C., in 1973. He received the B.S. and Ph.D. de-grees in electrical engineering from National TaiwanUniversity, Taipei, Taiwan, in 1995 and 2001, respec-tively.He was a Member of Technical Staff with VXIS

Technology Corporation, Hsin-Chu, Taiwan, from2001 to 2005. Since 2005, he has been with theDeparment of Electrical Engineering, Institute ofCommunications Engineering, National Tsing-HuaUniversity, Taiwan, where he is currently an Assis-

tant Professor. His research interests include VLSI design for digital signalprocessing systems and telecommunication systems.Dr. Huang is a Technical Committee Member of the Signal Processing Sys-

tems (DiSPS) Technical Committee of the IEEE Signal Processing Society.