random matrix methods for wireless communications

8/16/2019 Random Matrix Methods for Wireless Communications

http://slidepdf.com/reader/full/random-matrix-methods-for-wireless-communications 1/562

http://www.cambridge.org/9781107011632



This page intentionally left blank



Random Matrix Methods for Wireless Communications

Blending theoretical results with practical applications, this book provides an

introduction to random matrix theory and shows how it can be used to tackle a variety of problems in wireless communications. The Stieltjes transform method, free probabil-ity theory, combinatoric approaches, deterministic equivalents, and spectral analysismethods for statistical inference are all covered from a unique engineering perspective.Detailed mathematical derivations are presented throughout, with thorough explana-tions of the key results and all fundamental lemmas required for the readers to derivesimilar calculus on their own. These core theoretical concepts are then applied to a widerange of real-world problems in signal processing and wireless communications, includ-ing performance analysis of CDMA, MIMO, and multi-cell networks, as well as signaldetection and estimation in cognitive radio networks. The rigorous yet intuitive stylehelps demonstrate to students and researchers alike how to choose the correct approachfor obtaining mathematically accurate results.

Romain Couillet is an Assistant Professor at the Chair on System Sciences and theEnergy Challenge at Sup´ elec, France. Previously he was an Algorithm DevelopmentEngineer for ST-Ericsson, and he received his PhD from Sup elec in 2010.

M ´ erouane Debbah is a Professor at Sup´ elec, where he holds the Alcatel-Lucent Chairon Flexible Radio. He is the recipient of several awards, including the 2007 GeneralSymposium IEEE Globecom best paper award and the Wi-Opt 2009 best paper award.



Random Matrix Methods for

Wireless Communications

Romain Cou i l l e t and M ´ e rouane Debbah´ Ecole Sup erieure d’´ Electricit e, Gif sur Yvette, France



C A M B R I D G E U N I V E R S I T Y P R E S S

Cambridge, New York, Melbourne, Madrid, Cape Town,Singapore, S ao Paulo, Delhi, Tokyo, Mexico City

Cambridge University PressThe Edinburgh Building, Cambridge CB2 8RU, UK

Published in the United States of America by Cambridge University Press, New York

www.cambridge.orgInformation on this title: www.cambridge.org/9781107011632

c Cambridge University Press 2011

This publication is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 2011

Printed in the United Kingdom at the University Press, Cambridge

A catalogue record for this publication is available from the British Library

Library of Congress Cataloguing in Publication dataCouillet, Romain, 1983–Random matrix methods for wireless communications / Romain Couillet, Merouane Debbah.

p. cm.

Includes bibliographical references and index.ISBN 978-1-107-01163-2 (hardback)1. Wireless communication systems – Mathematics. 2. Matrix analytic methods.I. Debbah, Merouane, 1975– II. Title.TK5103.2.C68 2011621.38401 51–dc23

2011013189

ISBN 978-1-107-01163-2 Hardback

Cambridge University Press has no responsibility for the persistence oraccuracy of URLs for external or third-party internet websites referred toin this publication, and does not guarantee that any content on such

websites is, or will remain, accurate or appropriate.

http://www.cambridge.org/



http://www.cambridge.org/



v

To my family,

– Romain Couillet

To my parents,– Merouane Debbah



Contents

Preface page xiiiAcknowledgments xvAcronyms xviNotation xviii

1 Introduction 11.1 Motivation 11.2 History and book outline 6

Part I Theoretical aspects 15

2 Random matrices 172.1 Small dimensional random matrices 17

2.1.1 Denitions and notations 172.1.2 Wishart matrices 19

2.2 Large dimensional random matrices 292.2.1 Why go to innity? 292.2.2 Limit spectral distributions 30

3 The Stieltjes transform method 353.1 Denitions and overview 35

3.2 The Marcenko–Pastur law 423.2.1 Proof of the Marcenko–Pastur law 443.2.2 Truncation, centralization, and rescaling 54

3.3 Stieltjes transform for advanced models 573.4 Tonelli theorem 613.5 Central limit theorems 63

4 Free probability theory 714.1 Introduction to free probability theory 724.2 R- and S -transforms 754.3 Free probability and random matrices 774.4 Free probability for Gaussian matrices 84



viii Contents

4.5 Free probability for Haar matrices 87

5 Combinatoric approaches 955.1 The method of moments 955.2 Free moments and cumulants 985.3 Generalization to more structured matrices 1055.4 Free moments in small dimensional matrices 1085.5 Rectangular free probability 1095.6 Methodology 111

6 Deterministic equivalents 1136.1 Introduction to deterministic equivalents 113

6.2 Techniques for deterministic equivalents 1156.2.1 Bai and Silverstein method 1156.2.2 Gaussian method 1396.2.3 Information plus noise models 1456.2.4 Models involving Haar matrices 153

6.3 A central limit theorem 175

7 Spectrum analysis 1797.1 Sample covariance matrix 180

7.1.1 No eigenvalues outside the support 1807.1.2 Exact spectrum separation 1837.1.3 Asymptotic spectrum analysis 186

7.2 Information plus noise model 1927.2.1 Exact separation 1927.2.2 Asymptotic spectrum analysis 195

8 Eigen-inference 1998.1 G-estimation 199

8.1.1 Girko G-estimators 199

8.1.2 G-estimation of population eigenvalues and eigenvectors 2018.1.3 Central limit for G-estimators 213

8.2 Moment deconvolution approach 218

9 Extreme eigenvalues 2239.1 Spiked models 223

9.1.1 Perturbed sample covariance matrix 2249.1.2 Perturbed random matrices with invariance properties 228

9.2 Distribution of extreme eigenvalues 2309.2.1 Introduction to the method of orthogonal polynomials 2309.2.2 Limiting laws of the extreme eigenvalues 233

9.3 Random matrix theory and eigenvectors 237



Contents ix

10 Summary and partial conclusions 243

Part II Applications to wireless communications 249

11 Introduction to applications in telecommunications 25111.1 Historical account of major results 251

11.1.1 Rate performance of multi-dimensional systems 25211.1.2 Detection and estimation in large dimensional systems 25611.1.3 Random matrices and exible radio 259

12 System performance of CDMA technologies 26312.1 Introduction 263

12.2 Performance of random CDMA technologies 26412.2.1 Random CDMA in uplink frequency at channels 26412.2.2 Random CDMA in uplink frequency selective channels 27312.2.3 Random CDMA in downlink frequency selective channels 281

12.3 Performance of orthogonal CDMA technologies 28412.3.1 Orthogonal CDMA in uplink frequency at channels 28512.3.2 Orthogonal CDMA in uplink frequency selective channels 28512.3.3 Orthogonal CDMA in downlink frequency selective channels 286

13 Performance of multiple antenna systems 29313.1 Quasi-static MIMO fading channels 29313.2 Time-varying Rayleigh channels 295

13.2.1 Small dimensional analysis 29613.2.2 Large dimensional analysis 29713.2.3 Outage capacity 298

13.3 Correlated frequency at fading channels 30013.3.1 Communication in strongly correlated channels 30513.3.2 Ergodic capacity in strongly correlated channels 30913.3.3 Ergodic capacity in weakly correlated channels 311

13.3.4 Capacity maximizing precoder 31213.4 Rician at fading channels 316

13.4.1 Quasi-static mutual information and ergodic capacity 31613.4.2 Capacity maximizing power allocation 31813.4.3 Outage mutual information 320

13.5 Frequency selective channels 32213.5.1 Ergodic capacity 32413.5.2 Capacity maximizing power allocation 325

13.6 Transceiver design 32813.6.1 Channel matrix model with i.i.d. entries 33113.6.2 Channel matrix model with generalized variance prole 332



x Contents

14 Rate performance in multiple access and broadcast channels 33514.1 Broadcast channels with linear precoders 336

14.1.1 System model 33914.1.2 Deterministic equivalent of the SINR 34114.1.3 Optimal regularized zero-forcing precoding 34814.1.4 Zero-forcing precoding 34914.1.5 Applications 353

14.2 Rate region of MIMO multiple access channels 35514.2.1 MAC rate region in quasi-static channels 35714.2.2 Ergodic MAC rate region 36014.2.3 Multi-user uplink sum rate capacity 364

15 Performance of multi-cellular and relay networks 36915.1 Performance of multi-cell networks 36915.1.1 Two-cell network 37315.1.2 Wyner model 376

15.2 Multi-hop communications 37815.2.1 Multi-hop model 37915.2.2 Mutual information 38215.2.3 Large dimensional analysis 38215.2.4 Optimal transmission strategy 388

16 Detection 39316.1 Cognitive radios and sensor networks 39316.2 System model 39616.3 Neyman–Pearson criterion 399

16.3.1 Known signal and noise variances 40016.3.2 Unknown signal and noise variances 40616.3.3 Unknown number of sources 407

16.4 Alternative signal sensing approaches 41216.4.1 Condition number method 413

16.4.2 Generalized likelihood ratio test 41416.4.3 Test power and error exponents 416

17 Estimation 42117.1 Directions of arrival 422

17.1.1 System model 42217.1.2 The MUSIC approach 42317.1.3 Large dimensional eigen-inference 42517.1.4 The correlated signal case 429

17.2 Blind multi-source localization 43217.2.1 System model 43417.2.2 Small dimensional inference 436



Contents xi

17.2.3 Conventional large dimensional approach 43817.2.4 Free deconvolution approach 44017.2.5 Analytic method 44717.2.6 Joint estimation of number of users, antennas and powers 46917.2.7 Performance analysis 471

18 System modeling 47718.1 Introduction to Bayesian channel modeling 47818.2 Channel modeling under environmental uncertainty 480

18.2.1 Channel energy constraints 48118.2.2 Spatial correlation models 484

19 Perspectives 50119.1 From asymptotic results to nite dimensional studies 50119.2 The replica method 50519.3 Towards time-varying random matrices 506

20 Conclusion 511References 515Index 537



Preface

More than sixty years have passed since the 1948 landmark paper of Shannonproviding the capacity of a single antenna point-to-point communication channel.The method was based on information theory and led to a revolution in the eld,especially on how communication systems were designed. The tools then showedtheir limits when we wanted to extend the analysis and design to the multi-terminal multiple antenna case, which is the basis of the wireless revolutionsince the nineties. Indeed, in the design of these networks, engineers frequentlystumble on the scalability problem. In other words, as the number of nodes orbandwidth increase, problems become harder to solve and the determination of the precise achievable rate region becomes an intractable problem. Moreover,engineering insight progressively disappears and we can only rely on heavysimulations with all their caveats and limitations. However, when the system

is sufficiently large, we may hope that a macroscopic view could provide a moreuseful abstraction of the network. The properties of the new macroscopic modelnonetheless need to account for microscopic considerations, e.g. fading, mobility,etc. We may then sacrice some structural details of the microscopic view but themacroscopic view will preserve sufficient information to allow for a meaningfulnetwork optimization solution and the derivation of insightful results in a widerange of settings.

Recently, a number of research groups around the world have takenthis approach and have shown how tools borrowed from physical andmathematical frameworks, e.g. percolation theory, continuum models, gametheory, electrostatics, mean eld theory, stochastic geometry, just to name afew, can capture most of the complexity of dense random networks in order tounveil some relevant features on network-wide behavior.

The following book falls within this trend and aims to provide a comprehensiveunderstanding on how random matrix theory can model the complexity of theinteraction between wireless devices. It has been more than fteen years sincerandom matrix theory was successfully introduced into the eld of wirelesscommunications to analyze CDMA and MIMO systems. One of the usefulfeatures, especially of the large dimensional random matrix theory approach,

is its ability to predict, under certain conditions, the behavior of the empiricaleigenvalue distribution of products and sums of matrices. The results are strikingin terms of accuracy compared to simulations with reasonable matrix sizes, and



xiv Preface

the theory has been shown to be an efficient tool to predict the behavior of wireless systems with only few meaningful parameters. Random matrix theoryis also increasingly making its way into the statistical signal processing eldwith the generalization of detection and inference methods, e.g. array processing,hypothesis tests, parameter estimation, etc., to the multi-variate case. This comesas a small revolution in modern signal processing as legacy estimators, such as theMUSIC method, become increasingly obsolete and unadapted to large sensingarrays with few observations.

The authors are condent and have no doubt on the usefulness of the tool forthe engineering community in the upcoming years, especially as networks becomedenser. They also think that random matrix theory should become sooner or latera major tool for electrical engineers, taught at the graduate level in universities.

Indeed, engineering education programs of the twentieth century were mostlyfocused on the Fourier transform theory due to the omnipresence of frequencyspectrum. The twenty-rst century engineers know by now that space is the nextfrontier due to the omnipresence of spatial spectrum modes, which refocuses theprograms towards a Stieltjes transform theory.

We sincerely hope that this book will inspire students, teachers, and engineers,and answer their present and future problems.

Romain Couillet and Merouane Debbah



Acknowledgments

This book is the fruit of many years of the authors’ involvement in the eldof random matrix theory for wireless communications. This topic, which hasgained increasing interest in the last decade, was brought to light in thetelecommunication community in particular through the work of Stephen Hanly,Ralf Muller, Shlomo Shamai, Emre Telatar, David Tse, Antonia Tulino, andSergio Verd u, among others. It then rapidly grew into a joint research frameworkgathering both telecommunication engineers and mathematicians, among whichZhidong Bai, Vyacheslav L. Girko, Leonid Pastur, and Jack W. Silverstein.

The authors are especially indebted to Prof. Silverstein for the agreeabletime spent discussing random matrix matters. Prof. Silverstein has a veryinsightful approach to random matrices, which it was a delight to share withhim. The general point of view taken in this book is mostly inuenced by Prof.

Silverstein’s methodology. The authors are also grateful to the many colleaguesworking in this eld whose knowledge and wisdom about applied random matrixtheory contributed signicantly to its current popularity and elegance. This bookgathers many of their results and intends above all to deliver to the readers thissimplied approach to applied random matrix theory. The colleagues involved inlong and exciting discussions as well as collaborative works are Florent Benaych-Georges, Pascal Bianchi, Laura Cottatellucci, Maxime Guillaud, Walid Hachem,Philippe Loubaton, Mylene Maıda, Xavier Mestre, Aris Moustakas, Ralf M¨ uller,Jamal Najim, and Øyvind Ryan.

Regarding the book manuscript itself, the authors would also like to sincerelythank the anonymous reviewers for their wise comments which contributed toimprove substantially the overall quality of the nal book and more importantlythe few people who dedicated a long time to thoroughly review the successivedrafts and who often came up with inspiring remarks. Among the latter areDavid Gregoratti, Jakob Hoydis, Xavier Mestre, and Sebastian Wagner.

The success of this book relies in a large part on these people.

Romain Couillet and Merouane Debbah



Acronyms

AWGN additive white Gaussian noiseBC broadcast channel

BPSK binary pulse shift keyingCDMA code division multiple accessCI channel inversionCSI channel state informationCSIR channel state information at receiverCSIT channel state information at transmitterd.f. distribution functionDPC dirty paper codinge.s.d. empirical spectral distributionFAR false alarm rateGLRT generalized likelihood ratio testGOE Gaussian orthogonal ensembleGSE Gaussian symplectic ensembleGUE Gaussian unitary ensemblei.i.d. independent and identically distributedl.s.d. limit spectral distributionMAC multiple access channelMF matched-lterMIMO multiple input multiple output

MISO multiple input single outputML maximum likelihoodLMMSE linear minimum mean square errorMMSE minimum mean square errorMMSE-SIC MMSE and successive interference cancellationMSE mean square errorMUSIC multiple signal classicationNMSE normalized mean square errorOFDM orthogonal frequency division multiplexing



Acronyms xvii

OFDMA orthogonal frequency division multiple accessp.d.f. probability density functionQAM quadrature amplitude modulationQPSK quadrature pulse shift keyingROC receiver operating characteristicRZF regularized zero-forcingSINR signal-to-interference plus noise ratioSISO single input single outputSNR signal-to-noise ratioTDMA time division multiple accessZF zero-forcing



Notation

Linear algebraX Matrix

I N Identity matrix of size N ×N X ij Entry ( i, j ) of matrix X (unless otherwise stated)(X )ij Entry ( i, j ) of matrix X[X ]ij Entry ( i, j ) of matrix X

f (i, j )i,j Matrix with ( i, j ) entry f (i, j )(X ij )i,j Matrix with ( i, j ) entry X ijx Vector (column by default)x∗ Vector of the complex conjugates of the entries of xx i Entry i of vector xF X Empirical spectral distribution of the Hermitian X

X T Transpose of XX H Hermitian transpose of Xtr X Trace of Xdet X Determinant of Xrank( X ) Rank of X∆( X ) Vandermonde determinant of X

X Spectral norm of the Hermitian matrix Xdiag( x1 , . . . , x n ) Diagonal matrix with ( i, i ) entry xi

ker( A ) Null space of the matrix A , ker( A ) = x , Ax = 0span( A ) Subspace generated by the columns of the matrix A

Real and complex analysisN The space of natural numbersR The space of real numbersC The space of complex numbersA∗ The space A \ 0x+ Right-limit of the real xx− Left-limit of the real x



Notation xix

(x)+ For x ∈R , max( x, 0)sgn(x) Sign of the real x

[z] Real part of z[z] Imaginary part of z

z∗ Complex conjugate of zi Square root of −1 with positive imaginary partf (x) First derivative of the function f f (x) Second derivative of the function f f (x) Third derivative of the function f f ( p) (x) Derivative of order p of the function f f Norm of a function f = sup x |f (x)|

1A (x) Indicator function of the set A

1A (x) = 1 if x ∈ A, 1A (x) = 0 otherwiseδ (x) Dirac delta function, δ (x) = 1 0(x)∆( x|A) Convex indicator function

∆( x|A) = 0 if x ∈ A, ∆( x|A) = ∞ otherwiseSupp( F ) Support of the distribution function F x1 , x2 , . . . Series of general term xn

xn → Simple convergence of the series x1 , x2 , . . . to xn = o(yn ) Upon existence, xn /y n → 0 as n → ∞xn = O(yn ) There exists K , such that xn ≤ Kyn for all nn/N

→ c As n

→ ∞ and N

→ ∞, n/N

→ c

W (z) Lambert-W function satisfying W (z)eW (z ) = zAi(x) Airy functionΓ(x) Gamma function, Γ( n) = ( n −1)! for n integer

Probability theory(Ω, F , P ) Probability space Ω with σ-eld F and measure P P X (x) Density of the random variable X pX (x) Density of the scalar random variable X P (X i ) (x) Unordered density of the random variable X 1 , . . . , X N

P ≥(X i ) (x) Ordered density of the random variable X 1 ≥ . . . ≥ X N P ≤(X i ) (x) Ordered density of the random variable X 1 ≤ . . . ≤ X N

µX Probability measure of X , µX (A) = P (X (A))µX Probability distribution of the eigenvalues of Xµ∞X Probability distribution associated with the l.s.d. of XP X (x) Density of the random variable X , P X (x)dx = µX (dx)F X (x) Distribution function of X (real), F X (x) = µX ((−∞, x])E[X ] Expectation of X , E[X ] = Ω X (ω)dω



xx Notation

E[f (X )] Expectation of f (X ), E[f (X )] = Ω f (X (ω))dωvar( X ) Variance of X , var( X ) = E[ X 2]−E[X ]2

X

∼L X is a random variable with density L

N (µ , Σ ) Real Gaussian distribution of mean µ and covariance ΣCN (µ , Σ ) Complex Gaussian distribution of mean µ and covariance ΣW N (n, R ) Real zero mean Wishart distribution with n degrees of freedom

and covariance RCW N (n, R ) Complex zero mean Wishart distribution with n degrees of freedom

and covariance RQ(x) Gaussian Q-function, Q(x) = P (X > x ), X ∼N (0, 1)F + Tracy–Widom distribution functionF − Conjugate Tracy–Widom d.f., F −(x) = 1 −F + (−x)

xna .s.

−→ Almost sure convergence of the series x1 , x2 , . . . to F n ⇒ F Weak convergence of the d.f. series F 1 , F 2 , . . . to F X n ⇒ X Weak convergence of the series X 1 , X 2 , . . . to the random X

Random Matrix TheorymF (z) Stieltjes transform of the function F mX (z) Stieltjes transform of the eigenvalue distribution of XVF (z) Shannon transform of the function F VX (z) Shannon transform of the eigenvalue distribution of XRF (z) R transform of the function F RX (z) R transform of the eigenvalue distribution of XS F (z) S transform of the function F S X (z) S transform of the eigenvalue distribution of XηF (z) η-transform of the function F ηX (z) η-transform of the eigenvalue distribution of XψF (z) ψ-transform of the function F ψX (z) ψ-transform of the eigenvalue distribution of Xµ ν Additive free convolution of µ and ν µ ν Additive free deconvolution of µ and ν

µ ν Multiplicative free convolution of µ and ν µ ν Multiplicative free deconvolution of µ and ν

TopologyAc Complementary of the set A# A Cardinality of the discrete set AA⊕B Direct sum of the spaces A and B

⊕1≤i≤n Ai Direct sum of the spaces Ai , 1 ≤ i ≤ n



Notation xxi

x, A Norm of the orthogonal projection of x on the space A

Miscellaneousx y x is dened as ysgn(σ) Signature (or parity) of the permutation σ, sgn(σ) ∈ −1, 1



1 Introduction

1.1 Motivation

We initiate the book with a classical example, which exhibits both the non-obvious behavior of large dimensional random matrices and the motivationbehind their study.

Consider a random sequence x1 , . . . , x n of n independent and identicallydistributed (i.i.d.) observations of a given random process. The classical law of large numbers states that the sequence x1 , x 1 + x 2

2 , . . . , with nth term 1n

nk=1 xk

tends almost surely to the deterministic value E[ x1], the expectation of thisprocess, as n tends to innity. Denote (Ω , F , P ) the probability space thatgenerates the innite sequence x1 , x2 , . . . . For a given realization ω ∈ Ω, wewill denote x1(ω), x2(ω), . . . the realization of the random sequence x1 , x2 , . . . .

We recall that almost sure convergence means that there exists A ⊂ Ω, withP (A) = 1, such that, for ω ∈ A

1n

n

k=1

xk (ω) → E[x1] Ωx1(w)dw.

We also remind briey that the notation (Ω , F , P ) designates the triplet

composed of the space of random realizations Ω, i.e. in our case ω ∈ Ω is therealization of a series x1(ω), x2(ω), . . . , F is a σ-eld on Ω, which can be seenas the space of the measurable events on Ω, e.g. the space B = x1(ω) > 0 ∈F is such an event, and P is a probability measure on F , i.e. P is a function thatassigns to every event in F a probability.

This law of large numbers is fundamental in the sense that it provides adeterministic feature for a process ruled by ‘chance’ (or more precisely, ruled by adeterministic process, the precise nature of which the observer is unaware). Thisallows the observer to be able to retrieve deterministic information from randomvariables based on any observed random sequence (within a space of probabilityone). If, for instance, x1 , . . . , x n are successive samples of a stationary zero meanwhite noise waveform x(t), i.e. E[x(t)x(t −τ )] = σ2δ (t), it is the usual signalprocessing problem to estimate the power σ2 = E[ |x1|2] of the noise process; the



2 1. Introduction

empirical variance σ2n , i.e.

σ2n =

1

n

n

i =1 |x i

|2

is a classical estimate of σ2 which, according to the law of large numbers, is suchthat σ2

n → σ2 almost surely when n → ∞. It is often said that σ2n is a consistent

estimator of σ2 as it is asymptotically and almost surely equal to σ2 . To avoidconfusion with the two-dimensional case treated next, we will say instead thatσ2

n is an n-consistent estimator of σ2 , as it is asymptotically accurate as n growslarge. Obviously, we are never provided with an innitely long observation timewindow, so that n is usually large but nite, and therefore σ2

n is merely anapproximation of σ2 .

With the emergence of multiple antenna systems, channel spreading codes,sensor networks, etc., signal processing problems have become more and moreconcerned with vectorial inputs rather than scalar inputs. For a sequence of n i.i.d. random vectors, the law of large numbers still applies. For instance,for x 1 , x 2 , . . . ∈C N randomly drawn from a given N -variate zero mean randomprocess

R n = 1n

n

i=1

x i xH

i → R E[x 1x H

1 ] (1.1)

almost surely as n → ∞, where the convergence is considered for any matrixnorm, i.e. R −R n → 0 on a set of probability one. The matrix R n is oftenreferred to as the empirical covariance matrix or as the sample covariance matrix ,as it is computed from observed vector samples. We will use this last phrasethroughout the book. Following the same semantic eld, the matrix R will bereferred to as the population covariance matrix , as it characterizes the innatenature of all stochastic vectors x i from the overall population of such vectors.The empirical R n is again an n-consistent estimator of R of x 1 and, as before, asn is taken very large for N xed , R n is a good approximation of R in the sense of

the aforementioned matrix norm. However, in practical applications, it might bethat the number of available snapshots x k is indeed very large but not extremelylarge compared to the vector size N . This situation arises in diverse applicationelds, such as biology, nance, and, of course, wireless communications. If this isthe case, as will become obvious in the following examples and against intuition,the difference R −R n can be far from zero even for large n.

Since the DNA of many organisms have now been entirely sequenced, biologistsand evolutionary biologists are interested in the correlations between genes, e.g.:How does the presence of a given gene (or gene sequence) in an organism impactthe probability of the presence of another given gene? Does the activation of agiven gene come along with the activation of several other genes? To be ableto study the joint correlation between a large population of the several tenthousands of human genes, call this number N , we need a large sample of genome



1.1. Motivation 3

sequences extracted from human beings, call the number of such samples n. It istherefore typical that the N ×n matrix of the n gene sequence samples does nothave many more columns than rows, or, worse, may even have more rows thancolumns. We see already that, in this case, the sample covariance matrix R n isnecessarily rank-decient (of maximum rank N −n), while R has all the chancesto be full rank. Therefore, R n is obviously no longer a good approximation of R , even if n were very large in the rst place, since the eigenvalues of R n andR differ by at least N −n terms.

In the eld of nance, the interest of statisticians lies in the interactionsbetween assets in the market and the joint time evolution of their stock marketindices. The vectors x 1 , . . . , x n here may be representative of n months of marketindex evolution of N different brands of a given product, say soda, the ith entry of

the column vector x

k being the evolution of the market index of soda i in monthk. Obviously, this case differs from the independent vector case presented up tonow since the evolution at month k + 1 is somewhat correlated to the evolutionat previous month k, but let us assume for simplicity that the month evolutionis at least an uncorrelated process (which does not imply independence). Similarto the gene case for biologists, it often turns out that the N ×n matrix understudy contains few columns compared to the number of rows, although bothdimensions are typically large compared to 1. Of specic importance to tradersis the largest eigenvalue of the population covariance matrix R of (a centeredand normalized version of) the random process x1 , which is an indicator of the

maximal risk against investment returns taken by a trader who constitutes aportfolio from these assets. From the biology example above, it has become clearthat the eigenvalues of R n may be a very inaccurate estimate of those of R ;thus, R n cannot be relied on to estimate the largest eigenvalue and hence thetrading risk. The case of wireless communications will be thoroughly detailed inPart II, and a rst motivation is given in the next paragraph.

Returning to the initial sample covariance matrix model, we have alreadymentioned that in the scalar case the strong law of large numbers ensures thatit suffices for n to be quite large compared to 1 for σ2

n to be a good estimator

for σ2

. In the case where data samples are vectors, if n is large compared to 1,whatever N , then the ( i, j ) entry Rn,ij of R n is a good estimator of the ( i, j )entry Rij of R . This might (mis)lead us to assume that as n is much greaterthan one, R n R in some sense. However, if both N and n are large comparedto 1 but n is not large compared to N , then the peculiar thing happens: theeigenvalue distribution of R n (see this as an histogram of the eigenvalues) ingeneral converges, but does not converge to the eigenvalue distribution of R .This has already been pointed out in the degenerated case N > n , for whichR n has N −n null eigenvalues, while R could be of full rank. This behavior isevidenced in Figure 1.1 in which we consider x1

∼CN (0, I N ) and then R = I N ,

for N = 500, n = 2000. In that case, notice that R n converges point-wise to IN



4 1. Introduction

0 0.5 1 1.5 2 2.5 3zero

0.2

0.4

0.6

0.8

Eigenvalues of R n

D e n s i t y

Empirical eigenvaluesMarcenko–Pastur law

Figure 1.1 Histogram of the eigenvalues of R n = 1n

nk =1 x k x H

k , x k ∈C N , forn = 2000, N = 500.

when n is large, as Rn,ij , the entry ( i, j ) of R n , is given by:

Rn,ij = 1n

n

k =1

x ik x∗jk

which is close to one if i = j and close to zero if i = j . This is obviouslyirrespective of N , which is not involved in the calculus here. However, theeigenvalues of R n do not converge to a single mass in 1 but are spread around 1.This apparent contradiction is due to the fact that N grows along with n but n/N is never large. We say in that case that, while R n is an n-consistent estimatorof R , it is not an ( n, N )-consistent estimator of R . The seemingly paradoxicalbehavior of the eigenvalues of R n , while R n converges point-wise to IN , lies infact in the rate convergence of the entries of R n towards the entries of IN . Due

to central limit arguments for the sample mean of scalar i.i.d. random variables,Rn,ij −E[Rn,ij ] is of order O(1/ √ n). When determining the eigenvalues of R n ,the deviations around the means are negligible when n is large and N xed.However, for N and n both large, these residual deviations of the entries of R n

(their number is N 2) are no longer negligible and the eigenvalue distribution of R n is not a single mass in 1. In some sense, we can see R n as a matrix close tothe identity but whose entries all contain some small residual “energy,” whichbecomes relevant as much of such small energy is cumulated.

This observation has very important consequences, which motivate the needfor singling out the study of large empirical covariance matrices and moregenerally of large random Hermitian matrices as a unique eld of mathematics.Wireless communications may be the one research eld in which large matriceshave started to play a fundamental role. Indeed, current and more importantly



1.1. Motivation 5

future wireless communication systems are multi-dimensional in several respects(spatial with antennas, temporal with random codes, cellular-wise with largenumber of users, multiple cooperative network nodes, etc.) and random in otherrespects (time-varying fading channels, noisy communications, etc.). The study of the behavior of large wireless communication systems therefore calls for advancedmathematical tools that can easily deal with large dimensional random matrices.Consider for instance a multiple input multiple output (MIMO) complex channelmatrix H ∈C N ×n between an n-antenna transmitter and an N -antenna receiver,the entries of which are independent and complex Gaussian with zero mean andvariance 1 /n . If uniform power allocation across the antennas is used at thetransmit antenna array and the additive channel noise is white and Gaussian,the achievable transmission rates over this channel are all rates less than the

channel mutual information

I (σ2) = E log2 det I N + 1σ2 HH H (1.2)

where σ−2 denotes now the signal-to-noise ratio (SNR) at the receiver and theexpectation is taken over the realizations of the random channel H , varyingaccording to the Gaussian distribution. Now note that HH H = n

i =1 h i h Hi

with h i ∈C N the ith column of H , h 1 , . . . , h n being i.i.d. random vectors.The matrix HH H can then be seen as the sample covariance matrix of somehypothetical random N -variate variable √ nh 1 . From our previous discussion,denoting HH H = UΛU H the spectral decomposition of HH H , we have:

I (σ2) = E log2 det I N + 1σ2 Λ = E

N

i=1

log2 1 + λi

σ2 (1.3)

with λ1 , . . . , λ N the eigenvalues of HH H , which again are not all close to one,even for n and N large. The achievable transmission rates are then explicitly

dependent on the eigenvalue distribution of HH H

. More generally, it will beshown in Chapters 12–15 that random matrix theory provides a powerfulframework, with multiple methods, to analyze the achievable transmission ratesand rate regions of a large range of multi-dimensional setups (MIMO, CDMA,multi-user transmissions, MAC/BC channels, etc.) and to derive the capacity-achieving signal covariance matrices for some of these systems, i.e. determinethe non-negative denite matrix P ∈C N ×N , which, under some trace constrainttr P ≤ P , maximizes the expression

I (σ2

; P ) = E log det I N + 1σ2 HPH

H

for numerous fading channel models for H .



6 1. Introduction

1.2 History and book outline

The present book is divided into two parts: a rst part on the theoreticalfundamentals of random matrix theory, and a second part on the applicationsof random matrix theory to the eld of wireless communications. The rst partwill give a rather broad, although not exhaustive, overview of fundamental andrecent results concerning random matrices. However, the main purpose of thispart goes beyond a listing of important theorems. Instead, it aims on the onehand at providing the reader with a large, yet incomplete, range of techniquesto handle problems dealing with random matrices, and on the other hand atdeveloping sketches of proofs of the most important results in order to providefurther intuition to the reader. Part II will be more practical as it will apply most

of the results derived in Part I to problems in wireless communications, suchas system performance analysis, signal sensing, parameter estimation, receiverdesign, channel modeling, etc. Every application will be commented on withregard to the theoretical results developed in Part I, for the reader to have aclear understanding of the reasons why the practical results hold, of their mainlimitations, and of the questions left open. Before moving on to Part I, in thefollowing we introduce in detail the objectives of both parts through a brief historical account of eighty years of random matrix theory.

The origin of the study of random matrices is usually said to date backto 1928 with the pioneering work of the statistician John Wishart [Wishart,1928]. Wishart was interested in the behavior of sample covariance matrices of i.i.d. random vector processes x1 , . . . , x n ∈C N , in the form of the matrix R n

previously introduced

R n = 1n

n

i =1

x i x H

i . (1.4)

Wishart provided an expression of the joint probability distribution of the entries of such a matrix when its column vector entries are themselves

independent and have an identical standard complex Gaussian distribution, i.e.x ij ∼CN (0, 1). These normalized matrices with i.i.d. standard Gaussian entriesare now called Wishart matrices . Wishart matrices were thereafter generalizedand extensively studied. Today there exists in fact a large pool of propertieson the joint distribution of the eigenvalues, the distribution of the extremeseigenvalues, the distribution of the ratios between extreme eigenvalues, etc.

The rst asymptotic considerations, i.e. the rst results on matrices of asymptotically large dimensions, appeared with the work of the physician EugeneWigner [Wigner , 1955] on nuclear physics, who considered (properly scaled)symmetric matrices with independent entries uniformly distributed in

1,

−1

and proved the convergence of the marginal probability distribution of itseigenvalues towards the deterministic semi-circle law , as the dimension of thematrix grows to innity. Hermitian n ×n matrices with independent upper-



1.2. History and book outline 7

−3 −2 −1 0 1 2 3zero

0.1

0.2

0.3

0.4

Eigenvalues

D e n s i t y

Empirical eigenvalue distributionSemi-circle law

Figure 1.2 Histogram of the eigenvalues of a Wigner matrix and the semi-circle law,for n = 500.

triangular entries of zero mean and variance 1 /n are now referred to as Wigner matrices . The empirical eigenvalues of a large Wigner matrix and the semi-circle law are illustrated in Figure 1.2, for a matrix of size n = 500. Fromthis time on, innite size random matrices have drawn increasing attentionin many domains of physics [Mehta, 2004] (nuclear physics [Dyson, 1962a],statistical mechanics, etc.), nance [Laloux et al., 2000], evolutionary biology[Arnold et al., 1994], etc. The rst accounts of work on large dimensional randommatrices for wireless communications are attributed to Tse and Hanly [Tse andHanly, 1999] on the performance of large multi-user linear receivers, Verd´ u andShamai [Verd u and Shamai, 1999] on the capacity of code division multipleaccess (CDMA) systems, among others. The pioneering work of Telatar [Telatar,1995] on the transmission rates achievable with multiple antennas, paralleled by

Foschini [Foschini and Gans, 1998 ], is on the contrary a particular example of the use of small dimensional random matrices for capacity considerations. Inits nal version of 1999, the article also mentions asymptotic laws for capacity[Telatar, 1999] . We will see in Chapter 13 that, while Telatar’s original proof of the capacity growth rate for increasing number of antennas in a multipleantenna setup is somewhat painstaking, large random matrix theory provides astraightforward result. In Chapter 2, we will explore some of the aforementionedresults on random matrices of small dimension, which will be shown to be difficultto manipulate for simply structured matrices and rather intractable to extendto more structured matrices.

The methods used for random matrix-based calculus are mainly segmentedinto: (i) the analytical methods, which treat asymptotic eigenvalue distributionsof large matrices in a comprehensive framework of analytical tools, among



8 1. Introduction

which the important Stieltjes transform , and (ii) the moment-based methods,which establish results on the successive moments of the asymptotic eigenvaluesprobability distribution .1 The analytical framework allows us to solve a largerange of problems in wireless communications such as those related to capacityevaluation in both random and orthogonal CDMA networks and in largeMIMO systems, but also to address questions such as signal sensing in largenetworks or statistical inference, i.e. estimation of network parameters. Theseanalytic methods are mostly used when the random matrices under considerationare sample covariance matrices, doubly correlated i.i.d. matrices, informationplus noise matrices (to be dened later), isometric matrices or the sum andproducts of such matrices. They are generally preferred over the alternativemoment-based methods since they consider the eigenvalue distribution of large

dimensional random matrices as the central object of study, while the momentapproach is dedicated to the specic study of the successive moments of thedistribution. Note in particular that not all distributions have moments of all orders, and for those that do have moments of all orders, not all areuniquely dened by the series of their moments. However, in some cases of very structured matrices whose entries are non-trivially correlated, as in theexample of Vandermonde matrices [Ryan and Debbah, 2009 ], the moment-basedmethods convey a more accessible treatment. Both analytical and moment-basedmethods are not completely disconnected from one another as they share acommon denominator when it comes to dealing with unitarily invariant randommatrices, such as standard Gaussian or Haar matrices, i.e. unitarily invariantunitary matrices. This common denominator, namely the eld of free probability theory , bridges the analytical tools to the moment-based methods via derivativesof the Stieltjes transform, the R-transform, and the S -transform. The latter canbe expressed in power series with coefficients intimately linked to moments andcumulants of the underlying random matrix eigenvalue distributions. The freeprobability tool, due to Voiculescu [Voiculescu et al., 1992], was not initiallymeant to deal specically with random matrices but with more abstract non-commutative algebras, large dimensional random matrices being a particular case

of such algebras. The extension of classical probability theory to free probabilityprovides interesting and often surprising results, such as a strong equivalencebetween some classical probability distributions, e.g. Poisson, Gaussian, and theasymptotic probability distribution of the eigenvalues of some random matrixmodels, e.g. Wishart matrices and Wigner matrices. Some classical probabilitytools, such as the characteristic function, are also extensible through analytictools of random matrix theory.

1 Since the terminology method of moments is already dedicated to the specic technique

which aims at constructing a distribution function from its moments (under the conditionthat the moments uniquely determine the distribution), see, e.g. Section 30 of [Billingsley,1995], we will carefully avoid referring to any random matrix technique based on momentsas the method of moments.




The division between analytical and moment-based methods, with freeprobability theory lying in between, can be seen from another point of view.It will turn out that the analytical methods, and most particularly the Stieltjestransform approach, take full advantage of the independence between the entriesof large dimensional random matrices. As for moment-based methods, from a freeprobability point of view, they take full advantage of the invariance propertiesof large dimensional matrices, such as the invariance by the left or right productwith unitary matrices. The theory of orthogonal polynomials follows the samepattern, as it benets from the fact that the eigenvalue distribution of unitarilyinvariant random matrices can be studied regardless of the (uniform) eigenvectordistribution. In this book, we will see that the Stieltjes transform approach cansolve most problems involving random matrices with invariance properties as

well. This makes this distinction between random matrices with independententries and random matrices with invariance properties not so obvious to us.For this reason, we will keep distinguishing between the analytical approachesthat deal with the eigenvalue distribution as the central object of concernand the moment-based approaches that are only concerned with successivemoments. We will also briey introduce the rather old theory of orthogonalpolynomials which has received much interest lately regarding the study of limiting laws of largest eigenvalues of random matrices but which requiressignicant additional mathematical effort for proper usage, while applicationsto wireless communications are to this day rather limited, although in constant

expansion. We will therefore mostly state the important results from this eld,particularly in terms of limit theorems of extreme eigenvalues, see Chapter 9,without development of the corresponding proofs.

In Chapter 3, Chapter 4, and Chapter 5, we will introduce the analyticaland moment-based methods, as well as notions of free probability theory, whichare fundamental to understand the important concept of asymptotic freenessfor random matrices. We will also provide in these chapters a sketch of theproof of the convergence of the eigenvalue distribution of the Wishart andWigner matrices to the Marcenko–Pastur law, depicted in Figure 1.1, and the

semi-circle law, depicted in Figure 1.2, using the Stieltjes transform and themethod of moments, respectively. Generic methods to determine (almost sure)limiting distributions of the eigenvalues of large dimensional random matrices,as well as other functionals of such large matrices (e.g. log determinant), willbe reviewed in detail in these chapters. Chapter 6 will discuss the alternativemethods used when the empirical eigenvalue distribution of large randommatrices do not necessarily converge when the dimensions increase: in thatcase, in place of limit distributions, we will introduce the so-called deterministic equivalents , which provide deterministic approximations of functionals of randommatrices of nite size. These approximations are (almost surely) asymptotically

accurate as the matrix dimensions grow to innity, making them consistentwith the methods developed in Chapter 3. In addition to limiting eigenvaluedistributions and deterministic equivalents, in Chapter 3 and Chapter 6, central



10 1. Introduction

limit theorems that extend the convergence theorems to a higher precisionorder will be introduced. These central limit theorems constitute a rst stepinto a more thorough analysis of the asymptotic deviations of the spectrumaround its almost sure limit or around its deterministic equivalent. Chapter 7will discuss advanced results on the spectrum of both the sample covariancematrix model and the information plus noise model, which have been extensivelystudied and for which many results have been provided in the literature, suchas the proof of the asymptotic absence of eigenvalues outside the supportof the limiting distribution. Beyond the purely mathematical convenience of such a result, being able to characterize where the eigenvalues, and especiallythe extreme eigenvalues, are expected to lie is of fundamental importance toperform hypothesis testing decisions and in statistical inference. In particular,

the characterization of the spectrum of sample covariance matrices will be usedto retrieve information on functionals of the population covariance matrix fromthe observed sample covariance matrix, or functionals of the signal space matrixfrom the observed information plus noise matrix. Such methods will be referred toas eigen-inference techniques and are developed in Chapter 8. The rst part willthen conclude with Chapter 9, which extends the analysis of Section 7.1 to theexpression of the limiting distributions of the extreme eigenvalues. We will alsointroduce in this chapter the spiked models , which have recently received a lot of attention for their many practical implications. These objects are necessary toolsfor signal sensing in large dimensional networks, which are currently of major

interest with regard to the recent incentive for cognitive radios. In Chapter 10,the essential results of Part I will nally be summarized and rediscussed withrespect to their applications to the eld of wireless communications.

The second part of this book is dedicated to the application of the differentmethods described in the rst chapter to different problems in wirelesscommunications. As already mentioned, the rst applications of random matrixtheory to wireless communications are exclusively related to asymptotic systemperformance analysis, and especially channel capacity considerations. The ideaof considering asymptotically large matrix approximations was initially linked

to studies in CDMA communications, where both the number of users and thelength of the spreading codes are potentially very large [Li et al., 2004; Tse andHanly, 1999 ; Tse and Verd´u, 2000; Tse and Zeitouni, 2000 ; Zaidel et al., 2001]. Itthen occurred to researchers that large matrix approximations work rather wellwhen the size of the effective matrix under study is not so large, e.g. for matricesof size 8×8 or even 4 ×4 (in the case of random unitary matrices, simulationssuggest that approximations for matrices of size 2 ×2 are even acceptable). Thismotivated further studies in systems where the number of relevant parametersis moderately large. In particular, studies of MIMO communications [Chuahet al. , 2002; Hachem et al., 2008b; Mestre et al., 2003; Moustakas and Simon,

2005; Muller, 2002], designs of multi-user receivers [Honig and Xiao , 2001;Muller and Verd u, 2001], multi-cell communications [Abdallah and Debbah,2004; Couillet et al., 2011a; Peacock et al., 2008], multiple access channels and




broadcast channels [Couillet et al., 2011a; Wagner et al., 2011] started to beconsidered. More recent topics featuring large decentralized systems, such asgame theory-based cognitive radios [Meshkati et al., 2005] and ad-hoc and meshnetworks [Fawaz et al., 2011; Leveque and Telatar, 2005] were also treated usingsimilar random matrix theoretical tools. The initial fundamental reason for theattractiveness of asymptotic results of random matrix theory for the study of system performance lies in the intimate link between the Stieltjes transform andthe information-theoretic expression of mutual information. It is only recently,thanks to a new wave of theoretical results, that the wireless communicationcommunity has realized that many more questions can be addressed than justmutual information and system performance evaluations.

In Chapter 11, we will start with an introduction to the important results

of random matrix theory for wireless communications and their connections tothe methods detailed in the rst part of this book. In Chapters 12–15, we willpresent the latest results concerning achievable rates for a wide range of wirelessmodels, mostly taken from the aforementioned references.

From an even more practical, implementation-oriented point of view, wewill also introduce some ideas, rather scattered in the literature, which allowengineers to reduce the computational burden of transceivers by anticipatingcomplex calculus and also to reduce the feedback load within communicationnetworks by beneting from deterministic approximations brought by largedimensional matrix analysis. The former will be introduced in Chapter 13,where we discuss means to reduce the computational difficulty of implementinglarge minimum mean square error (MMSE) CDMA or MIMO receivers, whichrequire to perform real-time matrix inversions, from basis expansion models.Besides, it will be seen that in a multiple antenna uplink cellular network, theergodic capacity maximizing precoding matrices of all users, computed at thecentral base station, can be fed back to the users under the form of partialinformation contained within a few bits (the number of which do not scale withthe number of transmit antennas), which are sufficient for every user to computetheir own transmit covariance matrix. These ideas contribute to the cognitive

radio incentive for distributed intelligence within large dimensional networks.Of major interest these days are also the questions of statistical estimation anddetection in large decentralized networks or cognitive radio networks, spurredespecially by the cognitive radio framework, a trendy example of which is thefemto-cell incentive [Calin et al., 2010; Claussen et al., 2008]. Take for examplethe problem of radar detection using a large antenna array composed of N sensors. Each signal arising at the sensor array originates from a nite number K of sources positioned at specic angles with respect to the sensor support. Thepopulation covariance matrix R ∈C N ×N of the N -dimensional sample vectors iscomposed of K distinct eigenvalues, with multiplicity linked to the size of every

individual detected object, and possibly some null eigenvalues if N is large.Gathering n successive samples of the N -dimensional vectors, we can construct



12 1. Introduction

a sample covariance matrix R n as in (1.1). We might then be interested inretrieving the individual eigenvalues of R , which translate the distance withrespect to each object, or retrieving the multiplicity of every eigenvalue, whichis then an indicator of the size of the detected objects, or even detecting theangles of arrival of the incoming waveforms, which is a further indication of thegeographical position of the objects. The multiple signal classication (MUSIC)estimator [Schmidt , 1986] has long been considered efficient for such a treatmentand was indeed proved n-consistent. However, with N and n of similar orderof magnitude, i.e. if we wish to increase the number of sensors, the MUSICalgorithm is largely biased [Mestre, 2008a]. Random matrices considerations,again based on the Stieltjes transform, were recently used to arrive at alternative(N, n )-consistent solutions [Karoui, 2008; Mestre , 2008b; Rao et al., 2008].

These methods of eigenvalue and eigenvector retrieval are referred to as eigen-inference methods. In the case of a cognitive radio network, say a femto-cellnetwork, every femto-cell, potentially composed of a large number of sensors,must be capable of discovering its environment, i.e. detecting the presence of surrounding users in communication, in order to exploit spectrum opportunities,i.e. unoccupied transmission bandwidth, unused by the surrounding licensednetworks. One of the rst objectives of a femto-cell is therefore to evaluatethe number of surrounding users and to infer their individual transmit powers.The study of the power estimation capabilities of femto-cells is performed boththroughout analytical and moment-based methods in [Couillet et al., 2011c; Raoand Edelman, 2008; Ryan and Debbah, 2007a] . Statistical eigen-inference are alsoused to address hypothesis testing problems, such as signal sensing. Consider thatan array of N sensors captures n samples of the incoming waveform and generatesthe empirical covariance matrix R N . The question is whether R N indicates thepresence of a signal issued by a source or only indicates the presence of noise.It is natural to consider that, if the histogram of the eigenvalues of R N is closeto that of Figure 1.1, this evidences the presence of noise but the absence of atransmitted signal. A large range of methods have been investigated, in differentscenarios such as multi-source detection, multiple antenna signal sensing, known

or unknown signal-to-noise ratio (SNR), etc. to come up with ( N, n )-consistentdetection tests [Bianchi et al., 2011; Cardoso et al., 2008]. These methods arehere of particular interest when the system dimensions are extremely large toensure low rates of false positives (declaring pure noise to be a transmitted signal)and of false negatives (missing the detection of a transmitted signal). When asmall number of sensors are used, small dimensional random matrix models,however more involved, may be required [Couillet and Debbah , 2010a]. Notenally that, along with inference methods for signal sensing, small dimensionalrandom matrices are also used in the new eld of Bayesian channel modeling[Guillaud et al., 2007], which will be introduced in Chapter 18 and thoroughly

detailed.




Chapter 16, Chapter 17 and Chapter 18 discuss the solutions brought bythe eld of asymptotically large and small random matrices to the problemsof estimation, detection, and system modeling, respectively. Chapter 19 willthen discuss the perspectives and challenges envisioned for the future of randommatrices in topics related to wireless communications. Finally, in Chapter 20, wedraw the conclusions of this book.



Part I

Theoretical aspects



2 Random matrices

It is often assumed that random matrices is a eld of mathematics which treatsmatrix models as if matrices were of innite size and which then approximatefunctionals of realistic nite size models using asymptotic results. We wish rstto insist on the fact that random matrices are necessarily of nite size, so we donot depart from conventional linear algebra. We start this chapter by introducinginitial considerations and exact results on nite size random matrices. We will seelater that, for some matrix models, it is then interesting to study the features of some random matrices with large dimensions. More precisely, we will see that theeigenvalue distribution function F B N of some N ×N random Hermitian matricesB N converge in distribution (often almost surely so) to some deterministic limitF when N grows to innity. The results obtained for F can then be turned intoapproximative results for F B N , and therefore help to provide approximations

of some functionals of B N . Even if it might seem simpler for some to think of F as the eigenvalue distribution of an innite size matrix, this does not makemuch sense in mathematical terms, and we will never deal with such objects asinnite size matrices, but only with sequences of nite dimensional matrices of increasing size.

2.1 Small dimensional random matrices

We start with a formal denition of a random matrix and introduce somenotations.

2.1.1 Denitions and notations

Denition 2.1. An N ×n matrix X is said to be a random matrix if it is amatrix-valued random variable on some probability space (Ω , F , P ) with entriesin some measurable space ( R , G), where F is a σ-eld on Ω with probabilitymeasure P and G is a σ-eld on R . As per conventional notations, we denoteX (ω) the realization of the variable X at point ω

∈ Ω.

The rigorous introduction of the probability space (Ω , F , P ) is only necessaryhere for some details on proofs given in this chapter. When not mentioning either



18 2. Random matrices

Ω or the σ-eld F in the rest of this book, it will be clear what implicit probabilityspace is being referred to. In general, though, this formalism is unimportant.Also, unless necessary for clarity, we will in the following equally refer to X asthe random variable and as its random realization X (ω) for ω ∈ Ω. The spaceR will often be taken to be either R or C , i.e. for all ω ∈ Ω, X (ω) ∈R N ×n orX (ω) ∈C N ×n .

We dene the probability distribution of X to be µX , the joint probabilitydistribution of the entries of X , such that, for A ∈GN ×n

µX (A) P (ω, X (ω) ∈ A).

In most practical cases, µX will have a probability density function (p.d.f.) withrespect to whatever measure on RN ×n , which we will denote P X , i.e. for dY

∈G

an elementary volume around Y ∈R N ×n

P X (Y )dY µX (dY ) = P (ω, X (ω) ∈ dY ).

In order to differentiate the probability distribution function of real randomvariables X from that of multivariate entities, we will often denote pX P X inlowercase characters. For vector-valued real random variables ( X 1 , . . . , X N ), wefurther denote P (X 1 ,...,X N ) (x1 , . . . , x N ) or P (X i ) (x1 , . . . , x N ) the density of theunordered values X 1 , . . . , X N P ≤(X 1 ,...,X N ) (x1 , . . . , x N ) or P ≤(X i ) (x1 , . . . , x N ) thedensity of the non-decreasing values X 1

≤ . . .

≤ X N and P ≥

(X

1,...,X

N )(x1 , . . . , x N )

or P ≥(X i ) (x1 , . . . , x N ) the density of the non-increasing values X 1 ≥ . . . ≥ X N .The (cumulative) distribution function (d.f.) of a real random variable will

often be denoted by the letters F , G, or H , e.g. for x ∈R

F (x) µX ((−∞, x])

denotes the d.f. of X .We will in particular often consider the marginal probability distribution

function of the eigenvalues of random Hermitian matrices X . Unless otherwise

stated, the d.f. of the real eigenvalues of X

will be denoted F X

.Remark 2.1. As mentioned earlier, in most applications, the probability spaceΩ needs not be dened but may have interesting interpretations. In wirelesscommunications, if H ∈C n r ×n t is a random MIMO channel between an nt -antenna transmitter and an nr -antenna receiver, Ω can be seen as the spaceof “possible environments,” the elements ω of which being all valid snapshots of the physical world at a given instant. The random value H (ω) is therefore therealization of an instantaneous nr ×n t multiple antenna propagation channel.

We will say that a sequence F 1 , F 2 , . . . of real-supported distribution functionsconverge weakly to the function F , if, for x ∈R a continuity point of F

limn →∞

F n (x) = F (x).



2.1. Small dimensional random matrices 19

This will be denoted as

F n ⇒ F.

We will also say that a sequence x1 , x2 , . . . of random variables convergesalmost surely to the constant x if

P (limn

xn = x) = P (ω, limn

xn (ω) = x) = 1 .

The underlying probability space (Ω , F , P ) here is assumed to be the space thatgenerates the sequences x1(ω), x2(ω), . . . (and not the individual entries), forω ∈ Ω. This will be denoted

xna .s.

−→ x.

We will in particular be interested in the almost sure weak convergence of distribution functions, i.e. in proving that, for some sequence F 1 , F 2 , . . . of d.f.with weak limit F , there exists a space A ∈F , such that P (A) = 1 and, for allx ∈R , ω ∈ A implies

limn

F n (x; ω) = F (x)

with F 1(·; ω), F 2(·; ω), . . . one realization of the d.f.-valued random sequenceF 1 , F 2 , . . . .

Although this will rarely be used, we also mention the notation for convergence

in probability of a sequence x1 , x2 , . . . to x. A sequence x1 , x2 , . . . is said toconverge in probability to x if, for all ε > 0

limn

P (|xn −x| > ε ) = 0 .

2.1.2 Wishart matrices

In the following section, we provide elementary results on the distribution of random matrices with Gaussian entries and of their eigenvalue distributions.

As mentioned in the Introduction, the very rst random matrix considerationsdate back to 1928 [Wishart, 1928 ], with the expression of the p.d.f. of randommatrices XX H , for X ∈C N ×n with columns x 1 , . . . , x n ∼CN (0, R ). Such amatrix is called a (central) Wishart matrix.

Denition 2.2. The N ×N random matrix XX H is a (real or complex) central Wishart matrix with n degrees of freedom and covariance matrix R if the columnsof the N ×n matrix X are zero mean independent (real or complex) Gaussianvectors with covariance matrix R . This is denoted

XX H = XX T

∼

W N (n, R )

for real Wishart matrices and

XX H

∼CW N (n, R )




for complex Wishart matrices.

Dening the Gram matrix associated with any complex matrix X as being thematrix XX H , XX H

∼CW N (n, R ) is by denition the Gram matrix of a matrix

with Gaussian i.i.d. columns of zero mean and covariance R . When R = I N , itis usual to refer to X as a standard Gaussian matrix .

The interest of Wishart matrices lies primarily in the following remark.

Remark 2.2. Let x1 , . . . , x n ∈C N be n independent samples of the randomprocess x1 ∼CN (0, R ). Then, denoting X = [x 1 , . . . , x n ]

n

i =1

x i xH

i = XX H .

For this reason, the random matrix R n 1n XX H is often referred to as an

(empirical) sample covariance matrix associated with the random process x1 .This is to be contrasted with the population covariance matrix R , whose relationto the R n was already evidenced as non-trivial, when n and N are of the sameorder of magnitude. Of particular importance is the case when R = I N . In thissituation, XX H , sometimes referred to as a zero (or null) Wishart matrix , isproportional to the sample covariance matrix of a white Gaussian process. Thezero (or null) terminology is due to the signal processing problem of hypothesistesting, in which we have to decide whether the observed X emerges from a white

noise process or from an information plus noise process. The noise hypothesis isoften referred to as the null hypothesis .

Wishart provides us with the p.d.f. of Wishart matrices W in the space of non-negative denite matrices with elementary volume dW , as follows.

Theorem 2.1 ([Wishart , 1928]). The p.d.f. of the complex Wishart matrix XX H

∼CW N (n, R ), X ∈C N ×n , in the space of N ×N non-negative denite

complex matrices, for n ≥ N , is

P XX H (B ) = π−12 N (N

−1)

det R n N i =1 (n −i)! e−tr( R −

1B ) det B n −N . (2.1)

Note in particular that, for N = 1, this is the distribution of a real randomvariable X with 2X chi-square-distributed with 2 n degrees of freedom.

Proof. Since X is a Gaussian matrix of size N ×n with zero mean and covarianceR , we know that

P X (A )dX = 1

πNn det R n e−tr( R −1 AA H ) dX

which is the probability of an elementary volume dX around point A in thespace of N ×n complex matrices with measure dX i,j dX ij , with X ij theentry ( i, j ) of X . Now, to derive the probability P XX H (B )d(XX H ), it suffices to




operate a variable change between X and XX H from the space of N ×n complexmatrices to the space of N ×N non-negative denite complex matrices. This isobtained by Wishart in [Wishart, 1928] , who shows that

dX = π−1

2 N (N −1)+ Nn

N i =1 (n −i)!

det( XX H )n −N d(XX H )

and therefore:

P XX H (B )d(XX H ) = π−1

2 N (N −1)

det R n N i =1 (n −i)!

e−tr( R −1 B ) det B n −N d(XX H )

which is the intended formula.

In [Ratnarajah and Vaillancourt , 2005, Theorem 3], Ratnarajah andVaillancourt extend the result from Wishart to the case of singular matrices,i.e. when n < N . This is given as follows.

Theorem 2.2 ([Ratnarajah and Vaillancourt , 2005]). The p.d.f. of the complex Wishart matrix XX H

∼CW N (n, R ), X ∈C N ×n , in the space of N ×N non-

negative denite complex matrices of rank n, for n < N , is

P XX H (B ) = π−1

2 N (N −1)+ n (n −N )

det Rn n

i =1 (n −i)!e−tr( R −1 B ) det Λ n −N

with Λ ∈C n ×n the diagonal matrix of the positive eigenvalues of B .

As already noticed from Equation ( 1.3) of the mutual information of MIMOcommunication channels and from the brief introduction of eigenvalue-basedmethods for detection and estimation, the center of interest of random matricesoften lies in the distribution of their eigenvalues. For null Wishart matrices, noticethat P XX H (B ) = P XX H (UBU H ), for any unitary N ×N matrix U .1 Otherwisestated, the eigenvectors of the random variable R N are uniformly distributed

over the space U(N ) of unitary N ×N matrices. As such, the eigenvectors donot carry relevant information, and P XX H (B ) is only a function of the eigenvaluesof B .

The joint p.d.f. of the eigenvalues of zero Wishart matrices were studiedsimultaneously in 1939 by different authors [Fisher, 1939; Girshick, 1939 ; Hsu,1939; Roy, 1939]. The main two results are summarized in the following.

Theorem 2.3. Let the entries of X ∈C N ×n be i.i.d. Gaussian with zero mean and unit variance. Denote m = min( n, N ) and M = max( n, N ). The joint p.d.f.

P ≥(λ i ) of the positive ordered eigenvalues λ1 ≥ . . . ≥ λN of the zero Wishart matrix

1 We recall that a unitary matrix U∈

C N ×N is such that UU H = U H U = I N .




XX H is given by:

P ≥(λ i ) (λ1 , . . . , λ N ) = e− mi =1 λ i

m

i =1

λM −mi

(m −i)!(M −i)!∆( Λ )2

where, for a Hermitian non-negative m ×m matrix Λ ,2 ∆( Λ ) denotes the Vandermonde determinant of its eigenvalues λ1 , . . . , λ m

∆( Λ )1≤i<j ≤m

(λ j −λ i ).

The marginal p.d.f. pλ ( P λ ) of the unordered eigenvalues is

pλ (λ) = 1

m

m −1

k =0

k!

(k + M −n)!

[LM −mk (λ)]2λM −m e−λ

where Lkn are the Laguerre polynomials dened as

Lkn (λ) =

eλ

k!λndk

dλk (e−λ λn + k ).

The generalized case of (non-zero) central Wishart matrices is more involvedsince it requires advanced tools of multivariate analysis, such as the fundamentalHarish–Chandra integral [Chandra , 1957]. We will mention the result of Harish–Chandra, which is at the core of major results in signal sensing and channel

modeling presented in Chapter 16 and Chapter 18, respectively.

Theorem 2.4 ([Chandra , 1957]). For non-singular N ×N positive denite Hermitian matrices A and B of respective eigenvalues a1 , . . . , a N and b1 , . . . , b N ,such that, for all i = j , ai = a j and bi = bj , we have:

U∈

U (N )eκ tr( AUBU H ) dU =

N −1

i =1

i! κ12 N (N −1) det e−bj a i 1≤i,j ≤N

∆( A )∆( B )

where, for any bivariate function f , f (i, j )1≤i,j ≤N denotes the N ×N matrix

of (i, j ) entry f (i, j ), U(N ) is the space of N ×N unitary matrices, and dUdenotes the invariant measure on U(N ) normalized to make the total measure unity.

The remark that dU is a normalized measure on U(N ) arises from the factthat, contrary to spaces such as R N , U(N ) is compact. As such, it can beattached a measure dU such that U (N ) dU = V for V the volume of U(N ) underthis measure. We arbitrarily take V = 1 here. In other publications, e.g., [Hiaiand Petz , 2006], the authors choose other normalizations. As for the invariantmeasure terminology (called Haar measure in subsequent sections), it refers to

2 We will respect the convention that x (be it a scalar or a Hermitian matrix) is non-negativeif x ≥ 0, while x is positive if x > 0.




the measure such that dU is a constant (taken to equal 1 here). In the rest of this section, we take this normalization.

Theorem 2.4 enables the calculus of the marginal joint-eigenvalue distributionof (non-zero) central Wishart matrices [Itzykson and Zuber, 2006] given as:

Theorem 2.5 (Section 8.7 of [James , 1964]). Let the columns of X ∈C N ×n be i.i.d. zero mean Gaussian with positive denite covariance R , and n ≥ N . The joint p.d.f. P ≥(λ i ) of the ordered positive eigenvalues λ1 ≥ . . . ≥ λN of the central Wishart matrix XX H reads:

P ≥(λ i ) (λ1 , . . . , λ N ) = ( −1)12 N (N −1) det( e−

λ ir j 1≤i,j ≤N )

det R n∆( Λ )

∆( R −1)

N

j =1

λn −N j

(n − j )!

with r1 > .. . > r N > 0 the eigenvalues of R and Λ = diag( λ1 , . . . , λ N ).

Proof. The idea behind the proof is rst to move from the space of XX H non-negative denite matrices to the product space of matrices ( U , Λ ), U beingunitary and Λ diagonal with positive entries. For this, it suffices to notice froma Jacobian calculus that

d(XX H ) = π

12 N (N −1)

N i =1 i!

N

i<j

(λ j −λ i )2dΛ dU

with XX H = UΛU H .3 It then suffices to integrate out the matrices U for xedΛ from the probability distribution P XX H . This requires the Harish–Chandraformula, Theorem 2.4.

Noticing that the eigenvalue labeling is in fact irrelevant, we obtain the jointunordered eigenvalue distribution of central Wishart matrices as follows.

Theorem 2.6 ([James, 1964]). Let the columns of X ∈C N ×n be i.i.d. zero mean Gaussian with positive denite covariance R and n ≥ N . The joint p.d.f. P (λ i )

of the unordered positive eigenvalues λ1 , . . . , λ N of the central Wishart matrix XX H , reads:

P (λ i ) (λ1 , . . . , λ N ) = ( −1)12 N (N −1) 1

N !det( e−

λ ir j 1≤i,j ≤N )

det R n∆( Λ )

∆( R −1)

N

j =1

λn −N j

(n − j )!

with r1 > .. . > r N > 0 the eigenvalues of R and Λ = diag( λ1 , . . . , λ N ).

Note that it is necessary here that the eigenvalues of R be distinct. Thisis because the general expression of Theorem 2.6 involves integration over the

3 It is often found in the literature that d(XX H ) = N i<j (λ j −λ i )2 dΛ dU . This results from

another normalization of the invariant measure on U(N ).




unitary space which expresses in closed-form in this scenario thanks to theHarish–Chandra formula, Theorem 2.4.

A similar expression for the singular case when n < N is provided in[Ratnarajah and Vaillancourt , 2005, Theorem 4], but the nal expression is lessconvenient to express in closed-form as it features hypergeometric functions,which we introduce in the following. Other extensions of the central Wishartp.d.f. to non-central Wishart matrices , i.e. originating from a matrix with non-centered Gaussian entries, have also been considered, leading to results on the joint-entry distribution [Anderson, 1946 ], eigenvalue distribution [Jin et al.,2008], largest and lowest eigenvalue marginal distribution, condition numberdistribution [Ratnarajah et al., 2005a], etc. The results mentioned so far are of considerable importance when matrices of small dimension have to be considered

in concrete applications. We presently introduce the result of [Ratnarajah andVaillancourt, 2005, Theorem 4] that generalizes Theorem 2.6 and the resultthat concerns non-central complex Wishart matrices [James, 1964], extendedin [Ratnarajah et al., 2005b]. For this, we need to introduce a few denitions.

Denition 2.3. For κ = ( k1 , . . . , k N ) with k1 + . . . + kN = k for some (k, N ),we denote for x ∈C

[x]κ =N

i=1

(x −i + 1) k i

where (u)k = u(u + 1) . . . (u + k −1). We also denote, for X ∈C N ×N withcomplex eigenvalues λ1 , . . . , λ N

C κ (X ) = χ [κ ](1)χ [κ ](X )

the complex zonal polynomial of X , where χ [κ ](1) is given by:

χ [κ ](1) = k!N i<j (ki −kj −i − j )

N i =1 (ki + N −i)!

and χ [κ ](X ) reads:

χ [κ ](X ) =det λk j + N −j

i i,j

det λN −ji i,j

.

The hypergeometric function of a complex matrix X ∈C N ×N is then dened by

pF q(a1 , . . . , a p; b1 , . . . , b p; X ) =∞

k=0 κ

[a1]κ . . . [a p]κ[b1]κ . . . [b p]κ

C κ (X )k!

where a1 , . . . , a p , b1 , . . . , b p ∈C , κ is a summation over all partitions κ of kelements into N and [x]κ = N

i =1 (x −i + 1) k i .




We then dene the hypergeometric function of two complex matrices of thesame dimension X , Y ∈C N ×N to be

pF q(X , Y ) = U (N ) pF q(XUYU H )dU .

We have in particular that

0F 0(X ) = etr X

1F 0(a; X ) = det( I N −X )−a .

From this observation, we remark that the Harish–Chandra formula, Theorem2.4, actually says that

0F 0(A , B ) = U∈

U (N )0F 0(AUBU H )dU =

N −1

i =1

i! det e−bj a i 1≤i,j ≤N

∆( A )∆( B )

as long as the eigenvalues of A and B are all distinct. In particular, Theorem 2.6can be rewritten in its full generality, i.e. without the constraint of distinctnessof the eigenvalues of R , when replacing the determinant expressions by theassociated hypergeometric function.

Further generalizations of hypergeometric functions of multiple Hermitianmatrices, non-necessarily of the same dimensions, exist. In particular, forA 1 , . . . , A p all symmetric with A k ∈C N k ×N k , we mention the denition, see,e.g., [Hanlen and Grant, 2003]

0F 0(A 1 , . . . , A p) =∞

k=0 κ

C κ (A 1)k!

p

i =2

C κ (A i )C κ (I N )

(2.2)

with N = max kN k and the partitions κ have dimension min kN k. ForA ∈C N ×N and B ∈C n ×n , n ≤ N , we therefore have the general expression of 0F 0(A , B ) as

0F 0(A , B ) = ∞k=0 κ

C κ (A )C κ (B )k!C κ (I N )

. (2.3)

Sometimes, the notation 0F (n )0 is used where the superscript ( n) indicates

that the partitions κ are n-dimensional. For a deeper introduction to zonalpolynomials and matrix-valued hypergeometric functions, see, e.g., [Muirhead,1982].

The extension of Theorem 2.6 to the singular case is as follows.

Theorem 2.7 (Theorem 4 of [Ratnarajah and Vaillancourt, 2005 ]). Let the columns of X ∈C N ×n be i.i.d. zero mean Gaussian with non-negative denite covariance R and n < N . The joint p.d.f. P (λ i ) of the unordered positive




eigenvalues λ1 , . . . , λ n of the central Wishart matrix XX H , reads:

P (λ i ) (λ1 , . . . , λ N ) = ∆( Λ )2

0F 0(−R −1

, Λ )

n

j =1

λN −nj

(n − j )!(N − j )!

where Λ = diag( λ1 , . . . , λ n ).

Note here that R −1∈

C N ×N and Λ ∈C n ×n , n < N , so that 0F 0(−R −1 , Λ )does not take the closed-form conveyed by the Harish–Chandra theorem,Theorem 2.4, but can be evaluated from Equation ( 2.3).

The result from James on non-central Wishart matrices is as follows.

Theorem 2.8 ([James , 1964]). If X ∈C N ×n is a complex Gaussian matrix with mean M and invertible covariance Σ = 1

n E[(X −M )(X −M )H ], then XX H is distributed as

P XX H (B ) = π−1

2 N (N −1)

N i =1 (n −i)!

det( B )n −N

det( Σ )n e−tr Σ −1 (MM H + B )0F 1(n; Σ −1MM H Σ −1B ).

Also, the density of the unordered eigenvalues λ1 , . . . , λ N of Σ −1XX H

expresses as a function of the unordered eigenvalues m1 , . . . , m N of Σ −1

MMH

as

P (λ i ) (λ1 , . . . , λ N ) = ∆( Σ −1MM H )2e− N i =1 (m i + λ i )

N

i =1

λn −N i

(n −i)!(N −i)!

if mi = m j for all i = j .

We complete this section by the introduction of an identity that allows us

to extend results such as Theorem 2.6 to the case when an eigenvalue of Rhas multiplicity greater than one. As can be observed in the expression of the eigenvalue density in Theorem 2.6, equating population eigenvalues leadsin general to bringing numerators and denominators to zero. To overcome thisdifficulty when some eigenvalues have multiplicities larger than one, we have thefollowing result which naturally extends [Simon et al., 2006, Lemma 6].

Theorem 2.9 ([Couillet and Guillaud, 2011 ; Simon et al., 2006]). Let f 1 , . . . , f N

be a family of innitely differentiable functions and let x1 , . . . , x N ∈R . Denote

R(x1 , . . . , x N )det f i (x j )i,j

i<j (x j −x i ) .




which, after application of L’Hospital rule, equals (upon existence of the limit),to the limiting ratio between the numerator

σ∈S N

sgn(σ) f σ 1 (y1 + ε)f σ 2 (y1 + 2 ε)N

i =3f σ i (x i )

+2 f σ 1 (y1 + ε)f σ 2 (y1 + 2 ε)N

i =3f σ i (x i )

and the denominator

i>j> 2(x i −x j )

N

i=3(x i −y1 −ε)(x i −y1 −2ε)

+ εi>j> 2

(x i −x j )N

i =3(x i −y1 −ε)(x i −y1 −2ε)

as ε → 0, with SN the set of permutations of 1, . . . , N whose elements σ =(σ1 , . . . , σ N ) have signature (or parity) sgn( σ). Calling σ the permutation suchthat σ1 = σ2 , σ2 = σ1 and σi = σi for i ≥ 3, taking the limit when ε → 0, we canreorder the sum in the numerator as

(σ,σ )⊂S N

sgn(σ) f σ 1 (y1)f σ 2 (y1)N

i =3

f σ i (x i ) + 2 f σ 1 (y1)f σ 2 (y1)N

i =3

f σ i (x i )

+ sgn( σ ) f σ 1(y1)f σ 2

(y1)N

i =3

f σ i(x i ) + 2 f σ 1

(y1)f σ 2(y1)

N

i=3

f σ i(x i ) .

But from the denition of σ , we have sgn( σ ) = −sgn(σ) and then this becomes

(σ,σ )⊂S N

sgn(σ) −f σ 1 (y1)f σ 2 (y1)N

i=3

f σ i (x i ) + f σ 1 (y1)f σ 2 (y1)N

i =3

f σ i (x i )

or equivalently

σ∈S N

sgn(σ)f σ 1 (y1)f σ 2 (y1)N

i =3f σ i (x i )

the determinant of the expected numerator. As for the denominator, it clearlyconverges also to the expected result, which nally proves Theorem 2.9 in thissimple case.

As exemplied by the last results given in Theorem 2.7 and Theorem 2.8, themathematical machinery required to obtain relevant results, already for such asimple case of correlated non-centered Gaussian matrices, is extremely involved.This is the main reason for the increasing attractiveness of large dimensional random matrices , which, as is discussed in the following, provide surprisinglysimple results in comparison.



2.2. Large dimensional random matrices 29

2.2 Large dimensional random matrices

2.2.1 Why go to innity?

When random matrix problems relate to rather large dimensional matrices, it isoften convenient to mentally assume that these matrices have in fact extremelylarge dimensions in order to blindly apply the limiting results for large matricesto the effective nite-case scenario. It turns out that this approximate techniqueis often stunningly precise and can be applied to approximate scenarios where theeffective matrix under consideration is not larger than 8 ×8, and even sometimes4 ×4 and 2 ×2. Before delving into the core of large dimensional random matrixtheory, let us explain further what we mean by “assuming extremely large sizeand applying results to the nite size case.” As was already evidenced, we arelargely interested in the eigenvalue structure of random matrices and in particularin the marginal density of their eigenvalues.

Consider an N ×N (non-necessarily random) Hermitian matrix T N . Deneits empirical spectral distribution (e.s.d.) F T N to be the distribution function of the eigenvalues of T N , i.e. for x ∈R

F T N (x) = 1N

N

j =1

1x,λ j ≤x(x)

where λ1 , . . . , λ N are the eigenvalues of T N .4 For such problems as signal sourcedetection from a multi-dimensional sensor (e.g. scanning multiple frequencies),T N might be a diagonal matrix with K distinct eigenvalues, of multiplicity N/K each, representing the power transmitted by each source on N/K independentfrequency bands. Assuming a random signal matrix SN ∈C N ×n with i.i.d.entries is transmitted by the joint-source (composed of the concatenation of N -dimensional signal vectors transmitted in n successive time instants), andN , n are both large, then a rather simple model is to assume that the sensorobserves a matrix Y N = T

12N S N from which it needs to infer the K distinct

eigenvalues of T N . This model is in fact unrealistic, as it should consider other

factors such as the additive thermal noise, but we will stick here to this easiermodel for the sake of simplicity. Inferring the K eigenvalues of T N is a hardproblem in nite size random matrix theory, for which the maximum likelihoodmethod boils down to a K -dimensional space search. If, however, we consider thehypothetical series Y pN = T pN S pN , p ∈N with S pN ∈C pN × pn whose entriesfollow the same distribution as those of SN , and T pN = T N ⊗I p , with ⊗ theKronecker product, then the system has just been made larger without impactingits structural ingredients. Indeed, we have just made the problem larger bygrowing N into pN and n into pn. The most important fact to observe here is that

4 The Hermitian property is fundamental to ensure that all eigenvalues of T N belong to the realline. However, the extension of the e.s.d. to non-Hermitian matrices is sometimes required;for a denition, see (1.2.2) of [Bai and Silverstein, 2009] .




the distribution function of T N has not been affected, i.e. F T pN = F T N , for each p. It turns out, as will be detailed in Chapter 16, that when p → ∞, F Y pN Y H

pN hasa deterministic weak limit and there exist computationally efficient techniquesto retrieve the exact K distinct eigenvalues of T N , from the non-random limit of the e.s.d. lim p→∞F Y pN Y H

pN ). Going back to the nite (but large enough) N case,we can apply the same deterministic techniques to the random F Y N Y H

N insteadof lim p→∞F Y pN Y H

pN to obtain a good approximation of the eigenvalues of T N .These approximated values are consistent with a proportional growth of n andN , as they are almost surely exact when N and n tend to innity with positivelimiting ratio, and are therefore ( n, N )-consistent.

This is basically how most random matrix results work: (i) we articiallylet both n, N dimensions grow to innity with constant ratio, (ii) very often,

assuming large dimensions asymptotically leads to deterministic expressions, i.e.independent of the realization ω (at least for ω in a subset of Ω of probability one)which are simpler to manipulate, and (iii) we can then apply the deterministicresults obtained in (ii) to the nite dimensional stochastic observation ω athand and usually have a good approximation of the small dimensional matrixbehavior. The fact that small dimensional matrices enjoy similar propertiesas their large dimensional counterparts makes the above approach extremelyattractive, notably to wireless communication engineers, who often deal withnot-so-large matrix models.

2.2.2 Limit spectral distributions

Let us now focus on large dimensional random matrices and abandon for nowthe practical applications, which are discussed in Part II. The relevant aspectof some classes of large N ×N Hermitian random matrices X N is that their(random) e.s.d. F X N converges, as N → ∞, towards a non-random distributionF . This function F , if it exists, will be called the limit spectral distribution (l.s.d.)of X N . Weak convergence of F X N to F , i.e. for all x where F is continuous,

F X N

(x) −F (x) → 0, is often sufficient to obtain relevant results; this is denoted

F X N

⇒ F.

In most cases, though, the weak convergence of F X N to F will only be true ona set of matrices X N = X N (ω) of measure one. This will be mentioned with thephrase F X N

⇒ F almost surely .We detail in the following the best-known examples of such convergence. The

rst result on limit spectral distributions is due to Wigner [Wigner, 1955 , 1958],who establishes the convergence of the eigenvalue distribution of a particularcase of the now-called Wigner matrices. In its generalized form [Arnold, 1967,1971; Bai and Silverstein , 2009], this is:




Theorem 2.11 (Theorem 2.5 and Theorem 2.9 of [Bai and Silverstein, 2009] ).Consider an N ×N Hermitian matrix X N , with independent entries 1√ N X N,ij

such that E[X N,ij ] = 0, E[

|X N,ij

|2] = 1 and there exists ε such that the X N,ij

have a moment of order 2 + ε. Then F X N

⇒ F almost surely, where F has density f dened as

f (x) = 12π (4 −x2)+ . (2.4)

Moreover, if the X N,ij are identically distributed, the result holds without the need for existence of a moment of order 2 + ε.

The l.s.d. F is the semi-circle law ,5 depicted in Figure 1.2. The sketch of a

proof of the semi-circle law, based on the method of moments (Section 30 of [Billingsley, 1995] ) is presented in Section 5.1.This result was then followed by limiting results on other types of matrices,

such as the full circle law for non-symmetric random matrices, which is a largelymore involved problem, starting with the fact that the eigenvalues of suchmatrices are no longer restricted to the real axis [Hwang, 1986; Mehta , 2004].Although Girko was the rst to provide a proof of this result, the most generalresult is due to Bai in 1997 [Bai, 1997].

Theorem 2.12. Let X N

∈C N ×N have i.i.d. entries 1√ N X N,ij , 1

≤ i, j

≤ N ,

such that X N, 11 has zero mean, unit variance and nite sixth order moment.Additionally, assume that the joint distribution of the real and imaginary parts of 1√ N X N, 11 has bounded density. Then, with probability one, the e.s.d. of X N

tends to the uniform distribution on the unit complex disc. This distribution is referred to as the circular law, or full circle law.

The circular law is depicted in Figure 2.1.Theorem 2.11 and Theorem 2.12 and the aforementioned results have

important consequences in nuclear physics and statistical mechanics, largely

documented in [Mehta , 2004]. In wireless communications, though, Theorem 2.11and Theorem 2.12 are not the most fundamental and widely used results. Instead,in the wireless communications eld, we are often interested in sample covariancematrices or even more general matrices such as i.i.d. matrices multiplied bothon the left and on the right by deterministic matrices, or i.i.d. matrices witha variance prole, i.e. with independent entries of zero mean but differencevariances. Those matrices are treated in problems of detection, estimation andcapacity evaluation which we will further discuss in Part II. The best knownresult with a large range of applications in telecommunications is the convergenceof the e.s.d. of the Gram matrix of a random matrix with i.i.d. entries of zero

5 Note that the semi-circle law sometimes refers, instead of F , to the density f , which is the“semi-circle”-shaped function.




−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Eigenvalues (real part)

E i g e n v a l u e s

( i m a g i n a r y p a r t )

Empirical eigenvalue distributionCircular law

Figure 2.1 Eigenvalues of X N = 1√ N X ( N )

ijij

with X ( N )ij i.i.d. standard Gaussian, for

N = 500, against the circular law.

mean and normalized variance. This result is due to Marcenko and Pastur

[Marcenko and Pastur, 1967 ], so that the limiting e.s.d. of the Gram matrixis called the Marcenko–Pastur law . The result unfolds as follows.

Theorem 2.13. Consider a matrix X ∈C N ×n with i.i.d. entries 1√ n X N,ij ,

such that X N, 11 has zero mean and unit variance. As n, N → ∞ with N n → c ∈

(0, ∞), the e.s.d. of R n = XX H converges weakly and almost surely to a non-random distribution function F c with density f c given by:

f c(x) = (1 −c−1)+ δ (x) + 12πcx

(x −a)+ (b−x)+ (2.5)

where a = (1 −√ c)2 , b = (1 + √ c)2 and δ (x) = 1 0(x).

Note that, similar to the notation introduced previously, R n is the samplecovariance matrix associated with the random vector ( X N, 11 , . . . , X N,N 1)T , withpopulation covariance matrix I N . The d.f. F c is named the Marcenko–Pastur lawwith limiting ratio c.6 This is depicted in Figure 2.2 for different values of thelimiting ratio c. Notice in particular that, as is expected from the discussion in thePreface, when c tends to be small and approaches zero, the Marcenko–Pastur lawreduces to a single mass in 1. A proof of Theorem 2.13, which follows a Stieltjes

6 Similarly as with the semi-circle law, the Marcenko–Pastur law can also refer to the densityf c of F c .




0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

x

D e n s i t y

f c ( x )

c = 0 .1c = 0 .2c = 0 .5

Figure 2.2 Marcenko–Pastur law for different limit ratios c = lim N/n .

transform-based method, is proposed in Section 3.2. Since the Marcenko–Pasturlaw has bounded support, it has moments of all orders, which are explicitly givenin the following result.

Theorem 2.14. Let F c be the Marcenko–Pastur law with ratio c and with density f c given by (2.5). The successive moments M 1 , M 2 , . . . of F c are given, for all integer k, by

M k = 1k

k−1

i=0

ki

ki + 1

ci .

Note that further generalizations of Theorem 2.13 have been provided in theliterature, the most general form of which comes as follows.

Theorem 2.15 (Theorem 3.10 of [Bai and Silverstein, 2009] ). Consider a matrix X ∈C N ×n with entries 1√ n X N,ij , independent for all i,j,n and such that X N,ij has zero mean, unit variance, and nite 2 + ε order moment ( ε being independent of i,j,n ). Then, as n, N → ∞ with N

n → c ∈ (0, ∞), the e.s.d. of R n = XX H converges almost surely to the Marcenko–Pastur law F c , with density given by (2.5).

This last result goes beyond the initial Marcenko–Pastur identity as it does notassume identically distributed entries in X . However, the result would no longerstand if the variances of the entries were different, or were not independent. Thesescenarios are of relevance in wireless communications to model multi-dimensionalchannels with correlation or with a variance prole, for instance.




The above theorems characterize the eigenvalue distribution of Gram matricesof N ×n matrices with i.i.d. entries. In terms of eigenvector distribution, though,not much is known. We know that the eigenvectors of a Wishart matrix, i.e. amatrix XX H as above, where the entries are constrained to be Gaussian, areuniformly distributed on the unit sphere. That is, the eigenvectors do not pointto any privileged direction. This fact is true for all nite dimension N . Now,obviously, this result does not stand for all nite N for random matrices with i.i.d.but non-Gaussian entries. Still, we clearly feel that in some sense the eigenvectorsof XX H must be “isotropically distributed in the limit.” This is stated preciselyfor the real case under the following theorem, due to Silverstein [Silverstein, 1979,1981, 1984].

Theorem 2.16. Let X

∈R N ×n be random with i.i.d. real-valued entries of zero

mean and all moments uniformly bounded. Denote U = [u 1 , . . . , u N ] ∈R N ×N ,with u j the eigenvector of the j th largest eigenvalue of XX T . Additionally, denote x ∈R N an arbitrary vector with unit Euclidean norm and y = ( y1 , . . . , y N )T the random vector dened by

y = Ux .

Then, as N, n → ∞ with limiting ratio N/n → c, 0 < c ≤ 1, for all t ∈ [0, 1]tN

k=1

y2k

a .s.

−→ t

where x is the greatest integer smaller than x.

This result indicates some sort of uniformity in the distribution of theeigenvectors of XX T . In [Silverstein, 1986] , Silverstein extends Theorem 2.16 intoa limit theorem of the uctuations of the random process tN

k=1 y2k , the weak limit

being a Brownian bridge. Apart from these results, though, not much more isknown about eigenvectors of random matrices. This subject has however recentlygained some more interest and recent results will be given later in Chapter 8.

In the following chapters, we introduce the important tools known to this dayto characterize the l.s.d. of a large range of random matrix classes. We startwith the Stieltjes transform and provide a thorough proof of the Marcenko–Pastur law via the Stieltjes transform method. This proof will allow the readerto have a clear view on the building blocks required for deriving most results of large dimensional random matrix theory for wireless communications.



3 The Stieltjes transform method

This chapter is the rst of three chapters dedicated to providing the readerwith an overview of the most important tools used in problems related to largerandom matrices. These tools will form a strong basis for the reader to be ableto appreciate the extensions discussed in the more technical Chapters 6–9. Werst visit in this chapter the main results proved via the Stieltjes transform , tobe dened subsequently. The Stieltjes transform tool is at rst not very intuitiveand not as simple as the moment-based methods developed later. For thisreason, we start with a step-by-step proof of the Marcenko–Pastur law, Theorem2.13, for large dimensional matrices with i.i.d. entries, before we can addressmore elaborate random matrix models with non-independent or not identicallydistributed entries. We will then introduce the Stieltjes transform related toolsthat are the R-transform and the S -transform, which bear interesting properties

related to moments of the e.s.d. of some random matrix models. These R-transform and the S -transform, along with the free probability theory from whichthey originate, are the fundamental link between the Stieltjes transform and themoment-based methods. These are discussed thoroughly in Chapters 4–5.

3.1 Denitions and overview

To be able to handle the powerful tools that are the Stieltjes transform andmoment-based methods, we start with some prior denitions of the Stieltjestransform and other related functionals, which will be often used in thesubsequent chapters.

We rst introduce the Stieltjes transform.

Denition 3.1. Let F be a real-valued bounded measurable function over R .Then the Stieltjes transform mF (z)of F ,1 for z ∈ Supp( F )c , the complex space

1 We borrow here the notation m to a large number of contributions from Bai, Silverstein et al. In other works, the notation s or S for the Stieltjes transform is used. However, in thiswork, the notation S will be reserved to the S -transform to be dened later.



36 3. The Stieltjes transform method

complementary to the support of F ,2 is dened as

mF

(z)

∞

−∞1

λ −zdF (λ). (3.1)

For all F which admits a Stieltjes transform, the inverse transformation existsand formulates as follows.

Theorem 3.1. If x is a continuity points of F , then:

F (x) = 1π

limy→0+

x

−∞

[mF (x + iy)] dx. (3.2)

Proof. Since the function t → 1t−(x + iy ) is continuous and tends to zero as |t| →

∞, it has uniformly bounded norm on the support of F and we can then applyTonelli’s theorem, Theorem 3.16, and write

1π b

a[mF (x + iy)] dx

= 1π

b

a

y

(t

−x)2 + y2 dF (t)dx

= 1π

b

a

y(t −x)2 + y2 dxdF (t)

= 1π tan −1 b−t

y −tan −1 a −ty

dF (t).

As y → 0+ , this tends to 1[a,b ](t)dF (t) = F (b) −F (a).

In all practical applications considered in this book, F will be a distributionfunction. Therefore, there exists an intimate link between distribution functionsand their Stieltjes transforms. More precisely, if F 1 and F 2 are two distributionfunctions (therefore right-continuous by denition, see, e.g. Section 14 of [Billingsley, 1995] ) that have the same Stieltjes transform, then F 1 and F 2coincide everywhere and the converse is true. As a consequence, mF uniquelydetermines F and vice-versa. It will turn out that, while working on thedistribution functions of the empirical eigenvalues of large random matrices isoften a tedious task, the approach via Stieltjes transforms simplies greatly thestudy. The initial intuition behind the Stieltjes transform approach for random

2 We recall that the support Supp( F ) of a d.f. F with density f is the closure of the set

x ∈R , f (x) > 0.



3.1. Denitions and overview 37

matrices lies in the following remark. For a Hermitian matrix X ∈C N ×N

mF X (z) =

1

λ −zdF X (λ)

= 1N

tr ( Λ −zI N )−1

= 1N

tr ( X −zI N )−1

in which we denoted Λ the diagonal matrix of eigenvalues of X . Working withthe Stieltjes transform of F X then boils down to working with the matrix(X −zI N )−1 , and more specically the sum of its diagonal entries. From matrixinversion lemmas and several fundamental matrix identities, it is then rathersimple to derive limits of traces 1

N tr ( X

−zI N )−1 , as N grows large, and

therefore to derive a limit of the Stieltjes transform of F X . For instance, inthe case of large sample covariance matrices R n , we will see that it is rathereasy to show that mF R n tends almost surely to a function m, which is itself the Stieltjes transform of a distribution function F . Thanks to Theorem 3.10,this will prove that F R n

⇒ F almost surely. For notational simplicity, we maydenote mX mF X the Stieltjes transform of the e.s.d. of the Hermitian matrixX and call mX the Stieltjes transform of X .

An identity of particular interest is the relation between the Stieltjes transformof AB and BA when AB is Hermitian.

Lemma 3.1. Let A ∈C N ×n , B ∈C n ×N , such that AB is Hermitian. Then, for z ∈C \ R

nN

mF BA (z) = mF AB (z) + N −n

N 1z

.

Also, for X ∈C N ×n and for z ∈C \ R + , we have:

nN

mF X H X (z) = mF XX H (z) + N −n

N 1z

.

The identity follows directly from the fact that both AB and BA have thesame eigenvalues except for additional zero eigenvalues for the larger matrix.Hence, say n ≥ N , the larger matrix has N eigenvalues being the same as theeigenvalues of the smaller matrix, plus additional ( n −N ) eigenvalues equal tozero. Each one of the latter leads to the addition of a term 1 / (0 −z) = −1/z ,hence the identity.

Also, we have the following trivial relation.

Lemma 3.2. Let X ∈C N ×N be Hermitian and a be a non-zero real. Then, for

z ∈C \ R

ma X (az ) = 1a

mX (z).




This unfolds by noticing that

ma X (az ) =

1at

−az

dF X (t) = 1a

1t

−z

dF X (t).

For practical calculus in the derivations of the subsequent chapters, we needto introduce the following important properties for the Stieltjes transform of distribution functions, see, e.g., [Hachem et al., 2007].

Theorem 3.2. Let mF be the Stieltjes transform of a distribution function F ,then:

• mF is analytic over C + ,

• if z

∈C + , then mF (z)

∈C + ,

• if z ∈C + , |mF (z)| ≤ 1[z ] and [1/m F (z)] ≤ − [z],• if F (0−) = 0 ,3 then mF is analytic over C \ R + . Moreover, z ∈C + implies

zm F (z) ∈C + and we have the inequalities

|mF (z)| ≤1

| [z ]| , z ∈C \ R1

|z | , z < 01

dist( z, R + ) , z ∈C \ R +

with dist the Euclidean distance.

Conversely, if m is a function analytical on C +

such that m(z) ∈C +

if z ∈C +

and

limy→∞−iy m(iy) = 1 (3.3)

then m is the Stieltjes transform of a distribution function F given by

F (b) −F (a) = limy→0

1π b

a[m(x + iy)]dx.

If, moreover, zm(z) ∈C + for z ∈C + , then F (0−) = 0 , in which case m has an

analytic continuation on C

\R +

.

The rst inequalities will often be used when providing limiting results forsome large dimensional random matrix models involving the Stieltjes transform.The converse results will be more rarely used, restricted mainly to some technicalpoints in the proof of advanced results. Note that, if the limit in Equation ( 3.3)is nite but different from 1, then m(z) is said to be the Stieltjes transform of a nite measure on R + .

An interesting corollary of Theorem 3.2, which will often be reused in technicalproofs, is the following.

3 We will denote F (0−) and F (0 + ) the limit of F (x) when x tends to zero from below or fromabove, respectively.




Corollary 3.1. Let t > 0 and mF (z) be the Stieltjes transform of a distribution function F . Then, for z ∈C +

11 + tm F (z) ≤ |

z

|[z].

Proof. It suffices to realize here, from the properties of the Stieltjes transformgiven in the converse of Theorem 3.2, that

−1z(1 + tm F (z))

is the Stieltjes transform of some distribution function. It therefore unfolds, fromthe Stieltjes transform inequalities of Theorem 3.2, that

−1z(1 + tm F (z)) ≤ 1

[z].

A further corollary of Corollary 3.1 then reads:

Corollary 3.2. Let x ∈C N , t > 0 and A ∈C N ×N be Hermitian non-negative denite. Then, for z ∈C +

11 + tx H (A −zI N )−1 x ≤ |

z

|[z].

Proof. This unfolds by writing A = U H ΛU , the spectral decomposition of A ,with Λ = diag( λ1 , . . . , λ N ) ∈C N ×N diagonal and U ∈C N ×N . Denoting y = Uxand y = ( y1 , . . . , y N )T , we have:

x H (A −zI N )−1 x =i

|yi |2λ i −z

.

Under this notation, it is clear that the function f (z) = x H (A

−zI N )−1 x maps

C + to C + and that lim y→∞−iyf (iy) = j |yj |2 > 0. Therefore, up to a positivescaling factor, f (z) is the Stieltjes transform of a probability measure. We cantherefore use Corollary 3.2, which completes the proof.

In wireless communications, we are often interested in calculating the datatransmission rate achievable on a multi-dimensional N ×n communicationchannel H . We are therefore often led to evaluate functions in the form of ( 1.2).It turns out that the Stieltjes transform is directly connected to this expressionof mutual information through the so-called Shannon transform , initially coinedby Tulino and Verd´ u.

Denition 3.2 (Section 2.3.3 of [Tulino and Verd´u, 2004]). Let F be aprobability distribution dened on R + . The Shannon transform VF of F is




dened, for x ∈R + , as

VF (x)

∞

0

log(1 + xλ )dF (λ). (3.4)

The Shannon transform of F is related to its Stieltjes transform mF throughthe expression

VF (x) = ∞1x

1t −mF (−t) dt = x

0

1t −

1t2 mF −

1t

dt. (3.5)

The expression in brackets in ( 1.2) is N times the right-hand side of ( 3.4) if F is chosen to be the e.s.d. of HH H . To evaluate ( 1.2), it is therefore sufficient toevaluate the Stieltjes transform of F . This is the very starting point of capacityevaluations using the Stieltjes transform.

Another important characteristic of the Stieltjes transform, both from atheoretical and a practical point of view, is its relationship to moments of theunderlying distribution. We have in particular the following result.

Theorem 3.3. If F has compact support included in [a, b], 0 < a < b < ∞, then, for z ∈C \ R such that |z| > b, mF (z) can be expanded in Laurent series as

mF (z) =

−

1

z

∞

k=0

M k

zk (3.6)

where

M k = ∞

−∞λk dF (λ)

is the kth order moment of F .

By successive differentiations of zmF (−1/z ), we can recover the series of moments M 1 , M 2 , . . . of F ; the Stieltjes transform is then a moment generating

function (at least) for compactly supported probability distributions. If F is thee.s.d. of the Hermitian matrix X , then M k is also called the kth order moment of X . The above result provides therefore a link between the Stieltjes transform of X and the moments of X . The moments of random Hermitian matrices are in factof practical interest whenever direct usage of Stieltjes transform-based methodsare too difficult. Chapter 5 is dedicated to an account of these moment-basedconsiderations.

Before concluding this section, we introduce a few additional tools, all derivedfrom the Stieltjes transform, which have fundamental properties regarding themoments of Hermitian matrices. From a theoretical point of view, they helpbridge classical probability theory to free probability theory [Hiai and Petz , 2006;Voiculescu et al., 1992], to be introduced in Chapter 4. For a slightly moreexhaustive account of the most commonly used functionals of e.s.d. of large




dimensional random matrices and their main properties, refer to [Tulino andVerd u, 2004].

We consider rst the R-transform , dened as follows.

Denition 3.3. Let F be a distribution function on R and let mF be its Stieltjestransform, then the R-transform of F , denoted RF , is such that

mF (RF (z) + z−1) = −z (3.7)

or equivalently

mF (z) = 1

RF (

−mF (z))

−z

. (3.8)

If F is associated with a probability measure µ, then Rµ will also denote theR-transform of F . Also, if F is the e.s.d. of the Hermitian matrix X , then RF

will be called the R-transform of X . The importance of the R-transform lies inTheorem 4.6, to be introduced in Chapter 4. Roughly speaking, the R-transformis the random matrix equivalent to the characteristic function of scalar randomvariables in the sense that, under appropriate conditions (independence is notsufficient) on the Hermitian random matrices A and B , the R-transform of thel.s.d. of A + B is the sum of the R-transforms of the l.s.d. of A and of the l.s.d.

of B (upon existence of these limits).Similar to the R-transform, which has additive moment properties, we denethe S -transform, which has product moment properties.

Denition 3.4. Let F be a distribution function on R and let mF be its Stieltjestransform, then the S -transform of F , denoted S F , satises

mF z + 1zS F (z)

= −zS F (z).

If F has probability measure µ, then S µ denotes also the S -transform of F .Under suitable conditions, the S -transform of the l.s.d. of a matrix product ABis the product of the S -transforms of the l.s.d. of A and the l.s.d. of B .

Both R- and S -transforms are particularly useful when dealing with matrixmodels involving unitary matrices; in particular, they will be used in Chapter 12to evaluate the capacity of networks using orthogonal CDMA communications.

We also introduce two alternative forms of the Stieltjes transform, namelythe η-transform and the ψ-transform, which are sometimes preferred over theStieltjes transform because they turn out to be more convenient for readability

in certain derivations, especially those derivations involving the R- and S -transform. The η-transform is dened as follows.




Denition 3.5. For F a distribution function with support in R + , we denethe η-transform of F , denoted ηF , to be the function dened for z ∈C \ R − as

ηF (z) 11 + zt dF (t).

As such, the η-transform can be expressed as a function of the Stieltjestransform as

ηF (z) = 1z

mF −1z

.

The ψ-transform is dened similarly in the following.

Denition 3.6. For F a distribution function with support in R + , we denethe ψ-transform of F , denoted ψF , to be the function dened for z ∈C \ R + as

ψF (z) zt1 −zt

dF (t).

Therefore, the ψ-transform can be written as a function of the Stieltjestransform as

ψF (z) = −1 − 1z

mF 1z

.

These tools are obviously totally equivalent and are used in place of theStieltjes transform only in order to simplify long derivations.

The next section introduces the proof of one of the pioneering fundamentalresults in large dimensional random matrix theory. This proof will demonstratethe power of the Stieltjes transform tool.

3.2 The Marcenko–Pastur law

As already mentioned, the Stieltjes transform was used by Marcenko and Pastur

in [Marcenko and Pastur, 1967] to derive the Marcenko–Pastur law of largedimensional Gram matrices of random matrices with i.i.d. entries. We start withsome reminders before providing the essential steps of the proof of the Marcenko–Pastur law and providing the complete proof.

We recall, from Theorem 3.1, that studying the Stieltjes transform of aHermitian matrix X is equivalent to studying the distribution function F X of the eigenvalues of X . The celebrated result that triggered the now extensive useof the Stieltjes transform is due to Marcenko and Pastur [Marcenko and Pastur,1967] on the limiting distribution of the e.s.d. of sample covariance matrices withidentity population covariance matrix. Although many different proofs exist bynow for this result, some using different approaches than the Stieltjes transformmethod, we will focus here on what we will later refer to as the the Marcenko–Pastur method. This method is both simple and pedagogical for it uses the



3.2. The Marcenko–Pastur law 43

building blocks of the analytical aspect of large dimensional random matrixtheory in an elegant manner. We give in the following rst the key steps andthen the precise steps of this derivation. This proof will serve as grounds for thederivations of further results, which utilize mostly the same approach, and alsothe same lemmas and identities. In Chapter 6, we will discuss the drawbacksof the method as it fails to generalize to some more involved matrix models,especially when the e.s.d. of the large matrix under study does not converge.Less intuitive but more powerful methods will then be proposed, among whichthe Bai and Silverstein approach and the Gaussian methods .

We recall that the result we want to prove is the following. Let X ∈C N ×n bea matrix with i.i.d. entries 1√ n X N,ij , such that X N, 11 has zero mean and unit

variance. As n, N → ∞ with N n → c ∈ (0, ∞), the e.s.d. of R n XX H converges

almost surely to a non-random distribution function F c with density f c given by:

f c(x) = (1 −c−1)+ δ (x) + 12πcx (x −a)+ (b−x)+

where a = (1 −√ c)2 , b = (1 + √ c)2 and δ (x) = 1 0(x).Before providing the extended proof, let us outline the general derivation. The

idea behind the proof is to study the Stieltjes transform 1N tr( XX H

−zI N )−1 of F XX H

instead of F XX H

itself, and more precisely to study the diagonal entriesof the matrix ( XX H

−zI N )−1 , often called the resolvent of X . The main steps

consists of the following.

• It will rst be observed, through algebraic manipulations, involving matrixinversion lemmas, that the diagonal entry (1 , 1) of (XX H

−zI N )−1 can bewritten as a function of a quadratic form y H (Y H Y −zI n )−1y , with y the rstcolumn of X H and Y H dened as X H with column y removed. Precisely, wewill have

(XX H

−zI N )−111 =

1

−z −zy H (Y H Y −zI n )−1y.

• Due to the important trace lemma , which we will introduce and provebelow, the quadratic form y H (Y H Y −zI n )−1y can then be shown to beasymptotically very close to 1

n tr( Y H Y −zI n )−1 (asymptotically meaning herefor increasingly large N and almost surely). This is:

y H (Y H Y −zI n )−1y 1n

tr( Y H Y −zI n )−1

where we non-rigorously use the symbol “ ” to mean “almost surely equal inthe large N limit.”

• Another lemma, the rank- 1 perturbation lemma , will state that a perturbationof rank 1 of the matrix Y H Y , e.g. the addition of the rank-1 matrix yy H toY H Y , does not affect the value of 1

n tr( Y H Y −zI n )−1 in the large dimensionallimit. In particular, 1

n tr( Y H Y −zI n )−1 can be approximated in the large n




limit by 1n tr( X H X −zI n )−1 . With our non-rigorous formalism, this is:

1

n tr( Y H Y

−zI n )−1 1

n tr( X H X

−zI n )−1

an expression which is now independent of y , and in fact independent of thechoice of the column of X H , which is initially taken out. The same derivationtherefore holds true for any diagonal entry of ( XX H

−zI N )−1 .

• But then, we know that 1n tr( XX H

−zI N )−1 can be written as a function of 1n tr( X H X −zI n )−1 from Lemma 3.1. This is:

1n

tr( X H X −zI n )−1 = 1n

tr( XX H

−zI N )−1 + N −n

n1z

.

• It follows that each diagonal entry of ( XXH

−zI N )−1

can be written as afunction of 1n tr( XX H

−zI N )−1 itself. By summing up all N diagonal elementsand averaging by 1 /N , we end up with an approximated relation between1N tr( XX H

−zI N )−1 and itself

1N

tr XX H

−zI N −1 11 − N

n −z −z N n

1N tr ( XX H −zI N )−1

which is asymptotically exact almost surely.

• Since this appears to be a second order polynomial in 1N tr XX H

−zI N −1 ,this can be solved and we end up with an expression of the limiting Stieltjestransform of XX H . From Theorem 3.1, we nally nd the explicit form of thel.s.d. of F XX H

, i.e. the Marcenko–Pastur law.

We now proceed to the thorough proof of Theorem 2.13. For this, we restrictthe entries of the random matrix X to have nite eighth order moment. Theextension to the general case can be handled in several ways. We will mentionafter the proof the ideas behind a powerful technique known as the truncation,centralization, and rescaling method , due to Bai and Silverstein [Silverstein andBai, 1995], which allows us to work with truncated random variables, i.e. random

variables on a bounded support (therefore having moments of all orders), in placeof the actual random variables. It will be shown why working with the truncatedvariables is equivalent to working with the variables themselves and why thegeneral version of Theorem 2.13 unfolds.

3.2.1 Proof of the Marcenko–Pastur law

We wish this proof to be complete in the sense that all notions of random matrixtheory and probability theory tools are thoroughly detailed. As such, the proof contains many embedded lemmas, with sometimes further embedded results. Thereader may skip most of these secondary results for ease of read. Those lemmasare nonetheless essential to the understanding of the basics of random matrixtheory using the Stieltjes transform and deserve, as such, a lengthy explanation.




Let X ∈C N ×n be a random matrix with i.i.d. entries of zero mean, variance1/n , and nite eighth order moment, and denote R N = XX H . We start bysingling out the rst row y H

∈

C 1×n of X , and we write

Xy H

Y.

Now, for z ∈C + , we have

(R N −zI N )−1 =y H y −z y H Y H

Yy YY H

−zI N −1

−1

(3.9)

the trace of which is the Stieltjes transform of F R N . Our interest being tocompute this Stieltjes transform, and then the sum of the diagonal elements

of the matrix in ( 3.9), we start by considering the entry (1 , 1). For this, we needa classical matrix inversion lemma

Lemma 3.3. Let A ∈C N ×N , D ∈C n ×n be invertible, and B ∈C N ×n , C ∈C n ×N . Then we have:

A BC D

−1

= (A −BD −1C )−1 −A −1B (D −CA −1B )−1

−(A −BD −1C )−1CA −1 (D −CA −1B )−1 . (3.10)

We apply Lemma 3.3 to the block matrix ( 3.9) to obtain the upper left entry

−z + y H I N −Y H YY H

−zI N −1 −1Y y −1

.

From the relation IN −A N (I N + B N A N )−1B N = ( I N + A N B N )−1 , thisfurther expresses as

−z −zy H (Y H Y −zI n )−1y −1.

We then have

(R N

−zI N )−1

11

= 1

−z −zyH

(YH

Y −zI n )−1y

. (3.11)

To go further, we need an additional result, proved initially in [Bai andSilverstein, 1998] , that we formulate in the following theorem.

Theorem 3.4. Let A 1 , A 2 , . . . , with A N ∈C N ×N , be a series of matrices with uniformly bounded spectral norm. Let x1 , x 2 , . . . , with x N ∈C N , be random vectors with i.i.d. entries of zero mean, variance 1/N , and eighth order moment of order O(1/N 4), independent of A N . Then

x H

N A

N x

N − 1

N tr A

N

a .s.

−→ 0 (3.12)

as N → ∞.




We mention besides that, in the case where the quantities involved are realand not complex, the entries of x N have fourth order moment of order O(1/N 2)and A N has l.s.d. A, a central limit of the variations of x T

N A N x N

− 1N tr A N is

proved in [Tse and Zeitouni, 2000] , i.e.

√ N x T

N A N x N − 1N

tr A N ⇒ Z (3.13)

as N → ∞,4 with Z ∼N (0, v), for some variance v depending on the l.s.d. F A

of A N and on the fourth moment of the entries of xN as follows.

v = 2 t2dF A (t) + (E[ x411 ]−3) tdF A (t)

2

.

Intuitively, taking for granted that x HN A N x N does have a limit, this limit must

coincide with the limit of E[ x HN A N x N ]. But for nite N

E[x HN A N x N ] =

N

i=1

N

j =1

AN,ij E[x∗N,i xN,j ]

= 1N

N

i =1

N

j =1

AN,ij δ ji

= 1N

tr A N

which is the expected result. We hereafter provide a rigorous proof of thealmost sure limit, which has the strong advantage to introduce very classicalprobability theoretic tools which will be of constant use in the detailed proofs of the important results of this book. This proof may be skipped in order not todisrupt the ow of the proof of the Marcenko–Pastur law.

Proof. We start by introducing the following two fundamental results of probability theory. The rst result is known as the Markov inequality .

Theorem 3.5 (Markov Inequality, (5.31) of [Billingsley, 1995] ). For X a real random variable, α > 0, we have for all integer k

P (ω, |X (ω)| ≥ α) ≤ 1α k E |X |k .

The second result is the rst Borel–Cantelli lemma .

Theorem 3.6 (First Borel–Cantelli Lemma, Theorem 4.3 in [Billingsley, 1995]).Let AN be F -sets of Ω. If N P (AN ) < ∞, then P (lim sup N AN ) = 0 . When AN has a limit, this implies that P (lim N AN ) = 0 .

The symbol lim sup N AN stands for the set

k≥0 n ≥k An . An element ω ∈ Ω

belongs to limsup N AN if, for all integer k, there exists N

≥ k such that ω

∈ AN ,

4 The notation X N ⇒ X for X N and X random variables, with distribution function F N andF , respectively, is equivalent to F N ⇒ F .




i.e. ω ∈ lim supN AN if ω belongs to innitely many sets AN . Informally, an eventAN such that P (lim sup N AN ) = 0 is an event that, with probability one, doesnot happen innitely often (denoted i.o.). The set lim sup N AN is sometimeswritten AN i.o.

The technique to prove Theorem 3.4 consists in nding an integer k such that

E x H

N A N x N − 1N

tr A N

k

≤ f N (3.14)

where f N is constant independent of both A N and xN , such that N f N < ∞.Then, for some ε > 0, from the Markov inequality, we have that

P (ω, Y N (ω) ≥ ε) ≤ 1εk E Y kN (3.15)

with Y N x HN A N x N −(1/N ) tr A N . Since the right-hand side of ( 3.15) is

summable, it follows from the rst Borel–Cantelli lemma, Theorem 3.6, that

P (ω, Y N (ω) ≥ ε i.o.) = 0 .

Since ε > 0 was arbitrary, the above is true for all rational ε > 0. Because thecountable union of sets of probability zero is still a set of probability zero (see

[Billingsley, 1995] ), we nally have that

P ( p,q )∈(

N∗) 2

ω, Y N (ω) ≥ pq

i.o. = 0 .

The complementary of the set in parentheses above satises: for all ( p, q )there exists N 0(ω) such that, for all N ≥ N 0(ω), |Y N (ω)| ≤ p

q . This set hasprobability one, and therefore Y N has limit zero with probability one, i.e.x H

N A N x N − 1N tr A N

a .s.

−→ 0. It therefore suffices to nd an integer k such that

(3.14) is satised.For k = 4, expanding x HN A N x N as a sum of terms xH

i xj Ai,j and distinguishingthe cases when i = j or i = j , we have:

E x HN A N x N −

1N

tr A N

4

≤ 8N 4

E N

i =1AN,ii (|xN,i |2 −1)

4

+ Ei= j

AN,ij x∗N,i xN,j

4

where the inequality comes from: |x + y|k ≤ (|x|+ |y|)k ≤ 2k−1(|x|k + |y|k ). Thelatter arises from the H¨ older’s inequality, which states that, for p,q > 0 suchthat 1 /p + 1 /q = 1, applied to two sets x1 , . . . , x N and y1 , . . . , y N [Billingsley,




1995]

N

n =1 |xn yn | ≤ N

n =1 |xn | p

1/p N

n =1 |yn |q

1/q

taking N = 2, x1 = x, x2 = y, y1 = y2 = 1 and p = 4, we have immediately theresult. Since the xi have nite eighth order moment (and therefore nite kthorder moment of all k ≤ 8) and that A N has uniformly bounded norm, all theterms in the rst sum are nite. Now, expanding the sum as a 4-fold sum, thenumber of terms that are non-identically zeros is of order O(N 2). The secondsum is treated identically, with an order of O(N 2) non-identically null terms.Therefore, along with the factor 1 /N 4 in front, there exists K > 0 independent

of N , such that the sum is less than K/N 2

. This is summable and we have provedTheorem 3.4, when the xk have nite eighth order moment.

Remark 3.1. Before carrying on the proof of the Marcenko–Pastur law, we takethe opportunity of the introduction of the trace lemma to mention the followingtwo additional results. The rst result is an extension of the trace lemma to thecharacterization of x H Ay for independent x , y vectors.

Theorem 3.7. For A 1 , A 2 , . . . , with A N ∈C N ×N with uniformly bounded spectral norm, x1 , x 2 , . . . and y1 , y 2 , . . . two series of i.i.d. variables such that xN

∈C N

and yN ∈C N have zero mean, variance 1/N , and fourth order moment of order O(1/N 2), we have:

x H

N A N y N a .s.

−→ 0.

Proof. The above unfolds simply by noticing that E |xHN A N y N |4 ≤ c/N 2 for some

constant c. We give below the precise derivation for this rather easy case.

E x HN A N y N

4 = E

i 1 ,...,i 4j 1 ,...,j 4

x∗i 1 x i 2 x∗i 3 x i 4 yj 1 y∗j 2 yj 3 y∗j 4 Ai 1 ,j 1 A∗i 2 ,j 2 Ai 3 ,j 3 A∗i 4 ,j 4 .

If one of the xi k or yj k appears an odd number of times in one of the terms of the sum, then the expectation of this term is zero. We therefore only accountfor terms xi k and yj k that appear two or four times. If i1 = i2 = i3 = i4 and j1 = j 2 = j 3 = j 4 , then:

E x∗i 1 x i 2 x∗i 3 x i 4 yj 1 y∗j 2 yj 3 y∗j 4 Ai 1 ,j 1 A∗i 2 ,j 2 Ai 3 ,j 3 A∗i 4 ,j 4 = 1N 4 |Ai 1 ,j 1 |4E[|x1|4]E[|y1|4]

= O(1/N 4).

Since there are as many as N 2 such congurations of i1 , . . . , i 4 and j1 , . . . , j 4 ,these terms contribute to an order O(1/N 2) in the end. If i1 = i2 = i3 = i4 and




j1 = j 3 = j 2 = j 4 , then:

E x∗i 1 x i 2 x∗i 3 x i 4 yj 1 y∗j 2 yj 3 y∗j 4 Ai 1 ,j 1 A∗i 2 ,j 2 Ai 3 ,j 3 A∗i 4 ,j 4 = 1N 4 |Ai 1 ,j 1 |2|Ai 1 ,j 3 |2E[|x1|4]

= O(1/N 4).

Noticing that i 1 ,j 1 ,j 3 |Ai 1 ,j 1 |2|Ai 1 ,j 3 |2 = i 1 ,j 1 |Ai 1 ,j 1 |2( j 3 |Ai 1 ,j 3 |2), and thati,j |Ai,j |2 = tr A N A

HN = O(N ) from the bounded norm condition on A N , we

nally have that the sum over all possible i1 and j1 = j 3 is of order O(1/N 2). Thesame is true for the combination j1 = j 2 = j 3 = j 4 and i1 = i3 = i2 = i4 , whichresults in a term of order O(1/N 2). It remains the case when i1 = i3 = i2 = i4

and j1 = j 3 = j 2 = j 4 . This leads to the terms

E x∗i 1 x i 2 x∗i 3 x i 4 yj 1 y∗j 2 yj 3 y∗j 4 Ai 1 ,j 1 A∗i 2 ,j 2 Ai 3 ,j 3 A∗i 4 ,j 4 = 1

N 4

|Ai 1 ,j 1

|2

|Ai 2 ,j 3

|2

= O(1/N 4).

Noticing that i 1 ,i 3 ,j 1 ,j 3 |Ai 1 ,j 1 |2|Ai 2 ,j 3 |2 = i 1 ,j 1 |Ai 1 ,j 1 |2( i 3 ,j 3 |Ai 1 ,j 3 |2) fromthe same argument as above, we have that the last term is also of order O(1/N 2).Therefore, the total expected sum is of order O(1/N 2). The Markov inequalityand the Borel–Cantelli lemma give the nal result.

The second result is a generalized version of the fourth order momentinequality that led to the proof of the trace lemma above, when the entriesof x

N = ( x

N, 1, . . . , x

N,N )T have moments of all orders. This result unfolds from

the same combinatorics calculus as presented in the proof of Theorem 3.7.

Theorem 3.8 (Lemma B.26 of [Bai and Silverstein, 2009] ). Under the conditions of Theorem 3.4, if for all N, k , E[|√ Nx N,k |m ] ≤ ν m , then, for all p ≥ 1

E x H

N A N x N − 1N

tr A N

p

≤ C pN p

(ν 4 tr( AA H ))p2 + ν 2 p tr( AA H )

p2

for C p a constant depending only on p.

Returning to the proof of the Marcenko–Pastur law, in ( 3.11), y ∈C n isextracted from X , which has independent columns, so that y is independentof Y . Besides, y has i.i.d. entries of variance 1 /n . For large n, we therefore have

(R N −zI N )−1

11 − 1

−z −z 1n tr( Y H Y −zI n )−1

a .s.

−→ 0

where the convergence is ensured by verifying that the denominator of thedifference has imaginary part uniformly away from zero.

We feel at this point that, for large n, the normalized trace 1n tr( Y H Y

−zI n )−1

should not be much different from 1n tr( X H X −zI n )−1 , since the difference

between both matrices here is merely the rank-1 matrix yy H . This is formalizedin the following second theorem.




Theorem 3.9 ([Silverstein and Bai, 1995] ). For z ∈C \ R + , we have the following quadratic form identities.

(i) Let z ∈C \ R , A ∈CN

×N

, B ∈CN

×N

with B Hermitian, and v ∈CN

. Then

tr (B −zI N )−1 −(B + vv H

−zI N )−1 A ≤ A

| [z]|with A the spectral norm of A .

(ii) Moreover, if B is non-negative denite, for z ∈R −

tr (B −zI N )−1 −(B + vv H

−zI N )−1 A ≤ A

|z| .

This theorem can be further rened for z ∈R +

. Generally speaking, it isimportant to take z away from the support of the eigenvalues of B and B + vv H

to make sure that both matrices remain invertible. With z purely complex, theposition of the eigenvalues of B and B + vv H on the real line does not matter,and similarly for B non-negative denite and real z < 0, the position of theeigenvalues of B does not matter.

In the present situation, A = I n and therefore, irrespective of the actualproperties of y (it might even be a degenerated vector with large Euclideannorm), we have

1n tr( Y H Y −zI n )−1 −mF X H X (z)

= 1n

tr( Y H Y −zI n )−1 − 1n

tr( X H X −zI n )−1

= 1n

tr( Y H Y −zI n )−1 − 1n

tr( Y H Y + yy H

−zI n )−1

→ 0

and therefore:

(R N

−zI N )−1

11 − 1

−z −zm F X H X (z)

a .s.

−→ 0.

By the denition of the Stieltjes transform, since the non-zero eigenvalues of R N = XX H and X H X are the same, we have from Lemma 3.1

mF X H X (z) = N n

mF R N (z) + N −n

n1z

(3.16)

which leads to

(R N −zI N )−1

11 − 1

1 − N n −z −z N

n mF R N (z)a .s.

−→ 0. (3.17)

The second term in the difference is independent on the initial choice of theentry (1 , 1) in (R N −zI N )−1 . Due to the symmetric structure of X , the resultis also true for all diagonal entries ( i, i ), i = 1, . . . , N . Summing them up and




averaging, we conclude that 5

mF R N (z)

− 1

1 − N

n −z −zN n mF R N (z)

a .s.

−→ 0. (3.18)

Take R 1 , R 2 , . . . a particular sequence for which ( 3.18) is veried (suchsequences lie in a space of probability one). Since mF R N (z) ≤ 1/ [z] fromTheorem 3.2, the sequence mF R 1 (z), m F R 2 (z), . . . is uniformly bounded in acompact set. Consider now any subsequence mF

R ψ (1) (z), m F R ψ (2) (z), . . . ; along

this subsequence, ( 3.18) is still valid. Since mF R ψ ( n ) is uniformly bounded

from above, we can select a further subsequence mF R φ ( ψ (1)) , m F

R φ ( ψ (2)) , . . . of mF

R ψ (1) (z), m F R ψ (2) (z), . . . which converges (this is an immediate consequence

of the Bolzano–Weierstrass theorem). Its limit, call it m(z; φ, ψ) is still a Stieltjestransform, as can be veried from Theorem 3.2, and is one solution of the implicitequation in m

m = 1

1 −c −z −zcm. (3.19)

The form of the implicit Equation ( 3.19) is often the best we can obtain moreinvolved models than i.i.d. X matrices. It will indeed often turn out that noexplicit equation for the limiting Stieltjes transform mF (which we have not yetproved exist) will be available. Additional tools will then be required to ensurethat (3.19) admits either (i) a unique scalar solution, when seen as an equationin the dummy variable m, or (ii) a unique functional solution, when seen as anequation in the dummy function-variable m of z. The difference is technicallyimportant for practical applications. Indeed, if we need to recover the Stieltjestransform of a d.f. F (for instance to evaluate its associated Shannon transform),it is important to know whether a classical xed-point algorithm is expected toconverge to solutions of ( 3.19) other than the desired solution. This will bediscussed further later in this section.

For the problem at hand, though, ( 3.19) can be rewritten as the second order

polynomial m(1 −c −z −zcm ) = 0 in the variable m, a unique root of whichis the Stieltjes transform of a distribution function taken at z. This limit isof course independent of the choice of φ and ψ. Therefore, any subsequence of mF R 1 (z), m F R 2 (z), . . . admits a further subsequence, which converges to somevalue mF (z), which is the Stieltjes transform of some distribution functionF . Therefore, from the semi-converse of the Bolzano–Weierstrass theorem,mF R 1 (z), m F R 2 (z), . . . converges to mF (z). The latter is given explicitly by

mF (z) = 1−c

2cz − 12c − (1 −c −z)2 −4cz

2cz (3.20)

5 We use here the fact that the intersection of countably many sets of probability one on whichthe result holds is itself of probability one.




where the branch of (1 −c −z)2 −4cz is chosen such that mF (z) ∈C + forz ∈C + , mF (z) ∈C − for z ∈C − and mF (z) > 0 for z < 0.6

Using the inverse-Stieltjes transform formula ( 3.2), we then verify that mF R N

has a limit mF , the Stieltjes transform of the Marcenko–Pastur law, with density

F (x) = (1 −c−1)+ δ (x) + 12πcx (x −a)+ (b−x)+

where a = (1 −√ c)2 and b = (1 + √ c)2 . The term 12x (x −a)+ (b−x)+ is

obtained by computing lim y→0 mF (x + iy), and taking its imaginary part. Thecoefficient c is then retrieved from the fact that we know rst how many zeroeigenvalues should be added and second that the density should integrate to 1.

To prove that the almost sure convergence of mF R N (z) a.s.

−→ mF (z) induces the

weak convergence F R N

⇒ F with probability one, we nally need the followingtheorem.

Theorem 3.10 (Theorem B.9 of [Bai and Silverstein , 2009]). Let F N be a set of bounded real functions such that limx→−∞F N (x) = 0 . Then, for all z ∈C +

limN →∞

mF N (z) = mF (z)

if and only if there exists F such that limx→−∞F (x) = 0 and |F N (x) −F (x)| → 0 for all x ∈R .

Proof. For z ∈C + , the function f : (x, z ) → 1z−x is continuous and tends

to zero when |x| → ∞. Therefore, |F N (x) −F (x)| → 0 for all x ∈R impliesthat 1

z−x d(F N −F )(x) → 0. Conversely, from the inverse Stieltjes transformformula ( 3.2), for a, b continuity points of F , |(F N (b) −F N (a)) −(F (b) −F (a)) | ≤ 1

π limy→0+ ba |(mN −m)(x + iy)|dx, which tends to zero as N grows

large. Therefore, we have:

|F N (x) −F (x)| ≤ |(F N (x) −F (x)) −(F N (a) −F (a)) |+ |F N (a) −F (a)|which tends to zero as we take, e.g. a = N and N → ∞ (since both F N (a) → 1and F (a) → 1).

The sure convergence of the Stieltjes transform mF N to mF therefore ensuresthat F N ⇒ F and conversely. This is the one theorem, along with the inversionformula ( 3.2), that fully justies the usage of the Stieltjes transform to study theconvergence of probability measures.

Back to the proof of the Marcenko–Pastur law, we have up to now proved that

mF R N (z) −mF (z) a .s.

−→ 0

6 We use a minus sign here in front of √ (1 −c−z ) 2 −4cz

2cz for coherence with the principal squareroot branch when z < 0.




for some initially xed z ∈C + . That is, there exists a subspace C z ∈F , with(Ω, F , P ) the probability space generating the series R 1 , R 2 , . . . , such thatP (C z ) = 1 for which ω

∈ C z implies that mF R N ( ω ) (z)

−mF (z)

→ 0. Since we

want to prove the almost sure weak convergence of F R N to the Marcenko–Pasturlaw, we need to prove that the convergence of the Stieltjes transform holds for all z ∈C \ R + on a common space of probability one. We then need to showthat there exists C ∈F , with P (C ) = 1 such that, for all z ∈C + , ω ∈ C impliesthat mF R N ( ω ) (z) −mF (z) → 0. This requires to use Vitali’s convergence theorem[Titchmarsh, 1939] on a countable subset of C + (or alternatively, Montel’stheorem).

Theorem 3.11. Let f 1 , f 2 , . . . be a sequence of functions, analytic on a region

D ⊂C

, such that |f n (z)| ≤ M uniformly on n and z ∈ D . Further, assume that f n (zj ) converges for a countable set z1 , z2 , . . . ∈ D having a limit point inside D .Then f n (z) converges uniformly in any region bounded by a contour interior toD . This limit is furthermore an analytic function of z.

The convergence mF R N (z) −mF (z) a.s.

−→ 0 is valid for any z inside a boundedregion of C + . Take countably many z1 , z2 , . . . having a limit point in somecompact region of C + . For each i, we have mF R N ( ω ) (zi ) −mF (zi )

a .s.

−→ 0 forω ∈ C z i , some set with P (C z i ) = 1. The set C =

i C z i ∈F over which the

convergence holds for all zi has probability one, as the countable union of setsof probability one. As a consequence, for ω ∈ C , mF R N ( ω ) (ω) (zi ) −mF (zi ) → 0for any i. From Vitali’s convergence theorem, since mF R N (z) −mF (z) is clearlyan analytic function of z, this holds true uniformly in all sets interior to regionswhere mF R N (z) and mF (z) are uniformly bounded, i.e. in all regions that excludethe real positive half-line. From Theorem 3.10, F R N (ω) (x) −F (x) → 0 for allx ∈R and for all ω ∈ C , so that, for ω ∈ C , F R N (ω) (x) −F (x) → 0. This ensuresthat F R N

⇒ F almost surely. This proves the almost sure weak convergence of the e.s.d. of R N to the Marcenko-Pastur law.

The reason why the proof above constrains the entries of X to have nite

eighth order moment is due to the trace lemma, Theorem 3.4, which is onlyproved to work with this eighth order moment assumption. We give below ageneralization of Theorem 3.4 when the random variables under considerationare uniformly bounded in some sense, no longer requiring the nite eighth ordermoment assumption. We will then present the truncation, centralization, and rescaling steps, which allow us to prove the general version of Theorem 2.13.These steps consist in replacing the variables X ij by truncated versions of thesevariables, i.e. replacing X ij by zero whenever |X ij | exceeds some predenedthreshold. The centralization and centering steps are then used to recenter themodied X ij around its mean and to preserve its variance. The main objectivehere is to replace variables, which may not have bounded moments, by modiedversions of these variables that have moments of all orders. The surprising resultis that it is asymptotically equivalent to consider X ij or their altered versions;




as a consequence, in many practical situations, it is unnecessary to make anyassumptions on the existence of any moments of order higher than 2 for X ij . Wewill subsequently show how this operates and why it is sufficient to work withsupportly compacted random X ij variables.

3.2.2 Truncation, centralization, and rescaling

A convenient trace lemma for truncated variables can be given as follows.

Theorem 3.12. Let A 1 , A 2 , . . ., A N ∈C N ×N , be a series of matrices of growing sizes and x 1 , x 2 , . . ., xN ∈C N , be random vectors with i.i.d. entries bounded by N −1

2 logN , with zero mean and variance 1/N , independent of A N .

Then

E x H

N A N x N − 1N

tr A N

6

≤ K A N 6 log12 N

N 3

for some constant K independent of N .

The sixth order moment here is upper bounded by a bound on the values of the entries of the xk instead of a bound on the moments. This alleviates theconsideration of the existence of any moment on the entries. Note that a similar

result for the fourth order moment also exists that is in general sufficient forpractical purposes, but, since going to higher order moments is now immaterial,this result is slightly stronger. The proof of Theorem 3.12 unfolds from the samemanipulations as for Theorem 3.4.

Obviously, applying the Markov inequality, Theorem 3.5, and the Borel–Cantelli lemma, Theorem 3.6, the result above implies the almost sureconvergence of the difference x H

N A N x N − 1N tr A N to zero. A second obvious

remark is that, if the elements of √ N x N are bounded by some constant C instead of log( N ), the result still holds true.

The following explanations follow precisely Section 3.1.3 in [Bai andSilverstein, 2009] . We start by the introduction of two important lemmas.

Lemma 3.4 (Corollary A.41 in [Bai and Silverstein, 2009 ]). For A ∈C N ×n and B ∈C N ×n

L4 F AA H

, F BB H

≤ 2N

tr( AA H + BB H ) 1N

tr([A −B ][A −B ]H )

where L(F, G ) is the Levy distance between F and G, given by:

L(F, G ) inf

ε,

∀

x

∈R , F (x

−ε)

−ε

≤ G(x)

≤ F (x + ε) + ε

.

The Levy distance can be thought of as the length of the side of the largestsquare that can t between the functions F and G. Of importance to us presently




is the property that, for a sequence F 1 , F 2 , . . . of d.f., L(F N , F ) → 0 implies theweak convergence of F N to F .

The second lemma is a rank inequality.

Lemma 3.5 (Theorem A.44 in [Bai and Silverstein, 2009] ). For A ∈C N ×n and B ∈C N ×n

F AA H

−F BB H

≤ 1N

rank( A −B )

with f supx |f (x)|.We take the opportunity of the introduction of this rank inequality to mention

also the following useful result.

Lemma 3.6 (Lemma 2.21 and Lemma 2.23 in [Tulino and Verd´ u, 2004]). For A , B ∈C N ×N Hermitian

F A −F B ≤ 1N

rank( A −B ).

Also, denoting λX1 ≤ . . . ≤ λX

N the ordered eigenvalues of the Hermitian X

1N

N

i=1(λA

i −λBi )2 ≤

1N

tr( A −B )2 .

Returning to our original problem, let C be a xed positive real number. Thetruncation step consists rst of generating a matrix X with entries 1√ n X N,ij

dened as a function of the entries 1√ N X N,ij of X as follows.

X N,ij = X N,ij 1|X N,ij |<C (X N,ij ).

This is the truncation step that cuts off the tail of the distribution of X 11 .Now, since the distribution of X 11 is not necessarily centered around its mean,we recenter it as follows. We create a further matrix X with entries 1√ N X N,ij

such thatX N,ij = X N,ij −E[ X N,ij ]. (3.21)

This is the centralization step.The remaining problem is that the random variable X 11 has lost through this

process some of its weight in the cut tails. So we need to further rescale theresulting variable. For this, we create the variable X , with entries X N,ij denedas

X N,ij = 1σ(C )

X N,ij

with σ(C ) dened as

σ(C )2 = E[ | X N,ij |].




The idea now is to show that the limiting distribution of F R N is the samewhether we use the i.i.d. entries X N,ij or their truncated, centered, and rescaledversions X N,ij . If this is so, it is equivalent to work with the X N,ij or with X N,ij

in order to derive the Marcenko–Pastur law, with the strong advantage that inthe truncation process above no moment assumption was required. Therefore,if we can prove that F R N converges to the Marcenko–Pastur law with X N,ij

replaced by X N,ij , then we prove the convergence of F R N for all distributionsof X N,ij without any moment constraint of higher order than 2. This last resultis straightforward as it simply requires to go through every step of the proof of the Marcenko–Pastur law and replace every call to the trace lemma, Theorem3.4, by the updated trace lemma, Theorem 3.12. The remainder of this sectionis dedicated to proving that the limiting spectrum of R N remains the same if

the X N,ij are replaced by ¯X N,ij .We have from Lemma 3.4

L4 F XX H

, F X X H

≤ 21

Nn i,j|X N,ij |2 + | X N,ij |2

1Nn i,j

|X N,ij − X N,ij |2

≤ 41

Nni,j

|X N,ij |21

Nni,j

|X N,ij |21|X N,ij |>C (X N,ij ) .

From the law of large numbers, both terms in the right-hand side tend to their

means almost surely, i.e.

1Nn

i,j|X N,ij |2

1Nn

i,j|X N,ij |21|X N,ij |>C (X N,ij )

a .s.

−→ E |X N, 11 |21|X N, 11 |>C (X N,ij ) .

Notice that this goes to zero as C grows large (since the second order momentof X N, 11 exists). Now, from Lemma 3.5 and Equation ( 3.21), we also have

F X X H

−F X X H

≤ 1

N rank(E[ X ]) =

1

N .

This is a direct consequence of the fact that the entries are i.i.d. and thereforeE[ X N,ij ] = E[ X N, 11 ], entailing that the matrix composed of the E[ X N,ij ] has unitrank. The right-hand side of the inequality goes to zero as N grows large.

Finally, from Lemma 3.4 again

L4 F X X H

−F X X H

≤ 21 + σ(C )2

Nni,j

| X N,ij |21 −σ(C )2

nN i,j

| X N,ij |2

the right-hand side of which converges to 2(1

−σ(C )4) N

n almost surely as N grows large. Notice again that σ(C ) converges to 1 as C grows large.

At this point, we go over the proof of the Marcenko–Pastur law but with X inplace of X . The derivations unfold identically but, thanks to Theorem 3.12, we



3.3. Stieltjes transform for advanced models 57

nowhere need any moment assumption further than the existence of the secondorder moment of the entries of X . We then have that, almost surely

F X X H

⇒ F.But since the constant C , which denes X , was arbitrary from the very

beginning, it can be set as large as we want. For x ∈R and ε > 0, we can takeC large enough to ensure that

lim supN |F XX H

(x) −F X X H

(x)| < ε

for a given realization of X . Therefore

limsupN |

F XX H

(x)

−F (x)

| < ε.

Since ε is arbitrary, we nally have

F XX H

⇒ F

this event having probability one. This is our nal result.The proof of the Marcenko–Pastur law, with or without truncation steps,

can be applied to a large range of models involving random matrices with i.i.d.entries. The rst known extension of the Marcenko–Pastur law concerns thel.s.d. of a certain class of random matrices, which contains in particular theN ×N sample covariance matrices R n of the type (1.4), where the vector samplesx i ∈C N have covariance matrix R . This is presented in the subsequent section.

3.3 Stieltjes transform for advanced models

We recall that, if N is xed and n grows large, then we have the almost sureconvergence F R n

⇒ F R of the e.s.d. of the sample covariance matrix R n ∈C N ×N originating from n observations towards the population covariance matrixR . This is a consequence of the law of large numbers in classical probabilitytheory. This is not so if both n and N grow large with limit ratio n/N → c, suchthat 0 < c < ∞. In this situation, we have the following result instead.

Theorem 3.13 ([Silverstein and Bai , 1995]). Let B N = A N + X HN T N X N ,

where X N = 1√ n X N,iji,j ∈

C N ×n with the X N,ij independent with zero mean,

unit variance, and nite moment of order 2 + ε for some ε > 0 ( ε is independent of N,i , j ), T N ∈C N ×N diagonal with real entries and whose e.s.d. F T N

converges weakly and almost surely to F T , and A N

∈C n ×n Hermitian whose

e.s.d. F T N converges weakly and almost surely to F A , N/n tends to c, with 0 < c < ∞ as n, N grow large. Then, the e.s.d. of B N converges weakly and almost surely to F B such that, for z ∈C + , mF B (z) is the unique solution with




positive imaginary part of

mF B (z) = mF A z −c

t

1 + tm F B (z)dF T (t) . (3.22)

Moreover, if the X N has identically distributed entries, then the result holds without requiring that a moment of order 2 + ε exists.

Remark 3.2. In [Bai and Silverstein, 2009 ] it is precisely shown that the non-i.i.d. case holds if the random variables X N,ij meet a Lindeberg-like condition[Billingsley, 1995 ]. The existence of moments of order 2 + ε implies that theX N,ij meet the condition, hence the result. The Lindeberg-like condition statesexactly here that, for any ε > 0

1N 2 ij

E |X N,ij |2

·1|X N,ij |≥ε√ N (X N,ij ) → 0.

These conditions merely impose that the distribution of X N,ij , for all pairs ( i, j ),has light tails, i.e. large values have sufficiently low probability.

In the case of Gaussian X N , T N can be taken Hermitian non-diagonal. Thisis because the joint distribution of X N is in this particular case invariant byright-unitary product, i.e. X N U N has the same joint entry distribution as X N

for any unitary matrix U N ∈C N ×N . Therefore, T N can be replaced by anyHermitian matrix U N T N U

HN for U N unitary. As previously anticipated for this

simple extension of the Marcenko–Pastur law, mF B does not have a closed-formexpression.

The particular case when A N = 0 is interesting in many respects. In this case,(3.22) becomes

mF (z) = − z −c t1 + tm F (z)

dF T (t)−1

(3.23)

where we denoted F F B . This special notation will often be used in Section7.1 to differentiate the l.s.d. F of the matrix T

12N X N X

HN T

12N from the l.s.d. F

of the reversed Gram matrix XH

N T N X N . Note indeed that, similar to ( 3.16),the Stieltjes transform mF of the l.s.d. F of X HN T N X N is linked to the Stieltjes

transform mF of the l.s.d. F of T12N X N X

HN T

12N through

mF (z) = cmF (z) + ( c −1)1z

(3.24)

and then we also have access to a characterization of F , which is the asymptoticeigenvalue distribution of the sample covariance matrix model introduced earlierin (1.4), when the columns x1 , . . . , x n of X N = √ nT

12N X N form a sequence of

independent vectors with zero mean and covariance matrix E[ x 1x H1 ] = T N , with

T12N a Hermitian square root of T N . Note however that, contrary to the strict

denition of the sample covariance matrix model, we do not impose identicaldistributions of the vectors of X N here, but only identically mean and covariance.



3.3. Stieltjes transform for advanced models 59

In addition to the uniqueness of the pair ( z, m F (z)) in the set z ∈C + , m F (z) ∈C + solution of (3.23), an inverse formula for the Stieltjestransform can be written in closed-form, i.e. we can dene a function zF (m)on m ∈C + , zF (m) ∈C + , such that

zF (m) = − 1m

+ c t1 + tm

dF T (t). (3.25)

This will turn out to be extremely useful to characterize the spectrum of F .More on this topic is discussed in Section 7.1 of Chapter 7. From a wirelesscommunication point of view, even if this is yet far from obvious, ( 3.25) isthe essential ingredient to derive in particular ( N, n )-consistent estimates of the diagonal entries with large multiplicities of the diagonal matrix P for the

channel matrix models Y = P

12

X ∈C N

×n

and also Y = HP

12

X + W ∈C N

×n

.The latter, in which H , X and W have independent entries, can be used to modelthe n sampled vectorial data Y = [y 1 , . . . , y n ] received at an array of N sensorsoriginating from K sources with respective powers P 1 , . . . , P K gathered in thediagonal entries of P . In this model, H ∈C N ×K , X ∈C K ×n and W ∈C N ×n

may denote, respectively, the concatenated K channel vectors (in columns),the concatenated K transmit data (in rows), and the additive noise vectors,respectively. It is too early at this stage to provide any insight on the reason why(3.25) is so fundamental here. More on this subject will be successively discussedin Chapter 7 and Chapter 17.

We do not prove Theorem 3.13 in this section, which is a special case of Theorem 6.1 in Chapter 6, for which we will provide an extended sketch of theproof.

Theorem 3.13 was further extended by different authors for matrix modelswhen either X N has i.i.d. non-centered elements [Dozier and Silverstein , 2007a],X N has a variance prole , i.e. with independent entries of different variances,and centered [Girko, 1990 ] or non-centered entries [Hachem et al., 2007], X N is asum of Gaussian matrices with separable variance prole [Dupuy and Loubaton,2009], B N is a sum of such matrices [Couillet et al., 2011a; Peacock et al., 2008],

etc. We will present rst the two best known results of the previous list, whichhave now been largely generalized in the contributions just mentioned. The rstresult of importance is due to Girko [Girko, 1990] on X N matrices with centeredi.i.d. entries with a variance prole.

Theorem 3.14. Let the complex N ×n random matrix X N be composed of independent entries 1√ n X N,ij , such that X N,ij has zero mean, variance σ2

N,ij ,and the σN,ij X N,ij are identically distributed. Further, assume that the σ2

N,ijare uniformly bounded. Assume that the σ2

N,ij converge, as N grows large, to a

bounded limit density pσ2

(x, y ), (x, y ) ∈ [0, 1)2

, as n, N → ∞, n/N → c. That is,dening pN,σ 2 (x, y ) as

pN,σ 2 (x, y ) σ2N,ij




for i−1N ≤ x < i

N and y−1N ≤ y < j

N , pN,σ 2 (x, y ) → pσ 2 (x, y ), as N, n → ∞. Then the e.s.d. of B N = X N X

HN converges weakly and almost surely to a distribution

function F whose Stieltjes transform is given by:

mF (z) = 1

0u(x, z )dx

and u(x, z ) satises the xed-point equation

u(x, z ) = −z + c

0

pσ 2 (x, y )dy1 + 1

0 u(x , z) pσ 2 (x , y)dx

−1

the solution of which being unique in the class of functions u(x, z ) ≥ 0, analytic for

[z] > 0 and continuous on the plan section

(x, y ), x

∈ [0, 1]

.

Girko’s proof is however still not well understood .7 In Chapter 6, a moregeneric form of Theorem 3.14 and a sketch of the proof are provided. This resultis fundamental for the analysis of the capacity of MIMO Gaussian channels witha variance prole. In a single-user setup, the usual Kronecker channel model isa particular case of such a model, referred to as a channel model with separablevariance prole. That is, σ2

ij can be written in this case as a separable product r i t j

with r1 , . . . , r N the eigenvalues of the receive correlation matrix and t1 , . . . , t n

the eigenvalues of the transmit correlation matrix. However, the requirement

that the variance prole σ2N,ij converges to a limit density pσ 2 (x, y ) for largeN is often an unpractical assumption. More useful and more general results,

e.g., [Hachem et al., 2007], that do not require the existence of a limit will bediscussed in Chapter 6, when introducing the so-called deterministic equivalents for mF X N (z).

The second result of importance in wireless communications deals with theinformation plus noise models, as follows.

Theorem 3.15 ([Dozier and Silverstein, 2007a] ). Let X N be N ×n with i.i.d.entries of zero mean and unit variance, A N be N

×n independent of X N such

that F 1n A N A HN

⇒ H almost surely. Let also σ be a positive integer and denote

B N = 1n

(A N + σX N )(A N + σX N )H .

Then, for n, N → ∞ with n/N → c > 0, F B N

⇒ F almost surely, where F is a non-random distribution function whose Stieltjes transform mF (z), for z ∈C + ,is the unique solution of

mF (z) = dH (t)t

1+ σ 2 cm F (z ) −(1 + σ2cmF (z))z + σ2(1 −c)

7 As Bai puts it [Bai and Silverstein, 2009] : “his proofs have puzzled many who attempt tounderstand, without success, Girko’s arguments.”



3.4. Tonelli theorem 61

such that mF (z) ∈C + and zmF (z) ∈C + .

It is rather clear why the model B N is referred to as information plus noise .In practical applications, this result is used in various contexts, such as theevaluation of the MIMO capacity with imperfect channel state information[Vallet and Loubaton, 2009 ], or the capacity of MIMO channels with line-of-sight components [Hachem et al., 2007]. Both results were generalized in[Hachem et al., 2007] for the case of non-centered matrices with i.i.d. entrieswith a variance prole, of particular appeal in wireless communications sinceit models completely the so-called Rician channels, i.e. non-centered Rayleighfading channels with a variance prole. The works [Couillet et al., 2011a; Dupuyand Loubaton, 2009; Hachem et al., 2007] will be further discussed in Chapters

13–14 as they provide asymptotic expressions of the capacity of very generalwireless models in MIMO point-to-point channels, MIMO frequency selectiveRayleigh fading channels, as well as the rate regions of multiple access channelsand broadcast channels. The technical tools required to prove the latter nolonger rely on the Marcenko–Pastur approach, although the latter can be used toprovide an insight on the expected results. The main limitations of the Marcenko–Pastur approach will be evidenced when proving one of the aforementionedresults, namely the result of [Couillet et al., 2011a], in Chapter 6.

Before introducing an important central limit theorem, we make a smalldigression about the Tonelli theorem, also known as Fubini theorem. This resultis of interest when we want to extend results that are known to hold for matrixmodels involving deterministic matrices converging weakly to some l.s.d. to thecase when those matrices are now random, converging almost surely to somel.s.d.

3.4 Tonelli theorem

The Tonelli theorem for probability spaces can be stated as follows.

Theorem 3.16 (Theorem 18.3 in [Billingsley, 1995]). If (Ω, F , P ) and (Ω , F , P )are two probability spaces, then for f an integrable function with respect to the product measure Q on F ×F

Ω×Ωf (x, y )Q(d(x, y )) = Ω Ω

f (x, y )P (dy) P (dx)

and

Ω×Ω

f (x, y )Q(d(x, y )) =

Ω

Ω

f (x, y )P (dy) P (dx).

Moreover, the existence of one of the right-hand side values ensures the integrability of f with respect to the product measure Q.



3.5. Central limit theorems 63

the trace lemma, Theorem 3.4, and the rank-1 perturbation lemma, Theorem3.9, in the case when the matrices involved in both results do not havebounded spectral norm but only almost sure bounded spectral norm for all largedimensions. These generalizations are required to study zero-forcing precodersin multiple input single output broadcast channels, see Section 14.1.

This closes this parenthesis on the Tonelli theorem. We return now to furtherconsiderations of asymptotic laws of large dimensional matrices, and to the studyof (central) limit theorems.

3.5 Central limit theorems

Due to the intimate relation between the Stieltjes and Shannon transforms ( 3.5),it is now obvious that the capacity of large dimensional communication channelscan be approximated using deterministic limits of the Stieltjes transform.For nite dimensional systems, this however only provides a rather roughapproximation of quasi-static channel capacities or alternatively a ratheraccurate approximation of ergodic channel capacities. No information aboutoutage capacities is accessible to this point, since the variations of thedeterministic limit F of the Stieltjes transform F X N of some matrix X N ∈C N ×N

under study are unknown. To this end, we need to study more precisely theuctuations of the random quantity

r N [mF X N (z) −mF (z)]

for some rate rN , increasing with N . For X N a sample covariance matrix, it turnsout that under some further assumptions on the moments of the distribution of the random entries of X N , the random variable N [mF X N (z) −mF (z)] has acentral limit. This central limit generalizes to any well-behaved functional of X N .

The rst central limit result for functionals of large dimensional randommatrices is due to Bai and Silverstein for the covariance matrix model, as follows.

Theorem 3.17 ([Bai and Silverstein , 2004]). Let X N = 1√ n X N,ijij ∈

C N ×n

have i.i.d. entries, such that X N, 11 has zero mean, unit variance, and nite fourth order moment. Let T N ∈C N ×N be non-random Hermitian non-negative denite with uniformly bounded spectral norm (with respect to N ) for which we assume that F T N

⇒ H , as N → ∞. We denote τ 1 ≥ . . . ≥ τ N the eigenvalues of T N .Consider the random matrix

B N = T12N X N X

HN T

12N

as well as

B N = X HN T N X N .




We know from Theorem 3.13 that F B N

⇒ F for some distribution function F ,as N, n → ∞ with limit ratio c = lim N N/n . Denoting F N this limit distribution if the series F T 1 , F T 2 , . . . were to converge to H = F T N , let

GN N F B N −F N .

Consider now k functions f 1 , . . . , f k dened on R that are analytic on the segment

lim inf N

τ N 1(0 ,1) (c)(1 −√ c)2 , lim supn

τ 11(0 ,1) (c)(1 + √ c)2 .

Then, if (i) X N, 11 is real, T N is real and E[(X N, 11 )4] = 3, or (ii) if X N, 11 is complex, E[(X N, 11 )2] = 0 and E[(X N, 11 )4] = 2, then the random vector

f 1(x)dGN (x), . . . , f k (x)dGN (x)

converges weakly to a Gaussian vector (X f 1 , . . . , X f k ) with means (E[X f 1 ], . . . , E[X f k ]) and covariance matrix Cov(X f , X g ), (f, g ) ∈ f 1 , . . . , f k2 ,such that, in case (i)

E[X f ] = − 12πi f (z)

c m(z)3 t2(1 + tm )−3dH (t)(1 −c

m(z)2 t2(1 + tm (z))−2dH (t))2 dz

and Cov( X f , X g ) = −

12πi f (z1)g(z2)

(m(z1) −m(z2))2 m (z1)m (z2)dz1dz2

while in case (ii) E[X f ] = 0 and

Cov( X f , X g ) = − 14πi f (z1)g(z2)

(m(z1) −m(z2))2 m (z1)m (z2)dz1dz2 (3.26)

for any couple (f, g ) ∈ f 1 , . . . , f k2 , and for m(z) the Stieltjes transform of the l.s.d. of B N . The integration contours are positively dened with winding number

one and enclose the support of F .

The mean and covariance expressions of Theorem 3.17 are not easilyexploitable except for some specic f 1 , . . . , f k functions for which the integrals(or an expression of the covariances) can be explicitly computed. For instance,when f 1 , . . . , f k are taken to be f i (x) = xi and T N = I N , Theorem 3.17 givesthe central limit of the joint moments of the distribution N [F B N −F ] with F the Marcenko–Pastur law.

Corollary 3.3 ([Bai and Silverstein, 2004 ]). Under the conditions of Theorem 3.17 with H = 1 [1,∞) , in case (ii), denote vN the vector of j th entry

(vN )j = tr( B jN ) −NM j




where M j is the limiting j th order moment of the Marcenko–Pastur law, as N → ∞. Then, as N, n → ∞ with limit ratio N/n → c

v N ⇒ v ∼CN

(0, Q )where the (i, j )th entry of Q is

Q i,j = ci + ji−1

k 1 =0

j

k 2 =0

ik1

jk2

1 −cc

k 1 + k 2

×i−k 1

l=1

l2i −1 −k1 −l

i −12 j −1 −k2 + l

j −1.

We will see in Chapter 5 that, in the Gaussian case, combinatorial moment-based approaches can also be used to derive the above result in a much fasterway. In case (ii), for Gaussian X N , the rst coefficients of Q can be alternativelyevaluated as a function of the limiting free cumulants C 1 , C 2 , . . . of B N [Raoet al. , 2008], which will be introduced in Section 5.2. Explicitly, we have:

Q11 = C 2 −C 21Q21 = −4C 1C 2 + 2 C 31 + 2C 3Q22 = 16C 21 C 2 −6C 22 −6C 41 −8C 1C 3 + 4 C 4 (3.27)

with C k dened as a function of the moments M k limN 1

N tr B k

N throughEquation ( 5.3).

Obtaining central limits requires somewhat elaborate tools, involving thetheory of martingales in particular, see Section 35 of [Billingsley, 1995]. Animportant result for wireless communications, for which we will provide a sketchof the proof using martingales, is the central limit for the log determinant of Wishart matrices. In its full form, this is:

Theorem 3.18. Let X N ∈C N ×n have i.i.d. entries 1√ n X N,ij , such that X N, 11

has zero mean, unit variance, and nite fourth order moment. Denote B N =X N X HN ∈C N ×N . We know that, as N, n → ∞ with N/n → c, F B N

⇒ F almost surely with F the Marcenko–Pastur law with ratio c. Then the Shannon transform VB N (x) log(1 + xt )dF B N (t) of B N satises

N (VB N (x) −E [VB N (x)]) ⇒ X ∼N (0, Θ2)

with

Θ2 = −log 1 − cmF (−1/x )

(1 + cmF (−1/x ))2 + κ cmF (−1/x )

(1 + cmF (−1/x ))2

κ = E( X N, 11 )4

−2 and mF (z) the Stieltjes transform of the Marcenko–Pastur

law (3.20).

Note that in the case where X N, 11 is Gaussian, κ = 0.




We hereafter provide a sketch of the proof of Theorem 3.18, along with a shortintroduction to martingale theory.

Proof. Martingale theory requires notions of conditional probability.Denition 3.7 ((33.8) of [Billingsley, 1995]). Let (Ω, F , P ) be a probabilityspace, and X be a random variable on this probability space. Let G be a σ-eld in F . For A ∈F , we denote P [A G] the random variable with realizationsP [A G]ω , ω ∈ Ω, which is measurable G, integrable, and such that, for all G ∈G

GP [A G]dP = P (A ∩G).

If G is the nite set of the unions of the F -sets B1 , . . . , B K , then, for ω ∈B i , P (B i )P [A G]ω = P (A

∩B i ), i.e. P [A G]ω = P (A

|B i ), consistently with the

usual denition of conditional probability. We can therefore see P [A G] as theprobability of the event A when the result of the experiment G is known. Assuch, the σ-eld G

⊂F can be seen as an information lter , in the sense that it

brings a rougher vision of the events of Ω. Of interest to us is the extension of this information ltering under the form of a so-called ltration F 1 ⊂F 2 ⊂ . . .for a given sequence of σ-eld F 1 , F 2 , . . . ⊂F .

Denition 3.8. Consider a probability space (Ω , F , P ) and a ltration F 1 ⊂F 2 ⊂ . . . with, for each i, F i ⊂F . Dene X 1 , X 2 , . . . a sequence of randomvariables such that X i is measurable F i and integrable. Then X 1 , X 2 , . . . is amartingale with respect to F 1 , F 2 , . . . if, with probability one

E [X n +1 F n ] = X n

where E[X G] is the conditional expectation dened, for G ∈G as

GE[X G]dP = G

XdP.

If X 1 , X 2 , . . . and X 1 , X 2 , . . . are both martingales with respect to F 1 , F 2 , . . . ,then X 1 −X 1 , X 2 −X 2 , . . . is a martingale difference relative to F 1 , F 2 , . . . . More

generally, we have the following denition.Denition 3.9. If Z 1 , Z 2 , . . . is such that Z i is measurable F i and integrable,for a ltration F 1 , F 2 , . . . , and satises

E[Z n +1 F n ] = 0

then Z 1 , Z 2 , . . . is a martingale difference with respect to F 1 , F 2 , . . . .

Informally, this implies that, when the experiment F n is known beforeexperiment F n +1 has been realized, the expected observation X n +1 at timen + 1 is exactly equal to the observation X n at time n. Note that, by takingF = F N = F N +1 = . . ., the ltration can be limited to a nite sequence.

The link to random matrix theory is the following: consider a randommatrix X N ∈C N ×n with independent columns x1 , . . . , x n ∈C N . The columns




x k (ω), . . . , x n (ω) of X N (ω), ω ∈ Ω, can be thought of as the result of theexperiment F n −k +1 that consists in unveiling successive matrix columns fromthe last to the rst: F 1 unveils x n (ω), F 2 unveils xn −1(ω), etc. This results ina ltration F 0 ⊂F 1 ⊂ . . . ⊂F n , with F 0 = ∅, Ω. The reason why consideringmartingales will be helpful to the current proof takes the form of Theorem 3.19,introduced later, which provides a central limit result for sums of martingaledifferences. Back to the hypotheses of Theorem 3.18, consider the above ltrationbuilt from the column space of X N and denote

α n,j En −j +1 log det I N + xX N XHN −En −j log det I N + xX N X

HN

where E n −j +1 [X ] stands for E[X F n −j +1 ] for X measurable F n −j +1 . For all j ,the αn,j satisfy

En −j +1 [α n,j ] = 0and therefore αn,n , . . . , α 1,n is a martingale difference relative to the ltrationF 1 ⊂ . . . ⊂F n . Note now that the variable of interest, namely

β n log det I N + xX N XHN −E log det I N + xX N X

HN

satises

β n n

j =1(α n,j −α n,j +1 ) .

Denoting X ( j ) [x 1 , . . . , x j −1 , x j +1 , . . . , x n ], notice that

En −j +1 log det I N + xX ( j ) X H

( j ) = E n −j log det I N + xX ( j ) X H

( j )

and therefore, adding and subtracting E n −j +1 log det I N + xX ( j ) X H

( j )

β n = E n −j

log det I N + xX ( j ) X H

( j )

log det I N + xX N XHN −

n

j =1

En −j +1

log det I N + xX ( j ) X H

( j )

log det I N + xX N XHN

= E n −j

log det I n

−1 + xX H

( j )X ( j )

log det I n + xX HN X N −

n

j =1En −j +1

log det I n

−1 + xX H

( j )X ( j )

log det I n + xX HN X N

=n

j =1En −j +1 log 1 + x H

j X ( j ) X H

( j ) + 1x

I N −1

x j

−En −j log 1 + x H

j X ( j ) X H

( j ) + 1x

I N −1

x j

where the last equality comes rst from the expression of a matrix inverse as afunction of determinant and cofactors

X HN X N +

1x

I n−1

jj

=det X H

( j ) X ( j ) + 1x I n −1

det X HN X N + 1

x I n




and second by remembering from ( 3.11) that

X H

N X N +

1

xI n

−1

jj

= x

1 + x Hj (X ( j ) X H( j ) + 1x I N )−1x j.

Using now the fact that

En −j +1 log 1 + 1n

tr X ( j ) X H

( j ) + 1x

I N −1

= E n −j log 1 + 1n

tr X ( j ) X H

( j ) + 1x

I N −1

then adding and subtracting

log(1) = log1 + 1

n tr X ( j ) X H

( j ) + 1x I N

−1

1 + 1n tr X ( j ) X H

( j ) + 1x I N

−1

this is further equal to

β n =n

j =1

En −j +1 log1 + x H

j X ( j ) X H

( j ) + 1x I N

−1x j

1 + 1

n tr X ( j ) X H

( j ) + 1

xI N

−1

−En −j log1 + x H

j X ( j ) X H

( j ) + 1x I N

−1x j

1 + 1n tr X ( j ) X H

( j ) + 1x I N

−1

=n

j =1En −j +1

n −j logx H

j X ( j ) X H

( j ) + 1x I N

−1x j − 1

n tr X ( j ) X H

( j ) + 1x I N

−1

1 + 1n tr X ( j ) X H

( j ) + 1x I N

−1

=n

j =1

γ j

with E n −j +1n −j X = E n −j +1 X −En −j X , and with

γ j = E n −j +1n −j log (1 + Aj )

En −j +1n −j log

x Hj X ( j ) X H

( j ) + 1x I N

−1x j − 1

n tr X ( j ) X H

( j ) + 1x I N

−1

1 + 1n tr X ( j ) X H

( j ) + 1x I N

−1 .

The sequence γ n , . . . , γ 1 is still a martingale difference with respect to theltration F 1 ⊂ . . . ⊂F n .

We now introduce the fundamental theorem, which justies the previous stepsthat led to express β n as a sum of martingale differences.




Theorem 3.19 (Theorem 35.12 of [Billingsley, 1995]). Let δ 1 , . . . , δ n be a sequence of martingale differences with respect to the ltration F 1 ⊂ . . . ⊂F n .Assume there exists Θ2 such that

n

j =1

E δ 2j F j → Θ2

in probability, as n → ∞. Moreover assume that, for all ε > 0n

j =1E δ 2j 1|δj |≥ε → 0.

Then n

j =1 δ j ⇒ X ∼N

(0, Θ2

).

It therefore suffices here to prove that the Lindeberg-like condition is satisedand to determine Θ 2 . For simplicity, we only carry out this second step here.Noticing that Aj is close to zero by the trace lemma, Theorem 3.4

n

j =1En −j γ 2j

n

j =1En −j (En −j +1 Aj −En −j Aj )2

n

j =1

En −j (En −j +1 Aj )2

where we use the symbol ‘ ’ to denote that the difference in the terms on eitherside tends to zero for large N and where

En −j (En −j +1 Aj )2

En −j x Hj X N X

HN +

1x I N −1 x j − 1

n tr X ( j ) X H

( j ) + 1x I N

−1 2

1 + 1n tr X N X

HN +

1x I N −1 2 .

Further calculus shows that the term in the numerator expands as

En −j x Hj X N X

HN +

1x

I N −1

x j − 1n

tr X ( j ) X H

( j ) + 1x

I N −1 2

1n2 tr En −j X ( j ) X H

( j ) + 1x

I N −2

+ κN

i =1

En −j X ( j ) X H

( j ) + 1x

I N −1

ii

where the term κ appears, and where the term 1n 2 tr En −j X ( j ) X H

( j ) + 1x I N

−2

further develops into

1n2 tr En −j X ( j ) X H

( j ) + 1x

I N −2 1n tr E X N X HN −zI N −1

1 − n −j −1n

1n tr E (X N X H

N −z I N )−1

1+ 1n tr E (X N X H

N −z I N )−1 2

.




Expressing the limits as a function of mF , we nally obtainn

j =1

En −j (E n −j +1 Aj )2

→

1

(1 + cmF (−1/x ))2

1

0

cmF (−1/x )2

1 −(1 −y) cm F (

−1/x )

(1+ cm F (−1/x )) 2

dy

+ κ cmF (−1/x )2

(1 + mF (−1/x ))2

from which we fall back on the expected result.

This completes this rst chapter on limiting distribution functions of somelarge dimensional random matrices and the Stieltjes transform method. In thecourse of this chapter, the Stieltjes transform has been illustrated to be a powerfultool for the study of the limiting spectrum distribution of large dimensionalrandom matrices. A large variety of practical applications in the realm of wireless communications will be further developed in Part II. However, whilei.i.d. matrices, with different avors (with variance prole, non-centered, etc.),and Haar matrices have been extensively studied, no analytic result concerningmore structured matrices used in wireless communications, such as Vandermondeand Hankel matrices, has been found so far. In this case, some moment-basedapproaches are able to ll in the gap by providing results on all successivemoments (when they exist). In the subsequent chapter, we will rst introducethe basics of free probability theory, which encompasses a very different theory of large dimensional random matrices than discussed so far and which provides

a rather comprehensive framework for moment-based approaches, which willthen be extended in a further chapter. We will come back to the Stieltjestransform methods for determining limiting spectral distributions in Chapter6 in which we discuss more elaborated techniques than the Marcenko–Pasturmethod introduced here.



4 Free probability theory

In this chapter, we introduce free probability theory, a different approach tothe already introduced tools for random matrix theory. Free probability will beshown to provide a very efficient framework to study limiting distributions of some models of large dimensional random matrices with symmetric features.Although to this day it does not overcome the techniques introduced earlier andcan only be applied to very few random matrix models, this approach has thestrong advantage of often being faster at determining the l.s.d. for these models.In particular, some results are derived in a few lines of calculus in the following,which are generalized in Chapter 6 using more advanced tools.

It is in general a difficult problem to deduce the eigenvalue distribution of thesum or the product of two generic Hermitian matrices A , B as a function of the eigenvalue distributions of A and B . In fact, it is often impossible as the

eigenvectors of A and B intervene in the expression of the eigenvalues of theirsum or product. In Section 3.2, we provided a formula, Theorem 3.13, linkingthe Stieltjes transform of the l.s.d. of the sum or the product of a deterministicmatrix ( A N or T N ) and a matrix X N X

HN , where X N has i.i.d. entries, to the

Stieltjes transform of the l.s.d. of A N or T N . In this section, we will see thatTheorem 3.13 can be derived in a few lines of calculus under certain symmetryassumptions on X N . More general results will unfold from this type of calculusunder these symmetry constraints on X N .

The approach originates from the work of Voiculescu [Voiculescu et al.,1992] on a very different subject. Voiculescu was initially interested in themathematical description of a theory of probability on non-commutativealgebras, called free probability theory . The random variables here are elements of a non-commutative probability space ( A , φ), with A a non-commutative algebraand φ a given linear functional. The algebra of Hermitian random matrices isa particular case of such a probability space, for which the random variables,i.e. the random matrices, do not commute with respect to the matrix product.We introduce hereafter the basics of free probability theory that are required tounderstand the link between algebras of non-commutative random variables andrandom matrices.



72 4. Free probability theory

4.1 Introduction to free probability theory

We rst dene non-commutative probability spaces.

Denition 4.1. A non-commutative probability space is a couple (A, φ) whereA is a non-commutative unital algebra, that is an algebra over C having a unitdenoted by 1, and φ : A →C is a linear functional such that φ(1) = 1.

When the functional φ satises φ(ab) = φ(ba), it is also called a trace . As willappear below, the role of φ can be compared to the role of the expectation inclassical probability theory.

Denition 4.2. Let (A, φ) be a non-commutative probability space. In thecontext of free probability, a random variable is an element a of A. We callthe distribution of a the linear functional ρa on C [X ], the algebra of complexpolynomials in one variable, dened by

ρa : C [X ] → C

P → φ (P (a)) .

The distribution of a non-commutative random variable a is characterized byits moments , which are dened to be the sequence φ(a), φ(a2), . . . , the successive

images by ρa of the normalized monomials of C [X ]. The distribution of a non-commutative random variable can often be associated with a real probabilitymeasure µa in the sense that

φ(ak ) = Rtk dµa (t)

for each k ∈N . In this case, the moments of all orders of µa are of course nite.In free probability theory, it is more conventional to use probability distributions µ instead of distribution functions F (we recall that F (x) = µ(−∞, x] if F is thed.f. associated with the measure µ on R ), so we will keep these notations in thepresent section.

Consider the algebra A N of N ×N random matrices whose entries are denedon some common probability space (meant in the classical sense) and have alltheir moments nite. In what follows, we refer to X ∈ A N as a random matrix in the sense of Denition 2.1, and not as a particular realization X (ω), ω ∈ Ω,of X . A non-commutative probability space of random variables is obtained byassociating to A N the functional τ N given by:

τ N (X ) = 1N

E (tr X ) = 1N

N

i =1

E[X ii ] (4.1)

with X ij the (random variable) entry ( i, j ) of X . This is obviously a trace in thefree probability sense. This space will be denoted by ( A N , τ N ).



4.1. Introduction to free probability theory 73

Suppose X is a random Hermitian matrix with real random eigenvaluesλ1 , . . . , λ N . The distribution ρX of X is dened by the fact that its action oneach monomial X k of C [X ] is given by

ρX (X k ) = τ N (X k ) = 1N

N

i =1E[λk

i ].

This distribution is of course associated with the probability measure µX

dened, for all bounded continuous f , by

f (t)dµX (t) = 1N

N

i =1

E[f (λki )].

The notion of distribution introduced in Denition 4.2 is subsequentlygeneralized to the case of multiple random variables. Let a1 and a2 be tworandom variables in a non-commutative probability space ( A , φ). Considernon-commutative monomials in two indeterminate variables of the formX k1

i 1X k 2

i 2. . . X k n

i n, where for all j , ij ∈ 1, 2 , kj ≥ 1 and ij = ij +1 . The algebra

C X 1 , X 2 of non-commutative polynomials with two indeterminate variables isdened as the linear span of the space containing 1 and the non-commutativemonomials. The joint distribution of a1 and a2 is then dened as the linearfunctional on C X 1 , X 2 satisfying

ρ : C X 1 , X 2 → C

X k 1i 1

X k 2i 2

. . . X kni n → ρ(X k 1

i 1X k 2

i 2. . . X k n

i n) = φ(ak1

i 1ak2

i 2. . . a k n

i n).

More generally, denote by C X i | i ∈ 1, . . . , I the algebra of non-commutative polynomials in I variables, which is the linear span of 1 and the non-commutative monomials of the form X k 1

i 1X k 2

i 2. . . X kn

i n, where kj ≥ 1 and i1 = i2 ,

i2 = i3 , . . ., in −1 = in are smaller than or equal to I . The joint distribution of the random variables a1 , . . . , a I in (A, φ) is the linear functional

ρ : C X i

|i

∈ 1, . . . , I

−→ C

X k 1i 1 X k 2

i 2 . . . X kni n −→ ρ(X k1

i 1 X k 2i 2 . . . X k n

i n ) = φ(ak 1i 1 ak 2

i 2 . . . a k ni n ) .

In short, the joint distribution of the non-commutative random variablesa1 , . . . , a I is completely specied by their joint moments. We now introduce theimportant notion of freeness, which will be seen as the free probability equivalentto the notion of independence in classical probability theory.

Denition 4.3. Let ( A , φ) be a non-commutative probability space. A family

A1 , . . . , A I of unital subalgebras of A is free if φ(a1a2 . . . a n ) = 0 for all n-uples(a

1, . . . , a

n) satisfying

1. aj ∈ A i j for some ij ≤ I and i1 = i2 , i2 = i3 , . . ., in −1 = in .2. φ(a j ) = 0 for all j ∈ 1, . . . , n .




A family of subsets of A is free if the family of unital subalgebras generated byeach one of them is free. Random variables a1 , . . . , a n are free if the family of subsets

a1

, . . . ,

an

is free.

Note that in the statement of condition 1, only two successive random variablesin the argument of φ(a1a2 . . . a n ) belong to two different subalgebras. Thiscondition does not forbid the fact that, for instance, i1 = i3 . Note in particularthat, if a1 and a2 belong to two different free algebras, then φ(a1a2a1a2) = 0whenever φ(a1) = φ(a2) = 0. This relation cannot of course hold if a1 and a2

are two real-valued independent random variables and if φ coincides with theclassical mathematical expectation operator. Therefore freeness, often referredto as a free probability equivalent to independence, cannot be considered as anon-commutative generalization of independence because algebras generated byindependent random variables in the classical sense are not necessarily free.

Let us make a simple computation involving freeness. Let A 1 and A 2 be twofree subalgebras in A . Any two elements a1 and a2 of A 1 and A 2 , respectively,can be written as ai = φ(a i )1 + a i (1 is here the unit of A ), so φ(a i ) = 0. Now

φ(a1a2) = φ ((φ(a1)1 + a1) (φ(a2)1 + a2)) = φ(a1)φ(a2).

In other words, the expectations of two free random variables factorize.By decomposing a random variable ai into φ(a i )1 + a i , the principle of thiscomputation can be generalized to the case of more than two random variables

and to the case of higher order moments, and we can check that noncommutativity plays a central role there.

Theorem 4.1 ([Biane , 2003]). Let A 1 , . . . , A I be free subalgebras in ( A , φ) and let a1 , . . . , a n ⊂ A be such that, for all j ∈ 1, . . . , n , a j ∈ A i j , 1 ≤ ij ≤ I . Let Π be the partition of 1, . . . , n associated with the equivalence relation j ≡ k ⇔i j = ik , i.e. the random variables aj are gathered together according to the free algebras to which they belong. For each partition π of 1, . . . , n , let

φπ =

j 1 ,...,j r ∈πj 1 <...<j r

φ(a j 1 . . . a j r ).

There exists universal coefficients c(π, Π) such that

φ(a1 . . . a n ) =π ≤Π

c(π, Π)φπ

where “ π ≤ Π” stands for “ π is ner than Π,” i.e. every element of π is a subset of an element of Π.

The main consequence of this result is that, given a family of free algebras A 1 , . . . , A I in A , only restrictions of φ to the algebras A i are needed to computeφ(a1 . . . a n ) for any a1 , . . . , a n ∈ A such that, for all j ∈ 1, . . . , n , we have a j ∈Ai j , 1 ≤ ij ≤ I . The problem of computing explicitly the universal coefficients



4.2. R- and S -transforms 75

c(π, Π) has been solved using a combinatorial approach and is addressed inSection 5.2.

Let µ and ν be two compactly supported probability measures on [0 ,

∞).

Then, from [Hiai and Petz, 2006] , it always exists two free random variables a1

and a2 in some non-commutative probability space ( A , φ) having distributionsµ and ν , respectively. We can see that the distributions of the random variablesa1 + a2 and a1a2 depend only on µ and on ν . The reason for this is thefollowing: Denition 4.2 states that the distributions of a1 + a2 and a1a2 arefully characterized by the moments φ ((a1 + a2)n ) and φ((a1a2)n ), respectively.To compute these moments, we just need the restriction of φ to the algebrasgenerated by a1 and a2, according to Theorem 4.1. In other words,φ ((a1 + a2)n ) and φ((a1a2)n ) depend on the moments of a1 and a2 only.

As such, the distributions of a1 + a2 and a1a2 can, respectively, be associatedwith probability measures called free additive convolution and free multiplicative convolution of the distributions µ and ν of these variables. The free additiveconvolution of µ and ν is denoted µ ν , while the free multiplicative convolutionof µ and ν is denoted µ ν . Both µ ν and µ ν are compactly supported on[0, ∞), see, e.g. page 30 of [Voiculescu et al., 1992]. Also, both additive andmultiplicative free convolutions are commutative, e.g. µ ν = ν µ, and themoments of µ ν and µ ν are related in a universal manner to the momentsof µ and to those of ν . We similarly denote µ ν the free additive deconvolution ,which is such that, if η = µ ν , then µ = η ν and ν = η µ, and µ ν the free multiplicative deconvolution which is such that, if η = ν ν , then µ = η ν and ν = η µ.

In the following, we express the fundamental link between the operations µ ν and µ ν and the R- and S -transforms, respectively.

4.2 R - and S -transforms

The R-transform, introduced in Denition 3.3 from the point of view of its

connection to the Stieltjes transform, fully characterizes µ ν as a function of µ and ν . This is given by the following result.

Theorem 4.2. Let µ and ν be compactly supported probability measures of R .Dene the R-transform Rµ of the probability distribution µ by the formal series

Rµ (z)k≥1

C k zk−1

where the C 1 , C 2 , . . . are iteratively evaluated from

M n =π∈NC( n ) V ∈π

C |V | (4.2)




with

M n

xn µ(dx)

the moment of order k of µ and NC(n) the set of non-crossing partitions of

1, . . . , n . Then, we have that

Rµ ν (z) = Rµ (z) + Rν (z).

If X is a non-commutative random variable with probability measure µ, RX

will also denote the R-transform of µ.More is said in Section 5.2 about NC( n), which plays an important role in

combinatorial approaches for free probability theory, in a similar way as P (n),

the set of partitions of 1, . . . , n , which plays a fundamental role in classicalprobability theory.Similarly, the S -transform introduced in Denition 3.4 allows us to turn

multiplicative free convolution into a mere multiplication of power series. Inthe context of free probability, the S -transform is precisely dened through thefollowing result.

Theorem 4.3. Given a probability measure µ on R with compact support, let ψµ (z) be the formal power series dened by

ψµ (z) =k≥1

zk

tk dµ(t) = zt1 −zt dµ(t). (4.3)

Let χ µ be the unique function analytic in a neighborhood of zero, satisfying

χ µ (ψµ (z)) = z

for |z| small enough. Let also

S µ (z) = χ µ (z)1 + z

z .

The function S µ is called the S -transform of µ, introduced in Denition 3.4.Moreover the S -transform S µ ν of µ ν satises

S µ ν = S µ S ν .

Similar to the R-transform, if X is a non-commutative random variable withprobability measure µ, S X will also denote the S -transform of µ.

Remark 4.1. We mention additionally that, as in classical probability theory, alimit theorem for free random variables exists, and is given as follows.

Theorem 4.4 ([Bercovici and Pata , 1996]). Let A1 , A2 , . . . be a sequence of free random variables on the non-commutative probability space ( A , φ), such that, for all i, φ(Ai ) = 0 , φ(A2

i ) = 1 and for all k, supi |φ(Aki )| < ∞. We then have, as



4.3. Free probability and random matrices 77

N → ∞1

√ N (A1 + . . . + AN ) ⇒ A

where A is a random variable whose distribution has R-transform RA (z) = z.This distribution is the semi-circle law of Figure 1.2 , with density dened in (2.4).

The semi-circle law is then the free probability equivalent of the Gaussiandistribution.

Now that the basis of free probability theory has been laid down, we move tothe application of free probability theory to large dimensional random matrixtheory, which is the core interest of this chapter.

4.3 Free probability and random matrices

Voiculescu discovered very important relations between free probability theoryand random matrix theory. Non-diagonal random Hermitian matrices are clearlynon-commutative random variables. In [Voiculescu, 1991] , it is shown that certainindependent matrix models exhibit asymptotic free relations.

Denition 4.4. Let X N, 1 , . . . , X N,I be a family of random N ×N matricesbelonging to the non-commutative probability space ( A N , τ N ) with τ N denedin (4.1). The joint distribution has a limit distribution ρ on C X i |i ∈ 1, . . . , I as N → ∞ if

ρ(X k 1i 1

. . . X kni n

) = limN →∞

τ N (X k 1N,i 1

. . . X k nN,i n

)

exists for any non-commutative monomial in C X i |i ∈ 1, . . . , I .

Consider the particular case where I = 1 (we replace X N, 1 by X N to simplifythe notations) and assume X

N has real eigenvalues and that the distribution of

X N has a limit distribution ρ. Then, for each k ≥ 0

ρ(X k ) = limN →∞ tk dµX N (t) (4.4)

where µX N is the measure associated with the distribution of X N (seen as arandom variable in the free probability framework and not as a random variablein the classical random matrix framework).

Remark 4.2. If ρ is associated with a compactly supported probability measureµ, the convergence of the moments of µX N to the moments of µ expressed by(4.4) implies the weak convergence of the sequence µX 1 , µX 2 , . . . to µ, i.e.

f (t)dµ(t) = limN →∞ f (t)dµX N (t)




for each continuous bounded function f (t), from Theorem 30.2 of [Billingsley,1995]. This is a particular case of the method of moments , see Section 5.1, whichallows us to determine the l.s.d. of a random matrix model from the successivelimiting moments of the e.s.d.

We now dene asymptotic freeness, which is the extension of the notion of freeness for the algebras of Hermitian random matrices.

Denition 4.5. The family X N, 1 , . . . , X N,I of random matrices in ( A N , τ N )is said to be asymptotically free if the following two conditions are satised:

1. For every integer i ∈ 1, . . . , I , X N,i has a limit distribution on C [X i ].2. For every family i1 , . . . , i n ⊂ 1, . . . , I with i1 = i2 , . . . , i n −1 = in , and

for every family of polynomials P 1 , . . . , P n in one indeterminate variablesatisfying

limN →∞

τ N P j (X N,i j ) = 0, j ∈ 1, . . . , n (4.5)

we have:

limN →∞

τ N

n

j =1P j (X N,i j ) = 0 . (4.6)

The conditions 1 and 2 are together equivalent to the following two conditions:the family X N, 1 , . . . , X N,I has a joint limit distribution on C X i |i ∈ 1, . . . , I that we denote ρ and the family of algebras C [X 1], . . . , C [X I ]is free in the non-commutative probability space ( C X i |i ∈ 1, . . . , I , ρ).

The type of asymptotic freeness introduced by Hiai and Petz in [Hiai andPetz , 2006] is also useful because it deals with almost sure convergence under thenormalized classical matrix traces instead of convergence under the functionalsτ N . Following [Hiai and Petz, 2006 ], the family X N, 1 , . . . , X N,I in ( A N , τ N ) issaid to have a limit ρ almost everywhere if

ρ(X k1i 1

. . . X kni n

) = limN →∞

1N

tr( X k1N,i 1

. . . X knN,i n

)

almost surely for any non-commutative monomial in C X i | i ∈ 1, . . . , I . Inthe case where N = 1 and X N has real eigenvalues, if the almost sure limitdistribution of X N is associated with a compactly supported probability measureµ, this condition means that

f (t)dµ(t) = limN →∞

1N

N

i =1f (λ i,N )

almost surely for each continuous bounded function f (t). In other words, thee.s.d. of X N converges weakly and almost surely to the d.f. associated with themeasure µ.




The family X N, 1 , . . . , X N,I in ( A N , τ N ) is said to be asymptotically free almost everywhere if, for every i ∈ 1, . . . , I , X N,i has a non-random limitdistribution on C [X i ] almost everywhere and if the condition 2 above is satisedwith the operator τ N (·) replaced by 1N tr( ·) in (4.5) and ( 4.6), where the limitsin these equations are understood to hold almost surely. These conditions implyin particular that X N, 1 , . . . , X N,I has a non-random limit distribution almosteverywhere on C X i |i ∈ 1, . . . , I .

Along with Remark 4.2, we can adapt Theorem 4.3 to the algebra of Hermitianrandom matrices. In this case, µ and ν are the e.s.d. of two random matrices Xand Y , and the equality S µ ν = S µ S ν is understood to hold in the almost suresense. The same is true for Theorem 4.2.

It is important however to remember that the (almost everywhere) asymptotic

freeness condition only applies to a limited range of random matrices. Rememberthat, for A N ∈C N ×N deterministic with l.s.d. A and B N = X N XHN with l.s.d.

B and X N ∈C N ×n with i.i.d. entries of zero mean and variance 1 /n , we saw inTheorem 3.13 that the l.s.d. of X N A N X

HN and therefore the l.s.d. of A N B N are

functions of A and B alone. It is however not obvious, and maybe not true atall, under the mere i.i.d. conditions on the entries of B N that A N and B N areasymptotically free. Free probability theory therefore does not necessarily embedall the results derived in Chapter 3. The exact condition for which the l.s.d. of A N B N is only a function of A and B is in fact unknown. Free probability theorytherefore provides a partial answer to this problem by introducing a sufficient condition for this property to occur, which is the (almost everywhere) asymptoticfreeness condition. In fact, not so many families of random matrices are known tobe asymptotically free. Of importance to applications in Part II is the followingasymptotic freeness result.

Theorem 4.5 (Theorem 4.3.11 of [Hiai and Petz , 2006]). Let X N, 1 , . . . , X N,I be a family of N ×N complex bi-unitarily invariant random matrices, i.e. whose joint distribution is invariant both by left- and right-unitary products, and let

D N, 1 , . . . , D N,J be a family of non-random diagonal matrices. Suppose that, as

N tends to innity, the e.s.d. of X N,i XH

N,i and D N,j DH

N,j converge in distribution and almost surely to compactly supported d.f. Then the family

X N,i i∈1,...,I , X HN,i i∈1,...,I , D N,j j∈1,...,J , D H

N,j j∈1,...,J is asymptotically free almost everywhere as N → ∞.

Theorem 4.5 allows us in particular to derive the (almost sure) l.s.d. of thefollowing random matrices:

• sums and products of random (non-necessarily square) Gaussian matrices;

• sums and products of random Gaussian or unitary matrices and deterministicdiagonal matrices;




• the sum or product of a Hermitian A N ∈C N ×N by X N B N XHN ∈C N ×N , with

B N ∈C n ×n Hermitian and X N ∈C N ×n Gaussian or originating from columnsof a unitary matrix.

Theorem 4.5 can be used for sums and products of non-square Gaussian matricessince the D N,j matrices can be taken such that their rst ( N −n) diagonal entriesare ones and their last n entries are zeros. It can then be used to derive the l.s.d.of A N + X N B N X

HN or A N X N B N X

HN for non-diagonal Hermitian A N and B N ,

since the result holds if X N is replaced by U N X N V N , with U N the stackedcolumns of eigenvectors for A N and V N the stacked columns of eigenvectors forB N . Basically, and roughly speaking, for any couple ( X , Y ) of matrices, withmutually independent entries and such that at least one of these matrices isunitarily invariant by left and right product, then X and Y are asymptoticallyfree almost everywhere. Intuitively, almost everywhere asymptotic freeness of the couple ( X , Y ) means that the random variables X and Y have independententries and that their respective eigenspaces are asymptotically “disconnected,”as their dimensions grow, in the sense that the eigenvectors of each randommatrix are distributed in a maximally uncorrelated way. In the case where oneof the matrices has isotropically distributed eigenvectors, necessarily it will beasymptotically free almost everywhere with respect to any other independentmatrix, be it random unitarily invariant or deterministic. On the opposite, twodeterministic matrices cannot be free as their eigenvectors point in deterministic

and therefore “correlated” directions.Unitarily invariant unitary matrices, often called Haar matrices, are animportant class of matrices, along with Gaussian matrices, for which freeprobability conveys quite a few results regarding their summation or product toGaussian matrices, deterministic matrices, etc. Haar matrices can be constructedand dened in the following denition-theorem.

Denition 4.6. Let X ∈C N ×N be a random matrix with independent Gaussianentries of zero mean and unit variance. Then the matrix W ∈C N ×N , dened as

W = X X H X −1

2

is uniformly distributed on the space U (N ) of N ×N complex unitary matrices.This random matrix is called a Haar matrix . Moreover, as N grows large, thee.s.d. F W of W converges almost surely to the uniform distribution on thecomplex unit circle.

For the applications to come, we especially need the following two corollariesof Theorem 4.5.

Corollary 4.1. Let T 1 , . . . , T K be N ×N Hermitian random matrices,and let W 1 , . . . , W K be Haar distributed independent from all T k . Assume that the empirical eigenvalue distributions of all T k converge almost surely




toward compactly supported probability distributions. Then, the random matrices W 1T 1W H

1 , . . . , W K T K WHK are asymptotically free almost surely as N → ∞.

Corollary 4.2 (Proposition 4.3.9 of [Hiai and Petz, 2006 ]). Let A N and T N

be N ×N Hermitian random matrices, and let W N be a Haar distributed unitary random matrix independent from A N and T N . Assume that the empirical eigenvalue distributions of A N and of T N converge almost surely toward compactly supported probability distributions. Then, the random matrices A N

and W N T N WHN are asymptotically free almost surely as N → ∞.

The above propositions, along with the R- and S -transforms of Theorem 4.2and Theorem 4.3, will be used to derive the l.s.d. of some random matrix models

for which the Stieltjes transform approach is more painful to use.The R-transform and S -transform introduced in the denition-theorem,

Theorem 4.2, and the denition-theorem, Theorem 4.3, respectively, can beindeed redened in terms of transforms of the l.s.d. of random matrices, as theywere already quickly introduced in Denition 3.3 and Denition 3.4. In this case,Theorem 4.2 extends to the following result.

Theorem 4.6. Let A N ∈C N ×N and B N ∈C N ×N be two random matrices.If A N and B N are asymptotically free almost everywhere and have respective

(almost sure) asymptotic eigenvalue probability distribution µA and µB , then A N + B N has an asymptotic eigenvalue distribution µ, which is such that, if RA

and RB are the respective R-transforms of the l.s.d. of A N and B N

RA+ B (z) = RA (z) + RB (z)

almost surely, with RA + B the R-transform of the asymptotic eigenvalue distribution of A N + B N . The distribution µ is often denoted µA + B , and we write

µA + B = µA µB .

From the denition of the R-transform, we in particular have the followinguseful property.

Lemma 4.1. Let X ∈C N ×N be some Hermitian matrix and a ∈R . Then

Ra X (z) = aR X (az ).

Also, if X is random with limiting l.s.d. F , in the large N limit, the R-transform RF ( a ) of the l.s.d. F (a ) of the random matrix aX , for some a > 0, satises

RF ( a ) (z) = aR F (az ).




This is immediate from Lemma 3.2 and Denition 3.3. Indeed, applying thedenition of the R-transform at point az, we have:

mX RX (az ) + 1az = −az

while am a X (ay) = mX (y) from Lemma 3.2. Together, this is therefore:

am a X aR X (az ) + 1z

= −az.

Removing the a on each side, we have that aR X (az ) is the R-transform associatedwith the Stieltjes transform of aX , from Denition 3.3 again.

Similarly, Theorem 4.3 for the S -transform extends to the following result.

Theorem 4.7. Let A N ∈C N ×N and B N ∈C N ×N be two random matrices.If A N and B N are asymptotically free almost everywhere and have respective (almost sure) asymptotic eigenvalue distribution µA and µB , then A N B N has an asymptotic eigenvalue distribution µ, which is such that, if S A and S B are the respective S -transforms of the l.s.d. of A N and B N

S AB (z) = S A (z)S B (z)

almost surely, with S AB the S -transform of the l.s.d. of A N B N . The distribution µ is often denoted µAB and we write

µAB = µA µB .

An equivalent scaling result is also valid for the S -transform. Precisely, wehave the following lemma.

Lemma 4.2. Let X ∈C N ×N be some Hermitian matrix and let a be a non-zeroreal. Then

S a X (z) = 1a

S X (z).

If X is random with limiting l.s.d. F , then the S -transform S F ( a ) of the l.s.d.F (a ) of the random matrix aX satises

S F ( a ) (z) = 1a

S F (z).

This unfolds here from noticing that, by the (denition-)Theorem 4.3

S a X (z) = 1 + z

z ψ−1

a X (z)

with ψa X the ψ-transform of F a X . Now, by denition of the ψ-transform

ψa X (z) = azt1 −azt

dF X (t) = ψX (az )




and therefore:

ψ−1a X (z) =

1a

ψ−1X (z)

which gives the result.Of interest to practical applications of the S -transform is also the following

lemma.

Lemma 4.3. Let A ∈C N ×n and B ∈C n ×N such that AB is Hermitian non-negative. Then, for z ∈C \ R +

S AB (z) = z + 1z + n

N S BA

N n

z .

If AB is random with l.s.d. and BA has l.s.d. F as N, n → ∞ with N/n → c,0 < c < ∞, then we also have

S F (z) = z + 1z + 1

cS F (cz) .

This result can be proved starting from the denition of the S -transform,Denition 3.4, as follows. Notice rst from Denition 3.4 that

ψAB (z) = −1 − 1z

mAB (z−1)

= nN −1 −

1z

mBA (z−1)

= nN

ψBA (z).

Now, by denition

ψAB z

1 + zS AB (z) = z

from which:

ψBA z

1 + zS AB (z) =

N n

z.

Taking ψ−1BA (with respect to composition) on each side, this leads to

S AB (z) = 1 + z

z ψ−1

BAN n

z

=N n + N

n z1 + N

n z1 + N

n zN n z

ψ−1BA

N n

z

=N n +

N n z1 + N

n z S BA N n z

which is the nal result.




In the next section, we apply the above results on the R- and S -transforms torandom matrix models involving Gaussian matrices, before considering modelsinvolving Haar matrices in the subsequent section.

4.4 Free probability for Gaussian matrices

In the following, we will demonstrate in a few lines the result of Theorem 3.13 onthe limiting distribution of B N = A N + X H

N T N X N in the case when T N = I N ,A N ∈C n ×n has uniformly bounded spectral norm and X N ∈C N ×n has Gaussianentries of zero mean and variance 1 /n . To begin with, we introduce some classicalresults on the R- and S -transforms of classical laws in random matrix theory.

Theorem 4.8. The semi-circle law and Marcenko–Pastur law have the following properties.

(i) The R-transform RF c (z) of the Marcenko–Pastur law F c with ratio c, i.e. the almost sure l.s.d. of X N X

HN , X N ∈C N ×n , with i.i.d. entries of zero mean and

variance 1/n , as N/n → c, whose density f c is given by (2.5), reads:

RF c (z) = 11 −cz

and the S -transform S F c (z) of F c reads:

S F c (z) = 11 + cz

.

Similarly, the R-transform RF c (z) of the complementary Marcenko–Pastur law F c , i.e. the almost sure l.s.d. of X H

N X N , reads:

RF c (z) = c1 −z

and the S -transform S F c (z) of F c is given by:

S F c (z) = 1c + z

.

(ii) The R-transform RF (z) of the semi-circle law F , with density f given by (2.4), reads:

RF (z) = z.

If the entries of X N are Gaussian distributed, from Corollary 4.2, A N andX H

N X

N are asymptotically free almost everywhere. Therefore, from Theorem4.6, almost surely, the R-transform RB of the l.s.d. of B N satises

RB (z) = RA (z) + RF c (z)



4.4. Free probability for Gaussian matrices 85

with A the l.s.d. of A N . From the earlier Denition 3.3, Equation ( 3.8), of theR-transform

mB (z) = 1

RB (−mB (z)) −z

= 1

RA (−mB (z)) + c1+ m B (z ) −z

almost surely. This is equivalent to

RA (−mB (z)) + 1

−mB (z) = z −

c1 + mB (z)

almost surely.From Denition 3.3, Equation ( 3.7), taking the Stieltjes transform of A on

both sides

mB (z) = mA z − c

1 + mB (z)

which is consistent with Theorem 3.13. Slightly more work is required togeneralize this result to T N different from I N . This study is carried out in Section4.5 for the case where X N is a Haar matrix instead.

We present now one of the important results known so far concerning Gaussian-based models of deep interest for applications in wireless communications. In[Ryan and Debbah, 2007b ], Ryan and Debbah provide indeed a free probability

expressions of the free convolution and free deconvolution for the informationplus noise model, summarized as follows.

Theorem 4.9 ([Ryan and Debbah, 2007b ]). Let X N ∈C N ×n be a random matrix with i.i.d. Gaussian entries of zero mean and unit variance, and R N

a (non-necessarily random) matrix such that the eigenvalue distribution of R N = 1

n R N RHN converges weakly and almost surely to the compactly supported

probability distribution µR , as n, N → ∞ with limit ratio N/n → c > 0. Then the eigenvalue distribution of

B N = 1n (R N + σX N ) (R N + σX N )H

converges weakly and almost surely to the compact supported measure µB such that

µB = (( µR µc) δ σ 2 ) µc (4.7)

with µc the probability distribution with distribution function the Marcenko– Pastur law and δ σ 2 the probability distribution of a single mass in σ2 (with kthorder moment σ2k ). Equation (4.7) is the free convolution of the information plus noise model. This can be reverted as

µR = (( µB µc) δ σ 2 ) µc

which is the free deconvolution of the information plus noise model.




To understand the importance of the result above, remember for instance thatwe have already shown that matrix models of the type B N = A N + X N T N X

HN

or B N = ( A N + X N )(A N + X N )H , where X N has i.i.d. (possibly Gaussian)entries, can be treated by Stieltjes transform methods (Theorem 3.13 andTheorem 3.15), which provide a complete description of the l.s.d. of B N , throughits Stieltjes transform. However, while this powerful method is capable of treatingthe very loose case of X N with i.i.d. entries, there does not yet exist a unifyingframework that allows us to derive easily the limit of the Stieltjes transform formore involved models in which successive sums, products and information plusnoise models of i.i.d. matrices are taken. In the Stieltjes transform approach,every new model must be dedicated a thorough analysis that consists in (i)deriving an implicit equation for the Stieltjes transform of the l.s.d., (ii) ensuring

that the implicit equation has a unique solution, (iii) studying the convergenceof xed-point algorithms that solve the implicit equations. Consider for instancethe model

B N = HT12 X + σW HT

12 X + σW

H

(4.8)

where H , X , and W are independent and all have i.i.d. Gaussian entries, T isdeterministic and σ > 0. This model arises in wireless communications when aGaussian signal process T

12 X ∈C n t ×L is sent from an nt -antenna transmitter

during L sampling instants through a MIMO nr

×n t channel H

∈C n r ×n t with

additive white Gaussian noise W ∈C n r ×L with entries of variance σ2 . Thematrix X is assumed to have i.i.d. Gaussian entries and therefore the nt -dimensional signal vector has transmit covariance matrix T ∈C n t ×n t . In theframework of free probability, the model ( 4.8) is not difficult to treat, since itrelies on elementary convolution and deconvolution operation. Precisely, it is thesuccession of the multiplicative free convolution of the l.s.d. of X and T

12 and the

information plus noise-free convolution of the l.s.d. of T12 X and σW . As we can

already guess from the denitions of the free probability framework and as wewill see in detail in Chapter 5, this problem can also be treated from elementary

combinatorics computations based on the successive moments of the d.f. understudy; these computations will be shown in particular to be easily implementedon a modern computer. The analysis of model ( 4.8) by the Stieltjes transformapproach is more difficult and has been treated in [Couillet et al., 2011c], in whichit is proved that mF (z), z ∈C + , the Stieltjes transform of the l.s.d. F of B N , isthe unique solution with positive imaginary part of a given implicit equation, seeTheorem 17.5. However, computer trials suggest that the region of convergence of the xed-point algorithm solving the implicit equation for mF (z) has a very smallradius, which is very unsatisfactory in practice, therefore giving more interest tothe combinatorial approach. Moreover, the complete derivation of the limitingspectrum of B N requires a similar mathematical machinery as in the proof of the Marcenko–Pastur, this being requested for every new model. These are twostrong arguments that motivate further investigations on automated numerical



4.5. Free probability for Haar matrices 87

methods based on combinatoric calculus. Nonetheless, the Stieltjes transformapproach is able to treat the case when X has i.i.d. non-necessarily Gaussian entries, e.g. M -QAM, M -PSK modulated signals, which are not proved to satisfythe requirement demanded in the free probability approach. Moreover, as willbecome clear in Part II, the simplicity and the exibility of the combinatoricmethods are rarely any match for the accuracy and efficiency of the Stieltjestransform approach.

In the following section, we derive further examples of applications of Theorem4.6 and Theorem 4.7 to the free probability framework introduced in this sectionfor Haar random matrices.

4.5 Free probability for Haar matricesWe start with the introduction of a very simple model, from which the applicationof the R-transform and the S -transform are rather straightforward. Considerthe matrix model B N = W N T N W

HN + A N , where A N ∈C N ×N , T N ∈C n ×n

and W N ∈C N ×n is a Haar random matrix. Note that this model is the Haarequivalent of Theorem 3.13.

From Corollary 4.2, the matrices A N and W N T N WHN are asymptotically free

almost everywhere and therefore we can consider retrieving the l.s.d. of B N asa function of those of A N and W N T N W

HN using the R-transform.

For the problem at hand, denoting F the l.s.d. of B N , as N → ∞RF (z) = RT (z) + RA (z)

almost surely, with T the l.s.d. of T N and A the l.s.d. of A N . Indeed, W N beingunitary, the successive traces of powers of W N T N W

HN are also the successive

traces of powers of T N , so that the distribution of the free random variableW N T N W

HN is also the distribution of T N .

Consider the special case when both T N and A N have l.s.d. T with densityT (x) = 1

2 δ 0(x) + 12 δ 1(x), then:

mT (z) = 12

2z −1z(z −1)

which, from ( 3.8), leads to

RF (z) = z −1 −√ z2 + 1

zwhose Stieltjes transform can be computed into the form

mF (z) = − 1

z(z −2)

using (3.7). We nally obtain the l.s.d. using ( 3.2)

F (x) = 1 [0,2](x)1π

arcsin( x −1)




−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

Eigenvalues

D e n s i t y

Empirical eigenvalue distributionArcsinus law

Figure 4.1 Histogram of the eigenvalues of W N T N W HN + A N and the arcsinus law,

for N = 1000.

which is the arcsinus law, depicted in Figure 4.1.The same kind of reasoning can be applied to the product of asymptotically

free random variables, using the S -transform in place of the R-transform.For instance, the e.s.d. of B N = W N T N W HN A N admits an almost sure l.s.d.

F satisfying

S F (z) = S T (z)S A (z) a .s.

−→ 0

as N → ∞, with T and A the respective l.s.d. of T N and A N .From the R- and S -transform tools dened and used above, the l.s.d. of more

elaborate matrix models involving unitary matrices can be evaluated [Couilletand Debbah, 2009; Hachem, 2008]. From the above Corollary 4.1, for HaarW k matrices and deterministic T k matrices with limiting compactly supportedd.f., the family W 1T 1W H

1 , . . . , W K T K WHK is asymptotically free almost

everywhere. Moreover, for D a Hermitian matrix with compactly supportedlimited d.f., the family W k T k W H

k , D is asymptotically free almost everywhere,due to Corollary 4.2. This allows us to perform seamlessly the operations of S -and R-transforms of Theorem 4.6 and Theorem 4.7, respectively, for all matrixmodels involving random W k T k W H

k and deterministic D matrices. We hereafterprovide three applications of such large dimensional matrix models.

On the R-transform side, we have the following result.

Theorem 4.10 ([Couillet and Debbah , 2009]). Let T 1 , . . . , T K ∈C N ×N be diagonal non-negative denite matrices and W 1 , . . . , W K , W k ∈C N ×N , be independent Haar matrices. Denote T k the l.s.d. of T k . Finally, let B N ∈C N ×N




denote the matrix

B N =K

k=1

W k T k W H

k .

Then, as N grows large, the e.s.d. of B N tends to F whose η-transform ηF is given by:

ηF (x) = 1 + xK

k=1

β k (x)−1

where the functions β k (x), k ∈ 1, . . . , K , satisfy the K xed-point equations

β k (x) = t 1 + x t

−β k (x)

1 + x K i =1 β i (x)

−1

dT k (t). (4.9)

Also, the Shannon transform VF (x) of F is given by:

VF (x) = log 1 + xK

k =1

β k (x) +K

k=1 log(1 + xη(x)[t −β k (x)])dT k (t).

(4.10)

This result can be used to characterize the performance of multi-cellular

orthogonal CDMA communications in frequency at channels, see, e.g., [Couilletand Debbah, 2009; Peacock et al., 2008].

Proof. From the R-transform Denition 3.3 and the η-transform Denition 3.5,for a given distribution function G, we have:

RG (−xηG (x)) = −1x

1 − 1ηG (x)

(4.11)

ηG − 1

RG (x) + 1x

= xR G (x) + 1 . (4.12)

Denoting RF the R-transform of F and Rk the R-transform of the l.s.d. of W k T k W H

k , we have from Equation (4.12) that

xR k (x) + 1 = 11 − t

R k (x )+ 1x

dT k (t)

which is equivalent to

Rk (x) = 1x t

Rk (x) + 1x −t

dT k (t).

Evaluating the above in −xηF (x) and denoting β k (x) = Rk (−xηF (x)), we have:

β k (x) = 1x t

1 −xηF (x)β k (x) + xηF (x)tdT k (t).




Remember now from Corollary 4.1 that the matrices W k T k W Hk are

asymptotically free. We can therefore use the fact that

RF (−xηF (x)) =K

k=1

Rk (−xηF (x)) =K

k=1

β k (x)

from Theorem 4.6. From Equation (4.11), we nally have

K

k =1

β k (x) = −1x

1 − 1ηF (x)

which is the expected result.To obtain ( 4.10), notice from Equation ( 3.4) that the Shannon transform can

be expressed as

VF (x) = x

0

1t (1 −ηF (t)) dt.

We therefore seek an integral form for the η-transform. For this, the strategyis to differentiate logarithm expressions of the ηF (x) and β k (x) expressionsobtained above and seek for a link between the derivatives. It often turns out thatthis strategy leads to a compact expression involving only the aforementionedlogarithm expressions.

Note rst that

1x

(1 −ηF (x)) = 1x

1 − 1 + xK

k =1

β k (x)−1

=K

k=1

β k (x)ηF (x).

Also note from ( 4.9) that

1 −xηF (x)β k (x) = 1−xηF (x)β k (x)1 −xηF (x)β k (x) + xηF (x)t

dT k (t)

and therefore that1 = 1

1 −xηF (x)β k (x) + xηF (x)tdT k (t). (4.13)

Now, for any k, the derivative along x of

C k (x) log(1 −xηF (x)β k (x) + xηF (x)t)dT k (t)

is

C k (x)= [−ηF (x)β k (x) −xηF (x)β k (x) −xηF (x)β k (x)] + [ηF (x) + xηF (x)]t

1 −xηF (x)β k (x) + xηF (x)t dT k (t).




Recalling now ( 4.13) and ( 4.9), this yields

C k (x)

= −ηF (x)β k (x) −xηF (x)β k (x) −xηF (x)β k (x) + ( ηF (x) + xηF (x))β k (x)= −xηF (x)β k (x).

We also have

log 1 + xK

k=1

β k (x) = ηF (x)K

k=1

β k (x) + xK

k =1

ηF (x)β k (x).

Adding this last expression to

K k=1 C k (x), we end up with the desired

K k=1 β k (x)ηF (x). Verifying that VF (0) = 0, we nally obtain ( 4.10).

Note that the strategy employed to obtain the Shannon transform is a verygeneral approach that is very effective for a large range of models. In the case of the sum of Gram matrices of doubly correlated i.i.d. matrices, the Gram matrix of sums of doubly correlated i.i.d. matrices or matrices with a variance prole, thisapproach was identically used to derive expressions for the Shannon transform,see further Chapter 6.

As for S -transform related derivations, we have the following result.

Theorem 4.11 ([Hachem , 2008]). Let D ∈CN

×N

and T ∈CN

×N

be diagonal non-negative matrices, and W ∈C N ×N be a Haar matrix. Denote D and T the respective l.s.d. of D and T . Denote B N the matrix

B N = D12 WTW H D

12 .

Then, as N grows large, the e.s.d. of B N converges to F whose η-transform ηF (x) satises

ηF (x) =

(xγ (x)t + 1) −1 dD (t)

γ (x) = t (ηF (x) + xδ (x)t)−1 dT (t)

δ (x) = t (xγ (x)t + 1) −1 dD (t)

and whose Shannon transform VF (x) satises

VF (x) = log (1 + xγ (x)t) dD (t) + log(ηF (x) + xδ (x)t)dT (t).

This last result is proved similarly to Theorem 4.10, using the S -transformidentity

S F (x) = S D (x)S T (x) (4.14)




emerging from the fact that D and WTW H are asymptotically free randommatrices (from Corollary 4.2) and that the e.s.d. of WTW H is the e.s.d. of thematrix T .

Proof. The proof requires the introduction of the functional ζ (z) dened as

ζ (z) ψ(z)1 + ψ(z)

with ψ(z) introduced in Denition ( 3.6). It is shown in [Voiculescu, 1987 ] that ζ is analytical on C + . We denote its analytical inverse ζ −1 . From the denition of the S -transform as a function of ψ, we nally have S (z) = 1

z ζ −1(z).From the almost sure asymptotic freeness of D and WTW H , and from ( 4.14),

we have:

ζ −1F (z) = ζ −1

D (z)S T (z).

Hence, replacing z by ζ F (−z)

−z = ζ −1D (ζ F (−z))S T (ζ F (−z))

which, according to the denition of ζ F , gives

ζ −1D

ψF (−z)1 + ψF (−z)

= −z

S T ψF (−z )1+ ψ F (

−z )

.

Taking ψD on both sides, this is:

ψF (−z) = ψD −z

S T ψF (−z )1+ ψ F (−z )

(4.15)

since ζ −1D maps x

1+ x to ψ−1D (x), and hence ψF (−z )

1+ ψ F (−z ) to ψ−1D (ψF (−z)).

Denoting

γ (z) = 1

S T ψF (−z )1+ ψ F (−z )

this is

ψF (−z) = ψD (−zγ (z)) .

Computing ψF 1+ ψ F

and taking ζ −1T of the resulting expression, we obtain

ζ −1T

ψF (−z)1 + ψF (−z)

= 1γ (z)

ψF (−z)1 + ψF (−z)

.

Taking ψT on both sides, this is nally

ψF (−z) = ψT 1γ (z)

ψF (−z)1 + ψF (−z)

. (4.16)




To fall back on the system of equations, notice that ηF (z) = 1 −ψF (−z).Equation ( 4.15) becomes

ηF (z) = ηD (zγ (z))while Equation ( 4.16) leads to

γ (z) = tηF (z) − ψD (−zγ (z ))

γ (z ) tdT (t)

which are exactly the desired results with δ (z) dened as

δ (z) = ψD (−zγ (z))

−zγ (z) .

Putting together Theorem 4.10 and Theorem 4.11, we can show the followingmost general result.

Theorem 4.12. Let D ∈C N ×N , T 1 , . . . , T K ∈C N ×N be diagonal non-negative denite matrices, and W k ∈C N ×N , k ∈ 1, . . . , K , be independent Haar matrices. Denote D , T k the l.s.d. of the matrices D and T k , respectively. Finally,denote B N ∈C N ×N the matrix

B N =K

k=1

D12 W k T k W H

k D12 .

Then, as N grows large, the e.s.d. of B N tends to the F whose η-transform ηF

is given by:

ηF (x) = 1 + xδ (x)K

k=1

γ k (x)−1

where γ 1 , . . . , γ K and δ are solutions to the system of equations

δ (x) = t 1 + xtK

k=1

γ k (x)−1

dD (t)

γ k (x) = t (1 −xδ (x)γ k (x) + xδ (x)t)−1 dT k (t).

Also, the Shannon transform VF (x) is given by:

VF (x) =

log 1 + tx

K

k=1

γ k (x) dD (t)

+K

k=1 log (1 −xδ (x)γ k (x) + xδ (x)t) dT k (t).




To the best of our knowledge, this last result is as far as free probabilityreasoning can go. In particular, if the products D

12 W i T

12i were replaced by

the more general D12

i W i T

12

i , then it is impossible to use free probability

results any longer, as the family D12i W i T i W

Hi D

12i , i ∈ 1, . . . , K is no longer

asymptotically free almost everywhere. To deal with this model, which isfundamental to the study of multi-cellular frequency selective orthogonal CDMAsystems, we will show in Section 6.2.4 that the Stieltjes transform approach isour only asset. The Stieltjes transform approach is however much less direct andrequires more work than the free probability framework, although it is muchmore exible.

In this chapter, we introduced free probability theory and its applications tolarge dimensional random matrices through a denition involving joint momentsof spectral distributions with compact supports. Apart from the above analyticalresults based on the R-transform and S -transform, the common use of freeprobability theory for random matrices is related to the study of successivemoments of distributions, which is in fact more appealing when the modelsunder study are less tractable through analytic approaches. This is the subjectof the following chapter, which introduces the combinatoric moment methodsand which goes beyond the strict free probability framework to deal with morestructured random matrix models enjoying some symmetry properties. To thisday, these models have not been addressed through any analytical method but,

due to their symmetric structure, combinatorics calculus can be performed.



5 Combinatoric approaches

We start the discussion regarding moment-based approaches by mentioning thewell-known method of moments , which is a tool aimed at retrieving distributionfunctions based on moments from a classical probability theory point of view.This method of moments will be opposed to the moment-based methods ormoment methods which is the center of interest of this chapter since ittargets specically combinatoric calculus in non-commutative algebras of randommatrices.

5.1 The method of moments

In this section, we will discuss the technique known as the method of moments

to derive a probability distribution function from its moments, see, e.g., [Baiand Silverstein , 2009; Billingsley, 1995 ]. When we are able to infer the limitingspectral distribution of some matrix model, although not able to prove it throughclassical analytic approaches, it is possible under certain conditions to deriveall successive limiting moments of the distribution instead. Under Carleman’scondition, to be introduced hereafter, these moments uniquely determine thelimiting distribution.

We start by introducing Carleman’s condition.

Theorem 5.1. Let F be a distribution function, and denote M 1, M

2, . . . its

sequence of moments which are assumed to be all nite. If the condition

∞

k =1

M − 12 k

2k = ∞ (5.1)

is fullled, then F is uniquely determined by the sequence M 1 , M 2 , . . . .

Therefore, if we only have access to the moments M 1 , M 2 , . . . of somedistribution F and that Carleman’s condition is met, then F is uniquelydetermined by its moments.

In the specic case where M 1 , M 2 , . . . are the moments of the l.s.d. of largeHermitian matrices, we will need the following moment convergence theorem .



96 5. Combinatoric approaches

Theorem 5.2 (Lemma 12.1 and Lemma 12.3 of [Bai and Silverstein, 2009] ).Let F 1 , F 2 , . . . be a sequence of distribution functions, such that, for each n, F nhas nite moment M n,k of order k for all k, with M n,k

→ M k <

∞ as n

→ ∞.

Assume additionally that the sequence of moments M 1 , M 2 , . . . fullls Carleman’s condition (5.1). Then F n converges to the unique distribution function F with moments M k .

To prove that the e.s.d. of a given sequence of Hermitian random matricestends to a limit distribution function, all that is needed to show is the convergenceof the empirical moments to limiting moments that meet Carleman’s condition.It then suffices to match these moments to some previously inferred l.s.d., or totry to determine this l.s.d. directly from the moments. In the following, we give

the main steps of the proof of the semi-circle law of Theorem 2.11, which wepresently recall.Consider a Hermitian matrix X N ∈C N ×N , with independent entries 1√ N X N,ij

such that E[ X N,ij ] = 0, E[|X N,ij |2] = 1 and there exists ε such that the X N,ij

have a moment of order 2 + ε. Then F X N

⇒ F almost surely, where F has densityf dened as

f (x) = 12π (4 −x2)+ .

Moreover, if the X N,ij are identically distributed, the result holds without the

need for the existence of a moment of order 2 + ε.

Proof of Theorem 2.11. We wish to show that, on a space A ⊂ Ω of probabilityone, the empirical moments M N, 1(ω), M N, 2(ω), . . . , ω ∈ A, of the e.s.d. of arandom Wigner matrix with i.i.d. entries converge to the moments M 1 , M 2 , . . .of the semi-circle law, and that these moments satisfy Carleman’s condition. Themoments of the semi-circle law are computed as

M 2k +1 = 12π 2

−2x2k +1 4 −x2dx = 0

M 2k = 12π

2

−2x2k

4 −x2dx = 1k + 1

2kk

without difficulty, by the change of variable x = 2√ y. The value of M 2k is thekth Catalan number . Using Stirling’s approximation formula, for k large

M 2k − 12 k = ( k + 1)

12 k

(k!)1k

((2k)!) 12 k ∼

(k + 1) 12 k

12

(√ 2πk )1k

(√ 4πk ) 12 k

which tends to 12 . Therefore, there exists k0 > 0, such that k ≥ k0 implies

M − 12 k

2k > 14 , which ensures that the series

k M 2k − 1

2 k diverges, and Carleman’scondition is satised.

The rest of this proof follows from [Anderson et al., 2006, 2010; Bai andSilverstein, 2009] , where more details are found. The idea is rst to use



5.1. The method of moments 97

truncation, centralization, and rescaling steps as described in Section 3.2 so tomove from random entries of the Wigner matrix with no moment constraintto random entries with a distribution that admits moments of all orders. Byshowing that it is equivalent to prove the convergence of F X N for either X N,ij

or their truncated versions, we can now work with variables with momentsof all orders. We need to show that E[ M N,k ] M N,k (ω)dP (ω) → M k and

N E[(M N,k −M k )2] < ∞. This will be sufficient to ensure the almost sureconvergence of the empirical moments to the moments of the semi-circle law,by applying the Markov inequality, Theorem 3.5, and the rst Borel–Cantellilemma, Theorem 3.6.

To determine the limit of E[ M N,k ], we resort to combinatorial and particularlygraph theoretical tools. It is required indeed, when developing the trace of X k

N

E[M N,k ] = 1N

E[tr( X kN )]

= 1N

p∈1,...,N k

EX N,p 1 p2 X N,p 2 p3 . . . X N,p k p1

with pi the ith entry of p , to be able to determine which moments of the X N,ij

are non-negligible in the limit. These graph theoretical tools are developed indetail in [Bai and Silverstein , 2009] and [Anderson et al., 2006]. What arises inparticular is that all odd moments vanish, while it can be shown that, for the

even moments

E[M N, 2k ] = (1 + O(1/N )) 1N

k−1

j =0

E[M N, 2j ]E[M N, 2( k−j −1) ]

which establishes an asymptotic recursive equation for E[ M N, 2k ]. For k = 1, wehave E[M N, 2] → 1. Finally, by comparing the recursive formula to the denitionof the Catalan numbers C k , given later in ( 5.5), we obtain the required result

EM N, 2k → C k .

The previous proof provided the successive moments of the Wigner semi-circlelaw, recalled in the following theorem.

Theorem 5.3. Let F be the semi-circle law distribution function with density

f given by

f (x) = 12π (4 −x2)+ .




This distribution is compactly supported and therefore has successive moments M 1 , M 2 , . . . , given by:

M 2k +1 = 0,

M 2k = 1k + 1

2kk

.

This general approach can be applied to prove the convergence of the e.s.d. of a random matrix model to a given l.s.d. However, this requires to know a priorithe limit distribution sought for. Also, it assumes the existence of momentsof all orders, which might already be a stringent assumption, but truncation,centralization, and rescaling steps can be performed prior to using the methodof moments to alleviate this condition as was performed above. To the best of our knowledge, for more general random matrix models than Wigner or Wishartmatrices, the method of moments leads to rather involved calculus, from whichnot much can be inferred. We therefore close the method of moments parenthesishere and will never mention it again.

We now move to the moment methods originating from the free probabilityframework, which are concerned specically with random matrix models. Westart with an introduction of the free moments and free cumulants of randomnon-commutative variables and random matrices.

5.2 Free moments and cumulants

Remember that we mentioned that Theorem 4.1 is a major result since it allowsus to derive the successive moments of the l.s.d. of products of free randommatrices from the moments of the l.s.d. of the individual random matrices. Forinstance, for free random variables A and B, we have:

φ(AB ) = φ(A)φ(B )φ(ABAB ) = φ(A2)φ(B )2 + φ(A)2φ(B 2) −φ(A)2φ(B )2

φ(AB 2A) = φ(A2)φ(B 2).

Translated in terms of traces of random matrices, we can compute the so-called free moments , i.e. the moments of the (almost sure) l.s.d., of sums and productsof random matrices as a function of the free moments of the operands.

It is then possible to derive the limiting spectrum of the sum of asymptoticallyfree random matrices A N and B N from the sequences of the free moments of thel.s.d. of A N + B N . Theorem 4.2 has already shown that A N + B N is connectedto A N and B N through their respective R-transforms. Denote A and B two non-commutative random variables, with d.f. the l.s.d. of A N and B N , respectively.



5.2. Free moments and cumulants 99

The formal R-transform series of A can be expressed as

RA (z) =∞

k =1

C k zk−1 (5.2)

where C k is called the kth order free cumulant of A. From Theorem 4.2, wetherefore have

RA + B (z) =∞

k=1

[C k (A) + C k (B )]zk−1

with C k (A) and C k (B ) the respective free cumulants of the l.s.d. of A N andB N . In the same way as cumulants of independent random variables add up inclassical probability theory, free cumulants of free random variables add up in

free probability theory. This summarizes into the following result from Voiculescu[Voiculescu , 1986].

Theorem 5.4. Let µA and µB be compactly supported probability distributions,with respective free cumulants C 1(A), C 2(A), . . . and C 1(B ), C 2(B ), . . . . Then the free cumulants C 1(A + B ), C 2(A + B ), . . . of the distribution µA + B µA µB

satisfy

C k (A + B ) = C k (A) + C k (B ).

We recall that, in reference to classical probability theory, the binary additiveoperation ‘ µA µB ’ is called the free additive convolution of the distributionsµA and µB . We equivalently dened the binary operation ‘ µA µB ’ as the free additive deconvolution of µB from µA .

The cumulants of a given distribution µA , with bounded support, can becomputed recursively using Theorem 4.2. Equating coefficients of the terms inzk allows us to derive the kth order free cumulant C k of A as a function of therst k free moments M 1 , . . . , M k of A. In particular, the rst three cumulantsread

C 1 = M 1C 2 = M 2 −M 21C 3 = M 3 −3M 1M 2 + 2 M 31 .

It is therefore possible to evaluate explicitly the successive moments of theprobability distribution µA + B = µA µB of the sum of the random free variablesA and B from the successive moments of µA and µB . The method comes asfollows: given µA and µB , (i) compute the moments and then the cumulantsC k (A) of A and C k (B ) of B, (ii) compute the sum C k (A + B ) = C k (A) + C k (B ),(iii) from the cumulants C k (A + B ), retrieve the corresponding moments of A + B , from which the distribution µA + B , assumed of compact support, canbe found. This approach can be conducted in a combinatorial way, using non-crossing partitions . The method is provided by Speicher [Speicher, 1998], which




simplies many free probability calculus introduced by Voiculescu. In particular,the method described above allows us to recover the moments of A + B from thecumulants of A and B in the following result.

Theorem 5.5. Let A and B be free random variables with respective cumulants

C k (A) and C k (B ). The nth order free moment M n (A + B ) of the random variable A + B reads:

M n (A + B ) =π∈NC( n ) V ∈π

C |V |(A) + C |V |(B )

with NC(n) the set of non-crossing partitions of 1, . . . , n and |V | the cardinality of the subset V in the non-crossing partition partition π.

Setting B = 0, we fall back on the relation between the momentsM 1(A), M 2(A), . . . and the cumulants C 1(A), C 2(A), . . . of A

M n (A) =π∈NC( n ) V ∈π

C |V |(A) (5.3)

introduced in Theorem 4.2.

Remark 5.1. Note that, since the limiting distribution functions under study arecompactly supported, from Theorem 3.3, the Stieltjes transform mF A of the d.f.

F A

associated with µA can be written, for z ∈C+

in the convergence region of the series

mF A (z) = −∞

k=0

M k z−k−1

where M k is the kth order moment of F A , i.e. the kth order free moment of A. There therefore exists a strong link between the Stieltjes transform of thecompactly supported distribution function F A and the free moments of the non-commutative random variable A.

Contrary to the Stieltjes transform approach, though, the combinatorialmethod requires to compute all successive cumulants, or a sufficient numberof them, to better estimate the underlying distribution. Remember though that,for compactly supported distributions, M k /k ! vanishes fast for large k and thenan estimate of only a few rst moments might be good enough. This is inparticular convenient when the studied distribution µ consists of K masses. If so,assuming the respective weights of each mass are known, the rst K cumulantsare sufficient to evaluate the full distribution function. Indeed, in the specialcase of evenly distributed masses, given M 1 , . . . , M K the rst K moments of the

distribution, µ(x) = 1K

K

i =1 δ (x −λ i ) where λ1 , . . . , λ K are the K roots of thepolynomial

X K −Π1X K −1 + Π 2X K −2 −. . . + ( −1)K ΠK




where Π1 , . . . , Πn are the elementary symmetric polynomials , recursivelycomputed from the Newton–Girard formula

(−1)K K ΠK +K

i=1(−1)K + i M i ΠK −i = 0. (5.4)

See [Seroul, 2000] for more details on the Newton–Girard formula.Similar relations hold for the product of free random variables. In terms of

moments and non-crossing partitions, the result is provided by Nica and Speicher[Nica and Speicher, 1996] in the following theorem.

Theorem 5.6. Let A and B be free random variables with respective free

cumulants C k (A) and C k (B ). Then the nth order free moment M n (AB )of the random variable AB reads:

M n (AB ) =(π 1 ,π 2 )∈NC( n ) V 1∈π 1

V 2∈π 2

C |V 1 |(A)C |V 2 |(B ).

This formula enables the computation of all free moments of the distribution of AB from the free moments of µA and µB . We recall that the product distributionis denoted µA µB and is called multiplicative free convolution . Reverting thepolynomial formulas in the free cumulants enables also the computation of thefree moments of µA from the free moments of µAB and µB . This therefore allowsus to recover the distribution of the multiplicative free deconvolution µAB µB .

Before debating the important results of free moments and cumulants forwireless communications, we shortly introduce non-crossing partitions and therelations between partitions in classical probability theory and non-crossingpartitions in free probability theory.

In (classical) probability theory, we have the following relation between thecumulants cn and the moments mn of a given probability distribution

mn = π∈P (n ) V ∈π

c|V |

where P (n) is the set of partitions of 1, . . . , n . For instance, P (3) iscomposed of the ve sets 1, 2, 3, 1, 2, 3, 1, 3, 2, 2, 3, 1and

1, 2, 3. The cardinality of P (n) is called the Bell number Bn , recursivelydened by

B0 = 1Bn +1 =

nk=0

nk Bk .

From the example above, B3 = 5.We recall for instance that the cumulant c1 = m1 is the distribution mean,

c2 = m2 −m21 is the variance, c3 is known as the skewness , and c4 is the kurtosis .




Free probability theory provides the similar formula ( 5.3), where the sum isnot taken over the partitions of 1, . . . , n but over the non-crossing partitions of

1, . . . , n

. Non-crossing partitions are dened as follows.

Denition 5.1. Consider the set 1, . . . , n . The partition π ∈P (n) is said to benon-crossing if there does not exist a < b < c < d elements of 1, . . . , n (orderedmodulo n), such that both a, c ∈ π and b, d ∈ π.

Otherwise stated, this means that in a circular graph of the elements of

1, . . . , n where the elements V 1 , . . . , V |V | of a given π ∈P (n) are representedby |V | polygons, with polygon k connecting the elements of V k and such thatthe edges of polygon i and polygon j , i = j , never cross. This is depicted inFigure 5.1 in the case of π =

1, 3, 4

,

2

,

5, 6, 7

,

8

, for n = 8. The number

of such non-crossing partitions, i.e. the cardinality of NC( n), is known as theCatalan number , denoted C n , which was seen incidentally to be connected to themoments of the semi-circle law, Theorem 5.3. This is summarized and proved inthe following.

Theorem 5.7 ([Anderson et al., 2006]). The cardinality C n of NC(n), for n ≥ 1,satises the recursion equation

C 1 = 1 ,

C n =n

k =1

C n −k C k−1

and is explicitly given by:

C n = 1n + 1

2nn

. (5.5)

We provide below the proof of this result, which is rather short and intuitive.

Proof. Let π ∈ NC(n) and denote j the smallest element connected to 1 with

j = 1 if 1 ∈ π, e.g. j = 3 in Figure 5.1. Then necessarily both sets 1, . . . , j −1and j + 1 , . . . , n are non-crossing and, for xed link (1 , j ), the number of non-crossing partitions in 1, . . . , n is the product between the number of non-crossing partitions in the sets 1, . . . , j −1, i.e. C j −1 , and j + 1 , . . . , n , i.e.C n −j . We then have the relation

C n =n

j =1C j −1C n −j

as expected, along with the obvious fact that C 1 = 1. By recursion calculus, it

is then easy to see that the expression (5.5) satises this recursive equality.We now return to practical applications of the combinatorial moment

framework. All results mentioned so far are indeed of practical use for problems




1

2

3

4

5

6

7

8

Figure 5.1 Non-crossing partition π = 1, 3, 4, 2, 5, 6, 7, 8 of NC(8).

related to large dimensional random matrices in wireless communication settings.Roughly speaking, we can now characterize the limiting eigenvalue distributionsfor sums and products of matrices involving random Gaussian matrices, randomunitary matrices, deterministic Hermitian matrices, etc. based on combinatorialcalculus of their successive free moments. This is particularly suitable when thefull limiting distribution is not required but only a few moments are needed,

and when the random matrix model involves a large number of such matrices.The authors believe that all derivations handled by free probability theory canbe performed using the Stieltjes transform method, although this requires morework. In particular, random matrix models involving unitary matrices are farmore easily handled using free probability approaches than Stieltjes transformtools, as will be demonstrated in the subsequent sections. In contrast, theapplication range of free probability methods is seriously limited by the needfor eigenvalue distributions to be compactly supported probability measuresand more importantly by the need for the random matrices under study to beunitarily invariant (or more exactly asymptotically free).

Let us for instance apply the moment-cumulant relations as an applicationof Theorem 4.9. Denote Bk the kth order moment of µB and Rk the kth ordermoment of µR . Equation ( 4.7) provides a relation between µB and µR underthe form of successive free convolution or deconvolution operations involvingin particular the Marcenko–Pastur law. From the moments of the Marcenko–Pastur law, Theorem 2.14, and the above free addition and free product theorems,Theorem 5.5 and Theorem 5.6, we can then obtain polynomial relations betweenthe Bk and the Rk . Following this procedure, Theorem 4.9 entails

B1 = R

1 + 1 ,

B2 = R2 + (2 + 2 c)R1 + (1 + c),B3 = R3 + (3 + 3 c)R2 + 3 cR2

1 + (3 + 9 c + 3 c2 + 3) R1 + (1 + 3 c + c2) (5.6)




the subsequent Bk being long expressions that can be derived by computersoftware. As a matter of fact, moment-cumulant relations can be applied toany type of random matrix model involving sums, products, and informationplus noise models of asymptotically free random matrices, the proper calculusbeing programmable on modern computers. The moment framework that arisesfrom free probability theory therefore allows for very direct derivations of thesuccessive moments of involved random matrix models. In this sense, this is muchmore convenient than the Stieltjes transform framework which requires involvedmathematical tools to be deployed for every new matrix model.

In [Rao and Edelman, 2008] , in an attempt to fully exploit the above remark,Rao and Edelman developed a systematic computation framework for a class of random matrices, including special cases of (i) information plus noise models,

(ii) products and sums of Wishart matrices within themselves, or (iii) productsand sums of Wishart matrices with deterministic matrices. This class is denedby the authors as the class of algebraic random matrices . Roughly speaking, thisclass gathers all random Hermitian matrices X with l.s.d. F for which thereexists a bivariate complex-valued polynomial L(x, y ) satisfying

L(z, m F (z)) = 0 .

The class of algebraic random matrices is large enough to cover many practical

applications in wireless communications. However, it does not include even themost basic model X N T N X

HN , where X N has i.i.d. Gaussian entries and the

l.s.d. H of T N has a connected component . If H is a discrete sum of masses inλk , k = 1 , . . . , K , then X N T N X

HN is algebraic. Indeed, from Theorem 3.13, the

Stieltjes transform m(z) of the l.s.d. of X N T N XHN satises

m(z) c 1K

K

k=1

λk

1 + λk m(z) −z = 1

which, after multiplication on both sides by k (1 + λk m(z)), leads to a bivariatepolynomial expression in ( z, m (z)). This is not true in general when H has acontinuous support. Note also that several computer codes for evaluating freemoments of algebraic random matrices are provided in [Rao et al., 2008].

As repeatedly mentioned, the free probability framework is very limited inits application scope as it is only applied to large dimensional random matriceswith unitarily invariant properties. Nonetheless, the important combinatorialmachinery coming along with free probability can be efficiently reused to extendthe initial results on Gaussian and Haar matrices to more structured types of matrices that enjoy other symmetry properties. The next chapter introduces themain results obtained for these extended methods in which new partition setsappear.



5.3. Generalization to more structured matrices 105

5.3 Generalization to more structured matrices

In wireless communications, research focuses mainly on Gaussian and Haarrandom matrices, but not only. Other types of random matrices, more structured,are desirable to study. This is especially the case of random Vandermondematrices, dened as follows.

Denition 5.2. The Vandermonde matrix V ∈C N ×n generated from the vector(α 1 , . . . , α n )T is the matrix with ( i, j )th entry V ij = α j −1

i

V =

1 1 . . . 1α 1 α2 . . . αn...

.

.. . . ....

α N −11 αN −1

2 . . . αN −1n

.

A random Vandermonde matrix is a normalized (by 1√ N ) Vandermonde matrixwhose generating vector ( α 1 , . . . , α n )T is a random vector.

In [Ryan and Debbah, 2009 ], the authors derive the successive moments of thee.s.d. of matrix models involving Vandermonde matrices with generating vectorentries drawn uniformly and independently from the complex unit circle. The

main result is as follows.

Theorem 5.8 ([Ryan and Debbah , 2009]). Let D 1 , . . . , D L be L diagonal matrices of size n ×n such that D i has an almost sure l.s.d. as n → ∞, for all i. Let V ∈C N ×n be a random Vandermonde matrix with generators α1 , . . . , α n

drawn independently and uniformly from the unit complex circle. Call α a random variable on [0, 2π) distributed as α1 . For ρ ∈P (L), the set of partitions of

1, . . . , L , dene K ρ,α,N as

K ρ,α,N = N |ρ|−L−1

[0,2π ) |ρ |

L

k=11 −e

jN (α b ( k

−1)

−α b ( k ) )

1 −ej (α b ( k −1) −α b ( k ) )

|ρ|

i=1dα i

with b(k) the index of the set of ρ containing k (since the αi are i.i.d., the set indexing in ρ is arbitrary). If the limit

K ρ,α = limN →∞

K ρ,α,N

exists, then it is called a Vandermonde mixed moment expansion coefficient. If it exists for all ρ ∈P (L), then we have, as N, n → ∞ with n/N → c, 0 < c < ∞

1n

tr L

i =1

D i VH V →

ρ∈P (L )

K ρ,α c|ρ|−1Dρ




almost surely, where, for ρ = ρ1 , . . . , ρ K , we denote

Dρk = lim

n →∞1

ntr

i∈ρk

D i

and

Dρ =K

k=1

Dρk .

Contrary to the previous results presented for Gaussian random matrices,the extent of knowledge on the analytical approaches of random matrix theoryso far does not enable us to determine the l.s.d. of random Vandermonde

matrices in a closed-form. Only the aforementioned free probability approachis known to tackle this problem at this time. Note also that the support of the l.s.d. of such random Vandermonde matrices is not compact. It is thereforea priori uncertain whether the l.s.d. of such matrices can be determined bythe limiting moments. For this, we need to verify that Carleman’s condition,Theorem 5.1, is met. This has in fact been shown by Tucci and Whiting in[Tucci and Whiting, 2010] . For wireless communication purposes, it is possibleto evaluate the capacity of a random Vandermonde channel model under the formof a series of moments. Such a channel arises whenever signals emerging fromn sources impinge on an N -fold linear antenna array in line-of-sight. Assumingthe sources are sufficiently far from the sensing array, the signals emerging fromone particular source are received with equal amplitude by each antenna butwith phases rotated proportionally to the difference of the optical path lengths,i.e. proportionally to both the antenna index in the array and the sinus of the incoming angle. Therefore, calling di the power of signal source i at theantenna array and V the Vandermonde matrix with generating vector the nphases of the incoming signals, the matrix VD , D = diag( d1 , . . . , d n ), modelsthe aforementioned communication channel.

Since the moments M k of the (almost sure) l.s.d. of VDV H are only dependent

on polynomial expressions of the moments D1 , . . . , D k of the l.s.d. of D (in thenotations of Theorem 5.8, D 1 = . . . = D L so that Dk D1,...,k ), it is possibleto recover the Dk from the M k and hence obtain an estimate of Dk from thelarge dimensional observation VDV H . Assuming a small number of sources, theestimates of D1 , D 2 , . . . provide a further estimate of the respective distance of the signal sources. In [Ryan and Debbah , 2009], the rst moments for this setupare provided. Denoting M k cM k and Dk cDk , we have the relations

M 1 = D1

M 2 = D2 + D 21

M 3 = D3 + 3 D2 D1 + D 31

M 4 = D4 + 4 D3 D1 + 83

D 22 + 6 D2 D 2

1 + D 41



5.3. Generalization to more structured matrices 107

M 5 = D5 + 5 D4 D1 + 25

3D3 D2 + 10 D3 D 2

1 + 40

3D 2

2 D1 + 10 D2 D 3

1 + D 51

from which D1 , D2 , . . . can be written as a function of M 1 , M 2 , . . . .The successive moments of other structured matrices can be studied similarly

for random Toeplitz or Hankel matrices. The moment calculus can in fact bedirectly derived from the moment calculus of Vandermonde matrices [Ryan andDebbah, 2011] . We have in particular the following results.

Theorem 5.9. Let X N ∈R N ×N be the Toeplitz matrix given by:

X N = 1√ N

X 0 X 1 X 2 . . . X N −2 X N −1

X 1 X 0 X 1 X N −2

X 2 X 1 X 0. . .

......

. . . X 2X N −2 X 0 X 1X N −1 X N −2 . . . X 2 X 1 X 0

with X 0 , X 1 , . . . real independent Gaussian with zero mean and unit variance.

Then the moment M k of order k of the l.s.d. of X N is given by M k = 0 for kodd and the rst even moments are given by:

M 2 = 1

M 4 = 83

M 6 = 11

M 8 = 1435

24 .

Remember for instance that slow fading frequency selective channels can bemodeled by Toeplitz matrices. When the matrices are Wiener-class [Gray, 2006],i.e. when the series formed of the elements of the rst row is asymptoticallysummable (this being in particular true when a nite number of elements arenon-zero), it is often possible to replace the Toeplitz matrices by circulantmatrices without modifying the l.s.d., see Theorem 12.1. Since circulant matricesare diagonalizable in the Fourier basis, their study is simpler, so that thisassumption is often considered. However, in many practical cases, the Wiener-class assumption does not hold so results on the l.s.d. of Toeplitz matrices areof major importance.

A similar result holds for Hankel matrices.




Theorem 5.10. Let X N ∈R N ×N be the Hankel matrix dened by

X N = 1√ N

X 0 X 1 X 2 . . . X N −2 X N −1

X 1 X 2 X 3 X N

X 2 X 3 X 4. . .

......

. . . X 2N −2

X N −2 X 2N −2 X 2N −1

X N −1 X N . . . X 2N −2 X 2N −1 X 2N

with X 0 , X 1 , . . . real independent Gaussian with zero mean and unit variance.Then the free moment M k of order k of X N is given by M k = 0 for k odd and the rst even moments are given by:

M 2 = 1

M 4 = 83

M 6 = 14M 8 = 100 .

In general, the exact features that random matrices must fulll for resultssuch as Theorem 5.8 to be easily derived are not yet fully understood. It seemshowever that any random matrix X whose joint entry probability distribution is

invariant by left product with permutation matrices enters the same scheme asrandom Vandermonde, Toeplitz and Hankel matrices, i.e. the free moments of matrix products of the type L

k =1 D k XX H can be derived from the momentsof D 1 , . . . , D L . This is a very new, yet immature, eld of research.

As already stated in Chapter 4, free probabilistic tools can also be used in placeof classical random theoretical tools to derive successive moments of probabilitydistributions of large random matrices. We will mention here an additional usageof moment-based approaches on the results from free probability theory describedpreviously that allows us to obtain exact results on the expected eigenvaluedistribution of small dimensional random matrices, instead of the almost surel.s.d. of random matrices.

5.4 Free moments in small dimensional matrices

Thanks to the unitary invariance property of Gaussian matrices, standardcombinatorics tools such as non-crossing partitions allow us to further generalizemoment results obtained asymptotically, as the matrix dimensions grow large,to exact results on the moments of the expected e.s.d. of matrices for all xeddimensions. We have in particular the following theorem for small dimensionalWishart and deterministic matrix products.



5.5. Rectangular free probability 109

Theorem 5.11 ([Masucci et al., 2011]). Let X N ∈C N ×n have i.i.d. standard Gaussian entries and T N be a (deterministic) N ×N matrix. For any positive integer p, we have:

E 1N

tr T N 1n

X N XHN

p

=π∈

P ( p)

nk ( π )− pN l( π )−1T π |odd (5.7)

where π ∈P (2 p) is the permutation such that

π (2 j −1) = 2 π−1( j ), j ∈ 1, 2, . . . , p π (2 j ) = 2 π( j ) −1, j ∈ 1, 2, . . . , p .

Every such π is attached the equivalence relation ∼π , dened as

j

∼π π ( j ) + 1 .

In (5.7), π|odd is the set consisting in the equivalence classes/blocks of π which are contained within the odd numbers, k(ρ) is the number of blocks in ρ consisting of only even numbers, l(ρ) is the number of blocks in ρ consisting of only odd numbers, T ρ = k

i =11N tr T |ρ i |N whenever ρ = ρ1 , . . . , ρ k is a partition with

blocks ρi , and |ρi | is the number of elements in ρi .

An information plus noise equivalent to Theorem 4.9 is also provided in[Masucci et al., 2011] which requires further considerations of set partitions.

Both results arise from a generic diagrammatic framework to compute successivemoments of matrices invariant by row or column permutations.We complete this introduction on extended combinatorics tools with recent

advances in the study of rectangular random matrices from a free probabilityapproach.

5.5 Rectangular free probability

A recent extension of free probability theory for Hermitian random matrices tothe most general rectangular matrices has been proposed by Benaych-Georges[Benaych-Georges , 2009]. The quantity of interest is no longer the empiricaldistribution of the eigenvalues of square Hermitian matrices but the symmetrized singular law of rectangular matrices.

Denition 5.3. Let M ∈C N ×n be a rectangular random matrix on (Ω , F , P ).The singular law µ of M is the uniform distribution of its singular valuess1 , . . . , s min( n,N ) . The kth order moment M k of µ is dened as

M k = 1

min( n, N ) tr MM H

k2 .

The symmetrized singular law of M is the probability distribution ˜ µ such that,for any Borel set A ∈F , µ(A) = 1

2 (µ(A) + µ(−A)).




We have similar results for rectangular matrices as for Hermitian matrices. Inparticular, we dene a rectangular additive free convolution operator ‘ c’, withc = lim N N/n , which satises the following.

Theorem 5.12. Let M 1 , M 2 be independent bi-unitarily invariant N ×nmatrices whose symmetrized singular laws converge, respectively, to the measures µ1 and µ2 , as n, N grow to innity with limit ratio N/n → c. Then the symmetrized singular law of M 1 + M 2 converges to a symmetric probability measure, dependent on µ1 , µ2 , and c only, and which we denote µ1 c µ2 .

The rectangular additive free convolution can be computed explicitly from an

equivalent rectangular R-transform which has the same property as in the squarecase of summing cumulants of convolution of symmetrized singular laws.

Theorem 5.13. For a given symmetric distribution µ, denote

Rµ,c (z)∞

n =1C 2n,c (µ)zn

where C 2n,c (µ) are the rectangular free cumulants of µ with ratio c, linked to the free moments M n (µ) of µ by

M n (µ) =π∈NC (2 n )

ce(π )

V ∈π

C |V |,c (µ)

where NC (2n) is the subset of non-crossing partitions of 1, . . . , 2n with all blocks of even cardinality and e(π) is the number of blocks of π with even cardinality.

Then, for two distributions µ1 and µ2 , we have:

Rµ 1 c µ 2 ,c (z) = Rµ 1 ,c (z) + Rµ 2 ,c (z).

This is as far as we will go with rectangular random matrices, which is also avery new eld of research in mathematics, with still few applications to wirelesscommunications; see [Gregoratti et al., 2010] for an example in the context of relay networks with unitary precoders.

To conclude the last three theoretical chapters, we recollect the differenttechniques introduced so far in a short conclusion on the methodology to adoptwhen addressing problems of random matrix theory. The methods to considerheavily depend on the application sought for, on the time we have to invest onthe study, and obviously on the feasibility of every individual problem.



5.6. Methodology 111

5.6 Methodology

It is fundamental to understand why we would address a question regarding somerandom matrix model from the analytical or the moments approach.

Say our intention is to study the e.s.d. F X N of a given random Hermitianmatrix X N ∈C N ×N . Using the analytical methods, F X N will be treated as asystem parameter and will be shown to satisfy some classical analytic properties,such as: F X N has a weak limit F , the Stieltjes transform of F is solution of someimplicit equation, etc. The moment-based methods will focus on establishingresults on the successive moments M 1 , M 2 , . . . of F (or E[F X N ]) when they exist,such as: M k is linked to the moments M i , i = 1 , . . . , k , of another distribution F ,M k vanishes for k > 2 for growing dimensions of X N , etc. Both types of methods

will therefore ultimately give mutually consistent results. However, the choice of a particular method over the other is often motivated by the following aspects.

1. Mathematical attractiveness . Both methods involve totally differentmathematical tools and it often turns out that one method is preferableover the other in this respect. In particular, moment-based methods, whileleading to tedious combinatorial computations, are very attractive due to theirmechanical and simple way of working. In contrast, the analytical methods arenot so exible in some respects and are not yet able to solve many problemsalready addressed by different moment-based methods. For instance, inSection 5.3, we introduced results on large dimensional random Vandermonde,Toeplitz, Hankel matrices and bi-unitarily invariant rectangular matrices,which analytical methods are far from being able to provide. However, theconverse also holds: some random matrix models can be studied by theStieltjes transform approach, while moment-based methods are unusable. Thisis in fact the case for all random matrix models involving matrices withindependent entries that are often not unitarily invariant. It might also turnout that the moment-based methods are of no use for the evaluation of certainfunctionals of F . For instance, the fact that the series expansion of log(1 + x)

has convergence radius 1 implies that the moments of F do not allow us toestimate log(1 + xλ )dF (λ) for large x. The immediate practical consequenceis that most capacity expressions cannot be evaluated from the moments of F .

2. Application context . The most important drawback of the moment-basedmethods lies in their results consisting of a series of properties concerningthe individual or joint free moments. If we are interested in studying thelimiting distribution F of the e.s.d. of a given random matrix X N ∈C N ×N ,two cases generally occur: (i) F is a step function with K discontinuities, i.e.F has K distinct eigenvalues with large multiplicities, in which case results onthe rst M 1 , . . . , M K may be sufficient to obtain (or estimate) F completely(especially if the multiplicities are a priori known), although this estimateis likely to perform poorly if only K moments are used, and (ii) F is a non-




trivial function, in which case all moments are in general needed to accuratelyevaluate it. In case (ii), moment-based methods are not desirable because alarge number of moments need to be estimated (this often goes along with highcomputational complexity) and because the moment estimates are themselvescorrelated according to some non-trivial joint probability distribution, which iseven more computationally complex to evaluate. Typically, a small error in theestimate of the moment of order 1 propagates into a larger error in the estimateof the moment of order 2, which itself propagates forward into higher momentestimates. These errors need to be tracked precisely to optimally exploit thesuccessive estimates. Analytical methods, if numerically solvable, are muchmore appealing in this case. In general, the result of these methods expressesas an approximation of mF by another Stieltjes transform, the approximation

being asymptotically accurate as the system dimensions grow large. Theanalytical approach will especially be shown often to rely on computationallyinexpensive xed-point algorithms with proven convergence, see, e.g. Chapters12–15, while moment-based methods require involved combinatorial calculuswhen a large number of moments has to be taken into account.

3. Interpretation purpose . In general, analytical methods provide verycompact expressions, from which the typical behavior of the relevant problemparameters can be understood. The example of the optimality of the water-lling algorithm in the capacity maximization problem is a typical case wherethis phenomenon appears, see Chapters 13–14. On the moment-based method

side, results appear in the form of lengthy combinatorial calculus, from whichphysical interpretation is not always possible.

This concludes the set of three chapters on the limiting results of largedimensional random matrices, using the Stieltjes transform, free probabilitytheory, and related methods using moments. The next chapter will bededicated to further extensions of the Stieltjes transform approach, which areof fundamental use in the applicative context of wireless communications. Therst of these approaches extends the limiting results of Theorems 3.13, 3.14,3.15, 4.10, 4.11, etc., to the case where the e.s.d. of the underlying matrices does

not necessarily converge and allows us to provide accurate approximations of theempirical Stieltjes transform for all nite matrix dimensions. This will be shownto have crucial consequences from a practical point of view, in particular for theperformance study of large dimensional wireless communication systems.



6 Deterministic equivalents

6.1 Introduction to deterministic equivalents

The rst applications of random matrix theory to the eld of wirelesscommunications, e.g., [Tse and Hanly , 1999; Tse and Verd´u, 2000; Verd u andShamai, 1999 ], originally dealt with the limiting behavior of some simple randommatrix models. In particular, these results are attractive as these limitingbehaviors only depend on the limiting eigenvalue distribution of the deterministicmatrices of the model. This is in fact the case of all the results we have derivedand introduced so far; for instance, Theorem 3.13 unveils the limiting behavior of the e.s.d. of B N = A N + X H

N T N X N when both e.s.d. of A N and T N convergetoward given deterministic distribution functions and X N is random with i.i.d.entries. However, for practical applications, it might turn out that:

(i) the e.s.d. of A N or T N do not necessarily converge to a limiting distribution;(ii) even if the e.s.d. of the deterministic matrices in the model do all converge to

their respective l.s.d., the e.s.d. of the output matrix B N might not converge.This is of course not the case in Theorem 3.13, but we will show that this mayhappen for more involved models, e.g. the models treated by [Couillet et al.,2011a] and [Hachem et al., 2007].

Let us introduce a simple scenario for which the e.s.d. of the random matrixdoes not converge. This example is borrowed from [Hachem et al., 2007]. Dene

X N ∈C 2N

×2N

asX N =

X N 00 0

(6.1)

with the entries of X N being i.i.d. with zero mean and variance 1N . Consider in

addition the matrix T N ∈C 2N ×2N dened as

T N =

I N 00 0

, N even

0 0

0 I N

, N odd.(6.2)

Then, taking B N = ( T N + X N )(T N + X N )H , F B 2 N and F B 2 N +1 bothconverge weakly towards limit distributions, as N → ∞, but those distributions



114 6. Deterministic equivalents

0 1 2 3 4 50

1

2

3

4

5

Eigenvalues

D e n s i t y

0 1 2 3 4 50

1

2

3

4

5

Eigenvalues

D e n s i t y

Figure 6.1 Histogram of the eigenvalues of B N = ( T N + X N )( T N + X N )H modeled in(6.1 )–(6.2), for N = 1000 (top) and N = 1001 (bottom).

differ. Indeed, for N even, half of the spectrum of B N is formed of zeros, whilefor N odd, half of the spectrum of B N is formed of ones, the rest of the spectrumbeing a weighted version of the Marcenko–Pastur law. And therefore there doesnot exist a limit to F B N , while F X N X H

N tends to the uniformly weighted sum of the Marcenko–Pastur law and a mass in zero, and F T N T H

N tends to the uniformlyweighted sum of two masses in zero and one. This is depicted in Figure 6.1.

In such situations, there is therefore no longer any interest in looking at

the asymptotic behavior of e.s.d. Instead, we will be interested in ndingdeterministic equivalents for the underlying model.



6.2. Techniques for deterministic equivalents 115

Denition 6.1. Consider a series of Hermitian random matrices B 1 , B 2 , . . . ,with B N ∈C N ×N and a series f 1 , f 2 , . . . of functionals of 1 ×1, 2 ×2, . . .matrices. A deterministic equivalent of B N for the functional f N is a seriesB 1 , B 2 , . . . where B N ∈C N ×N , of deterministic matrices, such that

limN →∞

f N (B N ) −f N (B N ) → 0

where the convergence will often be with probability one. Note that f N (B N )does not need to have a limit as N → ∞. We will similarly call gN f N (B N )the deterministic equivalent of f N (B N ), i.e. the deterministic series g1 , g2 , . . .such that f N (B N ) −gN → 0 in some sense.

We will often take f N to be the normalized trace of ( B N

−zI N )−1 , i.e. the

Stieltjes transform of F B N . When f N (B N ) does not have a limit, the Marcenko–Pastur method, developed in Section 3.2, will fail. This is because, at some point,all the entries of the underlying matrices will have to be taken into account andnot only the diagonal entries, as in the proof we provided in Section 3.2. However,the Marcenko–Pastur method can be tweaked adequately into a technique thatcan cope with deterministic equivalents. In the following, we rst introduce thistechnique, which we will call the Bai and Silverstein technique , and then discussan alternative technique, known as the Gaussian method , which is particularlysuited to random matrix models with Gaussian entries. Hereafter, we detail thesemethods by successively proving two (similar) results of importance in wirelesscommunications, see further Chapters 13–14.

6.2 Techniques for deterministic equivalents

6.2.1 Bai and Silverstein method

We rst introduce a deterministic equivalent for the model

B N =

K

k=1R

12k X k T k X

H

k R

12k + A

where the K matrices X k have i.i.d. entries for each k, mutually independent fordifferent k, and the matrices T 1 , . . . , T K , R 1 , . . . , R K and A are ‘bounded’ insome sense to be dened later. This is more general than the model of Theorem3.13 in several respects:

(i) left product matrices R k , 1 ≤ k ≤ K , have been introduced. As an exercise,it can already be veried that a l.s.d. for the model R

121 X 1T 1X H

1 R121 + A may

not exist even if F R 1 and F A both converge vaguely to deterministic limits,unless some severe additional constraint is put on the eigenvectors of R 1 andA , e.g. R 1 and A are codiagonalizable. This suggests that the Marcenko–Pastur method will fail to treat this model;




(ii) a sum of K such models is considered ( K does not grow along with N here);(iii) the e.s.d. of the (possibly random) matrices T k and R k are not required to

converge.

While the result to be introduced hereafter is very likely to hold for X 1 , . . . , X K

with non-identically distributed entries (as long as they have common mean andvariance and some higher order moment condition), we only present here theresult where these entries are identically distributed, which is less general thanthe conditions of Theorem 3.13.

Theorem 6.1 ([Couillet et al., 2011a]). Let K be some positive integer. For some integer N , let

B N =K

k=1R

12k X k T k X H

k R12k + A

be an N ×N matrix with the following hypotheses, for all k ∈ 1, . . . , K 1. X k = 1√ n k

X k,ij ∈C N ×n k is such that the X k,ij are identically distributed for all N , i, j , independent for each xed N , and E|X k, 11 −EX k, 11 |2 = 1 ;

2. R12k ∈C N ×N is a Hermitian non-negative denite square root of the non-

negative denite Hermitian matrix R k ;3. T k = diag( τ k, 1 , . . . , τ k,n k ) ∈C n k ×n k , nk ∈N∗, is diagonal with τ k,i ≥ 0;

4. the sequences F T 1 , F T 2 , . . . and F R 1 , F R 2 , . . . are tight, i.e. for all ε > 0, there exists M > 0 such that 1−F T k (M ) < ε and 1−F R k (M ) < ε for all nk , N ;

5. A ∈C N ×N is Hermitian non-negative denite;6. denoting ck = N/n k , for all k, there exist 0 < a < b < ∞ for which

a ≤ lim inf N

ck ≤ lim supN

ck ≤ b. (6.3)

Then, as all N and nk grow large, with ratio ck , for z ∈C \ R + , the Stieltjes transform mB N (z) of B N satises

mB N (z)

−mN (z) a .s.

−→ 0 (6.4)

where

mN (z) = 1N

tr A +K

k =1 τ k dF T k (τ k )1 + ck τ k eN,k (z)

R k −zI N

−1

(6.5)

and the set of functions eN, 1(z), . . . , e N,K (z) forms the unique solution to the K equations

eN,i (z) = 1

N tr R i A +

K

k=1

τ k dF T k (τ k )

1 + ck τ k eN,k (z)

R k

−zI N

−1

(6.6)

such that sgn( [eN,i (z)]) = sgn( [z]), if z ∈C \ R , and eN,i (z) > 0 if z is real negative.




Moreover, for any ε > 0, the convergence of Equation (6.4) is uniform over any region of C bounded by a contour interior to

C

\ (z : |z| ≤ ε∪z = x + iv : x > 0, |v| ≤ ε) .For all N , the function mN is the Stieltjes transform of a distribution function

F N , and

F B N −F N ⇒ 0

almost surely as N → ∞.In [Couillet et al., 2011a], Theorem 6.1 is completed by the following result.

Theorem 6.2. Under the conditions of Theorem 6.1, the scalars eN, 1(z), . . . , e N,K (z) are also explicitly given by:

eN,i (z) = limt→∞

etN,i (z)

where, for all i, e0N,i (z) = −1/z and, for t ≥ 1

etN,i (z) =

1N

tr R i A +K

j =1 τ j dF T j (τ j )1 + cj τ j et−1

N,j (z)R j −zI N

−1

.

This result, which ensures the convergence of the classical xed-pointalgorithm for an adequate initial condition, is of fundamental importance forpractical purposes as it ensures that the eN, 1(z), . . . , e N,K (z) can be determinednumerically in a deterministic way. Since the proof of Theorem 6.2 relies heavilyon the proof of Theorem 6.1, we will prove Theorem 6.2 later.

Several remarks are in order before we prove Theorem 6.1. We have givenmuch detail on the conditions for Theorem 6.1 to hold. We hereafter discuss theimplications of these conditions. Condition 1 requires that the X k,ij be identicallydistributed across N,i, j , but not necessarily across k. Note that the identical

distribution condition could be further released under additional mild conditions(such as all entries must have a moment of order 2 + ε, for some ε > 0), seeTheorem 3.13. Condition 4 introduces tightness requirements on the e.s.d. of R k

and T k . Tightness can be seen as the probabilistic equivalent to boundednessfor deterministic variables. Tightness ensures here that no mass of the F R k andF T k escapes to innity as n grows large. Condition 6 is more general than therequirement that ck has a limit as it allows ck , for all k, to wander between twopositive values.

From a practical point of view, R12K X k T

12k will often be used to model a

multiple antenna N

×nk channel with i.i.d. entries with transmit and receive

correlations. From the assumptions of Theorem 6.1, the correlation matrices R k

and T k are only required to be ‘bounded’ in the sense of tightness of their e.s.d.This means that, as the number of antennas grows, the eigenvalues of R k and T k




can only blow up with increasingly low probability. If we increase the number N of antennas on a bounded three-dimensional space, then the rough tendency isfor the eigenvalues of T k and R k to be all small except for a few of them, whichgrow large but have a probability of order O(1/N ), see, e.g., [Pollock et al., 2003].In that context, Theorem 6.1 holds, i.e. for N → ∞, F B N −F N ⇒ 0.

It is also important to remark that the matrices T k are constrained to bediagonal. This is unimportant when the matrices X k are assumed Gaussian inpractical applications, as the X k , being bi-unitarily invariant, can be multipliedon the right by any deterministic unitary matrix without altering the nalresult. This limitation is linked to the technique used for proving Theorem6.1. For mathematical completion, though, it would be convenient for thematrices T k to be unconstrained. We mention that Zhang and Bai [Zhang, 2006]

derive the limiting spectral distribution of the model B N = R

12

1 X 1T 1XH

1 R

12

1 forunconstrained Hermitian T 1 , using a different approach than that presentedbelow.

For practical applications, it will be easier in the following to write ( 6.6) in amore symmetric way. This is discussed in the following remark.

Remark 6.1. In the particular case where A = 0, the K implicit Equations ( 6.6)can be developed into the 2 K linked equations

eN,i (z) = 1

N tr R i

−z I N +

K

k=1

ek (z)R k

−1

eN,i (z) = 1n i

tr T i (−z [I n i + ci eN,i (z)T i ])−1 (6.7)

whose symmetric aspect is both more readable and more useful for practicalreasons that will be evidenced later in Chapters 13–14. As a consequence, mN (z)in (6.5) becomes

mN (z) = 1N

tr −z I N +K

k =1

eN,k (z)R k

−1

.

In the literature and, as a matter of fact, in some deterministic equivalentspresented later in this chapter, the variables eN,i (z) may be normalized by 1

n i

instead of 1N in order to avoid carrying the factor ci in front of eN,i (z) in the

second xed-point equation of ( 6.7). In the application chapters, Chapters 12–15,depending on the situation, either one or the other convention will be taken.

We present hereafter the general techniques, based on the Stieltjes transform,to prove Theorem 6.1 and other similar results introduced in this section.As opposed to the proof of the Marcenko–Pastur law, we cannot prove thatthat there exists a space of probability one over which mB N (z)

→ m(z) for

all z ∈C \ R + , for a certain limiting function m. Instead, we prove that thereexists a space of probability one over which mB N (z) −mN (z) → 0 for all z, fora certain series of Stieltjes transforms m1(z), m 2(z), . . . . There are in general




to compute mN (z) for z ∈D , which is in particular of interest for practicalapplications when z = −σ2 < 0. In the proof of Theorem 6.1, we will introduceboth results for completion. In the proof of Theorem 6.17, we will directly proceedto proving the convergence of the xed-point algorithm for z real negative.

When the uniqueness of the Stieltjes transform mN (z) has been made clear,the last step is to prove that, in the large N limit

mB N (z) −mN (z) a.s.

−→ 0.

This step is not so immediate. To this point, we indeed only know thatmB N (z) −hN (mB N (z); z) a .s.

−→ 0 and mN (z) −hN (mN (z); z) = 0. This does notimply immediately that mB N (z) −mN (z) a.s.

−→ 0. If there are several point-wisesolutions to m −hN (m; z) = 0, we need to verify that mN (z) was chosen to be

the one that will eventually satisfy mB N (z) −mN (z) a .s.

−→ 0. This will concludethe proof.We now provide the specic proof of Theorem 6.1. In order to determine the

above function hN , we rst develop the Marcenko–Pastur method (for simplicityfor K = 2 and A = 0). We will realize that this method fails unless all R k andA are constrained to be co-diagonalizable. To cope with this limitation, we willintroduce the more powerful Bai and Silverstein method, whose idea is to guess along the derivations the suitable form of hN . In fact, as we will shortly realize,the problem is slightly more difficult here as we will not be able to nd sucha function hN (which may actually not exist at all in the rst place). We willhowever be able to nd functions f N,i such that, for each i

eB N ,i (z) −f N,i (eB N ,1(z), . . . , e B N ,K (z); z) a.s.

−→ 0

where eB N ,i (z) 1N tr R i (B N −zI N )−1 . We will then look for a function eN,i (z)

that satises

eN,i (z) = f N,i (eN, 1(z), . . . , e N,K (z); z).

From there, it will be easy to determine a further function gN such that

mB N (z)

−gN (eB N ,1(z), . . . , e B N ,K (z); z) a .s.

−→ 0

and

mN (z) −gN (eN, 1(z), . . . , e N,K (z); z) = 0 .

We will therefore have nally


−→ 0.

Proof of Theorem 6.1. In order to have a rst insight on what the deterministicequivalent mN of mB N may look like, the Marcenko–Pastur method will beapplied with the (strong) additional assumption that A and all R k , 1

≤ k

≤ K ,

are diagonal and that the e.s.d. F T k , F R k converge for all k as N grows large.In this scenario, mB N has a limit when N → ∞ and the method, however moretedious than in the proof of the Marcenko–Pastur law, leads naturally to mN .




Consider the case when K = 2, A = 0 for simplicity and denote H k =R

12k X k T

12k . Following similar steps as in the proof of the Marcenko–Pastur law,

we start with matrix inversion lemmas

H 1H H

1 + H 2H H

2 −zI N −111

= −z −z[h H

1 h H

2 ]U H

1U H

2[U 1U 2]−zI n 1 + n 2

−1h 1h 2

−1

with the denition H Hi = [h i U H

i ]. Using the block matrix inversion lemma, theinner inversed matrix in this expression can be decomposed into four submatrices.The upper-left n1 ×n1 submatrix reads:

−zU H

1 (U 2U H

2

−zI N

−1)−1U 1

−zI n 1

−1

while, for the second block diagonal entry, it suffices to revert all ones in twos andvice-versa. Taking the limits, using Theorem 3.4 and Theorem 3.9, we observethat the two off-diagonal submatrices will not play a role, and we nally have

H 1H H

1 + H 2H H

2 −zI N −111

−z −zr 111

n1tr T 1 −zH H

1 (H 2H H

2 −zI N )−1H 1 −zI n 1−1

−zr 211

n2tr T 2 −zH H

2 (H 1H H

1 −zI N )−1H 1 −zI n 2−1 −1

where the symbol “ ” denotes some kind of yet unknown large N convergence and where we denoted r ij the j th diagonal entry of R i .Observe that we can proceed to a similar derivation for the matrixT 1 −zH H

1 (H 2H H2 −zI N )−1H 1 −zI n 1

−1 that now appears. Denoting now H i =[h i U i ], we have indeed

T 1 −zH H

1 (H 2H H

2 −zI N )−1H 1 −zI n 1−1

11

= τ 11

−z

−zh H

1 U 1 U H

1 + H 2H H

2

−zI N

−1h 1

−1

τ 11 −z −zc1τ 111N

tr R 1 H 1H H

1 + H 2H H

2 −zI N −1 −1

with τ ij the j th diagonal entry of T i . The limiting result here arises from thetrace lemma, Theorem 3.4 along with the rank-1 perturbation lemma, Theorem3.9. The same result holds when changing ones in twos.

We now denote by ei and ei the (almost sure) limits of the random quantities

eB N ,i = 1N

tr R i H 1H H

1 + H 2H H

2 −zI N −1

and

eB N ,i = 1N

tr T i −zH H

1 (H 2H H

2 −zI N )−1H 1 −zI n 1−1




respectively, as F T i and F R i converge in the large N limit. These limits existhere since we forced R 1 and R 2 to be co-diagonalizable. We nd

ei = limN →∞1N tr R i (−zeB N ,i R 1 −zeB N ,i R 2 −zI N )−1

ei = limN →∞

1N

tr T i (−zci eB N ,i T i −zI n i )−1

where the type of convergence is left to be determined. From this short calculus,we can infer the form of ( 6.7).

This derivation obviously only provides a hint on the deterministic equivalentfor mN (z). It also provides the aforementioned observation that mN (z) is notitself solution of a xed-point equation, although eN, 1(z), . . . , e N,K (z) are. Toprove Theorem 6.1, irrespective of the conditions imposed on R 1 , . . . , R K ,T 1 , . . . , T K and A , we will successively go through four steps, given below.For readability, we consider the case K = 1 and discard the useless indexes.The generalization to K ≥ 1 is rather simple for most of the steps but requirescumbersome additional calculus for some particular aspects. These pieces of calculus are not interesting here, the reader being invited to refer to [Couilletet al. , 2011a] for more details. The four-step procedure is detailed below.

• Step 1. We rst seek a function f N , such that, for z ∈C +

eB N (z)

−f N (eB N (z); z) a.s.

−→ 0

as N → ∞, where eB N (z) = 1N tr R (B N −zI N )−1 . This function f N was

already inferred by the Marcenko–Pastur approach. Now, we will make thisstep rigorous by using the Bai and Silverstein approach , as is done in,e.g., [Dozier and Silverstein, 2007a ; Silverstein and Bai, 1995] . Basically, thefunction f N will be found using an inference procedure. That is, startingfrom a very general form of f N , i.e. f N = 1

N tr RD −1 for some matrix D ∈C N ×N (not yet written as a function of z or eB N (z)), we will evaluate thedifference eB N (z) −f N and progressively discover which matrix D will makethis difference increasingly small for large N .

• Step 2. For xed N , we prove the existence of a solution to the implicitequation in the dummy variable e

f N (e; z) = e. (6.9)

This is often performed by proving the existence of a sequence eN, 1 , eN, 2 , . . . ,lying in a compact space such that f N (eN,k ; z) −eN,k converges to zero, inwhich case there exists at least one converging subsequence of eN, 1 , eN, 2 , . . . ,whose limit eN satises (6.9).

• Step 3. Still for xed N , we prove the uniqueness of the solution of ( 6.9)lying in some specic space and we call this solution eN (z). This is classicallyperformed by assuming the existence of a second distinct solution and byexhibiting a contradiction.




• Step 4. We nally prove that

eB N (z) −eN (z) a.s.

−→ 0

and, similarly, that


−→ 0

as N → ∞, with mN (z) gN (eN (z); z) for some function gN .

At rst, following the works of Bai and Silverstein, a truncation, centralization,and rescaling step is required to replace the matrices X , R , and T by truncatedversions X , R , and T , respectively, such that the entries of X have zero mean,

X ≤ k log(N ), for some constant k, R ≤ log(N ) and T ≤ log(N ). Similar

to the truncation steps presented in Section 3.2.2, it is shown in [Couillet et al.,2011a] that these truncations do not restrict the generality of the nal result for

F T and F R forming tight sequences, that is:

F R12 X T X H R

12

−F R12 XTX H R

12

⇒ 0

almost surely, as N grows large. Therefore, we can from now on work with thesetruncated matrices. We recall that the main interest of this procedure is to beable to derive a deterministic equivalent (or l.s.d.) of the underlying randommatrix model without the need for any moment assumption on the entries of X ,

by replacing the entries of X by truncated random variables that have momentsof all orders. Here, the interest is in fact two-fold, since, in addition to truncatingthe entries of X , also the entries of T and R are truncated in order to be ableto prove results for matrices T and R that in reality have eigenvalues growingvery large but that will be assumed to have entries bounded by log( N ). Forreadability in the following, we rename X , T , and R the truncated matrices.

Remark 6.2. Alternatively, expected values can be used to discard the stochasticcharacter. This introduces an additional convergence step, which is the approachfollowed by Hachem, Najim, and Loubaton in several publications, e.g., [Hachem

et al. , 2007] and [Dupuy and Loubaton, 2009 ]. This additional step consists inrst proving the almost sure weak convergence of F B N −GN to zero, for GN

some auxiliary deterministic distribution (such as GN = E[F B N ]), before provingthe convergence GN −F N ⇒ 0.

Step 1. First convergence stepWe start with the introduction of two fundamental identities.

Lemma 6.1 (Resolvent identity) . For invertible A and B matrices, we have the

identity A −1 −B −1 = −A −1(A −B )B −1 .




This can be veried easily by multiplying both sides on the left by A and onthe right by B (the resulting equality being equivalent to Lemma 6.1 for A andB invertible).

Lemma 6.2 (A matrix inversion lemma, (2.2) in [Silverstein and Bai , 1995]).Let A ∈C N ×N be Hermitian invertible, then, for any vector x ∈C N and any scalar τ ∈C , such that A + τ xx H is invertible

x H (A + τ xx H )−1 = x H A −1

1 + τ x H A −1x.

This is veried by multiplying both sides by A + τ xx H from the right.Lemma 6.1 is often referred to as the resolvent identity , since it will be

mainly used to take the difference between matrices of type ( X −zI N )−1 and(Y −zI N )−1 , which we remind are called the resolvent matrices of X and Y ,respectively.

The fundamental idea of the approach by Bai and Silverstein is to guess thedeterministic equivalent of mB N (z) by writing it under the form 1

N tr D −1 at rst,where D needs to be determined. This will be performed by taking the differencemB N (z) − 1

N tr D −1 and, along the lines of calculus, successively determining thegood properties D must satisfy so that the difference tends to zero almost surely.

We then start by taking z ∈C + and D ∈C N ×N some invertible matrix whose

normalized trace would ideally be close to mB N (z) = 1N tr( B N −zI N )−

1. Wethen write

D −1 −(B N −zI N )−1 = D −1(A + R12 XTX H R

12 −zI N −D )(B N −zI N )−1

(6.10)using Lemma 6.1.

Notice here that, since B N is Hermitian non-negative denite, and z ∈C + , theterm ( B N −zI N )−1 has uniformly bounded spectral norm (bounded by 1 / [z]).Since D −1 is desired to be close to ( B N −zI N )−1 , the same property should alsohold for D −1 . In order for the normalized trace of ( 6.10) to be small, we needtherefore to focus exclusively on the inner difference on the right-hand side. Itseems then interesting at this point to write D A −zI N + pN R for pN left tobe dened. This leads to

D −1 −(B N −zI N )−1

= D −1R12 XTX H R

12 (B N −zI N )−1 − pN D −1R (B N −zI N )−1

= D −1n

j =1τ j R

12 x j x H

j R12 (B N −zI N )−1 − pN D −1R (B N −zI N )−1

where in the second equality we used the fact that XTX H = nj =1 τ j x j x H

j , withx j ∈C N the j th column of X and τ j the j th diagonal element of T . DenotingB ( j ) = B N −τ j R

12 x j x H

j R12 , i.e. B N with column j removed, and using Lemma




6.2 for the matrix B ( j ) , we have:

D −1 −(B N −zI N )−1

=n

j =1

τ j D −1R12 x j x Hj R

12 (B ( j ) −zI N )−1

1 + τ j x H R12 (B ( j ) −zI N )−1R

12 x j − pN D −1R (B N −zI N )−1 .

Taking the trace on each side, and recalling that, for a vector x and a matrixA , tr( Axx H ) = tr( x H Ax ) = x H Ax , this becomes

1N

tr D −1 − 1N

tr( B N −zI N )−1

= 1N

n

j =1

τ jx H

j R12 (B ( j ) −zI N )−1D −1R

12 x j

1 + τ j x H R12 (B ( j )

−zI N )−1R

12 x j − pN

1N

tr R (B N −zI N )−1D −1

(6.11)

where quadratic forms of the type x H Ax appear.Remembering the trace lemma, Theorem 3.4, which can a priori be applied to

the terms x Hj R

12 (B ( j ) −zI N )−1D −1R

12 x j since x j is independent of the matrix

R12 (B ( j ) −zI N )−1D −1R

12 , we notice that by setting

pN = 1n

n

j =1

τ j1 + τ j c 1

N tr R (B N −zI N )−1 .

Equation ( 6.11) becomes

1N

tr D −1 − 1N

tr( B N −zI N )−1

= 1N

n

j =1

τ j x H

j R12 (B ( j ) −zI N )−1D −1R

12 x j

1 + τ j x H R12 (B ( j ) −zI N )−1R

12 x j −

1n tr R (B N −zI N )−1D −1

1 + cτ j 1N tr R (B N −zI N )−1

(6.12)

which is suspected to converge to zero as N grows large, since both thenumerators and the denominators converge to one another. Let us assume forthe time being that the difference effectively goes to zero almost surely. Equation(6.12) implies

1N

tr( B N −zI N )−1 − 1N

tr A + 1n

n

j =1

τ j R1 + τ j c 1

N tr R (B N −zI N )−1 −zI N

−1

a .s.

−→ 0

which determines mB N (z) = 1N tr( B N −zI N )−1 as a function of the trace

1N tr R (B N

−zI N )−1 , and not as a function of itself. This is the observation

made earlier when we obtained a rst hint on the form of mN (z) using theMarcenko–Pastur method, according to which we cannot nd a function f N

such that mB N (z) −f N (mB N (z), z) a.s.

−→ 0. Instead, running the same steps as




above, it is rather easy now to observe that

1

N tr RD −1

− 1

N tr R (B N

−zI N )−1

= 1N

n

j =1

τ jx H

j R12 (B ( j ) −zI N )−1RD −1R

12 x j

1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j −

1n tr R (B N −zI N )−1RD −1

1 + τ j cN tr R (B N −zI N )−1

where R ≤ log N . Then, denoting eB N (z) 1N tr R (B N −zI N )−1 , we suspect

to have also

eB N (z) − 1N

tr R A + 1n

n

j =1

τ j1 + τ j ceB N (z)

R −zI N

−1

a .s.

−→ 0

and

mB N (z) − 1N

tr A + 1n

n

j =1

τ j1 + τ j ceB N (z)

R −zI N

−1

a .s.

−→ 0

which is exactly what was required, i.e. eB N (z) −f N (eB N (z); z) a.s.

−→ 0 with

f N (e; z) = 1N tr R A +

1n

n

j =1

τ j1 + τ j ce R −zI N

−1

and mB N (z) −gN (eB N (z); z) a.s.

−→ 0 with

gN (e; z) = 1N

tr A + 1n

n

j =1

τ j1 + τ j ce

R −zI N

−1

.

We now prove that the right-hand side of (6.12) converges to zero almost

surely. This rather technical part justies the use of the truncation steps andis the major difference between the works of Bai and Silverstein [Dozier andSilverstein, 2007a; Silverstein and Bai, 1995] and the works of Hachem et al.[Hachem et al., 2007]. We rst dene

wN n

j =1

τ jN

x Hj R

12 (B ( j ) −zI N )−1RD −1R

12 x j

1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j −


1 + τ j cN tr R (B N −zI N )−1

which we then divide into four terms, in order to successively prove theconvergence of the numerators and the denominators. Write

wN = 1N

n

j =1τ j d1

j + d2j + d3

j + d4j




where

d1j =

x Hj R

12 (B ( j ) −zI N )−1RD −1R

12 x j

1 + τ j xH

j R1

2 (B ( j ) −zI N )−1R

1

2 x j −

x Hj R

12 (B ( j ) −zI N )−1RD −1

( j ) R12 x j

1 + τ j xH

j R1

2 (B ( j ) −zI N )−1R

1

2 x j

d2j =

x Hj R

12 (B ( j ) −zI N )−1RD −1

( j ) R12 x j

1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j −

1n tr R (B ( j ) −zI N )−1RD −1

( j )

1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j

d3j =

1n tr R (B ( j ) −zI N )−1RD −1

( j )

1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j −


1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j

d4j =


1 + τ j x Hj R

12 (B ( j ) −zI N )−1R

12 x j −


1 + cτ j eB N

where we introduced D ( j ) = A + 1n

nk =1

τ k1+ τ k ce B ( j ) (z ) R −zI N , i.e. D with

eB N (z) replaced by eB ( j ) (z). Under these notations, it is simple to show thatwN

a .s.

−→ 0 since every term dkj can be shown to go fast to zero.

One of the difficulties in proving that the dkj tends to zero at a sufficiently fast

rate lies in providing inequalities for the quadratic terms of the type y H (A −zI N )−1y present in the denominators. For this, we use Corollary 3.2, whichstates that, for any non-negative denite matrix A , y ∈C N and for z ∈C +

11 + τ j y H (A −zI N )−1y ≤ |z|

[z]. (6.13)

Also, we need to ensure that D −1 and D −1( j ) have uniformly bounded spectral

norm. This unfolds from the following lemma.

Lemma 6.3 (Lemma 8 of [Couillet et al., 2011a]). Let D = A + iB + ivI N ,with A ∈C N ×N Hermitian, B ∈C N ×N Hermitian non-negative and v > 0. Then

D ≤ v−1 .

Proof. Noticing that DD H = ( A + iB )(A −iB ) + v2 I N + 2vB , the smallesteigenvalue of DD H is greater than or equal to v2 and therefore D −1 ≤ v−1 .

At this step, we need to invoke the generalized trace lemma, Theorem 3.12.From Theorem 3.12, (6.13), Lemma 6.3 and the inequalities due to the truncationsteps, we can then show that

τ j |d1j | ≤ x j

2 c log7 N |z|3N [z]7

τ j |d2j | ≤

log N x Hj R

12 (B ( j ) −zI N )−1RD −1

( j ) R12 x j − 1

n tr R (B ( j ) −zI N )−1RD −1( j )

[z]|z|−1

τ j

|d3

j

| ≤ |z| log3 N

[z]N

1

[z]2 +

c|z|2 log3 N

[z]6

τ j |d4j | ≤

log4 N x Hj R

12 (B ( j ) −zI N )−1R

12 x j − 1

n tr R (B ( j ) −zI N )−1 + log N N [z ]

[z]3|z|−1 .




Applying the trace lemma for truncated variables, Theorem 3.12, and classicalinequalities, there exists K > 0 such that we have simultaneously

E| x j 2 −1|6 ≤¯

K log12

N N 3

and

E|xHj R

12 (B ( j ) −zI N )−1RD −1

( j ) R12 x j −

1n

tr R (B ( j ) −zI N )−1RD −1( j ) |6

≤K log24 N N 3 [z]12

and

E|xH

j R12 (B ( j ) −zI N )−1R

12 x j −

1n tr R

12 (B ( j ) −zI N )−1R

12 |6

≤K log18 N N 3 [z]6

.

All three moments above, when summed over the n indexes j and multiplied byany power of log N , are summable. Applying the Markov inequality, Theorem3.5, the Borel–Cantelli lemma, Theorem 3.6, and the line of arguments usedin the proof of the Marcenko–Pastur law, we conclude that, for any k > 0,logk N max j ≤n τ j dj

a .s.

−→ 0 as N → ∞, and therefore:

eB N (z) −f N (eB N (z); z) a .s.−→ 0mB N (z) −gN (eB N (z); z) a .s.

−→ 0.

This convergence result is similar to that of Theorem ( 3.22), although in thelatter each side of the minus sign converges, when the eigenvalue distributionsof the deterministic matrices in the model converge. In the present case, even if the series F T and F R converge, it is not necessarily true that either eB N (z)or f N (eB N (z), z) converges.

We wish to go further here by showing that, for all nite N , f N (e; z) = e

has a solution (Step 2), that this solution is unique in some space (Step 3)and that, denoting eN (z) this solution, eN (z) −eB N (z) a.s.

−→ 0 (Step 4). This willimply naturally that mN (z) gN (eN (z); z) satises mB N (z) −mN (z) a.s.

−→ 0, forall z ∈C + . Vitali’s convergence theorem, Theorem 3.11, will conclude the proof by showing that mB N (z) −mN (z) a.s.

−→ 0 for all z outside the positive real half-line.

Step 2. Existence of a solution We now show that the implicit equation e = f N (e; z) in the dummy variable e hasa solution for each nite N . For this, we use a special trick that consists in growingthe matrices dimensions asymptotically large while maintaining the deterministiccomponents untouched, i.e. while maintaining F R and F T the same. The idea isto x N and consider for all j > 0 the matrices T [j ] = T

⊗I j ∈C jn ×jn , R [j ] =




R⊗

I j ∈C jN ×jN and A [j ] = A⊗

I j ∈C jN ×jN . For a given x

f [j ](x; z) 1

jN tr R [j ] A [j ] +

τdF T [j ] (τ )

1 + cτx R [j ]

−zI Nj

−1

which is constant whatever j and equal to f N (x; z). Dening

B [j ] = A [j ] + R12[j ]XT [j ]X

H R12[j ]

for X ∈C Nj ×nj with i.i.d. entries of zero mean and variance 1 / (nj )

eB [j ] (z) = 1 jN

tr R [j ](A [j ] + R12[j ]XT [j ]X

H R12[j ] −zI Nj )−1 .

With the notations of Step 1, wNj → 0 as j → ∞, for all sequences B [1], B [2], . . .

in a set of probability one. Take such a sequence. Noticing that both eB [j ] (z)and the integrand τ

1+ cτe B [j ] (z ) of f [j ](x, z ) are uniformly bounded for xed N and growing j , there exists a subsequence of eB [1] , eB [2] , . . . over which they bothconverge, when j → ∞, to some limits e and τ (1 + cτe)−1 , respectively. But sincewjN → 0 for this realization of eB [1] , eB [2] , . . . , for growing j , we have that e =limj f [j ](e, z ). But we also have that, for all j , f [j ](e, z ) = f N (e, z ). We thereforeconclude that e = f N (e, z ) and we have found a solution.

Step 3. Uniqueness of a solution

Uniqueness is shown classically by considering two hypothetical solutions e ∈C+

and e ∈C + to (6.6) and by showing then that e −e = γ (e −e), where |γ | mustbe shown to be less than one. Indeed, taking the difference e −e, we have withthe resolvent identity

e −e = 1N

tr RD −1e −

1N

tr RD −1e

= 1N

tr RD −1e cτ 2(e −e)dF T (τ )

(1 + cτe)(1 + cτe)RD −1

e

in which D e and D e are the matrix D with eB N (z) replaced by e and e,

respectively. This leads to the expression of γ as follows.

γ = cτ 2

(1 + cτe)(1 + cτe)dF T (τ )

1N

tr D −1e RD −1

e R .

Applying the Cauchy–Schwarz inequality to the diagonal elements of 1N D −1

e R √ cτ 1+ cτe dF T (τ ) and of 1

N D −1e R √ cτ

1+ cτe dF T (τ ), we then have

|γ | ≤ cτ 2dF T (τ )

|1 + cτe|2N tr D −1

e R (D He )−1R cτ 2dF T (τ )

|1 + cτe|2N tr D −1

e R (D eH )−1R

√ α√ α.We now proceed to a parallel computation of [e] and [e] in the hope

of retrieving both expressions in the right-hand side of the above equation.




Introducing the product ( D He )−1D H

e in the trace, we rst write e under the form

e = 1

N tr D −1

e R (D H

e )−1 A +

τ

1 + cτe∗

dF T (τ ) R

−z∗I N . (6.14)

Taking the imaginary part, this is:

[e] = 1N

tr D −1e R (D H

e )−1 cτ 2 [e]

|1 + cτe|2dF T (τ ) R + [z]I N

= [e]α + [z]β

where

β 1N

tr D −1e R (D H

e )−1

is positive whenever R = 0, and similarly [e] = α [e] + [z]β , β > 0 with

β 1N

tr D −1e R (D H

e )−1 .

Notice also that

α = α [e]

[e] =

α [e]α [e] + β [z]

< 1

and

α = α [e]

[e] = α [e]

α [e] + β [z] < 1.

As a consequence

|γ | ≤ √ α√ α = [e]α[e]α + [z]β [e]α

[e]α + [z]β < 1

as requested. The case R = 0 is easy to verify.

Remark 6.3. Note that this uniqueness argument is slightly more technical whenK > 1. In this case, uniqueness of the vector e1 , . . . , e K (under the notations of Theorem 6.1) needs be proved. Denoting e (e1 , . . . , e K )T , this requires to showthat, for two solutions e and e of the implicit equation, ( e −e ) = Γ (e −e ), whereΓ has spectral radius less than one. To this end, a possible approach is to showthat |Γij | ≤ α

12ij α

12ij , for αij and α ij dened similar as in Step 3. Then, applying

some classical matrix lemmas (Theorem 8.1.18 of [Horn and Johnson, 1985] andLemma 5.7.9 of [Horn and Johnson , 1991]), the previous inequality implies that

Γ ≤ (α12ij α

12ij )ij

where (α12

ijα

12

ij)

ij is the matrix with ( i, j ) entry α

12

ijα

12

ij and the norm is the matrix

spectral norm. We further have that

(α12ij α

12ij ) ij ≤ A

12 A

12




where A and A are now matrices with ( i, j ) entry α ij and α ij , respectively.The multi-dimensional problem therefore boils down to proving that A < 1and

A < 1. This unfolds from yet another classical matrix lemma (Theorem

2.1 of [Seneta, 1981]), which states in our current situation that, if we have thevectorial relation

[e ] = A [e ] + [z]b

with [e ] and b vectors of positive entries and [z] > 0, then A < 1. The aboverelation generalizes, without much difficulty, the relation [e] = [e]α + [z]β obtained above.

Step 4. Final convergence stepWe nally need to show that eN −eB N (z) a.s.

−→ 0. This is performed using asimilar argument as for uniqueness, i.e. eN −eB N (z) = γ (eN −eB N (z)) + wN ,where wN → 0 as N → ∞ and |γ | < 1; this is true for any eB N (z) taken froma space of probability one such that wN → 0. The major difficulty compared tothe previous proof is to control precisely wN .

The details are as follows. We will show that, for any > 0, almost surely

limN →∞

log N (eB N −eN ) = 0 . (6.15)

Let α N , β N be the values as above for which [eN ] = [eN ]α N + [z]β N . Usingtruncation inequalities

[eN ]α N

β N ≤ [eN ]c log N τ 2

|1 + cτeN |2dF T (τ )

= −log N τ 1 + cτeN

dF T (τ )

≤ log2 N |z| [z]−1 .

Therefore

α N = [eN ]α N

[eN ]α N + [z]β N

=[eN ]α N

β N

[z] + [eN ]α N β N

≤ log2 N |z|[z]2 + log 2 N |z|

. (6.16)

We also have

eB N (z) = 1N

tr D −1R −wN .




We write as in Step 3

[eB N ]

= 1N

tr D −1R (D H )−1

cτ 2

[eB N ]|1 + cτeB N |2

dF T (τ ) R + [z]I N − [wN ]

[eB N ]α B N + [z]β B N − [wN ].

Similarly to Step 3, we have eB N −eN = γ (eB N −eN ) + wN , where now

|γ | ≤ √ α B N √ α N .

Fix an > 0 and consider a realization of B N for which wN log N → 0, where = max( + 1 , 4) and N large enough so that

|wN | ≤ [z]3

4c|z|2 log3 N . (6.17)

As opposed to Step 2, the term [z]β B N − [wN ] can be negative. The idea is toverify that in both scenarios where [z]β B N − [wN ] is positive and uniformlyaway from zero, or is not, the conclusion |γ | < 1 holds. First suppose β B N ≤[z ]2

4c|z |2 log 3 N . Then by the truncation inequalities, we get

α B N ≤ c [z]−2|z|2 log3 Nβ B N ≤ 14

which implies

|γ

| ≤ 1

2. Otherwise we get from ( 6.16) and ( 6.17)

|γ | ≤ √ α N [eB N ]α B N

[eB N ]α B N + [z]β B N − [wN ]

≤ logN |z|[z]2 + log N |z|

.

Therefore, for all N large

log N |eB N −eN | ≤ (log N )wN

1 − log 2 N

|z

|[v]2 +log 2 N |z |

12

≤ 2 [z]−2( [z]2 + log 2 N |z|)(log N )wN

→ 0

as N → ∞, and ( 6.15) follows. Once more, the multi-dimensional case is muchmore technical; see [Couillet et al., 2011a] for details.

We nally show

mB N −mN a .s.

−→ 0 (6.18)

as N → ∞. Since mB N = 1N tr D −

1N − wN (for some wN dened similar to wN ),we have

mB N −mN = γ (eB N −eN ) − wN




where D t is dened as D with eB N (z) replaced by etN (z). From the Cauchy–

Schwarz inequality and the different truncation bounds on the D t , R , and Tmatrices, we have:

γ t ≤ |z|2c[z]4

log4 N N

. (6.22)

This entails

et +1N −et

N < K |z|2c[z]4

log4 N N

etN −et−1

N (6.23)

for some constant K .Let 0 < ε < 1, and take now a countable set z1 , z2 , . . . possessing a limit point,

such that

K |zk |2c[zk ]4

log4 N N

< 1 −ε

for all zk (this is possible by letting [zk ] > 0 be large enough). On this countableset, the sequences e1

N , e2N , . . . are therefore Cauchy sequences on C K : they all

converge. Since the etN are holomorphic functions of z and bounded on every

compact set included in C \ R + , from Vitali’s convergence theorem, Theorem3.11, et

N converges on such compact sets.From the fact that we forced the initialization step to be e0

N = −1/z , e0N is

the Stieltjes transform of a distribution function at point z. It now suffices toverify that, if etN = et

N (z) is the Stieltjes transform of a distribution function atpoint z, then so is et +1

N . From Theorem 3.2, this requires to ensure that: (i) z ∈C + and et

N (z) ∈C + implies et +1N (z) ∈C + , (ii) z ∈C + and zet

N (z) ∈C + impliesze t +1

N (z) ∈C + , and (iii) lim y→∞−yetN (iy) < ∞implies that lim y→∞−yet

N (iy) <

∞. These properties follow directly from the denition of etN . It is not difficult

to show also that the limit of etN is a Stieltjes transform and that it is solution

to (6.6) when K = 1. From the uniqueness of the Stieltjes transform, solutionto (6.6) (this follows from the point-wise uniqueness on C + and the fact thatthe Stieltjes transform is holomorphic on all compact sets of C

\R + ), we then

have that etN converges for all j and z ∈C \ R + , if e0N is initialized at a Stieltjestransform. The choice e0

N = −1/z follows this rule and the xed-point algorithmconverges to the correct solution.

This concludes the proof of Theorem 6.2.

From Theorem 6.1, we now wish to provide deterministic equivalents for otherfunctionals of the eigenvalues of B N than the Stieltjes transform. In particular,we wish to prove that

f (x)d(F B N

−F N )(x) a.s.

−→ 0

for some function f . This is valid for all bounded continuous f from thedominated convergence theorem, which we recall presently.




Theorem 6.3 (Theorem 16.4 in [Billingsley, 1995 ]). Let f N (x) be a sequence of real measurable functions converging point-wise to the measurable function f (x),and such that

|f N (x)

| ≤ g(x) for some measurable function g(x) with

g(x)dx <

∞. Then, as N → ∞ f N (x)dx → f (x)dx.

In particular, if F N ⇒ F , the F N and F being d.f., for any continuous bounded function h(x)

h(x)dF N (x) → h(x)dF (x).

However, for application purposes, such as the calculus of MIMO capacity, seeChapter 13, we would like in particular to take f to be the logarithm function.Proving such convergence results is not at all straightforward since f is hereunbounded and because F B N may not have bounded support for all large N .This requires additional tools which will be briey evoked here and which willbe introduced in detail in Chapter 7.

We have the following result [Couillet et al., 2011a].

Theorem 6.4. Let x be some positive real number and f be some continuous

function on the positive half-line. Let B N be a random Hermitian matrix as dened in Theorem 6.1 with the following additional assumptions.

1. There exists α > 0 and a sequence rN , such that, for all N

max1≤k≤K

max( λT kr N +1 , λ R k

r N +1 ) ≤ α

where λX1 ≥ . . . ≥ λX

N denote the ordered eigenvalues of the N ×N matrix X .2. Denoting bN an upper-bound on the spectral norm of the T k and R k , k ∈

1, . . . , K , and β some real, such that β > K (b/a )(1 + √ a)2 (with a and b

such that a < lim inf N ck ≤ limsup N ck < b for all k), then aN = b2N β satises

r N f (aN ) = o(N ). (6.24)

Then, for large N , nk

f (x)dF B N (x) − f (x)dF N (x) a.s.

−→ 0

with F N dened in Theorem 6.1.

In particular, if f (x) = log( x), under the assumption that ( 6.24) is fullled,we have the following corollary.




Corollary 6.1. For A = 0 , under the conditions of Theorem 6.4 with f (t) =log(1 + xt ), the Shannon transform VB N of B N , dened for positive x as

VB N (x) = ∞0log(1 + xλ )dF B N (λ)

= 1N

logdet( I N + xB N ) (6.25)

satises

VB N (x) −VN (x) a.s.

−→ 0

where VN (x) is dened as

VN (x) = 1N log det I N + xK

k=1R k

τ k dF T k

(τ k )1 + ck eN,k (−1/x )τ k

+K

k=1

1ck log (1 + ck eN,k (−1/x )τ k ) dF T k (τ k )

+ 1x

mN (−1/x ) −1

with mN and eN,k dened by (6.5) and (6.6), respectively.

Again, it is more convenient, for readability and for the sake of practicalapplications in Chapters 12–15 to remark that

VN (x) = 1N

log det I N +K

k =1

eN,k (−1/x )R k

+K

k =1

1N

logdet( I n k + ck eN,k (−1/x )T k )

− 1x

K

k =1

eN,k (−1/x )eN,k (−1/x ) (6.26)

with eN,k dened in ( 6.7).Observe that the constraint

max1≤k≤K

max( λT kr N +1 , λ R k

r N +1 ) ≤ α

is in general not strong, as the F T k and the F R k are already known to formtight sequences as N grows large. Therefore, it is expected that only o(N ) largesteigenvalues of the T k and R k grow large. Here, we impose only a slightly strongerconstraint that does not allow for the smallest eigenvalues to exceed a constantα . For practical applications, we will see in Chapter 13 that this constraint is metfor all usual channel models, even those exhibiting strong correlation patterns(such as densely packed three-dimensional antenna arrays).




Proof of Theorem 6.4 and Corollary 6.1. The only problem in translating theweak convergence of the distribution function F B N −F N in Theorem 6.1 tothe convergence of

fd [F B N

−F N ] in Theorem 6.4 is that we must ensure

that f behaves nicely. If f were bounded, no restriction in the hypothesis of Theorem 6.1 would be necessary and the weak convergence of F B N −F N to zerogives the result. However, as we are particularly interested in the unbounded,though slowly increasing, logarithm function, this no longer holds. In essence, theproof consists rst in taking a realization B 1 , B 2 , . . . for which the convergenceF B N −F N ⇒ 0 is satised. Then we divide the real positive half-line in twosets [0, d] and (d, ∞), with d an upper bound on the 2 Kr N th largest eigenvalueof B N for all large N , which we assume for the moment does exist. For anycontinuous f , the convergence result is ensured on the compact [0 , d]; if the largest

eigenvalue λ1 of B

N is moreover such that 2 Kr N f (λ1) = o(N ), the integrationover (d, ∞) for the measure dF B N is of order o(1), which is negligible in the nalresult for large N . Moreover, since F N (d) −F B N (d) → 0, we also have that, forall large N , 1 −F N (d) = ∞d dF N ≤ 2Kr N /N , which tends to zero. This nallyproves the convergence of fd [F B N −F N ]. The major difficulty here lies inproving that there exists such a bound on the 2 Kr N th largest eigenvalue of B N .The essential argument that validates the result is the asymptotic absence of eigenvalues outside the support of the sample covariance matrix . This is a resultof utmost importance (here, we cannot do without it) which will be presentedlater in Section 7.1. It can be exactly proved that, almost surely, the largest

eigenvalue of X k X Hk is uniformly bounded by any constant C > (1 + √ b)2 forall large N , almost surely. In order to use the assumptions of Theorem 6.4, wenally need to introduce the following eigenvalue inequality lemma.

Lemma 6.4 ([Fan, 1951] ). Consider a rectangular matrix A and let sAi denote

the ith largest singular value of A , with sAi = 0 whenever i > rank( A). Let m, n

be arbitrary non-negative integers. Then for A , B rectangular of the same size

sA + Bm + n +1 ≤ sA

m +1 + sBn +1

and for A , B rectangular for which AB is dened

sABm + n +1 ≤ sA

m +1 sBn +1 .

As a corollary, for any integer r ≥ 0 and rectangular matrices A 1 , . . . , A K , all of the same size

sA 1 + ... + A KKr +1 ≤ sA 1

r +1 + . . . + sA Kr +1 .

Since λT ki and λR k

i are bounded by α for i ≥ rN + 1 and that X k X Hk is

bounded by C , we have from Lemma 6.4 that the 2 Kr N th largest eigenvalue of B N is uniformly bounded by CKα 2 . We can then take d any positive real, suchthat d > CKα 2 , which is what we needed to show, up to some ne tuning onthe nal bound.




As for the explicit form of log(1 + xt )dF N (t) given in (6.26), it resultsfrom a similar calculus as in Theorem 4.10. Precisely, we expect the Shannontransform to be somehow linked to 1

N log det I N + K k=1 eN,k (

−z)R k and

1N logdet( I n k + ck eN,k (−z)T k ). We then need to nd a connection between thederivatives of these functions along z and 1

z −mN (−z), i.e. the derivative of theShannon transform. Notice that

1z −mN (−z) =

1N

(zI N )−1 − z I N +K

k=1

eN,k R k

−1

=K

k=1

eN,k (−z)eN,k (−z).

Since the Shannon transform VN (x) satises VN (x) = ∞1/x [w−1 −mN (−w)]dw,

we need to nd an integral form for K k=1 eN,k (−z)eN,k (−z). Notice now that

ddz

1N

log det I N +K

k=1

eN,k (−z)R k = −zK

k=1

eN,k (−z)eN,k (−z)

ddz

1N

logdet( I n k + ck eN,k (−z)T k ) = −zeN,k (−z)eN,k (−z)

and

ddz

zK

k=1

eN,k (−z)eN,k (−z) =K

k=1


−zK

k=1

eN,k (−z)eN,k (−z) + eN,k (−z)eN,k (−z) .

Combining the last three equations, we have:

K

k =1


= ddz −

1N

log det I N +K

k=1

eN,k (−z)R k

−K

k =1

1N

logdet( I n k + ck eN,k (−z)T k ) + zK

k=1


which after integration leads to

∞

z

1

w −mN (

−w) dw

= 1N

log det I N +K

k=1

eN,k (−z)R k




+K

k =1

1N

logdet( I n k + ck eN,k (−z)T k ) −zK

k =1


which is exactly the right-hand side of ( 6.26) for z = −1/x .

Theorem 6.4 and Corollary 6.1 have obvious direct applications in wirelesscommunications since the Shannon transform VB N dened above is the per-dimension capacity of the multi-dimensional channel, whose model is givenby K

k=1 R12k X k T

12k . This is the typical model used for evaluating the rate

region of a narrowband multiple antenna multiple access channel. This topicis discussed and extended in Chapter 14, e.g. to the question of nding thetransmit covariance matrix that maximizes the deterministic equivalent (hencethe asymptotic capacity).

6.2.2 Gaussian method

The second result that we present is very similar in nature to Theorem 6.1 butinstead of considering sums of matrices of the type

B N =K

k=1

R12k X k T k X H

k R12k

we treat the question of matrices of the type

B N = K

k=1

R12k X k T

12k

K

k=1

R12k X k T

12k

H

.

To obtain a deterministic equivalent for this model, the same technique as beforecould be used. Instead, we develop an alternative method, known as the Gaussian method , when the X k have Gaussian i.i.d. entries, for which fast convergence ratesof the functional of the mean e.s.d. can be proved.

Theorem 6.5 ([Dupuy and Loubaton , 2009]). Let K be some positive integer.For two positive integers N, n , denote

B N =k=1

R12k X k T

12k

k=1

R12k X k T

12k

H

where the notations are the same as in Theorem 6.1, with the additional assumptions that n1 = . . . = nK = n, the random matrix X k ∈C N ×n k has independent Gaussian entries (of zero mean and variance 1/n ) and the spectral norms R k and T k are uniformly bounded with N . Note additionally that,

from the unitarily invariance of X k , T k is not restricted to be diagonal. Then,denoting as above mB N the Stieltjes transform of B N , we have

N (E[mB N (z)] −mN (z)) = O (1/N )




We then have the integration by parts formula

E[xk f (x )] =N

i =1

r ki E∂f (x )

∂x∗iwith rki the entry (k, i ) of R .

This relation will be used to derive directly the deterministic equivalent, whichsubstitutes to the ‘guess-work’ step of the proof of Theorem 6.1. Note inparticular that it requires us to use all entries of R here and not simply itseigenvalues. This generalizes the Marcenko–Pastur method that only handleddiagonal entries. However, as already mentioned, the introduction of theexpectation in front of xk f (x ) cannot be avoided;

• the Nash–Poincare inequality

Theorem 6.7 ([Pastur , 1999]). Let x and f be as in Theorem 6.6 , and let

∇z f = [∂f/∂z 1 , . . . ,∂ f /∂z N ]T . Then, we have the following Nash–Poincare inequality

var( f (x )) ≤ E ∇x f (x )T R (∇x f (x ))∗+ E (∇x∗f (x )) H R∇x∗f (x ) .

This result will be used to bound the deviations of the random matrices underconsideration.

For more details on Gaussian methods, see [Hachem et al., 2008a]. We now

give the main steps of the proof of Theorem 6.5.Proof of Theorem 6.5 . We rst consider E( B N −zI N )−1 . Noting that −z(B N −zI N )−1 = I N −(B N −zI N )−1B N , we apply the integration by parts, Theorem6.6, in order to evaluate the matrix

E (B N −zI N )−1B N .

To this end, we wish to characterize every entry

E (B N −zI N )−1B N aa

= 1≤k, k≤K E (B N −zI N )−1

R

12

k (X k T

12

k R

12

k )(X k T

12

k )H

aa .

This is however not so simple and does not lead immediately to a nice formenabling us to use the Gaussian entries of the X k as the inputs of Theorem 6.6.Instead, we will consider the multivariate expression

E (B N −zI N )−1ab (R

12k X k T

12k )cd (R

12k X k T

12k )H

ea

for some k, k ∈ 1, . . . , K and given a, a ,b,c,d,e . This enables us to somehowunfold easily the matrix products before we set b = c and d = e, and simplify the

management of the Gaussian variables. This being said, we take the vector x of Theorem 6.6 to be the vector whose entries are denoted

xk,c,d x(k−1) Nn +( c−1) N + d = ( R12k X k T

12k )cd




uniformly bounded matrix E

E tr E (B N

−zI N )−1 = tr E (

−z[I N +

K

k=1

eN,k (z)R k ])−1 + O 1

N

from which N (E[eB N ,k (z)] −eN,k (z)) = O(1/N ) (for E = R k ) and nallyN (E[mB N (z)] −mN (z)) = O(1/N ) (for E = I N ). This is performed in a similarway as in the proof for Theorem 6.1, with the additional results coming from theNash–Poincare inequality.

The Gaussian method, while requiring more intensive calculus, allows us tounfold naturally the deterministic equivalent under study for all types of matrixcombinations involving Gaussian matrices. It might as well be used as a tool

to infer the deterministic equivalent of more involved models for which suchdeterministic equivalents are not obvious to ‘guess’ or for which the Marcenko–Pastur method for diagonal matrices cannot be used. For the latest resultsderived from this technique, refer to, e.g., [Hachem et al., 2008a; Khorunzhyet al. , 1996; Pastur, 1999 ]. It is believed that Haar matrices can be treated usingthe same tools, to the effort of more involved computations but, to the best of our knowledge, there exists no reference of such a work, yet.

In the same way as we derived the expression of the Shannon transform of themodel B N of Theorem 6.1 in Corollary 6.1, we have the following result for B N

in Theorem 6.5.

Theorem 6.8 ([Dupuy and Loubaton, 2010 ]). Let B N ∈C N ×N be dened as in Theorem 6.5. Then the Shannon transform VB N of B N satises

N (E[VB N (x)] −VN (x)) = O(1/N )

where VN (x) is dened, for x > 0, as

VN (x) = 1N

log det I N +K

k=1

eN,k (−1/x )R k

+ 1N

log det I n +K

k =1

eN,k (−1/x )T k

− nN

1x

K

k=1

eN,k (−1/x )eN,k (−1/x ). (6.28)

Note that the expressions of ( 6.26) and ( 6.28) are very similar, apart from theposition of a summation symbol.

Both Theorem 6.1 and Theorem 6.5 can then be compiled into an even moregeneral result, as follows. This is however not a corollary of Theorem 6.1 andTheorem 6.5, since the complete proof must be derived from the beginning.




Theorem 6.9. For k = 1, . . . , K , denote H k ∈C N ×n k the random matrix such that, for a given positive Lk

H k =L k

l=1

R 12k,l X k,l T 12

k,l

for R12k,l a Hermitian non-negative square root of the Hermitian non-negative

R k,l ∈C N ×N , T12k,l a Hermitian non-negative square root of the Hermitian non-

negative T k,l ∈C n k ×n k and X k,l ∈C N ×n k with Gaussian i.i.d. entries of zeromean and variance 1/n k . All R k,l and T k,l are uniformly bounded with respect to N , nk . Denote also for all k, ck = N/n k .

Call mB N (z) the Stieltjes transform of B N =

K k =1 H k H H

k , i.e. for z ∈C \ R +

mB N (z) = 1N

tr K

k =1

H k H Hk −zI N

−1

.

We then have

N (E[mB N (z)] −mN (z)) → 0

where mN (z) is dened as

mN (z) = 1

N tr

−z

K

k=1

L k

l=1

eN ;k,l (z)R k,l + I N

−1

and eN ;k,l solves the xed-point equations

eN ;k,l (z) = 1nk

tr T k,l −zL k

l =1

eN ;k,l (z)T k,l + I n k

−1

eN ;k,l (z) = 1nk

tr R k,l −zK

k =1

L k

l =1

eN ;k ,l (z)R k ,l + I N

−1

.

We also have that the Shannon transform VB N (x) of B N satises

N (E[VB N (x)] −VN (x)) → 0

where

VN (x) = 1N

log det K

k=1

L k

l=1

eN ;k,l (−1/x )R k,l + I N

+K

k=1

1N

log detL k

l=1

eN ;k,l (−1/x )T k,l + I n k

− 1x

K

k=1

nk

N

L k

l=1

eN ;k,l (−1/x )eN ;k,l (−1/x ).




For practical applications, this formula provides the whole picture for theergodic rate region of large MIMO multiple access channels, with K multipleantenna users, user k being equipped with nk antennas, when the differentchannels into consideration are frequency selective with Lk taps for user k, slowfading in time, and for each tap modeled as Kronecker with receive and transmitcorrelation R k,l and T k,l , respectively.

We now move to another type of deterministic equivalents, when the entriesof the matrix X are not necessarily of zero mean and have possibly differentvariances.

6.2.3 Information plus noise models

In Section 3.2, we introduced an important limiting Stieltjes transform result,Theorem 3.14, for the Gram matrix of a random i.i.d. matrix X ∈C N ×n with avariance prole σ2

ij /n , 1 ≤ i ≤ N and 1 ≤ j ≤ n. One hypothesis of Girkos’slaw is that the prole σij converges to a density σ(x, y ) in the sense that

σij − i

N

i −1N

jn

j −1n

σ(x, y )dxdy → 0.

It will turn out in practical applications that such an assumption is in generalunusable. Typically, suppose that σij is the channel fading between antenna iand antenna j , respectively, at the transmitter and receiver of a multiple antennachannel. As one grows N and n simultaneously, there is no reason for the σij

to converge in any sense to a density σ(x, y ). In the following, we thereforerewrite Theorem 3.14 in terms of deterministic equivalents without the need forany assumption of convergence. This result is in fact a corollary of the verygeneral Theorem 6.14, presented later in this section, although the deterministicequivalent is written in a slightly different form. A sketch of the proof using theBai and Silverstein approach is also provided.

Theorem 6.10. Let X N ∈C N ×n have independent entries xij with zero mean,variance σ2

ij /n and 4 + ε moment of order O(1/N 2+ ε/ 2), for some ε. Assume that the σij are deterministic and uniformly bounded, over n, N . Then, as N , ngrow large with ratio cn N/n such that 0 < lim inf n cn ≤ limsup n cn < ∞, the e.s.d. F B N of B N = X N X

HN satises

F B N −F N ⇒ 0

almost surely, where F N is the distribution function of Stieltjes transform mN (z),z

∈C

\R + , given by:

mN (z) = 1N

N

k =1

11n

ni =1 σ2

ki1

1+ eN,i (z ) −z




where eN, 1(z), . . . , e N,n (z) form the unique solution of

eN,j (z) = 1n

N

k =1

σ2kj

1n ni =1 σ2ki 11+ eN,i (z ) −z (6.29)

such that all eN,j (z) are Stieltjes transforms of a distribution function.

The reason why point-wise uniqueness of the eN,j (z) is not provided here is dueto the approach of the proof of uniqueness followed by Hachem et al. [Hachemet al. , 2007] which is a functional proof of uniqueness of the Stieltjes transformsthat the applications z → eN,i (z) dene. This does not mean that point-wiseuniqueness does not hold but this is as far as this theorem goes.

Theorem 6.10 can then be written is a more compact and symmetric form byrewriting eN,j (z) in (6.29) as

eN,j (z) = −1z

1n

N

k =1

σ2kj

1 + eN,k (z)

eN,k (z) = −1z

1n

n

i =1

σ2ki

1 + eN,i (z). (6.30)

In this case, mN (z) is simply

mN (z) = −1z

1N

N

k=1

11 + eN,k (z)

.

Note that this version of Girko’s law, Theorem 3.14, is both more generalin the assumptions made, and more explicit. We readily see in this result thatxed-point algorithms, if they converge at all, allow us to recover the 2 n coupledEquations ( 6.30), from which mN (z) is then explicit.

For the sake of understanding and to further justify the strength of thetechniques introduced so far, we provide hereafter the rst steps of the proof using the Bai and Silverstein technique. A complete proof can be found as aparticular case of [Hachem et al., 2007; Wagner et al., 2011].

Proof. Instead of studying mN (z), let us consider the more general eA N (z), adeterministic equivalent for

1N

tr A N X N XHN −zI N −1

.

Using Bai and Silverstein approach, we introduce F

∈C N ×N some matrix yet

to be dened, and compute

eA N (z) = 1N

tr A N (F −zI N )−1 .




Using the resolvent identity, Lemma 6.1, and writing X N XHN = n

i =1 x i xHi ,

we have:1N tr A N X N X

H

N −zI N −1

− 1N tr A N (F −zI N )−

1

= 1N

tr A N X N XHN −zI N −1

F (F −zI N )−1

− 1N

n

i =1tr A N X N X

H

N −zI N −1x i x H

i (F −zI N )−1

from which we then express the second term on the right-hand side under theform of sums for i ∈ 1, . . . , N of x H

i (F −zI N )−1 A N X N XHN −zI N −1 x i and

we use Lemma 6.2 on the matrix X N XHN −zI N −1 to obtain

1N

tr A N X N XH

N −zI N −1

− 1N

tr A N (F −zI N )−1

= 1N

tr A N X N XH

N −zI N −1F (F −zI N )−1

− 1N

n

i =1

x Hi (F −zI N )−1 A N X ( i ) X H

( i ) −zI N −1

x i

1 + x Hi X ( i ) X H

( i ) −zI N −1

x i

(6.31)

with X ( i ) = [x 1 , . . . , x i−1 , x i+1 , . . . , x n ].

Under this form, x i and X ( i ) XH

( i ) −zI N −1

have independent entries.However, x i does not have identically distributed entries, so that Theorem 3.4cannot be straightforwardly applied. We therefore dene y i ∈C N as

x i = Σ i y i

with Σ i ∈C N ×N a diagonal matrix with kth diagonal entry equal to σki , and y i

has identically distributed entries of zero mean and variance 1 /n . Replacing alloccurrences of x i in (6.31) by Σ i y i , we have:

1N tr A N X N X

H

N −zI N −1

− 1N tr A N (F −zI N )−

1

= 1N

tr A N X N XH

N −zI N −1F (F −zI N )−1

− 1N

n

i =1

y Hi Σ i (F −zI N )−1 A N X ( i ) X H

( i ) −zI N −1

Σ i y i

1 + y Hi Σ i X ( i ) X H

( i ) −zI N −1

Σ i y i

. (6.32)

Applying the trace lemma, Theorem 3.4, the quadratic terms of the formy H

i Yy i are close to 1n tr Y . Therefore, in order for ( 6.32) to converge to zero, F

ought to take the form

F = 1n

n

i =1

11 + eB N ,i (z)

Σ 2i




with

eB N ,i (z) = 1n

tr Σ 2i X N X

H

N −zI N −1.

We therefore infer that eN,i (z) takes the form

eN,i (z) = 1n

N

k =1

σ2ki

1n

ni =1 σ2

ki1

1+ eN,i (z ) −z

by setting A N = Σ 2i .

From this point on, the result unfolds by showing the almost sure convergencetowards zero of the difference eN,i (z) − 1

n tr Σ 2i X N X

HN −zI N −1 and the

functional uniqueness of the implicit equation for the eN,i (z).

The symmetric expressions ( 6.30) make it easy to derive also a deterministicequivalent of the Shannon transform.

Theorem 6.11. Let B N be dened as in Theorem 6.10 and let x > 0. Then, as N , n grow large with uniformly bounded ratio cn = N/n , the Shannon transform VB N (x) of B N , dened as

VB N (x) 1N

logdet( I N + xB N )

satises

E[VB N (x)] −VN (x) → 0

where VN (x) is given by:

VN (x) = 1N

N

k=1

log 1 + eN,k (−1x

) + 1N

n

i=1

log 1 + eN,i (−1x

)

− x

nN 1≤k≤N 1≤i≤n

σ2ki

1 + eN,k (−1x ) 1 + eN,i (−1x ).

It is worth pointing out here that the Shannon transform convergence resultis only stated in the mean sense and not, as was the case in Theorem 6.4, in thealmost sure sense. Remember indeed that the convergence result of Theorem 6.4depends strongly on the fact that the empirical matrix B N can be proved to havebounded spectral norm for all large N , almost surely. This is a consequence of spectral norm inequalities and of Theorem 7.1. However, it is not known whetherTheorem 7.1 holds true for matrices with a variance prole and the derivationof Theorem 6.4 can therefore not be reproduced straightforwardly.

It is in fact not difficult to show the convergence of the Shannon transform inthe mean via a simple dominated convergence argument. Indeed, remembering




the Shannon transform denition, Denition 3.2, we have:

E[VB N (x)]

−VN (x) =

∞1x

1

t −E[mB N (

−t)] dt

− ∞1x

1

t −mN (

−t) dt

(6.33)

for which we in particular have

1t −E[mB N (−t)] −

1t −mN (−t)

≤1t −E[mB N (−t)] +

1t −mN (−t)

=

1

t −

1

λ + tE[dF B N (λ)] +

1

t −

1

λ + tdF N (λ)

≤ 1t2 λE[dF B N (λ)] +

1t2 λdF N (λ).

It is now easy to prove from standard expectation calculus that both integralsabove are upper-bound by lim sup N sup i R i < ∞. Writing Equation ( 6.33)under the form of a single integral, we have that the integrand tends to zeroas N → ∞ and is summable over the integration parameter t. Therefore, fromthe dominated convergence theorem, Theorem 6.3, E[VB N (x)] −VN (x) → 0.

Note now that, in the proof of Theorem 6.10, there is no actual need for the

matrices Σ k to be diagonal. Also, there is no huge difficulty added by consideringthe matrix X N XHN + A N , instead of X N X

HN for any deterministic A N . As such,

Theorem 6.10 can be further generalized as follows.

Theorem 6.12 ([Wagner et al., 2011]). Let X N ∈C N ×n have independent columns x i = H i y i , where y i ∈C N i has i.i.d. entries of zero mean, variance 1/n , and 4 + ε moment of order O(1/n 2+ ε/ 2), and H i ∈C N ×N i are such that R i H i H H

i has uniformly bounded spectral norm over n, N . Let also A N ∈C N ×N be Hermitian non-negative and denote B N = X N X

HN + A N . Then, as N ,

N 1 , . . . , N n , and n grow large with ratios ci N i /n and c0 N/n satisfying 0 <liminf n ci ≤ lim supn ci < ∞ for 0 ≤ i ≤ n, we have that, for all non-negative Hermitian matrix C N ∈C N ×N with uniformly bounded spectral norm

1n

tr C N (B N −zI N )−1 − 1n

tr C N 1n

n

i =1

11 + eN,i (z)

R i + A N −zI N

−1a .s.

−→ 0

where eN, 1(z), . . . , e N,n (z) form the unique functional solution of

eN,j (z) = 1n

tr R j1n

n

i=1

11 + eN,i (z)

R i + A N −zI N

−1

(6.34)

such that all eN,j (z) are Stieltjes transforms of a non-negative nite measure on R + . Moreover, (eN, 1(z), . . . , e N,n (z)) is given by eN,i (z) = lim k→∞e

(k )N,i (z), where




e(0)N,i = −1/z and, for k ≥ 0

e(k+1)N,j (z) =

1n tr R j

1n

n

i =1

11 + e(k )

N,i (z) R i + A N −zI N

−1

.

Also, for x > 0, the Shannon transform VB N (x) of B N , dened as

VB N (x) 1N

logdet( I N + xB N )

satises

E[VB N (x)] −VN (x) → 0

where VN (x) is given by:

VN (x) = 1N

log det I N + x1n

n

i =1

11 + eN,i (−1

x )R i + A N

+ 1N

n

i =1

log 1 + eN,i (−1x

) − 1N

n

i =1

eN,i (−1x )

1 + eN,i (−1x )

.

Remark 6.5. Consider the identically distributed entries x1 , . . . , x n in Theorem6.12, and take n1 , . . . , n K to be K integers such that

i n i = n. Dene

R 1 , . . . , R K

∈C N ×N to be K non-negative denite matrices with uniformly

bounded spectral norm and T 1 ∈C n 1 ×n 1 , . . . , T K ∈C n K ×n K to be K diagonalmatrices with positive entries, T k = diag( tk1 , . . . , t kn k ). Denote R k = R j t ji ,k ∈ 1, . . . , n , with j the smallest integer such that k −(n1 + . . . + n j −1) > 0,n0 = 0, and i = k −(n1 + . . . + n j −1). Under these conditions and notations, upto some hypothesis restrictions, Theorem 6.12 with H i = R

12i also generalizes

Theorem 6.1 applied to the sum of K Gram matrices with left correlation matrixR 1 , . . . , R K and right correlation matrices T 1 , . . . , T K .

From Theorem 6.12, taking A N = 0, we also immediately have that the

distribution function F N with Stieltjes transform

mN (z) = 1N

tr1n

n

i =1

11 + eN,i (z)

R i −zI N

−1

(6.35)

where

eN,j (z) = 1n

tr R j1n

n

i =1

11 + eN,i (z)

R i −zI N

−1

(6.36)

is a deterministic equivalent for F X N X H

N . An interesting result with applicationin low complex lter design, see Section 13.6 of Chapter 13, is the description inclosed-form of the successive moments of the distribution function F N .




where the symbol “ ” stands for some approximation in the large N limit. Noticethen that Πx is, up to a basis change, a vector composed of N −n + 1 i.i.d.standard Gaussian entries and n

−1 zeros. Hence Πx 2

N

−n

→ 1. Dening now W

such that WW H −ww H = UU H , the reasoning remains valid, and this entails(6.38).

Since B N in Theorem 6.15 is assumed of uniformly bounded spectral norm,w H B N w is uniformly bounded also. Hence, if N, n grow large with ratio n/N uniformly away from one, the term 1

N −n w H B N w tends to zero. This thereforeentails the following corollary, which can be seen as a rank-1 perturbation of Theorem 6.15.

Corollary 6.2. Let W and B N be dened as in Theorem 6.15 , with N and nsuch that lim supn

nN < 1. Then, as N, n grow large, for w any column of W

w H B N w − 1

N −n tr B N I N −WW H a.s.

−→ 0.

Corollary 6.2 only differs from Theorem 6.15 by the fact that the projector Πis changed into IN −WW H .

Also, when B N is independent of W , we fall back on the same result as forthe i.i.d. case.

Corollary 6.3. Let W be dened as in Theorem 6.15 , and let A ∈C N ×N be independent of W and have uniformly bounded spectral norm. Then, as N grows large, for w any column of W , we have:

w H Aw − 1N

tr A a.s.

−→ 0.

Theorem 6.15 is the basis for establishing deterministic equivalents involvingisometric matrices. In the following, we introduce a result, based on Silverstein

and Bai’s approach, which generalizes Theorems 4.10, 4.11, and 4.12 to the casewhen the W i matrices are multiplied on the left by different non-necessarily co-diagonalizable matrices. These models are the basis for studying the propertiesof multi-user or multi-cellular communications both involving unitary precodersand taking into account the frequency selectivity of the channel. From amathematical point of view, there exists no simple way to study such modelsusing tools extracted solely from free probability theory. In particular, it isinteresting to note that in [Peacock et al., 2008], the authors already generalizedTheorem 4.12 to the case where the left-product matrices are different but co-diagonalizable. To do so, the authors relied on tools from free probability asthe basic instruments and then need some extra matrix manipulation to derivetheir limiting result, in a sort of hybrid method between free probability andanalytical approach. In the results to come, though, no mention will be made to




free probability theory, as the result can be derived autonomously from the toolsdeveloped in this section.

The following results are taken from [Couillet et al., 2011b], where detailedproofs can be found. We start by introducing the fundamental equations.

Theorem 6.16 ([Couillet et al., 2011b]). For i ∈ 1, . . . , K , let T i ∈C n i ×n i be Hermitian diagonal and let H i ∈C N ×N i . Dene R i H i H

Hi ∈C N ×N , ci = n i

N iand ci = N i

N . Then the following system of equations in (e1(z), . . . , eK (z)) :

ei (z) = 1N

tr T i (ei (z)T i + [ ci −ei (z)ei (z)]I n i )−1

ei (z) = 1

N tr R i

K

j =1

ej (z)R j

−zI N

−1

(6.39)

has a unique solution (e1(z), . . . , eK (z)) ∈C(C , C ) satisfying (e1(z), . . . , e K (z)) ∈S(R + )K and, for z real negative, 0 ≤ ei (z) < c i ci / ei (z) for all i. Moreover, for each real negative z

ei (z) = limt→∞

e( t )i (z)

where e( t )i (z) is the unique solution of

e( t )

i (z) =

1

N tr T i e( t )

i (z)T i + [ ci

−e( t )

i (z)e( t )

i (z)]I n

i

−1

within the interval [0, ci ci /e( t )i (z)) , e(0)

i (z) can take any positive value and e( t )i (z)

is recursively dened by

e( t )i (z) =

1N

tr R i

K

j =1

e( t−1)j (z)R j −zI N

−1

.

We then have the following theorem on a deterministic equivalent for the e.s.d.of the model B N =

K k=1 H i W i T i W

Hi H H

i .

Theorem 6.17 ([Couillet et al., 2011b]). For i ∈ 1, . . . , K , let T i ∈C n i ×n i

be a Hermitian non-negative matrix with spectral norm bounded uniformly along n i and W i ∈C N i ×n i be ni ≤ N i columns of a unitary Haar distributed random matrix. Consider H i ∈C N ×N i a random matrix such that R i H i H

Hi ∈C N ×N

has uniformly bounded spectral norm along N , almost surely. Dene ci = n iN i and

ci = N iN and denote

B N =K

i =1

H i W i T i WHi H H

i .

Then, as N , N 1 , . . . , N K , n 1 , . . . , n K grow to innity with ratios ci satisfying 0 < liminf ci ≤ lim sup ci < ∞ and 0 ≤ ci ≤ 1 for all i, the following limit holds




true almost surely

F B N −F N ⇒ 0

where F N is the distribution function with Stieltjes transform mN (z) dened by

mN (z) = 1N

tr K

i =1ei (z)R i −zI N

−1

where (e1(z), . . . , eK (z)) are given by Theorem 6.16 .

Consider the case when, for each i, ci = 1 and H i = R12i for some square

Hermitian non-negative square root R12i of R i . We observe that the system of

Equations ( 6.39) is very similar to the system of Equations ( 6.7) established forthe case of i.i.d. random matrices. The noticeable difference here is the additionof the extra term −ei ei in the expression of ei . Without this term, we fall back onthe i.i.d. case. Notice also that the case K = 1 corresponds exactly to Theorem4.11, which was treated for c1 = 1.

Another point worth commenting on here is that, when z < 0, the xed-pointalgorithm to determine ei can be initialized at any positive value, while the xed-point algorithm to determine ¯ ei must be initialized properly. If not, it is possiblethat e( t )

i diverges. Also, if we naively run the xed-point algorithm jointly overei and ei , we may end up not converging to the correct solution at all. Based onexperience, this case arises sometimes if no particular care is taken.

We hereafter provide both a sketch of the proof and a rather extensivederivation, which explains how (6.39) is derived and how uniqueness is proved.We will only treat the case where, for all i, limsup ci < 1, ci = 1, H i = R

12i and

the R i are deterministic with uniformly bounded spectral norm in order both tosimplify notations and for the derivations to be close in nature to those proposedin the proof of Theorem 6.1. The case where in particular limsup ci = 1 for acertain i only demands some additional technicalities, which are not necessaryhere. Nonetheless, note that, for practical applications, all these hypotheses are

essential, as unitary precoding systems such as code division or space divisionmultiple access systems, e.g. CDMA and SDMA, may require square unitaryprecoding matrices (hence ci = 1) and may involve rectangular multiple antennachannel matrices H i ; these channels being modeled as Gaussian i.i.d.-basedmatrices with almost surely bounded spectral norm. The proof follows thederivation in [Couillet et al., 2011b], where a detailed derivation can be found.

The main steps of the proof are similar to those developed for the proof of Theorem 6.1. In order to propose different approaches than in previousderivations, we will work almost exclusively with real negative z, instead of z with positive imaginary part. We will also provide a shorter proof of thenal convergence step mB N (z) −mN (z) a.s.−→ 0, relying on restrictions of thedomain of z along with arguments from Vitali’s convergence theorem. Theseapproaches are valid here because upper bounds on the spectral norms of R i and




T i are considered, which was not the case for Theorem 6.1. Apart from thesetechnical considerations, the main noticeable difference between the deterministicequivalent approaches proposed for matrices with independent entries and forHaar matrices lies in the rst convergence step, which is much more intricate.

Proof. We rst provide a sketch of the proof for better understanding, which willenhance the aforementioned main novelty. As usual, we wish to prove that thereexists a matrix F = K

i =1 f i R i , such that, for all non-negative A with A < ∞1N

tr A (B N −zI N )−1 − 1N

tr A (F −zI N )−1 a.s.

−→ 0.

Contrary to classical deterministic equivalent approaches for random matriceswith i.i.d. entries, nding a deterministic equivalent for 1

N tr A (B N −zI N )−1

is not straightforward. The reason is that during the derivation, terms suchas 1N −n i

tr I N −W i WHi A

12 (B N −zI N )−1 A

12 , with the I N −W i W

Hi prex

will naturally appear, as a result of applying the trace lemma, Theorem 6.15,that will be required to be controlled. We proceed as follows.

• We rst denote for all i, δ i 1N −n i

tr I N −W i WHi R

12i (B N −zI N )−1 R

12i

some auxiliary variable. Then, using the same techniques as in the proof of Theorem 6.1, denoting further f i 1

N tr R i (B N −zI N )−1 , we prove

f i − 1N

tr R i (G −zI N )−1 a.s.

−→ 0

with G = K j =1 gj R j and

gi = 1

1 −ci + 1N

n il=1

11+ t il δi

1N

n i

l=1

t il

1 + t il δ i

where ti 1 , . . . , t in i are the eigenvalues of T i . Noticing additionally that

(1 −ci )δ i −f i + 1N

n i

l=1

δ i1 + t il δ i

a .s.

−→ 0

we have a rst hint on a rst deterministic equivalent for f i . Precisely, weexpect to obtain the set of fundamental equations

∆ i = 11 −ci

ei − 1N

n i

l=1

∆ i

1 + t il ∆ i

ei = 1N

tr R i

K

j =1

11 −cj + 1

N n jl=1

11+ t jl ∆ j

1N

n j

l=1

t jl

1 + t jl ∆ jR j −zI N

−1

.

• The expressions of gi and their deterministic equivalents are however not very

convenient under this form. It is then shown that

gi − 1N

n i

l=1

t il

1 + t il f i −f i gi= gi −

1N

tr T i (f i T i + [1 −f i gi ]I n i )−1 a.s.

−→ 0




which is solved by arguments borrowed from the work of Hachem et al. [Hachemet al. , 2007], using a restriction on the denition domain of z, which simpliesgreatly the calculus.

We now turn to the precise proof. We use again the Bai and Silverstein steps:the convergence f i − 1

N tr R iK j =1 f j R j −zI N

−1 a .s.

−→ 0 in a rst step, the

existence and uniqueness of a solution to ei = 1N tr R i

K j =1 ej R j −zI N

−1

in a second, and the convergence ei −f ia .s.

−→ 0 in a third. Although precisecontrol of the random variables involved needs be carried out, as is detailedin [Couillet et al., 2011b], we hereafter elude most technical parts for simplicityand understanding.

Step 1: First convergence stepIn this section, we take z < 0, until further notice. Let us rst introducethe following parameters. We will denote T = max ilim sup T i , R =max ilim sup R i and c = max ilim sup ci.

We start with classical deterministic equivalent techniques. Let A ∈C N ×N bea Hermitian non-negative denite matrix with spectral norm uniformly boundedby A. Taking G = K

j =1 gj R j , with g1 , . . . , gK left undened for the moment,we have:

1

N tr A (B N

−zI N )−1

− 1

N tr A (G

−zI N )−1

= 1N

tr A (B N −zI N )−1K

i =1

R12i −W i T i W H

i + gi I N R12i (G −zI N )−1

=K

i=1

gi1N

tr A (B N −zI N )−1R i (G −zI N )−1

− 1N

K

i=1

n i

l=1

t il w H

il R12i (G −zI N )−1A (B N −zI N )−1R

12i w il

=K

i=1gi 1N tr A (B N −zI N )−1R i (G −zI N )−1

− 1N

K

i=1

n i

l=1

t il w Hil R

12i (G −zI N )−1A (B ( i,l ) −zI N )−1R

12i w il

1 + t il w Hil R

12i (B ( i,l ) −zI N )−1R

12i w il

, (6.40)

with ti 1 , . . . , t in i the eigenvalues of T i .The quadratic forms w H

il R12i (G −zI N )−1A (B ( i,l ) −zI N )−1R

12i w il and

w Hil R

12i (B ( i,l ) −zI N )−1R

12i w il are not asymptotically close to the trace of the

inner matrix, as in the i.i.d. case, but to the trace of the inner matrix multipliedby (I N −W i W H

i ). This complicates the calculus. In the following, we willtherefore study the following stochastic quantities, namely the random variablesδ i , β i and f i , introduced below.




For every i ∈ 1, . . . , K , denote

δ i 1

N −n i

tr I N

−W i W H

i R12i (B N

−zI N )−1 R

12i

f i 1N

tr R i (B N −zI N )−1

both being clearly non-negative. We may already recognize that f i is a keyquantity for the subsequent derivations, as it will be shown to be asymptoticallyclose to ei , the central parameter of our deterministic equivalent.

Writing W i = [w i, 1 , . . . , w i,n i ] and W i WHi = n i

l=1 w il w Hil , we have from

standard calculus and the matrix inversion lemma, Lemma 6.2, that

(1

−ci )δ i = f i

− 1

N

n i

l=1

w H

ilR

12

i (B N

−zI N )−1 R

12

i w il

= f i − 1N

n i

l=1

w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

1 + t il w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

(6.41)

with B ( i,l ) = B N −t il R12i w il w H

il R12i .

Since z < 0, δ i ≥ 0, so that 11+ t il δi

is well dened. We recognize already

from Theorem 6.15 that each quadratic term w Hil R

12i B ( i,l ) −zI N −1 R

12i w il is

asymptotically close to δ i . By adding the term 1N

n il=1

δi1+ t il δ i

on both sides,

(6.41) can further be rewritten

(1 −ci )δ i −f i + 1N

n i

l=1

δ i1 + t il δ i

= 1N

n i

l=1

δ i1 + t il δ i −

w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

1 + t il w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

.

We now apply the trace lemma, Theorem 6.15, which ensures that

E (1 −ci )δ i −f i + 1N

n i

l=1δ i1 + t il δ i

4

= O 1N 2 . (6.42)

We do not provide the precise derivations of the fourth order moment inequalitieshere and in all the equations that follow, our main purpose being concentratedon the fundamental steps of the proof. Precise calculus and upper bounds canbe found in [Couillet et al., 2011b]. This is our rst relation that links δ i tof i = 1

N tr R i (B N −zI N )−1 .Introducing now an additional A (G −zI N )−1 matrix in the argument of the

trace of δ i , with G , A

∈C N ×N any non-negative denite matrices,

A

≤ A, we

denote

β i 1N −n i

tr I N −W i W H

i R12i (G −zI N )−1 A (B N −zI N )−1 R

12i .




We then proceed similarly as for δ i by showing

β i = 1

N −n itr R

12i (G

−zI N )−1 A (B N

−zI N )−1 R

12i

− 1

N −n i

n i

l=1

w Hil R

12i (G −zI N )−1 A B ( i,l ) −zI N −1 R

12i w il

1 + t il w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

from which we have:

1N −n i

tr R12i (G −zI N )−1 A (B N −zI N )−1 R

12i −

1N −n i

n i

l=1

β i1 + t il δ i −β i

= 1N −n i

n i

l=1

w H

ilR

12

i (G

−zI

N )−1 A B

( i,l ) −zI

N −1 R

12

i w

il1 + t il w H

il R12i B ( i,l ) −zI N −1 R

12i w il − β i1 + t il δ i .

Since numerators and denominators converge again to one another, we canshow from Theorem 6.15 again that

Ew H

il R12i (G −zI N )−1 A B ( i,l ) −zI N −1 R

12i w il

1 + t il w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

− β i

1 + t il δ i

4

= O 1N 2

.

(6.43)

Hence

E1N

tr R i (G −zI N )−1 A (B N −zI N )−1 −β i 1 −ci + 1N

n i

l=1

11 + t il δ i

4

= O 1N 2

. (6.44)

This provides us with the second relation that links β i to1N tr R 12

i (G −zI N )−1 A (B N −zI N )−1 R 12i . That is, we have expressed

both δ i and β i as a function of the traces 1N tr R

12i (B N −zI N )−1 R

12i and

1N tr R

12i (G −zI N )−1 A (B N −zI N )−1 R

12i , which are more conventional to

work with.We are now in position to determine adequate expressions for ¯ g1 , . . . , gK .

From the fact that w Hil R

12i (B ( i,l ) −zI N )−1R

12i w il is asymptotically close to δ i

and that w Hil R

12i (G −zI N )−1A (B ( i,l ) −zI N )−1R

12i w il is asymptotically close to

β i , we choose, based on ( 6.44) especially

gi = 1

1 −ci + 1N

n il=1

11+ t il δi

1N

n i

l=1

t il

1 + t il δ i.




We then have1N

tr A (B N −zI N )−1 − 1N

tr A (G −zI N )−1

=K

i =1

1N

n il=1

t il1+ t il δi

1N tr R i (G −zI N )−1 A (B N −zI N )−1

1 −ci + 1N

n il=1

11+ t il δi

− 1N

K

i =1

n i

l=1

t il w Hil R

12i (G −zI N )−1A (B ( i,l ) −zI N )−1R

12i w il

1 + t il w Hil R

12i (B ( i,l ) −zI N )−1R

12i w il

=K

i =1

1N

n i

l=1

t il

1N tr R i (G −zI N )−1 A (B N −zI N )−1

(1 −ci + 1N

n il =1

11+ t i,l δ i

)(1 + t il δ i )

−w H

ilR

12

i (G

−zI N )−1A (B ( i,l )

−zI N )−1R

12

i w il

1 + t il w Hil R

12i (B ( i,l ) −zI N )−1R

12i w il

.

To show that this last difference tends to zero, notice that 1 + t il δ i ≥ 1 and

1 −ci ≤ 1 −ci + 1N

n i

l=1

11 + t il δ i ≤ 1

which ensure that we can divide the term in the expectation in the left-handside of (6.44) by 1 + t il δ i and 1 −ci + 1

N n il=1

11+ t il δi

without risking alteringthe order of convergence. This results in

Eβ i

1 + t il δ i −1N tr R

12i (G −zI N )−1 A (B N −zI N )−1 R

12i

1 −ci + 1N

n il=1

11+ t il δi

(1 + t il δ i )

4

= O 1N 2

.

(6.45)

From ( 6.43) and ( 6.45), we nally have that

E1N tr R i (G −zI N )−1 A (B N −zI N )−1

1 −ci + 1N

n il=1

11+ t il δ i

(1 + t il δ i )

− w H

il R12i (G −zI N )−1 A B ( i,l ) −zI N −1 R

12i w il

1 + t il w Hil R

12i B ( i,l ) −zI N −1 R

12i w il

4

= O 1N 2

(6.46)

from which we obtain

E1N

tr A (B N −zI N )−1 − 1N

tr A (G −zI N )−14

= O 1N 2

. (6.47)

This provides us with a rst interesting result, from which we could infera deterministic equivalent of mB N (z), which would be written as a functionof deterministic equivalents of δ i and deterministic equivalents of f i , for i =

1, . . . , K . However this form is impractical to work with and we need to gofurther in the study of ¯gi .




Observe that gi can be written under the form

gi = 1

N

n i

l=1

t il

(1 −ci + 1N

n i

l=11

1+ t il δi ) + t il δ i (1 −ci + 1N

n i

l=11

1+ t il δi ).

We will study the denominator of the above expression and show that it can besynthesized into a much more attractive form.

From ( 6.42), we rst have

E f i −δ i 1 −ci + 1N

n i

l=1

11 + t il δ i

4

= O 1N 2

.

Noticing that

1 −gi δ i 1 −ci + 1N

n i

l=1

11 + t il δ i

= 1 −ci + 1N

n i

l=1

11 + t il δ i

we therefore also have

E (1 −gi f i ) − 1 −ci + 1N

n i

l=1

11 + t il δ i

4

= O 1N 2

.

The two relations above lead to

E gi − 1N

n i

l=1

t il

t il f i + 1 −f i gi

4

= E1N

n i

l=1

t ilt il [f i −δ i κ i ] + [1 −f i gi −κ i ][κ i + t il δ i κ i ] [t il f i + 1 −f i gi ]

4

(6.48)

where we denoted κi 1 −ci + 1N

n il=1

11+ t il δi

.Again, all differences in the numerator converge to zero at a rate O(1/N 2).

However, the denominator presents now the term til f i + 1 −f i gi , which must

be controlled and ensured to be away from zero. For this, we can notice thatgi ≤ T / (1 −c) by denition, while f i ≤ R/ |z|, also by denition. It is thereforepossible, by taking z < 0 sufficiently small, to ensure that 1 −f i gi > 0. Wetherefore from now on assume that such z are considered.

Equation ( 6.48) becomes in this case

E gi − 1N

n i

l=1

t il


4

= O 1N 2

.

We are now ready to introduce the matrix F . Consider

F =K

i=1

f i R i ,




with f i dened as the unique solution to the equation in x

x = 1

N

n i

l=1

t il

1 −f i x + f i t il(6.49)

within the interval 0 ≤ x < c i /f i . To prove the uniqueness of the solution withinthis interval, note simply that

ci

f i ≥ 1N

n i

l=1

t il

1 −f i (ci /f i ) + f i t il

0 ≤ 1N

n i

l=1

t il

1 −f i ·0 + f i t il

and that the function x → 1N

n il=1

t il1−f i x + f i t il

is convex. Hence the uniquenessof the solution in [0 , ci /f i ]. We also show that this solution is an attractor of thexed-point algorithm, when correctly initialized. Indeed, let x0 , x1 , . . . be denedby

xn +1 = 1N

n i

l=1

t il

1 −f i xn + f i t il

with x0 ∈ [0, ci /f i ]. Then, xn ∈ [0, ci /f i ] implies 1 −f i xn + f i t il ≥ 1 −ci +f i t il > f i t il and therefore f i xn +1

≤ ci , so x0 , x1 , . . . are all contained in [0 , ci /f i ].

Now observe that

xn +1 −xn = 1N

n i

l=1

f i (xn −xn −1)(1 + t il f i −f i xn )(1 + t il f i −f i xn −1)

so that the differences xn +1 −xn and xn −xn −1 have the same sign. Thesequence x0 , x1 , . . . is therefore monotonic and bounded: it converges. Callingx∞ this limit, we have:

x∞

= 1

N

n i

l=1

t il

1 + t il f i −f i x∞as required.

To nally prove that 1N tr A (B N −zI N )−1 − 1

N tr A (F −zI N )−1 a.s.

−→ 0, wewant now to show that ¯gi − f i tends to zero at a sufficiently fast rate. For this,we write

E gi − f i4

≤ 8E gi − 1N

n i

l=1

t il


4

+ 8E 1N

n i

l=1

t il

t il f i + 1 −f i gi − 1N

n i

l=1

t il

t il f i + 1 −f i f i

4




= 8E gi − 1N

n i

l=1

t il


4

+ E gi − f i4 1

N

n i

l=1

t il f i(t il f i + 1 −f i f i )( t il f i + 1 −f i gi )

4

.

(6.50)

We only need to ensure now that the coefficient multiplying gi − f i in theright-hand side term is uniformly smaller than one. This unfolds again fromnoticing that the numerator can be made very small, with the denominator keptaway from zero, for sufficiently small z < 0. For these z, we can therefore provethat

E gi − f i4 = O

1N 2

.

It is important to notice that this holds essentially because we took f i to be theunique solution of ( 6.49) lying in the interval [0 , ci /f i ). The other solution (thathappens to equal 1 /f i for ci = 1) does not satisfy this fourth moment inequality.

Finally, we can proceed to proving the deterministic equivalent relations.

1

N tr A (G

−zI N )−1

− 1

N tr A (F

−zI N )−1

=K

i =1

1N

n i

l=1

t il

1N tr R i A (G −zI N )−1 (F −zI N )−1

(1 −ci + 1N

n il =1

11+ t i,l δi

)(1 + t il δ i )

−1N tr R i A (G −zI N )−1 (F −zI N )−1

1 −f i f i + t il f i

=K

i =1

1N

n i

l=1

t il 1

(1 −ci + 1N

n il =1

11+ t i,l δi

)(1 + t il δ i ) − 1

1 −f i gi + t il f i

+ 1

1 −f i gi + t il f i − 1

1 −f i f i + t il f i 1N tr R i A (G −zI N )−

1

(F −zI N )−1

.

The rst difference in brackets is already known to be small from previousconsiderations on the relations between ¯ gi and δ i . As for the second difference,it also goes to zero fast as E[ |gi − f i |4] is summable. We therefore have

E1N

tr A (G −zI N )−1 − 1N

tr A (F −zI N )−14

= O 1N 2

.

Together with ( 6.47), we nally have

E1N

tr A (B N −zI N )−1 − 1N

tr A (F −zI N )−14

= O 1N 2

.




Applying the Markov inequality, Theorem 3.5, and the Borel–Cantelli lemma,Theorem 3.6, this entails

1N tr A (B N −zI N )−

1

− 1N tr A (F −zI N )−

1 a .s.

−→ 0 (6.51)

as N grows large. This holds however to this point for a restricted set of negativez. But now, from the Vitali convergence theorem, Theorem 3.11, and the factthat 1

N tr A (B N −zI N )−1 and 1N tr A (F −zI N )−1 are uniformly bounded on

all closed subset of C not containing the positive real half-line, we have that theconvergence ( 6.51) holds true for all z ∈C \ R + , and that this convergence isuniform on all closed subsets of C \ R + .

Applying the result for A = R j , this is in particular

f j − 1N

tr R j

K

i =1

f i R i −zI N −1

a .s.

−→ 0

where we recall that f i is the unique solution to

x = 1N

n i

i =1

t il

1 −f i x + t il f i

within the set [0 , ci /f i ).For A = I N , this says that

mB N (z) − 1N

tr R j

K

i =1

f i R i −zI N −1

a .s.

−→ 0

which proves the sought convergence of the Stieltjes transform. We now move toproving the existence and uniqueness of the set ( e1 , . . . , e K ) = ( e1(z), . . . , e K (z)).

Step 2: Existence and uniqueness The existence step unfolds similarly as in the proof of Theorem 6.1. It sufficesto consider the matrices T [ p],i ∈C n i p and R [ p],i ∈C Np for all i dened as the

Kronecker products T [ p],i T i ⊗I p , R [ p],i R i ⊗I p , which have, respectively,the d.f. F T i and F R i for all p. Similar to the i.i.d. case, it is easy to see that ei

is unchanged by substituting the T [ p],i and R [ p],i to the T i and R i , respectively.Denoting in the same way f [ p],i the equivalent of f i for T [ p],i and R [ p],i , from theconvergence result of Step 1, we can choose f [1],i , f [2],i , . . . a sequence of the setof probability one where convergence is ensured as p grows large (N and the ni

are kept xed). This sequence is uniformly bounded (by R/ |z|) in C \ R + , andtherefore we can extract a converging subsequence out of it. The limit over thissubsequence satises the xed-point equation, which therefore proves existence.It is easy to see that the limit is also the Stieltjes transform of a nite measureon R + by verifying the conditions of Theorem 3.2.

We will prove uniqueness of positive solutions e1 , . . . , e K > 0 for z < 0 andthe convergence of the classical xed-point algorithm to these values. We rst




introduce some notations and useful identities. Notice that, similar to Step 1with the δ i terms, we can dene, for any pair of variables xi and x i , with x i

dened as the solution y to y = 1N

n il=1

t il1+ x j t il

−x j y such that 0

≤ y < cj /x j , the

auxiliary variables ∆ 1 , . . . , ∆ K , with the properties

x i = ∆ i 1 −ci + 1N

n i

l=1

11 + t il ∆ i

= ∆ i 1 − 1N

n i

l=1

t il ∆ i

1 + t il ∆ i

and

1 −x i x i = 1 −ci + 1N

n i

l=1

11 + t il ∆ i

= 1 − 1N

n i

l=1

t il ∆ i

1 + t il ∆ i.

The uniqueness of the mapping between the x i and ∆ i can be proved. In fact,it turns out that ∆ i is a monotonically increasing function of x i with ∆ i = 0 forx i = 0.

We take the opportunity of the above denitions to notice that, for xi > x iand x i , ∆ i dened similarly as x i and ∆ i

x i x i −x i x i = 1N

n i

l=1

t il (∆ i −∆ i )(1 + t il ∆ i )(1 + t il ∆ i )

> 0 (6.52)

whenever T i = 0. Therefore xi x i is a growing function of xi (or equivalently of ∆ i ). This will turn out a useful remark later.

We are now in position to prove the step of uniqueness. Dene, for i ∈1, . . . , K , the functions

h i : (x1 , . . . , x K ) → 1N

tr R i

K

j =1x j R j −zI N

−1

with x j the unique solution of the equation in y

y = 1N

n j

l=1

t jl

1 + x j t jl

−x j y

(6.53)

such that 0 ≤ y ≤ cj /x j .We will prove in the following that the multivariate function h = ( h1 , . . . , h K )

is a standard function , dened in [Yates , 1995], as follows.

Denition 6.2. A function h (x1 , . . . , x K ) ∈R K , h = ( h1 , . . . , h K ), is said to bea standard function or a standard interference function if it fullls the followingconditions

1. Positivity: for all j , if x1 , . . . , x K > 0 then hj (x1 , . . . , x K ) > 0,

2. Monotonicity: if x1 > x 1 , . . . , x K > x K , then hj (x1 , . . . , x K ) >h j (x1 , . . . , x K ), for all j ,

3. Scalability: for all α > 1 and j , αh j (x1 , . . . , x K ) > h j (αx 1 , . . . , α x K ).




The important result regarding standard functions [Yates, 1995] is given asfollows.

Theorem 6.18. If a K -variate function h (x1 , . . . , x K ) is standard and there exists (x1 , . . . , x K ) such that, for all j , xj ≥ h j (x1 , . . . , x K ), then the xed-point algorithm that consists in setting

x( t +1)j = hj (x( t )

1 , . . . , x ( t )K )

for t ≥ 1 and for any initial values x(0)1 , . . . , x (0)

K > 0 converges to the unique jointly positive solution of the system of K equations

x j = hj (x1 , . . . , x K )

with j ∈ 1, . . . , K .

Proof. The proof of the uniqueness unfolds easily from the standard functionassumptions. Take ( x1 , . . . , x K ) and ( x1 , . . . , x K ) two sets of supposedly distinctall positive solutions. Then there exists j such that xj < x j , αx j = xj , and αx i ≥x i for i = j . From monotonicity and scalability, it follows that

x j = hj (x1 , . . . , x K ) ≤ hj (αx 1 , . . . , α x K ) < αh j (x1 , . . . , x K ) = αx j

a contradiction. The convergence of the xed-point algorithm from any point(x1 , . . . , x K ) unfolds from similar arguments, see [Yates, 1995] for more details.

Therefore, by showing that h (h1 , . . . , h K ) is standard, we will prove that theclassical xed-point algorithm converges to the unique set of positive solutionse1 , . . . , e K , when z < 0.

The positivity condition is straightforward as ¯ x i is positive for x i positive andtherefore hj (x1 , . . . , x K ) is always positive whenever x1 , . . . , x K are.

The scalability is also rather direct. Let α > 1, then:

αh j (x1 , . . . , x K )

−h j (αx 1 , . . . , α x K )

= 1N

tr R j

K

k =1

xk

α R k −

zα

I N

−1

− 1N

tr R j

K

k =1

x(a )k R k −zI N

−1

where we denoted x(a )j the unique solution to ( 6.53) with xj replaced by αx j ,

within the set [0 , cj / (αx j )). Since αx i > x i , from the property ( 6.52), we haveαx k x(α )

k > x k xk or equivalently x(a )k − x k

α > 0. We now dene the two matricesA k

k=1x kα R k − z

α I N and A (α ) kk =1 x(α )

k R k −zI N . For any vector a ∈C N

a H A −A (α ) a = K

k=1

xk

α − x(α )k a H R k a + z 1 − 1

αa H a ≤ 0




since z < 0, 1− 1α > 0 and x k

α − x(α )k < 0. Therefore A −A (α ) is non-positive

denite. Now, from [Horn and Johnson, 1985, Corollary 7.7.4], this implies thatA −1

−(A (α ) )−1 is non-negative denite. Writing

1N

tr R j A −1 −(A (α ) )−1 = 1N

N

i =1

r H

j,i A −1 −(A (α ) )−1 r j,i

with r j,i the ith column of R j , this ensures αh j (x1 , . . . , x K ) > h j (αx 1 , . . . , α x K ).The monotonicity requires some more lines of calculus. This unfolds from

considering x i as a function of ∆ i , by verifying that dd∆ i

x i is negative.

d

d∆ ix i =

1

∆2i

1

− 1

1 − 1N n

il=1 t il ∆ i1+ t il ∆ i

+ 1

∆2i

1N

n il=1

t il ∆ i(1+ t il ∆ i )2

1 − 1N

n il=1 t il ∆ i

1+ t il ∆ i

2

= − 1N

n il=1

t il ∆ i1+ t il ∆ i

1 − 1N

n il=1


+ 1N

n il=1

t il ∆ i(1+ t il ∆ i ) 2

∆ 2i 1 − 1

N n il=1


2

= 1N

n il=1


2

− 1N

n il=1


+ 1N

n il=1

t il ∆ i(1+ t il ∆ i )2

∆ 2i 1 − 1

N n il=1


2

= 1N

n i

l=1t il ∆ i

1+ t il ∆ i

2

− 1N

n i

l=1

( t il ∆ i )2

(1+ t il ∆ i ) 2

∆ 2i 1 − 1

N n il=1


2 .

From the Cauchy–Schwarz inequality, we have:

n i

l=1

1N

t il ∆ i

1 + t il ∆ i

2

≤n i

l=1

1N 2

n i

l=1

(t il ∆ i )2

(1 + t il ∆ i )2

= ci1N

n i

l=1

(t il ∆ i )2

(1 + t il ∆ i )2

< 1N

n i

l=1

(t il ∆ i )2

(1 + t il ∆ i )2

which is sufficient to conclude that dd∆ i

x i < 0. Since ∆ i is an increasing functionof xi , we have that x i is a decreasing function of xi , i.e. d

dx ix i < 0. This being

said, using the same line of reasoning as for scalability, we nally have that, fortwo sets x1 , . . . , x K and x1 , . . . , x K of positive values such that xj > x j

h j (x1 , . . . , x K ) −h(x1 , . . . , x K )

= 1N

tr R j

K

k=1

xk R k −zI N −1

− K

k =1

xk R k −zI N −1

> 0




with x j dened equivalently as x j , and where the terms (¯xk − xk ) are all positivedue to negativity of d

dx ix i . This proves the monotonicity condition.

We nally have from Theorem 6.18 that ( e1 , . . . , e K ) is uniquely denedand that the classical xed-point algorithm converges to this solution fromany initialization point (remember that, at each step of the algorithm, the sete1 , . . . , eK must be evaluated, possibly thanks to a further xed-point algorithm).

Consider now two sets of Stieltjes transforms (e1(z), . . . , e K (z)) and(e1(z), . . . , e K (z)), z ∈C \ R + , functional solutions of the xed-point Equation(6.39). Since ei (z) −ei (z) = 0 for all i and for all z < 0, and ei (z) −ei (z) isholomorphic on C \ R + as the difference of Stieltjes transforms, by analyticcontinuation (see, e.g., [Rudin, 1986]), ei (z) −ei (z) = 0 over C \ R + . Thistherefore proves, in addition to point-wise uniqueness on the negative half-

line, the uniqueness of the Stieltjes transform solution of the functional implicitequation and dened over C \ R + .We nally complete the proof by showing that the stochastic f i and the

deterministic ei are asymptotically close to one another as N grows large.

Step 3: Convergence of ei −f iFor this step, we follow the approach in [Hachem et al., 2007]. Denote

εiN f i

− 1

N tr R i

K

k=1

f k R k

−zI N

−1

and recall the denitions of f i , ei , f i and ei :

f i = 1N

tr R i (B N −zI N )−1

ei = 1N

tr R i

K

j −1ej R j −zI N

−1

f i = 1N

n i

l=1t i,l

1 −f i f i + t i,l f i , f i ∈ [0, ci /f i ]

ei = 1N

n i

l=1

t i,l

1 −ei ei + t i,l ei, ei ∈ [0, ci /e i ].

From the denitions above, we have the following set of inequalities

f i ≤ R

|z|, ei ≤

R

|z|, f i ≤

T 1 −ci

, ei ≤ T 1 −ci

. (6.54)

We will show in the sequel that

ei −f ia .s.

−→ 0 (6.55)




for all i ∈ 1, . . . , N . Write the following differences

f i −

ei =

K

j =1

(ej −

f j) 1

N tr R

i

K

k=1

ek

Rk −

zIN

−1

Rj

K

k=1

f k

Rk −

zIN

−1

+ εi

N

ei − f i = 1N

n i

l=1

t2i,l (f i −ei ) −t i,l f i f i −ei ei

(1 + t i,l ei −ei ei )(1 + t i,l f i − f i f i )

and

f i f i −ei ei = f i (f i −ei ) + ei ( f i −ei ).

For notational convenience, we dene the following values

α supi E |f i −ei |4

α supi

E |f i −ei |4 .

It is thus sufficient to show that α is summable to prove ( 6.55). By applying(6.54) to the absolute of the rst difference, we obtain

|f i −ei | ≤ KR 2

|z|2 sup

i |f i −ei |+ supi |εi

N |and hence

α ≤ 8K 4R8

|z|8 α + 8C

N 2 (6.56)

for some constant C > 0 such that E[ |sup i εiN |4] ≤ C/N 2 . This is possible since

E[|sup i εiN |4] ≤ 8K sup i E[|εi

N |4] and E[|εiN |4] has been proved to be of order

O(1/N 2). Similarly, we have for the third difference

|f i f i −ei ei | ≤ |f i ||f i −ei |+ |ei ||f i −ei |≤

T 1 −c

supi |f i −ei |+

R

|z| sup

i |f i −ei |with c an upper bound on max i lim supn ci , known to be inferior to one. Thisresult can be used to upper bound the second difference term, which writes

|f i −ei | ≤ 1

(1 −c)2 T 2 supi |f i −ei |+ T |f i f i −ei ei |

≤ 1

(1 −c)2 T 2 supi |f i −ei |+ T

T 1 −c

supi |f i −ei |+

R

|z| sup

i |f i −ei |=

T 2(2 −c)(1 −c)3 sup

i |f i −ei |+ RT

|z|(1 −c)2 supi |f i −ei |.

Henceα ≤

8T 8(2 −c)4

(1 −c)12 α + 8R4T 4

|z|4(1 −c)8 α. (6.57)




For a suitable z, satisfying |z| > 2RT (1−c)2 , we have 8R 4 T 4

|z |4 (1−c) 8 < 1/ 2 and, thus,moving all terms proportional to α on the left

α < 16T 8

(2 −c)4

(1 −c)12 α.

Plugging this result into ( 6.56) yields

α ≤ 128K 4R8T 8(2 −c)4

|z|8(1 −c)12 α + 8C N 2

.

Take 0 < ε < 1. It is easy to check that, for |z| > 128 1 / 8 RT √ K (2−c)

(1−c)3 / 2 (1−ε ) 1 / 8 ,128 K 4 R 8 T 8 (2−c) 4

|z |8 (1−c) 12 < 1 −ε and thus

α < 8C εN 2

. (6.58)

Since C does not depend on N , α is clearly summable which, along with theMarkov inequality and the Borel–Cantelli lemma, concludes the proof.

Finally, taking the same steps as above, we also have

E |mB N (z) −mN (z)|4 ≤ 8C εN 2

for some |z| large enough. The same conclusion therefore holds: for thesez, mB N (z)

−mN (z) a.s.

−→ 0. From Vitali convergence theorem, since f i and ei

are uniformly bounded on all closed sets of C \ R + , we nally have that theconvergence is true for all z ∈C \ R + . The almost sure convergence of theStieltjes transform implies the almost sure weak convergence of F B N −F N tozero, which is our nal result.

As a (not immediate) corollary of the proof above, we have the following result,important for application purposes, see Section 12.2.

Theorem 6.19. Under the assumptions of Theorem 6.17 with T i diagonal for all i, denoting w ij the j th column of W i , tij the j th diagonal entry of T i , and

z ∈C \ R +

w Hij H H

i B N −t ij H i w ij w Hij H H

i −zI N −1H i w ij −

ei (z)ci −ei (z)ei (z)

a .s.

−→ 0. (6.59)

where ei (z) and ei (z) are dened in Theorem 6.17.

Similar to the i.i.d. case, a deterministic equivalent for the Shannon transformcan be derived. This is given by the following proposition.

Theorem 6.20. Under the assumptions of Theorem 6.17 with z =

−1/x , for

x > 0, denoting

VB N (x) = 1N

logdet( xB N + I N )




the Shannon transform of B N , we have:

VB N (x) −VN (x) a .s.

−→ 0

where

VN (x) = 1N

log det I N + xK

i=1

ei R i +K

i =1

1N

log det ([ci −ei ei ]I n i + ei T i )

+K

i =1

[(1 −ci )log( ci −ei ei ) −ci log(ci )] . (6.60)

The proof for the deterministic equivalent of the Shannon transform followsfrom similar considerations as for the i.i.d. case, see Theorem 6.4 and Corollary6.1, and is detailed below.

Proof. For the proof of Theorem 6.20, we again take ci = 1, R i deterministic of bounded spectral norm for simplicity and we assume ci ≤ 1 from the beginning,the trace lemma, Theorem 6.15, being unused here. First note that the system of Equations ( 6.39) is unchanged if we extend the T i matrices into N ×N diagonalmatrices lled with N −n i zero eigenvalues. Therefore, we can assume that allT i have size N ×N , although we restrict the F T i to have a mass 1 −ci in zero.Since this does not alter the Equations ( 6.39), we have in particular ¯ei < 1/e i .

This being said, ( 6.60) now needs to be rewritten

VN (x) = 1N

log det I N + xK

i =1

ei R i +K

i =1

1N

log det ([1 −ei ei ]I N + ei T i ) .

Calling V the function

V : (x1 , . . . , x K , x1 , . . . , xK , x) → 1N

log det I N + xK

i =1

x i R i

+K

i=1

1

N log det ([1

−x

ix

i]I

N + x

iT

i)

we have:

∂V ∂x i

(e1 , . . . , e K , e1 , . . . , eK , x) = ei −ei1N

N

l=1

11 −ei ei + ei t il

∂V ∂ x i

(e1 , . . . , e K , e1 , . . . , eK , x) = ei −ei1N

N

l=1


.

Noticing now that

1 = 1N

N

l=1


1 −ei ei + ei t il= (1 −ei ei )

1N

N

l=1


+ ei ei




we have:

(1

−ei ei ) 1

− 1

N

N

l=1

1


= 0 .

But we also know that 0 ≤ ei < 1/e i and therefore 1 −ei ei > 0. This entails

1N

N

l=1


= 1 . (6.61)

From ( 6.61), we conclude that

∂V ∂x i

(e1 , . . . , e K , e1 , . . . , eK , x) = 0

∂V ∂ x i

(e1 , . . . , e K , e1 , . . . , eK , x) = 0 .

We therefore have that

ddx

VN (x) =K

i =1

∂V ∂e i

∂e i

∂x +

∂V ∂ ei

∂ ei

∂x+

∂V ∂x

= ∂V ∂x

=

K

i =1ei

1N tr R i I N + x

K

j =1ej R j

−1

= 1x −

1x2

1N

tr1x

I N +K

j =1

ej R j

−1

.

Therefore, along with the fact that VN (0) = 0, we have:

VN (x) = x

0

1t −

1t2 mN −

1t

dt

and therefore VN (x) is the Shannon transform of F N , according to Denition3.2.

In order to prove the almost sure convergence VB N (x) −VN (x) a .s.

−→ 0, we needsimply to notice that the support of the eigenvalues of B N is bounded. Indeed, thenon-zero eigenvalues of W i W

Hi have unit modulus and therefore B N ≤ KT R .

Similarly, the support of F N is the support of the eigenvalues of K i =1 ei R i ,

which are bounded by KT R as well.As a consequence, for B 1 , B 2 , . . . a realization for which F B N −F N ⇒ 0, we

have, from the dominated convergence theorem, Theorem 6.3

∞0

log (1 + xt ) d[F B N −F N ](t) → 0.

Hence the almost sure convergence.



6.3. A central limit theorem 175

Applications of the above results are found in various telecommunicationsystems employing random isometric precoders, such as random CDMA, SDMA[Couillet et al., 2011b]. A specic application to assess the optimal numberof stream transmissions in multi-antenna interference channels is in particularprovided in [Hoydis et al., 2011a], where an extension of Theorem 6.17 tocorrelated i.i.d. channel matrices H i is provided. It is worth mentioning thatthe approach followed in [Hoydis et al., 2011a] to prove this extension relieson an “inclusion” of the deterministic equivalent of Theorem 6.12 into thedeterministic equivalent of Theorem 6.17. The nal result takes a surprisinglysimple expression and the proof of existence, uniqueness, and convergence of the implicit equations obtained do not require much effort. This “deterministicequivalent of a deterministic equivalent” approach is very natural and is expected

to lead to very simple results even for intricate communication models; recall e.g.Theorem 6.9.We conclude this chapter on deterministic equivalents by a central limit

theorem for the Shannon transform of the non-centered random matrix withvariance prole of Theorem 6.14.

6.3 A central limit theorem

Central limit theorems are also demanded for more general models than thesample covariance matrix of Theorem 3.17. In wireless communications, itis particularly interesting to study the limiting distribution of the Shannontransform of doubly correlated random matrices, e.g. to mimic Kronecker models,or even more generally matrices of i.i.d. entries with a variance prole. Indeed,the later allows us to study, in addition to the large dimensional ergodic capacityof Rician MIMO channels, as provided by Theorem 6.14, the large dimensionaloutage mutual information of such channels. In [Hachem et al., 2008b], Hachemet al. provide the central limit theorem for the Shannon transform of this model.

Theorem 6.21 ([Hachem et al., 2008b]). Let Y N be N ×n whose (i, j )th entry is given by:

Y N,ij = σij (n)

√ n X N,ij

with σij (n)ij uniformly bounded with respect to n, and X N,ij is the (i, j )thentry of an N ×n matrix X N with i.i.d. entries of zero mean, unit variance, and nite eighth order moment. Denote B N = Y N Y

HN . We then have, as N, n → ∞

with limit ratio c = lim N N/n , that the Shannon transform

VB N (x) 1N

log det( I N + xB N )




of B N satises

N θn

(VB N (x) −E[VB N (x)]) ⇒ X ∼N (0, 1)

with

θ2n = −log det( I n −J n ) + κ tr( J n )

κ = E[( X N, 11 )4]−3E[(X N, 11 )3] for real X N, 11 , κ = E[ |X N, 11 |4]−2E[|X N, 11 |2] for complex X N, 11 , and J n the matrix with (i, j )th entry

J n,ij = 1n

1n

N k=1 σ2

ki (n)σ2kj (n)tk (−1/x )2

1 + 1n

N k=1 σ2

ki (n)tk (−1/x )2

with ti (z) such that (t1(z), . . . , t N (z)) is the unique Stieltjes transform vector solution of

t i (z) = −z + 1n

n

j =1

σ2ij (n)

1 + 1n

N l=1 σ2

lj (n)t l (z)

−1

.

Observe that the matrix J n is in fact the Jacobian matrix associated with thefundamental equations in the eN,i (z), dened in the implicit relations ( 6.29) of Theorem 6.10 as

eN,i (z) = 1n

N

k=1

σ2ki (n)tk (z) =

1n

N

k=1

σ2ki (n)

1

−z + 1n

nl=1

σ 2kl (n )

1+ eN,l (z )

.

Indeed, for all eN,k (−1/x ) xed but eN,j (−1/x ), we have:

∂ ∂eN,j (−1/x )

1n

N

k=1

σ2ki (n)

11x + 1

nnl=1

σ 2kl (n )

1+ eN,l (−1/x )

= 1

n

N

k=1

σ2ki (n)

1n σ2

kj (n)

(1 + eN,j (−1/x ))2

1

1x + 1n nl=1σ 2

kl(n )

1+ eN,l (−1/x )

2

= 1n

N

k=1

1n σ2

ki (n)σ2kj (n)tk (−1/x )2

(1 + eN,j (−1/x ))2

= J n,ji .

So far, this observation seems to generalize to all central limits derivedfor random matrix models with independent entries. This is however only anintriguing but yet unproven fact.

Similar to Theorem 3.17, [Hachem et al., 2008b] provides more than anasymptotic central limit theorem for the Shannon transform of the informationplus noise model VB N −E[VB N ], but also the uctuations for N large of thedifference between VB N and its deterministic equivalent VN , provided in [Hachem



6.3. A central limit theorem 177

et al. , 2007]. In the case where X N has Gaussian entries, this takes a very compactexpression.

Theorem 6.22. Under the conditions of Theorem 6.21 with the additional assumption that the entries of X N are complex Gaussian, we have:

N

−logdet( I n −J n )(VB N (x) −VN (x)) ⇒ X ∼N (0, 1)

where VN is dened as

VN (x) = 1N

N

i =1log

xt i (−1/x )

+ 1N

n

j =1log 1 +

1n

N

l=1

σ2lj (n)t l (−1/x )

− 1Nn1≤i≤N 1≤j ≤n

σ2

ij(n)t

i(

−1/x )

1 + 1n

N l=1 σ2

lj (n)t l (−1/x )

with t1 , . . . , t N and J n dened as in Theorem 6.21.

The generalization to distributions of the entries of X N with a non-zerokurtosis κ introduces an additional bias term corresponding to the limitingvariations of N (E[VB N (x)] −VN (x)). This converges instead to zero in theGaussian case or, as a matter of fact, in the case of any distribution with nullkurtosis.

This concludes this short section on central limit theorems for deterministicequivalents.

This also closes this chapter on the classical techniques used for deterministicequivalents, when there exists no limit to the e.s.d. of the random matrix understudy. Those deterministic equivalents are seen today as one of the most powerfultools to evaluate the performance of large wireless communication systemsencompassing multiple antennas, multiple users, multiple cells, random codes,fast fading channels, etc. which are studied with scrutiny in Part II. In orderto study complicated system models involving e.g. doubly-scattering channels,

multi-hop channels, random precoders in random channels, etc., the current trendis to study nested deterministic equivalents; that is, deterministic equivalentsthat account for the stochasticity of multiple independent random matrices, seee.g. Hoydis et al. [2011a,b].

In the following, we turn to a rather different subject and study more deeplythe limiting spectra of the sample covariance matrix model and of the informationplus noise model. For these, much more than limiting spectral densities isknown. It has especially been proved that, under some mild conditions, theextreme eigenvalues for both models do not escape the support of the l.s.d.and that a precise characterization of the position of some eigenvalues can bedetermined. Some additional study will characterize precisely the links betweenthe population covariance matrix (or the information matrix) and the samplecovariance matrix (or the information plus noise matrix), which are fundamental




to address the questions of inverse problems and more precisely statistical eigen-inference for large dimensional random matrix models. These questions are atthe core of the very recent signal processing tools, which enable novel signalsensing techniques and ( N, n )-consistent estimation procedures adapted to largedimensional networks.



7 Spectrum analysis

In this chapter, we further study the spectra of the important random matrixmodels for wireless communications that are the sample covariance matrix andthe information plus noise models. It has already been shown in Chapter 3that, as the e.s.d. of the population covariance matrix (or of the informationmatrix) converges, the e.s.d. of the sample covariance matrix (or the informationplus noise matrix) converges almost surely. The limiting d.f. can then be fullycharacterized as a function of the l.s.d. of the population covariance matrix (or of the information matrix). It is however not convenient to invert the problem andto describe the l.s.d. of the population covariance matrix (or of the informationmatrix) as a function of the l.s.d. of the observed matrices. The answer to thisinverse problem is provided in Chapter 8, which however requires some effortto be fully accessible. The development of the tools necessary for the statistical

eigen-inference methods of Chapter 8 is one of the motivations of the currentchapter.

The starting motivation, initiated by the work of Silverstein and Choi[Silverstein and Choi, 1995] , which resulted in the important Theorem 7.4(accompanied later by an important corollary, due to Mestre [Mestre, 2008a],Theorem 7.5), was to characterize the l.s.d. of the sample covariance matrix inclosed-form. Remember that, up to this point, we can only characterize the l.s.d.F of a sample covariance matrix through the expression of its Stieltjes transform,as the unique solution mF (z) of some xed-point equation for all z ∈C \ R + .To obtain an explicit expression of F , it therefore suffices to use the inverseStieltjes transform formula ( 3.2). However, this suggests having a closer look atthe limiting behavior of mF (z) as z approaches the positive real half-line, aboutwhich we do not know much yet. Therefore, up to this point in our analysis, itis impossible to describe the support of the l.s.d., apart from rough estimationsbased on the expression of [mF (z)], for z = x + iy, y being small. It is alsonot convenient to depict F (x): the solution is to take z = x + iy, with y smalland x spanning from zero to innity, and to draw the curve z → 1

π [mF (z)] forsuch z. In the following, we will show that, as z tends to x > 0, mF (z) has alimit which can be characterized in two different ways, depending on whether x

belongs to the support of F or not. In any case, this limit is also characterized asthe solution to an implicit equation, although particular care must be taken as towhich of the multiple solutions of this implicit equation needs to be considered.



180 7. Spectrum analysis

Before we detail this advanced spectrum characterization, we provide adifferent set of results, fundamental to the validation of the eigen-inferencemethods proposed in Chapter 8. These results, namely the asymptotic absenceof eigenvalues outside the support of F , Theorem 7.1, and the exact separationof the support into disjoints clusters, Theorem 7.2, are once more due to Baiand Silverstein [Bai and Silverstein, 1998, 1999] . Their object is the analysis, ontop of the characterization of F , of the behavior of the particular eigenvaluesof the e.s.d. of the sample covariance matrix as the dimensions grow large.It is fundamental to understand here, and this will be reminded again in thenext section, that the convergence of the e.s.d. toward F , as the matrix sizeN grows large, does not imply the convergence of the largest eigenvalue of the sample covariance matrix towards the right edge of the support. Indeed,

the largest eigenvalues, having weight 1 /N in the spectrum, do not contributeasymptotically to the support of F . As such, it may well be found outside thesupport of F for all nite N , without invalidating Theorem 3.13. This particularcase in the Marcenko–Pastur model where eigenvalues are found outside thesupport almost surely when the entries of the random i.i.d. matrix X N inTheorem 3.13, T N = I N , have innite fourth order moment. In this scenario,it is even proved in [Silverstein et al., 1988] that the largest eigenvalue growswithout bound as the system dimensions grow to innity, while all the massof the l.s.d. is asymptotically kept in the support; if the fourth order momentis nite. Under nite fourth moment assumption though [Bai and Silverstein,

1998; Yin et al., 1988], the important result to be detailed below is that noeigenvalue is to be found outside the limiting support and that the eigenvaluesare found where they ought to be. This last statement is in fact slightly erroneousand will be adequately corrected when discussing the spiked models that leadsome eigenvalues to leave the limiting support. To be more precise, when themoment of order four of the entries of X N exists, we can characterize exactlythe subsets of R + where no eigenvalue is asymptotically found, almost surely.Further discussions on the extreme eigenvalues of sample covariance matrices areprovided in Chapter 9, where (non-central) limiting theorems for the distribution

of these eigenvalues are provided.

7.1 Sample covariance matrix

7.1.1 No eigenvalues outside the support

As observed in the previous sections, most early results of random matrix theorydealt with the limiting behavior of e.s.d. For instance, the Marcenko–Pasturlaw ensures that the e.s.d. of the sample covariance matrix R N of vectors withi.i.d. entries of zero mean and unit variance converges almost surely towardsa limit distribution function F . However, the Marcenko–Pastur law does notsay anything about the behavior of any specic eigenvalue, say for instance the



7.1. Sample covariance matrix 181

extreme lowest and largest eigenvalues λmin and λmax of R N . It is relevant inparticular to wonder whether λmin and λmax can be asymptotically found outside the support of F . Indeed, if all eigenvalues but the extreme two are in the supportof F , then the l.s.d. of R N is still F , which is still consistent with the Marcenko–Pastur law. It turns out that this is not the case in general. Under some mildassumption on the entries of the sample covariance matrix, no eigenvalue is foundoutside the support. We specically have the following theorem.

Theorem 7.1 ([Bai and Silverstein , 1998; Yin et al., 1988]). Let the matrix X N = 1√ n X N,ij ∈C N ×n have i.i.d. entries, such that X N, 11 has zero mean,unit variance, and nite fourth order moment. Let T N ∈C N ×N be non-random,with uniformly bounded spectral norm T N , whose e.s.d. F T N converge weakly

to H . From Theorem 3.13 , the e.s.d. of B N = T12N X N X HN T

12N ∈C N ×N converges

weakly and almost surely towards some distribution function F , as N , n goto innity with ratio cN = N/n → c, 0 < c < ∞. Similarly, the e.s.d. of B N =X H

N T N X N ∈C n ×n converges towards F given by:

F (x) = cF (x) + (1 −c)1[0,∞) (x).

Denote F N the distribution with Stieltjes transform mF N (z), which is solution, for z ∈C + , of the following equation in m

m = − z − N n

τ 1 + τm dF

T N

(τ )−1

(7.1)

and dene F N the d.f. such that

F N (x) = N n

F N (x) + 1 − N n

1[0,∞) (x).

Let N 0 ∈N , and choose an interval [a, b], a, b ∈ (0, ∞], lying in an open interval outside the union of the supports of F and F N for all N ≥ N 0 . For ω ∈ Ω,the random space generating the series X 1 , X 2 , . . . , denote L N (ω) the set of eigenvalues of B N (ω). Then

P (ω, L N (ω) ∩[a, b] = ∅ i.o.) = 0 .

This means concretely that, given a segment [ a, b] outside the union of thesupports of F and F N 0 , F N 0 +1 , . . . , for all series B 1(ω), B 2(ω), . . . , with ω insome set of probability one, there exists M (ω) such that, for all N ≥ M (ω),there will be no eigenvalue of B N (ω) in [a, b]. By denition, F K is the l.s.d.of an hypothetical B N with H = F T K . The necessity to consider the supportsof F N 0 , F N 0 +1 , . . . is essential when a few eigenvalues of T N are isolated andeventually contribute with probability zero to the l.s.d. H . Indeed, it is ratherintuitive that, if the largest eigenvalue of T N is large compared to the rest,at least one eigenvalue of B N will also be large compared to the rest (taken N to be convinced). Theorem 7.1 states exactly here that there will be




neither any eigenvalue outside the support of the main mass of F B N , nor anyeigenvalue around the largest one. Those models in which some eigenvalues of T N

are isolated are referred to as spiked models . These are thoroughly discussed inChapter 9. In wireless communications and modern signal processing, Theorem7.1 is of key importance for signal sensing and hypothesis testing methods since itallows us to verify whether the eigenvalues empirically found in sample covariancematrix spectra originate either from noise contributions or from signal sources.In the simple case where signals sensed at an antenna array originate either fromwhite noise or from a coherent signal source impaired by white noise, this can beperformed by simply verifying if the extreme eigenvalue of the sample covariancematrix is inside or outside the support of the Marcenko–Pastur law (Figure 1.1);see further Chapter 16.

We give hereafter a sketch of the proof, which again only involves the Stieltjestransform.

Proof. Surprisingly, the proof unfolds from a mere (though non-trivial)renement of the Stieltjes transform relation proved in Theorem 3.13. Let F N

be dened as above and let mN be its Stieltjes transform. It is possible to showthat, for z = x + ivN , with vN = N −1/ 68

supx∈[a,b ]

|mB N (z) −mN (z)| = o 1N

vN

almost surely. This result is in fact also true when [z] equals √ 2vN , √ 3vN , . . .or √ 34vN . Note that this renes the known statement that the difference is of order o(1). We take this property, which requires more than ten pages of calculus,for granted. We now have that

max1≤k≤34

supx∈[a,b ]

mB N (x + ik12 vN ) −mN (x + ik

12 vN ) = o(v67

N )

almost surely. Expanding the Stieltjes transforms and considering only theimaginary parts, we obtain

max1≤k≤34

supx∈[a,b ] d(F B N (λ) −F N (λ))

(x −λ)2 + kv2N

= o(v66N )

almost surely. Taking successive differences over the 34 values of k, we end upwith

supx

∈

[a,b ]

(v2

N )33 d(F B N (λ) −F N (λ))34k=1 ((x

−λ)2 + kv2

N )= o(v66

N ) (7.2)

almost surely, from which the term v66N simplies on both sides. Consider now

a < a and b > b such that [ a , b ] is outside the support of F . We then divide




(7.2) into two terms, as (remember that 1 /N = v68N )

supx∈[a,b ] 1R + \[a ,b ](λ)d(F B N (λ)

−F N (λ))

34k=1 ((x −λ)2 + kv2N ) + λ j∈[a ,b ]

v68N

34k =1 ((x −λ j )2 + kv2N )

= o(1)

almost surely. Assume now that, for a subsequence φ(1), φ(2) , . . . of 1, 2, . . . , therealways exists at least one eigenvalue of B φ (N ) in [a, b]. Then, for x taken equalto this eigenvalue, one term of the discrete sum above (whose summands areall non-negative) is exactly 1 / 34!, which is uniformly bounded away from zero.This implies that the integral must also be bounded away from zero. Howeverthe integrand of the integral is clearly uniformly bounded on [ a , b ] and, from

Theorem 3.13, F B

N −F ⇒ 0. Therefore the integral tends to zero as N → ∞.This is a contradiction. Therefore, the probability that there is an eigenvalueof B N in [a, b] innitely often is null. Now, from [Yin et al., 1988], the largesteigenvalue of 1

n X N XHN is almost surely asymptotically bounded. Therefore, since

T N is also bounded by hypothesis, the theorem applies also to b = ∞.

Note that the niteness of the fourth order moment of the entries X N,ij isfundamental for the validity of Theorem 7.1. It is indeed proved in [Yin et al.,1988] and [Silverstein et al., 1988] that:

• if the entries X N,ij have nite fourth order moment, with probability one, thelargest eigenvalue of X N XHN tends to the edge (1 + √ c)2 , c = lim N N/n of

the support of the Marcenko–Pastur law, which is an immediate corollary of Theorem 7.1 with T N = I N ;

• if the entries X N,ij do not have a nite fourth order moment then, withprobability one, the limit superior of the largest eigenvalue of X N X

HN is

innite, i.e. with probability one, for all A > 0, there exists N such that thelargest eigenvalue of X N X

HN is larger than A. It is therefore important never

to forget the underlying assumption made on the tails of the distribution of the entries in X N .

We now move to an extension of Theorem 7.1.

7.1.2 Exact spectrum separation

Now assume that the e.s.d. of T N converges to the distribution function of,say, three evenly weighted masses in λ1 , λ2 , and λ3 . For not-too-large ratioscN = N/n , it is observed that the support of F is divided into up to threeclusters of eigenvalues. In particular, when n becomes large while N is keptxed, the clusters consist of three punctual masses in λ1 , λ2 , and λ3 , as requiredby classical probability theory. This is illustrated in Figure 7.1 in the case of athree-fold clustered and a two-fold clustered support of F . The reason why weobserve sometimes three and sometimes less clusters is linked to the spreading




of each cluster due to the limiting ratio c; the smaller c, the thinner the clusters,as already observed in the simple case of the Marcenko–Pastur law, Figure 2.2.Considering Theorem 7.1, it is tempting to assume that, in addition to eachcluster of F being composed of one third of the total spectrum mass, each clusterof B N contains exactly one third of the eigenvalues of B N . However, Theorem7.1 only ensures that no eigenvalue is found outside the support of F for allN larger than a given M , and does not say how the eigenvalues of B N aredistributed in the various clusters. The answer to this question is provided in[Bai and Silverstein , 1999] in which the exact separation properties of the l.s.d.of such matrices B N is discussed.

Theorem 7.2 ([Bai and Silverstein, 1999] ). Assume the hypothesis of Theorem

7.1 with T N non-negative denite. Consider similarly 0 < a < b < ∞ such that [a, b] lies in an open interval outside the support of F and F N for all large N .Denote additionally λk and τ k the kth eigenvalues of B N and T N in decreasing order, respectively. Then we have:

1. If c(1 −H (0)) > 1, then the smallest value x0 in the support of F is positive and λN → x0 almost surely, as N → ∞.

2. If c(1 −H (0)) ≤ 1, or c(1 −H (0)) > 1 but [a, b] is not contained in [0, x0],then 1

P (λ i N > b, λ i N +1 < a for all large N ) = 1

where iN is the unique integer such that

τ i N > −1/m F (b),τ i N +1 < −1/m F (a).

Theorem 7.2 ensures in particular the exact separation of the spectrum whenτ 1 , . . . , τ N take values in a nite set. Consider for instance the rst plot inFigure 7.1 and an interval [ a, b] comprised between the second and third clusters.

What Theorem 7.2 claims is that, if iN and iN + 1 are the indexes of the rightand left eigenvalues when F B N jumps from one cluster to the next, and N islarge enough, then there is an associated jump from the corresponding iN th and(iN + 1)th eigenvalues of T N (for instance, at the position of the discontinuityfrom eigenvalue 7 to eigenvalue 3).

This bears some importance for signal detection. Indeed, consider the problemof the transmission of information plus noise. Given the dimension p of the signal

1 The expression “ P (AN for all large N ) = 1” is used in place of “there exists B ⊂ Ω, withP (B ) = 1, such that, for ω

∈ B , there exists N 0 (ω) for which N > N 0 (ω) implies ω

∈ A N .” It

is particularly important to note that “for all large N ” is somewhat misleading as it does not indicate the existence of a universal N 0 such that N > N 0 implies ω ∈ A N for all ω ∈ B , butrather the existence of an N 0 (ω) for each such ω. Here, A N = ω, λ i N (ω) > b, λ i N +1 (ω) < a and the space Ω is the generator of the series B 1 (ω), B 2 (ω), . . . .




1 3 70

0.2

0.4

0.6

Eigenvalues

D e n s i t y

Empirical eigenvalue distributionLimit law (from Theorem 3.13 )

1 3 40

0.2

0.4

0.6

Eigenvalues

D e n s i t y


Figure 7.1 Histogram of the eigenvalues of B N = T12N X N X H

N T12N , N = 300, n = 3000,

with T N diagonal composed of three evenly weighted masses in (i) 1, 3, and 7 on top,(ii) 1, 3, and 4 on the bottom.

space and n − p of the noise space, for large c, Theorem 7.2 allows us to isolatethe eigenvalues corresponding to the signal space from those corresponding tothe noise space. If both eigenvalue spaces are isolated in two distinct clusters,then we can exactly determine the dimension of each space and infer, e.g. thenumber of transmitting entities. The next question that then naturally arises isto determine for which values of c = lim N n/N the support of F separates into1, 2, or more clusters.




7.1.3 Asymptotic spectrum analysis

For better understanding in the following, we will take the convention that the

(hypothetical) single mass at zero in the spectrum of F is not considered asa ‘cluster’. We will number the successive clusters from left to right, from oneto K F with K F the number of clusters in F , and we will denote kF the clustergenerated by the population eigenvalue tk , to be introduced shortly. For instance,if two sample eigenvalues ti and ti +1 = t i generate a unique cluster in F (as in thebottom graph in Figure 7.1, where t2 = 3 and t3 = 4 generate the same cluster),then iF = ( i + 1) F ). The results to come will provide a unique way to denekF mathematically and not only visually. To this end, we need to study in moredepth the properties of the limiting spectrum F of the sample covariance matrix.

Remember rst that, for the model B N = X HN T N X N

∈C n ×n of l.s.d. F , where

X N ∈C N ×n has i.i.d. entries of zero mean and variance 1 /n , T N has l.s.d. H and N/n → c, mF (z), z ∈C + , Equation ( 3.22) has an inverse formula, given by:

zF (m) = − 1m

+ c t1 + tm

dH (t) (7.3)

for m ∈C + . The equation zF (m) = z ∈C + has a unique solution m with positiveimaginary part and this solution equals mF (z) by Theorem 3.13. Of course, B N

and B N only differ from |N −n| zero eigenvalues, so it is equivalent to study thel.s.d. of B N or that of B N . The link between their respective Stieltjes transforms

is given by:mF (z) = cmF (z) + ( c −1)

1z

from (3.16). Since F turns out to be simpler to study, we will focus on B N instead of the sample covariance matrix B N itself.

Now, according to the Stieltjes inversion formula ( 3.2), for every continuitypoints a, b of F

F (b) −F (a) = limy

→0+

1π

b

a[mF (x + iy)]dx.

To determine the distribution F , and therefore the distribution F , we mustdetermine the limit of mF (z) as z ∈C + tends to x ∈R∗. It can in fact be shownthat this limit exists.

Theorem 7.3 ([Silverstein and Choi, 1995 ]). Let B N ∈C n ×n be dened as previously, with almost sure l.s.d. F . Then, for x ∈R∗

limz→xz∈

C +

mF (z) m(x) (7.4)

exists and the function m is continuous on R∗. For x in the support of F , the density f (x) F (x) equals 1

π [m(x)]. Moreover, f is analytic for all x ∈R∗

such that f (x) > 0.




The study of m makes it therefore possible to describe the complete supportS F of F as well as the limiting density f . Since S F equals S F but for an additionalmass in zero, this is equivalent to determining the support of S F . Choi andSilverstein provided an accurate description of the function m, as follows.

Theorem 7.4 ([Silverstein and Choi, 1995] ). Let B = m | m = 0 , −1/m ∈ S cH ,with S cH the complementary of S H , and xF be the function dened on B by

xF (m) = − 1m

+ c t1 + tm

dH (t). (7.5)

For x ∈R∗, we can determine the limit m(x) of mF (z) as z → x, z ∈C + , along the following rules:

1. If x ∈ S F , then m(x) is the unique solution in B with positive imaginary part of the equation x = xF (m) in the dummy variable m.

2. If x ∈ S cF , then m(x) is the unique real solution in B of the equation x =xF (m) in the dummy variable m such that xF (m0) > 0. Conversely, for m ∈B , if xF (m) > 0, then xF (m) ∈ S cF .

From rule 1, along with Theorem 7.3, we can evaluate for every x > 0 thelimiting density f (x), hence F (x), by nding the complex solution with positiveimaginary part of x = xF (m).

Rule 2 makes it simple to determine analytically the exact support of F . Itindeed suffices to draw xF (m) for −1/m ∈ S cH . Whenever xF is increasing onan interval I , xF (I ) is outside S F . The support S F of F , and therefore of F (modulo the mass in zero), is then dened exactly by the complementary set

S F = R \a,b∈

R

a<b

xF ((a, b)) | ∀m ∈ (a, b), xF (m) > 0 .

This is depicted in Figure 7.2 in the case when H is composed of three evenlyweighted masses t1 , t 2 , t 3 in 1, 3, 5 or 1, 3, 10 and c = 1/ 10. Notice that, inthe case where t3 = 10, F is divided into three clusters, while, when t3 = 5, F isdivided into only two clusters, which is due to the fact that xF is non-increasingin the interval ( −1/ 3, −1/ 5). For applicative purposes, we will see in Chapter 17that it might be essential that the consecutive clusters be disjoint. This is onereason why Theorem 7.6 is so important.

We do not provide a rigorous proof of Theorem 7.4. In fact, while thoroughlyproved in 1995, this result was already intuited by Marcenko and Pastur in1967 [Marcenko and Pastur, 1967] . The fact that xF (m) increases outside thespectrum of F and is not increasing elsewhere is indeed very intuitive, and is notactually limited to the sample covariance matrix case. Observe indeed that, forany F , and any x0 ∈R∗ outside the support of F , mF (x0) is clearly well dened




−1 −13 − 1

10zero

1

3

10

m −1m +

1

m 1

m

x F

( m )

x F (m )Support of F

−1 −13 −1

5zero

1

3

5

m

x F

( m )


Figure 7.2 xF (m) for m real, T N diagonal composed of three evenly weighted massesin 1, 3, and 10 (top) and 1, 3, and 5 (bottom), c = 1 / 10 in both cases. Local extremaare marked in circles, inexion points are marked in squares. The support of F can beread on the right vertical axises.

and

mF (x0) = 1(λ −x0)2 dF (λ) > 0.

Therefore mF (x) is continuous and increasing on an open neighborhood of x0 .This implies that it is locally a one-to-one mapping on this neighborhood andtherefore admits an inverse xF (m), which is also continuous and increasing. Thisexplains why xF (m) increases when its image is outside the spectrum of F .Now, if for some real m0 , xF (m0) is continuous and increasing, then it is locallyinvertible and its inverse ought to be mF (x), continuous and increasing, in which




case x is outside the spectrum of F . Obviously, this reasoning is far from beinga proof (at least the converse requires much more work).

From Figure 7.2 and Theorem 7.4, we now observe that, when the e.s.d. of population matrix is composed of a few masses, xF (m) = 0 has exactly 2 K F

solutions with K F the number of clusters in F . Denote these roots in increasingorder m−1 < m +

1 ≤ m−2 < m +2 < . . . ≤ m−K F

< m +K F

. Each pair ( m−j , m+j ) is such

that xF ([m−j , m+j ]) is the j th cluster in F . We therefore have a way to determine

the support of the asymptotic spectrum through the function xF . This ispresented in the following result.

Theorem 7.5 ([Couillet et al., 2011c; Mestre , 2008a]). Let B N ∈C N ×N be dened as in Theorem 7.1. Then the support S F of the l.s.d. F of B N is

S F =K F

j =1

[x−j , x+j ]

where x−1 , x+1 , . . . , x −K F

, x+K F

are dened by

x−j = − 1m−j

+K

r =1cr

t r

1 + t r m−j

x+j = −

1m−j

+K

r =1cr

t r

1 + t r m+j

with m−1 < m +1 ≤ m−2 < m +

2 ≤ . . . ≤ m−K F < m +

K F the 2K F (possibly counted

with multiplicity) real roots of the equation in mK

r =1cr

t2r m2

(1 + t r m2)2 = 1 .

Note further from Figure 7.2 that, while xF (m) might not have roots on someintervals ( −1/t k−1 , −1/t k ), it always has a unique inexion point there. This isproved in [Couillet et al., 2011c] by observing that xF (m) = 0 is equivalent to

K

r =1cr

t3r m3

(1 + t r m)3 −1 = 0

the left-hand side of which has always positive derivative and shows asymptotesin the neighborhood of tr ; hence the existence of a unique inexion point onevery interval ( −1/t k−1 , −1/t k ), for 1 ≤ k ≤ K , with convention t0 = 0+ . WhenxF increases on an interval ( −1/t k−1 , −1/t k ), it must have its inexion pointin a point of positive derivative (from the concavity change induced by theasymptotes). Therefore, to verify that cluster kF is disjoint from clusters ( k

−1)F

and ( k + 1) F (when they exist), it suffices to verify that the ( k −1)th and kthroots mk−1 and mk of xF (m) are such that xF (mk−1) > 0 and xF (mk ) > 0. Fromthis observation, we therefore have the following result.




Theorem 7.6 ([Couillet et al., 2011c; Mestre, 2008b ]). Let B N be dened as in Theorem 7.1, with T N = diag( τ 1 , . . . , τ N ) ∈R N ×N , diagonal containing K distinct eigenvalues 0 < t 1 < .. . < t K , for some xed K . Denote N k the multiplicity of the kth largest distinct eigenvalue (assuming ordering of the τ i ,we may then have τ 1 = . . . = τ N 1 = t1 , . . . , τ N −N K +1 = . . . = τ N = tK ). Assume also that, for all 1 ≤ r ≤ K , N r /n → cr > 0, and N/n → c, with 0 < c < ∞.Then the cluster kF associated with the eigenvalue tk in the l.s.d. F of B N is distinct from the clusters (k −1)F and (k + 1) F (when they exist), associated with tk−1 and tk+1 in F , respectively, if and only if

K

r =1cr

t2r m2

k(1 + t r m2

k )2 < 1

K

r =1cr

t2r m2k +1

(1 + t r m2k+1 )2 < 1 (7.6)

where m1 , . . . , m K are such that mK +1 = 0 and m1 < m 2 < .. . < m K are the K solutions of the equation in m

K

r =1cr

t3r m3

(1 + t r m)3 = 1.

For k = 1, this condition ensures 1F = 2 F

−1. For k = K , this ensures K F =

(K −1)F + 1 . For 1 < k < K , this ensures (k −1)F + 1 = kF = ( k + 1) F −1.

Remark now that the conditions of Equation ( 7.6) are left unchanged if allt1 , . . . , t K are scaled by a common constant. Indeed, if tj becomes αt j for all j , then m1 , . . . , m K become m1 /α , . . . ,m K /α and the scaling effects cancel outin Equation ( 7.6). Therefore, in the case K = 2, the separability condition onlydepends on the ratios c1 , c2 and on t1/t 2 . If c1 = c2 = c/ 2, then we can depict theplot of the critical ratio 1 /c as a function of t1 /t 2 for which cluster separabilityhappens. This is depicted in Figure 7.3. Since 1/c is the limit of the ratio n/N ,

Figure 7.3 determines, for a xed observation size N , the limiting number of samples per observation size required to achieve cluster separability. Observehow steeply the plot of 1 /c increases when t1 gets close to t2 ; this suggeststhat the tools to be presented later that require this cluster separability willbe very inefficient when it comes to separate close sources (the denition of ‘closeness’ depending on each specic study, e.g. close directions of signal arrivalsin radar applications, close transmit powers in signal sensing, etc.). Figure 7.4depicts the regions of separability of all clusters in the case K = 3, for xedc = 0 .1, c1 = c2 = c3 , as a function of the ratios t3/t 1 and t2 /t 1 . Observe that thetriplets (1 , 3, 7) and (1 , 3, 10) are well inside the separability region as suggested,respectively, by Figure 7.1 (top) and Figure 7.2 (top); on the contrary, notice thatthe triplets (1 , 3, 4) and (1 , 3, 5) are outside the separability region, conrmingthen the observations of Figure 7.1 (bottom) and Figure 7.2 (bottom).




0 0.2 0.4 0.6 0.80

20

40

60

80

100

cluster separability region

t1 /t 2

1 / c

Figure 7.3 Limiting ratio c to ensure separability of ( t1 , t 2 ), t1 ≤ t2 , K = 2, c1 = c2 .

0 1 3 5 100

1

3

5

10

t2 /t 1

t 3 / t 1

Figure 7.4 Subset of ( t1 , t 2 , t 3 ) that satisfy cluster separability condition, c1 = c2 = c3 ,c = 0 .1, in crosshatched pattern.

After establishing these primary results for the sample covariance matrixmodels, we now move to the information plus noise model. According to theprevious remark borrowed from Marcenko and Pastur in [Marcenko and Pastur,1967], we infer that it will still be the case that the Stieltjes transform mF (x),extended to the real axis, has a local inverse xF (m), which is continuousand increasing, and that the range where xF (m) increases is exactly thecomplementary to the support of F . This statement will be shown to be




somewhat correct. The main difference with the sample covariance matrixmodel is that there does not exist an explicit inverse xF (m), as in (7.5) andtherefore mF (x) may have various inverses xF (m) for different subsets in thecomplementary of the support of F .

7.2 Information plus noise model

The asymptotic absence of eigenvalues outside the support of unconstrainedinformation plus noise matrices (when the e.s.d. of the information matrixconverges), i.e. with i.i.d. noise matrix components, is still at the stage of conjecture. While promising developments are being currently carried out, there

exists to this day no proof of this fact, let alone a proof of the exact separationof information plus noise clusters. Nonetheless, in the particular case where thenoise matrix is Gaussian, the two results have been recently proved [Vallet et al.,2010]. Those results are given hereafter.

7.2.1 Exact separation

We recall that an information plus noise matrix B N is dened by

B N = 1n (A N + σX N )(A N + σX N )

H

(7.7)

where A N is deterministic, representing the deterministic signal, X N is randomand represents the noise matrix, and σ > 0.

We start by introducing the theorem which states that, for all large N , noeigenvalue is found outside the asymptotic spectrum of the information plusnoise model.

Theorem 7.7. Let B N be dened as in (7.7), with A N ∈C N ×n such that H N

F 1n A N A H

N

⇒ H and supN 1n A N A

H

N < ∞, X N ∈C N

×n

with entries X N,ijindependent for all i, j , N , Gaussian with zero mean and unit variance. Further denote cN = N/n and assume cN → c, positive and nite. From Theorem 3.15,we know that F B N converges almost surely to a limit distribution F with Stieltjes transform mF (z) solution of the equation in m

m1 + σ2cN m

= mH z(1 + σ2cN m)2 −σ2(1 −cN )(1 + σ2cN m) (7.8)

this solution being unique for z ∈C + , m ∈C + and [zm ] ≥ 0. Denote now mN (z) this solution when mH is replaced by mH N and F N the distribution function with Stieltjes transform mN (z).

Let N 0 ∈N , and choose an interval [a, b] outside the union of the supports of F and F N for all N ≥ N 0 . For ω ∈ Ω, the probability space generating the noise



7.2. Information plus noise model 193

sequences X 1 , X 2 , . . . , denote L N (ω) the set of eigenvalues of B N (ω). Then

P (ω, L N (ω) ∩[a, b] = ∅ i.o.) = 0 .

The next theorem ensures that the repartition of the eigenvalues in theconsecutive clusters is exactly as expected.

Theorem 7.8 ([Vallet et al., 2010]). Let B N be as in Theorem 7.7. Let a < b be such that [a, b] lies outside the support of F . Denote λk and ak the kth eigenvalues smallest of B N and 1

n A N AHN , respectively. Then we have:

1. If c(1 −H (0)) > 1, then the smallest eigenvalue x0 of the support of F is positive and λN → x0 almost surely, as N → ∞.

2. If c(1 −H (0)) ≤ 1, or c(1 −H (0)) > 1 but [a, b] is not contained in [0, x0],then:

P (λ i N > b, λ i N +1 < a for all large N ) = 1

where iN is the unique integer such that

τ i N > −1/m F (b)τ i N +1 < −1/m F (a).

We provide hereafter a sketch of the proofs of both Theorem 7.7 and Theorem

7.8 where considerations of complex integration play a fundamental role. In thefollowing chapter, Chapter 8, we introduce in detail the methods of complexintegration for random matrix theory and particularly for statistical inference.

Proof of Theorem 7.7 and Theorem 7.8 . As already mentioned, these results areonly known to hold for the Gaussian case for the time being. The way theseresults are achieved is similar to the way Theorem 7.1 and Theorem 7.2 wereobtained, although the techniques are radically different. Indeed, somewhatsimilarly to Theorem 7.1, the rst objective is to show that the differencemN (z) −E[mB N (z)] between the deterministic equivalent mN (z) of the empirical

Stieltjes transform mB N (z) and E[ mB N (z)] goes to zero at a sufficiently fast rate.In the Gaussian case, this rate is of order O(1/N 2). Remember from Theorem 6.5that such a convergence rate was already observed for doubly correlated Gaussianmodels and allowed us to ensure that N (mN (z) −E[mB N (z)]) → 0. Using thefact, established precisely in Chapter 8, that, for holomorphic functions f and adistribution function G

f (x)dG(x) = − 12πi f (z)mG (z)dz

on a positively oriented contour encircling the support of F , we can infer the

recent result from [Haagerup et al., 2006]

E f (x)[F B N −F N ](dx) = O 1N 2

.




Take f any innitely differentiable function that is identically one on [ a, b] ⊂R

and identically zero outside ( a −ε, b + ε) for some small positive ε, such that(a

−ε, b + ε) is outside the support of F . From the convergence rate above, we

rst have.

E N

k=1

λk 1(a−ε,b + ε ) (λk ) = N (F N (b) −F N (a)) + O 1N

and therefore, for large N , we have in expectation the correct mass of eigenvaluesin (a −ε, b + ε). But we obviously want more than that: i.e., we want todetermine the asymptotic exact number of these eigenvalues. Using the Nash–Poincare inequality, Theorem 6.7, we can in fact show that, for this choice of f

E f (x)[F B N −F N ](dx)2

= O 1N 4

.

This is enough to prove, thanks to the Markov inequality, Theorem 3.5, that

P f (x)[F B N −F N ](dx) > 1N

43

< K N

43

for some constant K . From there, the Borel–Cantelli lemma, Theorem 3.6,ensures that the above event is innitely often true with probability zero; i.e.the event

N

k =1

λk 1(a −ε,b + ε ) (λk ) −N (F N (b) −F N (a)) > K N

13

is innitely often true with probability zero. Therefore, with probability one,there exists N 0 such that, for N > N 0 there is no eigenvalue in ( a −ε, b + ε).This proves the rst result.

Take now [ a, b] not necessarily outside the support of F and ε such that ( a −ε, a )∪(b, b+ ε) is outside the support of F . Then, repeating the same procedureas above but to characterize now

N

k =1

λk 1[a,b ](λk ) −N (F N (b) −F N (a))

we nd that this term equalsN

k =1

λk 1(a −ε,b + ε ) (λk ) −N (F N (b) −F N (a))

almost surely in the large N limit since there is asymptotically no eigenvalue in(a

−ε, a )

∪

(b, b+ ε). This now says that the asymptotic number of eigenvaluesin [a, b] is N (F N (b) −F N (a)) almost surely. The fact that the indexes of theseeigenvalues are those expected is obvious. If it were not the case, then we canalways nd an interval on the left or on the right of [ a, b] which does not contain




1 3 40

0.1

0.2

0.3

0.4

0.5

Eigenvalues

D e n s i t y


1 3 100

0.1

0.2

0.3

0.4

Eigenvalues

D e n s i t y


Figure 7.5 Empirical and limit eigenvalue distribution of the information plus noisemodel B N = 1

n (A N + σX N )( A N + σX N )H , N = 300, n = 3000 ( c = 1 / 10), F 1N

A N A H

N

has three evenly weighted masses at 1 , 3, 4 (top) and 1 , 3, 10 (bottom).

the right amount of eigenvalues, which is contradictory from this proof. Thiscompletes the proof of both results.

7.2.2 Asymptotic spectrum analysis

A similar spectrum analysis as in the case of sample covariance matrices whenthe population covariance matrix has a nite number of distinct eigenvalues canbe performed for the information plus noise model. As discussed previously, the




extension of mF (z) to the real positive half-line is locally invertible and increasingwhen outside the support of F . The semi-converse is again true: if xF (m) is aninverse function for mF (x) continuous with positive derivative, then its image isoutside the support of F . However here, xF (m) is not necessarily unique, as willbe conrmed by simulations. Let us rst state the main result.

Theorem 7.9 ([Dozier and Silverstein , 2007b]). Let B N = 1n (A N +

σX N )(A N + σX N )H , with A N ∈C N ×n such that H N F 1n A N A H

N ⇒ H and

supN 1n A N A

HN < ∞, X N = ( X N,ij ) ∈C N ×n with X N,ij independent for all

i, j , N with zero mean and unit variance (we release here the non-necessary Gaussian hypothesis). Denote S F and S H the supports of F and H , respectively.Take (h1 , h 2) ⊂ S cH . Then there is a unique interval (mF, 1 , m F, 2) ⊂ (− 1

σ 2 c , ∞)

such that the function

m → m

1 + σ2cm

maps (mF, 1 , m F, 2) to (mH, 1 , m H, 2) ⊂ (−∞, 1σ 2 c ), where we introduced

(mH, 1 , m H, 2) = mH ((h1 , h 2)) . On (h1 , h2), mH is invertible, and then we can dene

xF (m) = 1b2 m−1

H 1σ2c

1 − 1b

+ 1b

σ2(1 −c)

with b = 1 + σ2cm.Then:

1. if for m ∈ (mF, 1 , m F, 2), x(m) ∈ S cF , then x (m) > 0;2. if xF (m) > 0 for b ∈ (mF, 1 , m F, 2), then xF (m) ∈ S cF and m = mF (xF (m)) .

Similar to the sample covariance matrix case, Theorem 7.9 gives readily away to determine the support of F : for m varying in ( mF, 1 , m F, 2), wheneverxF (m) increases, its image is outside the support of F . The support of F is

therefore the complementary set to the union of all such intervals. We mustnonetheless be aware that the denition of xF (m) is actually linked to the choiceof the interval ( h1 , h2) ⊂ S cH . In Theorem 7.4, we had a unique explicit inversefor xF (m) as a function of m, whatever the choice of the pre-image of mH

(the Stieltjes transform of the l.s.d. of the population covariance matrix); thisstatement no longer holds here.

In fact, if S H is subdivided into K H clusters, we can expect at most K H + 1different local inverses for xF (m) as m varies along R . This is in fact exactlywhat is observed. Figure 7.6 depicts the situation when H is composed of threeevenly weighted masses in (1 , 3, 4), then (1 , 3, 10). Observe that K H + 1 differentinverses exist that have the aforementioned behavior.

Now, also similar to the sample covariance matrix model, a lot more can besaid in the case where H is composed of a nite number of masses. The exact




−4 −2 0 2 4

0

5

10

15

m

x F

( m )


−4 −2 0 2 4−2

0

2

4

6

8

m

x F

( m )


Figure 7.6 Information plus noise model, xF (m) for m real, F 1N

A N A H

N ⇒ H , where H

has three evenly weighted masses in 1, 3, and 10 (top) and 1, 3, and 4 (bottom),c = 1 / 10, σ = 0 .1 in both cases. The support of F can be read on the central verticalaxises.

determination of the boundary of F can be determined. The result is summarizedas follows.

Theorem 7.10 ([Vallet et al., 2010]). Let B N be dened as in Theorem 7.9,where F

1n A N A H

N = H is composed of K eigenvalues h1 , . . . , h K (we implicitly assume N takes only values consistent with F

1n A N A H

N = H ). Let φ be the function on R \ h1 , . . . , h K dened by

φ(w) = w(1 −σ2cmH (w))2 + (1 −c)σ2(1 −σ2cmH (w)) .




Then φ(w) has 2K F , K F ≤ K , local maxima, such that 1−σ2cmH (w) > 0 and φ(w) > 0. We denote these maxima w−1 , w+

1 , w−2 , w+2 , . . . , w−K F

, w+K F

in the order

w−1 < 0 < w+1 ≤ w−2 < w

+2 ≤ . . . ≤ w−K F < w

+K F .

Furthermore, denoting x−k = φ(w−k ) and x+k = φ(w+

k ), we have:

0 < x −1 < x +1 ≤ x−2 < x +

2 ≤ . . . ≤ x−K F < x +

K F .

The support S F of F is the union of the compact sets [x−k , x+k ], k ∈ 1, . . . , K F

S F =K F

k =1

[x−k , x+k ].

Note that this alternative approach, via the function φ(w), allows us to givea deterministic expression of the subsets [ x−k , x+

k ] without the need to explicitlyinvert mH in K + 1 different inverses, which is more convenient.

A cluster separability condition can also be established, based on the resultsof Theorem 7.10. Namely, we say that the cluster in F corresponding to theeigenvalue hk is disjoint from the neighboring clusters if there exists kF ∈1, . . . , K F such that

hk−1 < w−k F < h k < w +

kF < h k+1

with convention h0 = 0, hK +1 = ∞, and we say that kF is the cluster associated

with hk in F .This concludes this chapter on spectral analysis of the sample covariance

matrix and the information plus noise models. As mentioned in the Introductionof this chapter, these results will be applied to solve eigen-inference problems,i.e. inverse problems concerning the eigenvalue or eigenvector structure of theunderlying matrix models. We will then move to the last chapter, Chapter 9,of the theoretical part, which is concerned with limiting results on the extremeeigenvalues for both the sample covariance matrix and information plus noisemodels. These results will push further the theorems of exact separation byestablishing the limiting distributions of the extreme eigenvalues (although solelyin the Gaussian case) and also some properties on the corresponding eigenvectors.



8 Eigen-inference

In the introductory chapter of this book, we mentioned that the samplecovariance matrix R n ∈C N ×N obtained from n independent samples of a randomprocess x

∈C N is a consistent estimate of R E[xx H ] as n

→ ∞ for N xed,

in the sense that, for any given matrix norm R n −R → 0 as n → ∞ (theconvergence being almost sure under mild assumptions). As such, R n wasreferred to as an n-consistent estimator of R . However, it was then shown bymeans of the Marcenko–Pastur law that R n is not an (n, N )-consistent estimatorfor R in the sense that, as both ( n, N ) grow large with ratio bounded away fromzero and ∞, the spectral norm of the matrix difference stays often away fromzero. We then provided an explicit expression for the asymptotic l.s.d. of R n inthis case. However, in most estimation problems, we are actually interested inknowing R itself, and not R n (or its limit). That is, we are more interested in

the inverse problem of nding R given R n , rather than in the direct problem of nding the l.s.d. of R n given R .

8.1 G-estimation

8.1.1 Girko G-estimators

The rst well-known examples of ( n, N )-consistent estimators were provided byGirko, see, e.g., [Girko], who derived more than fty ( n, N )-consistent estimators

for various functionals of random matrices. Those estimators are called G-estimators after Girko’s name, and are numbered in sequence as G1 , G2 , etc.

The G1 , G3 , and G4 estimators may be rather useful in the context of wirelesscommunications and are given hereafter. The rst estimator G1 is a consistentestimator of the log determinant of the population covariance matrix R , alsoreferred to as the generalized variance .

Theorem 8.1. Let x1 , . . . , x n ∈R N be n i.i.d. realizations of a given random process with covariance E[(x 1 −E[x 1])(x 1 −E[x 1])H ] = R and n > N . Denote R

n the sample covariance matrix dened as

R n 1n

n

i=1

(x i − x )(x i − x )H



200 8. Eigen-inference

where x 1n

N i =1 x i . Dene G1 the functional

G1(R n ) = α−1

nlogdet( R n ) + log

n(n −1)N

(n −N ) N k=1 (n −k)

with αn any sequence such that α−2n log(n/ (n −N )) → 0. We then have

G1(R n ) −α−1n log det( R ) → 0

in probability.

The G3 estimator deals with the inverse covariance matrix. The result here issurprisingly simple.

Theorem 8.2. Let R ∈R N ×N invertible and R n ∈R N ×N be dened as in Theorem 8.1. Dene G3 as the function

G3(R n ) = 1 − N n

R −1n .

Then, for a ∈R N , b ∈R N of uniformly bounded norm, we have:

a T G3(R n )b −a T R −1b → 0

in probability.

The G4 estimator is a consistent estimator of the second order moment of thepopulation covariance matrix, in a sample covariance matrix model. This unfoldsfrom Theorem 3.13 and is given in the following.

Theorem 8.3. Let R ∈R N ×N and R n ∈R N ×N be dened as in Theorem 8.1.Dene G4 the function

G4(R n ) = 1N

tr R 2n −

1nN

(tr R n )2 .

Then

G4(R n ) − 1N

tr R 2 → 0

in probability.

This last result is compliant with the free probability estimator, for lessstringent hypotheses on R n .

It is then possible to derive some functionals of R based on the observation R n .Note in particular that the multiplicative free deconvolution operation presentedin Chapters 4 and 5 allows us to obtain ( n, N )-consistent estimates of thesuccessive moments of the eigenvalue distribution of R as a function of themoments of the e.s.d. of R n . Those can therefore be seen as G-estimators of themoments of the l.s.d. of R . Now, we may be interested in an even more difficult



8.1. G-estimation 201

inverse problem: provide an ( n, N )-consistent estimator of every eigenvalue in R .In the case where R has eigenvalues with large multiplicities, this problem hasbeen recently solved by Mestre in [Mestre , 2008b]. The following section presentsthis recent G-estimation result and details the mathematical approach used byMestre to determine this estimator.

8.1.2 G-estimation of population eigenvalues and eigenvectors

For ease of read, we come back to the notations of Section 7.1. In this case, wehave the following result

Theorem 8.4 ([Mestre, 2008b] ). Let B N = T12N X N X

HN T

12N

∈C N ×N be dened

as in Theorem 7.6, i.e. T N has K distinct eigenvalues t1 < .. . < t K with multiplicities N 1 , . . . , N K , respectively, for all r , N r /n → cr , 0 < c r < ∞.Further denote λ1 ≤ . . . ≤ λN the eigenvalues of B N and λ = ( λ1 , . . . , λ N )

T . Let k ∈ 1, . . . , K and dene

tk = nN k m∈

N k

(λm −µm ) (8.1)

with N k = k−1j =1 N j + 1 , . . . , k

j =1 N j and µ1 ≤ . . . ≤ µN are the ordered

eigenvalues of the matrix diag(λ )

− 1

n√ λ √ λ T

.Then, if condition (7.6) of Theorem 7.6 is fullled for k, i.e. cluster kF in F

is mapped to tk only, we have:

tk −tk → 0

almost surely as N, n → ∞, N/n → c, 0 < c < ∞.

The performance of the estimator of Theorem 8.4 is demonstrated in Figure 8.1for K = 3, t1 = 1, t2 = 3, t3 = 10 in the cases when N = 6, n = 18, and N = 30,

n = 90. Remember from Figure 7.4 that the set (1 , 3, 10) fullls condition ( 7.6),so that Theorem 8.4 is valid. Observe how accurate the G-estimates of the tk arealready for very small dimensions. We will see both in the current chapter andin Chapter 17 that, under the assumption that a cluster separability conditionis met (here, condition ( 7.6)), this method largely outperforms the moment-based approach that consists in deriving consistent estimates for the rst ordermoments and inferring tk from these moments. Note already that the naiveapproach that would consist in taking the mean of the eigenvalues inside eachcluster (bottom of Figure 8.1) shows a potentially large bias in the estimatedeigenvalue, although the estimator variance seems to be smaller than with theconsistent G-estimator.

The reason why condition ( 7.6) must be fullled is far from obvious but willbecome evident once we understand the proof of Theorem 8.4. Before proceeding,




(i) If z ∈ U is contained in the surface described by C, then for any f holomorphic on U

12πi C

f (ω)ω −z dω = f (z). (8.2)

If the contour C is negatively oriented, then the right-hand side becomes −f (z).(ii) If z ∈ U is outside the surface described by C, then:

12πi C

f (ω)ω −z

dω = 0. (8.3)

Note that this second result is compliant with the fact that, for f continuous,dened on the real axis, the integral of f along a closed contour C

⊂R (i.e. a

contour that would go from a to b and backwards from b to a) is null.Consider f some complex holomorphic function on U ⊂C , H a distributionfunction, and denote G the functional

G(f ) = f (z)dH (z).

From Theorem 8.5, we then have, for a negatively oriented closed path C

enclosing the support of H and with winding number one

G(f ) = 12πi

C

f (ω)z

−ω

dωdH (z)

= 12πi C f (ω)

z −ωdH (z)dω

= 12πi C

f (ω)mH (ω)dω (8.4)

the integral inversion being valid since f (ω)/ (z −ω) is bounded for ω ∈C. Notethat the sign inversion due to the negative contour orientation is compensatedby the sign reversal of ( ω −z) in the denominator.

If dH is a sum of nite or countable masses and we are interested in evaluatingf (tk ), with tk the value of the kth mass with weight lk , then on a negativelyoriented contour Ck enclosing tk and excluding tj , j = k

lk f (tk ) = 12πi C k

f (ω)mH (ω)dω. (8.5)

This last expression is particularly convenient when we have access to tk onlythrough an expression of the Stieltjes transform of H .

Now, in terms of random matrices, for the sample covariance matrix modelB N = T

12N X N X

HN T

12N , we have already noticed that the l.s.d. F of B N (or

equivalently the l.s.d. F of B N = X HN T N X N ) can be rewritten under the form

(3.23), which can be further rewrittenc

mF (z)mH −

1mF (z)

= −zm F (z) + ( c −1) (8.6)




where H (previously denoted F T ) is the l.s.d. of T N . Note that it is allowed toevaluate mH in −1/m F (z) for z ∈C + since −1/m F (z) ∈C + .

As a consequence, if we only have access to F B N (from the observation of B N ), then the only link between the observation of B N and H is obtained by (i)the fact that F B N

⇒ F almost surely and (ii) the fact that F and H are relatedthrough ( 8.6). Evaluating a functional f of the eigenvalue tk of T N is then madepossible by ( 8.5). The relations ( 8.5) and ( 8.6) are the essential ingredients behindthe proof of Theorem 8.4, which we detail below.

Proof of Theorem 8.4. We have from Equation ( 8.5) that, for any continuous f and for any negatively oriented contour Ck that circles around tk but none of thet j for j = k, f (tk ) can be written under the form

N kN

f (tk ) = 12πi C k

f (ω)mH (ω)dω

= 12πi C k

1N

K

r =1N r

f (ω)t r −ω

dω

with H the limit F T N

⇒ H . This provides a link between f (tk ) for all continuousf and the Stieltjes transform mH (z).

Letting f (x) = x and taking the limit N → ∞, N k /N → ck /c , with c c1 +. . . + cK the limit of N/n , we have:

ck

c tk = 1

2πi C k

ωmH (ω)dω. (8.7)

We now want to express mH as a function of mF , the Stieltjes transform of the l.s.d. F of B N . For this, we have the two relations ( 3.24), i.e.

mF (z) = cmF (z) + ( c −1)1z

and ( 8.6) with F T = H , i.e.

cmF (z) mH −

1mF (z) = −zm F (z) + ( c −1).

Together, those two equations give the simpler expression

mH − 1

mF (z) = −zm F (z)mF (z). (8.8)

Applying the variable change ω = −1/m F (z) in (8.7), we end up with

ck

c tk =

12πi C F ,k

zmF (z)mF (z)c

+ 1−c

cmF (z)m2

F (z) dz

= 1c

12πi C F ,k

z mF (z)mF (z)

dz, (8.9)




where CF ,k is the (well-dened) preimage of Ck by −1/m F . The second equality(8.9) comes from the fact that the second term in the previous relation is thederivative of ( c

−1)/ (cmF (z)), which therefore integrates to zero on a closed

path, as per classical real or complex integration rules [Rudin, 1986]. Obviously,since z ∈C + is equivalent to −1/m F (z) ∈C + (the same being true if C + isreplaced by C −), CF ,k is clearly continuous and of non-zero imaginary partwhenever [z] = 0. Now, we must be careful about the exact choice of CF ,k .

Since k is assumed to satisfy the separability conditions of Theorem 7.6, thecluster kF associated with k in F is distinct from the clusters ( k −1)F and(k + 1) F (whenever they exist). Let us then pick x( l)

F and x( r )F two real values

such that

x+(k

−1) F

< x ( l)F < x −k F

< x +kF

< x ( r )F < x −(k+1) F

with x−1 , x+1 , . . . , x −K F

, x+K F the support boundary of F , as dened in Theorem

7.5. That is, we take a point x( l)F right on the left side of cluster kF and a point

x( r )F right on the right side of cluster kF . Now remember Theorem 7.4 and Figure

7.2; for x( l)F as dened previously, mF (z) has a limit m( l)

∈R as z → x( l)

F , z ∈C + ,and a limit m(r )

∈R as z → x( r )

F , z ∈C + , those two limits verifying

tk−1 < x ( l) < t k < x ( r ) < t k+1 (8.10)

with x( l)

−1/m ( l) and x( r )

−1/m ( r ) .

This is the most important outcome of our integration process. Let us chooseCF ,k to be any continuously differentiable contour surrounding cluster kF suchthat CF ,k crosses the real axis in only two points, namely x( l)

F and x(r )F . Since

−1/m F (C + ) ⊂C + and −1/m F (C −) ⊂C −, Ck does not cross the real axiswhenever CF ,k is purely complex and is obviously continuously differentiablethere; now Ck crosses the real axis in x( l) and x(r ) , and is in fact continuousthere. Because of ( 8.10), we then have that Ck is (at least) continuous andpiecewise continuously differentiable and encloses only tk . This is what is requiredto ensure the validity of ( 8.9). In Figure 8.2, we consider the case where T N is

formed of three evenly weighted eigenvalues t1 = 1, t2 = 3 and t3 = 10, and wedepict the contours Ck , preimages of CF ,k , k ∈ 1, 2, 3, circular contours aroundthe clusters kF such that they cross the real line in the positions x( l)

F and x(r )F ,

corresponding to the inexion points of xF (m) (and an arbitrary large value forthe extreme right point).

The difficult part of the proof is completed. The rest will unfold more naturally.We start by considering the following expression

tk 12πi

nN k

C F ,k

zmB N

(z)mB N (z)

dz

= 12πi

nN k C F ,k

z1n

ni =1

1(λ i −z )2

1n

ni=1

1λ i −z

dz, (8.11)




1 3 10−0.25−0.2

−0.15−0.1

−5 ·10−2

0

5 ·10−2

0.10.15

0.2

0.25

(z)

( z )

C 1

C 2

C 3

Figure 8.2 Integration contours Ck , k ∈ 1, 2, 3, preimage of CF ,k by −1/m F , for CF ,k

a circular contour around cluster kF , when T N composed of three distinct entries,t1 = 1, t2 = 3, t3 = 10, N 1 = N 2 = N 3 , N/n = 1 / 10.

where we recall that B N X HN T N X N and, if n ≥ N , λN +1 = . . . = λn = 0.

The value tk can be viewed as the empirical counterpart of tk . Now, weknow from Theorem 3.13 that mB N (z) a.s.−→ mF (z) and mB N (z) a.s.−→ mF (z). Itis not difficult to verify, from the fact that mF is holomorphic, that the sameconvergence holds for the successive derivatives.

At this point, we need the two fundamental results that are Theorem 7.1 andTheorem 7.2. We know that, for all matrices B N in a set of probability one, allthe eigenvalues of B N are contained in the support of F for all large N , andthat the eigenvalues of B N contained in cluster kF are exactly λ i , i ∈N k forthese large N . Take such a B N . For all large N , mB N (z) is uniformly boundedover N and z ∈CF ,k , since CF ,k is away from the support of F . The integrand

in the right-hand side of ( 8.11) is then uniformly bounded for all large N andfor all z ∈CF ,k . By the dominated convergence theorem, Theorem 6.3, we thenhave that tk −tk

a .s.

−→ 0.It then remains to prove that tk takes the form ( 8.1). This is performed by

residue calculus [Rudin , 1986], i.e. by determining the poles in the expandedexpression of tk (when developing mB N (z) in its full expression).

For this, we open a short parenthesis to introduce the basic rules of complexintegration, required here. First, we need to dene poles and residues.

Denition 8.1. Let γ be a continuous, piecewise continuously differentiablecontour on C . If f is holomorphic inside γ but on a, i.e.

limz→a |f (z)| = ∞




From the same reasoning as above, with the dominated convergence theoremargument, Theorem 6.3, we have that, for sufficiently large N and almost surely

C F ,k

mB N

(z)m2

B N (z) dz < 12 . (8.13)

We now proceed to residue calculus in order to compute the integral in theleft-hand side of ( 8.13). Following the above procedure, notice that the polesof (8.12) are the λi and the µi that lie inside the integration contour CF ,k , allof order one with residues equal to −1 and 1, respectively. These residues areobtained using in particular L’Hospital rule, Theorem 2.10, as detailed belowfor the nal calculus. Therefore, ( 8.12) equals the number of such λi minus thenumber of such µi (remember that the integration contour is negatively oriented,

so we need to reverse the signs). We however already know that this difference,for large N , equals either zero or one, since only the position of the leftmost µi isunknown yet. But since the integral is asymptotically less than 1 / 2, this impliesthat it is identically zero, and therefore the leftmost µi (indexed by min N k ) alsolies inside the integration contour.

We have therefore precisely characterized N k . We can now evaluate ( 8.11). Thiscalls again for residue calculus, the steps of which are detailed below. Denoting

f (z) = zmB N

(z)mB N (z)

,

we nd that λi (inside CF ,k ) is a pole of order 1 with residue

limz→λ i

(z −λ i )f (z) = −λ i

which is straightforwardly obtained from the fact that f (z) ∼ 1λ i −z as z ∼ λ i .

Also µi (inside CF ,k ) is a pole of order 1 with residue

limz→µ i

(z −µi )f (z) = µi

which is obtained using L’Hospital rule: upon existence of a limit, we indeedhave

limz→µ i

(z −µi )f (z) = limz→µ i

ddz (z −µi )zm B N

(z)d

dz mB N (z)

which expands as

limz→µ i


zm B N (z) + z(z −µi )mB N

(z) + ( z −µi )mB N (z)

mB N (z)

.

Notice now that |mB N (z)| is positive and uniformly bounded by 1 /ε 2 for

min i

|λ i

−z

| > ε . Therefore, the ratio is always well dened and, for z

→ µi

with µi poven away from all λi , we nally have

limz→µ i


z = µi .




Since the integration contour is chosen to be negatively oriented , it must bekept in mind that the signs of the residues need be inverted in the nal relation.

It now remains to verify that µ1 , . . . , µ N are also the eigenvalues of diag( λ )

−1n √ λ √ λT

. This is immediate from the following lemma.

Lemma 8.1 ([Couillet et al., 2011c],[Gregoratti and Mestre , 2009]). Let A ∈C N ×N be diagonal with entries λ1 , . . . , λ N and y ∈C N . Then the eigenvalues of (A −yy ∗) are the N real solutions in x of

N

i =1

y2i

λ i −x = 1 .

Proof. Let A ∈C N ×N be a Hermitian matrix and y ∈C N . If µ is an eigenvalue

of (A −yy ∗) with eigenvector x , we have the equivalent relations(A −yy ∗)x = µx ,(A −µI N )x = y∗xy ,

x = y∗x (A −µI N )−1y ,y∗x = y∗xy ∗(A −µI N )−1y ,

1 = y∗(A −µI N )−1y .

Take A diagonal with entries λ1 , . . . , λ N , we then haveN

i=1y

2i

λ i −µ = 1.

Taking A = diag( λ ) and yi = √ λ i / √ n, we have the expected result. Thiscompletes the proof of Theorem 8.4.

Other G-estimators can be derived from this technique. In particular, notethat, for x , y ∈C N given vectors, and T N = K

k=1 tk U k U Hk ∈C N ×N the spectral

distribution of T N in Theorem 8.4, we have from residue calculus

x H U k U Hk y = 1

2πi C k

x H (T N −zI N )−1y dz

with Ck a negatively oriented contour enclosing tk , but none of the ti , i = k.From this remark, using similar derivations as above for the quadratic formx H (T N −zI N )−1y instead of the Stieltjes transform 1

N tr( T N −zI N )−1 , we thenhave the following result.

Theorem 8.7 ([Mestre , 2008b]). Let B N be dened as in Theorem 8.4,and denote B N =

N k=1 λk b k b H

k , b Hk b i = δ ik , the spectral decomposition of B N .

Similarly, denote T N = K k=1 tk U k U

H

k , UH

k U k = I n k , with U k ∈CN

×N k

the eigenspace associated with tk . For given vectors x , y ∈C N , denote

u(k; x , y ) x H U k U Hk y .




Then we have:

u(k; x , y ) −u(k; x , y ) a .s.

−→ 0

as N, n → ∞ with ratio cN = N/n → c, where

u(k; x , y )N

i=1

θk (i)x H b k b H

k y

and θk (i) is dened by

θi (k) = −φk (i) , i /∈N k

1 + ψk (i) , i ∈N k

with φk (i) =

r∈N k

λr

λ i −λ r − µr

λ i −µr

ψk (i) =r /∈

N k

λr

λ i −λ r − µr

λ i −µr

and N k , µ1 , . . . , µ N dened as in Theorem 8.4.

This result will be shown to be appealing in problems of direction of arrival

(DoA) detection, see Chapter 17.We complete this section with modied versions of Girko’s G-1 estimator,

Theorem 8.1, which are obtained from similar sample covariance arguments asabove. The rst result is merely a generalization of the convergence in probabilityof Theorem 8.1 to almost sure convergence.

Theorem 8.8 (Theorem 1 in [Kammoun et al., 2011]). Dene the matrix Y N =T N X N + 1√ x W N ∈C N ×M for x > 0, with X N ∈C n ×M and W N ∈C N ×M

random matrices with independent entries of zero mean, unit variance and nite 2 + ε order moment for some ε > 0 and T N

∈C N ×n deterministic such that

T N THN has uniformly bounded spectral norm along growing N . Assume that the

e.s.d. of T N THN converges weakly to H as N → ∞. Denote B N = 1

M Y N YHN .

Then, as N,n, M → ∞, with M N → c > 1 and N

n → c0

1N

log det I N + xT N THN

− 1N

log det( xB N ) + M −N

N log

M −N M

+ 1 a.s.

−→ 0.

Under this setting, the G-estimator is exactly the estimator of the Shannontransform of T N T H

N at point x or equivalently of the capacity of a deterministicmultiple antenna link T N under additive noise variance 1 /x from theobservations of the data vectors y 1 , . . . , y M such that Y = [y 1 , . . . , y M ]. A simple




way to derive this result is to use the Shannon transform relation of Equation(3.5)

1N log det( I N + xT N T HN ) =

x

01t − 1t2 mH −1t dt (8.14)

along with the fact, similar to Equation ( 8.8), that, for z ∈C \ R +

mH − 1

mF (z) − 1x

= −zm F (z)mF (z)

with F the l.s.d. of B N = 1M Y N Y

HN and F the l.s.d. of 1

M YHN Y N . The change of

variable t = (1 /m F (u) + 1 /x )−1 in Equation ( 8.14) allows us to write VT N T H

N (x)

as a function of mF from which we obtain directly the above estimator.The second result introduces an additional deterministic matrix R N , which in

an applicative sensing context can be used to infer the achievable communicationrate over a channel under unknown interference pattern. We precisely have thefollowing.

Theorem 8.9 (Theorem 2 in [Kammoun et al., 2011]). Dene the matrix Y N = T N X N + 1√ x W N ∈C N ×M for x > 0 where X N ∈C n ×M , and W N ∈C N ×M are random matrices with Gaussian independent entries of zero mean and unit variance, T N ∈C n ×n is deterministic with uniformly bounded spectral

norm for which the e.s.d. of T N TH

N converges weakly, and let R N ∈C N ×N be a deterministic non-negative Hermitian matrix. Then, as N,n,M → ∞ with 1 < lim inf M/N ≤ lim sup M/N < ∞ and 0 < lim inf N/n ≤ lim sup N/n < ∞,we have:

1N

log det I N + x R N + T N TH

N

− 1N

logdet( x[B N + yN R N ]) + M −N

N log(yN ) +

M N

(1 −yN ) a.s.

−→ 0.

with yN the unique positive solution of the equation in y

y = 1M

tr yR N (yR N + B N )−1 + M −N

M . (8.15)

This result is particularly useful in a rate inference scenario when R N = HH H

for some multiple antenna channel matrix H but unknown colored interferenceT N x k + 1√ x w k . Theorem 8.9 along with Theorem 8.8 allow for a consistentestimation of the capacity of the MIMO channel H based on M successiveobservations of noise-only signals (or the residual terms after data decoding).

The proof of Theorem 8.9 arises rst from the fact that, for given y and R N ,a deterministic equivalent for

1N

logdet( yR N + B N )




was derived in [Vallet and Loubaton , 2009] and expresses as

1N

logdet( yR N + B N )

− 1N

log det yR N + T N T

HN + xI N

1 + κ(y)+

M N

log(1 + κ(y)) −M N κ(y)

1 + κ(y) a .s.

−→ 0

with κ(y) the unique positive solution for y > 0 of

κ(y) = 1M

tr[T N TH

N + xI N ] yR N + 1

1 + κ(y)[T N T

H

N + xI N ]−1

.

This last result is obtained rather directly using deterministic equivalentmethods detailed in Chapter 6. Now, we observe that, if y = 1

1+ κ (y ) , the term1N log det( yR N + y[T N T HN + xI N ]) appears, which is very close to what we need.

Observing that this has a unique solution, asymptotically close to the uniquesolution of Equation ( 8.15), we easily infer the nal result. More details aregiven in [Kammoun et al., 2011].

Similar results are also available beyond the restricted case of samplecovariance matrices, in particular for the information plus noise models. Wemention especially the information plus noise equivalent to Theorem 8.7,provided in [Vallet et al., 2010], whose study is based on Theorem 7.10.

Theorem 8.10. Let B N be dened as in Theorem 7.8 , where we assume that F 1n A N A H

N = H for all N of practical interest, i.e. we assume F 1n A N A H

N

is composed of K masses in h1 < .. . < h K with respective multiplicities N 1 , . . . , N K . Further suppose that h1 = 0 and let Π be the associated eigenspace of h1 (the kernel of 1

n A N AHN ). Denote B N = N

k=1 λk u k u Hk the spectral

decomposition of B N , with u Hk u j = δ jk , and denote

π(x ) x H Πx .

Then, we have that

π(x ) − π (x ) a .s.

−→ 0where π (x ) is dened as

π(x )N

k =1

β k x H u k u H

k x

with β k dened as

β k = 1 + σ2

N

N

l= N −N 1 +1

1λ l

−λk

+ 2σ2

N

N

l= N −N 1 +1

λk

(λk

−λ l )2

−σ2(1 −c) N

l= N −N 1 +1

1λ l −λk −

N

l= N −N 1 +1

1µl −λk

, 1 ≤ k ≤ N −N 1




β k = 1 + σ2

N

N −N 1

l=1

1λ l −λk

+ 2σ2

N

N −N 1

l=1

λk

(λk −λ l )2

−σ2(1 −c)N −N 1

l=1

1λk −µl −

N −N 1

l=1

1λk −λ l

, N −N 1 + 1 ≤ k ≤ N

with µ1 , . . . , µ N the N real roots of mB N (x) = −1/σ 2 .

We do not further develop information plus noise model considerations in thissection and move now to second order statistics for G-estimators.

8.1.3 Central limit for G-estimators

The G-estimators derived above are consistent with increasingly large systemdimensions but are applied to systems of nite, sometimes small, dimensions.This implies some inevitable inaccuracy in the successive estimates, as observedfor instance in Figure 8.1. For application purposes, it is fundamental to beable to assess the quality of these estimates. In mathematical terms, thisimplies computing statistics of the estimates. Various central limit theoremsfor estimators of functionals of sample covariance matrix can be found in theliterature, notably in Girko’s work, see, e.g., [Girko], where central limits forG-estimators are provided.

This section introduces instead a recent result on the limiting distribution of n(tk −tk ) where tk and tk are dened in Theorem 8.4 as the entries of T N forthe sample covariance matrix B N = T

12N X N X

HN T

12N ∈C N ×N , X N ∈C N ×n with

entries 1√ n X ij , i.i.d., such that X 11 has zero mean, unit variance, and fourth ordermoment E[ |X 11 |4] = 2 and T N ∈C N ×N with distinct eigenvalues t1 < .. . < t K

of multiplicity N 1 , . . . , N K , respectively. Specically, we will show that the vector(n(tk −tk ))1≤k≤K is asymptotically Gaussian with zero mean and a covariancewhich we will evaluate, as N → ∞.

The nal result, due to Yao, unfolds as follows.

Theorem 8.11 ([Yao et al., 2011]). Let B N be dened as in Theorem 8.4with E[|X N,ij |4] = 2 and N i /N = ci + o(1/N ) for all i, 0 < c i < ∞. Denote I⊂ 1, . . . , K the set of indexes k such that k satises the separability condition

of Theorem 7.6 . Then, for every set J = j1 , . . . , j p ⊂I , as N, n grow large

n(tk −tk ) k∈J ⇒ X

with X a Gaussian p-dimensional vector with zero mean and covariance Θ J with (k, k ) entry ΘJ

k,k dened as

ΘJk,k

− 14π2c2ci cj C j k C j k

m (z1)m (z2)(m(z1) −m(z2))2 − 1

(z1 −z2)2 dz1dz2

m(z1)m(z2)(8.16)




1 3 100

1

2

3

Estimates

D e n s i t y

Histogram of the t k

Theoretical limiting distribution

Figure 8.3 Comparison of empirical against theoretical variances for the estimator of Theorem 8.4, based on Theorem 8.11, K = 3, t1 = 1, t2 = 3, t3 = 10,N 1 = N 2 = N 3 = 20, n = 600.

where the contour Ck encloses the limiting support of the eigenvalues of B N

indexed by N k = k−1j =1 N j + 1, . . . , k

j =1 N j , only, i.e. the cluster kF as dened in Theorem 7.6 . Moreover

ΘJk,k −ΘJ

k,ka .s.

−→ 0

as N, n → ∞, where ΘJj k ,j k

is dened by

ΘJk,k

n2

N k N k ( i,j )∈N j k ×N j k

−1(µi −µj )2mB N

(µi )mB N (µj )

+ δ kki

∈

N k

mB N (µi )

6mB N (µi )3 −

mB N (µi )2

4mB N (µi )4 (8.17)

with the quantities µ1 , . . . , µ N dened as in Theorem 8.4.

In Figure 8.3, the performance of Theorem 8.11 is evaluated against 10 000Monte Carlo simulations of a scenario of three users, with t1 = 1 , t 2 = 3, t3 =10, N 1 = N 2 = N 3 = 20, N = 60, and n = 600. It appears that the limitingdistribution is very accurate for these values of N, n . Further simulations toobtain empirical estimates ΘJ

k,k of ΘJk,k suggest that Θk,k is an accurate estimator

as well.We provide hereafter a sketch of the proof of Theorem 8.11.

Proof. The idea of the proof relies on the following remarks:

• from Theorem 3.17, well-behaved functionals of B N have a Gaussian limit;




• then, from the proof of Theorem 8.4 and in particular Equation ( 8.9), theestimator tk of tk expresses as an integral function of the Stieltjes transformof F B N and of its derivative. Applying Theorem 3.17, the variations of N (mB N (z) −mF (z)) and of N (mB N (z) −mF (z)), with F the l.s.d. of B N

can be proved to be asymptotically Gaussian;

• from there, the Gaussian limits of both N (mB N (z) −mF (z)) andN (mB N

(z) −mF (z)) can be further extended to the uctuations of theintegrand in the expression of tk in Equation ( 8.11), using the so-called delta method , to be introduced subsequently;

• nal tightness arguments then ensure that the limiting Gaussian uctuationsof the integrand propagate to the uctuations of the integral, i.e. to n(tk −tk ).

This is the general framework of the proof, for which we provide a sketchhereafter.

Following Theorem 3.17, denote N (F B N −F N ) the difference between thee.s.d. of B N and the l.s.d. of B N modied in such a way that F T N replaces thelimiting law of T 1 , T 2 , . . . and N i /n replaces ci . Then, for a family f 1 , . . . , f p of functions holomorphic on R + , the vector

n f i (x)d(F B N −F N )(x)1≤i≤ p

(8.18)

converges to a Gaussian random variable with zero mean and covariance V

with(i, j ) entry V ij given by:

V ij = − 1

4π2c2 f i (z1)f j (z2)vij (z1 , z2)dz1dz2

with

vij (z1 , z2) = m (z1)m (z2)(m(z1) −m(z2))2 −

1(z1 −z2)2

where the integration is over positively oriented contours that circle around theintersection of the supports of F N for all large N . If we ensure a sufficientlyfast convergence of the spectral law of T N and of N i /n , then we can replaceF N by F , the almost sure l.s.d. of B N , in (8.18). This explains the assumptionN i /n = ci + o(1/N ).

Now, consider Equation ( 8.9), where tk is expressed under the form of acomplex integral of the Stieltjes transform mF (z) of F and of its derivative mF (z)(we remind that F is the l.s.d. of B N = X H

N T N X N ) over a contour CF ,k thatencloses kF only. Since the functions ( x −z)−1 and ( x −z)−2 at any point z of theintegration contour CF ,k are holomorphic on R + , we can apply straightforwardlyTheorem 3.17 to ensure that any vector with entries n(mB N (zi ) −mF (zi )) andn(mB N

(zi ) −mF (zi )), for any nite set of zi away from the support of F , isasymptotically Gaussian with zero mean and a certain covariance. Then, notice




that

n zmB N

(zi )m

B N (z

i)

−z

mF (zi )m

F (z

i)

= nzi mB N

(zi )mF (zi ) −zi mF (zi )mB N (zi )m

B N (z

i)m

F (z

i)

which we would like to express in terms of the differences n(mB N (zi ) −mF (zi ))and n(mB N

(zi ) −mF (zi )).To this end, we apply Slutsky’s lemma , given as follows.

Theorem 8.12 ([Van der Vaart , 2000]). Let X 1 , X 2 , . . . be a sequence of random variables converging weakly to a random variable X and Y 1 , Y 2 , . . . converging in probability to a constant c. Then, as n → ∞

Y n X n ⇒ cX.

Applying Theorem 8.12 rst to the variables Y n = mB N (z) a.s.

−→ mF (z) andX n = n(mB N (z) −mF (z)), and then to the variables Y n = mB N

(z) a.s.

−→ mF (z)and X n = n(mB N

(z) −mF (z)), we have rather immediately that

nzi mB N

(zi )mF (zi ) −zi mF (zi )mB N (zi )mB N (zi )mF (zi ) ⇒ zi

mF (zi )X −mF (zi )Y mF (zi )2

with X and Y two random variables such that n[mB N (zi ) −mF (zi )] ⇒ X andn[mB N

(zi ) −mF (zi )] ⇒ Y . This last form can be rewritten

f (X, Y ) = f (X −0, Y −0) = zim

F (z

i)

mF (zi )2 X −zi mF (zi )mF (zi )2 Y

where f is therefore a linear function in ( X, Y ), differentiable at (0 , 0). In orderto pursue, we then introduce the fundamental tool required in this proof, thedelta method . The delta method allows us to transfer Gaussian behavior from arandom variable to a functional of it, according to the following theorem.

Theorem 8.13. Let X 1 , X 2 , . . . ∈R n be a random sequence such that

an (X n −µ) ⇒ X ∼N (0, V )

for some sequence a1 , a 2 , . . . ↑ ∞. Then for f : R n →R N , a function differentiable at µ

an (f (X n ) −f (µ)) ⇒ J (f )X

with J (f ) the Jacobian matrix of f .

Using the delta method on the variables X and Y for different zi , and appliedto the function f , we have that the vector

n zimB N

(zi )

mB N (zi ) −zi

mF (zi )

mF (zi ) 1≤i≤ p

i.e. the deviation of p points of the integrands in ( 8.11), is asymptoticallyGaussian.




In order to propagate the Gaussian limit of the deviations in the integrandsof (8.11) to the deviations in tk itself, it suffices to study the behavior of thesum of Gaussian variables over the integration contour CF ,k . Since the integralcan be written as the limit of a nite Riemann sum and that a nite Riemannsum of Gaussian random variable is still Gaussian, it suffices to ensure that thenite Riemann sum is still Gaussian in the limit. This requires an additionalingredient: the tightness of the sequences

n zmB N

(z)mB N (z) −z

mF (z)mF (z)

for growing n and for all z in the contour, see [Billingsley, 1968, Theorem 13.1].This naturally unfolds from a direct application of [Billingsley, 1968 , Theorem

13.2], following a similar proof as in [Bai and Silverstein, 2004] , and we haveproven the Gaussian limit of vectors ( n(tk −tk )) k∈

J .The last step of the proof is the calculus of the covariance of the Gaussian

limit. This requires to evaluate for all k, k

n2E C j k C j k

zk mB N (zk )

mB N (zk ) −zk mF (zk )mF (zk )

zk mB N (zk )

mB N (zk ) −zk mF (zk )

mF (zk )dzk dzk .

Integrations by parts simplify the result and lead to (8.16). In order to obtain(8.17), residue calculus is nally performed similar to the proof of Theorem

8.4.Note that the proof relies primarily on the central limit theorem of Bai and

Silverstein, Theorem 3.17. In particular, for other models more involved thanthe sample covariance matrix model, the Gaussian limit of the deviations of functionals of the e.s.d. of B N must be proven in order both to prove asymptoticcentral limit of the estimator and even to derive the asymptotic variance of the estimator. This calls for a generalization of Bai and Silverstein central limittheorem to advanced random matrix models, see examples of such models in thecontext of statistical inference for cognitive radios in Chapter 17.

This recent incentive for eigen-inference based on the Stieltjes transform istherefore strongly constrained by the limited amount of central limit theoremsavailable today. As an alternative to the Stieltjes transform method for statisticalinference, we have already mentioned that free probability and methods derivedfrom moments can perform similar inference, based on consistent estimation of the moments only. From the information on the estimated moments, assumingthat these moments alone describe the l.s.d., an estimate of the functional understudy can be determined. These moment approaches can well substitute Stieltjestransform methods when (i) the model under study is too involved to proceed tocomplex integration or (ii) when the analytic approach fails, as in the case whenclusters mapped to population eigenvalues are not disjoint. Note in particularthat the Stieltjes transform method requires exact separation properties, whichto this day is known only for very few models.




8.2 Moment deconvolution approach

Remember that the free probability framework allows us to evaluate thesuccessive moments of compactly supported l.s.d. of products, sums, andinformation plus noise models of asymptotically free random matrices basedon the moments of the individual random matrix l.s.d. The operation thatevaluates the moments of the output random matrices from the moments of the input deterministic matrices was called free convolution. It was also shownthat the moments of an input matrix can be retrieved from those of the resultingoutput matrix and the other operands (this assumes large matrix dimensions):the associated operation was called free deconvolution. From combinatoricson non-crossing partitions, we stated that it is rather easy to automatize the

calculus of free convolved and deconvolved moments. We therefore have alreadya straightforward way to perform eigen-inference on the successive momentsof the l.s.d. of the population covariance matrix from the observed samplecovariance matrix, or on the moments of the l.s.d. of the information matrixfrom the observed information plus noise matrix and so on. Since the l.s.d. arecompactly supported, the moments determine the distribution and thereforecan be used to perform eigen-inference on various functionals of the l.s.d.However, since moments are unbounded functionals of the eigenvalue spectrum,the moment estimates are usually very inaccurate, more particularly so forhigh order moments. When estimating the eigenvalues of population covariancematrices T N , as in Theorem 8.4, moment approaches can be derived althoughrather impractical, as we will presently see. This is the main drawback of thisapproach, which, although much more simple and systematic than the previouslyintroduced methods, is fundamentally inaccurate in practice.

Consider again the inference on the individual eigenvalues of T N in the modelB N = T

12N X N X

HN T

12N ∈C N ×N of Theorem 8.4. For simplicity, we assume that

the K distinct eigenvalues of T N have the same mass; this fact being known to theexperimenter. Since T N has K distinct positive eigenvalues t1 < t 2 < .. . < t K ,we have already mentioned that we can recover these eigenvalues from the rst

K moments of the l.s.d. H of T N , recovered from the rst K (free deconlvolved)moments of the l.s.d. F of B N . Those moments are the K roots of the Newton–Girard polynomial ( 5.2), computed from the moments of H . A naive approachmight therefore consist in estimating the moments of T N by free deconvolutionof the e.s.d. of B N = B N (ω), for N nite, and then solving the Newton–Girardpolynomial for the estimated moments. We provide hereafter an example for thecase K = 3.

From the method described in Section 5.2, we obtain that the momentsB1 , B 2 , B 3 of F are given as a function of the moments T 1 , T 2 , T 3 of H , as

B1 = T 1 ,B2 = T 2 + cT 21 ,



8.2. Moment deconvolution approach 219

B3 = T 3 + 3 cT 2T 1 + c2T 31

with c = lim n →∞N/n . Note that, from the extension to nite size random

matrices presented in Section 5.4, we could consider the expected e.s.d. of B N in place of the l.s.d. of B N , in which case B1 and B2 remain the same, and B3

becomes

B3 = (1 + n−2)T 3 + 3 cT 2T 1 + c2T 31 .

We can then obtain an expression of the T k by reverting the above equations,as

T 1 = B1 ,T 2 = B2 −cB 2

1 ,

T 3 = (1 + n−2)−1 B3 −3cB2B1 + 2 c2B 31 . (8.19)

By deconvolving the empirical moments Bk 1N tr B k

N , 1 ≤ k ≤ 3, of B N withthe method above, we obtain estimates T 1 , T 2 , T 3 of the moments T 1 , T 2 , T 3 , inplace of T 1 , T 2 , T 3 themselves. We then obtain estimates t1 , t2 , t3 of t1 , t 2 , t 3 bysolving the system of equations

T 1 = 13

t1 + t2 + t3 ,

T 2 = 13

t21 + t2

2 + t23 ,

T 3 = 13

t31 + t3

2 + t33 .

We recover the Newton–Girard polynomial by computing the successiveelementary symmetric polynomials Π 1 , Π2 , Π3 , using (5.4)

Π1 = 3 T 1 ,

Π2 = −12

T 2 + 92

T 21 ,

Π3 = T 3 − 15

2T 2 T 1 +

272

T 31 .

The three roots of the equation

X 3 −Π1X 2 + Π 2X −Π3 = 0

are the estimates t1 , t2 , t3 of t1 , t2 , t 3 .However, this method has several major practical drawbacks.

• Inverting the Newton–Girard equation does not ensure that the solutions areall real, since the estimator is not constrained to be real. When running theprevious algorithm for not too large N , a large portion of the estimatedeigenvalues are indeed returned as purely complex. When this happens, itis difficult to decide what to do with the algorithm output. The G-estimatorof Theorem 8.4 on the opposite, being an averaged sum of the eigenvalues of non-negative denite, necessarily provides real positive estimates;




• from the system of Equations ( 8.19), it turns out that the kth order momentT k is determined by the moments B1 , . . . , B k . As a consequence, whensubstituting B i to Bi , 1

≤ i

≤ k, in (8.19), a small difference between the

limiting B i and the empirical B i entails possibly large estimation errors in allT k , k ≥ i. This engenders a snowball effect on the resulting estimates T k forlarger k, this effect being increased by the intrinsic growing error between Bi

and B i for growing i.

On the positive side, while the G-estimators based on complex analysis are tothis day not capable of coping with situations when successive clusters overlap,the moment approach is immune against such situations. The performance of themoment technique against the G-estimator proposed in Section 8.1.2 is providedin Figure 8.4 for the same scenario as in Figure 8.1. We can observe that, although

asymptotically unbiased (as H is uniquely determined by its rst free moments),the moment-based estimator performs very inaccurately compared to the G-estimator of Theorem 8.4. Note that in the case N = 30, n = 90, some of theestimates were purely complex and were discarded; running simulations for thescenario N = 6, n = 18, as in Figure 8.1 leads to even worse results, most of whichbeing purely complex. Now, the limitations of this approach can be corrected bypaying more attention on the aforementioned snowball effect for the momentestimates. In particular, thanks to Theorem 3.17, we know that, for the samplecovariance matrix model under study, the k-multivariate random variable

N xd[F −F B N ](x) , N x2d[F −F B N ](x) , . . . , N xk d[F −F B N ](x)T

has a central limit with covariance matrix given in Corollary 3.3. As aconsequence, an alternative estimate t (k )

ML (t(k )

ML ,1 , . . . , t (k )ML ,K )

T of the K massesin H is the maximum likelihood (ML) estimate for ( t1 , . . . , t K )T based on theobservation of k successive moments of B N . This is given by:

t (k )ML = argmin

t(b −b (t )) T Q (t )−1(b −b (t )) + log det Q (t )

where t = ( T 1 , . . . , T k )T , b = ( B1 , . . . , Bk )T , b (t ) = ( B1 , . . . , B k ) assuming T k =1K

K i =1 tki , and Q (t ) is obtained as in ( 3.27); see [Masucci et al., 2011; Raoet al. , 2008] for more details. This method is however computationally expensivein this form, since all vectors t must be tested, for which every time Q (t )has to be evaluated. Suboptimal methods are usually envisioned to reduce thecomputational complexity of the ML estimate down to a reasonable level. InChapter 17, such methods will be discussed for more elaborate models.



8.2. Moment deconvolution approach 221

1 3 100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Estimated tk

D e n s i t y

Moment-based estimatorG-estimator, Theorem 8.4

1 3 100

2

4

6

Estimated tk

D e n s i t y

Moment-based estimatorG-estimator, Theorem 8.4

Figure 8.4 Estimation of t1 , t 2 , t 3 in the model B N = T12N X N X H

N T12N based on rst

three empirical moments of B N and Newton–Girard inversion, forN 1 /N = N 2 /N = N 3 /N = 1 / 3 ,N/n = 1 / 10, for 100 000 simulation runs; Top N = 30,n = 90, bottom N = 90, n = 270. Comparison is made against the G-estimator of Theorem 8.4.



9 Extreme eigenvalues

This last chapter of Part I introduces very recent mathematical advances of deepinterest to the eld of wireless communications, related to the limiting behaviorof the extreme eigenvalues and of their corresponding eigenvectors. Again, themain objects which have been extensively studied in this respect are derivativesof the sample covariance matrix and of the information plus noise matrix.

This chapter will be divided into two sections, whose results emerge fromtwo very different random matrix approaches. The rst results, about thelimiting extreme eigenvalues of the spiked models, unfold from the previous exactseparation results described in Chapter 7. It will in particular be proved thatin a sample covariance matrix model, when all population eigenvalues are equalbut for the few largest ones, the l.s.d. of the sample covariance matrix is stillthe Marcenko–Pastur law, but a few eigenvalues may now be found outside the

support of the l.s.d. The second set of results concerns mostly random matrixmodels with Gaussian entries, for which limiting results on the behavior of extreme eigenvalues are available. These results use very different approachesthan those proposed so far, namely the theory of orthogonal polynomials anddeterminantal representations. This subject, which requires many additionaltools, is briey introduced in this chapter. For more information about thesetools, see, e.g. the tutorial [Johnstone, 2006] or the book [Mehta , 2004].

We start this section with the spiked models.

9.1 Spiked models

We rst discuss the sample covariance matrix model, which can be seen underthe spiked model assumption as a perturbed sample covariance matrix withidentity population covariance matrix. We will then move to a different setof models, using free probability arguments in random matrix models withrotational invariance properties.



224 9. Extreme eigenvalues

9.1.1 Perturbed sample covariance matrix

Let us consider the so-called spiked models for the sample covariance matrix

model B N = T

12

N X N XH

N T N , with X N ∈CN

×n

random with i.i.d. entries of zero mean and variance 1 /n , which arise whenever the e.s.d. of the populationcovariance matrix T N ∈C N ×N contains a few outlying eigenvalues. What wemean by “a few outlying eigenvalues” is described in the following. Assume thee.s.d. of the series of matrices T 1 , T 2 , . . . converges weakly to some d.f. H anddenote τ 1 , . . . , τ N the eigenvalues of T N . Consider now M integers k1 , . . . , k M .Consider also a set α1 , . . . , α M of non-negative reals taken outside the union of the sets τ 1 , . . . , τ N for all N . Then the e.s.d. of the series T 1 , T 2 , . . . of diagonalmatrices given by:

T N = diag( α 1 , . . . , α 1

k1

, . . . , α M , . . . , α M

kM

, τ 1 , . . . , τ N − M i =1 k i

)

also converges to H as N → ∞, with M and the αk kept xed. Indeed, the nitelymany eigenvalues α1 , . . . , α M of nite multiplicities will have null measure inthe asymptotic set of eigenvalues of T N when N → ∞. These eigenvalues willhowever lie outside the support of H .

The question that now arises is whether those α1 , . . . , α K will induce thepresence of some eigenvalues of B N T

12N X N X

HN T

12N outside the limiting

support of the l.s.d. F of B N . First, it is clear that F = F . Indeed, from Theorem3.13, F is uniquely determined by H , and therefore the limiting distributionF of B N is nothing but F itself. If there are eigenvalues found outside thesupport of F , they asymptotically contribute with no mass. These isolatedeigenvalues will then be referred to as spikes in the following, as they will beoutlying asymptotically zero weight eigenvalues. It is important at this pointto remind that the existence of eigenvalues outside the support of F is notat all in contradiction with Theorem 7.1. Indeed, Theorem 7.1 precisely statesthat (with probability one), for all large N , there is no eigenvalue of B N in a

segment [ a, b] contained both in the complementary of the support of F and inthe complementary of the supports of F N , determined by the solutions of ( 7.1),for all large N . In the case where τ j = 1 for all j ≥ M

i =1 ki , F (x) = lim N F N (x)is the Marcenko–Pastur law, which does not exclude F N from containing largeeigenvalues of mass ki /N → 0; therefore, it is possible for B N to asymptoticallyhave eigenvalues outside the support of F .

The importance of spiked models in wireless communications arises whenperforming signal sensing, where the (hypothetical) signal space has smalldimension compared to the noise space. In this case, we might wish to be ableto decide on the presence of a signal based on the spectrum of the samplecovariance matrix B N . Typically, if an eigenvalue is found outside the predictednoise spectrum of B N , then this must indicate the presence of a signal bearinginformative data, while if all the eigenvalues are inside the limiting support, then



9.1. Spiked models 225

this should indicate the absence of such a signal. More on this is discussed indetail in Chapter 16.

However, as we will show in the sequel, it might not always be true that aspike in T N results in a spike in B N found outside the support of F , in the sensethat the support of F N may “hide” the spike in some sense. This is especiallytrue when the size of the main clusters of eigenvalues (linked to the ratio N/n )is large enough to “absorb” the spike of B N that would have resulted from thepopulation spike of T N . In this case, for signal detection purposes, whether asignal bearing informative data is present or not, there is no way to decide onthe presence of this signal by simply looking at the asymptotic spectrum. Thecondition for decidability when T N = I N is given in the following result.

Theorem 9.1 ([Baik and Silverstein , 2006]). Let ¯B N =

¯T

12

N X N XH

N ¯T

12

N , where X N ∈C N ×n has i.i.d. entries of zero mean, variance 1/n , and fourth order moment of order O(1/n 2), and T N ∈R N ×N is diagonal given by:

T N = diag( α 1 , . . . , α 1

k 1

, . . . , α M , . . . , α M

k M

, 1, . . . , 1

N − M i =1 k i

)

with α1 > .. . > α M > 0 for some M . We denote here c = lim N N/n . Call M 0 =# j, α j > 1 + √ c. For c < 1, take also M 1 to be such that M −M 1 = # j, α j <1 −√ c. Denote additionally λ1 , . . . , λ N the eigenvalues of B N , ordered as λ1 ≥. . . ≥ λN . We then have

• for 1 ≤ j ≤ M 0 , 1 ≤ i ≤ kj

λk 1 + ... + k j −1 + ia .s.

−→ α j + cαj

α j −1

• for the other eigenvalues, we must discriminate upon c– if c < 1

* for M 1 + 1 ≤ j ≤ M , 1 ≤ i ≤ kj

λN −k j −... −k M + ia .s.

−→ α j + cαj

αj −

1

* for the indexes of eigenvalues of T N inside [1−√ c, 1 + √ c]

λk 1 + ... + k M 0 +1a .s.

−→ (1 + √ c)2

λN −k M 1 +1 −... −k M a .s.

−→ (1 −√ c)2

– if c > 1

λna .s.

−→ (1 −√ c)2

λn +1 = . . . = λN = 0

– if c = 1

λmin( n,N )a .s.

−→ 0.




Therefore, when c is large enough, the segment [max(0 , 1 −√ c), 1 + √ c] willcontain some of the largest eigenvalues of T N (those closest to one). If this occursfor a given αk , the corresponding eigenvalues of B N will be “attracted” by theleft or right end of the support of the l.s.d. of B N . If c < 1, small populationspikes αk in T N may generate spikes of B N in the interval (0 , (1 −√ c)2); whenc > 1, though, the αk smaller than one will result in null eigenvalues of B N .

Remember from the exact separation theorem, Theorem 7.2, that there is acorrespondence between the eigenvalues of T N and those of B N inside eachcluster. Since exactly α 1 , . . . , α M 0 are above 1 + √ c then, asymptotically,exactly k1 + . . . + kM 0 eigenvalues will lie on the right-end side of the supportof the Marcenko–Pastur law. This is depicted in Figure 9.1 where we considerM = 2 spikes α 1 = 2 and α 2 = 3, both of multiplicity k1 = k2 = 2. We illustrate

the decidability condition depending on c by considering rst c = 1 / 3, in whichcase 1 + √ c 1.57 < α 1 < α 2 and then we expect two spikes of B N at positionα 1 + cα1(α 1 −1)−1 2.67 and two spikes of B N at position α2 + cα2(α 2 −2)−1 = 3 .5. We then move c to c = 5 / 4 for which α1 < 1 + √ c 2.12 < α 2 ; wetherefore expect only the two eigenvalues associated with α2 at position α2 +cα2(α 2 −2)−1 4.88 to lie outside the spectrum of F . This is approximatelywhat is observed.

The fact that spikes are non-discernible for large c leads to a seeminglyparadoxical situation. Consider indeed that the sample space if xed to n sampleswhile the population space of dimension N increases, so that we increase thecollection of input data to improve the quality of the experiment. In the contextof signal sensing, if we rely only on a global analysis of the empirical eigenvaluesof the input covariance matrix to declare that “if eigenvalues are found outsidethe support, a signal is detected,” then we are better off limiting N to a minimalvalue and therefore we are better off with a mediocre quality of the experiment;otherwise the decidability threshold is severely impacted. This point is criticaland it is essential to understand that the problem here lies in the non-suitabilityof the decision criterion (that consists just in looking at the eigenvalues outsideor inside the support) rather than in the intrinsic non-decidable nature of the

problem, which for nite N is not true. If N is large and such that there is nospike outside the support of F while T N does have spikes, then we will need tolook more closely into the tail of the Marcenko–Pastur law, which, for xed N ,contains more than the usual amount of eigenvalues; however, we will see thateven this strategy is bound to fail, for very large N . In this case, we may haveto resort to studying the joint eigenvalue distribution of B N , which containsthe full information. In Chapter 16, we will present a scheme for signal sensing,which aims at providing an optimal sensing decision, based on the complete jointeigenvalue distribution of the input signals, instead of assuming large dimensionalassumptions. In these scenarios, the rule of thumb that suggests that smalldimensional systems are well approximated by large dimensional analysis nowfails, and a signicant advantage is provided by the small dimensional analysis.




α 1 + cα 1α 1 −1 , α 2 + cα 2

α 2 −10

0.2

0.4

0.6

0.8

Eigenvalues

D e n s i t y

Marcenko–Pastur law, c = 1 / 3Empirical eigenvalues

α 2 + cα 2α 2 −1

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalues

D e n s i t y

Marcenko–Pastur law, c = 5 / 4Empirical eigenvalues

Figure 9.1 Eigenvalues of B N = T12N X N X N

H T12N , where T N is a diagonal of ones but

for the rst four entries set to 3, 3, 2, 2. On top, N = 500, n = 1500. One thebottom, N = 500, n = 400. Theoretical limit eigenvalues of B N are stressed.

Another way of observing practically when the e.s.d. at hand is close to theMarcenko–Pastur law F is to plot the empirical eigenvalues against the quantilesF −1( k−1/ 2

N ) for k = 1 , . . . , N . This is depicted in Figure 9.2, for the case c = 1/ 3with the same set of population spikes 2, 2, 3, 3 in T N as before. We observeagain the presence of four outlying eigenvalues in the e.s.d. of B N .

We subsequently move to a different type of results, dealing with thecharacterization of the extreme eigenvalues and eigenvectors of some perturbedunitarily invariant matrices. These recent results are due to the work of Benaych-Georges and Rao [Benaych-Georges and Rao, 2011] .




(1 −√ c)2 (1 + √ c)2

(1 −√ c)2

(1 + √ c)2

Eigenvalues

Q u a n t i l e s

Empirical eigenvaluesy = x

Figure 9.2 Eigenvalues of B N = T N

12 X N X N

HT N

12 , where T N is a diagonal of ones

but for the rst four entries set to 3, 3, 2, 2, against the quantiles of theMarcenko–Pastur law, N = 500, n = 15000, c = 1 / 3.

9.1.2 Perturbed random matrices with invariance properties

In [Benaych-Georges and Rao , 2011], the authors consider perturbations of unitarily invariant random matrices (or random matrices with unitarily invariantperturbation). What is meant by perturbation is either the addition of a smallrank matrix to a large dimensional random matrix, or the product of a largedimensional random matrix by a perturbed identity matrix in the sense justdescribed. Thanks to the unitarily invariance property of either of the twomatrices, we obtain the following very general results.

Theorem 9.2 ([Benaych-Georges and Rao, 2011] ). Let X N ∈C N ×N be a Hermitian random matrix with ordered eigenvalues λN

1

≥ . . .

≥ λN

N for which

we assume that the e.s.d. F X N converges almost surely toward F with compact support with inmum a and supremum b, such that λN

1a .s.

−→ b and λN N

a .s.

−→ a.Consider also a perturbation matrix A N of rank r , with ordered non-zeroeigenvalues aN

1 ≥ . . . ≥ aN r . Denote s the integer such that as > 0 > a s +1 . We

further assume that either X N or A N (or both) are bi-unitarily invariant. Denote Y N the matrix dened as

Y N = X N + A N

with ordered eigenvalues ν N 1

≥ . . .

≥ ν N

N . Then, as N grows large, for i

≥ 1

ν N i

a .s.

−→ −m−1F (1/a i ) , if 1 ≤ i ≤ r and 1/a i < −mF (b+ )

b , otherwise




where the inverse is with respect to composition. Also, for i ≥ 0

ν N n

−i

a .s.

−→−m−1

F (1/a r −i ) , if i < r −s and 1/a i > −mF (a−)

a , otherwise .We also have the same result for multiplicative matrix perturbations, as

follows.

Theorem 9.3. Let X N and A N be dened as in Theorem 9.2 . Denote ZN the matrix

Z N = X N (I N + A N )

with ordered eigenvalues µN 1

≥ . . .

≥ µN

N . Then, as N grows large, for i

≥ 1

µN i

a .s.

−→ψ−1

F (a i ) , if 1 ≤ i ≤ s and 1/a i < ψ F (b+ )b , otherwise

and, for i ≥ 0

µN n −r + i

a .s.

−→ψ−1

F (a i ) , if i < r −s and 1/a i > ψ F (a−)a , otherwise

where ψF is the ψ-transform of F , dened in (4.3) as

ψF (z) = tz −t dF (t) = −1 −

1z mF

1z .

This result in particular encompasses the case when X N = W N WHN , with

W N lled with i.i.d. Gaussian entries, perturbed in the sense of Theorem 9.1. Inthis sense, this result generalizes Theorem 9.1 for unitarily invariant matrices,although it does not encompass the general i.i.d. case of Theorem 9.1. Recentextensions of the above results on the second order uctuations of the extremeeigenvalues can be found in [Benaych-Georges et al., 2010].

As noticed above, the study of extreme eigenvalues carries some importance

in problems of detection of signals embedded in white noise, but not only. Fieldssuch as speech recognition, statistical learning, or nance also have interests inextreme eigenvalues of covariance matrices. For the particular case of nance, see,e.g., [Laloux et al., 2000; Plerous et al., 2002], consider X N is the N ×n matrixin which each row stands for a market product, while every column stands for atime period, say a month, as already presented in Chapter 1. The ( i, j )th entry of X N contains the evolution of the market index for product i in time period j . If all time-product evolutions are independent in the sense that the evolution of thevalue of product A for a given month does not impact the evolution of the value of product B , then it is expected that the rows of X N are statistically independent.Also, if the time scale is chosen such that the evolution of the price of productA over a given time period is roughly uncorrelated with its evolution on thesubsequent time period, then the columns will also be statistically independent.



9.2. Distribution of extreme eigenvalues 231

random matrix X N ∈C N ×N , the largest eigenvalue λN N has density

P λ N N

(λN N ) =

λ

N 1

. . .

λ

N N −1

P ≤(λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N )dλ N 1 . . . dλ N

N . (9.1)

In the case where the order of the eigenvalues is irrelevant, we have that

P (λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N ) = 1N !

P ≤(λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N )

with P (λ N 1 ,...,λ N

N ) the density of the unordered eigenvalues.From now on, the eigenvalue indexes 1 , . . . , N are considered to be just labels

instead of ordering indexes. From the above equality, it is equivalent, and as willturn out actually simpler, to study the unordered eigenvalue distribution ratherthan the ordered eigenvalue distribution. In the particular case of a zero Wishart

matrix with n ≥ N degrees of freedom, this property holds and we have fromTheorem 2.3 that

P (λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N ) = e− N i =1 λ N

i

N

i =1

(λN i )n −N

(n −i)!i! i<j

(λN i −λN

j )2 .

Similarly, we have for Gaussian Wigner matrices [Tulino and Verd´u, 2004], i.e.Wigner matrices with upper-diagonal entries complex standard Gaussian anddiagonal entries real standard Gaussian

P (λ N 1 ,...,λ N N ) (λN 1 , . . . , λ

N N ) =

1(2π) N

2 e−N i =1 (λ N

i )2N

i=1

1i! i<j (λ

N i −λ

N j )

2

. (9.2)

The problem now is to be able to compute the multi-dimensionalmarginalization for either of the above distributions, or for more involveddistributions. We concentrate on the simpler Gaussian Wigner case in whatfollows.

To be able to handle the marginalization procedure, we will use the reproducing kernel property, given below which can be found in [Deift, 2000].

Theorem 9.4. Let K n

∈C n ×n with (i, j ) entry K ij = f (x i , x j ) for some

complex-valued function f of two real variables and a real vector x = ( x1 , . . . , x n ).The function f is said to satisfy the reproducing kernel property with respect toa real measure µ if

f (x, y )f (y, z )dµ(y) = f (x, z ).

Under this condition, we have that

det K n dµ(xn ) = ( q −(n −1)) det K n −1 (9.3)

with

q = f (x, x )dµ(x).




The above property is interesting in the sense that, if such a reproducingkernel property can be exhibited, then we can successively iterate (9.3) in orderto perform marginalization calculus such as in ( 9.1).

By working on the expression of the eigenvalue distribution of Gaussian Wignermatrices ( 9.2), it is possible to write P (λ N

1 ,...,λ N N ) under the form

P (λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N ) = C det e−12 (λ N

j )2π i−1(λN

j )1≤i,j ≤N

2

for any set of polynomials ( π0 , . . . , π N −1) with πk of degree k and leadingcoefficient 1, and for some normalizing constant C . A proof of this fact stemsfrom similar arguments as for the proof of Lemma 16.1, namely that the matrixabove can be written under the form of the product of a diagonal matrix with

entries e12 (λ

N j )

2

and a matrix with polynomial entries πi (x j ), the determinantof which is proportional to the product of e j (λ N

j )2times the Vandermonde

determinant i<j (λN i −λN

j ).Now, since we have the freedom to take any set of polynomials ( π0 , . . . , π N −1)

with leading coefficient 1, we choose a set of orthogonal polynomials with respectto the weighting coefficient e−x 2

, i.e. we dene (π0 , . . . , π N −1) to be such that

e−x 2π i (x)π j (x)dx = δ ji .

Denoting now K

N ∈C N

×N

the matrix with ( i, j ) entry

K ij = kN (λN i , λN

j )N −1

k=0

e−12 (λ N

i )2πk (λN

i ) e−12 (λ N

j )2πk (λN

j )

we observe easily, from the fact that det( A 2) = det( A T A ), that

P (λ N 1 ,...,λ N

N ) (λN 1 , . . . , λ N

N ) = C det K N .

From the construction of K N , through the orthogonality of the polynomialsπ0(x), . . . , π N −1(x), we have that

kN (x, y )kN (y, z )dy = kN (x, y )

and the function kN has the reproducing kernel property.This ensures that

. . . det K N dxk +1 . . . dx N = ( N −k)!det K k

where the term ( N −k)! follows from the computation of

kn (x, x )dx for n ∈

k + 1 , . . . , N

.

To nally compute the probability distribution of the largest eigenvalue, notethat the probability that it is greater than ξ is complementary to the probabilitythat there is no eigenvalue in B = ( ξ, ∞). The latter, called the hole probability ,




expresses as

QN (B ) =

. . .

P (λ N

1 ,...,λ N

N ) (λN

1 , . . . , λ N N )

N

k=1

(1

−1B (λN

k ))dλ N 1 . . . dλ N

N .

From the above discussion, expanding the product term, this can be shown toexpress as

QN (B ) =N

i =0

(−1)i 1i! B

. . . Bdet K i dλN

1 . . . dλ N i .

This last expression is in fact a Fredholm determinant , denoted det( I −K N ),where K N is called an integral operator with kernel kN acting on square

integrable functions on B. These Fredholm determinants are well-studied objects,and it is in particular possible to derive the limiting behavior of QN (B ) as N grows large, which leads presently to the complementary of the Tracy–Widomdistribution, and to results such as Theorem 9.5.

Before completing this short introduction, we also mention that alternativecontributions such as the recent work of Tucci [Tucci, 2010] establish expressionsfor functionals of eigenvalue distributions, without resorting to the orthogonalpolynomial machinery. In [Tucci , 2010], Tucci provides in particular a closed-formformula for the quantity

f (t)dF X N T N XH

N (t)

for X N ∈C N ×n a random matrix with Gaussian entries of zero mean and unitvariance and T N a deterministic matrix, given under the form of the determinantof a matrix with entries given in an integral form of f and the eigenvalues of T N . This form is not convenient in its full expression, although it constitutes arst step towards the generalization of the average spectral analysis of Gaussianrandom matrices. Incidentally, Tucci provides a novel integral expression of theergodic capacity of a point-to-point 2 ×2 Rayleigh fading MIMO channel.

Further information on the tools above can be found in the early book fromMehta [Mehta , 2004], the very clear tutorial from Fyodorov [Fyodorov, 2005] , andthe course notes from Guionnet [Guionnet , 2006], among others. In the following,we introduce the main results concerning limit laws of extreme eigenvalues knownto this day.

9.2.2 Limiting laws of the extreme eigenvalues

The major result on the limiting density of extreme eigenvalues is due to Tracyand Widom. It comes as follows.

Theorem 9.5 ([Tracy and Widom, 1996 ]). Let X N ∈C N ×N be Hermitian with independent Gaussian off-diagonal entries of zero mean and variance 1/N .




Denote λ−N and λ+N the smallest and largest eigenvalues of X N , respectively.

Then, as N → ∞N

23

λ+N −2 ⇒ X

+

∼ F +

N 23 λ−N + 2 ⇒ X − ∼ F −

where F + is the Tracy–Widom law given by:

F + (t) = exp − ∞t

(x −t)2q 2(x)dx (9.4)

with q the Painleve II function that solves the differential equation

q (x) = xq (x) + 2 q 3(x)

q (x) ∼x→∞ Ai(x)in which Ai(x) is the Airy function, and F − is dened as

F −(x) 1 −F + (−x).

This theorem is in fact extended in [Tracy and Widom , 1996] to a moregeneral class of matrix spaces, including the space of real symmetric matricesand that of quaternion-valued symmetric matrices, with Gaussian i.i.d. entries.Those are therefore all special cases of Wigner matrices. The space of Gaussianreal symmetric matrices is referred to as the Gaussian orthogonal ensemble (denoted GOE), that of complex Gaussian Hermitian matrices is referred to asthe Gaussian unitary ensemble (GUE), and that of quaternion-valued symmetricGaussian matrices is referred to as the Gaussian symplectic ensemble (GSE). Theseemingly strange “orthogonal” and “unitary” denominations arise from deeperconsiderations on these ensembles, involving orthogonal polynomials, see, e.g.,[Faraut, 2006] for details.

It was later shown [Bianchi et al., 2010] that the random variables λ+N and λ−N

are asymptotically independent, giving therefore a simple description of theirratio, the condition number of X N .

Theorem 9.6 ([Bianchi et al., 2010]). Under the assumptions of Theorem 9.5

N 23 λ+

N −2 , N 32 λ−N + 2 ⇒ (X + , X −)

where X + and X − are independent random variables with respective distributions F + and F −. The random variable λ+

N /λ −N satises

N 23

λ+N

λ−N + 1 ⇒ −

12

X + + X − .

The result of interest to our study of extreme eigenvalues of Wishart andperturbed Wishart matrices was proposed later on by Johansson for thelargest eigenvalue in the complex case [Johansson, 2000] , followed by Johnstone




[Johnstone, 2001 ] for the largest eigenvalue in the real case, while it took tenyears before Feldheim and Sodin provided a proof of the result on the smallesteigenvalue in both real and complex cases [Feldheim and Sodin, 2010] . We onlymention here the complex case.

Theorem 9.7 ([Feldheim and Sodin , 2010; Johansson, 2000 ]). Let X N ∈C N ×n

be a random matrix with i.i.d. Gaussian entries of zero mean and variance 1/n .Denoting λ+

N and λ−N the largest and smallest eigenvalues of X N XHN , respectively,

we have:

N 23

λ+N −(1 + √ c)2

(1 + √ c)43 √ c ⇒ X ∼ F +

N 23 λ−N −(1 −√ c)2

−(1 −√ c)43 √ c ⇒ X ∼ F +

as N, n → ∞ with c = lim N N/n < 1 and F + the Tracy–Widom distribution dened in Theorem 9.5 . Moreover, the convergence result for λ+

N holds also for c ≥ 1.

The empirical against theoretical distributions of the largest eigenvalues of X N X

HN are depicted in Figure 9.3, for N = 500, c = 1/ 3.

Observe that the Tracy–Widom law is largely weighted on the negative half line. This means that the largest eigenvalue of X N X

HN has a strong tendency

to lie much inside the support of the l.s.d. rather than outside. For the samescenario N = 500, c = 1 / 3, we now depict in Figure 9.4 the Tracy–Widom lawagainst the empirical distribution of the largest eigenvalue of T

12N X N X

HN T

12N in

the case where T N ∈R N ×N is diagonal composed of all ones but for T 11 = 1.5.From Theorem 7.2, no eigenvalue is found outside the asymptotic spectrum of the Marcenko–Pastur law. Figure 9.4 suggests that the largest eigenvalue of T

12N X N X

HN T

12N does not converge to the Tracy–Widom law since it shows a much

heavier tail in the positive side; this is however not true asymptotically. Theasymptotic limiting distribution of the largest eigenvalue of T

12N X N X H

N T12N is

still the Tracy–Widom law, but the convergence towards the second order limitarises at a seemingly much slower rate. This is proved in the following theorem.To appreciate the convergence towards the Tracy–Widom law, N must then betaken much larger.

Theorem 9.8 ([Baik et al., 2005]). Let X N ∈C N ×n have i.i.d. Gaussian entries of zero mean and variance 1/n and T N = diag( τ 1 , . . . , τ N ) ∈R N ×N .Assume, for some xed r and k, τ r +1 = . . . = τ N = 1 and τ 1 = . . . = τ k while

τ k+1 , . . . , τ r lie in a compact subset of (0, τ 1). Assume further that the ratio N/nis constant, equal to c < 1 as N, n grow. Denoting λ+

N the largest eigenvalue of T

12N X N X

HN T

12N , we have:



9.3. Random matrix theory and eigenvectors 237

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

Centered-scaled largest eigenvalue of XX H

D e n s i t y

Empirical eigenvaluesTracy–Widom law F +

Figure 9.3 Density of N 23 c−1

2 (1 + √ c)−43 λ +

N −(1 + √ c)2 against the Tracy–Widomlaw for N = 500, n = 1500, c = 1 / 3, for the covariance matrix model XX H of Theorem 9.6. Empirical distribution taken over 10 000 Monte-Carlo simulations.

T12N X N X

HN T

12N was the inability to visually determine the presence of a spike

τ 1 < 1 + √ c from the asymptotic spectrum of T

12

N X

N X H

N T

12

N . Now, it turns outthat even the distribution of the largest eigenvalue in that case is asymptoticallythe same as that when T N = I N . There is therefore not much left to be done inthe asymptotic regime to perform signal detection under the detection threshold1 + √ c. In that case, we may resort to further limit orders, or derive exactexpressions of the largest eigenvalue distribution. Similar considerations areaddressed in Chapter 16.

We also mention that, in the real case X N ∈R N ×n , if τ 1 > 1 + √ c hasmultiplicity one, Paul proves that the limiting distribution of λ+

N − τ 1 + τ 1 cτ 1 −1

is still Gaussian but with variance double that of the complex case, i.e.2n τ 21 − τ 21 c

(τ 1 −1) 2 [Paul , 2007]. Theorem 9.8 is also extended in [Karoui, 2007]to more general Gaussian sample covariance matrix models, where it is provedthat under some conditions on the population covariance matrix, for any integerk xed, the largest k eigenvalues of the sample covariance matrix have a Tracy–Widom distribution with the same scaling factor but different centering andscaling coefficients.

9.3 Random matrix theory and eigenvectors

Fewer results are known relative to the limiting distribution of the largesteigenvectors. We mention in the following the limiting distribution of the largest




−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

Centered-scaled largest eigenvalue of XX H

D e n s i t y

Empirical eigenvaluesTracy–Widom law F +

Figure 9.4 Distribution of N 23 c−1

2 (1 + √ c)−43 λ +

N −(1 + √ c)2 against theTracy–Widom law for N = 500, n = 1500, c = 1 / 3, for the covariance matrix modelT

12 XX H T

12 with T diagonal with all entries 1 but for T 11 = 1 .5. Empirical

distribution taken over 10 000 Monte-Carlo simulations.

eigenvector in the spiked model for Gaussian sample covariance matrices, i.e.

normalized Wishart matrices with population covariance matrix composed of eigenvalues that are all ones but for a few larger eigenvalues. This is given in thefollowing.

Theorem 9.9 ([Paul , 2007]). Let X N ∈R N ×n have i.i.d. real Gaussian entries of zero mean and variance 1/n and T N ∈R N ×N be dened as

T N = diag( α 1 , . . . , α 1

k 1

, . . . , α M , . . . , α M

k M

, 1, . . . , 1

N − M

i =1 k i

)

with α1 > .. . > α M > 0 for some positive integer M . Then, as n, N → ∞ with limit ratio N/n → c, 0 < c < 1, for all i ∈ k1 + . . . + kj −1 + 1 , . . . , k 1 + . . . +kj −1 + kj , the eigenvector p i associated with the ith largest eigenvalue λi of T

12N X N X

HN T

12N satises

p Ti e N,i

2 a.s.

−→1− c

( α j −1) 2

1+ cα j −1

, if α j > 1 + √ c0 , otherwise

where eN,i

∈R N denotes the vector with all zeros but a one in position i.

Also, if α j > 1 + √ c has multiplicity one, denoting k M l=1 kl , we write p i =(p T

A,i , p TB,i )T , with p A,i ∈R k the vector of the rst k coordinates and p B,i ∈R k

the vector of the last N −k coordinates. We further take the convention that the




coordinate i of p i is non-negative. Then we have, for i = k1 + . . . + kj −1 + 1 , as N, n → ∞ with N/n −c = o(1/ √ n)

(i) the vector p A,i satises

√ n p A,i

p A,i −e M,i ⇒ X

where X is an M -variate Gaussian vector with zero mean and covariance Σ j

given by:

Σ j = 1 − c

(α j −1)2

−1

1≤l≤M l= j

α l α j

(α l −α j )2 e M,l e T

M,l

(ii) the vector p B,i / p B,i is uniformly distributed on the unit sphere of dimension N −k −1 and is independent of p A,i .

Note that (ii) is valid for all nite dimensions. As a matter of fact, (ii) is validfor any i ∈ 1, . . . , min( n, N ). Theorem 9.9 is important as it states in essencethat only some of the eigenvectors corresponding to the largest eigenvalues of aperturbed Wishart matrix carry information. Obviously, as αj tends to 1 + √ c,the almost sure limit of p T

i e N,i tends to zero, while the variance of the secondorder statistics tends to innity, meaning that increasingly less information can

be retrieved as αj → 1 + √ c.In Figure 9.5, the situation of a single population spike α1 = α with

multiplicity one is considered. The matrix dimensions N and n are taken tobe such that N/n = 1/ 3, and N ∈ 100, 200, 400. We compare the averagedempirical projections p T

i e N,i against Theorem 9.9. That is, we evaluate theaveraged absolute value of the rst entry in the eigenvector matrix U N inthe spectral decomposition of T

12N X N X

HN T

12N = U N diag(λ1 , . . . , λ N )U H

N . Weobserve that the convergence rate of the limiting projection is very slow.Therefore, although nothing can be said asymptotically on the eigenvectors of a

spiked model, when α < 1 + √ c, there exists a large range of values of N and nfor which this is not so.We also mention the recent result from Benaych-Georges and Rao [Benaych-

Georges and Rao , 2011] which, in addition to providing limiting positions forthe eigenvalues of some unitarily invariant perturbed random matrix models,Theorem 9.2, provides projection results ` a la Paul. The main result is as follows.

Theorem 9.10. Let X N ∈C N ×N be a Hermitian random matrix with ordered eigenvalues λ1 ≥ . . . ≥ λN . We assume that the e.s.d. F X N converges weakly and almost surely toward F with compact support with inmum a and supremum b,such that λ1 a .s.−→ b and λN a .s.−→ a. Consider also a perturbation Hermitian matrix A N of rank r , with ordered non-zero eigenvalues a1 ≥ . . . ≥ ar . Finally, assume that either X N or A N (or both) are bi-unitarily invariant. Denote Y N the matrix




1 2 3 4 50

0.2

0.4

0.6

0.8

1

Population spike value α

A v e r a g e d

| U N

, 1 1

|

Simulation, N = 100Simulation, N = 200Simulation, N = 400Limiting

|U N, 11

|

Figure 9.5 Averaged absolute rst entry |U N, 11 | of the eigenvector corresponding to

the largest eigenvalue in T12N X N X H

N T12N = U diag( λ 1 , . . . , λ N )U H , with X N lled with

i.i.d. Gaussian entries CN (0, 1/n ) and T N ∈R N ×N diagonal with all entries one butfor the rst entry equal to α, N/n = 1 / 3, for varying N .

dened as

Y N = X N + A N

with order eigenvalues ν 1 ≥ . . . ≥ ν N . For i ∈ 1, . . . , r such that 1/a i ∈(−mF (a−), −mF (b+ )) , call zi = ν i if ai > 0 or zi = ν N −r + i if ai < 0, and v i

an eigenvector associated with zi in the spectral decomposition of Y N . As N grows large, we have:

(i)

v i , ker( a i I N −A N ) 2 a.s.

−→ 1

a2i mF − 1

m −1F (1 /a i )

(ii)

v i ,⊕j = i ker( a j I N −A N ) 2 a.s.

−→ 0

where the notation ker(X ) denotes the kernel or nullspace of X , i.e. the space of vectors y such that Xy = 0 and x, A is the norm of the orthogonal projection of x on A.

A similar result for multiplicative matrix perturbations is also available.




Theorem 9.11. Let X N and A N be dened as in Theorem 9.10 . Denote ZN

the matrix

ZN =

XN (

IN +

AN )

with ordered eigenvalues µ1 ≥ . . . ≥ µN . For i ∈ 1, . . . , r such that 1/a i ∈(ψ−1

F (a−), ψ−1F (b+ )) with ψF the ψ-transform of F (see Denition 3.6 ), call

zi = ν i if ai > 0 or zi = ν N −r + i if ai < 0, and v i an eigenvector associated with zi in the spectral decomposition of ZN . Then, as N grows large, for i ≥ 1:

(i)

v i , ker( a i I N −A N ) 2 a .s.

−→ − 1

a2i ψ−1

F (1/a i )ψF ψ−1F (1/a i ) + a i

(ii)v i ,⊕j = i ker( a j I N −A N ) 2 a.s.

−→ 0.

The results above have recently been extended by Couillet and Hachem[Couillet and Hachem , 2011], who provide a central limit theorem for the jointuctuations of the spiky sample eigenvalues and eigenvector projections, for theproduct perturbation model of Theorem 9.11. These results are particularlyinteresting in the applicative context of local failure localization in largedimensional systems, e.g. sensor failure or sudden parameter change in large

sensor networks, or link failure in a large interconnected graphs. The underlyingidea is that a local failure may change the network topology, modeled though thecovariance matrix of successive nodal observations, by a small rank perturbation.The perturbation matrix is a signature of the failure which is often easier toidentify from its eigenvector properties than from its eigenvalues, particularly soin homogeneous networks where each failure leads to similar amplitudes of theextreme eigenvalues.

This completes this short section on extreme eigenvectors. Many more resultsare expected to be available on this subject in the near future. Before moving tothe application part, we summarize the rst part of this book and the importantresults introduced so far.



244 10. Summary and partial conclusions

matrices have some nice symmetric structure. It is at this point that we bridgedrandom matrix theory and free probability theory. The latter provides differenttools to study sums and products of a certain class of large dimensional randommatrices. This class of random matrices is characterized by the asymptoticfreeness property, which is linked to rotational invariance properties. Froma practical point of view, asymptotic freeness arises only for random matrixmodels based on Haar matrices and on Gaussian random matrices. We alsointroduced the free probability theory from a moment-cumulant viewpoint, whichwas illustrated to be a very convenient tool to study the l.s.d. of involvedrandom matrix models through their successive free moments. In particular,we showed that, through combinatorics calculus that can be automated on amodern computer, the successive moments of the l.s.d. of (potentially involved)

sums, products, and information plus noise models of asymptotically free randommatrices can be easily derived. This bears some advantages compared to theStieltjes transform approach for which case-by-case treatment of every randommatrix model must be performed.

From the basic combinatorial grounds of free probability theory, it was thenshown that we can go one step further into the study of more structuredrandom matrices. Specically, it was shown that, by exploiting softer invariancestructures of some random matrices, such as the left permutation invarianceof some types of random matrices, it is possible to derive expressions of the successive moments of more involved random matrices, such as random

Vandermonde matrices. Moreover, the rotational invariance of these matrixmodels allows us to extend expressions of the moments of the l.s.d. to expressionsof the moments of the expected e.s.d. for all nite dimensions. This allows us torene the moment estimates when dealing with small dimensional matrices. Thisextrapolation of free probability however is still in its infancy, and is expectedto produce a larger number of results in the coming years.

However, the moment-based methods, despite their inherent simplicity,suffer from several shortcomings. First, apart from the very recent results onVandermonde random matrices, which can be seen as a noticeable exception,

moment-based approaches are only useful in practice for dealing with Gaussianand Haar matrices. This is extremely restrictive compared to the models treatedwith the analytical Stieltjes transform approach, which encompass to this daymatrix models based on random matrices with independent entries (samplecovariance matrices, matrices with a variance prole, doubly correlated sums of such matrices, etc.) as well as models based on Haar matrices. Secondly, providingan expression of successive moments to approximate a distribution functionassumes that the distribution function under study is uniquely characterized byits moments, and more importantly that these moments do exist. If the formerassumption fails to be satised, the computed moments are mostly unusable; if

the latter assumption fails, then these moments cannot even be computed. Dueto these important limitations, we dedicated most of Part I to a deep study of the Stieltjes transform tool and of the Stieltjes transform-based methods used



245

to determine the l.s.d. of involved random matrix models, rather than momentsmethods.

To be complete, we must mention that recent considerations, mostly spurredby Pastur, suggest that, for most classical matrix models discussed so far, itis possible to prove that the l.s.d. of matrix models with independent entriesor with Gaussian independent entries are asymptotically the same. This can beproved by using the Gaussian method, introduced in Chapter 6, along witha generalized integration by parts formula and Nash–Poincare inequality forgeneric matrices with independent entries. Thanks to this method, Pastur alsoshows that central limit theorems for matrices with independent entries can berecovered from central limit theorems for Gaussian matrices (again accessiblethrough the Gaussian method). Since the latter is much more convenient and

much more powerful, as it relies on appreciable properties of the Gaussiandistribution, this last approach may adequately replace in the future the Stieltjestransform method, which is sometimes rather difficult to handle. The tool thatallows for an extension of the results obtained for matrices with Gaussian entriesto unconstrained random matrices with independent entries is referred to as theinterpolation trick , see, e.g., [Lytova and Pastur, 2009] . Incidentally, for simplerandom matrix models, such as X N X

HN , where √ nX N ∈C N ×n has i.i.d. entries

of zero mean, unit variance, and some order four cumulant κ, but not only, itcan be shown that the variance of the central limit for linear statistics of theeigenvalues can be written under the form σ2

Gauss + κσ 2 , where σ2Gauss is the

variance of the central limit for the Gaussian case and σ2 is some additionalparameter (remember that κ = 0 in the Gaussian case).

Regarding random matrix models, for which not only the eigenvalue but alsothe eigenvector distribution plays a role, we further rened the concept of l.s.d.to the concept of deterministic equivalents of the e.s.d. These deterministicequivalents are rst motivated by the fact that there might not exist a l.s.d.in the rst place and that the deterministic matrices involved in the models(such as the side correlation matrices R k and T k in Theorem 6.1) may not havea l.s.d. as they grow large. Also, even if there is a limiting d.f. F , it is rather

inconvenient that two different series of e.s.d. F B 1

, F B 2

, . . . and F B 1

, F B 2

, . . . ,both converging to the same l.s.d. F , are attributed the same deterministicapproximation. Instead, we introduced the concept of deterministic equivalentswhich provide a specic approximate d.f. F N for F B N such that, as N growslarge, F N −F B N

⇒ 0 almost surely; this way, F N can approximate F B N moreclosely than would F and the limiting result would still be valid even if F B N

does not have a limit d.f.. We then introduced different methodologies (i) todetermine, through its Stieltjes transform mN , the deterministic equivalent F N ,(ii) to show that the xed-point equation to which mN (z) is a solution admits aunique solution on some restriction of the complex plane, and (iii) to show that

mN (z) −mF B N (z) a.s.−→ 0 and therefore F N −F B N

⇒ 0 almost surely. This leadsto a much more involved work than just showing that there exists a l.s.d. to B N ,but it is necessary to ensure the stability of the applications derived from these



246 10. Summary and partial conclusions

results. Deterministic equivalents were presented for various models, such as thesum of doubly correlated Gram matrices or general Rician models. For furtherapplication purposes, we also derived deterministic equivalents for the Shannontransform of these models.

We then used the Stieltjes transform tool to dig deeper into the study of the empirical eigenvalue distribution of some matrix models, and especially theempirical eigenvalue distribution of sample covariance matrices and informationplus noise matrices. While it is already known that the e.s.d. of a samplecovariance matrix has a limit d.f. whenever the e.s.d. of the population covariancematrix has a limit, we showed that more can be said about the e.s.d. We observedrst that, as the matrix dimensions grow large, no eigenvalue is found outsidethe support of the limiting d.f. with probability one (under mild assumptions).

Then we observed that, when the l.s.d. is formed of a nite union of compactsets, as the matrix dimensions grow large, the number of eigenvalues found inevery set (called a cluster) is exactly what we would expect. In particular, if the population covariance matrix in the sample covariance matrix model (or theinformation matrix in the information plus noise matrix model) is formed of anite number K of distinct eigenvalues, with respective multiplicities N 1 , . . . , N K

(each multiplicity growing with the system dimensions), then the number of eigenvalues found in every cluster of the e.s.d. exactly matches each N k oran exact sum of consecutive N k with high probability. Many properties forsample covariance matrix models were then presented: determination of the exact

limiting support, condition for cluster separability, etc. The same types of resultswere presented for the information plus noise matrix models. However, in thisparticular case, only exact separation for the Gaussian case has been establishedso far.

For both sample covariance matrix and information plus noise matrixmodels, eigen-inference, i.e. inverse problems based on the matrix eigenstructure,was performed to provide consistent estimates for some functionals of thepopulation eigenvalues. Those estimators, that are consistent in the sense of beingasymptotically unbiased as both matrix dimensions grow large with comparable

sizes, were named G-estimators after Girko who calculated a large number of such consistent estimators.We then introduced models of sample covariance matrices whose population

covariance matrix has a nite number of distinct eigenvalues, but for whichsome of the eigenvalues have nite multiplicity, not growing with the systemdimensions. These models were referred to as spiked models, as these eigenvalueswith small multiplicity have an outlying behavior. For these types of matrices,the previous analysis no longer holds and specic study was made. It wasshown that these models exhibit a so-called ‘phase transition’ effect in the sensethat, if a population eigenvalue with small multiplicity is greater than a given

threshold, then eigenvalues of the sample covariance matrix will be found withhigh probability outside the support of the l.s.d., the number of which beingequal to the multiplicity of the population eigenvalue. On the contrary, if the



247

population eigenvalue is smaller than this threshold, then no sample eigenvalueis found outside the support of the l.s.d., again with high probability, and it wasshown that the outlying population eigenvalue was then mapped to a sampleeigenvalue that converges to the right edge of the spectrum. It was thereforeconcluded that, from the observation of the l.s.d. (or in practice, from theobservation of the e.s.d. for very large matrices), it is impossible to infer onthe presence of an outlying eigenvalue in the population covariance matrix. Wethen dug deeper again to study the behavior of the right edge of the Marcenko–Pastur law and more specically the behavior of the largest eigenvalue in thee.s.d. of normalized Wishart matrices.

The study of the largest eigenvalue requires different tools than theStieltjes transform method, among which large deviation analysis, orthogonal

polynomials, etc. We observed a peculiar behavior of the largest eigenvalue of the e.s.d. of uncorrelated normalized Wishart matrices, which does not have aclassical central limit with convergence rate O(N

12 ) but a Tracy–Widom limit

with convergence rate O(N 23 ). It was then shown that spiked models in which the

isolated population eigenvalues are below the aforementioned critical thresholddo not depart from this asymptotic behavior, i.e. the largest eigenvalue convergesin distribution to the Tracy–Widom law, although the matrix size must benoticeably larger than in the non-spiked model for the asymptotic behaviorto match the empirical distribution. When the largest isolated populationeigenvalue is larger than the threshold and of multiplicity one, then it has a

central limit with rate O(N 12 ).

If the results given in Part I have not all been introduced in view of practicalwireless communications applications, at least most of them have. Some of them have in fact been designed especially for telecommunication purposes,e.g. Theorems 5.8, 6.4, 6.5, etc. We show in Part II that the techniquespresented in the present part can be used for a large variety of problemsin wireless communications, aside from the obvious applications to MIMOcapacity, MAC and BC rate regions, signal sensing already mentioned. Of particular interest will be the adaption of deterministic equivalent methods

to the performance characterization of linearly precoded broadcast channelswith transmit correlation, general user path-loss pattern, imperfect channelstate information, channel quantization errors, etc. Very compact deterministicequivalent expressions will be obtained, where exact results are mathematicallyintractable. This is to say the level of details and the exibility that randommatrix theory has reached to cope with more and more realistic systemmodels. Interesting blind direction of arrival and distance estimators will alsobe presented that use and extend the analysis of the l.s.d. of sample covariancematrix models developed in Section 7.1 and Chapter 8. These examples translatethe aptitude of random matrix theory to deal today with more general questions

than merely capacity evaluation.



Part II

Applications to wirelesscommunications



11 Introduction to applications in

telecommunications

In the preface of [Bai and Silverstein, 2009] , Silverstein and Bai provide atable of the number of scientic publications in the domain of random matrixtheory for ten-year periods. The table reveals that the number of publicationsroughly doubled from one period to the next, with an impressive total of more than twelve hundred publications for the 1995-2004 period. This trendis partly due to the mathematical tools developed over the years that allowfor more and more possibilities for matrix model analysis. The major reasonthough is related to the increasing complexity of the system models employed inmany elds of physics which demand low complexity analysis. We have alreadymentioned in the introductory chapter that nuclear physics, biology, nance,and telecommunications are among the elds in which the system complexityinvolved in the daily work of engineers is growing at a rapid pace. The second

part of this book is entirely devoted to wireless communications and to somerelated signal processing topics. The reader must nonetheless be aware thatmany models developed here can be adapted to other elds of research, thetypical example of such models being the sample covariance matrix model.

In the following section, we provide a brief historical account of thepublications in wireless communications dealing with random matrices (fromboth small and large dimensional viewpoints), from the earlier results in idealtransmission channels down to recent rened examples reecting more realisticcommunication environments. It will appear in particular to the reader that, inthe latest works, the hypotheses made on channel conditions are precise enoughto take into account (sometimes simultaneously) multi-user transmissions, verygeneral channel models with both transmit and receive correlations, Ricianmodels, imperfect channel state information at the sources and the receivers,integration of linear precoders and decoders, inter-cell interference, etc.

11.1 Historical account of major results

It is often mentioned that Tse and Hanly [Tse and Hanly , 1999] initiated therst contribution of random matrix theory to information theory. We wouldlike to insist, following our point that random matrix theory deals both withlarge dimensional random matrices and small dimensional random matrices, that



252 11. Introduction to applications in telecommunications

the important work from Telatar on the multiplexing gain of multiple antennacommunications [Telatar, 1995, 1999 ] was also part of this multi-dimensionalsystem analysis trend of the late nineties. From this time on, the interest inrandom matrix theory grew vividly, to such an extent that more and moreresearch laboratories dedicated their time to various applications of randommatrix theory in large systems. The major driver for this dramatic increaseof work in applied random matrix theory is the recent growth of all systemdimensions in recent telecommunication systems. The now ten-year-old storyof random matrices for wireless communications started with the performancestudy of code division multiple access.

11.1.1 Rate performance of multi-dimensional systems

The rst large dimensional system which was approached by asymptotic analysisis the code division multiple access (CDMA) technology that came along withthe third generation of mobile phone communications. We remind that CDMAsucceeded the time division multiple access (TDMA) technology used for thesecond generation of mobile phone communications. In a network with TDMAresource sharing policy, users are successively allocated an exclusive amountof time to exchange data with the access points. Due to the established xedpattern of time division, one of the major issues of the standard was thenthat each user could only be allocated a unique time slot, while at the sametime a very strict maximal number of users could be accepted by a givenaccess point, regardless of the users’ requests in terms of quality of service.In an effort to increase the number of users for a given access point, whiledynamically balancing the quality of service offered to each terminal, the CDMAsystem was selected for the subsequent mobile phone generation. In a CDMAsystem, each user is allocated a (usually long) spreading code that is maderoughly orthogonal to the other users’ codes, in such a way that all users cansimultaneously receive data while experiencing a limited amount of interferencefrom concurrent communications, due to code orthogonality. Equivalently, in the

uplink, the users can simultaneously transmit orthogonal streams that can bedecoded free of interference at the receiver. Since the spreading codes are rarelyfully orthogonal (unless orthogonal codes such as Hadamard codes are used),the more users served by an access point, the more the interference and then theless the quality of service; but at no time is a user rejected for lack of availableresource (unless an excessive number of users wishes to access the network).While the achievable transmission data rate for TDMA systems is rather easy toevaluate, the capacity for CDMA networks depends on the precoding strategyapplied. One strategy is to build purely orthogonal codes so that all users donot interfere with each other; we refer to this precoding policy as orthogonal CDMA. This has the strong advantage of making decoding easy at the receiverand discards the so-called near-far effect that leads non-orthogonal users thattransmit much power to interfere with more than other users that transmit less



11.1. Historical account of major results 253

power. However, this is not better than TDMA in terms of achievable rates, ismore demanding in terms of time synchronization (i.e. the codes of two users notproperly synchronous are no longer orthogonal), and suffers signicantly from thefrequency selectivity of the transmission medium (which induces convolutions of the orthogonal codes, breaking then the orthogonality to some extent). For allthese reasons, it is often more sensible to use random i.i.d. codes. This secondprecoding policy is called random CDMA . In practice, codes may originatefrom random vector generators tailored so to mitigate inter-user interference;these are called pseudo-random CDMA codes. Now the problem is to evaluatethe communication rates achieved by such precoders. Indeed, in the orthogonalCDMA approach, assuming frequency at channel conditions for all users andchannel stability over a large number of successive symbol periods, the rates

achieved in the uplink (from user terminals to access points) are maximal whenthe orthogonal codes are as long as the number of users N , and we have thesystem capacity C orth (σ2) for a noise power σ2 given by:

C orth (σ2) = 1N

log det I N + 1σ2 WHH H W H

where W ∈C N ×N is the unitary matrix whose columns are the CDMA codesand H = diag( h1 , . . . , h N ) is the diagonal matrix of the channel gains of the users1, . . . , N . By the property det( I + AB ) = det( I + BA ) for matrices A , B suchthat both AB and BA are square matrices and the fact that W is unitary, this

reduces to

C orth (σ2) = 1N

log det I N + 1σ2 HH H =

1N

N

i=1

log 1 + |h i |2σ2 .

This justies our previous statement on the equivalence between TDMA andCDMA rate performance. When it comes to evaluate the capacity C rand (σ2) of random CDMA systems, under the same conditions, we have:

C rand (σ2) = 1N

log det I N + 1σ2 XHH H X H

with X ∈C N ×N the matrix whose columns are the users’ random codes. Theresult here is no longer trivial and appeals to random matrix analysis. Since thenumber of users attached to a given access point is usually assumed large, wecan use results from large dimensional random matrix theory. The analysis of such systems in the large dimensional regime was performed by Shamai, Tse,and Verd´u in [Tse and Verd´u, 2000; Verd u and Shamai , 1999].

These capacity expressions may however not be realistic achievable rates inpractice, in the sense that they imply non-linear processing at the receivingaccess points (e.g. decoding based on successive interference cancellation). Forcomplexity reasons, such non-linear processing might not be feasible, so thatlinear precoders or decoders are often preferred. For random CDMA codes,the capacity achieved by linear decoders in CDMA systems such as matched-




lters, linear minimum mean square error (LMMSE) decoders, etc. have beenextensively studied from the earlier work of Tse and Hanly [Tse and Hanly, 1999]in frequency at channels, Evans and Tse [Evans and Tse, 2000] in frequencyselective channels, Li and Tulino for reduced-rank LMMSE decoders [Li et al.,2004], Verdu and Shamai [Verdu and Shamai, 1999] for several receivers infrequency at channels, etc.

While the single-cell orthogonal CDMA capacity does not require the randommatrix machinery, the study of random CDMA systems is more difficult.Paradoxically, when it comes to linear decoders, the tendency is opposite, asthe performance study is in general more involved for the orthogonal case thanfor the random i.i.d. case. The rst substantial work on the performance of linearorthogonal precoded systems arises in 2003 with Debbah and Hachem [Debbah

et al. , 2003a] on the performance of MMSE decoders in orthogonal CDMAfrequency at fading channels. The subsequent work of Chaufray and Hachem[Chaufray et al., 2004] deals with more realistic sub-optimum MMSE receiverswhen only partial channel state information is available at the receiver. Thiscomes as a rst attempt to consider more realistic transmission conditions. Bothaforementioned works have the particularity of providing genuine mathematicalresults that were at the time not part of the available random matrix literature.As it will often turn out in the intricate models introduced hereafter, therapid involvement of the wireless communication community in random matricescame along with a fast exhaustion of all the “plug-and-play” results availablein the mathematical literature. As a consequence, most results discussedhereafter involving non-trivial channel models often require a deep mathematicalintroduction, which fully justies Part I of the present book.

In the same vein as linear systems, linear precoders and decoders for multi-user detection have been given a lot of attention in the past ten years, with theincreasing demand for the study of practical scenarios involving a large numberof users. The motivation for using random matrix theory here is the increase inthe number of users, as well as the increase in the number of antennas used formulti-user transmissions and decoding. Among the notable works in this domain,

we mention the analysis of Tse and Zeitouni [Tse and Zeitouni, 2000] on linearmulti-user receivers for random CDMA systems. The recent work by Wagner andCouillet [Wagner et al., 2011] derives deterministic equivalents for the sum rate inmulti-user broadcast channels with linear precoding and imperfect channel stateinformation at the transmitter. The sum rate expressions obtained in [Wagneret al. , 2011] provide different system characterizations. In particular, the optimaltraining time for channel estimation in quasi-static channels can be evaluated,the optimal cell coverage or number of users to be served can be assessed, etc. Wecome back to the performance of such linear systems in Chapter 12 and Chapter14.

One of the major contributions to the analysis of multi-dimensional systemperformance concerns the capacity derivation of multiple antenna technologies




when several transmit and receive antennas are assumed. The derivation fromTelatar [Telatar, 1999 ] of the capacity of frequency at point-to-point multipleantenna transmissions with Gaussian i.i.d. channel matrix involves complicatedsmall dimensional random matrix calculus. The generalization of Telatar’scalculus to non-Gaussian or correlated Gaussian models is even more involvedand cannot be treated in general. Large dimensional random matrices helpderiving large dimensional capacity expressions for these more exotic channelmodels. The rst results for correlated channels are due to Chuah and Tse [Chuahet al. , 2002], Mestre and Fonollosa [Mestre et al., 2003], and Tulino and Verd´ u[Tulino and Verd´ u, 2005]. In the last mentioned article, an implicit expression forthe asymptotic ergodic capacity of Kronecker channel models is derived, whichdoes not require integral calculus. We recall that a Kronecker-modeled channel

is dened as a matrix H

= R

12 XT

12

∈C n r

×n t

, where R

12

∈C n r

×n r

and T

12

∈C n t ×n t are deterministic non-negative Hermitian matrices, and X ∈C n r ×n t hasGaussian independent entries of zero mean and normalized variance. This issomewhat less general than similar models where X has non-necessarily Gaussianentries. We must therefore make the distinction between the Kronecker modeland the doubly correlated i.i.d. matrix model, when necessary. Along with theKronecker model of [Tulino and Verd´ u, 2005], the capacity of the very generalRician channel model H = A + X is derived in [Hachem et al., 2007], whereA ∈C n r ×n t is deterministic and X ∈ n r ×n t has entries X ij = σij Y ij , wherethe elements Y ij are i.i.d. (non-necessarily Gaussian) and the factors σ2

ij form

the deterministic variance prole .These results on point-to-point communications in frequency at channels

are further generalized to multi-user communications, to frequency selectivecommunications, and to the most general multi-user frequency selectivecommunications, in doubly correlated channels. The frequency selective channelresults are successively due to Moustakas and Simon [Moustakas and Simon,2007], who conjecture the capacity of Kronecker multi-path channels usingthe replica method briey discussed in Section 19.2, followed by Dupuy andLoubaton, who prove those earlier results and additionally determine the

capacity achieving signal precoding matrix in [Dupuy and Loubaton , 2010]. Onthe multi-user side, using tools from free probability theory, Peacock and Honig[Peacock et al., 2008] derive the limit capacity of multi-user communications.Their analysis is based on the assumptions that the number of antennas per usergrows large and that all user channels are modeled as Kronecker, such that theleft correlations matrices are co-diagonalizable . In a parallel work Couillet et al.[Couillet et al., 2011a] relax the constraints on the channels of [Peacock et al.,2008] and provide a deterministic equivalent for the points in the rate region of multi-user multiple access and broadcast channels corresponding to deterministic precoders , with general doubly correlated channels. Moreover, [Couillet et al.,

2011a] provides an expression of the ergodic capacity maximizing precodingmatrix for each user in the multiple access uplink channel and therefore providesan expression for the boundary of the ergodic multiple access rate region. Both




results can then be combined to derive a deterministic equivalent of the rateregion for multi-user communications on frequency selective Kronecker channels.To establish results on outage capacity for multiple antenna transmissions, weneed to go beyond the expression of deterministic equivalents of the ergodiccapacity and study limiting results on the capacity around the mean. Mostresults are central limits. We mention in particular the important result fromHachem and Najim [Hachem et al., 2008b] on a central limit theorem for thecapacity of Rician multiple antenna channels, which follows from their previouswork in [Hachem et al., 2007].

The previous multi-user multiple antenna considerations may then be extendedto multi-cellular systems. In [Zaidel et al., 2001], the achievable rates in multi-cellular CDMA networks are studied in Wyner’s innite linear cell array model

[Wyner, 1994] . Abdallah and Debbah [Abdallah and Debbah, 2004 ] provideconditions on network planning to improve system capacity of a CDMA networkwith matched-lter decoding, assuming an innite number of cells in the network.Circular cell array models are studied by Hoydis et al. [Hoydis et al., 2011d],where the optimal number of serving base stations per user is derived in a multi-cell multiple antenna channel model with nite channel coherence time. Thenite channel coherence duration parameter involves a limitation of the gain of multi-cell cooperation due to the time required for synchronization (linked tothe fundamental diversity-multiplexing trade-off [Tse and Zheng, 2003 ]), hencea non-trivial rate optimum. Limiting results on the sum capacity in large cellular

multiple antenna networks are also studied in [Aktas et al., 2006] and [Huh et al.,2010].

Networks with relay and ad-hoc networks are also studied using tools fromrandom matrix theory. We mention in particular the work of Leveque and Telatar[Leveque and Telatar , 2005] who derive scaling laws for large ad-hoc networks.On the relay network side Fawaz et al. [Fawaz et al., 2011] analyze the asymptoticcapacity of relay networks in which relays perform decode and forward. Gametheoretical aspects of large dimensional systems were also investigated in lightof the results from random matrix theory, among which is the work of Bonneau

et al. [Bonneau et al., 2007] on power allocation in large CDMA networks.Chapters 13–15 develop the subject of large multi-user and multiple antennachannels in detail.

11.1.2 Detection and estimation in large dimensional systems

One of the recent hot topics in applied random matrix theory for wirelesscommunications deals with signal detection and source separation capabilitiesin large dimensional systems. The sudden interest in the late nineties forsignal detection using large matrices was spurred simultaneously by the recentmathematical developments and by the recent need for detection capabilities inlarge dimensional networks. The mathematical milestone might well be the earlywork of Geman in 1980 [Geman, 1980] who proved that the largest eigenvalue of




the e.s.d. of XX H , where X has properly normalized central i.i.d. entries withsome assumption on the moments, converges almost surely to the right edge of the Marcenko–Pastur law. This work was followed by generalizations with lessconstrained assumptions and then by the work of Tracy and Widom [Tracy andWidom, 1996] on the limit distribution of the largest eigenvalue for this modelwhen X is Gaussian.

On the application side, when source detection is to be performed withthe help of a single sensing device, i.e. a single antenna, the optimal decisioncriterion is given by the Neyman–Pearson test, which was originally derived byUrkowitz for Gaussian channels [Urkowitz , 1967]. This method was then renedto more realistic channel models in, e.g., [Kostylev , 2002; Simon et al., 2003].The Neyman–Pearson test assumes the occurrence of two possible events H 0

and H 1 with respective probability p0 and p1 = 1 − p0 . The event H 0 is calledthe null hypothesis , which in our context corresponds to the case where onlynoise is received at the sensing device, while H 1 is the complementary event,which corresponds to the case where a source is emitting. Upon reception of asequence of n symbols gathered in the vector y = ( y1 , . . . , y n )T at the sensingdevice, the Neyman–Pearson criterion states that H 1 must be decided if

P (H 1|y )P (H 0|y )

= P (y |H 1) p1

P (y |H 0) p0> γ

for a given threshold γ . If this condition is not met, H 0 must be decided. Inthe Gaussian channel context, the approach is simple as it merely consists incomputing the empirical received power, i.e. the averaged square amplitude1n i |yi |2 , and deciding the presence or the absence of a signal source basedon whether the empirical received power is greater or less than the thresholdγ . Due to the Gaussian assumption, it is then easy to derive the probabilityof false negatives, i.e. the probability of missing the presence of a source, or of false positives, i.e. the probability of declaring the presence of a source whenthere is none, which are classical criteria to evaluate the performance of a sourcedetector. Observe already that estimating these performances requires to know

the statistics of the yi or, in the large n hypothesis, to know limiting results oni |yi |2 .The generalization to multi-dimensional data y 1 , . . . , y n ∈C N could consist in

summing the total empirical power received across the N sensors, i.e. ni=1 y i

2 ,and then comparing this value to some threshold. This is what is usuallydone for arrays of N sensors, as it achieves an N -fold increase of performance.Calling Y = [y 1 , . . . , y n ] ∈C N ×n , the empirical power reduces to the normalizedtrace of YY H . However simple and widely used, see, e.g., [Meshkati et al.,2005], this solution may not always be Neyman–Pearson optimal. Couillet andDebbah [Couillet and Debbah , 2010a] derive the Neyman–Pearson optimaldetector in nitely large multiple antenna channels. Their work assumes thatboth transmit symbols and channel fading links are Gaussian, which are shownto be optimal assumptions in the maximum-entropy principle sense [Jaynes,




1957a,b] when limited prior information is known about the communicationchannel. The derived Neyman–Pearson criterion for this model turns out to be arather complex formula involving all eigenvalues of YY H and not only theirsum. Moreover, under the realistic assumption that the signal-to-noise ratiois not known a priori, the decision criterion expresses under a non-convenientintegral form. This is presented in detail in Section 16.3. Simpler solutions arehowever sought for which may benet from asymptotic considerations on theeigenvalue distribution of YY H . A rst approach, initially formulated by Zengand Liang [Zeng and Liang , 2009] and accurately studied by Cardoso and Debbah[Cardoso et al., 2008], consists in considering the ratio of the largest to thelowest eigenvalue in YY H and deciding the presence of a signal when this ratiois more than some threshold. This approach is simple and does not assume

any prior knowledge on the signal-to-noise ratio. A further approach, actuallymore powerful, consists in replacing the difficult Neyman–Pearson test by thesuboptimal though simpler generalized likelihood ratio test. This is performedby Bianchi et al. [Bianchi et al., 2011]. These methods are studied in detail inSection 16.4.

Another non-trivial problem of large matrices with both dimensions growingat a similar rate is the consistent estimation of functionals of eigenvalues andeigenvectors. As already largely introduced in Chapter 8, in many disciplines,we are interested in estimating, e.g. population covariance matrices, eigenvaluesof such matrices, etc., from the observation of the empirical sample covariancematrix. With both number of samples n and sample size N growing large at thesame rate, it is rather obvious that no good estimate for the whole populationcovariance matrix can be relied on. However, consistent estimates of functionalsof the eigenvalues and eigenvectors have been shown to be possible to obtain. Inthe context of wireless communications, sample covariance matrices can be usedto model the downlink of multiple antenna Gaussian channels with directionsof arrivals. Using subspace methods, the directions of arrival can be evaluatedas a functional of the eigenvectors of the population null space, i.e. the spaceorthogonal to the space spanned by the transmit vectors. Being able to evaluate

these directions of arrival makes it possible to characterize the position of asource. In [Mestre, 2008a], Mestre proves that the classical n-consistent approachMUSIC method from Schmidt [Schmidt, 1986] to estimate the directions of arrival is largely biased when ( n, N ) grow large at a similar rate. Mestre thenoffers an alternative estimator in [Mestre, 2008b ] which unfolds directly fromTheorem 8.4, under separability assumption of the clusters in the l.s.d. of a sample covariance matrix. This method is shown to be ( n, N )-consistent,while showing slightly larger estimate variances. A similar direction of arrivalestimation is performed by Vallet et al. [Vallet et al., 2010] in the more realisticscenario where the received data vectors are still deterministic but unknown to

the sensors. This approach uses an information plus noise model in place of thesample covariance matrix model. The problem of detecting multiple multiple




antenna signal sources and estimating the respective distances or powers of eachsource based on the data collected from an array of sensors enters the samecategory of problems. Couillet and Silverstein [Couillet et al., 2011c] derive an(n, N )-consistent estimator to this purpose. This estimator assumes simultaneoustransmissions over i.i.d. fading channels and is sufficiently general to allow thetransmitted signals to originate from different symbol constellations. The blindestimation by an array of sensors of the distances to sources is particularly suitedto the recently introduced self-congurable femto-cells. In a few words, femto-cells are required to co-exist with neighboring networks, by reusing spectrumopportunities left available by these networks. The blind detection of adjacentlicensed users therefore allows for a dynamical update of the femto-cell coveragearea that ensures a minimum interference to the licensed network. See, e.g.,

[Calin et al., 2010; Claussen et al., 2008] for an introduction to femto-cells and[Chandrasekhar et al., 2009] for an understanding of the need for femto-cells toevaluate the distance to neighboring users.

The aforementioned methods, although very efficient, are however oftenconstrained by strong requirements that restrict their usage. Suboptimalmethods that do not fall into these shortcomings are then sought for. Amongthem, we mention inversion methods based on convex optimization, an exampleof which being the inversion algorithm for the sample covariance matrix derivedby El Karoui [Karoui , 2008], and moment-based approaches, as derived byCouillet and Debbah [Couillet and Debbah, 2008] , Rao and Edelman [Rao

et al. , 2008], Ryan and Debbah [Ryan and Debbah, 2007a] , etc. These methodsare usually largely more complex and are no match both in complexity andperformance for the analytical approaches mentioned before. However, they aremuch more reliable as they are not constrained by strong prior assumptions onasymptotic spectrum separability. These different problems are discussed at largein Chapter 17.

In the following section, we open a short parenthesis to introduce the currentlyactive eld of cognitive radios, which provides numerous open problems relatedto signal sensing, estimation, and optimization in possibly large dimensional

networks.

11.1.3 Random matrices and exible radio

The eld of cognitive radios, also referred to as exible radios, has known anincreasing interest in the past ten years, spurred by the concept of softwaredened radios, coined by Mitola [Mitola III and Maguire Jr, 1999] . Softwaredened radios are recongurable telecommunication service providers that aremeant to dynamically adapt to the client demand. That is, in order to increasethe total throughput of multi-protocol communications in a given geographicalarea [Akyildiz et al., 2006; Tian and Giannakis , 2006], software dened radios willprovide various protocol services to satisfy all users in a cell. For instance, cellularphone users in a cognitive radio network will be able to browse the Internet




seamlessly through either WiFi, WiMAX, LTE, or 3G protocols, depending onthe available spectral resources. This idea, although not yet fully accepted inthe industrial world, attempts to efficiently reuse the largely unused (we mightsay “wasted”) bandwidth, which is of fundamental importance in days when nomore room is left in the electromagnetic frequency spectrum for future high-speedtechnologies, see, e.g., [Tandra et al., 2009].

The initial concept of software dened radios has then grown into themore general idea of intelligent, exible, and self-recongurable radios [Huret al. , 2006]. Such radios are composed of smart systems at all places of thecommunication networks, from the macroscopic infrastructure that spans acrossthousands of kilometers and that must be globally optimized in some sense,down to the microscopic local in-house networks that must ensure a high quality

of service to local users with minimum harm to neighboring communications.From the macroscopic viewpoint, the network must harmonize a large numberof users at the network and MAC layers. These aspects may be studied inlarge game-theoretical frameworks, which demand for large dimensional analysis.Random matrices, but also mean eld theory (see, e.g., [Bordenave et al., 2005;Buchegger and Le Boudec, 2005 ; Sharma et al., 2006]), are common tools toanalyze such large decentralized networks. The random matrix approach allowsus to characterize deterministic behaviors in large decentralized games. Thestudy of games with a large number of users invokes concepts such as thatof Wardrop equilibrium, [Haurie and Marcotte , 1985; Wardrop, 1952] . The rst

applications to large wireless communication networks mixing both large randommatrix analysis and games are due to Bonneau et al. [Bonneau et al., 2007,2008] who study the equilibrium of distributed games for power allocation inthe uplink of large CDMA networks. The addition of random matrix tools tothe game theoretic settings of such large CDMA networks allows the players inthe game, i.e. the CDMA users, to derive deterministic approximations of theirfunctions of merit or cost functions.

On the microscopic side, individual terminals are now required to be smartin the sense that they ought to be able to concur for available spectral

resources in decentralized networks. The interest for the decentralized approachtypically appears in large networks of extremely mobile users where centralizedresource allocation is computationally prohibitive. In such a scenario, mobileusers must be able to (i) understand their environment and its evolution bydeveloping sensing abilities (this phase of the communication process is oftencalled exploration ) and (ii) take optimal decisions concerning resource sharing(this phase is referred to as exploitation ). These rather novel constraints on themobile terminals demand for an original technologically disruptive frameworkfor future mobile communications. Part of the exploration requirements forsmart terminals is the ability to detect on-going communications in the available

frequency resources. This point has already been discussed and is thoroughlydealt with in Chapter 16. Now, another requirement for smart terminals, actuallyprior to signal sensing, is that of channel modeling. Given some prior information




on the environment, such as the geographical location or the number of usuallysurrounding base stations, the terminal must be able to derive a rough modelfor the expected communication channel; in this way, sensing procedures will bemade faster and more accurate. This requirement again calls for a new frameworkfor channel modeling. This is introduced in Chapter 18, where, upon statisticalprior information known to the user terminal, various a priori channel probabilitydistributions are derived. This chapter uses extensively the works from Guillaud,Muller, and Debbah [Debbah and M uller , 2005; Guillaud et al., 2007] and reliesprimarily on small dimensional random matrix tools.

The chapter ordering somewhat parallels the ordering of Part I, bychronologically introducing models that only require limit distribution theorems,then models that require the introduction of deterministic equivalents and

nally detection criteria that rely either on extreme eigenvalue distributions oron asymptotic spectrum considerations and eigen-inference methods. The nextchapter deals with system performance evaluation in CDMA communicationsystems, that is the most documented subject to this day, gathering more thanten years of progress in wireless communications.



12 System performance of CDMA

technologies

12.1 Introduction

The following four chapters are dedicated to the analysis of the rate at whichinformation can be reliably transmitted over a physical channel characterizedby its space, time, and frequency dimensions. We will often consider situationswhere one or several of these dimensions can be considered large (and complex)in some sense. Notably, the space dimension will be said to be large whenmultiple transmit sources or receive sensors are used, which will be the caseof multiple antenna, multi-user, or multi-cell communications; the complexityof these channels arises here from the joint statistical behavior of all point-to-point channel links. The time dimension can be said to be large and complexwhen the inputs to the physical channel exhibit a correlated behavior, such as

when signal precoders are used. It may be argued here that the communicationchannel itself is not complex, but for simplied information-theoretic treatment,the transmission channel in such a case is assumed to be the virtual mediumformed by the ensemble precoder and physical channel . Finally, the frequencydimension exhibits complexity when the channel shows uctuations in thefrequency domain, i.e. frequency selectivity, which typically arises as multi-pathreections come into play in the context of wireless communications or whenmulti-modal transmissions are used in ber optics.

To respect historical progress in the use of random matrix theory in wirelesscommunications, we should start with the work of Telatar [Telatar, 1995] onmultiple antenna communication channels. However, asymptotic considerationswere not immediately applied to multiple antenna systems and, as a matterof fact, not directly applied to evaluate mutual information through Shannontransform formulas. Instead, the rst applications of large dimensional randommatrix theory [Biglieri et al., 2000; Shamai and Verd´u, 2001; Tse andHanly, 1999 ; Verd u and Shamai , 1999] were motivated by the similaritybetween expressions of the signal-to-interference plus noise ratio (SINR) inCDMA precoded transmissions and the Stieltjes transform of some distributionfunctions. We therefore start by considering the performance of such CDMA

systems either with random or unitary codes.



264 12. System performance of CDMA technologies

12.2 Performance of random CDMA technologies

We briey recall that code division multiple access consists in the allocationof (potentially large) orthogonal or quasi-orthogonal codes w 1 , . . . , w K ∈C N toa series of K users competing for spectral resource access. Every transmittedsymbol, either in the downlink (from the access point to the users) or in theuplink (from the users to the access point), is then modulated by the code of theintended user. By making the codes quasi-orthogonal in the time domain, i.e.in the downlink, by ensuring w H

i w j 0 for i = j , user k receives its dedicatedmessage though a classical matched-lter (i.e. by multiplying the input signal byw H

k ), while being minimally interfered with by the messages intended for otherusers. In the uplink, the access point can easily recover the message from all

users by matched-ltering successively the input signal by wH

1 , . . . , wH

K . Matched-ltering is however only optimal (from an output SINR viewpoint) when thecodes are perfectly orthogonal and when the communication channel does notbreak orthogonality, i.e. frequency at channels. A rened lter in that case isthe minimum mean square error (MMSE) decoder, which is optimal in termsof SINR experienced at the receiver. Ideally, though, regardless of the codebeing used, the sum rate optimal solution consists either in the uplink or inthe downlink in proceeding to joint decoding at the receiver. Since this requiresa lot of information about all user channel conditions, these optimal lters arenever considered in the downlink. The performance of all aforementioned ltersare presented in the following under more and more constraining communicationscenarios.

We rst start with uplink CDMA communications before proceeding to thedownlink scenario, and consider random i.i.d. CDMA codes before studyingorthogonal CDMA codes.

12.2.1 Random CDMA in uplink frequency at channels

We consider the uplink of a random code division multiple access transmission,with K users sending simultaneously their signals to a unique access pointor base station, as in, e.g., [Biglieri et al., 2001; Grant and Alexander, 1998;Madhow and Honig , 1994; Muller, 2001; Rapajic and Popescu, 2000 ; Schrammand M uller, 1999]. Denote N the length of each CDMA spreading code. Userk, k ∈ 1, . . . , K , has code w k ∈C N , which has i.i.d. entries of zero meanand variance 1 /N . At time l, user k transmits the Gaussian symbol s( l)

k . Thechannel from user k to the base station is assumed to be non-selective in thefrequency domain, constant over the spreading code length, and is written asthe product hk√ P k of a fast varying parameter hk , and a long-term parameter√ P k accounting for the power transmitted by user k and the shadowing effect.We refer for simplicity to P k as the power of user k. At time l, we therefore have



12.2. Performance of random CDMA technologies 265

the transmission model

y ( l) =K

k=1

hk w k

P k s( l)

k + n ( l) (12.1)

where n ( l)∈

C N denotes the additive Gaussian noise vector with entries of zeromean and variance σ2 , and y ( l)

∈C N is the signal vector received at the base

station.The expression ( 12.1) can be written in the more compact form

y ( l) = WHP12 s ( l) + n ( l)

with s( l) = [ s( l)1 , . . . , s ( l)

K ]T

∈C K a Gaussian vector of zero mean and covariance

E[s ( l) s ( l)H ] = I K , W = [w 1 , . . . , w K ]

∈C N ×K , P

∈C K ×K is diagonal with kth

entry P k , and H ∈C K ×K is diagonal with kth entry hk .In the following, we will successively study the general expressions of the

deterministic equivalents for the capacity achieved by the matched-lter, theminimum-mean square error lter and the optimal decoder. Then, we apply ourresults to the particular cases of the additive white Gaussian noise (AWGN)channel and the Rayleigh fading channels. The AWGN channel will correspondto the channel for which hk = 1 for all k. The Rayleigh channel is the channel forwhich |hk | has Rayleigh distribution with real and imaginary parts of variance1/ 2, and therefore |hk |2 is χ 2

2-distributed with density p(|hk |2) = e−|h k |2 .

12.2.1.1 Matched-lterWe rst assume that, at the base station, a matched-lter is applied to thereceived signal. That is, the base station takes the product w H

k y ( l) to retrievethe data transmitted by user k. In this case, the signal power after matched-ltering reads:

P k |hk |2|wHk w k |2

while the interference plus noise power reads:

wH

k WHPHH

WH

−P k |hk |2w k wH

k + σ2 I N w k

from which the signal-to-interference ratio γ (MF)k relative to the data of user k is

γ (MF)k =

P k |hk |2|wHk w k |2

w Hk WHPH H W H −P k |hk |2w k w H

k + σ2 I N w k.

Clearly, from the law of large numbers, as N → ∞, w Hk w k

a .s.

−→ 1. Also, fromTheorem 3.4, for large N , K such that 0 < lim inf N K/N ≤ lim supN K/N < ∞,we expect to have

w Hk WHPH H W H −P k |hk |2w k w Hk + σ2 I N w k

− 1N

tr WHPH H W H

−P k |hk |2w k w H

k + σ2 I N → 0




almost surely. However, the conditions of Theorem 3.4 impose that the innermatrix term WHPH H W H

−P k |hk |2w k w Hk + σ2 I N be uniformly bounded

in spectral norm. This is not the case here since there is a non-zero probability for the largest eigenvalue of WHPH H W H to grow large.Nonetheless, Theorem 3.4 can be generalized to the case where the innermatrix WHPH H W H

−P k |hk |2w k w Hk + σ2 I N only has almost surely bounded

spectral norm from a simple application of Tonelli’s theorem. Lemma 14.2provides the exact statement and the proof of this result, which is evenmore critical for applications in multi-user MIMO communications, seeSection 14.1. From Theorem 7.1, the almost sure uniform bounded norm of

WHPH H W H

−P k |hk |2w k w Hk + σ2 I N is valid here for uniformly bounded

HPH H . Precautions must however be taken for unbounded HH H , such as for

Rayleigh channels.Now1N

tr WHPH H W H

−P k |hk |2w k w Hk + σ2 I N −

1N

tr WHPH H W H + σ2 I N

= − 1N

P k |hk |2w k w H

k

→ 0

almost surely. Together, this leads to

w Hk WHPH H W H

−P k

|hk

|2w k w H

k + σ2 I N w k

− 1

N tr WHPH H W H + σ2 I N

→ 0

almost surely. The second term on the left-hand side can be divided into σ2 andthe normalized trace of WHPH H W H . The trace can be further rewritten

1N

tr WHPH H W H = 1N

K

i =1

P i |h i |2w Hi w i .

Assume additionally that the entries of w i have nite eighth order moment.Then, from the trace lemma, Theorem 3.4

1N

tr WHPH H W H

− 1N

K

i =1

P i |h i |2 a.s.

−→ 0.

We nally have that the signal-to-interference plus noise ratio γ (MF)k for the

signal of user k after the matched-ltering process satises

γ (MF)k −

P k |hk |21N

K i=1 P i |h i |2 + σ2

a .s.

−→ 0. (12.2)

If P 1 = . . . = P K P , K/N

→ c, and the

|h i

| are now Rayleigh distributed,

the denominator of the deterministic equivalent in ( 12.2) converges to

σ2 + c P te−t dt = σ2 + P c.




If the simpler case when P 1 = . . . = P K P , K/N → c, and hi = 1 for all i,we have simply

γ (MF)k a .s.−→ P P c + σ2 .

The spectral efficiency associated with the matched-lter, denoted C MF (σ2),is the maximum number of bits that can be reliably transmitted per secondand per Hertz of the transmission bandwidth. In the case of large dimensions,the interference being Gaussian, the spectral efficiency in bits/s/Hz is wellapproximated by

C MF (σ2) = 1N

K

k=1

log2 1 + γ (MF)k .

From the discussion above, we therefore have, for N, K large, that

C MF (σ2) − 1N

K

k=1

log2 1 + P k |hk |2

1N

K i =1 P i |h i |2 + σ2

a.s.

−→ 0

and therefore we have exhibited a deterministic equivalent of the spectralefficiency of the matched-lter for all deterministic channels.

When the |h i | arise from a Rayleigh distribution, all the P i equal P andK/N

→ c, we therefore infer that

C MF (σ2) a .s.

−→ c log2 1 + P t

P c + σ2 e−t dt

= −c log2(e)eP c + σ 2

P Ei −P c + σ2

P

with Ei( x) the exponential integral function

Ei(x) = − ∞

−x

1t

e−t dt.

To obtain this expression, particular care must be taken, since, as mentionedpreviously, it is hazardous to take HH H non-uniformly bounded in spectral norm.However, if the queue of the Rayleigh distribution is truncated at C > 0, a mereapplication of the dominated convergence theorem, Theorem 6.3, ensures theconvergence. Growing C large leads to the result.

In the AWGN case, where hi = 1, P i = P for all i and K/N → c, this is instead

C MF (σ2) a.s.

−→ c log2 1 + P

P c + σ2 .

12.2.1.2 MMSE receiverWe assume now that the base station performs the more elaborate minimummean square error (MMSE) decoding. The signal y ( l) received at the base station




at time l is now ltered by multiplying it by the MMSE decoder, as

P k h∗k w H

k WHPH H W H + σ2 I N −1y ( l) .

In this scenario, the signal-to-interference plus noise ratio γ (MMSE)k relative to

the signal of user k is slightly more involved to obtain. The signal power is

P k |hk |2w H

k 1≤i≤K

P i |h i |2w i wH

i + σ2 I N

−1

w k

2

while the interference power is

P k |hk |2j = k

P j |h j |2 w H

ki

P i |h i |2w i w H

i + σ2 I N −1

w j

2

+ σ2P k |hk |2w Hk

i

P i |h i |2w i wHi + σ2I N

−2

w k .

Working on the interference power, by writing

j = k

P j |h j |2w j w Hj =

1≤j ≤K

P j |h j |2w j w Hj + σ2 I N −σ2 I N −P k |hk |2w k w H

k

we obtain for the interference power

P k |hk |2w Hk

i

P i |h i |2w i w Hi + σ2 I N

−1

w k

− P k |hk |2w H

ki

P i |h i |2w i w H

i + σ2 I N

−1

w k

2

.

This simplies the expression of the ratio of signal power against interferencepower as

γ (MMSE)k =

P k |hk |2w Hk 1≤i≤K P i |h i |2w i w H

i + σ2 I N −1 w k

1 −P k |hk |2w Hk 1≤i≤K P i |h i |2w i w

Hi + σ2 I N −1 w k

.

Applying the matrix inversion lemma, Lemma 6.2, on √ P k hk w k , we nallyhave the compact form of the MMSE SINR

γ (MMSE)k = P k |hk |2w H

k1≤i≤K

i= k

P i |h i |2w i wH

i + σ2I N

−1

w k .




Now, from Theorem 3.4, as N and K grow large with ratio K/N such that0 < lim inf N K/N ≤ lim supN K/N < ∞

w Hk

1≤i≤K i= k

P i |h i |2w i wHi + σ2I N

−1

w k

− 1N

tr WHPH H W H

−P i |h i |2w i wH

i + σ2 I N −1 a.s.

−→ 0.

From Theorem 3.9, we also have that

1N

tr WHPH H W H

−P i |h i |2w i w H

i + σ2 I N −1

− 1N tr WHPH

H

WH

+ σ2 I N −1

→ 0

where the convergence is sure. Together, the last two equations entail

w Hk

i = k

P i |h i |2w i wHi + σ2 I N

−1

w k − 1N

tr WHPH H W H + σ2 I N −1 a.s.

−→ 0.

This last expression is the Stieltjes transform mWHPH H W H (−σ2) of the e.s.d.of WHPH H W H evaluated at −σ2 < 0. Notice that the Stieltjes transform hereis independent of the choice of k. Now, if the e.s.d. of HPH H for all N forma tight sequence, since W has i.i.d. entries of zero mean and variance 1 /N , weare in the conditions of Theorem 6.1. The Stieltjes transform mWHPH H W H (−σ2)therefore satises

mWHPH H W H (−σ2) −mN (−σ2) a.s.

−→ 0

where mN (−σ2) is the unique positive solution of the equation in m (see Theorem3.13, under different hypotheses)

m = 1N

tr HPH H mHPH H + I K −1+ σ2 −1

. (12.3)

We then have that the signal-to-interference plus noise ratio for the signaloriginating from user k is close to

P k |hk |2mN (−σ2)

for all large N, K , where mN (−σ2) is the unique positive solution to

m = σ2 + 1N

1≤i≤K

P i |h i |21 + mP i |h i |2

−1

. (12.4)

In the Rayleigh case, the e.s.d. of HH H is also Rayleigh distributed, so that

F HH H

(indexed by N ) forms a tight sequence, which implies the tightness of

F HPH H

. As a consequence, mN (−σ2) has a deterministic almost sure limit




m(−σ2) as N, K → ∞, K/N → c, which is the unique positive solution to theequation in m

m = σ2 + c P t1 + P tm

e−t dt −1. (12.5)

The corresponding spectral efficiency C MMSE (σ2) of the MMSE receiver, fornoise variance equal to σ2 , takes the form

C MMSE (σ2) = 1N

K

k=1

log2 1 + P k |hk |2w Hk

i = k

P i |h i |2w i wHi + σ2 I N

−1

w k .

For increasing N, K , K/N → c, P 1 = . . . = P K = P , and |h i | Rayleighdistributed for every i, we have:

C MMSE (σ2) a .s.

−→ c log2 1 + P tm (−σ2) e−t dt

which can be veried once more by an adequate truncation of the queue of theRayleigh distributed at C > 0 and then taking C → ∞.

In the AWGN scenario, ( 12.4) becomes

m = 1 + mP

σ2 + cP + mPσ 2

which can be expressed as a second order polynomial in m, whose unique positivesolution m(−σ2) reads:

m(−σ2) = −(σ2 + ( c −1)P ) + (σ2 + ( c −1)P )2 + 4 P σ 2

2P σ 2 (12.6)

which is therefore the almost sure limit of mN (−σ2), for N, K → ∞. The spectralefficiency therefore has the deterministic limit

C MMSE (σ2) a.s.

−→ c log2 1 + −(σ2 + ( c −1)P ) + (σ2 + ( c −1)P )2 + 4 P σ 2

2σ2 .

12.2.1.3 Optimal receiverThe optimal receiver jointly decodes the data streams from all users. Its spectralefficiency C opt (σ2) is simply

C opt (σ2) = 1N

log det I N + 1σ2 WHPH H W H .




A straightforward application of Corollary 6.1 leads to the limiting result

C opt (σ2)

−log2 1 +

1

σ2N

K

k=1

P k |hk |21 + cP k |hk |

2mN (−σ

2)

− 1N

K

k=1

log2 1 + cP k |hk |2mN (−σ2)

−log2(e) σ2mN (−σ2) −1 a.s.

−→ 0.

for P 1|h1|2 , . . . , P K |hK |2 uniformly bounded across N , where mN (−σ2) is theunique positive solution of ( 12.4).

In the case of growing system dimensions and Rayleigh fading on all links,with similar arguments as previously, we have the almost sure convergence

C opt (σ2) a.s.

−→log2 1 + cσ2 P t

1 + P tm (−σ2)e−t dt

+ c log2 1 + P tm (−σ2) e−t dt

+ log 2(e) σ2m(−σ2) −1 .

with m(−σ2) dened here as the unique positive solution to ( 12.5).In the AWGN scenario, this is instead

C opt (σ2

) a.s.

−→log2 1 + c

σ2(1 + P m (−σ2))+ c log2 1 + P m (−σ2)+ log 2(e) σ2m(−σ2) −1 .

with m(−σ2) dened here by (12.6).In Figure 12.1 and Figure 12.3, a comparison is made between the matched-

lter, the MMSE lter, and the optimum decoder, for K = 16 users and N = 32chips per CDMA code, for different SNR values, for the scenarios of AWGNchannels and Rayleigh fading channels, respectively. In these graphs, theoretical

expressions are compared against Monte Carlo simulations, the average andstandard deviation of which are displayed. We observe, already for these smallvalues of K and N , that the deterministic equivalents for the matched-lter, theminimum mean square error lter, and the optimal decoder fall within (plus orminus) one standard deviation of the empirical spectral efficiency.

In Figure 12.2 and Figure 12.4, the analysis of the spectral efficiency fordifferent limiting ratios c = lim K/N is performed for these very decoders andfor AWGN and Rayleigh fading channels, respectively. The SNR is set to 10 dB.Note that the rate achieved by the optimal decoder grows unbounded as c growslarge. This emerges naturally from the fact that the longer the CDMA codes, themore important the data redundancy. The matched-lter also benets from shortcode length. On the opposite, the MMSE decoder benets only from moderateratios K/N and especially suffers from large K/N . This can be interpreted from




−5 0 5 10 15 20 25 300

1

2

3

4

5

SNR [dB]

S p e c t r a l e ffi c i e n c y

[ b i t s / s / H z ]

MF, det. eq.MF, simulationMMSE, det. eq.MMSE, simulationOptimal, det. eq.Optimal, simulation

Figure 12.1 Spectral efficiency of random CDMA decoders, AWGN channels.Comparison between simulations and deterministic equivalents (det. eq.), for thematched-lter, the MMSE decoder, and the optimal decoder, K = 16 users, N = 32chips per code. Rayleigh channels. Error bars indicate two standard deviations.

0 1 2 3 40

1

2

3

4

5

c


[ b i t s / s / H z ]

MFMMSEOptimal

Figure 12.2 Spectral efficiency of random CDMA decoders, for different asymptoticratios c = lim K/N , SNR=10 dB, AWGN channels. Deterministic equivalents for thematched-lter, the MMSE decoder, and the optimal decoder. Rayleigh channels.

the fact that, for N small compared to K , every user data stream at the outputof the MMSE lter is strongly interfered with by inter-code interference fromother users. The SINR for every stream is therefore very impacted, to the extentthat the spectral efficiency of the MMSE decoder is signicantly reduced.




−5 0 5 10 15 20 25 300

1

2

3

4

5

SNR [dB]


[ b i t s / s / H z ]

MF, det. eq.MF, simulationMMSE, det. eq.MMSE, simulationOptimal, det. eq.Optimal, simulation

Figure 12.3 Spectral efficiency of random CDMA decoders, Rayleigh fading channels.Comparison between simulations and deterministic equivalents (det. eq.), for thematched-lter, the MMSE decoder, and the optimal decoder, K = 16 users, N = 32chips per code. Rayleigh channels. Error bars indicate two standard deviations.

0 1 2 3 40

1

2

3

4

5

c


[ b i t s / s / H z ]

MF

MMSEOptimal

Figure 12.4 Spectral efficiency of random CDMA decoders, for different asymptoticratios c = lim K/N , SNR=10 dB, Rayleigh fading channels. Deterministic equivalentsfor the matched-lter, the MMSE decoder, and the optimal decoder.

12.2.2 Random CDMA in uplink frequency selective channels

We consider the same transmission model as in Section 12.2.1, but thetransmission channels are now frequency selective, i.e. contain multiple paths. Wethen replace the at channel fading hk coefficients in (12.1) by the convolution




Toeplitz matrix H (0)k ∈C N ×N , constant over time and given by:

H (0)k

hk, 0 0 · · · · · · · · · 0

... . . . . . . · · · . . . ...

hk,L k −1. . . hk 0

. . . ...

...

0 hk,L k −1. . . hk, 0

. . . ...

... . . . . . . . . . . . . 0

0 · · · 0 hk,L k −1 · · · hk, 0

where the coefficient hk,l stands for the lth path of the multi-path channel H (0)k ,

and the number of relevant such paths is supposed equal to Lk . We assume the

hk,l uniformly bounded over both k and l. In addition, because of multi-path,inter-symbol interference arises between the symbols s( l−1)

k transmitted at timel −1 and the symbols s ( l)

k . Under these conditions, (12.1) has now an additionalcontribution due to the interference from the previously sent symbol and wenally have

y ( l) =K

k=1

H (0)k w k P k s ( l)

k +K

k=1

H (1)k w k P k s ( l−1)

k + n ( l) (12.7)

where H (1)k is dened by

H (1)k

0 ·· · 0 hk,L k −1 · · · hk, 1... . . . . . . . . . . . .

...... . . . . . . . . . . . . hk,L k −1

......

. . . . . . . . . 0... . . . · · ·

. . . . . . ...

0 · · · · ·· · · · · ·· 0

.

If maxk (Lk ) is small compared to N , as N grows large, it is rather intuitivethat the term due to inter-symbol interference can be neglected, when estimatingthe performances of the CDMA decoders. This is because the matrices H (1)

k arelled with zeros but in the upper 1

2 Lk (Lk −1) elements. This number is of ordero(N ) and the elements are uniformly bounded with increasing N . Informally, thefollowing therefore ought to be somewhat correct for large N and K

y ( l)K

k=1

H (0)k w k P k s ( l)

k + n ( l) .

We could now work with this model and evaluate the performance of thedifferent decoding modes, which will feature in this case the Gram matrix of a random matrix with independent columns w k left-multiplied by H (0)

k for k ∈1, . . . , K . Therefore, it will be possible to invoke Theorem 6.12 to compute




in particular the performance of the matched-lter, MMSE and optimal uplinkdecoders. Nonetheless, the resulting nal expressions will be given as a functionof the matrices H (0)

k , rather than as a function of the entries hk,j . Instead, we willcontinue the successive model approximations by modifying further H (0)k into amore convenient matrix form, asymptotically equivalent to H (0)

k .From the same line of reasoning as previously, we can indeed guess that lling

the triangle of 12 Lk (Lk −1) upper right entries of H (0)

k with bounded elementsshould not alter the nal result in the large N, K limit. We may therefore replaceH (0)

k by H H (0)k + H (1)

k , leading to

y ( l)K

k =1

H k w k P k s( l)k + n ( l) . (12.8)

The right-hand side of ( 12.8) is more interesting to study, as H k is a circulantmatrix

H k

hk, 0 0 · · · hk,L k −1 · · · hk, 1...

. . . . . . . . . . . . ...

hk,L k −1. . . hk, 0

. . . . . . hk,L k −1

0 hk,L k −1. . . hk, 0

. . . ...

... . . . . . . . . . . . . 0

0 · · · 0 hk,L k −1 · · · hk, 0

which can be written under the form H k = F HN D k F N , with F N the discrete

Fourier transform matrix of order N with entries F N,ab = e−2πi ( a −1)( b−1)N .

Moreover, the diagonal entries of D k are the discrete Fourier transformcoefficients of the rst column of H k [Gray , 2006], i.e. with dab D a,bb

dab =L a −1

n =0ha,n e−2πi bn

N .

All H k matrices are therefore diagonalizable in a common eigenvector basisand we have that the right-hand side of ( 12.8), multiplied on the left by F N ,reads:

z ( l)K

k=1

D k w k P k s ( l)k + n ( l) (12.9)

with w k F N w k and n ( l) F N n ( l) .In order to simplify the problem, we now need to make the strong assumption

that the vectors w k have independent Gaussian entries. This ensures that w k

also has i.i.d. entries (incidentally Gaussian).We wish to study the performance of linear decoders for the model ( 12.7).

To prove that it is equivalent to work with (12.7) or with ( 12.9), as N, K growlarge, we need to prove that the difference between the gure of merit (say




here, the SINR) for the model y ( l) and the gure of merit for the model z( l)

is asymptotically almost surely zero. This can be proved, see, e.g., [Chaufrayet al. , 2004], using Szego’s theorem, the Markov inequality, Theorem 3.5, andthe Borel–Cantelli lemma, Theorem 3.6, in a similar way as in the proof of Theorem 3.4 for instance. For this condition to hold, it suffices that L/N → 0 asN grows large with L an upper bound on L1 , . . . , L K for all K large, that theH k matrices are bounded in spectral norm, and that there exists a < b such that0 < a < P k < b < ∞ for all k, uniformly on K .

Indeed, the condition on the equivalence of Toeplitz and circulant matrices isformulated rigorously by Szeg¨ o’s theorem, given below.

Theorem 12.1 (Theorem 4.2 of [Gray , 2006]). Let . . . , t −2 , t−1 , t 0 , t 1 , t 2 , . . . be

a summable sequence of real numbers, i.e. such that ∞

k= −∞|tk | < ∞.

Denote T N ∈C N ×N the Toeplitz matrix with kth column the vector (t−k+1 , . . . , t N −k )T . Then, denoting τ N, 1 , . . . , τ N,N the eigenvalues of T N , for any positive s

limN

→∞

1N

N −1

k=0

τ sN,k = 12π

2π

0f (λ)dλ

with

f (λ)∞

k= −∞tk eiλ

the Fourier transform of . . . , t −2 , t−1 , t0 , t 1 , t 2 , . . . . In particular, if tk = 0 for k <0 and k > K for some constant K , then the series is nite and then absolutely summable, and the l.s.d. of T N is the l.s.d. of the circulant matrices with rst column (t0 , . . . , t K −1)T .

From now on, we claim that model ( 12.9) is equivalent to model ( 12.7) for thestudies to come, in the sense that we can work either with ( 12.9) or with ( 12.7)and will end up with the same asymptotic performance results, but for a set of w 1 , . . . , w K of probability one. Equation ( 12.9) can be written more compactlyas

z ( l) = XP12 s + n ( l)

where the ( i, j )th entry X ij of X ∈C N ×K has zero mean, E[ |X ij |2] = |dji |2 andthe elements X ij /d ji are identically distributed. This is the situation of a channelmodel with variance prole. This type of model was rst studied in Theorem 3.14,when the matrix of the dij has a limiting spectrum, and then in Theorem 6.14in terms of a deterministic equivalent, in a more general case.




We consider successively the performance of the matched-lter, the MMSEdecoder, and the optimal receiver in the frequency selective case.

12.2.2.1 Matched-lterThe matched-lter here consists, for user k, in ltering z( l) by the kth columnof X as x H

k z ( l) . From the previous derivation, we have that the SINR γ (MF)k at

the output of the matched-lter reads:

γ (MF)k =

P k |xHk x k |2

x Hk XPX H −P k x k x H

k + σ2 I N x k. (12.10)

The x k are dened as xk = D k w k , where w k has i.i.d. entries of zero meanand variance 1 /N , independent of the D k . The trace lemma therefore ensures

that

w H

k D H

k D k w k − 1N

tr D H

k D ka .s.

−→ 0

where the trace can be rewritten

1N

tr D H

k D k = 1N

N

i =1|dk,i |2 .

As for the denominator of ( 12.10), notice that the inner matrix has entriesindependent of xk , which we would ideally like to be of almost sure uniformlybounded spectral norm, so that the trace lemma, Lemma 14.2, can operate asbefore, i.e.

x Hk XPX H

−P k x k x Hk + σ2I N x k = w H

k D Hk XPX H

−P k x k x Hk + σ2 I N D k w k

would satisfy

x Hk XPX H

−P k x k x Hk + σ2 I N x k −

1N

tr D k D Hk XPX H + σ2I N

a .s.

−→ 0.

However, although extensive Monte Carlo simulations suggest that XPX H

indeed is uniformly bounded almost surely, it is not proved to this day that thisholds true. For the rest of this section, we therefore mainly conjecture this result.

From the denition of X , we then have

1N

tr D k D Hk XPX H + σ2 I N −

1N 2

N

n =1|dk,n |2 σ2 +

1≤i≤K

P i |di,n |2 a.s.

−→ 0.

(12.11)And we nally have

γ (MF)k −P k 1

N N n =1

|dk,n

|2

2

1N 2

N n =1

K i =1 P i |dk,n |2|di,n |2 + σ2 1

N N n =1 |dk,n |2

a .s.−→ 0

where the deterministic equivalent is now clearly dependent on k.




The inner part of the inverse matrix is independent of the entries of xk .Since xk = D k w k with w k a vector of i.i.d. entries with variance 1 /N , we have,similarly as before, that

γ (MMSE)k −

P kN

tr D k D H

k XPX H + σ2 I N −1 a.s.

−→ 0

as N , n grow large. Now notice that XP12 is still a matrix with independent

entries and variance prole P j σ2ij , 1 ≤ i ≤ N , 1 ≤ j ≤ n. From Theorem 6.10,

it turns out that the trace on the left-hand side satisesP kN

tr D k D Hk XPX H + σ2 I N −1

−ek (−σ2) a.s.

−→ 0

where ek (z), z ∈C \ R + , is dened as the unique Stieltjes transform that satises

ek (z) = −1z

1N

N

n =1

P k |dkn |21 + K N en (z)

where en (z) is given by:

en (z) = −1z

1K

K

i =1

P i |din |21 + ei (z)

.

Note that the factor K/N is placed in the denominator of the term ek (z) hereinstead of the denominator of the term ¯ ek (z), contrary to the initial statement

of Theorem 6.10. This is due to the fact that X

is here an N ×K matrix withentries of variance |dij |2 /N and not |dij |2 /K . Particular care must therefore betaken here when propagating the term K/N in the formula of Theorem 6.10.

We conclude that the spectral efficiency C (MMSE) (σ2) for the MMSE decoderin that case satises

C (MMSE) (σ2) − 1N

K

k =1

log2 1 + ek (−σ2) a.s.

−→ 0.

Similar to the MF case, there does not exist a straightforward limit toC (MMSE) (σ2) for practical distributions of h1 , . . . , h L . However, if the user-frequency channel decay |dk,n |2 converges to a density |h(κ, f )|2 for users indexedby k κK and for normalized frequencies n f N , and that users within dκ of κhave power P (κ), then C (MMSE) (σ2) has a deterministic almost sure limit, givenby:

C (MMSE) (σ2) a.s.

−→ c 1

0log2 (1 + e(κ)) dκ

for e(κ) the function dened as a solution to the differential equations

e(κ) = 1

σ2

1

0

P (κ)|h(κ, f )|21 + ce(f )

df

e(f ) = 1σ2 1

0

P (κ)|h(κ, f )|21 + e(κ)

dκ. (12.13)




12.2.2.3 Optimal decoderThe spectral efficiency C opt (σ2) of the frequency selective uplink CDMA reads:

C opt (σ2) = 1N log det I N + 1σ2 XPX H .

A deterministic equivalent for C opt (σ2) is then provided by extendingTheorem 6.11 with the conjectured asymptotic boundedness of XPX H by astraightforward application of the dominated convergence theorem, Theorem 6.3.Namely, we have:

C opt (σ2)− 1N

N

n =1log2 1 +

K N

en (−σ2) + 1N

K

k=1

log2 1 + ek (−σ2)

− log2(e)

σ21

N 21≤n ≤N 1≤k≤K

|dkn |21 + K

N en (−σ2) (1 + ek (−σ2))a .s.

−→ 0

where the en (−σ2) and ek (−σ2) are dened as in the previous MMSE case. Theconjectured result is however known to hold in expectation by applying directlythe result of Theorem 6.11.

As for the linear decoders, if |dk,n |2 has a density limit |h(κ, f )|2 , then C opt (σ2)converges almost surely as follows.

C opt (σ2) a.s.

−→ 1

0log2 (1 + ce(f )) df + c

1

0log2 (1 + e(κ)) dκ

− log2(e)

σ2 c 1

0 1

0|h(κ, f )|2

(1 + ce(f )) (1 + e(κ))dκdf

where the functions e and e are solutions of ( 12.13).In Figure 12.5, we compare the performance of random CDMA detectors as

a function of the channel frequency selectivity, for different ratios K/N . Wesuccessively assume that the channel is Rayleigh fading, of length L = 1, L = 2,

and L = 8, the coefficients hl being i.i.d. Gaussian of variance 1 /L . For simulationpurposes, we consider a single realization, i.e. we do not average realizations, of the channel condition for an N = 512 random CDMA transmission. We observevarious behaviors of the detectors against frequency selectivity. In particular, theoptimal decoder clearly benets from channel diversity. Nonetheless, althoughthis is not represented here, no further gain is obtained for L > 8; therefore,there exists a diversity threshold above which the uplink data rate does notincrease. The matched-lter follows the same trend, as it benets as well fromfrequency diversity. The case of the MMSE decoder is more intriguing, as thelatter benets from frequency selectivity only for ratios K/N lower than one, i.e.for less users than code length, and suffers from channel frequency selectivity forK/N 1. In our application example, K/N ≤ 1 is the most likely assumption,in order for the CDMA codes to be almost orthogonal.




0 1 2 3 40

1

2

3

4

5

K/N


[ b i t s / s / H z ]

MFMMSEOptimal

Figure 12.5 Spectral efficiency of random CDMA decoders, for different ratios K/N ,SNR=10 dB, Rayleigh frequency selective fading channels. Deterministic equivalentsfor the matched-lter, the MMSE decoder, and the optimal decoder; N = 512, L = 1in dashed lines, L = 4 in dotted lines, L = 8 in plain lines.

12.2.3 Random CDMA in downlink frequency selective channels

We now consider the downlink CDMA setting, where the base station issues datafor the K terminals. Instead of characterizing the complete rate region of thebroadcast channel, we focus on the capacity achieved by a specic terminal k ∈1, . . . , K . In the case of frequency selective transmissions, the communicationmodel can be written at time l as

y ( l)k = P k H (0)

k Ws ( l) + P k H (1)k Ws ( l−1) + n ( l)

k (12.14)

with y ( l)k ∈C N the N -chip signal received by terminal k at time l, W =

[w 1 , . . . , w K ] ∈C N ×K , where w i is now the downlink random CDMA codeintended for user i, s( l) = [s ( l)

1 , . . . , s ( l)K ]

T

∈

C K with s( l)i the signal intended for

user i at time l, P k is the mean transmit power of user k, H (0)k ∈C N ×N is

the Topelitz matrix corresponding to the frequency selective channel from thebase station to user k, H (1)

k ∈C N ×N the upper-triangular matrix that takes intoaccount the inter-symbol interference, and n ( l)

k ∈C N is the N -dimensional noisereceived by terminal k at time l.

Similar to the uplink case, we can consider for ( 12.14) the approximated model

y ( l)k P k H k Ws ( l) + n ( l)

k (12.15)

where H k

∈C N ×K is the circulant matrix equivalent to H (0)

k . It can indeed beshown, see, e.g., [Debbah et al., 2003a], that the asymptotic SINR sought for arethe same in both models. We study here the two receive linear decoders that arethe matched-lter and the MMSE decoder. The optimal joint decoding strategy




is rather awkward in the downlink, as it requires highly inefficient computationalloads at the user terminals. This will not be treated.

12.2.3.1 Matched-lterSimilar to the uplink approach, the matched-lter in the downlink consists foruser k to lter the input y ( l)

k by its dedicated code convoluted by the channelw H

k H Hk . The SINR γ (MF)

k for user k is then simply given by:

γ (MF)k =

P k w Hk H H

k H k w k2

w Hk H H

k P k H k WW H H Hk −P k H k w k w H

k H Hk + σ2 I N H k w k

.

Using similar tools as for the uplink case, we straightforwardly have that

γ (MF)k − 1

N 2P

k

N

n =1 |d

k,n |2

2

σ2 1N

N n =1 |dk,n |2 + K −1

N 2 P k N n =1 |dk,n |4

a .s.−→ 0 (12.16)

where dk,n is dened as above by

dk,n L k −1

l=0

hk,l e−2πi nlN .

From this expression, we then have that the sum rate C MF achieved by thebroadcast channel satises

C MF (σ2) − 1N

K

k=1

log2 1 + P k 1N

N n =1 |dk,n |

22

σ2 1N

N n =1 |dk,n |2 + K −1

N 2 P k N n =1 |dk,n |4

a .s.

−→ 0.

When P 1 , . . . , P K have a limiting density P (κ), the |dk,n | have a limitingdensity |h(κ, f )|, and K/N → c, then asymptotically

C MF (σ2) a .s.

−→ c 1

0log2 1 + P (κ) 1

0 |h(κ, f )|2df 2

σ2 10 |h(κ, f )|2df + c 1

0 |h(κ, f )|4df dκ .

12.2.3.2 MMSE decoderFor the more advanced MMSE decoder, i.e. the linear decoderthat consists in retrieving the transmit symbols from the productw H

k H Hk P k H k WW H H H

k + σ2 I N −1 y ( l)k , the SINR γ (MMSE)

k for the dataintended for user k is explicitly given by:

γ (MMSE)k = P k w H

k H Hk P k H k WW H H H

k −P k H k w k w Hk H H

k + σ2 I N −1H k w k .

As in the uplink case, the central matrix is independent of w k and therefore thetrace lemma ensures that the right-hand side expression is close to the normalized

trace of the central matrix, which is its Stieltjes transform at point −σ2. Wetherefore use again the deterministic equivalent of Theorem 6.1 to obtain

γ (MMSE)k −ek (−σ2) a.s.

−→ 0




where ek (−σ2) is the unique real positive solution to the equation in e

e = 1

N

N

n =1

P k |dk,n |211+ e K −1N P k |dk,n |2 + σ2

. (12.17)

The resulting achievable sum rate C MMSE for the MMSE decoded broadcastchannel is then such that

C MMSE (σ2) − 1N

K

k=1

log2 1 + ek (−σ2) a .s.

−→ 0.

When the channel has a limiting space-frequency power density |h(κ, f )|2and the users within dκ of κ have inverse path loss P (κ), the sum rate has

a deterministic limit given by:

C MMSE (σ2) a.s.

−→ c 1

0log2 (1 + e(κ))

where the function e(κ) satises

e(κ) = 1

0

P (κ)|h(κ, f )|2c1+ e(κ ) P (κ)|h(κ, f )|2 + σ2 df.

Before moving to the study of orthogonal CDMA transmissions, let us recall[Poor and Verd´ u, 1997] that the multiple access interference incurred by the non-orthogonal users can be considered roughly Gaussian in the large dimensionalsystem limit. As a consequence, the bit error rate BER induced, e.g. by theMMSE decoder for user k and for QPSK modulation, is of order

BER Q γ (MMSE)k

with Q the Gaussian Q-function, dened by

Q(x) 1√ 2π

∞x

e−t 22 dt.

We can therefore give an approximation of the average bit error rate in thedownlink decoding. This is provided in Figure 12.6, where it can be seen that, forsmall ratios K/N , the deterministic approximation of the bit error rate is veryaccurate even for not too large K and N . In contrast, for larger K/N ratios,large K , N are demanded for the deterministic approximation to be accurate.

Note additionally that asymptotic gaussianity of the SINR at the output of the MMSE receiver for all cases above can be also proved, although this isnot detailed here, see, e.g., [Guo et al., 2002; Tse and Zeitouni , 2000]. Also,extensions of the above results to the multiple antenna case were studied in [Baiand Silverstein , 2007; Hanly and Tse, 2001] as well as to asynchronous randomCDMA in [Cottatellucci et al., 2010a,b ; Hwang, 2007; Mantravadi and Veeravalli,2002].




−5 0 5 10 1510−4

10−3

10−2

10−1

100

SNR [dB]

B i t e r r o r r a t e

Sim., K = 16, N = 32Sim., K = 64, N = 128Det. eq., c = 1 / 2Sim., K = 32, N = 32Sim., K = 128, N = 128Det. eq., c = 1

Figure 12.6 Bit error rate achieved by random CDMA decoders in the downlink,AWGN channel. Comparison between simulations (sim.) and deterministic equivalents(det. eq.) for the MMSE decoder, K = 16 users, N = 32 chips per code.

12.3 Performance of orthogonal CDMA technologies

The initial incentive for using CDMA schemes in multi-user wirelesscommunications is based on the idea that orthogonality between users can bebrought about by codes, instead of separating user transmissions by using timedivision or frequency division multiplexing. This is all the more convenient whenthe codes are perfectly orthogonal and the communication channel is frequencyat. In this scenario, the signals received either in the uplink by the base stationor in the downlink by the users are perfectly orthogonal. If the channel isfrequency selective, the code orthogonality is lost so that orthogonal codes are

not much better than random codes, as will clearly appear in the following.Moreover, it is important to recall that orthogonality is preserved on the solecondition that all codes are sent and received simultaneously. In the uplink, thisimposes all users to be synchronous in their transmissions and that the delayincurred by the difference of wave travel distance between the user closest andthe user furthest to the base station is small enough.

For all these reasons, in days when CDMA technologies are no longer usedexclusively for communications over narrowband channels, the question is posedof whether orthogonal CDMA is preferable to random CDMA. This sectionprovides the orthogonal version of the results derived in the previous sectionfor i.i.d. codes. It will be shown that orthogonal codes perform always betterthan random codes in terms of achievable rates, although the difference becomesmarginal as the channel frequency selectivity increases.



12.3. Performance of orthogonal CDMA technologies 285

Similar to the previous section, we start by the study of orthogonal CDMA inthe uplink.

12.3.1 Orthogonal CDMA in uplink frequency at channels

We start with the transmission model ( 12.1) where now w 1 , . . . , w K are K ≤ N columns of a Haar matrix, and W = [w 1 , . . . , w K ] is such that W H W = I K . Wedene H and P as before.

Consider the matched-lter for which the SINR ¯ γ (MF)k for user k is given by:

γ (MF)k =

P k |hk |2w Hk w k

w Hk WHPH H W H −P k |hk |2w k w H

k + σ2 I N w k

= P k |hk |2w

H

k w kσ2

which unfolds from the code orthogonality.Since we have, for all N

w Hi w j = δ ji

with δ ji the Kronecker delta, we have simply

γ (MF)

k =

P k |hk |2σ2

and the achievable sum rate C (MF) satises

C (MF) = 1N

K

k=1

log2 1 + P k |hk |2

σ2 .

This is the best we can get as this also corresponds to the capacity of boththe MMSE and the optimal joint decoder.

12.3.2 Orthogonal CDMA in uplink frequency selective channels

We now move to the frequency selective model ( 12.8).

12.3.2.1 Matched-lterThe SINR γ (MF)

k for the signal originating from user k reads as before

γ (MF)k =

P k x Hk x k

2

x Hk XPX H −P k x k x H

k + σ2 I N x k

where x k = D k F N w k , D k being diagonal dened as previously, and F N is theFourier transform matrix. Since W is formed of columns of a Haar matrix, F N Wis still Haar distributed.




From the trace lemma, Corollary 6.3, we have again that the numeratorsatises

wH

k FH

N DH

k D k F N w k − 1N tr D

H

k D k

a .s.

−→ 0as long as the inner matrix has almost surely uniformly bounded spectral norm.The second term in the left-hand side is simply 1

N N n =1 |dkn |2 with dkn the nth

diagonal entry of D k . To handle the denominator, we need the following result.

Lemma 12.1 ([Bonneau et al., 2005]). Let W = [w 1 , . . . , w K ] ∈C N ×K , K ≤ N ,be K columns of a Haar random unitary matrix, and A ∈C N ×K be independent of W and have uniformly bounded spectral norm. Denote X = [x 1 , . . . , x K ] the matrix with (i, j )th entry wij a ij . Then, for k ∈ 1, . . . , K

x H

k XX H x k −1

N 2

N

n =1 j = k|ank |2|anj |2 −

1N 3

j = k

N

n =1akn a∗jn

2a .s.

−→ 0.

Remark 12.1. Compared to the case where w i has i.i.d. entries (see, e.g., ( 12.11)),observe that the i.i.d. and Haar cases only differ by the additional second termin brackets. Observe also that w H

k XX H w k is necessarily asymptotically smallerwhen W is Haar than if W has i.i.d. entries. Therefore, the interference term atthe output of the matched-lter is asymptotically smaller and the resulting SINRlarger. Note also that the almost sure convergence is easy to verify compared tothe (only conjectured) i.i.d. counterpart, since the eigenvalues of unitary matricesare all of unit norm.

Applying Lemma 12.1 to the problem at hand, we nally have that thedifference between γ (MF)

k and

P k 1N

N n =1 |dk,n |2

2

i= kN n =1

P iN 2 |dk,n |2|di,n |2 − i = k

P iN 3

N n =1 dk,n d∗i,n

2+ σ 2

N N n =1 |dk,n |2

is asymptotically equal to zero in the large N, K limit, almost surely. Theresulting deterministic equivalent for the capacity unfolds directly.To this day, a convenient deterministic equivalent for the capacity of the

MMSE decoder in the frequency selective class has not been proposed, althoughnon-convenient forms can be obtained using similar derivations as in the rststeps of the proof of Theorem 6.17. This is because the communication modelinvolves random Haar matrices with a variance prole, which are more involvedto study.

12.3.3 Orthogonal CDMA in downlink frequency selective channelsWe now consider the downlink model ( 12.15) and, again, turn W into K columnsof an N ×N Haar matrix.




12.3.3.1 Matched-lterThe SINR γ (MF)

k for user k at the output of the matched-lter is given by:

γ (MF)k = P k w

Hk H

Hk H k w k

2

w Hk H H

k P k H k WW H H Hk −P k H k w k w H

k H Hk + σ2 I N H k w k

.

Since H k is a circulant matrix and W is unitarily invariant, by writingF N H k F H

N = D k and w k = F N w k , W = [w 1 , . . . , w K ] ( W is still unitary andunitarily invariant), D k is diagonal and the SINR reads:

γ (MF)k =

P k w Hk D H

k D k w k2

w Hk D H

k P k D k W W H D Hk −P k D k w k w H

k D Hk + σ2 I N D k w k

.

The numerator is as usual such that

w H

k D H

k D k w k − 1N

N

i =1|dk,n |2

a.s.

−→ 0.

As for the denominator, we invoke once more Lemma 12.1 with a varianceprole a2

ij with aij constant over j . We therefore obtain

w Hk D H

k D k W W H D Hk −D k w k w H

k D Hk D k w k

− K −1N 2

N

n =1|dk,n |4 − K −1

N 3 N

n =1|dk,n |2

2a .s.

−→ 0.

A deterministic equivalent for the SINR therefore unfolds as

γ (MF)k −

P k 1N

N i=1 |dk,n |2

2

P k (K −1)N 2

N n =1 |dk,n |4 − P k (K −1)

N 3N n =1 |dk,n |2

2+ σ2

N N i =1 |dk,n |2

a .s.

−→ 0.

Compared to Equation ( 12.16), observe that the term in the denominatoris necessarily inferior to the term in the denominator of the deterministicequivalent of the SINR in ( 12.16), while the numerator is unchanged. As aconsequence, at least asymptotically, the performance of the orthogonal CDMAtransmission is better than that of the random CDMA scheme. Notice also, fromthe boundedness assumption on the |dn,k |, that

K −1N 3

N

n =1|dk,n |2

2

≤ K −1

N max

n |dk,n |4

the right-hand side of which is of order O(K/N ). The difference between thei.i.d. and orthogonal CDMA performance is then marginal for small K .




This expression has an explicit limit if |dk,n | and P k have limiting densities

|h(κ, f )| and P (κ), respectively, and K/N → c. Precisely, we have:

γ (MF)k

a .s.

−→P (κ) 1

0 |h(κ, f )|2df 2

P (κ)c 10 |h(κ, f )|4df −P (κ)c 1

0 i 10

N n =1 |h(κ, f )|2df

2+ σ2 1

0 |h(κ, f )|2df .

The resulting limit of the deterministic equivalent of the capacity C (MF)

unfolds directly.

12.3.3.2 MMSE decoderThe MMSE decoder leads to the SINR ¯γ (MMSE)

k of the form

γ (MMSE)k = P k w H

k H H

k P k H k WW H H H

k −P k H k w k w H

k H H

k + σ2 I N −1H k w k .

A direct application of Theorem 6.19 when the sum of Gram matrices is takenover a single term leads to

γ (MMSE)k −

P k ek

1 −ek ek

where ek and ek satisfy the implicit equation

ek = K N

P k1 + P k ek −ek ek

ek = 1N

N

n =1

|dk,n |2ek |dk,n |2 + σ2 .

In [Debbah et al., 2003a,b], a free probability approach is used to derive theabove deterministic equivalent. The precise result of [Debbah et al., 2003a] isthat

γ (MMSE)k −ηk

a .s.

−→ 0

where ηk is the unique positive solution to

ηk

ηk + 1 =

1N

N

n =1

P k |dn,k |2cP k |dn,k |2 + σ2(1 −c)ηk + σ2 .

It can be shown that both expressions are consistent, i.e. that ηk = P k ek1−ek ek

, bywriting

P k ek

1

−ek ek

P k ek

1

−ek ek

+ 1−1

= 1N

N

n =1

P k |dn,k |2cP k

|dn,k

|2 + σ2(1

−ek ek + P k ek )

= 1N

N

n =1

P k |dn,k |2cP k |dn,k |2 + σ2(1 + (1 −c) P k ek

1−ek ek)




0 0.5 1 1.5 20

1

2

3

K/N


[ b i t s / s / H z ]

MMSE, i.i.d.MMSE, orth.MF, i.i.d.MF, orth.

Figure 12.7 Spectral efficiency of random and orthogonal CDMA decoders, fordifferent ratios K/N , K = 512, SNR=10 dB, Rayleigh frequency selective fadingchannels L = 1, in the downlink.

orthogonal case). That is, we consider that the orthogonal codes received by theusers are perfectly orthogonal. The SNR is set to 10 dB, and the number of

receivers taken for simulation is K = 512. In this case, we observe indeed thatboth the matched-lter and the MMSE lter for the orthogonal codes perform thesame (as no inter-code interference is present), while the linear lters for the i.i.d.codes are highly suboptimal, as predicted. Then, in Figure 12.8, we consider thescenario of an L = 8-tap multi-path channel, which now shows a large advantageof the MMSE lter compared to the matched-lter, both for i.i.d. codes andfor orthogonal codes. Nonetheless, in spite of the channel convolution effect, thespectral efficiency achieved by orthogonal codes is still largely superior to thatachieved by random codes. This is different from the uplink case, where differentchannels affect the different codes. Here, from the point of view of the receiver,the channel convolution effect affects identically all user codes, therefore limitingthe orthogonality reduction due to frequency selectivity and thus impacting thespectral efficiency in a limited manner.

This completes the chapter on CDMA technologies. We now move to the studyof multiple antenna communications when the number of antennas on eithercommunication side is large.




0 0.5 1 1.5 20

1

2

3

K/N


[ b i t s / s / H z ]

MMSE, i.i.d.MMSE, orth.MF, i.i.d.MF, orth.

Figure 12.8 Spectral efficiency of random and orthogonal CDMA decoders, fordifferent ratios K/N , K = 512, SNR=10 dB, Rayleigh frequency selective fadingchannels, L = 8, in the downlink.



13 Performance of multiple antenna

systems

In this section, we study the second most investigated application of randommatrix theory to wireless communications, namely multiple antenna systems,rst introduced and motivated by the pioneering works of Telatar [Telatar,1995] and Foschini [Foschini and Gans , 1998]. While large dimensional systemanalysis is easily defensible in CDMA networks, which typically allow for a largenumber of users with large orthogonal or random codes, it is not so for multipleantenna communications. Indeed, when it comes to applying approximatedresults provided by random matrix theory analysis, we expect that the typicalsystem dimensions are of order ten to a thousand. However, for multiple inputmultiple output (MIMO) setups, the system dimensions can be of order 4, oreven 2. Asymptotic results for such systems are then of minor interest. However,it will turn out in some specic scenarios that the difference between the

ergodic capacity for multiple antenna systems and their respective deterministicequivalents is sometimes of order O(1/N ), N being the typical system dimension.The per-receive antenna rate, which is of interest for studying the cost and gain of bringing additional antennas on nite size devices, can therefore be approximatedwithin O(1/N 2). This is a rather convenient rate, even for small N . In fact, as willbe observed through simulations, the accuracy of the deterministic equivalentsis often even better.

13.1 Quasi-static MIMO fading channels

We hereafter recall the foundations of multiple antenna communications. We rstassume a simple point-to-point communication between a transmitter equippedwith n t antennas and a receiver equipped with n r antennas. The communicationchannel is assumed linear, frequency at, and is modeled at any instant by thematrix H ∈C n r ×n t , with ( i, j ) entry hij . At time t, the transmitter emits thedata vector x ( t )

∈C n t through H , which is corrupted by additive white Gaussian

noise σn ( t )∈

C n r with entries of variance σ2 and received as y ( t )∈

C n r . We

therefore have the classical linear transmission model

y ( t ) = Hx ( t ) + σn ( t ) .



13.2. Time-varying Rayleigh channels 295

The capacity C (n r ,n t ) found above, when H is constant over a long time(sufficiently long to be considered innite) and H is known at the transmitterside, will be further referred to as the quasi-static channel capacity . That is,it concerns the scenario when H is a fading channel static over a long time,compared to the data transmission duration.

13.2 Time-varying Rayleigh channels

However, in mobile communications, it is often the case that H is varyingfast, and often too fast for the transmitter to get to know the transmission

environment perfectly prior to transmission. For simplicity here, we assumethat some channel information is nonetheless emitted by the transmitter in thedirection of the receiver prior to proper communication so that the receiver is atall times fully aware of H . We also assume that the feedback effort is negligible interms of consumed bit rate (think of it as being performed on an adjacent controlchannel). In this case, the computation of the mutual information between thetransmitter and the receiver therefore assumes that the exact value of H isunknown to the transmitter, although the joint probability distribution P H (H )of H is known (or at least that some statistical information about H has beengathered). The computation of Shannon’s capacity C (n r ,n t )

ergodic in this case reads:

C (n r ,n t )ergodic (σ2) max

Ptr P ≤P log2 det I n r +

1σ2 HPH H dP H (H )

which is the maximization over P of the expectation over H of the mutualinformation I (n r ,n t ) (σ2 ; P ) given in (13.1).

This capacity is usually referred to as the ergodic capacity . Indeed, we assume

that H is drawn from an ergodic process, that is a process whose probabilitydistribution can be deduced from successive observations. Determining the exactvalue of C (n r ,n t )

ergodic in this case is more involved as an integral has to be solved andmaximized over P .

We now recall the early result from Telatar [Telatar, 1995 , 1999] on the ergodiccapacity of multiple antenna at Rayleigh fading channels. Telatar assumes thenow well spread i.i.d. Gaussian model, i.e. H has Gaussian i.i.d. entries of zeromean and variance 1 /n t . This assumption amounts to assuming that the physicalchannel between the transmitter side and the receiver side is lled with numerousscatterers and that there exists no line-of-sight component. The choice of lettingthe entries of H have variances proportional to 1 /n t changes the power constraintinto 1

n ttr P ≤ P , which will turn out to be often more convenient for practical

calculus.



296 13. Performance of multiple antenna systems

13.2.1 Small dimensional analysis

When H is i.i.d. Gaussian, it is unitarily invariant so that the ergodic capacity for

the channel HU , for U ∈C n t

×n t

unitary, is identical to the ergodic capacity forH itself. The optimal precoding matrix P can therefore be considered diagonal(non-negative denite) with no generality restriction. Denote Π (n t ) the set of permutation matrices of size nt ×n t , whose cardinality is ( n t !). Note [Telatar,1999], by the concavity of log 2 det( I n t + 1

σ 2 HPH H ) seen as a function of P , thatthe matrix Q 1

n t ! Π∈Π

( n t ) ΠPΠ H is such that

log2 det I n t + 1σ2 HQH H

≥ 1n t !

Π∈Π

( n t )

log2 det I n t + 1σ2 HΠPΠ H H H

= log 2 det I n t + 1σ2 HPH H .

This follows from Jensen’s inequality. Since P was arbitrary, Q maximizes thecapacity. But now notice that Q is, by construction, necessarily a multiple of theidentity matrix. With the power constraint, we therefore have Q = I n t .

It therefore remains to evaluate the ergodic capacity as

C (n r ,n t )ergodic (σ2) =

log2 det I n r +

1σ2 HH H dP H (H )

where P H is the density of an ( n r ×n t )-variate Gaussian variable with entriesof zero mean and variance 1 /n t . To this purpose, we rst diagonalize HH H andwrite

C (n r ,n t )ergodic (σ2) = n r ∞

0log2 1 +

λn t σ2 pλ (λ)dλ

where pλ is the marginal eigenvalue distribution of the null Wishart matrixn

tHH H . Remember now that this is exactly stated in Theorem 2.3. Hence, we

have that

C (n r ,n t )ergodic (σ2) = ∞

0log2 1 +

λn t σ2

m −1

k=0

n r k!(k + n −m)!

[Ln −mk (λ)]2λn −m e−λ dλ

(13.3)with m = min( n r , n t ), n = max( n r , n t ), and Lj

i are the Laguerre polynomials.This important result is however difficult to generalize to more involved

channel conditions, e.g. by introducing side correlations or a variance prole tothe channel matrix. We will therefore quickly move to large dimensional analysis,where many results can be found to approximate the capacity of point-to-pointMIMO communications in various channel conditions, through deterministicequivalents.




13.2.2 Large dimensional analysis

From a large dimensional point of view, the ergodic capacity evaluation for the

Rayleigh i.i.d. channel is a simple application of the Marcenko–Pastur law. Wehave that the per-receive antenna capacity 1n r

C (n r ,n t )ergodic satises

1n r

C (n r ,n t )ergodic (σ2) → ∞

0log2 1 +

xσ2 (x −(1 −√ c)2)((1 + √ c −x)2 −x)

2πcx dx

as (n t , n r ) grow large with asymptotic ratio nr /n t → c, 0 < c < ∞. This resultwas already available in Telatar’s pioneering article [Telatar, 1999 ]. In fact, weeven have that the quasi-static mutual information for P = I n t converges almostsurely to the right-hand side value. Since the channels H for which this is notthe case lie in a space of zero measure, this implies that the convergence holdssurely in expectation. Now, an explicit expression for the above integral canbe derived. It suffices here to apply, e.g. Theorem 6.1 or Theorem 6.8 to theextremely simple case where the channel matrix has no correlation. In that case,the equations leading to the deterministic equivalent for the Stieltjes transformare explicit and we nally have the more interesting result

1n r

C (n r ,n t )ergodic (σ2)

− log2 1 + 1

σ2(1 + n r

n tδ )

+ nt

n rlog2 1 +

nr

n tδ + log 2(e) σ2δ −1 → 0

where δ is the positive solution to

δ = 1

1 + n rn t

δ + σ2

−1

which is explicitly given by:

δ = 12

1σ2 1 −

nt

n r − nt

n r+ 1σ2 1 −

nt

n r − nt

n r

2

+ 4 nt

n r σ2 .

Also, since H has Gaussian entries, by invoking, e.g. Theorem 6.8, it is knownthat the convergence is as fast as O(1/n 2

t ). Therefore, we also have that

C (n r ,n t )ergodic (σ2)

− n r log2 1 + 1

σ2(1 + n rn t

δ )+ n t log2 1 +

nr

n tδ + n r log2(e) σ2δ −1

= O(1/n t ). (13.4)

In Table 13.1, we evaluate the absolute difference between C (n r ,n t ) (σ2) (fromEquation ( 13.3)) and its deterministic equivalent (given in Equation ( 13.4)),relative to C (n r ,n t ) (σ2), for (n t , n r ) ∈ 1, . . . , 82 . The SNR is 10 dB. We observethat, even for very small values of nt and nr , the relative difference does not




n r , n t 1 2 3 4 5 6 7 81 0.0630 0.0129 0.0051 0.0027 0.0016 0.0011 0.0008 0.00062 0.0116 0.0185 0.0072 0.0035 0.0020 0.0013 0.0009 0.00073 0.0039 0.0072 0.0080 0.0044 0.0025 0.0016 0.0011 0.00084 0.0019 0.0032 0.0046 0.0044 0.0029 0.0019 0.0013 0.00095 0.0011 0.0017 0.0025 0.0030 0.0028 0.0020 0.0014 0.00106 0.0007 0.0010 0.0015 0.0019 0.0021 0.0019 0.0015 0.00117 0.0005 0.0007 0.0009 0.0012 0.0015 0.0015 0.0014 0.00118 0.0003 0.0005 0.0006 0.0008 0.0010 0.0012 0.0012 0.0011

Table 13.1. Relative difference between true ergodic n r ×n t MIMO capacity andassociated deterministic equivalent.

exceed 6% and is of order 0 .5% for nt and nr of order 4. This simple examplemotivates the use of large dimensional analysis to approximate the real capacityof even small dimensional systems. We will see in this chapter that this trendcan be extended to more general models, although particular care has to betaken for some degenerated cases. It is especially of interest to determine thetransmit precoders that maximize the capacity of MIMO communications understrong antenna correlations at both communication ends. It will be shown thatthe capacity in this corner case can still be approximated using deterministic

equivalents, although fast convergence of the deterministic equivalents cannotbe ensured and the resulting estimators can therefore be very inaccurate. Inthis case, theory requires that very large system dimensions be assumed toobtain acceptable results. Nonetheless, simulations still suggest, apart from verydegenerated models, where, e.g. both transmit and receive sides have rank-1correlation proles, that the deterministic equivalents are still very accurate forsmall system dimensions.

Note additionally that random matrix theory, in addition to providingconsistent estimates for the ergodic capacity, also ensures that, with probabilityone, as the system dimension grows large, the instantaneous mutual informationof a given realization of a Rayleigh distributed channel is within o(1) of thedeterministic equivalent (this assumes however that only statistical channelstate information is available at the transmitter). This unveils some sort of deterministic behavior for the achievable data rates as the number of antennasgrows, which can be thought of as a channel hardening effect [Hochwald et al.,2004], i.e. as the system dimensions grow large, the variance of the quasi-staticcapacity is signicantly reduced.

13.2.3 Outage capacityFor practical nite dimensional quasi-static Rayleigh fading channel realizations,whose realizations are unknown beforehand, the value of the ergodic capacity,




that can be only seen as an a priori “expected” capacity, is not a proper measureof the truly achievable transmission data rate. In fact, if the Rayleigh fadingchannel realization is unknown, the largest rate to which we can ensure data istransmitted reliably is in fact null. Indeed, for every given positive transmissionrate, there exists a non-zero probability that the channel realization has a lessercapacity. As this statement is obviously not convenient, it is often preferable toconsider the so-called q -outage capacity dened as the largest transmission datarate that is achievable at least a fraction q of the time. That is, for a randomchannel H with realizations H (ω) and instantaneous capacity C (n t ,n r ) (σ2 ; ω),ω ∈ Ω, the q -outage capacity C (n t ,n r )

outage (σ2 ; q ) is dened as

C (n t ,n r )outage (σ2 ; q ) = sup

R

≥0

P ω, C (n t ,n r ) (σ2 ; ω) > R ≤ q

= supR ≥0

P C (n t ,n r ) (σ2) > R (13.5)

with C (n t ,n r ) (σ2) seen here as a random variable of H .It is often difficult to characterize fully the outage capacity under perfect

channel knowledge at the transmitter, since the transmit precoding policy isdifferent for each channel realization. In the following, we will in general referto the outage capacity as the outage rate obtained under deterministic (oftenuniform) power allocation at the transmitter. In this scenario, we have insteadthe outage mutual information I (n t ,n r )

outage (σ2 ; P ; q ), for a specic precoding matrix

P

I (n t ,n r )outage (σ2 ; P ; q ) = sup

R ≥0P I (n t ,n r ) (σ2 ; P ) > R ≤ q . (13.6)

To determine the outage capacity of a given communication channel, it sufficesto be able to determine the complete probability distribution of C (n t ,n r ) (σ2) forvarying H , if we consider the outage capacity denition ( 13.5), or the probabilitydistribution of I (n t ,n r ) (σ2 ; P ) for a given precoder P if we consider the denition(13.6). For the latter, with P = I n t , for the Rayleigh fading MIMO channel

H ∈C n r ×n t , it suffices to describe the distribution of

log2 I N + 1σ2 HH H =

n r

k=1

log2 1 + λk

σ2

for the random variables H , with λ1 , . . . , λ n r the eigenvalues of HH H .Interestingly, as both nt and nr grow large, the distribution of the

zero mean random variable I (n t ,n r ) (σ2 ; P ) −E[I (n t ,n r ) (σ2 ; P )] turns out tobe asymptotically Gaussian [Kamath et al., 2002]. This result is in fact astraightforward consequence of the central limit Theorem 3.17. Indeed, considerTheorem 3.17 in the case when the central T N matrix is identity. Taking theuniform precoder P = I n t and letting f 1 be dened as f 1(x) = log 2 1 + x

σ 2 ,clearly continuous on some closed set around the limiting support of the




−4 −2 0 2 40

1 ·10−2

2 ·10−2

3 ·10−2

4 ·10−2

Centered capacity

D e n s i t y

SimulationCentral limit

Figure 13.1 Simulated I ( n t ,n r ) (σ2 ; I n t ) −E[I ( n t ,n r ) (σ2 ; I n t )] against central limit,σ2 = −10 dB.

Marcenko–Pastur law F , we obtain that

n r

log2 1 +

λσ2 dF HH H

(λ) −dF (λ) ⇒ X

with X a random real Gaussian variable with zero mean and variance computedfrom (3.26) as being equal to

E X 2 = −log 1 − σ4

16c (1 + √ c)2

σ2 + 1 − (1 −√ c)2

σ2 + 14

.

This result is also a direct application of Theorem 3.18 for Gaussian distributedrandom entries of X N .

13.3 Correlated frequency at fading channels

Although the above analysis has the advantage to predict the potentialcapacity gains brought by multiple antenna communications, i.i.d. Rayleighfading channel links are usually too strong an assumption to model practicalcommunication channels. In particular, the multiplexing gain of order min( n r , n t )announced by Telatar [Telatar, 1999 ] and Foschini [Foschini and Gans, 1998]relies on the often unrealistic supposition that the channel links are frequencyat and have a multi-variate independent zero mean Gaussian distribution. Inpractical communication channels, this model faces strong limitations. We lookat these limitations from a receiver point of view, although the same reasoningcan be performed on the transmitter side.



13.3. Correlated frequency at fading channels 301

0 2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

Achievable rate

D i s t r i b u t i o n f u n c t i o n

SISO, det. eq.SISO, sim.MIMO 2

×2, det. eq.

MIMO 2 ×2, sim.MIMO 4 ×4, det. eq.MIMO 4 ×4, sim.

Figure 13.2 Distribution function of C ( n t ,n r ) (σ2 ), σ2 = 0 .1, for different values of n t , n r , and comparison against deterministic equivalents.

• To ensure conditional independence (with respect to the transmitted data)of the waveforms received by two distinct antennas but emerging from asingle source, the propagation environments must be assumed decorrelatedin some sense. Roughly speaking, two incoming waveforms can be statedindependent if they propagate along different paths in the communicationmedium. This physically constrains the distance between receive antennas tobe of an order larger than the transmission wavelength. Introducing a specicmodel of channel correlation, it can be in particular shown that increasingthe number of antennas to innity on nite size devices leads to a physicallyfundamental rate saturation, see, e.g., [Couillet et al., 2008; Pollock et al.,2003]. To model more realistic channel matrices, statistical correlation between

transmit antennas and receive antennas must therefore be taken into account.The most largely spread channel model which accounts for both transmit andreceive signal correlations is the Kronecker channel model. This model assumesthat the transmitted signals are rst emitted from a correlated source (e.g.close transmit antennas, privileged direction of wave transmission, etc.), thenpropagate through a largely scattered environment, which acts as a randomi.i.d. linear lter decorrelated from transmit and receive parts, to be nallyreceived on a correlated antenna array. Again, the correlation at the receiveris due either to the fact that receive antennas are so close, or that the solidangle of direction of arrival is so thin, that all incoming signals are essentiallythe same on all antennas. Note that this model, although largely spread, hasbeen criticized and claimed unrealistic to some extent by eld measurementsin, e.g., [Ozcelik et al., 2003; Weichselberger et al., 2006].




• The i.i.d. Gaussian channel matrix model also assumes that the propagationpaths from a given transmit antenna to a given receive antenna have anaverage fading gain, which is independent of the selected antenna pair. Thisis a natural assumption for long-distance transmissions over a communicationmedium with a large number of scatterers. For xed communication systemswith smaller distances, this does not take into account the specic impactof the inter-antenna distance. A more general channel matrix model in thissense is to let the channel matrix entries be independent Gaussian entries withdifferent variances, though, i.e. with a variance prole.

• Both previous generalization models still suffer from the lack of line-of-sight components in the channel. Indeed, line-of-sight components cannot beconsidered Gaussian, or, for that matter, random in the short-term, but are

rather modeled as deterministic components with possibly a varying phaserotation angle. A more adequate model, that however assumes decorrelatedtransmissions over the random channel components, consists in summing adeterministic matrix, standing for the line-of-sight component, and a randommatrix with Gaussian entries and a variance prole. To account for therelative importance of the line-of-sight component and the random part,both deterministic and random matrices are scaled accordingly. This model isreferred to as the Rician model.

• Although this section only deals with frequency at fading channels, werecall that wideband communication channels, i.e. communication channels

that span over a large range of frequencies (the adjective “large” qualifyingthe fact that the transmission bandwidth is several times larger than thechannel coherence bandwidth) induce frequency dependent channel matrices.Therefore, in wideband transmissions, channel models cannot be simplyrepresented as a single matrix H ∈C n r ×n t at any given instant, but rather asa matrix-valued continuous function H (f ) ∈C n r ×n t , with f ∈ [−W/ 2, W/ 2]the communication bandwidth. This motivates in particular communicationschemes such as OFDM, which practically exploit these frequency properties.This is however not the subject of the current section, which will be given

deeper considerations in Section 13.5.In this section, we discuss the case of communication channels with correlation

patterns both at the transmitter and the receiver and a very scattered mediumin between; i.e. we consider here the Kronecker channel model. The results tobe introduced were initially derived in, e.g., [Debbah and M uller, 2003] usingtools from free probability theory. We will see in Chapter 18 that the Kroneckermodel has deep information-theoretic grounds, in the sense that it constitutesthe least informative channel model when statistical correlation matrices at bothcommunication sides are a priori known by the system modeler.

Letting H ∈C n r

×n t

be a narrowband Kronecker communication channelmatrix, we will write

H = R12 XT

12




where X ∈C n r ×n t is a random matrix with independent Gaussian entries of zero mean and variance 1 /n t , which models the rich scattering environment,R

12

∈C n r ×n r is a non-negative denite Hermitian square root of the non-negative

denite receive correlation matrix R , and T12 ∈C n t ×n t is a non-negative denite

Hermitian square root of the non-negative denite receive correlation matrix T .Note that the fact that X has Gaussian i.i.d. entries allows us to assume withoutloss of generality that both R and T are diagonal matrices (a remark which nolonger holds in multi-user scenarios). The achievable quasi-static bit rate C (n t ,n r )

for the additive white Gaussian noise channel under this medium ltering matrixmodel reads:

C (n t ,n r ) (σ2) = supP

1n

ttr P

≤P

I (n r ,n t ) (σ2 ; P )

where we dene as before the mutual information I (n r ,n t ) (σ2 ; P ) as

I (n t ,n r ) (σ2 ; P ) log2 det I n r + 1σ2 HPH H

= log 2 det I n r + 1σ2 R

12 XT

12 PT

12 X H R

12

where σ2 is the variance of the individual i.i.d. receive noise vector entries.The corresponding ergodic capacity takes the same form but with an additional

expectation in front of the log determinant, which is taken over the random Xmatrices.

Evaluating the ergodic capacity and the corresponding optimal covariancematrix in closed-form is however rather involved. This has been partially solvedin [Hanlen and Grant, 2003] where an exact expression of the ergodic mutualinformation is given as follows.

Theorem 13.1 (Theorem 2 in [Hanlen and Grant , 2003]). Let H = R12 XT

12 ,

with R ∈C n r ×n r and T ∈C n t ×n t deterministic and X ∈C n r ×n t random with i.i.d. Gaussian entries of zero mean and variance 1/n t . Then, denoting

I (n t ,n r ) (σ2 ; P ) = log 2 det I n r + 1σ2 HPH H

we have:

E I (n t ,n r ) (σ2; P ) = det( R )−n r det( TP )−n t

mi =1 (M −i)! m

i =1 (m −i)! Λ > 00F 0 −R −1 , Λ , T −1

×m

i =1λM −m

i

m

i<j

(λ i −λ j )2m

i =1log2 1 +

λi

σ2n tdΛ

with m = min( n r , n t ), M = max( n r , n t ), Λ = diag( λ1 , . . . , λ m ), 0F 0 dened in Equation (2.2) and the integral is taken over the set of m-dimensional vectors with positive entries.




Moreover, assuming R and T diagonal without loss of generality, the matrix P that maximizes E[I (n t ,n r ) (σ2 ; P )] under constraint 1

n ttr P ≤ P is the diagonal

matrix P = diag( p1 , . . . , p n t ) such that, for all k

∈ 1, . . . , n t

E (I n t + H H HP )−1H H H kk = µ, if pk > 0E (I n t + H H HP )−1H H H kk < µ, if pk = 0

for some µ set to satisfy the power constraint 1n t

tr P ≤ P .

These results, although exact, are difficult to use in practice. We will see inthe next section that large dimensional random matrix analysis brings severalinteresting features to the study of MIMO Kronecker channels.

• It rst allows us to provide a deterministic equivalent for the quasi-static mutual information of typical channels, assuming channel independenttransmit data precoding. The informal “typical” adjective here suggests thatsuch channels belong to a high probability subset of all possible channelrealizations. Indeed, we have already discussed the fact that, for non-deterministic channel models, the capacity is ill-dened and is actually zeroin the current Kronecker scenario. Instead, we will say that we providea deterministic equivalent for the achievable rate of all highly probabledeterministic channels, with deterministic transmit precoding. The reason why

transmit precoding cannot be optimized will be made clear.• It then allows us to obtain a deterministic equivalent of the ergodic capacityof correlated MIMO communications. As we will see, the transmit covariancematrix which maximizes the deterministic equivalent can be derived. Themutual information evaluated at this precoder can further be proved to beasymptotically close to the exact ergodic capacity. The main advantage of this approach is that, compared to the results of Theorem 13.1, deterministicequivalents provide much simpler and more elegant solutions to the ergodiccapacity characterization.

Nonetheless, before going further, we address some key limitations of theKronecker channel model. The major shortcoming of the Kronecker model lies inthe assumption of a rich scattered environment. To ensure that a large numberof antennas can be used on either communication side, the number of suchscatterers must be of an order larger than the product between the numberof transmit and received antennas, so as to generate diverse propagation pathsfor all transmit–receive antenna pairs. This assumption is typically not metin an outdoor environment. Also, no line-of-sight component is allowed for inthe Kronecker model, nor does there exist a correlation between transmit andreceive antennas. This is also restrictive in short range communications with fewpropagation paths.

To provide a deterministic equivalent for the capacity of the Kronecker channelmodel, we will use a simplied version of Corollary 6.1 of Theorem 6.4 and




Theorem 6.8. Both deterministic equivalents will obviously be consistent andin fact equal, although the theorems rely on different underlying assumptions.Thanks to these assumptions, different conclusions will be drawn in terms of theapplicability to specic channel conditions. Note that these results were initiallyderived in [Tulino et al., 2003] using Girko’s result, Theorem 3.14, and were alsoderived later using tools borrowed from physics in [Sengupta and Mitra, 2006] .

13.3.1 Communication in strongly correlated channels

We rst recall Corollary 6.1 in the case of a single transmitter, i.e. K = 1.Assume, as above, a Gaussian channel with channel matrix H and additive

noise variance σ2 . Let H = R12 XT

12 ∈C n r ×n t , with X ∈C n r ×n t composed of

Gaussian i.i.d. entries of zero mean and variance 1 /n t , R12

∈C n r

×n r

andT12 ∈C n t ×n t be Hermitian non-negative denite, such that F R and F T form

a tight sequence, as the dimensions nr , nr grow large, and the sequence nt /n r

is uniformly bounded from below, away from a > 0, and from above, away fromb < ∞. Also assume that there exists α > 0 and a sequence sn r , such that, forall nr

max( ts n r +1 , r s n r +1 ) ≤ α

where r i and t i denote the ith ordered eigenvalue of R and T , respectively, and,denoting bn r an upper-bound on the spectral norm of T and R and β some real,such that β > (b/a )(1 + √ a)2 , assume that an r = b2

n r β satises

sn r log2(1 + an r σ−2) = o(n r ). (13.7)

Then, for large nr , nt , the Shannon transform of HH H , given by:

VHH H (σ−2) = 1n r

log2 det( I n r + 1σ2 HH H )

satises

VHH H (σ−2)

−Vn r (σ−2) a.s.

−→ 0

where Vn r (σ−2) satises

Vn r (σ−2) = 1n r

n r

k =1

log2 1 + δr k + 1n r

n t

k=1

log2 1 + nr

n tδtk −σ2 log2(e)δ δ

with δ and δ the unique positive solutions to the xed-point equations

δ = 1σ2n r

n r

k=1

r k

1 + δr k

δ = 1σ2n t

n t

k =1

tk

1 + n rn t

δtk.




This result provides a deterministic equivalent for the Shannon transform of HH H , which is, in our information-theoretic context, a deterministic equivalentfor the per-receive antenna mutual information between the multiple antennatransmitter and the multiple antenna receiver when uniform power allocation isused across transmit antennas. We need now to understand the extent of theapplicability of the previous result, which consists in understanding exactly theunderlying assumptions made on T and R . Indeed, it is of particular importanceto be able to study the capacity of MIMO systems when the correlation matricesT and R have very ill-conditioned proles. It may seem in particular that thedeterministic equivalents may not be valid if T and R are composed of a few verylarge eigenvalues and all remaining eigenvalues are close to zero. This intuitionturns out not to be correct.

Remember that tightness, which is commonly dened as a probability theorynotion, qualies a sequence of distribution functions F 1 , F 2 , . . . (let us assumeF (x) = 0 for x < 0), such that, for all ε > 0, there exists M > 0, such that

F k (M ) > 1 −ε

for all k. This is often thought of as the probability theory equivalent toboundedness. Indeed, a sequence x1 , x2 , . . . of real positive scalars is boundedif there exists M such that xk < M for all k. Here, the parameter ε allows forsome event leakage towards innity, although the probability of such events isincreasingly small. Therefore, no positive mass can leak to innity. In our setting,F T and F R do not really form tight probability distributions in the classicalprobabilistic meaning of boundedness as they are deterministic distributionfunctions rather than random distribution functions. This does not however affectthe mathematical derivations to come.

Since T and R are correlation matrices, for a proper denition of the signal-to-noise ratio 1 /σ 2 , we assume that they are constrained by 1

n ttr T = 1 and

1n r

tr R = 1. That is, we do not allow the power transmitted or received to growas nt , nr grow. The trace constraint is set to one for obvious convenience. Notethat it is classical to let all diagonal entries of T and R equal one, as we generally

assume that every individual antenna on either communication side has the samephysical properties. This assumption would no longer be valid under a channelwith variance prole, for which the fading link hij between transmit antenna iand receive antenna j has different variances for different ( i, j ) pairs.

Observe that, because of the constraint 1n r

tr R = 1, the sequence F R (for growing nr ) is necessarily tight. Indeed, given ε > 0, take M = 2/ε ;n r [1−F R (M )] is the number of eigenvalues in R larger than 2 /ε , which isnecessarily smaller than or equal to nr ε/ 2 from the trace constraint, leadingto 1 −F R (M ) ≤ ε/ 2 and then F R (M ) ≥ 1 −ε/ 2 > 1 −ε. The same naturallyholds for matrix T . Now the condition regarding the smallest eigenvalues of Rand T (those less than α) requires a stronger assumption on the correlationmatrices. Under the trace constraint, this requires that there exists α > 0, suchthat the number of eigenvalues in R greater than α is of order o(n r / log n r ). This




TP are less than M , hence 1 −F TP (M ) < ε and F TP is tight. Once again,the condition on the smallest eigenvalues can be satised for a vast majority of T

12 PT

12 matrices from the same argument.

The analysis above makes it possible to provide deterministic equivalents forthe per-antenna mutual information of even strongly correlated antenna patterns.Assuming a quasi-static channel with imposed deterministic power allocationpolicy P at the transmitter (assume, e.g. short time data transmission over atypical quasi-static channel such that the transmitter does not have the luxury toestimate the propagation environment and chooses P in a deterministic manner),the mutual information I (n t ,n r ) (σ2 ; P ) for this precoder satises

1n r

I (n t ,n r ) (σ2 ; P ) − 1n r

log2 det I n t + nr

n tδ T

12 PT

12

+ 1n r

n r

k=1

log2 1 + δr k −σ2 log2(e)δ δ a.s.

−→ 0

with δ and δ the unique positive solutions to

δ = 1σ2n r

n r

k =1

r k

1 + δr k,

δ = 1σ2n t

tr T12 PT

12 I n t +

nr

n tδ T

12 PT

12

and this is valid for all (but a restricted set of) choices of T , P , and R matrices.This property of letting T and R have eigenvalues of order O(n r ) is crucial to

model the communication properties of very correlated antenna arrays, althoughextreme care must be taken in the degenerated case when both T and R havevery few large eigenvalues and the remainder of their eigenvalues are close tozero. In such a scenario, extensive simulations suggest that the convergence of thedeterministic equivalent is extremely slow, to the point that even the per-antennacapacity of a 1000 ×1000 MIMO channel is not well approximated by thedeterministic equivalent. Also, due to strong correlation, it may turn out that the

true per-antenna capacity decreases rather fast to zero, with growing nr , n t , whilethe difference between true per-antenna capacity and deterministic equivalent isonly slowly decreasing. This may lead to the very unpleasant consequence thatthe relative difference grows to innity while the effective difference goes to zero,but at a slow rate. On the other hand, extensive simulations also suggest thatwhen only one of T and R is very ill-conditioned, the other having not toofew large eigenvalues, deterministic equivalents are very accurate even for smalldimensions.

When it comes to determining the optimal transmit data precoding matrix,we seek for the optimal matrix P , H being fully known at the transmitter, suchthat the quasi-static mutual information is maximized. We might think thatdetermining the precoding matrix which maximizes the deterministic equivalentof the quasi-static mutual information can provide at least an insight into the




quasi-static capacity. This is however an incorrect reasoning, as the optimalprecoding matrix, and therefore the system capacity, depend explicitly on theentries of X . But then, the assumptions of Theorem 6.4 clearly state that thecorrelation matrices are deterministic or, as could be shown, are random butat least independent of the entries of X . The deterministic equivalent of themutual information can therefore not be extended to a deterministic equivalentof the quasi-static capacity. As an intuitive example, consider an i.i.d. Gaussianchannel H . The eigenvalue distribution of HH H is, with high probability, close tothe Marcenko–Pastur for sufficiently large dimensions. From our nite dimensionanalysis of the multiple antenna quasi-static capacity, water-lling over theeigenvalues of the Marcenko–Pastur law must be applied to maximize thecapacity, so that strong communication modes receive much more power than

strongly faded modes. However, it is clear, by symmetry, that the deterministicequivalent of the quasi-static mutual information is maximized under equal powerallocation, which leads to a smaller rate.

13.3.2 Ergodic capacity in strongly correlated channels

Assuming T and R satisfy the conditions of Theorem 6.4, the space of matricesX over which the deterministic equivalent of the per-antenna quasi-static mutualinformation is an asymptotically accurate estimator has probability one. As aconsequence, by integrating the per-antenna capacity over the space of Rayleighmatrices X and applying straightforwardly the dominated convergence theorem,Theorem 6.3, for all deterministic precoders P , we have that the per-antennaergodic mutual information is well approximated by the deterministic equivalentof the per-antenna quasi-static mutual information for this precoder.

It is now of particular interest to provide a deterministic equivalent for theper-antenna ergodic capacity , i.e. for unconstrained choice of a deterministictransmit data precoding matrix. Contrary to the quasi-static scenario, as thematrix X is unknown to the transmitter, the optimal precoding matrix is nowchosen independently of X . As a consequence, it seems possible to provide a

deterministic equivalent for1

n rC (n t ,n r )

ergodic (σ2) supP

1n t

tr P =1

1n r

E[I (n t ,n r ) (σ2 ; P )]

with

I (n t ,n r ) (σ2; P ) log2 det I n r + 1σ2 HPH H .

This is however not so obvious as P is allowed here to span over all matriceswith constrained trace, which, as we saw, is a larger set than the set of matricesthat satisfy the constraint ( 13.7) on their smallest eigenvalues. If we consider inthe argument of the supremum only precoding matrices satisfying the assumption




(13.7), the per-antenna unconstrained ergodic capacity reads:

1n

r

C (n t ,n r )ergodic (σ2) = sup

P1

n ttr P =1

qs n r +1 ≤α

1n

r

E[I (n t ,n r ) (σ2 ; P )]

for some constant α > 0, with q 1 , . . . , q n t the eigenvalues of T12 PT

12 and

s1 , s 2 , . . . any sequence which satises ( 13.7) (with T replaced by T12 PT

12 ).

We will assume in what follows that ( 13.7) holds for all precoding matricesconsidered. It is then possible to provide a deterministic equivalent for

1n r

C (n t ,n r )ergodic (σ2) as it is possible to provide a deterministic equivalent for the right-

hand side term for every deterministic P . In particular, assume P is a precoderthat achieves 1

n rC (n t ,n r )

ergodic and P is a precoder that maximizes its deterministic

equivalent. We will denote 1n r I (n t ,n r )(σ2 ; P ) the value of the deterministicequivalent of the mutual information for precoder P . In particular

1n r

I (n t ,n r )(σ2 ; P ) = 1n r

n r

k=1

log2 1 + δ (P )r k + 1n r

log2 det I n t + δ (P )T12 PT

12

−σ2 log2(e)δ (P )δ (P )

with δ (P ) and δ (P ) the unique positive solutions to the equations in ( δ, δ )

δ = 1

σ2n r

n r

k=1

r k

1 + δr k,

δ = 1σ2n t

tr T12 PT

12 I n t +

nr

n tδ T

12 PT

12 . (13.8)

We then have1

n rC (n t ,n r )

ergodic (σ2) − 1n r

I (n t ,n r )(σ2 ; P )

= 1n r

E[I (n t ,n r ) (σ2 ; P )] − 1n r

I (n t ,n r )(σ2 ; P )

= 1n r E[I (n t ,n r ) (σ2; P )] −I (n t ,n r )(σ2 , P )

+ 1n r

I (n t ,n r )(σ2 ; P ) −I (n t ,n r )(σ2 ; P )

= 1n r

E[I (n t ,n r ) (σ2; P )] −E[I (n t ,n r ) (σ2 , P )]

+ 1n r

E[I (n t ,n r ) (σ2 , P )] −I (n t ,n r )(σ2 ; P ) .

In the second equality, as nt , nr grow large, the rst term goes to zero,while the second is clearly negative by denition of P , so that asymptotically

1n r (C (n t ,n r )ergodic (σ2) −I (n t ,n r )(σ2 ; P )) is negative. In the third equality, the rstterm is clearly positive by denition of P while the second term goes tozero, so that asymptotically 1

n r(C (n t ,n r )

ergodic (σ2) −I (n t ,n r )(σ2; P )) is positive.




Therefore 1n r

(C (n t ,n r )ergodic (σ2) −I (n t ,n r )(σ2 ; P )) a.s.

−→ 0, and the maximum value of the deterministic equivalent provides a deterministic equivalent for the ergodiccapacity. This however does not say yet whether P is close to the optimalprecoder P itself. To verify this fact, we merely need to see that both the ergodiccapacity and its deterministic equivalent, seen as functions of the precoder P , arestrictly concave functions, so that they both have a unique maximum; this canbe proved without difficulty and therefore P coincides with P asymptotically.

The analysis above has the strong advantage to be valid for all practicalchannel conditions that follow the Kronecker model, even strongly correlatedchannels. However, it also comes along with some limitations. The strongestlimitation is that the rate of convergence of the deterministic equivalent of theper-antenna capacity is only ensured to be of order o(1). This indicates that an

approximation of the capacity falls within o(n t ), which might be good enough(although not very satisfying) if the capacity scales as O(n t ). In the particularcase where an increase in the number of antennas on both communication endsincreases correlation in some sense, the capacity no longer scales linearly withn t , and the deterministic equivalent is of no practical use. The main two reasonswhy only o(1) convergence could be proved is that the proof of Theorem 6.1 is(i) not restricted to Gaussian entries for X , and (ii) assumes tight F T , F R sequences. These two assumptions are shown, through truncation steps in theproof, to be equivalent to assuming that T , R , and X have entries bounded bylog(n t ), therefore growing to innity, but not too fast.

In the following, the same result is discussed but under tighter assumptions onthe T , R , and X matrices. It will then be shown that a much faster convergencerate of the deterministic equivalent for the per-antenna capacity can be derived.

13.3.3 Ergodic capacity in weakly correlated channels

We now turn to the analysis provided in [Dupuy and Loubaton , 2009, 2010;Moustakas and Simon, 2007 ], leading in particular to Theorem 6.8. Obviously,the deterministic equivalent for the case when K = 1 is the same as the one from

Theorem 6.4 for K = 1. However, the underlying conditions are now slightlydifferent. In particular, large eigenvalues in R and T are no longer allowed aswe now assume that both R and T have uniformly bounded spectral norm.Also, the proof derived in Theorem 6.8 explicitly takes into account the fact thatX has Gaussian entries. In this scenario, it is shown that the ergodic capacityitself E[I (n t ,n r ) ], and not simply the per-antenna ergodic mutual information, iswell approximated by the above deterministic equivalent. Constraining uniformpower allocation across the transmit antennas, we have:

E[I (n t ,n r ) (σ2 ; I n t )]

− n r

k =1

log2 1 + δr k +n t

k =1

log2 1 + nr

n tδtk −n r σ2 log2(e)δ δ = O(1/n r ).




The capacity maximizing transmit precoder P is shown similarly as beforeto coincide asymptotically with the transmit precoder P that maximizes thedeterministic equivalent. We need however to constrain the matrices P tolie within a set of ( n t ×n t ) Hermitian non-negative matrices with uniformlybounded spectral norm. Nonetheless, in their thorough analysis of the Ricianmodel [Dumont et al., 2010; Hachem et al., 2007, 2008b], which we will discussin the following section, Hachem et al. go much further and show explicitly that,under the assumption of uniformly bounded spectral norms for T and R , theassumption that P lies within a set of matrices with uniformly bounded spectralnorm is justied. The proof of this result is however somewhat cumbersome andis not further discussed.

In the following section, we turn to the proper evaluation of the capacity

maximizing transmit precoder.

13.3.4 Capacity maximizing precoder

As was presented in the above sections, the capacity maximizing precoder P isclose to P , the precoder that maximizes the deterministic equivalent of the truecapacity, for large system dimensions. We will therefore determine P instead of P . Note nonetheless that [Hanlen and Grant , 2003] proves that the Gaussianinput distribution is still optimal in this correlated setting and provides aniterative water-lling algorithm to obtain P , although the formulas involvedare highly non-trivial. Since we do not deal with asymptotic considerations here,there is no need to restrict the denition domain of the non-negative HermitianP matrices.

By denition

P

= argmaxP

n r

k=1

log2 1 + δr k + log 2 det I n t + nr

n tδ T

12 PT

12 −n r σ2 log2(e)δ δ

where δ and δ are here function of P , dened as the unique positive solutions to(13.8).

In order to simplify the differentiation of the deterministic equivalent along P ,which is made difficult due to the interconnection between δ , δ , and P , we willuse the differentiation chain rule. For this, we rst denote V the function

V : (∆ , ∆ , P ) →n r

k =1

log2 1 + ∆ r k + log 2 det I n t + nr

n t∆ T

12 PT

12

−n r σ2 log2(e)∆ ∆ .

That is, we dene V as a function of the independent dummy parameters ∆,∆, and P . The function V is therefore a deterministic equivalent of the capacityonly for restricted choices of ∆, ∆, and P , i.e. ∆ = δ , ∆ = δ that satisfy the




implicit Equations ( 13.8). We have that

∂V

∂ ∆(∆ , ∆ , P ) = log 2(e)

n r

n ttr T

12 PT

12 I n t +

nr

n t∆ T

12 PT

12

−1

−n r σ2 ∆

∂V ∂ ∆

(∆ , ∆ , P ) = log 2(e) n r

k =1

r k

1 + ∆ r k −n r σ2∆ .

Observe now that, for δ and δ the solutions of ( 13.8) (for any given P ), wehave also by denition that

n r

n ttr T

12 PT

12 I n t +

nr

n tδ T

12 PT

12

−1

−n r σ2 δ = 0

n r

k=1r k

1 + δr k −n r σ2δ = 0.

As a consequence

∂V ∂ ∆

(δ, δ, P ) = 0

∂V ∂ ∆

(δ, δ, P ) = 0

and then:

∂V ∂ ∆ (δ, δ, P )

∂ ∆∂ P (δ, δ, P ) +

∂V ∂ ∆ (δ, δ, P )

∂ ∆∂ P (δ, δ, P ) +

∂V ∂ P (δ, δ, P ) =

∂V ∂ P (δ, δ, P )

as all rst terms vanish. But, from the differentiation chain rule, this expressioncoincides with the derivative of the deterministic equivalent of the mutualinformation along P . For P = P , this derivative is zero by denition of P .Therefore, setting the derivative along P of the deterministic equivalent of themutual information to zero is equivalent to writing

∂V ∂ P

(δ (P ), δ (P ), P ) = 0

for (δ (P ), δ (P )) the solution of ( 13.8) with P = P . This is equivalent to∂

∂ P log2 det I n t + δ (P )T

12 PT

12 = 0

which reduces to a water-lling problem for given δ (P ). Denoting T = UΛU H

the spectral decomposition of T for some unitary matrix U , P is dened asP = UQ U H , with Q diagonal with ith diagonal entry q i given by:

q i = µ − 1

δ (P )t i

+

µ being set so that 1n t

n tk=1 q k = P .

Obviously, the question is now to determine δ (P ). For this, the iterativewater-lling algorithm of Table 13.2 is proposed.




Dene η > 0 the convergence threshold and l ≥ 0 the iteration step.At step l = 0, for k ∈ 1, . . . , n t , set q 0k = P while maxk

|q lk

−q l−1

k

| > η do

Dene (δ l+1 , δ l+1 ) as the unique pair of positive solutions to ( 13.8)for P = UQ l U H , Q l = diag( q l1 , . . . , q ln t )for i ∈ 1 . . . , n t do

Set q l+1i = µ − 1

ce l +1 t i

+, with µ such that 1

n ttr Q l = P

end forassign l ← l + 1

end while

Table 13.2. Iterative water-lling algorithm for the Kronecker channel model.

In [Dumont et al., 2010], it is shown that, if the iterative water-lling algorithmdoes converge, then the iterated Q l matrices of the algorithm in Table 13.2necessarily converge towards Q . To this day, though, no proof of the absoluteor conditional convergence of this water-lling algorithm has been provided.

To conclude this section on correlated MIMO transmissions, we presentsimulation results for R and T modeled as a Jakes’ correlation matrix, i.e. the(i, j )th entry of R or T equals J 0(2πd ij /λ ), with λ the transmission wavelengthand dij the distance between antenna i and antenna j (on either of the twocommunication sides). We assume the antennas are distributed along a horizontallinear array and numbered in order (say, from left to right), so that T and Rare simply Toeplitz matrices based on the vector (1 , J 0(2πd/λ ), . . . , J 0(2π(n t −1)d/λ ), with d the distance between neighboring antennas. The eigenvalues of Rand T for n t = 4 are provided in Table 13.3 for different ratios d/λ . Jakes’ modelarises from the assumption that the antenna array under study transmits orreceives waveforms of wavelength λ isotropically in the three-dimensional space,which is a satisfying assumption under no additional channel constraint.

In Figure 13.3, we depict simulation results of the MIMO mutual information

with equal power allocation at the transmitter as well as the MIMO capacity (i.e.with optimal power allocation), and compare these results to the deterministicequivalents derived in this section. We assume 4 ×4 MIMO communication withinter-antenna spacing d = 0 .1λ and d = λ. It turns out that, even for stronglycorrelated channels, the mutual information with uniform power allocation isvery well approximated for all SNR values, while a slight mismatch appears forthe optimal power allocation policy. Although this is not appearing in this graph,we mention that the mismatch gets more acute for higher SNR values. This isconsistent with the results derived in this section.

From an information-theoretic point of view, observe that the gains achievedby optimal power allocation in weakly correlated channels are very marginal,as the water-lling algorithm distributes power almost evenly on the channeleigenmodes, while the gains brought about by optimal power allocation in




Correlation factor Eigenvalues of T , Rd = 0 .1λ 0.0 0.0 0.1 3.9

d = λ 0.3 1.0 1.4 1.5d = 10λ 0.8 1.0 1.1 1.1

Table 13.3. Eigenvalues of correlation matrices for n t = n r = 4 , under differentcorrelations.

−15

−10

−5 0 5 10 15 20

0

5

10

15

20

25

SNR [dB]

A c h i e v a b l e r a t e

dλ = 0 .1, uni., sim.dλ = 0 .1, uni., det.dλ = 0 .1, opt., sim.dλ = 0 .1, opt., det.dλ = 1, uni., sim.dλ = 1, uni., det.dλ = 1, opt., sim.dλ = 1, opt., det.

Figure 13.3 Ergodic capacity from simulation (sim.) and deterministic equivalent(det.) of the Jakes’ correlated 4 ×4 MIMO, for SNR varying from −15 dB to 20 dB,for different values of d

λ , time-varying channels, uniform (uni.) and optimal (opt.)power allocation.

strongly correlated channels are much more relevant, as the water-lling

algorithm manages to pour much of the power onto the stronger eigenmodes.Also, the difference between the mutual information for uniform and optimalpower allocations reduces as the SNR grows. This is due to the fact that, forhigh SNR, the contribution to the capacity of every channel mode becomesensibly the same, and therefore, from concavity arguments, it is optimal toevenly distribute the available transmit power along these modes. In contrast,for low SNR, log(1 + |hk |2σ−2) is close to |hk |2σ−2 , where |hk |2 denotes the ktheigenvalue of HH H , and therefore it makes sense to pour more power on thelarger |hk |2 eigenvalues. Also notice, as described in [Goldsmith et al., 2003]that, for low SNR, statistical correlation is in fact benecial, as the capacity islarger than in the weak correlation case. This is here due to the fact that strongcorrelation comes along with high power modes (see Table 13.3), in which all theavailable power can be poured to increase the transmission bit rate.




13.4 Rician at fading channels

Note that in the previous section, although we used Theorem 6.4 and Theorem6.8 that provide in their general form deterministic equivalents that are functionsof the eigenvectors of the underlying random matrices, all results derived so farare only functions of the eigenvalues of the correlation matrices T and R . Thatis, we could have considered T and R diagonal. It is therefore equivalent forthe point-to-point MIMO analysis to consider doubly correlated transmissionsor uncorrelated transmissions with weighted powers across the antennas. Assuch, the large system analysis via deterministic equivalents of doubly correlatedMIMO communications is the same as that of the MIMO communication overthe channel H with independent Gaussian entries of zero mean and variance

prole σ2ij /n t , where σij = √ t i r j with t i the ith diagonal entry (or eigenvalue)of T and r j the j th diagonal entry of R . When the variance prole can be written

under this product form of the variance of rows and columns, we say that H hasa separable variance prole.

As a consequence, the study of point-to-point MIMO channels with a general(non-necessarily separable) variance prole and entries of non-necessary zeromean completely generalizes the previous study. Such channels are known asRician channels. The asymptotic analysis of these Rician channels is the objectiveof this section, which relies completely on the three important contributionsof Hachem, Loubaton, Dumont, and Najim that are [Dumont et al., 2010;Hachem et al., 2007, 2008b]. Note that early studies assuming channel matriceswith entries of non-zero mean and unit variance were already provided in, e.g.,[Moustakas and Simon, 2003 ] in the multiple input single output (MISO) caseand [Cottatellucci and Debbah , 2004a,b] in the MIMO case.

Before moving to the study of the ergodic capacity, i.e. the study of theresults from [Hachem et al., 2007], it is important to remind that Rician channelmodels do not generalize Kronecker channel models. What we stated before isexactly that, for the capacity analysis of point-to-point MIMO communications ,Kronecker channels can be substituted by Rician channels with zero mean and

separable variance proles. However, we will see in Chapter 14 that, for multi-user communications, this remark does not hold in general, unless all users haveidentical or at least co-diagonalizable channel correlation matrices. The cominganalysis of Rician channel models is therefore not sufficient to treat the mostgeneral multi-user communications, which will be addressed in Chapter 14.

13.4.1 Quasi-static mutual information and ergodic capacity

Consider the point-to-point communication between an nt -antenna transmitterand an nr -antenna receiver. The communication channel is denoted by therandom matrix H ∈C n r ×n t which is modeled as Rician, i.e. the entries of Hare Gaussian, independent and the ( i, j )th entry hij has mean aij and variance



13.4. Rician at fading channels 317

σ2ij /n t , for 1 ≤ i ≤ n r and 1 ≤ j ≤ n t . We further denote A ∈C n r ×n t the matrix

with ( i, j )th entry aij , Σ j ∈R n r ×n r the diagonal matrix of ith entry σij , andΣ i

∈R n t ×n t the diagonal matrix of j th entry σij . Assume, as in the previous

sections, that the ltered transmit signal is corrupted by additive white Gaussiannoise of variance σ2 on all receive antennas. Here we assume that the σij areuniformly bounded with respect to the system dimensions nt and nr .

Assuming equal power allocation at the transmitter, the mutual informationfor the per-antenna quasi-static case I (n t ,n r ) (σ2 ; I n t ) reads:

I (n t ,n r ) (σ2 ; I n t ) = log 2 det I n r + 1σ2 HH H


12 XT

12 + A R

12 XT

12 + A

H

.

From Theorem 6.14, we have that

1n t

I (n t ,n r ) (σ2 ; I n t ) − 1n r

log2 det 1σ2 Ψ −1 + A ΨA T +

1n r

log2 det 1σ2 Ψ −1

−log2(e)σ2

n t n r i,j

σ2ij vi vj

a .s.

−→ 0 (13.9)

where we dened Ψ ∈C n r ×n r the diagonal matrix with ith entry ψi , Ψ ∈C n r ×n r

the diagonal matrix with j th entry ψj , with ψi and ψj , 1

≤ i

≤ n r , 1

≤ j

≤ n t ,

the unique solutions of

ψi = 1σ2 1 +

1n r

tr Σ 2i Ψ −1 + σ2A T ΨA −1 −1

ψj = 1σ2 1 +

1n r

tr Σ 2j Ψ −1 + σ2A ΨA T −1 −1

which are Stieltjes transforms of distribution functions, while vi and vj aredened as the ith diagonal entry of Ψ −1 + σ2A T ΨA −1 and the j th diagonalentry of Ψ −1 + σ2A ΨA T −1 , respectively.

Also, due to the Gaussian assumption (the general statement is based onsome moment assumptions), [Hachem et al., 2008b] ensures that the ergodicmutual information I (n t ,n r ) (σ2 ; I n t ) under uniform transmit power allocationhas deterministic equivalent

E[I (n t ,n r ) (σ2 ; I n t )]

− log2 det 1σ2 Ψ −1 + A ΨA T + log 2 det

1σ2 Ψ −1 −

σ2 log2(e)n t i,j

σ2ij vi vj

= O(1/n t ). (13.10)As we have already mentioned in the beginning of this section, Equations ( 13.9)

and ( 13.10) can be veried to completely generalize the uncorrelated channel




case, the Kronecker channel case, and the case of a channel with separablevariance prole.

13.4.2 Capacity maximizing power allocation

In [Dumont et al., 2010], the authors assume the above Rician channel model,however restricted to a separable variance prole, i.e. for all pairs ( i, j ), σij canbe written as a product r i t j . We then slightly alter the previous model settingby considering the channel model

H = R12 XT

12 P

12 + AP

12

with R ∈R n r ×n r diagonal with ith diagonal entry r i , T ∈R n t ×n t diagonal with

j th diagonal entry tj , and P Hermitian non-negative, the transmit precodingmatrix. Note, as previously mentioned, that taking R and T to be diagonal doesnot restrict generality. Indeed, the mutual information for this channel modelreads:

I (n t ,n r ) (σ2 ; P )


12 XT

12 P

12 + AP

12 R

12 XT

12 P

12 + AP

12

H

= log 2 det I n r + 1σ2 UR

12 U H XVT

12 V H VP

12 V H + UAV H VP

12 V H

× UR12 U H XVT

12 V H VP

12 V H + UAV H VP

12 V H

H

for any unitary matrices U ∈C n t ×n t and V ∈C n r ×n r . The matrix X beingGaussian, its distribution is not altered by the left- and right-unitary productsby U H and V , respectively. Also, when addressing the power allocation problem,optimization can be carried out equally on P or VPV H . Therefore, instead of thediagonal R and T covariance matrices, we could have considered the Kroneckerchannel with non-necessarily diagonal correlation matrices URU H at the receiverand VTV H at the transmitter and the line-of-sight component matrix UAV H .

The ergodic capacity optimizing power allocation matrix, under powerconstraint 1

n ttr P ≤ P , is then shown in [Dumont et al., 2010] to be determined

as the solution of an iterative water-lling algorithm. Specically, call P theoptimal power allocation policy for the deterministic equivalent of the ergodiccapacity. Using a similar approach as for the Kronecker channel case, it is shownthat maximizing the deterministic equivalent of I (n t ,n r ) (σ2 ; P ) is equivalent tomaximizing the function

V : (δ, δ, P ) → log2 det I n t + PG (δ, δ ) + log 2 det I n r + δ R −log2(e)n t σ2δ δ

over P such that 1n t tr P = P , where G is the deterministic matrix

G = δ T + 1σ2 A T I n r + δ R −1 A (13.11)




Dene η > 0 the convergence threshold and l ≥ 0 the iteration step.At step l = 0, for k ∈ 1, . . . , n t , set q 0k = P . At step l ≥ 1,while maxk

|q lk

−q l−1

k

| > η do

Dene (δ l+1 , δ l+1 ) as the unique pair of positive solutions to(13.12) for P = WQ l W H , Q l = diag( q l1 , . . . , q ln t ) and W the matrixsuch that G , given in (13.11) with δ = δ l , δ = δ l has spectraldecomposition G = WΛW H , Λ = diag( λ1 , . . . , λ n t )for i ∈ 1 . . . , n t do

Set q l+1i = µ − 1

λ i

+, with µ such that 1

n ttr Q l = P

end forassign l ← l + 1

end while

Table 13.4. Iterative water-lling algorithm for the Rician channel model with separablevariance prole.

in the particular case when δ = δ (P ) and δ = δ (P ), the latter two scalars beingthe unique solutions to the xed-point equation

δ = 1σ2n t

tr R (I n r + δ R ) + 1σ2 AP 1

2 (I n t + δ P 12 TP 1

2 )−1P 12 A T −1

δ = 1σ2n t

tr P 12 TP 1

2 (I n t + δ P 12 TP 1

2 ) + 1σ2 P 1

2 A T (I n r + δ R )−1AP 12 −1

(13.12)

which are Stieltjes transforms of distribution functions.Similar to the Kronecker case, the iterative water-lling algorithm of Table

13.4 solves the power-optimization problem.It is common to consider a parameter κ, the Rician factor, that accounts for

the relative importance of the line-of-sight component A relative to the varyingrandom part R

12 XT

12 . To incorporate the Rician factor in the present model,

we assume that 1n t

tr AA T = κκ +1 and that R is constrained to satisfy 1

n rtr R =

1κ +1 . Therefore, the larger the parameter κ the more important the line-of-sightcontribution. In Figure 13.4, we consider the 4 ×4 MIMO Rician channel whereA has entries equal to 1

n t

κκ +1 , and T , R are modeled as Jakes’ correlation

matrices for a linear antenna array with distance d between consecutive antennas.The transmission wavelength is denoted λ. We wish to study the relative impactof the line-of-sight component A on the ergodic capacity, and on the powerallocation policy. Therefore, we consider rst mildly correlated antennas at bothcommunication ends with inter-antenna distance d = λ. We give the channelperformance both for κ = 1 and for κ = 100.

We observe again, already for the 4 ×4 MIMO case, a very accurate matchbetween the approximated and simulated ergodic capacity expressions. We rst




−15 −10 −5 0 5 10 15 200

5

10

15

20

SNR [dB]


κ = 1, uni., sim.κ = 1, uni., det.κ = 1, opt., sim.κ = 1, opt., det.κ = 100, uni., sim.κ = 100, uni., det.κ = 100, opt., sim.κ = 100, opt., det.

Figure 13.4 Ergodic capacity from simulation (sim.) and deterministic equivalent(det.) of the Jakes’ correlated 4 ×4 MIMO model with line-of-sight component, linearantenna arrays with d

λ = 1 are considered at both communication ends. The SNRvaries from −15 dB to 20 dB, and the Rician factor κ is chosen to be κ = 1 andκ = 100. Uniform (uni.) and optimal (opt.) power allocations are considered.

see that the capacity for κ small is larger than the capacity for κ large at highSNR, which is due to the limited multiplexing gain offered by the strongly line-of-sight channel. Note also that, for κ = 1, there is little room for high capacitygain by proceeding to optimal power allocation, while important gains can beachieved when κ = 100. For asymptotically large κ, the optimal power allocationpolicy requires that all power be poured on the unique non-zero eigenmode of thechannel matrix. From the trace constraint on the precoding matrix, this entailsa SNR gain of up to log 10 (n t ) 6 dB (already observed in the zero dB-10 dBregion). For low SNR regimes, it is therefore again preferable to seek correlation,

embodied here by the line-of-sight component. In contrast, for medium to highSNR regimes, correlation is better avoided to fully benet from the channelmultiplexing gain. The latter is always equal to min( n t , n r ) for all nite κ,although extremely large SNR conditions might be necessary for this gain tobe observable.

13.4.3 Outage mutual information

We complete this section by the results from [Hachem et al., 2008b] on a centrallimit theorem for the capacity of the Rician channel H , in the particular casewhen the channel coefficients hij have zero mean, i.e. when A = 0. Note that thesimple one-sided correlation case was treated earlier in [Debbah and R. M¨uller,




2003], based on a direct application of the central limit Theorem 3.17. Further,assume either of the following situations:

• the variance prole σ2ij /n t is separable in the sense that σij = √ r i t j andthe transmitter precodes its signals with matrix P . In that case, we denote

T = diag( t1 , . . . , t n t ) and σ 2ij /n t the variance prole with σij = i r i t j with

t1 , . . . , t n t the eigenvalues of T12 PT

12 .

• both T and the precoding matrix P = diag( p1 , . . . , p n t ) are diagonal. In thiscase, σij = r i t j , where tj = σ2

i,j pj .

The result is summarized in Theorem 6.21, which states under the currentchannel conditions that, for a deterministic precoder P satisfying one of theabove conditions, the mutual information I (n t ,n r ) (σ2 ; P ) asymptotically variesaround its mean E[ I (n t ,n r ) (σ2 ; P )] as a zero mean Gaussian random variablewith, for each nr , a variance close to θ2

n r , dened as

θ2n r = −logdet( I n t −J n t )

where J n t is the matrix with ( i, j )th entry J n tij , dened as

J n tij =

1n t

1n t

n rk =1 σ 2

ki σ 2kj t2

k

1 + 1n t

n rk=1 σki

2 tk2

with t1 , . . . , t n r , dened as the unique solutions of

t i = σ2 + 1n t

n t

j =1

σ 2ij

1 + 1n t

n rl=1 σ 2

lj t l

−1

which are Stieltjes transforms of distribution functions when seen as functionsof the variable −σ2 .

We therefore deduce a theoretical approximation of the outage mutualinformation I (n t ,n r )

outage (σ2 ; P ; q ), for large nt and nr and deterministic precoderP , dened by

I (n t ,n r )outage (σ2 ; P ; q ) = sup

R ≥0P I (n t ,n r ) (σ2 ; P ) > R ≤ q

with I (n t ,n r ) (σ2 ; P ) the quasi-static mutual information for the deterministicprecoder P . This is:

I (n t ,n r )outage (σ2 ; P ; q ) Q−1 q

θn r

with Q(x) the Gaussian Q-function, where the inverse is taken with respect tothe conjugation.

In Figure 13.5, we provide the curves of the theoretical distribution functionof the mutual information for the precoder P = I n t . The assumptions taken inFigure 13.5 are those of a Kronecker channel with transmit correlation matrices




0 2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

Achievable rate


dλ = 0 .1dλ = 1dλ = 10

Figure 13.5 Deterministic equivalent of the outage capacity for the Jakes’ correlated4 ×4 MIMO channel model, linear antenna arrays with d

λ = 0 .1, dλ = 1 and d

λ = 10.The SNR is set to 10 dB. Uniform power allocation is considered.

R ∈C n r ×n r and T ∈C n t ×n t modeled along Jakes’ correlation model for a linearantenna array with inter-antenna distance d (both at the transmit and receive

ends) and transmission wavelength λ. We depict the outage performance of theMIMO Rician channel for different values of d/λ .

This concludes this section on the performance of point-to-pointcommunications over MIMO Rician channels. We recall that the most generalRician setup generalizes all previously mentioned models. Nonetheless, thesubject has not been fully covered as neither the optimal power allocationpolicy for a Rician channel with non-separable variance prole, nor the outageperformance of a Rician channel with line-of-sight component have beenaddressed.

In what follows, we generalize multiple antenna point-to-point communicationstowards another direction, by introducing frequency selectivity. This will providea rst example of a channel model for which the MIMO Rician study developed inthis section is not sufficient to provide a theoretical analysis of the communicationperformance.

13.5 Frequency selective channels

Due to the additional data processing effort required by multiple antennacommunications compared to single antenna transmissions and to the non-negligible correlation arising in multiple antenna systems, MIMO technologies aremostly used as a solution to further increase the achievable data rate of existing



13.5. Frequency selective channels 323

wireless broadband communication networks, but are not used as a substitute forlarge bandwidth communications. Therefore, practical MIMO communicationnetworks usually come along with large transmission bandwidths. It is thereforeoften unrealistic to assume narrowband MIMO transmission as we have doneup to now. This section is dedicated to frequency selective transmissions, forwhich the channel coherence bandwidth is assumed smaller than the typicaltransmission bandwidth, or equivalently for which strong multi-path componentsconvey signal energy.

Consider an L-path MIMO channel between an nt -antenna transmitter andan nr -antenna receiver, modeled as a sequence of L matrices H 1 , . . . , H L ,H l ∈C n r ×n t . That is, each link from transmit antenna i and receive antenna j is amulti-path scalar channel with ordered path gains given by ( h1( j, i ), . . . , h L ( j, i )).

In the frequency domain, the channel transfer matrix is modeled as a randommatrix process H (f ) ∈C n r ×n t , for every frequency f ∈ [−W/ 2, W/ 2], dened as

H (f ) =L

k=1

H k e−2πik f W

with W the two-sided baseband communication bandwidth. The resulting(frequency normalized) quasi-static capacity C (n t ,n r ) of the frequency selectivechannel H (f ), for a communication in additive white Gaussian noise and underpower constraint P , reads:

C (n t ,n r ) (σ2) = supP (f )

1W

W/ 2

−W/ 2log2 det I n r + 1

σ2 H (f )P (f )H H (f ) df

with P (f ) ∈C n t ×n t , f ∈ [−W/ 2, W/ 2], a matrix-valued function modeling theprecoding matrix to be applied at all frequencies, with maximum mean power P (per Hertz) and therefore submitted to the power constraint

1n t W f

tr P (f )df ≤ P. (13.13)

According to the denition of H (f ), the capacity C (n t ,n r ) (σ2) also reads:

C (n t ,n r ) (σ2) = supP ( f )

1W W/ 2

−W/ 2log2 det I n r +

1σ2

L

k=1

H k e−2πik f W P (f )

L

k=1

H H

k e2πik f W df

under the trace constraint ( 13.13) on P (f ).We then assume that every channel H k , k ∈ 1, . . . , L , is modeled as a

Kronecker channel, with non-negative denite left-correlation matrix R k ∈C n r ×n r and non-negative denite right-correlation matrix T

k ∈C n t ×n t . That

is, H k can be expressed as

H k = R12k X k T

12k




where X k ∈C n r ×n t has i.i.d. Gaussian entries of zero mean and variance 1 /n t .We also impose that the matrices T k and R k , for all k, have uniformly boundedspectral norms and that the X k are independent.

Although this is not strictly proven in the literature (Theorem 6.8), we inferthat, for all xed f and all deterministic choices of P (f ) (with possiblysome mild assumptions on the extreme eigenvalues), it is possible to providea deterministic equivalent of the random quantity

1n r

log2 det I n r + 1σ2

L

k =1

H k e−2πik f W P (f )

L

k=1

H H

k e2πik f W

using the implicit equations of Theorem 6.8. The latter indeed only providesthe convergence of such a deterministic equivalent in the mean. Integrating this

deterministic equivalent over f ∈ [−W/ 2, W/ 2] (and possibly averaging over W )would then lead to a straightforward deterministic equivalent for the per-receive-antenna quasi-static capacity (or its per-frequency version). Note that Theorem6.4 and Theorem 6.8 are very similar in nature, so that the latter must beextensible to the quasi-static case, using tools from the proof of the former.Similar to previous sections, it will however not be possible to derive the matrixprocess P (f ) which maximizes the capacity, as was performed for instance in[Tse and Hanly, 1998] in the single antenna (multi-user) case. We mention that[Scaglione, 2002] provides an explicit expression of the characteristic function of the above mutual information in the small dimensional setting.

13.5.1 Ergodic capacity

For the technical reasons explained above and also because this is a more tellingmeasure of performance, we only consider the ergodic capacity of the frequencyselective MIMO channel. Note that this frequency selective ergodic capacityC (n t ,n r )

ergodic (σ2) reads:

C (n t ,n r )ergodic (σ2)

= supP (f )

1W

W/ 2

−W/ 2E log2 det I n r + 1

σ2 L

k=1

H k P (f ) L

k=1

H Hk df

= supP (0)

E log2 det I n r + 1σ2

L

k=1

H k P (0) L

k=1

H H

k

where in the second equality we discarded the terms e−2πik f W since it is equivalent

to take the expectation over X k or over X k e−2πik f W , for all f ∈ [−W/ 2, W/ 2]

(since both matrices have the same joint Gaussian entry distribution). Therefore,on average, all frequencies are alike and the current problem reduces to ndinga deterministic equivalent for the single frequency case. Also, it is obvious fromconvexity arguments that there is no reason to distribute the power P unevenlyalong the different frequencies. Therefore, the power optimization can be simply




operated over a single frequency and the supremum can be taken over the singleprecoding matrix P (0). The new power constraint is therefore:

1n t tr P (0) ≤ P.

For ease of read, from now on, we denote P P (0).For all deterministic choices of precoding matrices P , the ergodic mutual

information E I (n t ,n r ) (σ2 ; P ) has a deterministic equivalent, given by Theorem6.8, such that

E[I (n t ,n r ) (σ2 ; P )]− log2 det I n r +L

k =1

δ k R k + log 2 det I n t +L

k=1

δ k T12k PT

12k

−n r log2(e)σ2L

k =1

δ k δ k = O 1N

where δ k and δ k , k ∈ 1, . . . , L , are dened as the unique positive solutions of

δ i = 1n r σ2 tr R i I n r +

L

k=1

δ k R k

δ i = 1n r σ2 tr T

12i PT

12i I n t +

L

k=1

δ k T12k PT

12k . (13.14)

13.5.2 Capacity maximizing power allocation

Based on the standard methods evoked so far, the authors in [Dupuy andLoubaton, 2010] prove that the optimal power allocation strategy is to performa standard water-lling procedure on the matrix

L

k =1

δ k (P )T k

where we dene P as the precoding matrix that maximizes the deterministicequivalent of the ergodic mutual information, and we denote δ k (P ) the (uniquepositive) solution of the system of Equations ( 13.14) when P = P .

Denote UΛU H the spectral decomposition of Lk =1 δ k (P )T k , with U unitary

and Λ a diagonal matrix with diagonal entries λ1 , . . . , λ n t . We have that P asymptotically well approximates the ergodic mutual information maximizingprecoder, and is given by:

P = UQ U H

where Q is diagonal with diagonal entries q 1 , . . . , q n t dened by

q k = µ − 1λk

+




−15 −10 −5 0 5 10 15 200

5

10

15

20

SNR [dB]


∆ = π6 , sim., uni.

∆ = π6 , det., uni.

∆ = π6 , sim., opt.

∆ = π6 , det., opt.∆ = 2 π , sim., uni.∆ = 2 π , det., uni.∆ = 2 π , sim., opt.∆ = 2 π , det., opt.

Figure 13.6 Ergodic capacity of the frequency selective 4 ×4 MIMO channel. Linearantenna arrays on each side, with correlation matrices modeled according to thegeneralized Jakes’ model. Angle spreads in the horizontal direction set to ∆ = 2 π or∆ = π/ 6. Comparison between simulations (sim.) and deterministic equivalent (det.),for uniform power allocation (uni.) and optimal power allocation (opt.).

1) πL . We also assume an inter-antenna distance of dR = λ at the receiver side and

dT = 0 .1λ at the transmitter side. We take successively ∆ = 2 π and ∆ = π6 . We

observe that the achievable rate is heavily impacted by the choice of a restrictedangle of aperture at both transmission sides. This is because transmit and receive

correlations increase with smaller antenna aperture, an effect which it is thereforeessential to take into account.We complete this chapter on single-user multiple antenna communications

with more applied considerations on suboptimal transmitter and receiver design.From a practical point of view, be it for CDMA or MIMO technologies, achievingchannel capacity requires in general heavy computational methods, the cost of which may be prohibitive for small communication devices. In place for optimalprecoders and decoders, we have already seen several instances of linear precodersand decoders, such as the matched-lter and the MMSE lter. The subject of the next section is to study intermediary precoders, performing better than thematched-lter, less than the MMSE lter, but with adjustable complexity thatcan be made simple and efficient thanks to large dimensional random matrixresults.




13.6 Transceiver design

In this last section, we depart from the previous capacity analysis introduced inthis chapter to move to a very practical application of large dimensional randommatrix results. The application we deal with targets the complexity reductionof some linear precoder or decoder designs. Precisely, we will propose successiveapproximations of MMSE lters with low complexity. We recall indeed thatMMSE lters demand the inversion of a potential large dimensional matrix,the latter depending on the possibly fast changing communication channel. Forinstance, we introduced in the previous chapter linear MMSE CDMA decoderswhich are designed based on both the spreading code (usually constant over thewhole communication) and the multi-path channel gains (possibly varying fast

with time). If both the number of users and the number of chips per code arelarge, inverting the decoding matrix every channel coherence time imposes a largecomputational burden on the decoder, which might be intolerable for practicalpurposes. This problem would usually and unfortunately be solved by turningto less complex decoders, such as the classical matched-lter. Thanks to largedimensional random matrix theory, we will realize that most of the complexityinvolved in large matrix inverses can be fairly reduced by writing the matrixinverse as a nite weighted sum of matrices and by approximating the weightsin this sum (which carry most of the computational burden) by deterministicequivalents.

We will address in this section the question of optimal low complex MMSEdecoder design. The results mentioned below are initially due to M¨ uller andVerd u in [Muller and Verd´u, 2001] and are further developed in the work of Cottatellucci et al. in, e.g., [Cottatellucci and M¨ uller, 2002, 2005 ; Cottatellucciet al. , 2004], Loubaton et al. [Loubaton and Hachem , 2003], and Hoydis et al.[Hoydis et al., 2011c].

Consider the following communication channel

y = Hx + σw

with x = [x1 , . . . , x n ]T

∈C n some transmitted vectorial data of dimension n,

assumed to have zero mean and covariance matrix In , w ∈C N is an additiveGaussian noise vector of zero mean and covariance IN , y = [y1 , . . . , y N ]T

∈C N

is the received signal, and H ∈C N ×n is the multi-dimensional communicationchannel.

Under these model assumptions, irrespective of the communication scenariounder study (e.g. MIMO, CDMA), the minimum mean square error decoderoutput x for the vector x reads:

x = H H H + σ2 I n −1 H H y (13.15)

= H H HH H + σ2 I N −1y .



13.6. Transceiver design 329

For practical applications, recovering x therefore requires the inversion of the potentially large HH H + σ2I N −1 matrix. This inverted matrix has to beevaluated every time the channel matrix H changes. It unfolds that a highcomputational effort is required at the receiver to numerically evaluate suchmatrices. In some situations, where the computational burden at the receiveris an important constraint (e.g. impacting directly the battery consumption incellular phones), this effort might be unbearable and we may have to resort tolow complexity and less efficient detectors, such as the matched-lter H H .

Now, from the Cayley-Hamilton theorem, any matrix is a (matrix-valued)root of its characteristic polynomial. That is, denoting P (x) the characteristicpolynomial of H H H + σ2 I n , i.e.

P (x) = det H H H + σ2 I n

−xI n

it is clear that P (H H H + σ2I n ) = 0. Since the determinant above can be writtenas a polynomial of x of maximum degree n, P (x) expresses as

P (x) =n

i =0a i x i (13.16)

for some coefficients a0 , . . . , a n to determine.From ( 13.16), we then have

0 = P (HH

H + σ2I n ) =

n

i =0a i H

H

H + σ2I n

i

from which

−a0 =n

i=1

a i H H H + σ2 I ni.

Multiplying both sides by ( H H H + σ2 I n )−1 , this becomes

H H H + σ2 I n −1= −

n

i =1

a i

a0H H H + σ2 I n

i−1

= −n

i =1

a i

a0

i−1

j =0

i −1 j

σ2( i−j −1) H H Hi−1

and therefore ( 13.15) can be rewritten under the form

x =n −1

i =0bi H H H

iH H y

with bi

−1 =

−a ia 0

i−1j =0

i−1j σ2( i−j −1) .

Obviously, the effort required to compute ( H H H + σ2 I n )−1 is equivalent tothe effort required to compute the above sum. Nonetheless, it will appear thatthe bi above can be expressed as a function of the trace of successive powers of




where Φ ∈C m ×m and φ∈

C (m ) depend only on the trace of the successive powersof H H H . Denoting Φ ij the ( i, j )th entry of Φ and φi the ith entry of φ , weexplicitly have

Φij = 1n

tr H H Hi + j

+ σ2 1n

tr H H Hi+ j −1

φi = 1n

tr H H Hi.

But then, from all limiting results on large dimensional random matricesintroduced in Part I, either under the analytical Stieltjes transform approachor under the free probability approach, it is possible to approximate the entriesof Φ and φ and to obtain deterministic equivalents for b(m )

0 , . . . , b(m )m −1 , for a large

set of random matrix models for H .

13.6.1 Channel matrix model with i.i.d. entries

Typically, for H with independent entries of zero mean, variance 1 /N , and nite2 + ε order moment, for some positive ε, from Theorem 2.14

1n

tr H H Hi a.s.

−→ 1i

i−1

k =0

ik

ik + 1

ck

as N, n → ∞ with n/N → c.

In place for the optimal MSE minimizing truncated polynomial decoders, wemay then use the order m detectorm −1

i =0

b(m )i H H H

iH H

where b (m ) = ( b(m )0 , . . . , b(m )

m −1) is dened as

b (m ) = Φ −1 φ

where the entries of Φ and φ are the almost sure limit of Φ and φ , respectively,

as N, n grow to innity with limiting ratio n/N → c.These suboptimal weights are provided, up to a scaling factor over all b(m )i , in

Table 13.6.Obviously, the partial MMSE detectors derived from asymptotic results differ

from the exact partial MMSE detectors. They signicantly differ if N, n arenot large, therefore impacting the decoding performance. In Figure 13.7 andFigure 13.8, we depict the simulated bit error rate performance of partial MMSEdetectors, using the weights b (m ) dened in this section, along with the bit errorrate performance of the suboptimal detectors with weights b (m ) . Comparisonis made between both approaches, when N = n = 4 or N = n = 64 and H hasindependent Gaussian entries of zero mean and variance 1 /N . Observe that, forthese small values of N and n, the large dimensional approximations b (m ) of b (m )

are far from accurate. Note in particular that the approximated MMSE detector




N = 1 b(1)0 = 1

N = 2 b(2)0 = σ2 −2(1 + c)

˜b

(2)

1 = 1

N = 3b(3)

0 = 3(1 + c2) −3σ2(1 + c) + σ4 + 4 cb(3)

1 = σ2 −3(1 + c)b(3)

2 = 1

N = 4

b(4)0 = 6σ2(1 + c2) + 9 σ2c −4(1 + c3) −4σ4(1 + c) + σ6 −6c(1 + c)

b(4)1 = −6(1 + c2) + 4 σ2(1 + c) −σ4 −9c

b(4)2 = σ2 −4(1 + c)

b(4)3 = 1

Table 13.6. Deterministic equivalents of the weights for the (MSE optimal) partial MMSElters.

−5 0 5 10 15 20 25 30−35

−30

−25

−20

−15

SNR [dB]

b i t e r r o r r a t e

[ d B ]

b 1

˜b

1

b 2

b 2

b 3

b 3

b 4

b 4

Figure 13.7 Bit error rate performance of partial MMSE lters, for exact weights b( m )

and approximated weights b ( m ) , H ∈C N ×n has i.i.d. Gaussian entries, N = n = 4.

(m = 4) is extremely badly approximated for these small values of N and n. Forhigher N and n, the decoders based on the approximated b (m ) perform veryaccurately for small m, as observed in Figure 13.8.

13.6.2 Channel matrix model with generalized variance prole

In the scenario when H ∈C N ×n can be written under the very general formH = [h 1 , . . . , h n ], with h i = R

12i x i , with R i ∈C N ×N and x i ∈C N , with i.i.d.

entries of zero mean and variance 1 /n , we have from Theorem 6.13, for every m,




−5 0 5 10 15 20 25 30

−50

−40

−30

−20

SNR [dB]

b i t e r r o r r a t e

[ d B ]

b 2

b 2

b 4

b 4

b 8

b 8

b 64

Figure 13.8 Bit error rate performance of partial MMSE lters, for exact weights b ( m )

and approximated weights b ( m ) , H ∈C N ×n has i.i.d. Gaussian entries, N = n = 64.

that b (m ) = Φ −1 φ , where

Φi,j = M i + j + σ2M i + j −1

φi = M i

where the M i are dened recursively in Theorem 6.13.Other polynomial detector models for, e.g. downlink CDMA frequency selected

channels, have been studied in [Hachem, 2004 ]. This concludes this chapter onpoint-to-point, or single-user, MIMO communications. In the following chapter,we extend some of the previous results to the scenario of multiple users possiblycommunicating with multiple antennas.



14 Rate performance in multiple

access and broadcast channels

In this chapter, we consider both multiple access channels (MAC), which assumea certain number of users competing for the access to (i.e. to transmit data to) asingle resource, and broadcast channels (BC), which assume the opposite scenariowhere a single transmitter multicasts data to multiple receivers.

The performance of multi-user communications can no longer be assessed froma single capacity parameter, as was the case for point-to-point communications.In a K -user MAC, we must evaluate what vectors ( R1 , . . . , R K ) of rates, Ri beingthe data rate transmitted by user i, are achievable, in the sense that simultaneousreliable decoding of all data streams is possible at the receiver. Now, similarto single-user communications, where all rates R less than the capacity C areachievable and therefore dene a rate set R = R, R ≤ C , for the multiple accesschannel, we dene the multi-dimensional MAC rate region as the set R MAC of

all vectors ( R1 , . . . , R K ) such that reliable decoding is possible at the receiverif users 1, . . . , K transmit, respectively, at rate R1 , . . . , R K . Similarly, for thebroadcast channel, we dene the BC rate region R BC as the (closed) set of allvectors ( R1 , . . . , R K ), Ri being now the information data rate received by useri, such that every user can reliably decode its data. We further dene the rate region boundary , either in the MAC or BC case, as the topological boundary of the rate region. These rate regions can be dened either in the quasi-static or inthe ergodic sense. That is, the rate regions may assume perfect or only statisticalchannel state information at the transmitters. This is particularly convenientin the MAC, where in general perfect channel state information of all users’channel links is hardly accessible to each transmitter and where imperfect channelstate information does not dramatically impact the achievable rates. In contrast,imperfect channel state information at the transmitters in the BC results insuboptimal beamforming strategies and thus high interference at all receivers,therefore reducing the rates achievable under perfect channel knowledge.

In the MAC, either with perfect or imperfect channel information at thetransmitter, it is known that the boundary of the rate region can be achieved if the receiver performs MMSE decoding and successive interference cancellation(MMSE-SIC) of the input data streams. That is, the receiver decodes the

strongest signal rst using MMSE decoding, removes the signal contributionfrom the input data, then decodes the second to strongest signal, etc. untildecoding the weakest signal. As for the BC with perfect channel information



336 14. Rate performance in multiple access and broadcast channels

at the transmitter, it took researchers a long time to gure out a precoding strategy which achieves the boundary of the BC rate region; note here thatthe major processing effort is shifted to the transmitter. The main results werefound almost simultaneously in the following articles [Caire and Shamai , 2003;Viswanath et al., 2003; Viswanathan and Venkatesan, 2003; Weingarten et al.,2006; Yu and Cioffi, 2004 ]. One of these boundary achieving codes (and forthat matter, the only one we know of so far) is the so-called dirty-paper coding(DPC) algorithm [Costa, 1983]. The strategy of the DPC algorithm is to encodethe data sequentially at the transmission in such a way that the interferencecreated at every receiver and treated as noise by the latter allows for reliabledata decoding. The approach is sometimes referred to as successive encoding ,in duality reference with the successive decoding approach for the MAC. The

DPC precoder is therefore non-linear and is to this day too complex to beimplemented in practical communication systems. However, it has been shownin the information theory literature, see, e.g., [Caire and Shamai, 2003; Peelet al. , 2005; Wiesel et al., 2008; Yoo and Goldsmith , 2006] that suboptimal linear precoders can achieve a large portion of the BC rate region while featuring lowcomputational complexity. Thus, much research has recently focused on linearprecoding strategies.

It is often not convenient though to derive complete achievable rate regions,especially for communications with a large number of users. Instead, sum rate capacity is often considered as a relevant performance metric, which corresponds

to the maximally achievable sum R1 + . . . + RK , with ( R1 , . . . , R K ) elements of the achievable MAC or BC rate regions. Other metrics are sometimes used inplace of the sum rate capacity, which allow for more user fairness. In particular,maximizing the minimum rate is a strategy that avoids leaving users with badchannel conditions with zero rate, therefore improving fairness among users. Inthis chapter, we will however only discuss sum rate maximization.

The next section is dedicated to the evaluation, through deterministicequivalents or large dimensional system limits, of the rate region or sum rateof MAC, BC, linearly precoded BC, etc. We assume a multi-user communication

wireless network composed of K users, either transmitters (in the MAC) orreceivers (in the BC), communicating with a multiple antenna access point orbase station. User k, k ∈ 1, . . . , K is equipped with nk antennas, while theaccess point is equipped with N antennas. Since users and base stations areeither transmitting or receiving data, we no longer use the notations nt and nr

to avoid confusion.We rst consider the case of linearly precoded broadcast channels.

14.1 Broadcast channels with linear precoders

In this section, we consider the downlink communication of the N -antenna basestation towards K single antenna receivers, i.e. nk = 1 for all k ∈ 1, . . . , K ,



14.1. Broadcast channels with linear precoders 337

which is a common assumption in current broadcast channels, although studiesregarding multiple antenna receivers have also been addressed, see, e.g.,[Christensen et al., 2008]. We further assume N

≥ K . In some situations, we may

need to further restrict this condition to cN N/K > 1 + ε for some ε > 0 (as wewill need to grow N and K large with specic rates). We denote by h H

k ∈C 1×N

the MISO channel from the base station to user k. At time t, denoting s( t )k the

signal intended to user k of zero mean and unit variance, σw( t )k an additive white

Gaussian noise with zero mean and variance σ2 , and y( t )k the signal received by

user k, the transmission model reads:

y( t )k = h H

k

K

j =1

g j s ( t )j + σw( t )

k

where g j ∈C N denotes the linear vector precoder, also referred to asbeamforming vector or beamformer , of user j . Gathering the transmit data intoa vector s( t ) = ( s ( t )

1 , . . . , s ( t )K )

T

∈C K , the additive thermal noise into a vector

w ( t ) = ( w( t )1 , . . . , w ( t )

K )T

∈C K , the data received at the antenna array into a

vector y ( t ) = ( y( t )1 , . . . , y ( t )

K )T

∈C K , the beamforming vectors into a matrix G =

[g 1 , . . . , g K ] ∈C N ×K , and the channel vectors into a matrix H = [h 1 , . . . , h K ]H

∈C K ×N , we have the compact transmission model

y ( t ) = HGs ( t ) + σw ( t )

where G must satisfy the power constraint

tr(E[ Gs ( t ) s ( t )H G H ]) = tr( GG H ) ≤ P (14.1)

assuming E[ s ( t ) s ( t ) H ] = I K , for some available transmit power P at the basestation.

When necessary, we will denote z( t )∈

C N the vector

z ( t ) Gs ( t )

of data transmitted from the antenna array of the base station at time t.Due to its practical and analytical simplicity, this linear precoding model isvery attractive. Most research in linear precoders has focused to this day bothon analyzing the performance of some ad-hoc precoders and on determining theoptimal linear precoders. Optimality is often taken with respect to sum ratemaximization (or sometimes, for fairness reasons, with respect to maximizationof the minimum user rate). In general, though, the rate maximizing linearprecoder has no explicit form. Several iterative algorithms have been proposedin [Christensen et al., 2008; Shi et al., 2008] to come up with a sum rate optimalprecoder, but no global convergence has yet been proved. Still, these iterativealgorithms have a high computational complexity which motivates the use of further suboptimal linear transmit lters, by imposing more structure into thelter design.




In order to maximize the achievable sum rate, a rst straightforward techniqueis to precode the transmit data by the inverse, or Moore–Penrose pseudo-inverse,of the channel matrix H . This scheme is usually referred to as channel inversion(CI) or zero-forcing (ZF) precoding [Caire and Shamai, 2003 ]. That is, the ZFprecoder G zf reads:

G zf = ξ √ N

H H (HH H )−1

where ξ is set to fulll some transmit power constraint ( 14.1).The authors in [Hochwald and Vishwanath, 2002 ; Viswanathan and

Venkatesan, 2003 ] carry out a large system analysis assuming that the number of transmit antennas N at the base station as well as the number of users K grow

large while their ratio cN = N/K remains bounded. It is shown in [Hochwaldand Vishwanath, 2002] that, for cN > 1 + ε, uniformly on N , the achievablesum rate for ZF precoding has a multiplexing gain of K , which is identical tothe optimal DPC-achieving multiplexing gain. The work in [Peel et al., 2005]extends the analysis in [Hochwald and Vishwanath, 2002] to the case K = N and shows that the sum rate of ZF saturates as K grows large. The authors in[Peel et al., 2005] counter this problem by introducing a regularization term αinto the inverse of the channel matrix. The precoder obtained is referred to asregularized zero-forcing (RZF) or regularized channel inversion, is denoted G rzf ,and is given explicitly by

G rzf = ξ √ N

H H (HH H + α I N )−1 (14.2)

with ξ dened again to satisfy some transmit power constraint.Under the assumption of large K and for any unitarily invariant channel

distribution, [Peel et al., 2005] derives the regularization term α that maximizesthe signal-to-interference plus noise ratio. It has been observed that the optimalRZF precoder proposed in [Peel et al., 2005] is very similar to the transmit ltersderived under the minimum mean square error (MMSE) criterion at every user,

i.e. with α = σ2 [Joham et al., 2002], and become identical in the large K limit.Based on the tools developed in Part I, we provide in this section deterministic

approximations of the achievable sum rate under ZF and RZF precoding, anddetermine the optimal α parameter in terms of the sum rate as well as otherinteresting optimization measures discussed below. We also give an overviewof the potential applications of random matrix theory for linearly precodedbroadcast channels, under the most general channel conditions. In particular,consider that:

• the signal transmitted at the base station antenna array is correlated. Thatis, for every user k, k ∈ 1, . . . , K , h k can be written under the form

h k = T12k x k




where x k ∈C N has i.i.d. Gaussian entries of zero mean and variance 1 /N , andT

12k is a non-negative denite square root of the transmit correlation matrix

T k

∈C N ×N with respect to user k;

• the different users are assumed to have individual long-term path-lossesr 1 , . . . , r K . This allows us to further model h k as

h k = √ r k T12k x k .

The letter ‘ r ’ is chosen to indicate the channel fading seen at the receiver ;

• the channel state information (CSI) at the transmitter side is assumedimperfect. That is, H is not completely known at the transmitter. Onlyan estimate H is supposed to be available at the transmitter. We modelH = [h 1 , . . . , h K ]H with

h j = 1 −τ 2j h j + τ j h j

with τ j some parameter roughly indicating the accuracy of the j th channelestimate, and H = [h 1 , . . . , h K ]

H as being the random matrix of channel errorswith properties to be dened later. Note that a similar imperfect channelstate analysis framework for the single-user MIMO case has been introducedin [Hassibi and Hochwald , 2003].

These channel conditions are rather realistic and include as particular cases theinitial results found in the aforementioned literature contributions and others,e.g., [Ding et al., 2007; Hochwald and Vishwanath, 2002 ; Jindal, 2006; Peelet al. , 2005]. An illustration of the linearly precoded MISO (or vector) broadcastchannel is provided in Figure 14.1. The following development is heavily basedon the work [Wagner et al., 2011].

14.1.1 System model

Consider the transmission model described above. For simplicity here, we will

assume that T 1 = . . . = T K T , i.e. the correlation at the base station of allusers’ channel vectors is identical. Field measurements [Kaltenberger et al., 2009]suggest that this assumption is too strong and not fully realistic. As will befurther discussed in subsequent sections, signal correlation at the transmitterdoes not only arise from close antenna spacing, but also from the differentsolid angles of signal departure. It could be argued though that the scenariowhere all users experience equal transmit covariance matrices represents a worstcase scenario, as it reduces multi-user diversity. If not fully realistic, the currentassumption on T is therefore still an interesting hypothesis.

Similar to [Chuah et al., 2002; Shi et al., 2008; Tulino and Verd´ u, 2005], wetherefore denote

H = R12 XT

12




with imperfect short-term statistics X = [x 1 , . . . , x K ]H given by:

x k =

1

−τ 2k x k + τ k q k

where we dened Q = [q 1 , . . . , q K ]H

∈C K ×N the matrix of channel estimation

errors containing i.i.d. entries of zero mean and variance 1 /N , and τ k ∈ [0, 1]the distortion in the channel estimate h k of user k. We assume that the τ k areperfectly known at the transmitter. However, as shown in [Dabbagh and Love,2008], an approximated knowledge of τ k will not lead to a severe performancedegradation of the system. Furthermore, we suppose that X and Q are mutuallyindependent as well as independent of the symbol vector s( t ) and noise vectorw ( t ) . A similar model for the case of imperfect channel state information at the

transmitter has been used in, e.g., [Dabbagh and Love , 2008; Hutter et al., 2000;Yoo and Goldsmith, 2006] .We then dene the average SNR ρ as ρ P/σ 2 . The received symbol y( t )

k of user k at time t is given by:

y( t )k = h H

k g k s ( t )k +

1≤i≤K i = k

h H

k g i s( t )i + σw ( t )

k

where we recall that h Hk ∈C N denotes the kth row of H .

The SINR γ k of user k reads:

γ k = |hHk g k |2N

j =1j = k |h H

k g j |2 + σ 2

N

. (14.6)

The system sum rate Rsum is dened as

R sum =K

k=1

log2 (1 + γ k )

evaluated in bits/s/Hz.The objective of the next section is to provide a deterministic equivalent for

the γ k under RZF precoding.

14.1.2 Deterministic equivalent of the SINR

A deterministic equivalent for the SINR under RZF precoding is given as follows.

Theorem 14.1. Let γ rzf ,k be the SINR of user k under RZF precoding, i.e. γ rzf ,k

is given by (14.6), with G given by (14.6) and ξ such that the power constraint (14.1) is fullled. Then

γ rzf ,k −γ rzf ,ka .s.

−→ 0




as N, K → ∞, such that 1 ≤ N/K ≤ C for some C > 1, and where γ rzf ,k is given by:

γ rzf ,k = r2

k(1

−τ 2

k)m2

r k Υ (1 −τ 2k [1−(1 + r k m)2]) + Ψρ (1 + r k m)2

with

m = 1N

tr T (α I N + φT )−1 , (14.7)

Ψ =1N tr R (I K + mR )−2 1

N tr T (α I N + φT )−2

1 − 1N tr R 2 (I K + mR )−2 1

N tr T 2 (α I N + φT )−2 (14.8)

Υ =1N tr R (I K + mR )−2 1

N tr T 2 (α I N + φT )−2

1 − 1N tr R

2

(I K + mR )−2 1

N tr T2

(α I N + φT )−2 (14.9)

where φ is the unique real positive solution of

φ = 1N

tr R I K + R 1N

tr T (α I N + φT )−1 −1

. (14.10)

Moreover, dene φ0 = 1/α , and for k ≥ 1

φk = 1N

tr R I K + R 1K

tr T (α I N + φk−1T )−1 −1

.

Then φ = lim k→∞φk .

We subsequently prove the above result by providing deterministic equivalentsfor ξ , then for the power of the signal of interest (numerator of the SINR) andnally for the interference power (denominator of the SINR). Throughout thisproof, we will mainly use Theorem 6.12.

For convenience, we will constantly use the notation mB N ,Q N (z) for a randommatrix B N ∈C N ×N and a deterministic matrix Q N ∈C N ×N to represent

mB N ,Q N (z) 1N

tr Q N (B N −zI N )−1 .

Also, we will denote mB N (z) mB N , I N (z). The character m is chosen herebecause all mB N ,Q N (z) considered in the following will turn out to be Stieltjestransforms of nite measures on R + .

From the sum power constraint ( 14.1), we obtain

ξ 2 = P

1N tr H H H H H H + α I N

−2 = P

m H H H (−α) −αmH H H (−α)

P Ψ

where the last equation follows from the decomposition

H H H (H H H + α I M )−2 = ( H H H + α I N )−1 −α(H H H + α I N )−2




and we dene

Ψ m H H H (−α) −αm H H H (−α).

The received symbol y( t )k of user k at time t is given by:

y( t )k = ξ h H

k W h k s( t )

k + ξ K

i=1i= k

h Hk

W h i s( t )i + σw ( t )

k

where W (H H H + α I N )−1 and h Hk is the kth row of H . The SINR γ rzf ,k of

user k can be written under the form

γ rzf ,k = |h H

k W h k

|2

h Hk W H H

(k ) H (k ) Wh k + 1ρ Ψ

where H H

(k ) = [h 1 , . . . , h k−1 , h k+1 , . . . , h K ] ∈C N ×(K −1) . Note that this SINRexpression implicitly assumes that the receiver is perfectly aware of both thevector channel h k and the phase of h H

k W h k which rotates the transmitted signal.This requires to assume the existence of a dedicated training sequence for thereceivers.

We will proceed by successively deriving deterministic equivalent expressionsfor Ψ, for the signal power |h

Hk W h k |2 and for the power of the interference

hH

kˆ

WˆH

H

(k )ˆH (k )

ˆWh k .We rst consider the power regularization term Ψ. From Theorem 6.12,

m H H H (−α) is close to m H H H (−α) given as:

m H H H (−α) = 1N

tr ( α I N + φT )−1 (14.11)

where φ is dened in (14.10).Remark now, and for further purposes, that ¯ m in (14.7) is a deterministic

equivalent of the Stieltjes transform mA (z) of

A ˆX

H

RˆX + αT −

1

evaluated at z = 0, i.e.

mA (0) − m a .s.

−→ 0. (14.12)

Note that it is uncommon to evaluate Stieltjes transforms at z = 0. This isvalid here, since we assumed in ( 14.3) that 1 /t N > 1/a + and then the smallesteigenvalue of A is strictly greater than 1 / (2a+ ) > 0, uniformly on N . Therefore,mA (0) is well dened.

Since the deterministic equivalent of the Stieltjes transform of H H H is itself theStieltjes transform of a probability distribution, it is an analytic function outsidethe real positive half-line. The dominated convergence theorem, Theorem 6.3,ensures that the derivative ¯ m H H H (z) of m H H H (z) is a deterministic equivalent of




m H H H (z), i.e.

m H H H (z)

− m H H H (z) a.s.

−→ 0.

After differentiation of ( 14.11) and standard algebraic manipulations, weobtain

Ψ − Ψ a.s.

−→ 0

as N, K → ∞, where

Ψ m H H H (−α) −αm H H H (−α)

which is explicitly given by (14.8).We now derive a deterministic equivalent for the signal power.Applying Lemma 6.2 to h H

k W = h Hk (H H

(k ) H (k ) + α I N + h k h Hk )−1 , we have:

h H

k Wh k =

h Hk H H

(k ) H (k ) + α I N −1

h k

1 + h Hk H H

(k ) H (k ) + α I N −1

h k

.

Together with h Hk = √ r k

1 −τ 2k x H

k + τ k q Hk T

12 , we obtain

h H

k Wh k = 1 −τ 2k r k x H

k A −1(k ) x k

1 + r k x Hk A −1

(k ) x k+

τ k r k q Hk A −1

(k ) x k

1 + r k x Hk A −1

(k ) x k

with A (k ) = X H

(k ) R (k ) X (k ) + αT −1 for X H

(k ) = [x 1 , . . . , x k−1 , x k+1 , . . . , x K ], x n

being the nth row of X , and R (k ) = diag( r 1 , . . . , r k−1 , r k +1 . . . r K ). Since bothx k and x k have i.i.d. entries of variance 1 /N and are independent of A (k ) , whileA (k ) has uniformly bounded spectral norm since t1 > a −, we invoke the tracelemma, Theorem 3.4, and obtain

x H

k A −1(k ) x k − 1

N tr A −1

(k )a .s.

−→ 0

x H

k A −1(k ) x k −

1N

tr A −1(k )

a .s.

−→ 0.

Similarly, as q k and xk are independent, from Theorem 3.7

q Hk A −1

(k ) x ka .s.

−→ 0.

Consequently, since (1 + r k x Hk A −1

(k ) x k ) is bounded away from zero, we obtain

h Hk

Wh k − 1 −τ 2kr k

1N tr A −1

(k )

1 + r k1N tr A −1

(k )

a .s.

−→ 0. (14.13)




This implies in particular that, if A N is uniformly bounded

(x H

N Ax N ) p

− 1

N tr A N

pa .s.

−→ 0.

We nally address the more involved question of the interference power.With W = T −1

2 A −1T −12 , the interference power can be written as

h Hk

W H H

(k ) H (k ) Wh k = rk x H

k A −1 X H

(k ) R (k ) X (k ) A −1x k . (14.16)

Denote c0 = (1 −τ 2k )r k , c1 = τ 2k r k and c2 = τ k 1 −τ 2k r k , then:

A = A (k ) + c0x k x H

k + c1q k q H

k + c2x k q H

k + c2q k x H

k .

In order to eliminate the dependence between x k and A in (14.16), we rewrite(14.16) as

h H

k W H H

(k ) H (k ) Wh k

= rk x H

k A −1(k ) X H

(k ) R (k ) X (k ) A −1x k + r k x H

k A −1 −A −1(k )

X H

(k ) R (k ) X (k ) A −1x k .

(14.17)

Applying the resolvent identity, Lemma 6.1, to the term in brackets in ( 14.17)and together with A −1

(k ) X H

(k ) R (k ) X (k ) = I N −αA −1(k ) T −1 , (14.17) takes the form

h Hk W H H

(k ) H (k ) Wh k = rk x Hk A −1x k −αr k x H

k A −1(k ) T −1A −1x k

−c0r k x H

k A −1x k x H

k A −1x k −αx H

k A −1(k ) T −1A −1x k

−c1r k x Hk A −1q k q H

k A −1x k −αq Hk A −1

(k ) T −1A −1x k

−c2r k x H

k A −1x k q H

k A −1x k −αq H

k A −1(k ) T −1A −1x k

−c2r k x Hk A −1q k x H

k A −1x k −αx Hk A −1

(k ) T −1A −1x k .

(14.18)

To nd a deterministic equivalent for all of all the terms in ( 14.18), we needthe following lemma, which is an extension of Theorem 3.4.

Lemma 14.1. Let U N , V N ∈C N ×N be invertible and of uniformly bounded spectral norm. Let xN , y N ∈C N have i.i.d. complex entries of zero mean,variance 1/N , and nite eighth order moment and be mutually independent as well as independent of U N , V N . Dene c0 , c1 , c2 > 0 such that c0c1 −c2

2 ≥ 0 and let u 1

N tr V −1N and u 1

N tr U N V −1N . Then we have, as N → ∞

xH

N U N V N + c0x N xH

N + c1y N yH

N + c2x N yH

N + c2y N xH

N −1

x N

− u (1 + c1u)

(c0c1 −c22)u2 + ( c0 + c1)u + 1

a .s.

−→ 0.




Note that, for τ > 0 and at asymptotically high SNR, the above regularizationterm α satises

limρ→∞α = τ

2

1 −τ 2 K N .

Thus, for asymptotically high SNR, RZF-O does not converge to the ZFprecoding, in the sense that ¯α does not converge to zero. This is to be opposedto the case when perfect channel state information at the transmitter is assumedwhere α tends to zero.

Under this scenario and with α = α , the SINR γ rzf ,k is now independent of the index k and takes now the simplied form

γ rzf ,k = ω2 ρ

N K −1 +

χ2 −

12

where ω and χ are given by:

ω = 1−τ 2

1 + τ 2ρ

χ = N K −1

2

ω2ρ2 + 2N K

+ 1 ωρ + 1 .

Further comments on the above are provided in [Wagner et al., 2011]. Wenow move to ZF precoding, whose performance can be studied by having theregularization parameter of RZF tend to zero.

14.1.4 Zero-forcing precoding

For α = 0, the RZF precoding matrix reduces to the ZF precoding matrix G zf ,which we recall is dened as

G zf = ξ

√ N ˆH

H ˆH

ˆH

H −1

where ξ is a scaling factor to fulll the power constraint ( 14.1).To derive a deterministic equivalent of the SINR of ZF precoding, we cannot

assume N = K and apply the same techniques as for RZF, since by removinga row of H , the matrix H H H becomes singular. Therefore, we adopt a differentstrategy and derive a deterministic equivalent ¯ γ zf ,k for the SINR of ZF for user kunder the additional assumption that, uniformly on K, N , N

K > 1 + ε, for someε > 0. The approximated SINR ¯γ zf ,k is then given by:

γ zf ,k = limα →0

γ rzf ,k .

The result is summarized in the following theorem.




Theorem 14.3. Let N/K > 1 + ε for some ε > 0 and γ zf ,k be the SINR of user k for the ZF precoder. Then

γ zf ,k −γ zf ,ka .s.

−→ 0as N, K → ∞, with uniformly bounded ratio, where γ zf ,k is given by:

γ zf ,k = 1−τ 2kr k τ 2k Υ + Ψ

ρ

(14.21)

with

Ψ = 1φ

1N

tr R −1

Υ = ψN K φ2 −ψ

1K

tr R −1

ψ = 1N

tr T 2 I N + K N

1φ

T−2

(14.22)

where φ is the unique solution of

φ = 1N

tr T I N + K N

1φ

T−1

. (14.23)

Moreover, φ = lim k→∞φk , where φ0 = 1 and, for k ≥ 1

φk = 1N

tr T I N + K N

1φk−1

T−1

.

Note, that by Jensen’s inequality ψ ≥ φ2 with equality if T = I N . In thesimpler case when T = I N , R = I K and τ 1 = . . . = τ K τ , we have the followingdeterministic equivalent for the SINR under ZF precoding.

γ zf ,k = 1−τ 2

τ 2 + 1

ρ

N K −1 .

We hereafter prove Theorem 14.3.Recall the terms in the SINR of RZF that depend on α, i.e. mA , Ψ, and Υ

mA = 1N

tr TF

Ψ = 1N

tr F −αF 2

Υ = 1N

tr T F −αF 2

where we introduced the notation

F = H H H + α I N −1

.




In order to take the limit α → 0 of Ψ and Υ, we apply the resolvent lemma,Lemma 6.1, to F and we obtain

F −αF 2 = HH

H HH

+ α I N −2

H .

Since H H H is non-singular with probability one for all large N, K , we can takethe limit α → 0 of Ψ and Υ in the RZF case, for all such H H H and large enoughN, K . This redenes Ψ and Υ as

Ψ = 1N

tr H H H −1

Υ = 1N

tr HT H H H H H −2.

Note that it is necessary to assume that N/K > 1 + ε uniformly for some ε > 0to ensure that the maximum eigenvalue of matrix ( H H H )−1 is uniformly boundedfor all large N . Since mA grows with α decreasing to zero as O(1/α ), we have:

γ zf ,k = limα →0

γ rzf ,k = 1−τ 2kr k τ 2k Υ + Ψ

ρ. (14.24)

Now we derive deterministic equivalents Ψ and Υ for Ψ and Υ, respectively.Theorem 6.12 can be directly applied to nd Ψ as

¯Ψ =

1N

1φ tr R −

1

where φ is dened in (14.23).To determine Υ, note that we can diagonalize T as T = U diag( t1 , . . . , t N )U H ,

where U is a unitary matrix, and still have i.i.d. elements in the kth column x kof XU . Denoting C = H H H , C (k ) = H [k ] H

H

[k ] −tk R12 x k x H

k R12 and applying the

usual matrix inversion lemma twice, we can write

Υ = 1N

N

k=1

t2k

x Hk R

12 C −2

(k ) R12 x k

(1 + tk x Hk R

12 C −1

(k ) R12 x k )2

.

Notice here that C −1(k ) does not have uniformly bounded spectral norm. The

trace lemma, Theorem 3.4, can therefore not be applied straightforwardly here.However, since the ratio N/K is taken uniformly greater than 1 + ε for someε > 0, C −1

(k ) has almost surely bounded spectral norm for all large N (this unfoldsfrom Theorem 7.1 and from standard matrix norm inequalities, reminding us that0 < a < t k , r k < b < ∞). This is in fact sufficient for a similar trace lemma tohold.

Lemma 14.2. Let A 1 , A 2 , . . . , with A N

∈C N ×N , be a series of random matrices

generated by the probability space (Ω, F , P ) such that, for ω ∈ A ⊂ Ω, with P (A) = 1 , A N (ω) < K (ω) < ∞, uniformly on N . Let x 1 , x 2 , . . . be random vectors of i.i.d. entries such that the entries of xN ∈C N have zero mean,




variance 1/N , and eighth order moment of order O(1/N 4), independent of A N .Then

x HN A N x N − 1N tr A N a .s.−→ 0

as N → ∞.

Proof. The proof unfolds from a direct application of the Tonelli theorem,Theorem 3.16. Denoting ( X, X , P X ) the probability space that generates theseries x 1 , x 2 , . . . , we have that, for every ω ∈ A (i.e. for every realizationA 1(ω), A 2(ω), . . . with ω ∈ A), the trace lemma, Theorem 3.4, holds true. Now,from Theorem 3.16, the space B of couples (x, ω) ∈ X ×Ω for which the trace

lemma holds satises

X ×Ω1B (x, ω)dP X ×Ω(x, ω) = Ω X

1B (x, ω)dP X (x)dP (ω).

If ω ∈ A, then 1 B (x, ω) = 1 on a subset of X of probability one. The inner integraltherefore equals one whenever ω ∈ A. As for the outer integral, since P (A) = 1,it also equals one, and the result is proved.

Moreover, the rank-1 perturbation lemma, Theorem 3.9, no longer ensuresthat the normalized trace of C −1

(k ) and C −1 are asymptotically equal. In fact,following the same line of argument as above, we also have a generalized rank-1perturbation lemma, which now holds only almost surely.

Lemma 14.3. Let A 1 , A 2 , . . . , with A N ∈C N ×N , be deterministic with uniformly bounded spectral norm and B 1 , B 2 , . . . , with B N ∈C N ×N , be random Hermitian, with eigenvalues λB N

1 ≤ . . . ≤ λB N N such that, with probability one,

there exist ε > 0 for which λB N 1 > ε for all large N . Then for v ∈C N

1N

tr A N B −1N − 1

N tr A N (B N + vv H )−1 a.s.−→ 0

as N → ∞, where B −1N and (B N + vv H )−1 exist with probability one.

Proof. The proof unfolds similarly as above, with some particular care to betaken. Call B the set of probability one in question and take ω ∈ B . The smallesteigenvalue of B N (ω) is greater than ε(ω) for all large N . Therefore, for such anN , rst of all B N (ω) and B N (ω) + vv H are invertible and, taking z = −ε(ω)/ 2,we can write

1N

tr A N B N (ω)−1 = 1N

tr A N B N (ω) − ε(ω)

2 IN +

ε(ω)2

IN −1




and1N

tr A N (B N (ω) + vv H )−1

= 1N

tr A N B N (ω) + vv H

− ε(ω)

2 IN +

ε(ω)2

IN −1

.

With these notations, B N (ω) − ε(ω)2 IN and B N (ω) + vv H

− ε(ω)2 IN are still non-

negative denite for all N . Therefore, the rank-1 perturbation lemma, Theorem3.9, can be applied for this ω. But then, from the Tonelli theorem, in the spacethat generates ( B 1 , B 2 , . . . ), the subspace where the rank-1 perturbation lemmaapplies has probability one, which is what needed to be proved.

Applying the above trace lemma and rank-1 perturbation lemma, we obtain

Υ − 1N

tr RC −2 1N

N

k=1

t2k

(1 + tk1N tr RC −1)2

a .s.

−→ 0.

To determine a deterministic equivalent ˆ mC ,Λ (0) for mC ,Λ (0) = 1K tr RC −1 ,

we apply Theorem 6.12 again (noticing that there is once more no continuityissue in point z = 0). For 1

K tr RC −2 , we have:

1K

tr RC −2 = mC 2 ,R (z) = mC ,R (0) .

The derivative of mC ,R (0) being a deterministic equivalent of mC ,R (0), wehave Υ, given as:

Υ = ψ

N K φ2 −ψ

1K

tr R −1

where ψ and φ are dened in ( 14.22) and ( 14.23), respectively. Finally, we obtain(14.21) by substituting Ψ and Υ in ( 14.24) by their respective deterministicequivalents, which completes the proof.

The above results allow for interesting characterizations of the linearlyprecoded broadcast channels. Some of these are provided below.

14.1.5 Applications

An interesting characterization of the performance of RZF-O derived abovefor imperfect channel state information at the transmitter is to evaluate thedifference ∆ R between the sum rate achieved under RZF-O and the sumrate achieved when perfect channel information is assumed. For N = K , fromthe deterministic equivalents obtained above, this is close to ∆ R which, in

homogeneous channel conditions, takes the following convenient form

∆ R = log 2 1 + √ 1 + 4 ρ1 + √ 1 + 4 ρω




with

ω = 1−τ 2

1 + τ 2ρ.

An interesting question is to determine how much feedback is required in theuplink for the sum rate between the perfect CSIT and imperfect CSIT case to beno larger than log 2 b. That is, how much distortion τ 2 is maximally allowed toensure a rate gap of log 2 b. Under the above simplied channel conditions, withN = K

τ 2 = 1 + 4ρ − ω2

b2

3 + ω2

b2

1ρ

.

In particular, τ 2 b2

−1 in the large ρ regime. This further allows us to evaluate

the optimal number of training bits required to maintain a given rate loss. Then,taking into account the additional rate loss incurred by the process of channeltraining, the optimal sum rate under the channel training constraint can be madeexplicit. This is further discussed in [Wagner et al., 2011].

A second question of interest lies in the optimal ratio N/K between the numberof transmit antennas and the number of users that maximizes the sum rate pertransmit antenna. This allows us to determine a gure of merit for the efficiencyof every single antenna. In the uncorrelated channel conditions above, from thedeterministic equivalents, we nd the optimal ratio c to be given by

c = 1 −τ 2 + 1

ρ

1 −τ 21 +

1

W 1e

1−τ 2τ 2 + 1

ρ −1

with W (x) the Lamber-W function, dened as the solution to W (x) = xeW (x ) ,unique for x ∈ [−1

e , ∞).In Figure 14.2, the sum rate performance of RZF and ZF under imperfect

channel state information at the transmitter are depicted and compared. Weassume here that N = K , that the users are uniformly distributed around the

transmitter, and that there exists no channel correlation at the transmitter.The channel estimation distortion equals τ 2 = 0 .1. Observe that a signicantperformance gain is achieved when the imperfect channel estimation parameteris taken into account, while, as expected, the performance of both RZF assumingperfect channel information and ZF converge asymptotically to the same limit.

In Figure 14.3, the performance of ZF precoding under different channelassumptions are compared against their respective deterministic equivalents.The exact channel conditions assumed here follow from those introduced in[Wagner et al., 2011] and are omitted here. We only mention that homogeneouschannel conditions, i.e. with T = I N and R = I K , are compared against thecase when correlation at the transmitter emerges from a compact array of antennas on a limited volume, with inter-antenna spacing equal to half thetransmit wavelength, and the case when users have different path losses, following



14.2. Rate region of MIMO multiple access channels 355

0 5 10 15 20 25 300

10

20

30

ρ [dB]

s u m r a t e

[ b i t s / s / H z ]

RZF-ORZF-CDUZF

Figure 14.2 Ergodic sum rate of regularized zero-forcing with optimal α (RZF-O), αtaken as if τ = 0, i.e. for channel distortion unaware transmitter (RZF-CDU) andzero-forcing (ZF), T = I N , R = I K , N = K , τ 2 = 0 .1.

the COST231 Hata urban propagation model. We also take τ 1 = . . . = τ K τ ,with τ = 0 or τ 2 = 0 .1, alternatively. The number of users is K = 16 and thenumber of transmit antennas N = 32. The lines drawn are the deterministicequivalents, while the dots and error bars are the averaged sum rate evaluatedfrom simulations and the standard deviation, respectively. Observe that thedeterministic equivalents, already for these not too large system dimensions,fall within one standard deviation of the simulated sum rates and are in mostsituations very close approximations of the mean sum rates.

Similar considerations of optimal training time in large networks, but in thecontext of multi-cell uplink models are also provided in [Hoydis et al., 2011d],

also relying on deterministic equivalent techniques. This concludes this section onlinearly precoded broadcast channels. In the subsequent section, we address theinformation-theoretic, although less practical, question of the characterization of the overall rate region of the dual multiple access channel.

14.2 Rate region of MIMO multiple access channels

We consider the generic model of an N -antenna access point (or base station)communicating with K users. User k, k ∈ 1, . . . , K , is equipped with nk

antennas. Contrary to the study developed in Section 14.1, we do notrestrict receiver k to be equipped with a single antenna. To establish large




0 5 10 15 20 25 300

50

100

150

ρ [dB]

s u m r a t e

[ b i t s / s / H z ]

T = I N , R = I K , τ = 0T = I N , R = I K , τ = 0T = I N , R = I K , τ = 0

T = I N , R = I K , τ 2 = 0 .1T = I N , R = I K , τ 2 = 0 .1

Figure 14.3 Sum rate of ZF precoding, with N = 32, K = 16, under different channelconditions. Uncorrelated transmit antennas ( T = I N ) or volume limited transmitdevice with inter-antenna spacing of half the wavelength ( T = I N ), equal path losses(R = I K ) or path losses based on modied COST231 Hata urban model ( T = I K ),τ = 0 or τ 2 = 0 .1. Simulation results are indicated by circle marks with error barsindicating one standard deviation in each direction.

dimensional matrix results, we will consider here that N and n K i=1 nk are

commensurable. That is, we will consider a large system analysis where:

• either a large number K of users, each equipped with few antennas,communicate with an access point, equipped with a large number N of antennas;

• either a small number K of users, each equipped with a large number of antennas, communicate with an access point, equipped also with a largenumber of antennas.

The channel between the base station and user k is modeled by the matrixH k ∈C n k ×N . In the previous section, where nk = 1 for all k, H k was denoted byh H

k , the Hermitian sign being chosen for readability to avoid working with rowvectors. With similar notations as in Section 14.1, the downlink channel modelfor user k at time t reads:

y ( t )k = H k z ( t ) + σw ( t )

k

where z( t )

∈

C N is the transmit data vector and the receive symbols y ( t )k

∈C n k

and additive Gaussian noise w ( t )k ∈C n k are now vector valued. Contrary to thelinear precoding approach, the relation between the effectively transmitted z( t )

and the intended symbols s ( t ) = ( s( t )1 , . . . , s ( t )

K )T is not necessarily linear. We also




Figure 14.4 Multiple access MIMO channel, composed of K users and a base station.User k is equipped with nk antennas, and the base station with N antennas. Thechannel between user k and the base station is H H

k = T12k X H

k R Hk .

denote P the transmit covariance matrix P = E[z ( t ) z ( t ) H ], assumed independentof t, which satises the power constraint 1

N tr P = P .Denoting equivalently z( t )

k ∈C n k the signal transmitted in the uplink (MAC)by user k, such that E[ z ( t )

k z( t )H

k ] = P k , 1n k

tr P k ≤ P k , y ( t ) and w ( t ) the signaland noise received by the base station, respectively, and assuming perfect channelreciprocity in the downlink and the uplink, we have the uplink transmission

model

y ( t ) =K

k=1

H Hk z ( t )

k + w ( t ) . (14.25)

Similar to the linearly precoded case, we assume that H k , k ∈ 1, . . . , K , ismodeled as Kronecker, i.e.

H k R12k X k T

12k (14.26)

where X k ∈C n k ×N has i.i.d. Gaussian entries of zero mean and variance 1 /n k ,

T k ∈C N

×N

is the Hermitian non-negative denite channel correlation matrixat the base station with respect to user k, and R k ∈C n k ×n k is the Hermitiannon-negative denite channel correlation matrix at user k.

In the following, we study the MAC rate regions for quasi-static channels. Wewill then consider the ergodic rate region for time-varying MAC. An illustrativerepresentation of a cellular uplink MAC channel as introduced above is providedin Figure 14.4.

14.2.1 MAC rate region in quasi-static channels

We start by assuming that the channels H 1 , . . . , H K are random realizations of the Kronecker channel model ( 14.26), considered constant over the observationperiod. The MIMO MAC rate region C MAC (P 1 , . . . , P K ; H H ) for the quasi-static




model (14.25), under respective transmit power constraints P 1 , . . . , P K for users1 to K and channel H H [H H

1 . . . H HK ], reads [Tse and Viswanath , 2005]

C MAC (P 1 , . . . , P K ; HH

)

=1

n itr( P i )≤P iP i ≥0

i =1 ,...,K

r ;i∈

S

r i ≤ log2 det I N + 1σ2

i∈S

H H

i P i H i ,∀S⊂ 1, . . . , K

(14.27)

where r = ( r 1 , . . . , r K ) ∈ [0, ∞)K and P i ≥ 0 stands for P i non-negative denite.That is, the set of achievable rate vectors ( r 1 , . . . , r K ) is such that the sum of the rates of any subset S =

i1 , . . . , i

|S

| is less than a classical log determinant

expression for all possible precoders P i 1 , . . . , P i |S |.Consider such a subset S = i1 , . . . , i |S | of 1, . . . , K and a set P i 1 , . . . , P i |S |

of deterministic precoders, i.e. precoders chosen independently of the particularrealizations of the H 1 , . . . , H K matrices (although possibly taken as a functionof the R k and T k correlation matrices).

At this point, depending on the underlying model assumptions, it is possibleto apply either Corollary 6.1 of Theorem 6.4 or Remark 6.5 of Theorem 6.12.Although we mentioned in Remark 6.5 that it is highly probable that boththeorems hold for the most general model hypotheses, we presently state which

result can be applied to which situation based on the mathematical resultsavailable in the literature.

• If the R k and T k matrices are only constrained to have uniformly boundednormalized trace and the number of users K is small compared to the numberof antennas nk per user and the number of antennas at the transmitter, thenCorollary 6.1 of Theorem 6.4 states (under some mild additional assumptionsrecalled in Chapter 13) that the per-antenna normalized log determinantexpressions of ( 14.27) can be given a deterministic equivalent. This case allowsus to assume very correlated antenna arrays, which is rather convenient for

practical purposes.• If the R k and T k matrices have uniformly bounded spectral norm as their

dimensions grow, i.e. in practice if the largest eigenvalues of R k or T k aremuch smaller than the matrix size, and the total number of user antennas n

K k=1 nk is of the same order as the number N of antennas at the base station,

then the hypotheses of Theorem 6.12 hold and a deterministic equivalent forthe total capacity can be provided.

The rst setting is more general in the sense that more general antennacorrelation proles can be used, while the second case is more general in thesense that the users’ antennas can be distributed in a more heterogeneous way.Since we wish rst to determine the rate region of MAC, though, we need, forall subset S

⊂ 1, . . . , K , that N and i∈S n i grow large simultaneously. This




imposes in particular that all n i grow large simultaneously with N . Later we willdetermine the achievable sum rate, for which only the subset S = 1, . . . , K willbe considered. For the latter, Theorem 6.12 will be more interesting, as it canassume a large number of users with few antennas, which is a far more realisticassumption in practice.

For rate region analysis, consider then that all ni are of the same order of dimension as N and that K is small in comparison. From Theorem 6.4, we haveimmediately that

1N

log2 det I N + 1σ2

i∈S

H Hi P i H i −

1N

log2 det I N +k∈

S

ek T k

− log2(e)σ2

k∈S

ek ek + 1

N k∈S

log2 det I n k + ck ek R12k P k R

12k

a .s.

−→ 0

where ck N/n k and ei 1 , . . . , e i |S |, ei 1 , . . . , ei |S | are the only positive solutions to

ei = 1σ2N

tr T i I N +k∈

S

ek T k

−1

ei = 1σ2n i

tr R12i P i R

12i I n i + ci ei R

12i P i R

12i

−1. (14.28)

This therefore provides a deterministic equivalent for the points in the rate

region corresponding to deterministic power allocation strategies , i.e. powerallocation strategies that do not depend on the X k matrices. That is, not allpoints in the rate region can be associated with a deterministic equivalent(especially not the points on the rate region boundary) but only those pointsfor which a deterministic power allocation is assumed.

Note now that we can similarly provide a deterministic equivalent to everypoint in the rate region of the quasi-static broadcast channel corresponding toa deterministic power allocation policy. As recalled earlier, the boundaries of the rate region C BC (P ; H ) of the broadcast channel have been recently shown[Weingarten et al., 2006] to be achieved by dirty paper coding (DPC). For atransmit power constraint P over the compound channel H , it is shown by MAC-BC duality that [Viswanath et al., 2003]

C BC (P ; H ) =P 1 ,...,P K

Kk =1 P k ≤P

C MAC (P 1 , . . . , P K ; H H ).

Therefore, from the deterministic equivalent formula above, we can alsodetermine a portion of the BC rate region: that portion corresponding to thedeterministic precoders. However, note that this last result has a rather limitedinterest. Indeed, channel-independent precoders in quasi-static BC inherentlyperform poorly compared to precoders adapted to the propagation channel, suchas the optimal DPC precoder or the linear ZF and RZF precoders. This is becauseBC communications come along with potentially strong inter-user interference,




which is only mitigated through adequate beamforming strategies. Deterministicprecoders are incapable of providing efficient inter-user interference reductionand are therefore rarely considered in the literature.

Simulation results are provided in Figure 14.5 in which we assume a two-userMAC scenario. Each user is equipped with n1 = n2 antennas, where n1 = 8 orn1 = 16, while the base station is equipped with N = n1 = n2 antennas. Theantenna array is linear with inter-antenna distance dR /λ set to 0.5 or 0.1 at theusers, and dT /λ = 10 at the base station. We further assume that the effectivelytransmitted energy propagates from a solid angle of π/ 6 on either communicationside, with different propagation directions, and therefore consider the generalizedJakes’ model for the R k and T k matrices. Specically, we assume that user 2sees the signal arriving at angle zero rad, and user 1 sees the signal arriving at

angle π rad. We further assume uniform power allocation at the transmission.From the gure, we observe that the deterministic equivalent plot is centeredsomewhat around the mean value of the rates achieved for different channelrealizations. As such, it provides a rather rough estimate of the instantaneousmultiple access mutual information. It is nonetheless necessary to have at least 16antennas on either side for the deterministic equivalent to be effectively useful. Interms of information-theoretical observations, note that a large proportion of theachievable rates is lost by increasing the antenna correlation. Also, as alreadyobserved in the single-user MIMO case, increasing the number of antennas instrongly correlated channels reduces the efficiency of every individual antenna.

As largely discussed above, it is in fact of limited interest to study theperformance of quasi-static MAC and BC channels through large dimensionalanalysis, in a similar way to the single-user case, in the sense that optimal powerallocation cannot be performed and the deterministic equivalent only provides arough estimate of the effective rates achieved with high probability for a smallnumber of antennas. When it comes to ergodic mutual information, though,similar to the point-to-point MIMO scenario, large system analysis can provideoptimal power allocation policies and very accurate capacity approximations forsmall system dimensions.

14.2.2 Ergodic MAC rate region

Consider now the situation where the K channels are changing too fast for theusers to be able to adapt adequately their transmit powers, while having constantKronecker-type statistics. In this case, the MAC rate region is dened as

C (ergodic)MAC (P 1 , . . . , P K ; H H )

=

P i r ,

i∈S

r i ≤ E log2 det I N + 1σ2

i∈S

H Hi P i H i ,∀

S⊂ 1, . . . , K




0 0.5 1 1.5 2 2.50

1

2

3

4

Per-antenna rate of user 1 [bits/s/Hz]

P e r - a n t e n n a r a t e o f u s e r 2 [ b i t s / s / H z ]

0 0.5 1 1.5 2 2.50

1

2

3

4

Per-antenna rate of user 1 [bits/s/Hz]

P e r - a n t e n n a r a t e o

f u s e r 2 [ b i t s / s / H z ]

Figure 14.5 (Per-antenna) rate of two-user at fading MAC, equal power allocation,for N = 8 (top), N = 16 (bottom) antennas at the base station, n1 = n 2 = N antennas at the transmitters, uniform linear antenna arrays, antenna spacingd R

λ = 0 .5 (dashed) and d R

λ = 0 .1 (solid) at the transmitters, d T

λ = 10 at the basestation, SNR = 20 dB. Deterministic equivalents are given in thick dark lines.

where the union is taken over all P i non-negative denite such that

1n i

tr( P i ) ≤ P i .




We can again recall Theorem 6.4 to derive a deterministic equivalent for theergodic mutual information for all deterministic P i 1 , . . . , P i |S | precoders, as

E 1N

log2 det I N + 1σ2

i∈S

H H

i P i H i − 1N

log2 det I N +k∈

S

ek T k

+ 1N

k∈S

log2 det I n k + ck ek R12k P k R

12k −log2(e)σ2

k∈S

ek ek → 0

for growing N , ni 1 , . . . , n i |S |. Now it is of interest to determine the optimaldeterministic equivalent precoders. That is, for every subset S = i1 , . . . , i |S |,we wish to determine the precoding vector P S

i 1 , . . . , P Si |S |

which maximizes thedeterministic equivalent. This study is performed in [Couillet et al., 2011a].

Similar to the single-user MIMO case, it suffices to notice that maximizingthe deterministic equivalent over P i 1 , . . . , P i |S | is equivalent to maximizing theexpression

k∈S

log2 det I n k + ck eSk R

12k P k R

12k

over P i 1 , . . . , P i |S |, where (eSi 1 , . . . , e S

i |S |, eS

i 1 , . . . , eSi |S |

) are xed, equal to the uniquesolution with positive entries of ( 14.28) when P i = P S

i for all i ∈S. To observethis, we essentially need to observe that the derivative of the function

V : (P i 1 , . . . , P i |S |, ∆ i 1 , . . . , ∆ i |S |, ∆ i 1 , . . . , ∆ i |S |) →

1N

log2 det I N +k∈

S

∆ k T k

+ 1N

k∈S

log2 det I n k + ck ∆ k R12k P k R

12k −log2(e)σ2

k∈S

∆ k ∆ k

along any ∆ k or ∆ k is zero when ∆ i = ei and ∆ i = ei . This unfolds from

∂V ∂ ∆ k

(P i 1 , . . . , P i |S |, ei 1 , . . . , e i |S |, ei 1 , . . . , ei |S |)

= log 2(e)1N

tr I N +i∈

S

ei T i−1

T k −σ2ek

∂V ∂ ∆ k

(P i 1 , . . . , P i |S |, ei 1 , . . . , e i |S |, ei 1 , . . . , ei |S |)

= log 2(e)ck

N tr I + ck ek R

12i P i R

12i

−1R

12k P k R

12k −σ2 ek

both being null according to ( 14.28).By the differentiation chain rule, the maximization of the log determinants

over every P i is therefore equivalent to the maximization of every term

log2 det I n k + ck eSk R

12k P k R

12k




Dene η > 0 the convergence threshold and l ≥ 0 the iteration step.At step l = 0, for k ∈S, i ∈ 1, . . . , n k, set q 0k,i = P k . At step l ≥ 1,while maxk,i

|q l

k,i −q l−1

k,i | > η do

For k ∈S, dene (el+1k , el+1

k ) as the unique pair of positivesolutions to (14.28) with, for all j ∈S, P j = U j Q l

j U Hj , Q l

j =diag( q lj, 1 , . . . , q lj,n j ) and U j the matrix such that R j has spectraldecomposition U j Λ j U H

j , Λ j = diag( r j, 1 , . . . , r j,n j )for i ∈ 1 . . . , n k do

Set q l+1k,i = µk − 1

ck e l +1k r k,i

+, with µk such that 1

n ktr Q l

k = P kend forassign l ← l + 1

end while

Table 14.1. Iterative water-lling algorithm for the determination of the MIMO MACergodic rate region boundary.

(remember that the power constraints over the P i are independent). Themaximum of the deterministic equivalent for the MAC ergodic mutualinformation is then found to be V evaluated at eS

k , eSk and P S

k for all k ∈S. Itunfolds that the capacity maximizing precoding matrices are given by a water-lling solution, as

P Sk = U k Q S

k U H

k

where U k ∈C n k ×n k is the eigenvector matrix of the spectral decomposition of R k as R k = U k diag( r k, 1 , . . . , r k,n k )U H

k , and Q Sk is a diagonal matrix with ith

diagonal entry q Ski given by:

q Ski = µk − 1

ck eSk r k,i

+

µk being set so that 1n k

n k

i =1 q Ski = P k , the maximum power allowed for userk. As usual, this can be determined from an iterative water-lling algorithm,

provided that the latter converges. This is given in Table 14.1.The performance of uniform and optimal power allocation strategies in the

uplink ergodic MAC channel is provided in Figure 14.6 and Figure 14.7. As inthe quasi-static case, the system comprises two users with n1 = n2 antennas,identical distance 0 .5λ between consecutive antennas placed in linear arrays,and angle spread of energy arrival of π/ 2. User 1 sees the signal from an angle of zero rad, while user 1 sees the signal from an angle of π rad. In Figure 14.6, weobserve, as was true already for the point-to-point MIMO case, that deterministicequivalents approximate very well the actual ergodic mutual information, fordimensions greater than or equal to four. It is then observed in Figure 14.7 thatmuch data throughput can be gained by using optimal precoders at the user




0 2 4 6 8 10 12 14 160

5

10

15

20

25

Per antenna rate of user 1 [bits/s/Hz]

P e r - a n t e n n a r a t e o f u s e r 2 [ b i t s / s / H z ] N = 2, simulation

N = 2, det. eq.N = 4, simulation

N = 4, det. eq.N = 8, simulationN = 8, det. eq.

Figure 14.6 Ergodic rate region of two-user MAC, uniform power allocation, forN = 2, N = 4, and N = 8, n1 = n 2 = N , uniform linear array model, antenna spacingat the users d R

λ = 0 .5, at the base station d T

λ = 10. Comparison between simulationsand deterministic equivalents (det. eq.).

terminals, especially on the rates of strongly correlated users. Notice also that

in all previous performance plots, depending on the direction of energy arrival,a large difference in throughput can be achieved. This is more acute than inthe single-user case where the resulting capacity is observed to be only slightlyreduced by different propagation angles. Here, it seems that some users can eitherbenet or suffer greatly from the conditions met by other users.

We now turn to the specic study of the achievable ergodic sum rate.

14.2.3 Multi-user uplink sum rate capacity

As recalled earlier, when it comes to sum rate capacity, we only need to providea deterministic equivalent for the log determinant expression

E log2 det I N + 1σ2

K

i =1

H H

i P i H i

for all deterministic P 1 , . . . , P K . Obviously, this problem is even easier totreat than the previous case, as only one subset S of 1, . . . , K , namelyS = 1, . . . , K , has to be considered. As a consequence, the large dimensionalconstraint is just that both n =

K i =1 n i and N are large and of similar

dimension. This does no longer restrict any individual ni to be large and of similar amplitude as N . Therefore, Theorem 6.12 will be used instead of Theorem6.4.




0 5 10 15 200

10

20

30

Per antenna rate of user 1 [bits/s/Hz]

P e r - a n t e n n a r a t e o f u s e r 2 [ b i t s / s / H z ] N = 2, uniform

N = 2, optimalN = 4, uniformN = 4, optimalN = 8, uniformN = 8, optimal

Figure 14.7 Deterministic equivalents for the ergodic rate region of two-user MAC,uniform power allocation against optimal power allocation, for N = 2, N = 4, andN = 8, n1 = n 2 = N , uniform linear array model, antenna spacing at the usersd R

λ = 0 .5, at the base station d T

λ = 10.

We will treat a somewhat different problem in this section, which assumes that,

instead of a per-user power constraint, all users can share a total energy budgetto be used in the uplink. We also assume that the conditions of Theorem 6.12now hold and we can consider, as in Section 14.1, that a large number of mono-antenna users share the channel to access a unique base station equipped witha large number of antennas. In this case, the deterministic equivalent developedin the previous section still holds true. Under transmit sum-power budget P ,the power optimization problem does no longer take the form of a few optimalmatrices but of a large number of scalars to be appropriately shared amongthe users. Recalling the notations of Section 14.1 and assuming perfect channelreciprocity, the ergodic MAC sum rate capacity C ergodic (σ2) reads:

C ergodic (σ2) = max p1 ,...,p KKi =1 pi = P

E log2 det I N + 1σ2

K

i =1 pi h i h H

i (14.29)

with h i = T12i x i ∈C N , x i ∈C N having i.i.d. Gaussian entries of zero mean and

variance 1 /N . Remember that T i stems for both the correlation at the basestation and the path-loss component. In the MAC channel, T i is now a receive correlation matrix.

The right-hand side of ( 14.29) is asymptotically maximized for pk = pk such

that pk − pk → 0 as the system dimensions grow large, where pk is given by:

pk = µ − K N

1ek

+




for all k ∈ 1, . . . , K , where µ is set to satisfy K i =1 pi = P , and where ek is such

that ( e1 , . . . , e K ) is the only vector of positive entries solution of the implicitequations in ( e1 , . . . , e K )

ei = 1N

tr T i 1N

K

k=1

pk1 + cK pk ek

T k + σ2 I N

−1

with cK N/K . Again, the pk can be determined by an iterative water-llingalgorithm.

We nally have that the deterministic equivalent for the capacity C ergodic (σ2)is given by:

1

N C ergodic (σ2)

− 1

N log

2 det σ2 1

N

K

k=1

1

1 + cK ekT k + I N

+ 1N

K

k =1

log2 (1 + cK ek pk ) −log2(e) 1K

K

k=1

pk ek1 + cK ek pk → 0.

In Figure 14.8, the performance of the optimal power allocation scheme isdepicted for correlation conditions similar to the previous scenarios, i.e. withcorrelation patterns at the base station accounting for inter-antenna distanceand propagation direction, and different path loss exponents for the differentusers. It turns out numerically that, as in the single-user MIMO case, for low

SNR, it is benecial to reduce the number of transmitting users and allocatemost of the available power to a few users, while this tendency fades away forhigher SNR.

This closes this chapter on multi-user communications in both single antennaand multiple antenna regimes. In the next chapter, we extend the single-cellmultiple antenna and multi-user framework to multi-cellular communicationsand relay communications.




−15 −10 −5 0 5 10 15 200

2

4

6

8

10

12

SNR [dB]


Sim., uni.Det. eq., uni.Sim., opt.Det. eq., opt.

Figure 14.8 Ergodic MAC sum rate for an N = 4-antenna receiver and K = 4mono-antenna transmitters under sum power constraint. Every user transmit signalhas different correlation patterns at the receiver, and different path losses.Deterministic equivalents (det. eq.) against simulation (sim.), with uniform (uni.) oroptimal (opt.) power allocation.



15 Performance of multi-cellular and

relay networks

In this chapter, we move from single-cell considerations, with a central entity,e.g. access point or base station, to the wider multi-user multi-cell networkpoint of view or the multi-hop relay point of view. For the former, we nowconsider that communications are performed in a cellular environment over agiven shared communication resource (e.g. same frequency band, overlaying cellcoverage), with possible inter-cell interference. This is a more realistic assumptionto model practical communication networks than the isolated single-cell case withadditive white Gaussian noise. Here, not only AWGN is affecting the differentactors in the network but also cross-talk between adjacent base stations or cell-edge users. For the latter, we do not assume any cellular planning but merelyconsider multi-hop communications between relays which is of reduced use incommercial communication standards but of high interest to, e.g., ad-hoc military

applications.We start with the multi-cell viewpoint.

15.1 Performance of multi-cell networks

We consider the model of a multi-cellular network with overlaying coverageregions. In every cell, each user has a dedicated communication resource,supposed to be orthogonal to any other user’s resource. For instance, we may

assume that orthogonal frequency division multiple access (OFDMA) is used inevery cell, so that users are orthogonally separated both in time and frequency.It is well-known that a major drawback of such networks is their being stronglyinterference limited since users located at cell edges may experience muchinterference from signals transmitted in adjacent cells. To mitigate or cancelthis problem, a few solutions are considered, such as:

• ban the usage of certain frequencies in individual cells. This technique isreferred to as spatial frequency reuse and consists precisely in making surethat two adjacent cells never use the same frequency band. This has thestrong advantage that cross-talk in the same communication bandwidth canonly come from remote cells. Nonetheless, this is a strong sacrice in terms of available communication resources;



370 15. Performance of multi-cellular and relay networks

• create directional transmit signal beams. That is, at least in the downlinkscenario, base stations use additional antennas to precode the transmit datain such a way that the information dedicated to the in-cell users is in themain lobe of a propagation beam, while the out-cell users are outside theselobes. This solution does not cancel adjacent cell interference but mitigatesit strongly, although this requires additional antennas at the transmitter andmore involved data processing.

An obvious solution to totally discard interference is to allow the basestations to cooperate. Precisely, if all network base stations are connected viaa high rate backbone and are able to process simultaneously all uplink anddownlink transmissions, then it is possible to remove the inter-cell interference

completely by considering a single large cell instead, which is composed of multiple cooperative base stations. In this case, the network can be simplyconsidered as a regular multi-access or broadcast channel, as those studied inChapter 14. Nonetheless, while this idea of cooperative base stations has nowmade its way to standardization even for large range communications, e.g. inthe 3GPP-LTE Advanced standard [Sesia et al., 2009], this approach is bothdifficult to establish and of actually limited performance gain. Concerning thedifficulty to put the system in place, we rst mention the need to have a highrate backbone common to all base stations and a central processing unit so fastthat joint decoding of all users can be performed under the network delay-limitedconstraints. As for the performance gain limitation, this mainly arises from thefact that, as the effective cell size increases, the central processing unit must beat all times aware of all communication channels and all intended data of allusers. This imposes a large amount of synchronization and channel estimationinformation to be fed back to the central unit as the network size grows large.This therefore assumes that much time is spent by the network learning from itsown structure. In fast mobile communication networks, where communicationchannels are changing fast, this is intolerable as too little time is then spent oneffective communications.

This section is dedicated to the study of the intrinsic potential gains broughtby cooperative cellular networks. As cooperative networks become large, largedimensional matrix theory can be used with high accuracy and allows us toobtain deterministic approximations of the system performance from whichoptimization can be conducted, such as determining the capacity maximizingfeedback time, the capacity maximizing number of cooperative cells, etc. Forsimplicity, though, we assume perfect channel knowledge in the following in orderto derive the potential gains obtained by unrealistic genie-aided multi-cellularcooperation.

Let us consider a K -cell network sharing a narrowband frequency resource,used by exactly one user per cell. Therefore, we can assume here that each cellis composed of a single-user. We call user k the user of cell k, for k ∈ 1, . . . , K .We can consider two interference limited scenarios, namely:



15.1. Performance of multi-cell networks 371

• the downlink scenario, where multiple base stations concurrently use theresource of all users, generating inter-cell interference to all users;

• the uplink scenario, where every base station suffers from inter-cell interferencegenerated by the data transmitted by users in other cells.

We will consider here only the uplink scenario. We assume that base stationk is equipped with N k transmit antennas, and that user k is equipped withnk receive antennas. Similar to previous chapters, we denote n K

k =1 nk

and N K k=1 N k . We also denote H k,i ∈C N i ×n k the uplink multiple antenna

channel between user k and base station i. We consider that H k,i is modeled asa Kronecker channel, under the following form

H k,i = R12k,i X k,i T

12k,i

for non-negative R k,i with eigenvalues rk, 1 , . . . , r k,N , non-negative T k,i witheigenvalues tk,i, 1 , . . . , t k,i,n k , and X k,i with i.i.d. Gaussian entries of zero meanand variance 1 /n k . As usual, we can assume the T k,i diagonal without loss of generality.

Assume that a given subset C⊂ 1, . . . , K of cells cooperates and is interfered

with by the remaining users from the complementary set Cc = 1, . . . , K \C.

Denote N C = k∈

C N k , N C c = k∈

C c N k , and H k, C ∈CN C

×n k

the channel[H Hk,i 1

, . . . , H Hk,i |C |

]H , with i1 , . . . , i |C | = C. The channel H k, C is the joint channelbetween user k and the cooperative large base station composed of the |C|subordinate base stations indexed by elements of C. Assuming uniform powerallocation across the antennas of the transmit users, the ergodic sum rateE[I (C; σ2)] of the multiple access channel between the |C| users and thecooperative base stations, considered as a unique receiver, under AWGN of variance σ2 reads:

E[I (C; σ2

)]

= E log2 det I N C +k∈

C

H k, C H Hk, C σ2I N C +

k∈C c

H k, C H Hk, C

−1

= E log2 det I N C + 1σ2

K

k =1

H k, C H H

k, C

−E log2 det I N C + 1σ2

k∈C c

H k, C H Hk, C .

Under this second form, it unfolds that a deterministic equivalent of E[ I (C; σ2)]can be found as the difference of two deterministic equivalents, similar to Chapter




14. Indeed, the j th column h k, C ,j ∈C N C of H k, C can be written under the form

h k, C ,j =

h k,i 1 ,j

...h k,i |C |,j

=

√ tk,i 1 ,j R12k,i 1 · · · 0

... . . . ...0 · · · tk,i |C |,j

R12k,i |C |

x k,i 1 ,j

...x k,i |C |,j

with x k,i,j the j th column of X k,i , which has independent Gaussian entries of zero mean and variance 1 /n k . We will denote R k, C ,j the block diagonal matrix

R k, C ,j tk,i 1 ,j R k,i 1 · · · 0

... . . .

...0 · · · tk,i |C |,j

R k,i |C |

.

With these notations, the random matrix H k, C H Hk, C therefore follows the

model of Theorem 6.12 with correlation matrices R k, C ,j for all j ∈ 1, . . . , n kand the normalized ergodic sum rate

1N C

E[I (C; σ2)] = E1

N Clog2 det I N C +

1σ2

K

k=1

n k

j k =1

h k, C ,j k h Hk, C ,j k

−E1

N Clog2 det I N C +

1σ2

k∈C c

n k

j k =1

h k, C ,j k h Hk, C ,j k

has a deterministic equivalent that can be split into the difference of the followingexpression (i)

1N C

log2 det σ2 1n

K

k=1

n k

j k =1

11 + cN eK

k,j k

R k, C ,j k + I

+ 1N C

K

k=1

n k

j k =1

log2 1 + cN eKk,j k −

1n

K

k=1

log2(e)n k

j k =1

eKk,j k

1 + cN ek,j k

with cN N/n and eKk,j k , K 1, . . . , K , dened as the only all positivesolutions of

eKk,j k =

1N C

tr R k, C ,j k

1n

K

k =1

n k

j k =1

11 + cN eK

k ,j k

R k ,C ,j k + σ2 I−1

(15.1)

and of this second term (ii)

1N C

log2 det σ2 1n

k

∈

C c

n k

j k =1

11 + cN eC c

k,j k

R k, C ,j k + I

+ 1N C k∈

C c

n k

j k =1

log2 1 + cN eC c

k,j k − 1n

k∈C c

n k

j k =1

log2(e)eC c

k,j k

1 + cN eC c

k,j k




with eC c

k,j kthe only all positive solutions of

eCc

k,j k = 1N C tr R k, C ,j k

1n

k∈C c

n k

j k =1

11 + cN eC c

k ,j k

R k ,C ,j k + σ2 I

−1

.

Although these expressions are all the more general under our modelassumptions, simplications are required to obtain tractable expressions. Wewill consider two situations. First, that of a two-cell network with differentinterference powers, and second that of a large network with mono-antennadevices. The former allows us to discuss the interest of cooperation in a simpletwo-cell network as a function of the different channel conditions affectingcapacity: overall interference level, number of available antennas, transmit andreceive correlation, etc. The latter allows us to determine how many base stationsare required to be connected in order for the overall system capacity not tobe dramatically impaired. On top of these discussions, the problem of optimalchannel state information feedback time can also be considered when the channelcoherence time varies. It is then possible, for practical networks, to have a roughoverview on the optimal number of base stations to inter-connect under mobilityconstraints. This last point is not presented here.

15.1.1 Two-cell network

In this rst example, we consider a two-cell network for which we provide thetheoretical sum rate capacities achieved when:

• both cells cooperate in the sense that they proceed to joint decoding of theusers operating at the same frequency;

• cells do not cooperate and are therefore interference limited.

When base stations do not cooperate, it is a classical assumption that thechannels from user k to base station k, k ∈ 1, 2, are not shared among the

adjacent cells. An optimal power allocation policy can therefore not be clearlydened. However, in the cooperative case, power allocation can be clearlyperformed, under either sum power constraint or individual power constraintfor each user. We will consider the latter, more realistic, choice. Also, we makethe strong necessary assumption that, at the transmitter side, the effective energyis transmitted isotropically, i.e. there is no privileged direction of propagation.This assumption is in fact less restrictive when the mobile transmitter is locatedin a very scattered environment, such as in the house or in the street while thereceive base station is located far away. This assumption would however not holdfor base stations, which are typically located in a poorly scattered environment.Following Jakes’ model for the transmit correlation T k,i matrices, we willtherefore assume that T k, 1 = T k, 2 T k , k ∈ 1, 2. We are now in the situationof the previous study for K = 1, 2. The subset C

⊂K of simultaneously decoded




users is alternatively 1 and 2 in the non-cooperative case, and 1, 2 in thecooperative case.

The real difficulty lies in determining the optimal power allocation policy forthe cooperative case. Denoting P k ∈C n k ×n k this optimal power allocation policyfor user k in the cooperative case, we have in fact trivially that P k = U k Q k U H

kwith U k the eigenvector basis of T k , and Q k = diag( q k, 1 , . . . , q k,n k

), with q k,igiven by:

q k,i = µk − 1

1 + cN eK k,j k

+

where µk is set to satisfy 1n kn ki =1 q k,i = P k , P k being the power allowed for

transmission to user k, and eK k,j kis the solution to the xed-point Equation

(15.1) in eKk,j k

, with eKk,j k

replaced by eKk,j k

pk,j (present now in the expressionof R k, K ,j ). The only difference compared to the multi-user mono-antenna MACmodel is that multiple power constraints are imposed. This is nonetheless nothard to solve as the maximization is equivalent to the maximization of

k∈

K

n k

j k =1

log2 1 + cN eKk,j k pk,j .

As usual, the entries of P k can be determined from an appropriate iterativewater-lling algorithm. In the following, we provide performance results forthe two-cell network. In Figure 15.1 and Figure 15.2, we consider the caseof two users, equipped with four antennas each, and two base stations, alsoequipped with four antennas each. In Figure 15.1, both transmit and receivesides are loosely correlated, i.e. the inter-antenna spacing is large and isotropictransmissions are assumed. In Figure 15.2, the same scenario is considered,

although, at the transmitting users, strong signal correlation is present (thedistance between successive antennas is taken to be half the wavelength). Weobserve that strong correlation at the transmitter side reduces the achievableoptimal sum rate, although in this case optimal power allocation policy helpsto recovere part of the lost capacity, notably for low SNR regimes. However, forstrongly correlated transmitters, it turns out that the performance of single-userdecoding in every cell is not strongly affected and is even benecial for highSNR. This is mostly due to the uniform power allocation policy adopted in theinterference limited case, which could be greatly improved.

In the following, we address the problem of the uplink communication in a largedimensional network, somewhat following the Wyner model [Wyner , 1994] withone user per cell and only one transmitting antenna on either communicationdevice.




−15 −10 −5 0 5 10 15 200

10

20

30

40

50

SNR [dB]


Coop., uni.Coop., opt.Non-coop.

Figure 15.1 Sum rate capacity of two-cell network with two users per cell. Comparisonbetween cooperative MAC scenario (coop.) for uniform (uni.) and optimal (opt.)power allocation, and interference limited scenario (non-coop.). Loosely correlatedsignals at both communication ends. Interference power 0 .5.

−15 −10 −5 0 5 10 15 200

10

20

30

40

SNR [dB]


Coop., uni.Coop., opt.Non-coop.

Figure 15.2 Sum rate capacity of two-cell network with two users per cell. Comparisonbetween cooperative MAC scenario (coop.) for uniform (uni.) and optimal (opt.)power allocation, and interference limited scenario (non-coop.). Loosely correlatedsignals at the receiver end, strongly correlated signals at the transmission end.Interference power 0 .5.




Figure 15.3 Wyner model of a three-cell network, with cooperation via a centralbackbone. Every cell site contains a single-user per spectral resource.

15.1.2 Wyner model

This section follows the ideas of, e.g., [Hoydis et al., 2011d; Levy and Shamai,2009; Somekh et al., 2004, 2007]. We consider a K -cell network with one userper cell (again, this describes the case when a single resource is analyzed) andone antenna per user. We further assume that the base stations of all cells areconnected via an innite capacity backbone, so that inter-cell communicationsare possible at a high transmission rate. The link hik between user k and basestation i is assumed to be random Gaussian with zero mean and variancea ik /K , with aik the long-term path loss exponent. This scenario is depictedin Figure 15.3. Denoting A

∈C K

×K the matrix with ( i, k )th entry a

ik and

H = [h 1 , . . . , h K ] ∈C K ×K the matrix with ( i, k )th entry hik , the achievable sumrate per cell site C under perfect channel state information at the connectedbase stations, unit transmit power for all users, additive white Gaussian noise of variance σ2 , is

C (σ2) = 1K

log2 det I K + 1σ2

K

k=1

h k h H

k

= 1K

log2 det I K + 1σ2 HH H .

This is a classical expression with H having a variance prole, for which adeterministic equivalent is obtained from Theorem 6.12 for instance.

C (σ2) −2K

K

k =1

log2 (1 + ek ) −log2(e) 1

σ2K 2

K

i,k =1

ak,i

(1 + ei )(1 + ek )a .s.

−→ 0

where e1 , . . . , e K are the only all positive solutions to

ei = 1σ2

1K

K

k =1

ak,i

1 + ek. (15.2)




We consider the (admittedly unrealistic) situation when the cells are uniformlydistributed on a linear array, following the Wyner model [Wyner , 1994]. The cellsare numbered in order, following the linear structure. The path loss from user kto cell n has variance α |n −k |, for some xed parameter α. As such, A is explicitlygiven by:

A =

1 α α 2 · · · αK −1

α 1 α · · · αK −2

... . . . . . . . . .

...α K −2 · · · α 1 αα K −1 · · · α2 α 1

.

In the case where α is much smaller than one, we can approximate the Toeplitzmatrix A by a matrix with all terms α n replaced by zeros, for all n greater thansome integer L, of order O(K ) but strictly less than K/ 2. In this case, theresulting Toeplitz matrix is a band matrix, which in turn can be approximatedfor large dimensions by an equivalent circulant matrix A , whose entries are thesame as A in the 2L main diagonals, and which is made circulant by lling theupper right and lower left matrix corners accordingly. This must nonethelessbe performed with extreme care, following the conditions of Szeg¨ o’s theorem,Theorem 12.1. The matrix A , in addition to being circulant, has the property tohave all sums in rows and columns equal. This matrix is referred to as a doubly

regular variance prole matrix. In this particular case, it is shown in [Tulino andVerd u, 2005] that the asymptotic eigenvalue distribution of the matrix

11K

K k=1 aki

HH H

with aki the entry ( k, i ) of A , is the Marcenko–Pastur law. To observe this fact,it suffices to notice that the system of Equations ( 15.2), with a ij replaced by a ij ,is solved for e1 = . . . = eK = f , with f dened as the only Stieltjes transformsolution to

f = 1σ2

1K

K

k=1

ak,i

1 + f

for all i. This is easily veried as K k=1 ak,i is constant irrespective of the choice

of the row i. Now, f is also the solution of ( 15.2) if all aki were taken to beidentical, all equal to 1

K K k=1 aki for any i, in which case H would be a matrix

with i.i.d. entries of zero mean and variance 1K

K k=1 aki . Hence the result.

We therefore nally have, analogously to the uncorrelated antenna case in

Chapter 13, that C can be approximated explicitly as

C (σ2) − log2 1 + δ + sσ2 + log 2(e)

σ2

s δ −1 a .s.

−→ 0




−5 0 5 10 15 20 25 300

1

2

3

4

SNR [dB]


α = 0 .5α = 0 .2α = 0 .1

Figure 15.4 Multi-cell site capacity performance for K = 64 cell sites distributed on alinear array. Path loss decay parameter α = 0 .5, α = 0 .2, α = 0 .1.

where δ is given by:

δ = 12

1 +

4sσ2 −1

and where we denoted s 1K

K k=1 ak1 .

In Figure 15.4, we provide the deterministic equivalents for the capacity C fordifferent values of α , and for K = 64. The observed gain in capacity achieved byinter-cell cooperation when cells are rather close encourages a shift towards inter-cell cooperation, although in this particular example we did not consider at all theinherent problems of such communications: base station synchronization, cost of simultaneous data decoding, and more importantly feedback cost necessary forthe joint base station to learn the various (possibly fast varying) communication

channels.In the next section, we turn to the analysis of multi-hop multiple antennacommunications assuming the number of antennas per relay is large.

15.2 Multi-hop communications

Relay communications are often seen as a practical solution to extend thecoverage of multi-cellular networks to remote areas. In this context, the maininterest is focused on two-hop relaying, where the relays are forwarding directlythe information from the source to the receiver. In the simple scenario where therelay only amplies and forwards the data of the source, the receiver capturesidentical signals, with possibly some delay, so that the overall system can be



15.2. Multi-hop communications 379

roughly modeled as a multiple antenna channel. However, information theoryrequires that more intelligent processing be done at the relay than enhancing andforwarding both received signal and noise. In particular, decoding, re-encoding,and forwarding the information at the relay is a better approach in termsof achievable rate, but is much more difficult to analyze. This becomes evenmore difficult as the number of simultaneously transmitting relays increases.Simple models are therefore often called for when studying these scenarios. Largedimensional random matrix theory is particularly helpful here. Increasing thenumber of relays in the multi-hop model has different applications, especially forad-hoc networks, e.g. intended for military eld operations.

We will study in this section the scenario of multiple hops in a relay networkassuming a linear array composed of a source, K −1 relays, and a destination.

Each hop between relay pairs or source to rst relay will be assumed noise-free forsimplicity, while the last hop will experience additive noise. Each communicationentity will be assumed to be equipped with a large number of antennas and toreceive signals only from the last backward relay or source; we therefore assumethe simple scenario where distant relays do not interfere with each other. Ateach hop, the communication strategy is to re-encode the receive data by alinear precoder in a hybrid manner between the amplify and forward and thedecode and forward strategies. The study of the achievable rates in this setup willtherefore naturally call for the analysis of successive independent channel matrixproducts. Therefore, tools such as the S -transform, and in particular Theorem

4.7, will be extremely useful here.The following relies heavily on the work of M¨uller [Muller, 2002] on the capacity

of large dimensional product channels and on the work of Fawaz et al. [Fawazet al. , 2011].

15.2.1 Multi-hop model

Consider a multi-hop relaying system with a source, K −1 relays, and adestination. The source is equipped with N 0 antennas, the destination with

N K antennas and the kth relay level with N k antennas. We assume that thenoise power is negligible at all relays, while at the destination, at all times l, anadditive Gaussian noise vector w ( l)

∈C N K with zero mean and covariance σ2 I N K

is received. In effect, the simplifying assumption of noise-free relays is made tohave a white aggregate noise at the destination and consequently more tractablederivations. Note that several works have implicitly used a similar noise-free relayassumption by assuming that the noise at the destination of a multiple antennamulti-hop relay network is white. In [Yang and Belore, 2008 ], the authors provethat in an amplify and forward multi-hop relay system the resulting colorednoise at the destination can be well approximated by white noise in the highSNR regime. In terms of practical relevance, the mutual information expressionderived in the case of noise-free relays can be seen as an upper-bound for thecase of noisy relays.




Figure 15.5 Linear relay network.

and Medard , 2010; Maric and Yates, 2010 ]. On the contrary, in the high SNRregime, linear precoding techniques such as amplify and forward perform well[Borade et al., 2007; Maric et al., 2010]. Finally, from a practical point of view, limited channel knowledge and simple linear precoding techniques atrelays are particularly relevant for systems where relays have limited processingcapabilities.

The signal received at the destination at time l is given by:

y ( l)K = H K P K −1H K −1P K −2 . . . H 2P 1H 1P 0y ( l−K )

0 + w ( l)

= G K y( l−K )0 + w ( l)

where the end-to-end equivalent channel G K is given by:

G K H K P K −1H K −1P K −2 . . . H 2P 1H 1P 0

= R12K X K T

12K P K −1R

12K −1X K −1T

12K −1P K −2 . . . R

122 X 2T

122 P 1R

121 X 1T

121 P 0 .

For clarity in what follows, let us introduce the notations

M 0 = T121 P 0

M k = T12k+1 P k R

12k , k ∈ 1, . . . , K −1

M K = R12K .

Then G K can be rewritten as

G K = M K X K M K −1X K −1 . . . M 2X 2M 1X 1M 0 .

In what follows, we will always assume that the destination knows perfectlythe channel G K at all times in order to decode the source data. We mayadditionally assume that the source and the relays have statistical channel state




information about the backward and forward channels, i.e. relay k knows thereceive correlation matrix R k and the transmit correlation matrix T k+1 .

15.2.2 Mutual information

Consider the channel realization G K in one channel coherence block. Under theassumption that the destination knows G K perfectly, the instantaneous end-to-end mutual information between the channel input y 0 and channel output(y K , G K ) in this channel coherence block is the same as the mutual informationof a multiple antenna channel given by Telatar [Telatar, 1999] as in Chapter 13by

log det I N K

+ 1

σ2G K G

H

K .

The end-to-end mutual information averaged over multiple channel realizationsis in turn given by:

E(X k ) log det I N K + 1σ2 G K G

HK

where the expectation is taken over the joint realization of the K −1 randommatrices X k , k ∈ 1, . . . , K −1.

As in Chapter 13, we will not be able to optimize the instantaneous mutualinformation, but will rather focus on optimizing the average mutual information,when the different relays have at least statistical information about the channelsH k . Under adequate assumptions on the various channel state information knownat the relays, we will therefore try to nd the precoders P k that maximize theend-to-end mutual information subject to power constraints ( 15.3). This will giveus the end-to-end average mutual information

C (σ2) supP k 1

N ktr(E[ s k s H

k ])≤P k

E(X k ) log det I N K + 1σ2 G K G

HK .

Note that the average mutual information above does not necessarily representthe channel capacity in the Shannon sense here. In the next section, we will derivea limiting result for the mutual information using tools from free probabilitytheory.

15.2.3 Large dimensional analysis

In this section, we consider the instantaneous mutual information per sourceantenna between the source and the destination and derive its asymptotic valueas N 0 , N 1 , . . . , N K grow large at a similar rate. We obtain the following result.

Theorem 15.1. For the system described above, and under the assumption that the destination knows G K at all times, that N k

N K → ck , 0 < ck < ∞, and M Hk M k




has an almost sure l.s.d. F k and has uniformly bounded spectral norm along growing N k , for all k, then the normalized (per source antenna) instantaneous mutual information

I 1N 0

log det I N K + 1σ2 G K G

H

K

converges almost surely to

I ∞ =K

k=0

ci

c0 log 1 + 1σ2

ak +1

ckhK

k t dF k (t) −K σ2

c0

K

k=0

hk

where aK +1 = 1 by convention and h0 , h1 , . . . , h K are the solutions of

K

j =0

h j = ak+1 hK k t

1 + 1σ 2

a k +1ck

hK k t

dF k (t).

We give hereafter the main steps of the proof.

Proof. The proof of Theorem 15.1 consists of the following steps:

• Step 1. We rst obtain an expression for the limiting S -transform S G (z)of G K G

HK using the fact that the matrix G K is composed of products of

asymptotically almost everywhere free matrices.

• Step 2. We then use S G (z) to determine the limiting ψ-transform ψG (z) of G K G

HK , which we recall is closely related to the Stieltjes transform of G K G

HK ,

see Denition 3.6.

• Step 3. We nally use the relation between the ψ-transform and the Shannontransform to complete the derivation.

Step 1.We show rst the following result. As all N k → ∞ with the same rate, the S -transform of G K G

HK converges almost surely to S G (z), given by:

S G (z) = S F K (z)K

k=1

ck−1

ak

1(z + ck−1)

S F k −1

zck−1

. (15.4)

The proof is done by induction. First, we prove the case K = 1. Note that

G 1G H

1 = M 1X 1M 0M H

0 X H

1 M H

1

and therefore, denoting systematically S ∞Z (z) the limiting almost sure S -transform of the random Hermitian matrix Z as the dimensions grow to innity

S G (z) = S ∞X 1 M 0 MH0 X

H1 M

H1 M 1 (z)

thanks to the S -transform matrix exchange identity, Lemma 4.3. Then, from theasymptotic freeness almost everywhere of Wishart matrices and deterministic




matrices, Theorem 4.5, and the S -transform product relation, Theorem 4.7, wefurther have

S G (z) = S ∞X 1 M 0 M H0 X H

1 (z)S F 1 (z).

Using again Lemma 4.3, we exchange the order in matrix X 1M 0M H0 X H

1 to obtain

S G (z) = z + 1z + N 0

N 1

S ∞M 0 M H0 X H

1 X 1z

N 1N 0

S F 1 (z).

A second application of Theorem 4.7 gives

S G (z) = z + 1z + N 0

N 1 S F 0 zN 1N 0 S ∞X H1 X 1 z

N 1N 0 S F 1 (z)

where we recognize the S -transform of (a scaled version of) the Marcenko–Pasturlaw. Applying both Lemma 4.2 and Theorem 4.8, we obtain

S G (z) = z + 1z + N 0

N 1

S F 0 zN 1N 0

1a1

1z N 1

N 0 + N 1N 0

S F 1 (z)

and nally

S G (z) = S F 1 (z) c0

a1

1z + c0

S F 0 zc0

which proves the case K = 1.Now, we need to prove that, if the result holds for K = q , it also holds for

K = q + 1. Note that

G q+1 G H

q+1 = M q+1 X q+1 M qX q . . . M 1X 1M 0M H

0 X H

1 M H

1 . . . X H

q M H

q X H

q+1 M H

q+1 .

Therefore

S G q +1 G Hq +1

(z) = S X q +1 M q ... M Hq X H

q +1 M Hq +1 M q +1 (z).

We use once more the same approach and theorems as above to obtainsuccessively, with K = q + 1

S G (z) = S ∞X q +1 ... X Hq +1

(z)S M Hq +1 M q +1 (z)

= z + 1z + N q

N q +1

S ∞M q ... M Hq X H

q +1 X q +1z

N q+1

N qS F q +1 (z)

= z + 1z + N q

N q +1

S F q zN q+1

N qS ∞X H

q +1 X q +1z

N q+1

N qS F q +1 (z).




As above, developing the expression of the S -transform of the Marcenko–Pastur law, we then obtain

S G (z) = z + 1z + N q

N q +1

S F q z N q+1

N qq

i =1

1a i

N i−1

N q1

z N q +1N q + N i −1

N q

S F i −1 z N q+1

N i−1

× 1aq+1

1N q +1

N q + z N q +1N q

S F q +1 (z)

which further simplies as

S G (z) = z + 1z + N q

N q +1

S F q +1 (z) 1aq+1

N qN q+1

1z + 1

S F q zN q+1

N q

×q

i =1

N i −1N q +1

a i

1z + N i −1

N q +1

S F i −1 zN q+1

N i−1

= S F q +1 (z)q+1

i =1

1a i

N i−1

N q+1

1z + N i −1

N q +1

S F i −1 zN q+1

N i−1

= S F q +1 (z)q+1

i =1

ci−1

a i

1(z + ci−1)

S F i −1

zci−1

which is the intended result.

Step 2.We now prove the following. Denoting aK +1 = 1, for s ∈C \ R + , we have:

s(ψG (s))K =K

i=0

ci

a i +1ψ−1

F iψG (s)

ci. (15.5)

This unfolds rst from ( 15.4), by multiplying each side by z/ (z + 1)

zz + 1

S G (z) = zz + 1

S F K (z)K

k=1

ck−1

ak

1(z + ck−1)

S F k −1 zck−1

where the last right-hand side term can be rewritten

S F k −1

zck−1

=1 + z

ck −1z

ck −1

zck −1

1 + zck −1

S F k −1

zck−1

.

Using the free probability denition of the S -transform in relation to the ψ-transform, Theorem 4.3, we obtain

ψ−1G (z) =

1zK ψ−1

F K (z)K

i =1

ci−1

a iψ−1

F i −1

zci−1




or equivalently

ψ−1G (z) =

1

zK

K

i =0

ci

a i +1ψ−1

F iz

ci.

Substituting z = ψG (s) gives the result.

Step 3.We subsequently show that, as N 0 , N 1 , . . . , N K go to innity, the derivative of the almost surely limiting instantaneous mutual information I ∞ is given by:

dI ∞d(σ−2)

= 1c0

K

i=0h i

where h0 , h1 , . . . , h K are the solutions to the followingK

j =0

h j = ci hK i t

cia i +1 + 1

σ 2 hK i t

dF i (t).

First, we note that

I = 1N 0

log det I N K + 1σ2 G K G

H

K a .s.

−→ 1c0 log 1 +

1σ2 t dG(t)

the convergence result being valid here because the largest eigenvalue of G K GH

K is almost surely bounded as the system dimensions grow large. This is due to thefact that the deterministic matrices M k M H

k are uniformly bounded in spectralnorm and that the largest eigenvalue of X k X H

k is almost surely bounded forall N k by Theorem 7.1; therefore, the largest eigenvalue of G K G

HK , which is

smaller than or equal to the product of all these largest eigenvalues is almostsurely uniformly bounded. The dominated convergence theorem, Theorem 6.3,then ensures the convergence.

Now, we also have the relation

dI ∞d(σ−2) =

1c0

t1 + 1

σ 2 t dF G (t)

= −σ2

c0ψG (−σ−2). (15.6)

Let us denote

τ = ψG (−σ−2)

gi = ψ−1F i

tci

.

From ( 15.6), we have:

τ = −c0

σ2dI ∞

d(σ−2).




Substituting s = −σ−2 in (15.5) and using τ and gi above, it follows that

−τ K

σ2 =

K

i =0

ci

a i +1 gi .

Using the denition of the ψ-transform, this is nally

τ = ci gi t1 −gi t

dF i (t).

These last three equations together give

− 1σ2

K +1

c0dI

∞d(σ−2)

K

=

K

i=0

cia i +1

gi

and

− 1σ2 c0

dI ∞d(σ−2)

= ci gi t1 −gi t

dF i (t).

Dening now

h i cia i +1

1K

−gi σ21

K

we have

c0dI ∞

d(σ−2) =

K

i =0h i . (15.7)

Using again these last three equations, we obtain

− 1σ2

K

j =0

h j = ci − 1σ 2 hK i a i +1c i t1 −(− 1

σ 2 )hK i

a i +1c i

tdF i (t)

or equivalently

N

j =0

h j = ci hK i t

cia i +1

+ 1σ 2 hK

i tdF i (t).

This, along with Equation ( 15.7), gives the result.We nally need to prove that the nal result is indeed a pre-derivative of

dI ∞/dσ −2 that also veries lim σ 2 →∞I ∞ = 0, i.e. the mutual information is zerofor null SNR. This unfolds, as usual, by differentiating the nal result. In




particular, we successively obtain

K

i =0ci

t hK

i + K σ 2 hK −1

i hic i

a i +1(1 + a i +1

σ 2 cihK

i t) dF i (t) −K

K

i =0h i −

K σ2

K

i =0h i

j = ih j

=K

i =0ci thK

i dF i (t)ci

a i +1+ 1

σ 2 hK i t

+ K σ2

K

i =0

h i

h ici thK

i dF i (t)ci

a i +1+ 1

σ 2 hK i t −K

K

i =0h i

− K σ2

K

i =0

h i

h i

K

j =0

h j

=K

i =0

K

j =0h j +

K

σ2

K

i=0

h i

h i

K

j =0h j

−K

K

i=0h i

− K

σ2

K

i =0

h i

h i

K

j =0h j

= ( K + 1)K

j =0

h j −K K

j =0

h j

=K

j =0

h j

where hi dh id(σ −2 ) . This completes the proof.

Theorem 15.1 holds as usual for any arbitrary set of precoding matrices P k ,k ∈ 0, . . . , K −1 such that M H

k M k has uniformly bounded spectral norm.

15.2.4 Optimal transmission strategy

In this section, we analyze the optimal linear precoding strategies P k , k ∈0, . . . , K −1, at the source and the relays that allow us to maximize the averagemutual information. We characterize the optimal transmit directions determinedby the singular vectors of the precoding matrices at source and relays, for asystem with nite dimensions N 0 , . . . , N K before considering large dimensionallimits.

The main result of this section is given by the following theorem.

Theorem 15.2. For each i ∈ 1, . . . , K , denote T i = U t,i Λ t,i U Ht,i and R i =

U r,i Λ r,i U Hr,i the spectral decompositions of the correlation matrices T i and

R i , where U t,i and U r,i are unitary and Λ t,i and Λ r,i are diagonal, with their respective eigenvalues ordered in non-increasing order. Then, assuming the destination knows G K at all times and that the source and intermediary relays have local statistical information about the backward and forward channels, the optimal linear precoding matrices that maximize the average mutual information




under power constraints (15.3) can be written as

P 0 = U t, 1Λ P 0

P i = U t,i +1 Λ P i U Hr,i , i ∈ 1, . . . , K −1where ΛP i are diagonal matrices with non-negative real diagonal elements.

We do not provide here the proof of Theorem 15.2 that can be found in [Fawazet al. , 2011, Appendix C]. Theorem 15.2 indicates that the power maximizationcan then be divided into two phases: the alignment of the eigenvectors on theone hand and the search for the optimal eigenvalues (entries of Λ t,i and Λ r,i ) onthe other hand.

We now apply the above result to two specic multi-hop communicationscenarios. In these scenarios, a multi-hop multiple antenna system as aboveis considered and the asymptotic mutual information is developed in theuncorrelated and exponential correlation cases, respectively.

15.2.4.1 Uncorrelated multi-hop MIMOIn this example, we consider an uncorrelated multi-hop MIMO system, i.e. allcorrelation matrices are equal to the identity matrix.

Before analyzing this scenario, we mention the following nite dimensional

result, which can be proved rather easily by induction on i.

tr(E[ s i sHi ]) = a i tr( P i R i P

Hi )

i−1

k=0

ak

N ktr( T k+1 P k P H

k ). (15.8)

By Theorem 15.2, in the uncorrelated case ( R k and T k taken identity) theoptimal precoding matrices should be diagonal. Assuming equal power allocationat source and relays, the precoding matrices are of the form P k = αk I N k ,where αk is real positive and chosen to satisfy the power constraints. Usingthe expression (15.8), it can be shown by induction on k that the coefficients α k

in the uncorrelated case are necessarily given by:

α 0 = P 0

α i = P ia i P i−1

, i ∈ 1, . . . , K −1α K = 1 . (15.9)

Then the asymptotic mutual information for the uncorrelated multi-hopMIMO system with equal power allocation is given by:

I ∞ =K

i =0

ci

c0log 1 +

hK i ai+1 α 2

i

σ2ci −K σ2

c0

K

i =0

h i (15.10)




where h0 , h 1 , . . . , h K are the solutions of the system of K + 1 multivariatepolynomial equations

K

j =0h j = h

K i α

2i a i +1

1 + h Ki a i +1 α 2

iσ 2 c i

.

15.2.4.2 Exponentially correlated multi-hop MIMOIn this second example, the asymptotic mutual information is developed in thecase of exponential correlation matrices and precoding matrices with singularvectors as in Theorem 15.2. Similar to Jakes’ model, exponential correlationmatrices are a common model of correlation, particularly for uniform linearantenna array, see, e.g., [Loyka, 2001; Martin and Ottersten, 2004; Oestges et al.,

2008].We assume that the relay at level i is equipped with a uniform linear antennaarray of length Li , characterized by its antenna spacing li = Li /N i and itscharacteristic distances ∆ t,i and ∆ r,i proportional to transmit and receive spatialcoherences, respectively. Then the receive and transmit correlation matrices atrelaying level i can, respectively, be modeled by the following Hermitian Toeplitzmatrices

R i =

1 r i r2i . . . r N i −1

i

r i 1 . . . . . .

...

r 2i . . . . . . . . . r2

i...

. . . . . . 1 r i

r N i −1i . . . r 2

i r i 1

and

T i +1 =

1 ti +1 t2i +1 . . . tN i −1

i+1

t i +1 1 .. . . . .

...

t2i +1

. . . . . . . . . t2i +1

... . . . . . . 1 ti +1

tN i −1i +1 . . . t2

i +1 ti +1 1

where the antenna correlation at receive and transmit sides read r i = e− l i∆ r,i and

t i +1 = e− l i∆ t,i , respectively. It can be veried that these Toeplitz matrices are of

Wiener-class thanks to Szeg¨ o’s theorem, Theorem 12.1. We therefore know alsofrom Theorem 12.1 the limiting eigenvalue distribution of those deterministicmatrices as their dimensions grow large. We further assume here equal powerallocation over the optimal directions, i.e. the singular values of P i are chosento be all equal: ΛP i = α i I N i , where αi is real positive and chosen to satisfy thepower constraint ( 15.3). Equal power allocation may not be the optimal powerallocation scheme, but it is considered in this example for simplicity.




Using the power constraint expression for general correlation models ( 15.8) andconsidering precoding matrices P i = U H

r,i (α i I N i )U t,i +1 , with U r,i unitary suchthat R i = U r,i Λ r,i U H

r,i with Λ r,i diagonal and U t,i such that T i = U t,i Λ t,i U Ht,i

with Λ t,i diagonal, following the conditions of Theorem 15.2 with equal singularvalues α i , we can show by induction on i that the coefficients α i respecting thepower constraints for any correlation model are now given by:

α 0 = P 0

α i = P ia i P i−1

tr( Λ r,i −1)tr( Λ r,i )

ki

tr( Λ t,i Λ r,i −1), i ∈ 1, . . . , K −1

α K = 1 .

Applying the exponential correlation model to the above relations and makingthe dimensions of the system grow large, it can be shown that in the asymptoticregime, the α i respecting the power constraint for the exponentially correlatedsystem converge to the same value, given in (15.9), as for the uncorrelated system.It can then be shown that the asymptotic mutual information in this scenario isgiven by:

I ∞ =K

i =0

ci

c0π2 ∞t = −∞ ∞

u = −∞g(t, u ; h )dtdu −K

σ2

c0

K

i=0h i

where, denoting h = ( h0 , . . . , h K )

g(t, u ; h ) 11 + t2

11 + u2 log 1 + ρr,i ρt,i +1

hK i ai +1 α 2

i

σ2ci

(1 + t2)(ρ2

r,i + t2)(1 + u2)

(ρ2t,i +1 + u2)

and h0 , h 1 , . . . , h K are the solutions of the system

K

j =0

h j = 2π

hK i ai +1 α 2

i

ρr,i ρt,i +1 + h Ki a i +1 α 2

iσ 2 ci 1

ρ r,i ρ t,i +1+ h K

i a i +1 α 2i

σ 2 ci

F π2

, √ m i

with F (θ, x) the incomplete elliptic integral of the rst kind given by:

F (θ, x) = θ

0

1

1 −x2 sin2(t)dt

and, for all i ∈ 0, K ρr,i =

1−r i

1 + r i

ρt,i +1 = 1−t i +1

1 + t i +1

m i = 1 −ρ t,i +1

ρ r,i + hKi a i +1 α

2i

σ 2 c i ρr,iρ t,i +1 + h

Ki a i +1 α

2i

σ 2 ci

1ρ r,i ρ t,i +1

+ h Ki a i +1 α 2

iσ 2 c i

ρr,i ρt,i +1 + h Ki a i +1 α 2

iσ 2 ci




with the convention r0 = tN +1 = 0. The details of these results can be found in[Fawaz et al., 2011].

This concludes this chapter on the performance of multi-cellular and relaycommunication systems. In the subsequent chapters, the signal processingproblems of source detection, separation, and statistical inference for largedimensional systems are addressed.



16 Detection

In this chapter, we now address a quite different problem than the performanceevaluation of data transmissions in large dimensional communication channelmodels. The present chapter, along with Chapter 17, deals with practical signalprocessing techniques to solve problems involving (possibly large dimensional)random matrix models. Specically in this chapter, we will rst address thequestion of signal sensing using multi-dimensional sensor arrays.

16.1 Cognitive radios and sensor networks

A renewed motivation for large dimensional signal sensing has been recentlytriggered by the cognitive radio incentive, which, according to some, may bethought of as the next information-theoretic revolution after the original workof Shannon [Shannon, 1948] and the introduction of multiple antenna systemsby Foshini [Foschini and Gans, 1998] and Telatar [Telatar, 1999 ]. In additionto the theoretical expression of the point-to-point noisy channel capacity in[Shannon , 1948], Shannon made us realize that, in order to achieve highrate of information transfer, increasing the transmission bandwidth is largelypreferred over increasing the transmission power. Therefore, to ensure highrate communications with a nite power budget, we have to consider frequencymultiplexing. This constituted the rst and most important revolution in modern

telecommunications and most notably wireless communications, which led todayto an almost complete saturation of all possible transmission frequencies. By“all possible,” we mean those frequencies that can efficiently carry information(high frequencies tend to be rapidly attenuated when propagating in theatmosphere) and be adequately processed by analog and digital devices(again, high frequency carriers require expensive and sometimes even physicallyinfeasible radio front-ends). Foschini and Telatar brought forward the idea of multiplexing the information, not only through orthogonal frequency bands, butalso in the space domain, by using spatially orthogonal propagation paths inmulti-dimensional channels. As we saw in the previous chapter, though, thisassumed orthogonality only holds for fairly unrealistic communication channels(very scattered propagation environments lled with objects the size of thetransmission wavelength, very distant transmit and receive antennas, etc.). Also,



394 16. Detection

the multiple antenna multiplexing gain is only apparent for high signal-to-noiseratios, which is inconsistent with most contemporary interference limited cellularnetworks. We also discussed in Chapter 15 the impracticality of large cooperativenetworks which require a huge load of channel state information feedback. Thecognitive radio incentive, initiated with the work of Mitola [Mitola III andMaguire Jr , 1999], follows the same idea of communication resource harvesting .That is, cognitive radios intend to communicate not by exploiting the over-usedfrequency domain, or by exploiting the over-used space domain, but by exploitingso-called spectrum holes , jointly in time, space, and frequency. The basic idea isthat, while the time, frequency, and spatial domains are over-used in the sensethat telecommunication service providers have already bought all frequencies andhave already placed base stations, access points, and relays to cover most areas,

the effectively delivered communication service is largely discontinuous. Thatis, the telecommunication networks do not operate constantly in all frequencybands, at all times, and in all places at the maximum of their deliverablecapacities. Multiple situations typically arise.

• A licensed network is under-used over a given period of time. This is typicallythe case during night-time, when little large range telecommunication serviceis provided.

• The frequency band exploited by a network is left free of use or thedelivered content is not of interest to potential receivers. This is the caseof broadcast television, whose frequency multiplexed channels are not all usedsimultaneously at any given space location.

• A licensed network is not used locally. This arises whenever no close user isfound in a given space area, where a licensed network is operating.

• A licensed network is used simultaneously on all resource dimensions, but theusers’ service request induces transmission rates below the channel capacity.This arises when for instance a wideband CDMA network is used for a single-user voice call. The single-user is clearly capable, in the downlink, of decodingthe CDMA stream with few errors, even if it were slightly interfered with

by overlaying communications, since the CDMA code redundancy inducesresistence against interfering with data streams.

The concept of cognitive radios covers a very large framework, not clearlyunied to this day, though, which intends to reuse spectrum left-overs (or holes),four examples of which were given above. A cognitive radio network can bedescribed as an autonomous network overlaying one or many existing legacynetworks. While the established networks have dedicated bandwidths and spatialplanning to operate, cognitive radios are not using any licensed resource, be it inspace, time, or frequency. Cognitive radios are however free to use any licensedspectrum, as long as by doing so they do not dramatically interfere with thelicensed networks. That is, they are able to reuse the spectrum left unused by so-called primary networks whenever possible, while generating minimum harm to



16.1. Cognitive radios and sensor networks 395

the on-going communications. Considering the four examples above, a cognitiveradio network, also called secondary network , could operate, respectively:

• on a given time–frequency–space resource when no communication is foundto take place in the licensed frequencies for a certain amount of time;

• if the delivered data content is not locally used, by overlaying the (nowinterfering) network transmissions;

• if the delivered data content is intended for some user but the cognitive radiois aware that this user is sufficiently far away, by again overlaying the locallyunused spectrum;

• by intentionally interfering with the established network but using asufficiently low transmit power that still allows the licensed user to decode

its own data.

Now, what makes the secondary network cognitive is that all theaforementioned ways of action require constant awareness of the operationstaking place in the licensed networks. Indeed, as it is an absolute necessity not tointerfere with the licensed users, some sort of dynamic monitoring, or informationfeedback, is required for the secondary network to abide by the rules. Sincesecondary networks are assumed to minimally impact on the networks in place,it is a conventional assumption to consider that the licensed networks do not pro-actively deliver network information to the cognitive radio. It is even conventionalto assume that the licensed networks are completely oblivious of the existence of potential interferers. Therefore, legacy telecommunication networks need not berestructured in order to face the interference of the new secondary networks. As aconsequence, all the burden is placed on the cognitive radio to learn about its ownenvironment. This is relatively easy when dealing with surrounding base stationsand other xed transmitters, as much data can be exploited in the long-term,but this is not so for mobile users. Service providers sometimes do not transmitat all (apart from pilot data), in which case secondary networks can detect aspectrum hole and exploit it. However, the real gain of cognitive radios does not

come solely from beneting from completely unused access points, but ratherfrom beneting from overlaying on-going communications while not affectingthe licensed users. A classical example is that of a mobile phone network, whichcan cover an area as large as a few kilometers. In day-time, it is uncommon fora given base station never to be in use (for CDMA transmissions, rememberthat this means that the whole spectrum is then used at once), but it is alsouncommon that the users communicating with this base station are alwayslocated close to a secondary network. A cognitive radio can always overlay thedata transmitted by an operating base station if the user, located somewhere ina large area, is not found to be anywhere close to the cognitive network. For in-house cognitive radios, such as femto-cells in closed access (see, e.g., [Calin et al.,2010; Chandrasekhar et al., 2009; Claussen et al., 2008], it can even be assumedthat overlaying communication can take place almost continuously, as long as no



396 16. Detection

user inside the house or in neighboring houses establishes a communication withthis network.

The question of whether an active user is to be found in the vicinity of acognitive radio is therefore of prior importance to establish reliable overlayingcommunications in cognitive radios. For that, a cognitive radio needs to be ableto sense neighboring users in active transmissions. This can be performed bysimple energy detection, as in the original work from Urkowitz [Urkowitz , 1967].However, energy detection is meant for single antenna transmitters and receiversunder additive white Gaussian noise conditions and does therefore not takeinto account the possibility of joint processing at the sensor network level inMIMO fading channel conditions. In this chapter, we will investigate the variousapproaches brought by random matrix theory to perform signal detection as

reliably as possible. We will rst investigate the generalization of the Urkowitzapproach to multiple sources and multiple receivers under a small dimensionalrandom matrix approach. The rather involved result we will present will thenmotivate large dimensional random matrix analysis. Most notably, approachesthat require minimum a priori knowledge of the environment will be studied froma large dimensional perspective. Indeed, it must be assumed that the cognitiveradio is completely unaware even of the expected received signal-to-noise ratioin a given frequency band, as it exactly intends to decide whether only noise orinformative signals are received within this band.

Before getting into random matrix applications, let us model the signal sensing

problem.

16.2 System model

We consider a communication network composed of K transmitting sources, e.g.this can be either a K -antenna transmitter or K single antenna (not necessarilyuncorrelated) information sources, and a receiver composed of N sensors, be they

the uncorrelated antennas of a single terminal or a mesh of scattered sensors,similar to the system model exploited in, e.g., [Cabric et al., 2006; Ghasemiand Sousa , 2005, 2007; Mishra et al., 2006; Sun et al., 2007a,b; Wang et al.,2010; Zhang and Letaief , 2008]. To enhance the multiple antenna (MIMO) modelanalogy, the set of sources and the set of sensors will be collectively referred toas the transmitter and the receiver , respectively. The communication channelbetween the transmitter and the receiver is modeled by the matrix H ∈C N ×K ,with ( i, j )th entry h ij . If at time l the transmitter emits data, those are denotedby the K -dimensional vector x ( l) = ( x( l)

1 , . . . , x ( l)K )

T

∈C K and are assumed

independent across time. The additive white Gaussian noise at the receiver ismodeled, at time l, by the vector σw ( l) = σ(w( l)1 , . . . , w ( l)N )T

∈C N , where σ2

denotes the variance of the noise vector entries, again assumed independentacross time. Without generality restriction, we consider in the following zero



16.2. System model 397

Figure 16.1 A cognitive radio network under hypothesis H 0 , i.e. no close user istransmitting during the exploration period.

mean and unit variance of the entries of both w ( l) and x ( l) , i.e. E[|w( l)i |2] = 1,

E[|x( l)i |2] = 1 for all i. We then denote y ( l) = ( y( l)

1 , . . . , y ( l)N )

T the N -dimensionaldata received at time l. Assuming the channel coherence time is at least as long asM sampling periods, we nally denote Y = [y (1) , . . . , y (M ) ] ∈C N ×M the matrix

of the concatenated receive i.i.d. vectors.Depending on whether the transmitter emits informative signals, we considerthe following hypotheses.

• H 0 . Only background noise is received.

• H 1 . Informative signals plus background noise are received.

Both scenarios of cognitive radio networks under hypotheses H 0 and H 1 aredepicted in Figure 16.1 and Figure 16.2, respectively. Figure 16.1 illustrates thecase when users neighboring the secondary network are not transmitting, whileFigure 16.2 illustrates the opposite situation when a neighboring user is foundto transmit in the frequency resource under exploration.

Therefore, under condition H 0 , we have the model

Y = σW

with W = [w (1) , . . . , w (M ) ] ∈C N ×M and under condition H 1

Y = H σI N XW

(16.1)

with X = [x (1) , . . . , x (M ) ]

∈C N ×M .

Under this hypothesis, we further denote Σ the covariance matrix of y (1)

Σ = E[y (1) y (1) H ] = HH H + σ2 I N = UGU H



398 16. Detection

Figure 16.2 A cognitive radio network under hypothesis H 1 , i.e. at least one close useris transmitting during the exploration period.

where G = diag ν 1 + σ2 , . . . , ν N + σ2 ∈R N ×N , with ν 1 , . . . , ν N theeigenvalues of HH H and U ∈C N ×N a certain unitary matrix.

The receiver is entitled to decide whether the primary users are transmittinginformative signals or not. That is, the receiver is required to test the hypothesisH 0 against the hypothesis H 1 . The receiver is however considered to havevery limited information about the transmission channel and is in particularnot necessarily aware of the exact number K of sources and of the signal-to-noise ratio. For this reason, following the maximum entropy principle [Jaynes,1957a,b], we seek a probabilistic model for the unknown variables, which isboth (i) consistent with the little accessible prior information available tothe sensor network and (ii) has maximal entropy over the set of densitiesthat validate (i). Maximum entropy considerations, which we do not develophere, are further discussed in Chapter 18, as they are at the core of the

channel models developed in this chapter. We therefore admit for the timebeing that the entropy maximizing probability distribution of a random vector,the knowledge about which is limited to its population covariance matrix, isa multivariate Gaussian distribution with zero mean and covariance matrixthe known population covariance matrix. If the population covariance matrixis unknown but is known to be of unit trace, then the entropy maximizingdistribution is now multivariate independent Gaussian with zero mean andnormalized identity covariance matrix. Therefore, if the channel matrix H is onlyknown to satisfy, as is often the case in the short-term, E[ 1

N tr HH H ] = E , withE the total power carried through the channel, the maximum entropy principlestates that the entries hij should be modeled as independent and all Gaussiandistributed with zero mean and variance E/K . For the same reason, both noisew( l)

i and signal x( l)i entries are taken independent Gaussian with zero mean and



16.3. Neyman–Pearson criterion 399

variance E[ |w( l)i |2] = 1, E[|x

( l)i |2] = 1. Obviously, the above scalings depend on

the denition of the signal-to-noise ratio.Now that the model is properly dened, we turn to the question of testing

hypothesis H 0 against hypothesis H 1 . The idea is to decide, based on theavailable prior information and upon observation of Y , whether H 0 is morelikely than H 1 . Instead of exploiting structural features of the signal, such ascyclostationarity as in, e.g., [Enserink and Cochran, 1994 ; Gardner, 1991 ; Kimand Shin, 2008 ], we consider here the optimal Neyman–Pearson decision test.This is what we study rst in the following (Section 16.3) under different priorinformation on all relevant system parameters. We will realize that the optimalNeyman–Pearson test, be it explicitly derivable for the model under study, leadsnonetheless to very involved formulations, which cannot exibly be extended to

more involved system models. We will therefore turn to simpler suboptimal tests,whose behavior can be controlled based on large dimensional analysis. This isdealt with in Section 16.4.

16.3 Neyman–Pearson criterion

The Neyman–Pearson criterion [Poor, 1994; Vantrees , 1968] for the receiver toestablish whether an informative signal was transmitted is based on the ratio

C (Y ) = P H 1 |Y (Y )P H 0 |Y (Y )

(16.2)

where, following the conventions of Chapter 2, P H i |Y (Y ) is the probability of theevent H i conditioned on the observation Y . For a given receive space–time matrixY , if C (Y ) > 1, then the odds are that an informative signal was transmitted,while if C (Y ) < 1, it is more likely that no informative signal was transmittedand therefore only background noise was captured. To ensure a low probabilityof false alarms (or false positives), i.e. the probability of declaring a pure noisesample to carry an informative signal, a certain threshold ξ is generally set such

that, when C (Y ) > ξ , the receiver declares an informative signal was sent, whilewhen C (Y ) < ξ , the receiver declares that no informative signal was sent. Thequestion of what ratio ξ to be set to ensure a given maximally acceptable falsealarm rate will not be treated in the following. We will however provide anexplicit expression of ( 16.2) for the aforementioned model, and will compare itsperformance to that achieved by classical detectors. The results provided in thissection are borrowed from [Couillet and Debbah , 2010a].

Thanks to Bayes’ rule, ( 16.2) becomes

C (Y ) = P H 1 ·P Y |H 1 (Y )

P H 0 ·P Y |H 0 (Y )with P H i the a priori probability for hypothesis H i to hold. We suppose that noside information allows the receiver to consider that H 1 is more or less probable



400 16. Detection

than H 0 , and therefore set P H 1 = P H 0 = 12 , so that

C (Y ) =

P Y |H 1 (Y )

P Y |H 0 (Y ) (16.3)

reduces to a maximum likelihood ratio.In the next section, we derive closed-form expressions for C (Y ) under the

hypotheses that the values of K and the SNR, that we dene as 1 /σ 2 , are eitherperfectly or only partially known at the receiver.

16.3.1 Known signal and noise variances

16.3.1.1 Derivation of P Y |H i in the SIMO caseWe rst analyze the situation where the noise power σ2 and the number K of signal sources are known to the receiver. We also assume in this rst scenario thatK = 1. Since it is a common assumption that the number of available samples atthe receiver is larger than the number of sensors, we further consider that M > N and N ≥ 2 (the case N = 1 is already known to be solved by the classical energydetector [Kostylev , 2002]).

Likelihood under H 0.

In this rst scenario, the noise entries w( l)i are Gaussian and independent. Theprobability density of Y , that can be seen as a random vector with N M entries,

is then an NM multivariate uncorrelated complex Gaussian with covariancematrix σ2 I NM

P Y |H 0 (Y ) = 1

(πσ 2)NM e− 1σ 2 tr YY H

. (16.4)

Denoting λ = ( λ1 , . . . , λ N )T the eigenvalues of YY H , (16.4) only depends onN i =1 λ i as follows.

P Y |H 0 (Y ) = 1

(πσ 2)NM e− 1σ 2

N i =1 λ i .

Likelihood under H 1.Under the information plus noise hypothesis H 1 , the problem is more involved.The entries of the channel matrix H were previously modeled as jointlyuncorrelated Gaussian, with E[ |h ij |2] = E/K . From now on, for simplicity, wetake E = 1 without loss of generality. Therefore, since here K = 1, H ∈C N ×1

and Σ = HH H + σ2 I N has N

−1 eigenvalues g2 = . . . = gN equal to σ2 and

another distinct eigenvalue g1 = ν 1 + σ2 = ( N i =1 |h i 1|2) + σ2 . Since the |h i1|2are the sum of two Gaussian independent variables of zero mean and variance 12

(the real and imaginary parts of hij ), 2(g1 −σ2) is a χ 22N distribution. Hence,




the unordered eigenvalue distribution of Σ , dened on [σ2 , ∞)N , reads:

P G (G ) = 1

N (g1

−σ2)N −1 e−(g1 −σ 2 )

(N −1)!

N

i =2

δ (gi

−σ2).

From the model H 1 , Y is distributed as correlated Gaussian, as follows.

P Y |Σ ,I 1 (Y , Σ ) = 1

πMN det( G )M e−tr (YY H UG −1 U H )

where I k denotes the prior information at the receiver “ H 1 and K = k.” Thisadditional notation is very conventional for Bayesian probabilists, as it helpsremind that all derived probability expressions are the outcomes of a so-calledplausible reasoning based on the prior information available at the system

modeler.Since the channel H is unknown, we need to integrate out all possible channels

for the transmission model under H 1 over the probability space of N ×K matrices with Gaussian i.i.d. distribution. From the unitarily invariance of Gaussian i.i.d. random matrices, this is equivalent to integrating out all possiblecovariance matrices Σ over the space of such non-negative denite Hermitianmatrices, as follows.

P Y |H 1 (Y ) = ΣP Y |Σ ,H 1 (Y , Σ )P Σ (Σ )dΣ .

Eventually, after complete integration calculus given in the proof below,the Neyman–Pearson decision ratio ( 16.2) for the single input multiple outputchannel takes an explicit expression, given by the following theorem.

Theorem 16.1. The Neyman–Pearson test ratio C Y (Y ) for the presence of an informative signal when the receiver knows K = 1, the signal power E = 1 , and the noise power σ2 , reads:

C Y (Y ) = 1N

N

l=1

σ2( N + M −1) eσ 2 + λ lσ 2

N

i =1i= l (λ l −λ i )

J N −M −1(σ2 , λ l ), (16.5)

with λ1 , . . . , λ N the eigenvalues of YY H and where

J k (x, y ) ∞x

tk e−t−yt dt = 2y

k +12 K −k−1(2√ y) − x

0tk e−t−y

t dt. (16.6)

The proof of Theorem 16.1 is provided below. Among the interesting features of (16.5), note that the Neyman–Pearson test does only depend on the eigenvaluesof YY H . This suggests that the eigenvectors of YY H do not provide anyinformation regarding the presence of an informative signal. The essential reasonis that, both under H 0 and H 1 , the eigenvectors of Y are isotropically distributedon the unit N -dimensional complex sphere due to the Gaussian assumptionsmade here. As such, a given realization of the eigenvectors of Y does indeed not



402 16. Detection

carry any relevant information to the hypothesis test. The Gaussian assumptionfor H brought by the maximum entropy principle, or as a matter of fact forany unitarily invariant distribution assumption for H , is therefore essential here.Note however that ( 16.5) is not reduced to a function of the sum i λ i of theeigenvalues, as suggested by the classical energy detector.

On the practical side, note that the integral J k (x, y ) does not take a closed-form expression, but for x = 0, see, e.g., pp. 561 of [Gradshteyn and Ryzhik,2000]. This is rather inconvenient for practical purposes, since J k (x, y ) musteither be evaluated every time, or be tabulated. It is also difficult to getany insight on the performance of such a detector for different values of σ2 ,N , and K . We provide hereafter a proof of Theorem 16.1, in which classicalmulti-dimensional integration techniques are required. In particular, the tools

introduced in Section 2.1, such as the important Harish–Chandra formula,Theorem 2.4, will be shown to be key ingredients of the derivation.

Proof. We start by writing the probability P Y |I 1 (Y ) as the marginal probabilityof P Y ,Σ ,I 1 after integration along all possible Σ . This is:

P Y |I 1 (Y ) = S (σ 2 )P Y |Σ ,I 1 (Y , Σ )P Σ (Σ )dΣ

with S(σ2) ⊂C N ×N the cone of positive denite complex matrices with smallestN −1 eigenvalues equal to σ2 .

We now consider the one-to-one mappingB : (U(N )/T ) ×(σ2 , ∞) →S(σ2)

(U , g1) → Σ = UGU H

where

G = g1 00 σ2 I N −1

and where U(N )/T is the space of unitary matrices of N ×N with rst columncomposed of real positive entries. More information on this mapping is providedin [Hiai and Petz , 2006], which is reused later in Chapter 18. From variablechange calculus, see, e.g., [Billingsley, 1995], we have

P Y |I 1 (Y )

= (U (N ) /T )×(σ 2 ,∞)P Y |B (U ,g 1 ) (Y , U , g1)P B (U ,g 1 ) (U , g1) det( J (B ))dU dg1

with J (B ) the Jacobian matrix of B.Notice now that Σ −σ2 I N = HH H is a Wishart matrix. The density of its

entries is therefore invariant by left- and right-product by unitary matrices.The eigenvectors of Σ are as a consequence uniformly distributed over U(N ),the space of complex unitary matrices of size N ×N . Moreover, the eigenvaluedistribution of Σ is independent of the matrix U . From these observations, we




conclude that the joint density

P (U ,g 1 ) (U , g1) = P B (U ,g 1 ) (U , g1)det( J (B ))

can be written under the product form

P (U ,g 1 ) (U , g1) = P U (U )P g1 (g1).

As in Chapter 2, we assume that dU is the Haar measure with densityP U (U ) = 1. We can therefore write

P Y |I 1 (Y ) = U (N )×(σ 2 ,∞)P Y |Σ ,H 1 (Y , Σ )P g1 (g1)dU dg1 .

The latter can further be equated to

P Y |I 1 (Y ) = U (N )×(σ 2 ,∞)

e−tr (YY H UG −1 U H )πNM det( G )M (g1 −σ2)N −1 e−(g1 −σ 2 )

N ! dU dg1 .

The next step is to use the Harish–Chandra identity provided in Theorem2.4. Denoting ∆( Z ) the Vandermonde determinant of matrix Z ∈C N ×N witheigenvalues z1 ≤ . . . ≤ zN

∆( Z )i>j

(zi −zj ) (16.7)

the likelihood P Y|I

1(Y ) further develops as

P Y |I 1 (Y ) = limg2 ,...,g N →σ 2

eσ 2(−1)

N ( N −1)2

N −1j =1 j !

πMN σ2M (N −1) N !

× + ∞σ 2

(g1 −σ2)N −1e−g1 1gM

1

det e−λ ig j

∆( YY H )∆( G −1)dg1 .

Now, noticing that ∆( G −1) = ( −1)N (N +3) / 2 ∆( G )det( G ) N −1 , this is also

P Y|I 1

(Y )

= limg2 ,...,g N →σ 2

π−MN eσ 2 N −1j =1 j !

N !σ2( N −1)( M −N +1) + ∞σ 2

(g1 −σ2)N −1e−g1

gM −N +11

det e−λ ig j

∆( YY H )∆( G )dg1

in which we remind that λ1 , . . . , λ N are the eigenvalues of YY H . Note the trickof replacing the known values of g2 , . . . , g N by limits of scalars converging tothese known values, which dodges the problem of improper ratios. To derive theexplicit limits, we then proceed as follows.

Denoting y = ( γ 1 , . . . , γ N −1 , γ N ) = ( g2 , . . . , g N , g1) and dening the functions

f (x i , γ j ) e−x iγ j

f i (γ j ) f (x i , γ j )



404 16. Detection

we then have from Theorem 2.9

limg2 ,...,g N →σ 2

det e−λ ig j

1≤i≤N 1≤j ≤N

∆( YY H )∆( G )

= limγ 1 ,...,γ N −1 →σ 2

γ M →g1

(−1)N −1det f i (λ j )i,j

∆( YY H )∆( G )

= ( −1)N −1 det f i (σ2), f i (σ2), . . . , f (N −2) (σ2), f i (g1)

i<j (λ i −λ j )(g1 −σ2)N −1 N −2j =1 j !

.

The change of variables led to a switch of one column and explains the

(−1)N

−1

factor appearing when computing the resulting determinant. Thepartial derivatives of f along the second variable is

∂ ∂γ k

f k≥1

(a, b) =k

m =1

(−1)k+ m

bm + kmk

(k −1)!(m −1)!

am e−ab

κk (a, b)e−ab .

Back to the full expression of P Y |H 1 (Y ), we then have

P Y |I 1 (Y )

= eσ 2

σ2( N

−1)( N

−M

−1)

Nπ MN

× + ∞σ 2

(−1)N −1gN −M −11 e−g1

det f i (σ2), f i (σ2), . . . , f (N −2) (σ2), f i (g1)

i<j (λ i −λ j ) dg1

= eσ 2

σ2( N −1)( N −M −1)

Nπ MN i<j (λ i −λ j )

×

+ ∞σ 2

(−1)N −1gN −M −11 e−g1 det

e−x 1σ 2

...

e−x N σ 2

κ j (λ i , σ2)e−λ iσ 2

1≤i≤N 1

≤j

≤N

−2

e−λ 1g 1

...

e−λ N g 1

dg1 .

Before going further, we need the following result, often required in the calculusof marginal eigenvalue distributions for Gaussian matrices.

Lemma 16.1. For any family a1 , . . . , a N ∈R N , N ≥ 2, and for any b ∈R∗

det

1... (κ j (a i , b)) 1≤i≤N

1≤j ≤N −11

= 1

bN (N −1)i<j

(a j −a i ).

This identity follows from the observation that column k of the matrixabove is a polynomial of order k. Since summations of linear combinations of the columns do not affect the determinant, each polynomial can be replaced




by the monomial of highest order, i.e. b−2( k−1) aki in row i. Extracting the

product 1 ·b−2 · · ·b−2( N −1) = b−(N −1) N from the determinant, what remains isthe determinant of a Vandermonde matrix based on the vector a1 , . . . , a N .

By factorizing every row of the matrix by e−λ iσ 2 and developing the determinanton the last column, we obtain

P Y |I 1 (Y )

= eσ 2

σ2( N −1)( N −M −1)

Nπ MN i<j (λ i −λ j )

× + ∞σ 2

gN −M −11 e−g1 −

N

i =1λ i

σ 2

N

l=1

(−1)2N + l−1e−λ l 1g 1 − 1

σ 2

σ2( N −1)( N −2)i<j

i = lj = l

(λ i −λ j )dg1

= eσ 2 − 1

σ 2N i =1 λ i

Nπ MN σ2( N −1)( M −1)

N

l=1

(−1)l−1 + ∞σ 2

gN −M −11 e−g1 e−λ l

1g 1 − 1

σ 2

i<l (λ i −λ l ) i>l (λ l −λ i )dg1

= eσ 2 − 1

σ 2N i =1 λ i

Nπ MN σ2( N −1)( M −1)

N

l=1

eλ lσ 2

N i=1i = l

(λ l −λ i ) + ∞σ 2

gN −M −11 e− g1 + λ l

g 1 dg1

which nally gives

P Y |I 1 (Y ) = eσ2

− 1σ 2

N

i =1 λ i

Nπ MN σ2( N −1)( M −1)

N

l=1

eλ lσ 2

N i =1i = l

(λ l −λ i )J N −M −1(σ2 , λ l )

where

J k (x, y ) = + ∞x

tk e−t−yt dt = 2y

k +12 K −k−1(2√ y) − x

0tk e−t−y

t dt

and K n denotes the modied Bessel function of the second kind.

We now turn to the more general case where K ≥ 1, which unfolds similarly.

16.3.1.2 Multi-source caseIn the generalized multi-source conguration, where K ≥ 1, the likelihood P Y |H 0

remains unchanged and therefore the previous expression for K = 1 is stillcorrect. For the subsequent derivations, we only treat the situation where K ≤ N but the case K > N is a rather similar extension.

In this scenario, H ∈C N ×K is now a random matrix (instead of a vector) withi.i.d. zero mean Gaussian entries. The variance of every row is E[ K

j =1 |h ij |2] = 1.Therefore K HH H is distributed as a null Wishart matrix. Hence, observing thatΣ −σ2 I N is the diagonal matrix of the eigenvalues of HH H

Σ = U ·diag( ν 1 + σ2 , . . . , ν K + σ2 , σ2 , . . . , σ 2) ·U H (16.8)



406 16. Detection

for some unitary matrix U ∈C N ×N and with ν 1 , . . . , ν K the eigenvalues of H H H ,the unordered eigenvalue density of G unfolds from Theorem 2.3

P G (G ) = (N −K )!K KN

N !

K

i =1

e−K Ki =1 (g i −σ 2 ) (gi −σ2)

N

−K

+(K −i)!(N −i)!

K

i<j

(gi −gj )2 .

(16.9)

From the Equations (16.8) and ( 16.9) above, it is possible to extend Theorem16.1 to the multi-source scenario, using similar techniques as for the proof of Theorem 16.1, which we do not further develop here, but can be found in [Couilletand Debbah, 2010a] . This extended result is provided below.

Theorem 16.2. The Neyman–Pearson test ratio C Y (Y ) for the presence of informative signals when the receiver perfectly knows the number K ( K ≤ N ) of signal sources, the source power E = 1, and the noise power σ2 , reads:

C Y (Y ) = σ2K (N + M −K ) (N −K )!eK 2 σ 2

N !K (K −1−2M )K/ 2 K −1j =1 j ! a

⊂[1,N ]

eK

i =1λ a i

σ 2

a i j = a 1...

j = a i

(λa i −λ j )

×b∈

P (K )

(−1)sgn( b )+ K K

l=1

J N −M −2+ bl (Kσ 2 , Kλ a l )

with P (K ) the ensemble of permutations of 1, . . . , K , b = ( b1 , . . . , b K ) and sgn(b ) the signature of the permutation b . The function J k is dened as in Theorem 16.1.

Observe again that C Y (Y ) is a function of the empirical eigenvalues λ1 , . . . , λ N

of YY H only. In the following, we extend the current signal detector to the morerealistic situations where K , E , and σ2 are not a priori known to the receiver.

16.3.2 Unknown signal and noise variances

Efficient signal detection when the noise variance is unknown is highly desirable[Tandra and Sahai, 2005] . Indeed, as recalled earlier, if the noise and signalvariances were exactly known, some prior noise detection mechanism would berequired. The difficulty here is handily avoided thanks to ad-hoc methods that areasymptotically independent of the noise variance, as in, e.g., [Cardoso et al., 2008;Zeng and Liang , 2009], or more theoretical, although suboptimal, approachesas in [Bianchi et al., 2011], which will be discussed when dealing with largedimensional random matrix considerations.

In the following, we consider the general case when the knowledge about thesignal and noise variances can range from a total absence of information to aperfect knowledge, and will represent this knowledge under the form of a prior




probability distribution, as per classical Bayesian derivation. It might happenin particular that the receiver has no knowledge whatsoever on the values of the noise power and the expected signal power, but obviously knows that thesepowers are positive values. When such a situation arises, the unknown parametermust be assigned a so-called uninformative prior , such as the widely spreadJeffreys prior [Jeffreys, 1946]. Assigning uninformative priors of variables denedin a continuum is however, still to this day, a controverted issue of the maximumentropy principle [Caticha, 2001] . The classical uninformative priors consideredin the literature are (i) the uniform prior, i.e. every two positive values forthe signal/noise power are equi-probable, which experiences problems of scalinginvariance thoroughly discussed in [Jaynes, 2003] , and (ii) the aforementionedJeffreys prior [Jeffreys, 1946], i.e. the prior distribution for the variance parameter

σ2

takes the form σ−2β

for any deterministic choice of positive β , which isinvariant under scaling but is not fully attractive as it requires a subjectivechoice of β .

In the case where the signal power E is known to be contained between E −and E + (for the time being, we had considered E = 1), and the noise power σ2

is known at least to be bounded by σ2− and σ2

+ , we will consider the “desirable”assumption of uniform prior

P E (E ) = 1

E + −E −P σ 2 (σ2) = 1

σ2+ −σ2

−.

Denoting I k the event “ H 1 , K = k, E − ≤ E ≤ E + and σ2− ≤ σ2 ≤ σ2

+ ,” thisleads to the updated decisions of the form

C Y (Y ) = 1

E + −E − σ 2

+

σ 2− E +

E −P Y |σ 2 ,I K (Y , σ2 , E )dσ2dE

σ 2+

σ 2−

P Y |σ 2 ,H 0 (Y , σ2)dσ2(16.10)

where P Y |σ 2 ,I K (Y , σ2 , E ) is obtained as previously by assuming a transmit powerE , instead of 1. Precisely, it suffices to consider the density of E Y with σ2

changed into σ2 /E .The computational difficulty raised by the integrals J k (x, y ) does not allow

for any satisfying closed-form expression for ( 16.10) so that only numericalintegrations can be performed at this point.

16.3.3 Unknown number of sources

In practical cases, the number of transmitting sources is only known to be niteand discrete. If only an upper bound K max on K is known, a uniform prioris assigned to K . The probability distribution of Y under hypothesis I 0 “σ2



408 16. Detection

known, 1 ≤ K ≤ K max unknown,” reads:

P Y|I 0 (Y ) =

K max

i =1

P Y|“ K = i ” ,I 0 (Y )

·P “ K = i ”

|I 0

= 1K max

K max

i =1

P Y |“K = i” ,I 0 (Y )

which does not meet any computational difficulty.Assuming again equal probability for the hypotheses H 0 and H 1 , this leads to

the decision ratio

C Y (Y ) = 1K max

K max

i =1

P Y |“K = i ” ,I 0 (Y )P Y |H 0 (Y )

.

Note now that it is possible to make a decision test on the number of sourcesitself in a rather straightforward extension of the previous formula. Indeed, givena space–time matrix realization Y , the probability for the number of transmitantennas to be i is from Bayes’ rule

P “K = i ” |Y (Y )

= P Y |“ K = i ” (Y )P “K = i ”

K maxj =0 P Y |“K = j ” (Y )P “K = j ”

= P Y |H 0 (Y ) P Y |H 0 (Y ) + 1K max

K max

j =1 P Y |“ K = j ” (Y )−1

, i = 01

K maxP Y |“K = i” (Y ) P Y |H 0 (Y ) + 1

K max

K maxj =1 P Y |“ K = j ” (Y ) −1

, i ≥ 1

where all the quantities of interest here were derived in previous sections. Themultiple hypothesis test on K is then based on a comparison of the odds O(“K =i”) for the events “ K = i,” for all i ∈ 0, . . . , K max . Under Bayesian terminology,we remind that the odds for the event “ K = i” are dened as

O(“K = i”) = P “K = i” |Y (Y )

K maxj =0j = i

P “ K = j ” |Y (Y ).

In the current scenario, these odds express as

O(“K = i”)

=P Y |H 0 (Y ) 1

K max

K maxj =1 P Y |“K = j ” (Y ) −1

, i = 01

K maxP Y |“K = i ” (Y ) P Y |H 0 (Y ) + 1

K max j = i P Y |“K = j ” (Y ) −1, i ≥ 1.

We now provide a few simulation results that conrm the optimality of the Neyman–Pearson test for the channel model under study, i.e. with i.i.d.Gaussian channel, signal, and noise. We also provide simulation results whenthese assumptions are not met, in particular when a line-of-sight component ispresent in the channel and when the signal samples are drawn from a quadraturephase shift-keying (QPSK) constellation.




First, we provide in Figure 16.3 the simulated plots of the false alarmand correct detection rates obtained for the Neyman–Pearson test derived inTheorem 16.1 when K = 1, with respect to the decision threshold above whichcorrect detection is claimed. To avoid trivial scenarios, we consider a rather lowSNR of −3 dB, and N = 4 receivers capturing only M = 8 signal instances.The channel conditions are assumed to match the conditions required by themaximum entropy model, i.e. channel, signal, and noise are all i.i.d. Gaussian.Note that such conditions are desirable when fast decisions are demanded.In a cognitive radio setup, secondary networks are expected to be capable of very fast and reliable signal detection, in order to be able to optimally exploitspectrum opportunities [Hoyhtya et al., 2007]. This is often referred to as theexploration versus exploitation trade-off, which balances the time spent exploring

for available resources with high enough detection reliability and the time spentexploiting the available spectrum resources. Observe in Figure 16.3 that thefalse alarm rate curve shows a steep drop around C Y (Y ) = 1 (or zero dB).This however comes along with a drop, although not so steep, of the correctdetection rate. A classical way to assess the performance of various detectiontests is to evaluate how much correct detection rate is achieved for a givenxed tolerable false alarm rate. Comparison of correct detection rates for givenfalse alarm rates is obtained in the so-called receiver operating characteristic(ROC) curve. The ROC curve of the Neyman–Pearson test against that of theenergy detector is provided in Figure 16.4 under the channel model conditions,

for N = 4, M = 8, and σ2 = −3 dBm as above, with the transmit power E equalto zero dBm, this last information being either perfectly known or only knownto belong to [−10 dBm, 10 dBm]. We only focus on a section of the curve whichcorresponds to low false alarm rates (FAR), which is a classical assumption. Werecall that the energy detector consists in summing up λ1 to λN , the eigenvaluesof YY H (or equivalently taking the trace of YY H ) and comparing it against somedeterministic threshold. The larger the sum the more we expect the presence of aninformative signal in the received signal. Observe that the Neyman–Pearson testis effectively superior in correct detection rate than the legacy energy detector,

with up to 10% detection gain for low false alarm rates.We then test the robustness of the Neyman–Pearson test by altering theeffective transmit channel model. We specically consider that a line-of-sightcomponent of amplitude one fourth of the mean channel energy is present. Thisis modeled by letting the effective channel matrix H be H = √ 1 −α 2Z + αA ,where Z ∈C N ×K has i.i.d. Gaussian entries of variance 1 /K and A ∈C N ×K

has all entries equal to 1 /K . This is depicted in Figure 16.5 with α2 = 14 . We

observe once more that the Neyman–Pearson test performs better than thepower detector, especially at low SNR. It therefore appears to be quite robust toalterations in the system model such as the existence of a line-of-sight component,

although this was obviously not a design purpose.In Figure 16.6, we now vary the SNR range, and evaluate the correct detection

rates under different false alarm rate constraints, for the Gaussian i.i.d. signal



410 16. Detection

−20 0 20 40 60 80 100 120 1400

0.1

0.2

0.3

0.4

0.5

0.6

0.70.8

0.9

1

C Y (Y ) [dB]

C o r r e c t

d e t e c t i o n

/ F a l s e a l a r m r a t e s False alarm rate

Correct detection rate

Figure 16.3 Neyman–Pearson test performance in single-source scenario. Correctdetection rates and false alarm rates for K = 1, N = 4, M = 8, SNR = −3 dB.

and channel model. This graph conrms the previous observation that thestronger the false alarm request, the more efficient the Neyman–Pearson testcomparatively with the energy detection approach. Note in particular that asmuch as 10% of correct detection can again be gained in the low SNR regimeand for a tolerable FAR of 10 −3 .

Finally, in Figure 16.7, we provide the ROC curve performance for the multi-source scheme, when K ranges from K = 1 to K = 3, still under the Gaussiani.i.d. system model. We observe notably that, as the number of sources increases,the energy detector closes in the performance gap observed in the single sourcecase. This arises both from a performance decrease of the Neyman–Pearson test,which can be interpreted from the fact that the more the unknown variables(there are more unknown channel links) the less reliable the noise-versus-

information comparative test, and from a performance increase of the powerdetector, which can be interpreted as a channel hardening effect (the more thechannel links the less the received signal variance).

A more interesting problem though is to assume that the noise variance σ2 isnot a priori known at the receiver end since the receiver is entitled to determinewhether noise or informative signals are received without knowing the noisestatistics in the rst place. We have already seen that the Neyman–Pearson testapproach leads to a multi-dimensional integral form, which is difficult to furthersimplify. Practical systems however call for low complex implementation [Cabricet al. , 2004]. We therefore turn to alternative approaches bearing ideas in thelarge dimensional random matrix eld to cover this particularly interesting case.It will turn out that very simple tests can be determined for the scenario wherethe noise variance is not known to the receiver, and theoretical derivations of the




1 ·10−3 5 ·10−3 1 ·10−2 2 ·10−20.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

False alarm rate

C o r r e c t

d e t e c t i o n r a t e

Neyman–Pearson testNeyman–Pearson test ( E unknown)Energy detector

Figure 16.4 ROC curve for single-source detection, K = 1, N = 4, M = 8,SNR = −3 dB, FAR range of practical interest, with signal power E = 0 dBm, eitherknown or unknown at the receiver.

1 ·10−3 5 ·10−3 1 ·10−2 2 ·10−20.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

False alarm rate

C o r r e c t


Neyman–Pearson testEnergy detector

Figure 16.5 ROC curve for single-source detection, K = 1, N = 4, M = 8,SNR = −3 dB, FAR range of practical interest, under Rician channel withline-of-sight component of amplitude 1 / 4 and QPSK modulated input signals.

correct detection rate against the false alarm rate can be performed. This is thesubject of the subsequent section.



412 16. Detection

−4 −2 0 2 4 6 80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNR [dB]

C o r r e c t

d e t e c t i o n r a t e s

Neyman–Pearson (10 −3 )Energy detector (10 −3 )Neyman–Pearson (10 −2 )Energy detector (10 −2 )Neyman–Pearson (10 −1 )Energy detector (10 −1 )

Figure 16.6 Correct detection rates under different FAR constraints (in parentheses inthe legend) and for different SNR levels, K = 1, N = 4, M = 8.

1 ·10−3 5 ·10−3 1 ·10−2 2 ·10−20.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

False alarm rate

C o r r e c t


Neyman–Pearson ( K = 3)Energy detector ( K = 3)Neyman–Pearson ( K = 2)Energy detector ( K = 2)Neyman–Pearson ( K = 1)Energy detector ( K = 1)

Figure 16.7 ROC curve for MIMO transmission, K = 1 to K = 3, N = 4, M = 8,SNR = −3 dB. FAR range of practical interest.

16.4 Alternative signal sensing approaches

The major results of interest in the large dimensional random matrix eld forsignal detection are those regarding the position of extreme eigenvalues of asample covariance matrix. The rst idea we will discuss, namely the conditionnumber test, arises from the simple observation that, under hypothesis H 0 ,



16.4. Alternative signal sensing approaches 413

not only should the empirical eigenvalue distribution of YY H be close to theMarcenko–Pastur law, but also should the largest eigenvalue of YY H be close tothe rightmost end of the Marcenko–Pastur law support, as both the number of sensors and the number of available time samples grow large. If an informativesignal is present in the observation Y , we expect instead the largest eigenvalue of YY H to be found sufficiently far away from the Marcenko–Pastur law support.

The methods proposed below therefore heavily rely on Bai and Silverstein’sTheorem 7.1 and its various extensions and corollaries, e.g. Theorem 9.8.

16.4.1 Condition number method

The rst method we introduce is an ad-hoc approach based on the observation

that in the large dimensional regime, as both N and M grow large, the ratiobetween the largest and the smallest eigenvalue of 1

M YY H , often referred to asthe condition number of 1

M YY H , converges almost surely to a deterministicvalue. Ordering the eigenvalues λ1 , . . . , λ N of YY H as λ1 ≥ . . . ≥ λN , underhypothesis H 0 , this convergence reads:

λ1

λN

a .s.

−→ σ2 (1 + √ c)2

σ2 (1 −√ c)2 = (1 + √ c)2

(1 −√ c)2

with c dened as the limiting ratio c limN

→∞N/M . This is an immediate

consequence of Theorem 7.1 and Theorem 9.8. This ratio is seen no longer todepend on the specic value of the noise variance σ2 . Under hypothesis H 1 ,notice that the model ( 16.1) is related to a spiked model, as the populationcovariance matrix of 1

M YY H is formed of N −K eigenvalues equal to σ2 and K other eigenvalues strictly superior to σ2 and all different with probability one. Inthe particular case when K = 1, all eigenvalues of E[ y (1) y (1) H ] equal σ2 but thelargest which equals σ2 + N

i =1 |h i 1|2 . As previously, call g1 σ2 + N i=1 |h i 1|2 .

Similar to the previous section, let us consider the K = 1 scenario. We stillassume that M > N , i.e. that more time samples are collected than there are

sensors. From Theorem 9.1, we then have that, as M and N grow large withlimiting ratio c lim N

M and such that g1σ 2 −1 → ρ, if ρ > √ c

λ1

M a .s.

−→ (1 + ρ) 1 + cρ

λ sp

and

λN

M a .s.

−→ σ2 1 −√ c 2

while if ρ < √ cλ1

M a .s.

−→ σ2 1 + √ c 2



414 16. Detection

andλN

M a .s.

−→ σ2 1 −√ c 2 .

Thus, under the condition that M is large enough to ensure that g1 > 1 + √ c,it is asymptotically possible to detect the presence of informative signals, withoutexplicit knowledge of σ2 . To this end, we may compare the ratio λ1 /λ N to thevalue

1 + √ c1 −√ c

2

corresponding to the asymptotically expected ratio under H 0 . This denes a newtest, rather empirical, which consists in considering a threshold around the ratio

1+ √ c1−√ c 2 and of deciding for hypothesis H 1 whenever λ1 /λ N exceeds this value,or H 0 otherwise.

The condition number approach is interesting, although it is totally empirical.In the following section, we will derive the generalized likelihood ratio test(GLRT). Although suboptimal from a Bayesian point of view, this test will beshown through simulations to perform much more accurately than the presentcondition number test and in fact very close to the optimal Bayesian test. It willin particular appear that the intuitive choice of λ1 /λ N as a decision variable wasnot so appropriate and that the appropriate choice (at least, the choice that is

appropriate in the GLRT approach) is in fact λ1 / ( 1N tr( YYH

)).

16.4.2 Generalized likelihood ratio test

As we concluded in Section 16.3, it is rather difficult to exploit the nal formulaobtained in Theorem 16.1, let alone its generalized form of Theorem 16.2. This isthe reason why a different approach is taken in this section. Instead of consideringthe optimal Neyman–Pearson test, which is nothing more than a likelihood ratiotest when P H 0 = P H 1 , we consider the suboptimal generalized likelihood ratio

test, which is based on the calculus of the ratio C GLRT (Y ) below

C GLRT (Y ) =supH ,σ 2 P Y |H ,σ 2 ,H 1 (Y )

supσ 2 P Y |σ 2 ,H 0 (Y ) .

This test differs from the likelihood ratio test (or Neyman–Pearson test) by theintroduction of the sup H ,σ 2 in the numerator and the sup σ 2 in the denominator.That is, among all possible H and σ2 that are tested against the observation Y ,we consider only the most probable ( H , σ2) pair in the calculus of the numeratorand the most probable σ2 in the calculus of the denominator. This is a ratherappropriate approach whenever Y carries much information about the possibleH and σ2 , but a rather hazardous one when a large extent of ( H , σ2) pairscan account for the observation Y , most of these being discarded by taking thesupremum.



416 16. Detection

16.4.3 Test power and error exponents

As was mentioned earlier, the explicit computation of a decision threshold and

of error exponents for the optimal Neyman–Pearson test is prohibitive due tothe expression taken by C Y , not to mention the expression when the signal-to-noise ratio is a priori unknown. However, we will presently show that, forthe GLRT approach it is possible to derive both the optimal threshold andthe error exponent for increasingly large system dimensions N and M (growingsimultaneously large). For the condition number approach, the optimal thresholdcannot be derived due to an eigenvalue independence assumption, which is notproved to this day, although it is possible to derive an expression for the errorexponent for this threshold. In fact, it will turn out that the error exponents areindependent of the choice of the detection threshold.

We rst consider the GLRT test. The optimal threshold γ M for a xed falsealarm rate α can be expressed as follows.

Theorem 16.4. For xed false alarm rate α ∈ (0, 1), the power of the generalized likelihood ratio test is maximum for the threshold

γ M = 1 + N M

2

+ bM

M 23

ζ M

for some sequence ζ M such that ζ M converges to ( ¯F

+

)−1

(α ), with ¯F

+

the complementary Tracy–Widom law dened as F + (x) = 1 −F + (x) and where

bM 1 + N M

43

N M

−16 .

Moreover the false alarm rate of the GLRT with threshold

γ M = 1 + N M

2

+ bM

M 23

( F + )−1(α )

converges to α, and more generally

∞γ

P T M |H 0 (x) − F +M 23 (γ −(1 +

N M )2)

bM → 0.

Let us assume σ = 1 without loss of generality. To prove this result, we rstneed to remember from Theorem 9.5 that the largest eigenvalue λ1 of YY H isrelated to the Tracy–Widom distribution as follows.

M 23

λ 1M − 1 + N

M 2

bM ⇒ X +

as N, M grow large, where X + is distributed according to the Tracy–Widomdistribution F + . At the same time, 1

MN tr YY H a.s.

−→ 1. Therefore, from the




Slutsky theorem, Theorem 8.12, we have that

T M M 2

3

T M

−1 + N

M 2

bM ⇒ X +

.

Denoting F M (x) ∞x P T M |H 0(x)dx, the distribution function of T M under H 0 ,

it can in fact be shown that the convergence of F M towards F + is in fact uniformover R . By denition of α, we have that

1 −F M M 23

γ M − 1 + N M

2

bM = α.

From the convergence of F M to F + , we then have that

F + M 23

γ M − 1 + N M

2

bM → α

from which, taking the (continuous) inverse of F + , we have the rst identity. Theremaining expressions are merely due to the fact that F M (T M ) −F + (T M ) → 0.

Deriving a similar threshold for the condition number test requires to prove the

asymptotic independence of λ1 and λN . This has been shown for the Gaussianunitary ensemble, Theorem 9.6, i.e. for the extreme eigenvalues of the semi-circlelaw, but to this day not for the extreme eigenvalues of the Marcenko–Pastur law.

After setting the decision threshold, it is of interest to evaluate thecorresponding theoretical power test, i.e. the probability of correct detection. Tothis end, instead of an explicit expression of the theoretical test power, we assumelarge system dimensions and use tools from the theory of large deviations. Thosetools come along with mathematical requirements, though, which are outsidethe scope of this book. We will therefore only state the main conclusions. Fordetails, the reader is referred to [Bianchi et al., 2011]. As stated in [Bianchi et al.,2011], as the system dimensions grow large, it is sensible for us to reduce theacceptable false alarm rate. Instead of a mere expression of the test power fora given false alarm, we may be interested in the error exponent curve , whichis the set of points ( a, b), such that there exists sequences α1 , α 2 , . . . such thatlimM − 1

M logα M = a and I ∞(α) = b. We rst address the case of the GLRT.

Theorem 16.5. Assume, under H 1 , that N k=1 |h k |2σ 2 converges to ρ as N, M

grow large, and that N/M → c. Then, for any α ∈ (0, 1), I ∞(α) is well dened and reads:

I ∞(α ) = I ρ (1 + √ c)2 , if ρ > √ c0 , otherwise



418 16. Detection

where I ρ (x) is dened as

I ρ (x) = x −λ sp

1 + ρ −(1

−c)log

x

λ sp

−c(V + (x) −V + (λ sp )) + ∆( x|[(1 + √ c)2 , ∞))

with ∆( x|A) the function equal to zero is x ∈ A or to innity otherwise, and V + (x) the function dened by

V + (x) = log( x) + 1c log(1 + cm(x)) + log(1 + m(x)) + xm (x)m(x)

with m(x) = cm(x) − 1−cx and m(x) the Stieltjes transform of the Marcenko–

Pastur law with ratio c

m(z) = (1 −z −c) + (1 −z −c)2 −4cz

2cz

(the branch of the square-root being such that m is a Stieltjes transform).Moreover, the error exponent curve is described by the set of points

(I 0(x), I ρ (x)) , x ∈(1 + √ c)2 , λ sp

with I 0 dened as

I 0(x) = x

−(1 + √ c)2

−(1

−c)log

x

(1 + √ c)2

−c[V (x) −V ((1 + √ c)2)] + ∆( x|[(1 + √ c)2 , ∞)) .

A similar result is obtained for the condition number test, namely:

Theorem 16.6. For any false alarm rate α ∈ (0, 1) and for each limiting ρ, the error exponent of the condition number test coincides with the error exponent for the GLRT. Moreover, the error exponent curve is given by the set of points

(J 0(x), J ρ (x)) , x ∈(1 + √ c)2

(1 −√ c)2 , λsp

(1 −√ c)2

with

J ρ (x) = inf I ρ (x1) + I −(x2), x1

x2= x

J 0(x) = inf I 0(x1) + I −(x2), x1

x2= x

where I − is dened as

I −(x) = x −(1 −√ c)2 −(1 −c)log x(1 −√ c)2

−2c[V −(x) −V − (1 −√ c)2 ] + ∆( x|(0, (1 −√ c)2])




and V − is given by:

V −(x) = log( x) + 1c log(1 + cm(x)) + log( −1 −m(x)) + xm (x)m(x).

The ROC curve performance of the optimal Neyman–Pearson test withunknown SNR is compared against the condition number test and the GLRTin Figure 16.8. For the Neyman–Pearson test, we remind that the unknown SNRparameter must be integrated out, which assumes the need for a prior probabilitydistribution for σ2 . We provide in Figure 16.8 two classical approaches, namelyuniform distribution and Jeffreys prior with coefficient β = 1, i.e. P σ 2 (σ2) = 1

σ 2 .The simulation conditions are as before with K = 1 transmit source, N = 4receive sensors, M = 8 samples. The SNR is now set to zero dB in order to havenon-trivial correct detection values. Observe that, as expected, the Neyman–Pearson test outperforms both the GLRT and condition number tests, eitherfor uniform or Jeffreys prior. More surprising is the fact that the generalizedlikelihood ratio test largely outperforms the condition number test and performsrather close to the optimal Neyman–Pearson test. Therefore, the choice of theratio between the largest eigenvalue and the normalized trace of the samplecovariance matrix as a test comparison criterion is much more appropriate thanthe ratio between the largest eigenvalue and the smallest eigenvalue. Giventhe numerical complexity involved by the explicit computation of the Neyman–Pearson test, the GLRT can be considered as an interesting suboptimal substitute

for this test when the signal and noise powers are a priori unknown.It is also mentioned in [Bianchi et al., 2011] that the error exponent curve of

the GLRT dominates that of the condition number test in the sense that, foreach (a, b) in the error exponent curve of the condition number test, there existsb > b such that ( a, b ) is in the error exponent curve of the GLRT. Therefore,from the above theorems, at least asymptotically, the GLRT always outperformsthe condition number test. The practical simulations conrm this observation.

In this section, we mainly used Theorem 7.1 and Theorem 9.1 that basicallystate that, for large dimensional sample covariance matrices with populationcovariance eigenvalues converging to a single mass in 1, the largest eigenvalueis asymptotically found at the edge of the support of the Marcenko–Pasturlaw or outside this support, depending on whether a spike is found amongthe population eigenvalues. This allowed us to proceed to hypothesis testsdiscriminating both models with or without a spiked population eigenvalue. Inthe next section, we go further by considering more involved random matrixmodels for which not only hypothesis testing is performed but also statisticalinference. That is, we will proceed to the estimation of system parameters usingeigen-inference methods. This chapter will therefore require the tools developedin Chapter 7 and Chapter 8 of Part I.



420 16. Detection

1 ·10−3 5 ·10−3 1 ·10−2 2 ·10−20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

False alarm rate

C o r r e c t


N-P, JeffreysN-P, uniformCondition numberGLRT

Figure 16.8 ROC curve for a priori unknown σ2 of the Neyman–Pearson test (N-P),condition number method and GLRT, K = 1, N = 4, M = 8, SNR = 0 dB. For theNeyman–Pearson test, both uniform and Jeffreys prior, with exponent β = 1, areprovided.



17 Estimation

In this chapter, we consider the consistent estimation of system parametersinvolving random matrices with large dimensions. When it comes to estimationor statistical inference in signal processing, there often exists a large number of different methods proposed in the literature, most of which are usually based ona reference, simple, and robust method which has various limitations such as theUrkowitz’s power detector [Urkowitz, 1967] that only assumes the additive whiteGaussian noise (AWGN) model, or the multiple signal classication (MUSIC)algorithm [Schmidt , 1986] of Schmidt that suffers from undecidability issueswhen the signal to noise ratio reaches a critically low value. When performingstatistical inference based on a limited number of large dimensional vectorinputs, the main limitation is due to the fact that those legacy estimators areusually built under the assumption that the number of available observations is

extremely large compared to the number of system parameters to identify. Inmodern signal processing applications, especially for large sensor networks, theestimators receive as inputs the M stacked N -dimensional observation vectorsY = [y (1) , . . . , y (M ) ] ∈C N ×M of some observation vectors y (m )

∈C N at time m,

M and N being of similar size, or even sometimes M being much smaller thanN . Novel estimators that can cope with this large population size limitationare therefore required in place of the historical estimators. In this chapter,we introduce such ( N, M )-consistent estimators, which we recall are estimatorswhich are asymptotically unbiased when both N and M grow large at a similarrate.

Since the signicant advances in this eld of research are rather new, only twomain examples will be treated here. The rst example is that of the consistentestimation of direction of arrivals (DoA) in linear sensor arrays (such as radars)[Kay, 1993; Scharf, 1991] when the number of sensors is of similar dimensionas the number of available observations. The major works in this directionare [Mestre and Lagunas , 2008] and [Vallet et al., 2010] for almost identicalsituations, involving nonetheless different system models. We will then moveto the question of blind user sensing and power estimation. Specically, we willconsider a sensor array receiving simultaneous transmissions from multiple signal

emitters, the objective being to estimate both the number of transmitters andthe power used by each one of those. The latter is mainly based on [Couilletet al. , 2011c].



422 17. Estimation

Figure 17.1 Two-user line-of-sight transmissions with different angles of arrival, θ1 andθ2 .

17.1 Directions of arrival

In this section, we consider the problem of a sensor array impinged by multiplesignals, each one of which comes from a given direction. This is depicted in

Figure 17.1, where two signals transmitted simultaneously by two terminal users(positioned far away from the receiving end) are received with angles θ1 and θ2

at the sensor array. The objective here is to detect both the number of signalsources and the direction of arrival from each of these signals. This has naturalapplications in radar detection for instance, where multiple targets need to belocalized. In general, thanks to the diversity offered by the sensor array, and thephase shifts in the signals impacting every antenna, it is possible to determinethe angle of signal arrival from basic geometrical optics. In the following, wewill recall the classical so-called multiple signal classication estimator (MUSIC)[Schmidt , 1986], which is suited for large streams of data and small dimensionalsensor array as it can be proved to be a consistent estimator in this setting.However, it can be proved that the MUSIC technique is not consistent withincreasing dimensions of both the number of sensors and the number of samples.To cope with this problem, a G-estimator is proposed, essentially based onTheorem 8.7. This recent estimator, developed in [Mestre and Lagunas, 2008 ] byMestre and Lagunas, is based on the concept of G-estimation and is referred toas G-MUSIC. We rst introduce the system model under consideration.

17.1.1 System modelWe consider the communication setup between K signal sources (that wouldbe, in the radar context, the reected waveforms from detected targets) and N



17.1. Directions of arrival 423

receive sensors, N > K . Denote x( t )k the signal issued by source k at time t. The

received signals at time t, corrupted by the additive white Gaussian noise vectorσw ( t )

∈

C N with E[ w ( t ) w ( t ) ] = δ t,t I N , are gathered into the vector y ( t )

∈

C N .We assume that the channel between the sources and the sensors creates onlyphase rotations, that essentially depend on the antenna array geometry. Otherparameters such as known scattering effects might be taken into account as well.To be all the more general, we assume that the channel steering effect on signalx( t )

k for sensor i is modeled through the time invariant function s i (θ) for θ = θk .As such, we characterize the transmission model at time t as

y ( t ) =K

k=1

s(θk )x( t )k + σw ( t ) (17.1)

where s(θk ) = [s1(θk ), . . . , s N (θk )]T . For simplicity, we assume that the vectorss(θk ) have unit Euclidean norm.

Suppose for the time being that x ( t ) = [x( t )1 , . . . , x ( t )

K ]T

∈C K are i.i.d. along

the time domain t and have zero mean and covariance matrix P ∈C K ×K . Thisassumption, which is not necessarily natural, will be discarded in Section 17.1.4.The vectors y ( t ) are sampled M times, with M of the same order of magnitudeas N , and are gathered into the matrix Y = [y (1) , . . . , y (M ) ] ∈C N ×M . From theassumptions above, the columns of Y have zero mean and covariance R , givenby:

R = S (Θ) PS (Θ) H + σ2I N

where S(Θ) = [ s(θ1), . . . , s (θK )] ∈C N ×K .The DoA detection question amounts to estimating θ1 , . . . , θ K based on Y ,

knowing the steering vector function s(θ) = [s1(θ), . . . , s N (θ)]T for all θ. To thisend, not only eigenvalues of 1

M YY H but also eigenvectors are necessary. This iswhy we will resort to the G-estimators introduced in Section 17.1.3. Before that,we discuss the classical subspace methods and the MUSIC approach.

17.1.2 The MUSIC approach

We denote λ1 ≤ . . . ≤ λN the eigenvalues of R and e1 , . . . , e N theircorresponding eigenvectors. Similarly, we denote λ1 ≤ . . . ≤ λN the eigenvaluesof R N 1

M YY H , with respective eigenvectors e 1 , . . . , e N . If some eigenvalue hasmultiplicity greater than one, the set of corresponding eigenvectors is taken to beany orthonormal basis of the associated eigenspace. From the assumption thatthe number of sensors N is greater than the number of transmit sources K , thelast N

−K eigenvalues of R equal σ2 and we can represent R under the form

R = E W E S σ2 I N −K 0

0 Λ S

E HW

E HS



424 17. Estimation

with ΛS = diag( λN −K +1 , . . . , λ N ), E S = [e N −K +1 , . . . , e N ] the so-called signal space and E W = [e 1 , . . . , e N −K ] the so-called noise space .

The basic idea of the subspace approach, which is at the core of the MUSICmethod, is to observe that any vector lying in the signal space is orthogonal tothe noise space. This leads in particular to

E H

W s(θk ) = 0

for all k ∈ 1, . . . , K , which is equivalent to

η(θk ) s(θk )E W EHW s(θk ) = 0 .

The idea behind the MUSIC approach is simple in that it suggests, accordingto the large M -dimension approach, that the covariance matrix R is well

approximated by R N as M grows to innity. Therefore, denoting E W =[e 1 , . . . , e N −K ] the eigenvector space corresponding to the smallest eigenvalues of R N , the MUSIC estimator consists in retrieving the arguments θ which minimizethe function

η(θ) s (θ)H E W EHW s(θ).

Notice that it may not be possible for ˆ η(θ) to be zero for any θ, so that bylooking for minima in η(θ), we are not necessarily looking for roots. This approachis originally due to Schmidt in [Schmidt, 1986] . However, the nite number of available samples strongly affects the efficiency of the MUSIC algorithm. In orderto come up with more efficient approaches, the subspace approach was furtherrened by taking into account the fact that, in addition to be orthogonal tothe noise space, s(θk ) is aligned to the signal space S(Θ) PS (Θ) H . One of theknown examples is the so-called SSMUSIC approach due to McCloud and Scharf [McCloud and Scharf, 2002] . The approach considered in the SSMUSIC methodis now to determine the local minima of the function

ηSS (θ) s(θ)H E W E H

W s(θ)

s (θ)H E S Λ S − σ2 I K −1

E HS s(θ)

where the denominator comes from S (Θ) PS (Θ) H −1 = E S Λ S −σ2I K −1 E HS

(when S(Θ) PS (Θ) H is not invertible, the same remark holds with theinverse sign replaced by the Moore–Penrose pseudo-inverse sign), with Λ S =diag( λN −K +1 , . . . , λN ) and σ2 = 1

N −K N −K k=1 λk . The SSMUSIC technique was

proved to outperform the MUSIC approach for nite M , as it has a higherresolution power to distinguish close angles of arrival [McCloud and Scharf, 2002] .

However, even though it is proved to be better, this last approach is still not(N, M )-consistent. This fact, which we do not prove here, is the point madeby Mestre in [Mestre , 2008a] and [Mestre and Lagunas, 2008 ]. Instead of thisclassical large dimensional M approach, we will assume that both N and M grow large at similar pace while K is kept constant, so that we can use theresults from Chapter 8.




17.1.3 Large dimensional eigen-inference

The improved MUSIC estimator unfolds from a trivial application of Theorem

8.7. The cost function u introduced in Theorem 8.7 is simply replaced by thesubspace cost function η(θ), dened by

η(θk ) = s (θk )E W EHW s (θk ).

We therefore have the following improved MUSIC estimator, called by theauthors in [Mestre and Lagunas, 2008] the G-MUSIC estimator.

Theorem 17.1 ([Mestre and Lagunas, 2008] ). Under the above conditions, we have:

η(θ) − η(θ) a .s.−→ 0

as N, M grow large with limiting ratio satisfying 0 < lim N/M < ∞, where

η(θ) = s(θ)H N

n =1φ(n)e n e H

n s(θ)

with φ(n) dened as

φ(n) =1 + N

k= N −K +1 λ k

λ n −λ k − µ k

λ n −µ k , n ≤ N −K

− N −K k=1 λ kλ n −λ k − µ k

λ n −µ k , n > N −K

and with µ1 ≤ . . . ≤ µN the eigenvalues of diag( λ ) − 1M λ λ

T

, where we denoted λ = ( λ1 , . . . , λN )T .

This derives naturally from Theorem 8.7 by noticing that the noise space E W

is the space of the smallest eigenvalue of R with multiplicity N −K , which ismapped to the space of the smallest N −K eigenvalues of the empirical R N toderive the consistent estimate.

It is also possible to derive an ( N, M )-consistent estimate for the improvedSSMUSIC method. The latter will be referred to as G-SSMUSIC This unfoldsfrom a similar application of Theorem 8.7 and is given in [Mestre and Lagunas,2008] under the following form.

Theorem 17.2 ([Mestre and Lagunas, 2008] ). Under the above conditions

ηSS (θ) − ηSS (θ) a .s.

−→ 0

as N, M grow large with ratio uniformly bounded away from zero and innity,

where

ηSS (θ) = η(θ)

εη(θ) + χ (θ)



426 17. Estimation

for any ε ≥ 0 that guarantees that the denominator does not vanish for all θ,where η(θ) is given in Theorem 17.1 and χ (θ) is given by:

χ (θ) = s (θ)H

N

n =1ψ(n)e n e H

n s (θ)

with ψ(n) dened as

ψ(n) =1

σ 2N k= N −K +1

λ k

λ n −λ k − ν kλ n −ν k

, n ≤ N −K 1

σ 2N −K k=0

ν kλ n −ν k − N −K

k=1λ k

λ n −λ k , n > N −K

with µ1 ≤ . . . ≤ µN the eigenvalues of diag( λ ) − 1M

λ

λ

T

, ν 0 ≤ . . . ≤ ν N the solutions of the equation in ν

ν = σ2 1 − 1M

N

k =1

λk

λk −ν

and σ2 is an (N, M )-consistent estimator for σ2 , given here by

σ2 = M N −K

N −K

k=1

λk − µk .

We hereafter provide one-shot realizations of the cost functions ¯ η(θ) andηSS (θ) for the different DoA estimation methods proposed above. We take theassumptions that K = 3 signal sources are emitting and that an array of N = 20sensors is used to perform the statistical inference, that samples M = 150 timesthe incoming waveform. The angles of arrival are 10 , 35, and 37, while theSNR is set to 10 dB. This situation is particularly interesting as two incomingwaveforms are found with very close DoA. From the discussion above, wetherefore hope that SSMUSIC would better resolve the two close angles. Infact, we will see that the G-estimators that are G-MUSIC and G-SSMUSICare even more capable of discriminating between close angles. Figure 17.2 and

Figure 17.3 provide the comparative performance plots of the MUSIC againstG-MUSIC approaches, for θ ranging from −45 to 45 in Figure 17.2 and forθ varying from −33 to −38 in Figure 17.3. Observe that, while the MUSICapproach is not able to resolve the two close DoA, the G-MUSIC techniqueclearly isolates two minima of η(θ) around 35 and 37. Apart from that, bothperformance plots look alike. Similarly, the SSMUSIC and G-SSMUSIC costfunctions for the same random realization as in Figure 17.2 and Figure 17.3are provided in Figure 17.4 and Figure 17.5. Observe here that the SSMUSICestimator is able to resolve both angles, although it is clearly not as efficient asthe G-SSMUSIC estimator. Performance gures in terms of mean square errorare found in [Mestre and Lagunas, 2008 ]. It is observed in particular by theauthors that the improved estimators still do not solve the inherent problemof both MUSIC and SS-MUSIC estimators, which is that both perform very




-10 35 37−30

−25

−20

−15

−10

−5

0

5

10

angle [deg]

C o s t

f u n c t i o n

[ d B ]

MUSICG-MUSIC

Figure 17.2 MUSIC against G-MUSIC for DoA detection of K = 3 signal sources,N = 20 sensors, M = 150 samples, SNR of 10 dB. Angles of arrival of 10 , 35, and37.

35 37−30

−28

−26

−24

−22

−20

−18

−16

angle [deg]

C o s t

f u n c t i o n

[ d B ]

MUSIC

G-MUSIC

Figure 17.3 MUSIC against G-MUSIC for DoA detection of K = 3 signal sources,N = 20 sensors, M = 150 samples, SNR of 10 dB. Angles of arrival of 10 , 35, and37.

badly in the low SNR regime. Nevertheless, the improved G-estimators manageto repel to a lower level the SNR limit for which performance decays signicantly.The same performance behavior will also be observed in Section 17.2, where theperformance of blind multi-source power estimators is discussed.



428 17. Estimation

-10 35 37−30

−20

−10

0

10

20

30

angle [deg]

C o s t

f u n c t i o n

[ d B ]

SSMUSICG-SSMUSIC

Figure 17.4 SSMUSIC against G-SSMUSIC for DoA detection of K = 3 signal sources,N = 20 sensors, M = 150 samples, SNR of 10 dB. Angles of arrival of 10 , 35, and37.

35 37−28

−26

−24

−22

−20

−18

−16

angle [deg]

C o s t

f u n c t i o n

[ d B ]

SSMUSIC

G-SSMUSIC

Figure 17.5 SSMUSIC against G-SSMUSIC for DoA detection of K = 3 signal sources,N = 20 sensors, M = 150 samples, SNR of 10 dB. Angles of arrival of 10 , 35, and37.

Further work has been done on the DoA topic, especially in the case where,instead of i.i.d. samples, the sensors receive correlated data. These data can beassumed not to be known to the sensors, so that no specic random model canbe applied. This is discussed in the next section.




17.1.4 The correlated signal case

We recall that in the previous section, we explicitly assumed that the vector of

transmit signals are independent for successive samples, have zero mean, andhave the same covariance matrix. The present section is merely an extension of Section 17.1.3 to the even more restrictive case when the transmit data structureis unknown to the receiving sensors. In this case, the sample covariance matrixmodel assumed in Section 17.1.3, which allowed us to use Theorem 8.7, is nolonger available. This section mainly recalls the results of [Vallet et al., 2010].

Remark 17.1. It must be noted that the authors of [Vallet et al., 2010], insteadof mentioning that the signal source is random with unknown distribution, statethat the source is deterministic but unknown to the sensors. We prefer to say that

the source, being unknown to the receiver, is therefore random from the point of view of the receiver , and not deterministic. However, to be able to use thetools hereafter, we will need to assume that, although unknown to the receiver,the particular realization of the random incoming data satises some importantboundedness assumptions, known to the receiver.

The model ( 17.1) is still valid, i.e. the receive data vector y ( t ) at time instantt reads:

y ( t ) =K

k=1

s(θk )x( t )k + σw ( t )

where the vector x ( t ) = [x( t )1 , . . . , x ( t )

K ]T

∈C K is no longer i.i.d. along the time

index t. We therefore collect the M vector samples into the random matrixX = [x (1) , . . . , x (M ) ] ∈C K ×M and we obtain the receive model

Y = SX + σW

with W = [w (1) , . . . , w (M ) ] ∈C N ×M a Gaussian matrix and S ∈C N ×K thematrix with columns the K steering vectors, dened as previously. In Section17.1.3, X was of the form P

12 Z with Z

∈C K ×M lled with i.i.d. entries of

zero mean and unit variance, and W was naturally lled with i.i.d. entries of zero mean and unit variance, so that Y took the form of a sample covariancematrix. This is no longer valid in this new scenario. The matrix Y can insteadbe considered as an information plus noise matrix, if we take the additionalassumption that we can ensure SXX H S uniformly bounded for all matrixsizes. Denoting E W the noise subspace of R 1

M SXX H S , our objective is nowto estimate the cost function

η(θ) = s(θ)H E W EHW s (θ).

Again, the traditional MUSIC approach replaces the noise subspace E W bythe empirical subspace E W composed of the eigenvectors corresponding to theN −K smallest eigenvalues of 1

M YY H . The resulting cost function therefore



430 17. Estimation

reads:

η(θ) = s(θ)H E W EH

W s (θ)

and the MUSIC algorithm consists once more in nding the K deepest minimaof η(θ). Assuming this time that the noise variance σ2 is known, as per theassumptions of [Vallet et al., 2010], the improved MUSIC approach, call it oncemore the G-MUSIC technique, now for the information plus noise model, derivesdirectly from Theorem 8.10.

Theorem 17.3 ([Vallet et al., 2010]). As N, M grow to innity with limiting ratio satisfying 0 < lim N/M < ∞

η(θ)

− η(θ) a.s.

−→ 0

for all θ, where

η(θ) = s (θ)H N

k =1

φ(k)e k e H

k s(θ)

with φ(k) dened as

φ(k) =

1 + σ2

M N i = N −K +1

1λ i −λ k

+ 2σ 2

M N i = N −K +1

λ k

λ i −λ k

+ σ 2 (M −N )M

N i = N −K +1

1λ i −λ k − 1

µ i −λ k , k ≤ N −K

σ 2

M

N

−K

i =11

λ i −λ k − 2σ 2

M

N

−K

i =1λ k

λ i −λ k

+ σ 2 (M −N )M

N −K i =1

1λ i −λ k − 1

µ i −λ k , k > N −K

and with µ1 ≤ . . . ≤ µN the N roots of the equation in µ

1 + σ2

M

N

k=1

1λk −µ

.

The performance of the G-MUSIC approach which assumes the informationplus noise model against the previous G-MUSIC technique is compared for the

one-shot random transmission in Figure 17.6. We observe that both estimatorsdetect very accurately both directions of arrival. Incidentally, the informationplus noise G-MUSIC shows slightly deeper minima than the sample covariancematrix G-MUSIC. This is only an outcome of the one-shot observation at handand does not affect the average performance of both approaches, as shown moreprecisely in [Vallet et al., 2010].

Note that Theorem 17.3 only proves the asymptotic consistency for thefunction η(θ). The consistency of the angle estimator itself, which is the resultof interest for practical applications, is derived in [Vallet et al., 2011a]. Theuctuations of the estimator are then provided in [Mestre et al., 2011]. We alsomention that the same authors also proposed a G-MUSIC alternative relying onan additive spike model approach. Both limit and uctuations are derived alsofor this technique, which turns out to perform worse than the approach presented




35 37−32

−30

−28

−26

−24

−22

−20

−18

angle [deg]

C o s t

f u n c t i o n

[ d B ]

i.i.d. G-MUSICGeneral G-MUSIC

Figure 17.6 G-MUSIC tailored to i.i.d. samples (i.i.d. G-MUSIC) against unconditionalG-MUSIC (General G-MUSIC) for DoA detection of K = 3 signal sources, N = 20sensors, M = 150 samples, SNR of 10 dB. Angles of arrival of 10 , 35, and 37 .

in this section. The initial results can be found in [Hachem et al., 2011; Valletet al. , 2011b].

This completes this section on DoA localization. We take the opportunity of the information plus noise study above to mention that similar G-estimators havebeen derived by the same authors for system models involving the informationplus noise scenario. In particular, in [Vallet and Loubaton , 2009], a consistent

estimator for the capacity of a MIMO channels H under additive white noiseof variance σ2 , i.e. log det( I N + σ−2HH H ), is derived based on successiveobservations y ( t ) = Hs ( t ) + σw ( t ) with known pilot sequence s(1) , s (2) , . . . andwith additive standard Gaussian noise vector w ( t ) .

In the following, we consider a similar inference problem, relative to thelocalization of multiple transmit sources, not from an angular point of view, butrather from a distance point of view. The main idea now is to consider a statisticalmodel where hidden eigenvalues must be recovered that give information on thepower transmitted by distinct signal sources. Similar to the DoA estimation,the problem of resolving transmissions of close power will be raised for which acomplete analysis of the conditions of source resolution is performed. As such, wewill treat the following section in more detail than the current DoA estimationsection.



17.2. Blind multi-source localization 433

• Assuming the sensed data are due to primary uplink transmissions by mobileusers to a xed network, the primary uplink frequency band will be reused insuch a way that no primary user emitting with power P is interfered with byany transmission from the secondary network;

• if P is above a certain threshold, the cognitive radio will decide thatneighboring primary cell sites are in use by primary users. Therefore, alsodownlink transmissions are not to be interfered with, so that the downlinkspectrum is not considered a spectrum hole.

If the secondary network is able to do more than just overall power estimation,namely if it is capable of estimating both the number of concurrent simultaneoustransmissions in a given spectral resource, call this number K , and the powerof each individual source, call them P 1 , . . . , P K for source 1 to K , respectively,with P 1 ≤ . . . ≤ P K , then the secondary network can adapt its coverage area ina more accurate way as follows.

• Since the strongest transmitter has power P K , the secondary cell coveragearea can be set such that the primary user with power P K is not interferedwith. This will automatically induce that the other primary users are notinterfered with (if it is further assumed that no power control is performedby the primary users). As an immediate consequence, the primary uplinktransmission will be stated as reusable if P K is not too large. Also, if P K

is so little that no primary user is expected to use primary downlink datasent by neighboring cells, also the downlink spectrum will be reused. In thecase where multiple transmissions happen simultaneously, this strategy willturn out to be much more efficient than the estimation of the overall transmitpower P K

k=1 P k .

• Also, by measuring the transmit powers of multiple primary users withinmultiple distant secondary networks, information can be shared (via low speedlinks) among these networks so as to eventually pinpoint the precise locationof the users. This brings even more information about the occupancy (andtherefore the spectrum reusability) of each primary cell site. Moreover, it will

turn out that most methods presented below show a strong limitation whenit comes to resolving different users transmitting with almost equal power.Quite often, it is difficult to discriminate between the scenario of a single-usertransmitting with power P or multiple transmitters with similar transmitpowers, the sum of which being equal to P . Communications between distantsecondary networks can therefore bring more information on the number of users with almost equal power. This eventually leads to the same performancegain as given in the previous point when it comes for the cognitive networkto decide on the maximally acceptable coverage area.

• We also mentioned the need for estimating the number of transmit antennasper user. In fact, in the eigen-inference techniques presented below, the abilityto estimate the number of antennas per user is an aftermath of the estimationalgorithms. The interest of estimating the number of eigenvalues is linked



434 17. Estimation

to the previous point, concerning the difficulty to differentiate between oneuser transmitting with strong power or many users transmitting with almostequal powers. If it is observed that power P is used by a device equippedwith many more antennas than the communication protocol allows, then thisshould indicate to the sensors the presence of multiple transmitters with closetransmit powers instead of a unique transmitter. This again allows for moreprecise inference on the system parameters.

Note from the discussion above that estimating P K is in fact more importantto the secondary network than estimating P 1 , as P K can by itself already providea major piece of information concerning the largest coverage radius for secondarytransmissions. When the additive noise variance is large, or when the numberof available sensors is too small, inferring the smallest transmit powers is ratherdifficult. This is one of the reasons why eigen-inference methods that are capableof estimating a particular P k are preferred over methods that jointly estimatethe power distribution with masses in P 1 , . . . , P K .

We hereafter introduce the general communication model discussed in the restof this section. We will then derive eigen-inference techniques based on eitherexact small dimensional approaches, asymptotic free deconvolution approaches(as presented in Section 8.2), or the more involved but more efficient Stieltjestransform methods, relying on the theorems derived in Section 8.1.2.

17.2.1 System model

Consider a wireless (primary) network in which K entities are transmittingdata simultaneously on the same frequency resource. Transmitter k ∈ 1, . . . , K has transmission power P k and is equipped with nk antennas. We denoten K

k=1 nk the total number of transmit antennas within the primary network.Consider also a secondary network composed of a total of N , N > n , sensingdevices (either N single antenna devices or multiple devices equipped with atotal of N antennas); we refer to the N sensors collectively as the receiver . This

scenario relates in particular to the conguration depicted in Figure 17.7.To ensure that every sensor in the secondary network, e.g. in a closed-accessfemto-cell [Claussen et al., 2008], roughly captures the same amount of energyfrom a given transmitter, we need to assume that all distances between a giventransmitter and the individual sensors are alike. This is a realistic assumption forinstance for an in-house femto-cell network, where all sensors lie in a restrictedspace and transmitters are found far away from the sensors. Denote H k ∈C N ×n k

the channel matrix between transmitter k and the receiver. We assume thatthe entries of √ N H k are i.i.d. with zero mean, unit variance, and nite fourthorder moment. At time instant m, transmitter k emits the signal x (m )

k

∈C n k ,

with entries assumed to be independent, independent along m, k , identicallydistributed along m, and all have zero mean, unit variance, and nite fourthorder moment (the x (m )

k need not be identically distributed along k). Assume




Figure 17.7 A cognitive radio network.

further that at time instant m the receive signal is impaired by additive whitenoise with entries of zero mean, variance σ2 , and nite fourth order moment onevery sensor; we denote σw (m )

∈C N the receive noise vector where the entries

of w (m )k have unit variance. At time m, the receiver therefore senses the signal

y (m )∈

C N dened as

y (m ) =K

k=1

P k H k x (m )

k + σw (m ) .

Assuming the channel fading coefficients are constant over at least M consecutive sampling periods, by concatenating M successive signal realizationsinto Y = [y (1) , . . . , y (M ) ] ∈C N ×M , we have:

Y =K

k=1 P k H k X k + σW

where X k = [x (1)k , . . . , x (M )

k ] ∈C n k ×M and W = [w (1) , . . . , w (M ) ] ∈C N ×M . Thiscan be further rewritten as

Y = HP12 X + σW (17.2)

where P ∈R n ×n is diagonal with rst n1 entries P 1 , subsequent n2 entries P 2 , etc.and last nK entries P K , H = [H 1 , . . . , H K ] ∈C N ×n and X = [X T

1 , . . . , X TK ]

T

∈C n ×M . By convention, we assume P 1 ≤ . . . ≤ P K .

Remark 17.2. The statement that √ N H , X and W have independent entries of nite fourth order moment is meant to provide as loose assumptions as possibleon the channel, signal, and noise properties. In the simulations carried out laterin this section, the entries of H , W are taken Gaussian. Nonetheless, accordingto our assumptions, the entries of X need not be identically distributed, butmay originate from a maximum of K distinct distributions. This translatesthe realistic assumption that different data sources may use different symbol



436 17. Estimation

constellations (e.g. M -QAM, M -PSK); the nite fourth moment assumptionis obviously veried for nite constellations. These assumptions though aresufficient requirements for the analysis performed later in Section 17.2.5. Section17.2.2 and Section 17.2.4, in contrast, will require much stronger assumptions onthe system model, namely that the random matrices √ N H , X , and W underconsideration are Gaussian with i.i.d. entries. The reason is these sections arebased methods that require the involved matrices to be unitarily invariant.

Our objective is to infer the values of the powers P 1 , . . . , P K from therealization of a single random matrix Y . This is successively performedfrom different approaches in the following sections. We will start with smalldimensional optimal maximum-likelihood and minimum mean square errorestimators, using similar derivations as in Chapter 16. Due to the computationalcomplexity of the method, we then consider large dimensional approaches. Thest of those is the conventional approach that assumes n small, N much largerthan n, and M much larger than N . This will lead to a simple althoughlargely biased estimation algorithm when tested in practical small dimensionalscenarios. This algorithm will be corrected rst by using moment approachesand specically free deconvolution approaches, although this approach requiresstrong system assumptions and will be proved not to be very efficient, both withrespect to performance and to computational effort. Finally, the latter will befurther improved using Stieltjes transform approaches in the same spirit as in

Section 17.1.

17.2.2 Small dimensional inference

This rst approach consists in evaluating the exact distribution of the powersP 1 , . . . , P K given the observations Y , modeled in ( 17.2), when H , X , and W areassumed Gaussian. Noticing that we can write Y under the unitarily invariantform

Y = HP12 σI N

XW

(17.3)

the derivation unfolds similarly as that proposed in Chapter 16 with thenoticeable exception that the matrix N HPH H has now a correlated Wishartdistribution instead of an uncorrelated Wishart distribution. This makes thecalculus somewhat more involved. We do not provide the successive steps of thefull derivations that mimic those of Chapter 16 and that can be found in detailin [Couillet and Guillaud, 2011] . The nal result is given as follows.

Theorem 17.4 ([Couillet and Guillaud , 2011]). Assume P 1 , . . . , P K are all different and have multiplicity n1 = . . . = nK = 1 , hence n = K . Then, denoting




λ = ( λ1 , . . . , λ N ) the eigenvalues of 1M YY H , P Y |P 1 ,...,P K (Y ) reads:

P Y|P

1,...,P

K(Y ) =

C (−1)Nn +1 eNσ 2 ni =1

1P i

σ2( N −n )( M −n ) ni =1 P M −n +1i ∆( P ) a∈

S N n

(

−1)|a |sgn(a )e−M

σ 2 |λ [a ]|

× ∆(diag( λ [a ]))

∆(diag( λ ))b∈

S n

sgn(b )n

i =1

J N −M −1Nσ 2

P bi

, N Mλ a i

P bi

where SN n is the set of all permutations of n-subsets of 1, . . . , N , Sn = Sn

n ,

|x | = i x i , x is the complementary of the set x , x[a ] is the restriction of x tothe indexes stored in the vector a , ∆( X ) is the Vandermonde determinant of the matrix X , the constant C is given by:

C = 1πNM n! N n (M

−n −1

2 )

M n (N −n +12 )

and J k (x, y ) is the integral form dened in (16.6) by

J k (x, y ) = 2 yk +1

2 K −k−1(2√ y) − x

0uk e−u −y

u du.

The generalization of Theorem 17.4 to powers P 1 , . . . , P K of multiplicitiesgreater than one is obtained by exploiting Theorem 2.9. The nal result howevertakes a more involved form which we do not provide here.

From Theorem 17.4, we can derive the maximum likelihood (ML) estimatorP

(ML)= P (ML)

1 , . . . , P (ML)K of the joint ( P 1 , . . . , P K ) vector as

P (ML)

= arg maxP 1 ,...,P K

P Y |P 1 ,...,P K (Y )

or the minimum mean square error (MMSE) estimator P (MMSE)

=P (MMSE)

1 , . . . , P (MMSE)K as

P (MMSE)

=

[0,∞) K

(P 1 , . . . , P K )P P 1 ,...,P K |Y (P 1 , . . . , P K )dP 1 . . . dP K

with P P 1 ,...,P K |Y (Y )P Y (Y ) = P Y |P 1 ,...,P K (Y )P P 1 ,...,P K (P 1 , . . . , P K ). Underuniform a priori distribution of the powers, this is simply

P (MMSE)

= [0,∞)K (P 1 , . . . , P K )P Y |P 1 ,...,P K (P 1 , . . . , P K )dP 1 . . . dP K

[0,∞)K P Y |P 1 ,...,P K (P 1 , . . . , P K )dP 1 . . . dP K .

However, both approaches are computationally complex, the complexityscaling exponentially with N in particular, and require multi-dimensional linesearches on ne grids for proper evaluation, which also do not scale nicely withgrowing K . We will therefore no longer consider this optimal approach, whichis only useful in providing lower bounds on the inference performance for smallsystem dimensions. We now turn directly to alternative estimators using largedimensional random matrix theory.



438 17. Estimation

17.2.3 Conventional large dimensional approach

The classical large dimensional approach assumes numerous sensors in order to

have much diversity in the observation vectors, as well as an even larger numberof observations so as to create an averaging effect on the incoming random data.In this situation, let us consider the system model ( 17.3) and now denote λ1 ≤. . . ≤ λN the ordered eigenvalues of 1

M YY H (the non-zero eigenvalues of whichare almost surely different).

Appending Y ∈C N ×M into the larger matrix Y ∈C (N + n )×M

Y = HP12 σI N

0 0XW

we recognize that, conditional on H , 1

M YY H is a sample covariance matrix , for

which the population covariance matrix is

THPH H + σ2I N 0

0 0

and the random matrix

XW

has independent (non-necessarily identically distributed) entries of zero mean

and unit variance. The population covariance matrix T , whose upper left entriesalso form a matrix unitarily equivalent to a sample covariance matrix, clearly hasan almost sure l.s.d. as N grows large for xed or slowly growing n. ExtendingTheorem 3.13 and Theorem 9.1 to c = 0 and applying them twice (once for thepopulation covariance matrix T and once for 1

M YY H ), we nally have that, asM,N,n → ∞ with M/N → ∞ and N/n → ∞, the distribution of the largest neigenvalues of 1

M YY H is asymptotically almost surely composed of a mass σ2 +P 1 of weight lim n1 /n , a mass σ2 + P 2 of weight lim n2 /n , etc. and a mass σ2 +P K of weight lim nK /n . As for the distribution of the smallest N −n eigenvaluesof 1

M YY H , it converges to a single mass in σ2 .

If σ2 is a priori known, a rather trivial estimator of P k is then given by:

1nk i∈

N k

(λ i −σ2)

where we denoted N k = N − K j = k nj + 1 , . . . , N − K

j = k +1 n j and we recallthat λ1 ≤ . . . ≤ λN are the ordered eigenvalues of 1

M YY H .This means in practice that P K is asymptotically well approximated by

the averaged value of the nK largest eigenvalues of 1M YY H , P K −1 is well

approximated by the averaged value of the nK

−1 eigenvalues before that, etc.

This also assumes that σ2 is perfectly known at the receiver. If it were not,observe that the averaged value of the N −n smallest eigenvalues of 1

M YY H isa consistent estimate for σ2 . This therefore leads to the second estimator P ∞k for




P k , that will constitute our reference estimator

P ∞k = 1nk i∈

N kλ i − σ

2

where

σ2 = 1

N −n

N −n

i =1λ i .

Note that the estimation of P k only relies on nk contiguous eigenvaluesof 1M YY H , which suggests that the other eigenvalues are asymptotically

uncorrelated from these. It will turn out that the improved ( n,N,M )-consistentestimator does take into account all eigenvalues for each k, in a certain manner.

As a reference example, we assume the scenario of three simultaneoustransmissions with transmit powers P 1 , P 2 , and P 3 equal to 1 / 16, 1/ 4, and1, respectively. We assume that each user possesses four transmit antennas, i.e.K = 3 and n1 = n2 = n3 = 4. The receiver is an array of N = 24 sensors, thatsamples as many as 128 independent (and identically distributed) observations.The SNR is set to 20 dB. In this reference scenario, we assume that K , n1 , n2 ,n3 are known. The question of estimating these values will be discussed later inSection 17.2.6. In Figure 17.8 and Figure 17.9, the performance of the estimatorP ∞k for k ranging from one to three is evaluated, for 1000 random realizationsof Gaussian channels H , Gaussian additive noise W and QPSK modulated usertransmissions X . This is gathered in Figure 17.8 under the form of an histogramof the estimated P ∞k in linear scale and in Figure 17.9 under the form of thedistribution function of the marginal distribution of the P ∞k in logarithmic scale.While our analysis ensures consistency of the P ∞k estimates for extremely largeM and very large N , we observe that, for not-too-large system dimensions, the

P ∞k are very biased estimates of the true P k powers. In particular here, bothP 1 and P 2 are largely underestimated overall, while P 3 is clearly overestimated.Since the system dimensions under study are rather realistic in practical cognitive(secondary) networks, i.e. the number of sensors is not assumed extremely largeand the number of observation samples is such that the exploration phase isshort, this means that the estimator P ∞k is inappropriate to applications incognitive radios. These performance gures naturally call for improved estimates.In particular, it will turn out that estimates that take into account the facts thatM is not much larger than N and that N is not signicantly larger than n willprovide unbiased estimates in the large dimensional setting, which will be seenthrough simulations to be very accurate even for small system dimensions.

We start with a moment approach, which recalls the free probability andmoment-based eigen-inference methods detailed in Section 8.2 of Chapter 8.



440 17. Estimation

116

14

10

5

10

15

20

25

Estimated P ∞k

D e n s i t y

Figure 17.8 Histogram of the P ∞k for k ∈ 1, 2, 3, P 1 = 1 / 16, P 2 = 1 / 4, P 3 = 1,n 1 = n 2 = n 3 = 4 antennas per user, N = 24 sensors, M = 128 samples andSNR = 20 dB.

116

14

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P ∞k


Figure 17.9 Distribution function of the estimator P ∞k for k ∈ 1, 2, 3, P 1 = 1 / 16,P 2 = 1 / 4, P 3 = 1, n1 = n 2 = n 3 = 4 antennas per user, N = 24 sensors, M = 128samples and SNR = 20 dB. Optimum estimator shown in dashed lines.

17.2.4 Free deconvolution approach

To be able to proceed with free deconvolution, similar to Section 17.2.2, we willneed to take some further assumptions on the system model at hand. Precisely,we will require the random matrices H , W , and X to be lled with Gaussian




i.i.d. entries. That is, we no longer allow for arbitrary modulated transmissionssuch as QPSK or QAM (at least from a theoretical point of view). This is akey assumption to ensure asymptotic freeness of products and sums of randommatrices. For the sake of diversity in the methods developed in this chapter, weno longer consider the compact model ( 17.3) for Y but instead we will see Yas an information plus noise matrix whose information matrix is random buthas an almost surely l.s.d. We also assume that, as the system dimensions growlarge, we have M/N → c, for k ∈ 1, . . . , K , N/n k → ck and N/n → c0 , with0 < c, c 0 , c1 , . . . , c K < ∞.

We can write

B N 1M

YY H = 1M

HP12 X + σW HP

12 X + σW

H

which is such that, as M , N , and n grow to innity with positive limiting ratios,the e.s.d. of HP

12 XX H P

12 H H converges weakly and almost surely to a limiting

distribution function. This is ensured by iterating twice Theorem 3.13: a rsttime on the sample covariance matrix P

12 H H HP

12 with population covariance

matrix P (which we assume converges towards a l.s.d. composed of K massesin P 1 , . . . , P K ) and a second time on the (conditional) sample covariance matrixHP

12 XX H P

12 H H with population covariance matrix HPH H , that was proved in

the rst step to have an almost sure l.s.d.In what follows, we will denote for readability µ∞Z the probability distribution

associated with the l.s.d. of the Hermitian random matrix Z . We can now referto Theorem 4.9, which ensures under the system model above that the limitingdistribution µ∞B N

of B N (in the free probability terminology) can be expressed asa function of the limiting distribution µ∞1

M HP12 XX H P

12 H H

of 1M HP

12 XX H P

12 H H .

Using the free probability operators, this reads:

µ∞1M HP

12 XX H P

12 H H

= µ∞B N µ 1

cδ σ 2 µ 1

c

where µ 1c

is the probability distribution of a random variable with distributionfunction the Marcenko–Pastur law with ratio 1

c and δ σ 2 the probability

distribution of a single mass in σ2. Remember that the above formula (throughthe convolution operators) translates by denition the fact that all moments of

the left-hand side can be computed iteratively from the moments of the termsin the right-hand side. Since all eigenvalue distributions under study satisfyCarleman’s condition, Theorem 5.1, this is equivalent to saying that the l.s.d. of 1

M HP12 XX H P

12 H H is entirely dened through the l.s.d. of µ∞B N

, a fact which isobvious from Theorem 3.13.

Instead of describing step by step the link between the moments of thel.s.d. of B N and the moments of the l.s.d. of the deconvolved matrices, weperform deconvolution remembering that automated algorithms can provide useffortlessly with the nal relations between the moments of the l.s.d. of B N andthe moments of the l.s.d. of P , i.e. all the sums 1

nK k=1 nk P mk , for all integers

m.



442 17. Estimation

Before we move to the second deconvolution step, we rewrite1

M HP12 XX H P

12 H H under the form of a product of a scaled zero Wishart

matrix with another matrix. This is:

µ∞1M P

12 H H HP

12 XX H

= c0µ∞1M HP

12 XX H P

12 H H

+ (1 −c0) δ 0

with µ∞1M P

12 H H HP

12 XX H

the l.s.d. of 1M P

12 H H HP

12 XX H .

In terms of moments, this introduces a scaling factor c0 to all successivemoments of the limiting distribution. Under this form, we can proceed to thesecond deconvolution step, which writes

µ∞P

12 H H HP

12

= µ∞1M P

12 H H HP

12 XX H

µ 1cc 0

,

with µ∞P

12 H H HP

12

the l.s.d. of P12 H H HP

12 .

Note that the scaling factor 1M disappeared due to the fact that X has entries

of unit variance. With the same line of reasoning as before, we then write theresulting matrix under the form of a matrix product containing a Wishart matrixas a second factor. The step is rather immediate here as no additional mass inzero is required to be added

µ∞PH H H = µ∞P

12 H H HP

12

with µ∞PH H H the l.s.d. of PHH

H .The nal deconvolution step consists in removing the effect of the scaledWishart matrix H H H . Incidentally, since H H has N columns and has entriesof variance 1 /N , we nally have the simple expression

µ∞P = µ∞PH H H µ 1c 0

where µ∞P is the probability distribution associated with the l.s.d. of P , i.e. µ∞Pis the probability distribution of a random variable with K masses in P 1 , . . . , P K

with respective weights cc1

, . . . , ccK

. This completes the free deconvolution steps.

It is therefore possible, going algebraically or numerically through the successivedeconvolution steps, to express all moments of the l.s.d. of P as a function of themoments of the almost sure l.s.d. of B N .

Remember now from Section 5.3 that it is possible to generalize further theconcept of free deconvolution to nite dimensional random matrices, if we replaceµ∞B N

, µ∞P and the intermediate probability distributions introduced so far by theprobability distributions of the averaged e.s.d. , instead of the l.s.d. That is, forany random matrix X ∈C N ×N with compactly supported eigenvalue distributionfor all N , similar to Chapter 4, we dene µN

X as the probability distributionwith mth order moment E

xm dF X (x) . Substituting the µ∞X by the µN

X

in the derivations above and changing the denitions of the free convolutionoperators accordingly, see, e.g., [Masucci et al., 2011], we can nally derive thecombinatorial expressions that link the moments of the eigenvalue distribution




of B N (seen here as a random matrix and not as a particular realization) to1K

K k=1 P mk , for all integer m.

We will not go into excruciating details as to how this expresses theoreticallyand will merely state the rst three moment relations, whose output wasgenerated automatically by a computer software, see, e.g., [Koev; Ryan, 2009a,b] .Denote

pm K

k=1

nk

n P mk

and

bm 1N

E [tr B mN ]

where the expectation is taken over the joint realization of the random matricesH , X , and W . In the case where n1 = . . . = nK , the rst moments pm and bm

relate together as

b1 = N −1np1 + 1b2 = N −2M −1n + N −1n p2 + N −2n2 + N −1M −1n2 p2

1

+ 2N −1n + 2 M −1n p1 + 1 + NM −1

b3 = 3N −3M −2n + N −3n + 6 N −2M −1n + N −1M −2n + N −1n p3

+ 6N −3M −1n2 + 6 N −2M −2n2 + 3 N −2n2 + 3 N −1M −1n2 p2 p1

+ N −3M −2n3 + N −3n3 + 3 N −2M −1n3 + N −1M −2n3 p31

+ 6N −2M −1n + 6 N −1M −2n + 3 N −1n + 3 M −1n p2

+ 3N −2M −2n2 + 3 N −2n2 + 9 N −1M −1n2 + 3 M −2n2 p21

+ 3N −1M −2n + 3 N −1n + 9 M −1n + 3 NM −2n p1 . (17.4)

As a consequence, if L instances of the random matrices Y (ω1), . . . , Y (ωL ) areavailable, and L grows large, then asymptotically the averaged moments of thee.s.d. of the 1

M Y (ωi )Y (ωi )H converge to the moments b1 , b2 , . . . . This howeverrequires that multiple realizations of Y matrices are indeed available, and that

changes the conditions of the problem in the rst place. Nonetheless, if effectivelyonly one such matrix Y is available, it is possible to handily use this multi-instance approach by breaking down Y into several parts of smaller columndimension. That is, Y can be rewritten under the form Y = [Y 1 , . . . , Y L ],where Y i ∈C N ×(M/L ) , for some L which divides M . Note importantly that thisapproach is totally empirical and not equivalent to L independent realizationsof the random Y matrix for all Y i , since the channel matrix H is kept identicalfor all realizations.

If large dimensions are assumed, then the terms that go to zero in theabove relations must be discarded. Two different approaches can then be takento use the moment approach. Either we assume large dimensions and keepsthe realization of Y as is, or we may rewrite Y under the form of multiplesubmatrices and use the approximated averaged relations. In either case, the



444 17. Estimation

relations between consecutive moments must be dealt with carefully. We develophereafter two estimation methods, already introduced in Chapter 8, namely thefast but inaccurate Newton–Girard approach and a computationally expensivemethod, which we will abusively call the maximum-likelihood approach.

17.2.4.1 Newton–Girard methodLet us work in the small system dimension regime and consider the scenariowhere the realization of Y is divided into L submatrices Y 1 , . . . , Y L . TheNewton–Girard approach consists in taking, for m ∈ 1, . . . , K , the estimate

bm = 1NL

L

l=1

tr( Y l YHl )

m

of bm for m = 1, 2, . . . . Remember that those estimates are not L-consistent,since the random realization of H is the same for all Y l (unless the timebetween successive observations of Y i is long enough for the independence of the successive H matrices to hold). From bm , we may then successively takeestimates of p1 , . . . , p m by simply recursively inverting the formulas ( 17.4), withbm replaced by bm . These estimates are denoted ˆ p1 , . . . , ˆ pK .

Newton–Girard formulas, see Section 5.2, allow us to recover estimatesP (mom)

1 , . . . , P (mom)K of the transmit powers P 1 , . . . , P K by inverting the relations

K

k=1

nkn P (mom)k

m

= pm

for m ∈ 1, . . . , K . For instance, in the case of our example where K = 3,n1 = n2 = n3 , we have that P (mom)

1 , P (mom)2 , and P (mom)

3 are the roots of thepolynomial in X

−92

ˆ p31 +

92

ˆ p1 ˆ p2 − ˆ p3 X 2 +92

ˆ p1 − 32

ˆ p2 X + 1 = 0 .

This method is simple and does not require a lot of computational resources,

but fails in accuracy for several reasons discussed in Section 5.2 and whichwe presently recall. First, the free deconvolution approach aims at providingconsistent estimates of the successive moments of the e.s.d. of P , in order toobtain a good estimate on the e.s.d. of P , while our current objective is rather toestimate some or all entries of P instead. Second, the Newton–Girard approachdoes not take into account the fact that the estimated random moments bm , andconsequently the moments ˆ pm , do not all have the same variance around theirmeans. Inverting the moment relations linking the bm to the pm by replacingthe moments by their estimates assumes implicitly that all estimated momentsequally well approximate the true values, which is in fact far from being correct.Finally, the roots of the polynomial that lead to the estimates P (mom)k are notensured to be non-negative and worse not ensured to be real. Post-processingis then required to deal with such estimates. In the simulations below, we will




116

14

10

5

10

15

20

Estimated P (mom)k

D e n s i t y

Figure 17.10 Histogram of the P (mom)k for k ∈ 1, 2, 3, P 1 = 1 / 16, P 2 = 1 / 4, P 3 = 1,

n 1 = n 2 = n 3 = 4 antennas per user, N = 24 sensors, M = 128 samples, andSNR = 20 dB.

simply discard the realizations leading to purely complex or real negative values,altering therefore the nal result.

Figure 17.10 and Figure 17.11 provide the performance of the freedeconvolution approach with Newton–Girard inversion for the same systemmodel as before, i.e. three simultaneous transmissions, each user being equippedwith four antennas, N = 24 sensors, M = 128 samples, and the SNR is 20 dB.We consider the case where L = 1, which is observed to perform overall similarlyas the case where L is taken larger (with M scaled accordingly), with someminor differences. Notice that, although we now moved from a nested M - andN -consistent estimator to an ( N,n,M )-consistent estimator of P 1 , . . . , P K , thecontestable Newton–Girard inversion has very awkward side effects, both interms of bias for small dimensions, especially for the smallest powers, and interms of variance of the estimate, which is also no match for the variance of theprevious estimator when it comes to estimating very low powers. In terms of absolute mean square error, the conventional approach is therefore still betterhere. More elaborate post-processing methods are thus demanded to cope withthe issues of the Newton–Girard inversion. This is what the subsequent sectionis devoted to.

17.2.4.2 ML and MMSE methodsThe idea is now to consider the distribution of the bm moment estimates andto take the estimated powers ( P (mom ,ML)1 , . . . , P (mom ,ML)K ) as the K -tuple thatmaximizes some reward function of the joint variable b1 , . . . , bT , for some integerT . A classical approach is to consider as a reward function the maximum



446 17. Estimation

116

14

10

0.1

0.2

0.3

0.4

0.5

0.6

0.70.8

0.9

1

P (mom)k


Figure 17.11 Distribution function of the estimator P (mom)k for k ∈ 1, 2, 3, P 1 = 1 / 16,

P 2 = 1 / 4, P 3 = 1, n1 = n 2 = n 3 = 4 antennas per user, N = 24 sensors, M = 128,samples and SNR = 20 dB. Optimum estimator shown in dashed lines.

likelihood of b1 , . . . , bT given ( P (mom ,ML)1 , . . . , P (mom ,ML)

K ) or the minimum meansquare error in the estimation of b

1, . . . , b

T . This however implies that the

joint probability distribution of b1 , . . . , bT is known. Reminding that B N canbe written under the form of a covariance matrix with population covariancematrix whose e.s.d. converges almost surely to a limit distribution function,we are tempted to use Theorem 3.17 and to state that, for all nite T , thevector N (b1 −E[b1], . . . , bT −E[bT ]) converges in distribution and almost surelytowards a T -variate Gaussian random variable. However, the assumptions of Theorem 3.17 do not let the sample covariance matrix be random, so that thecentral limit of the vector N (b1 −E[b1], . . . , bT −E[bT ]) cannot be stated butonly conjectured. For the rest of the coming derivation, we will assume that the

result does hold, as it may likely be the case. We therefore need to computethe covariance matrix of the vector N (b1 −E[b1], . . . , bT −E[bT ]), i.e. we need tocompute all cross-moments

C N ij (P 1 , . . . , P K ) N 2E bi −E[bi ] bj −E[bj ]

for 1 ≤ i, j ≤ T , for some integer T , where the exponent N reminds thesystem dimension. Call C N (P 1 , . . . , P K ) ∈C T ×T the matrix with entriesC N

ij (P 1 , . . . , P K ). As N grows large, C N (P 1 , . . . , P K ) converges point-wise tosome matrix, which we denote C ∞(P 1 , . . . , P K ). The central limit theorem,be it valid in our situation, would therefore state that, asymptotically, thevector N (b1 −E[b1], . . . , bT −E[bT ]) is jointly Gaussian with zero mean andcovariance matrix C ∞(P 1 , . . . , P K ). We therefore determine the estimate vector




P (mom ,ML) ( P (mom ,ML)1 , . . . , P (mom ,ML)

K ) as

P (mom ,ML) = arg inf ( ˜P 1 ,...,

˜P K )P i ≥0

(b

−E[b ])C N ( P 1 , . . . , P K )−1(b

−E[b ])T

where b = ( b1 , . . . , bT )T and the expectations E[ b ] are conditioned with respectto ( P 1 , . . . , P K ) and can therefore be computed explicitly as described in theprevious sections.

Similar to the maximum likelihood vector estimator P (mom ,ML) , we can denethe vector estimator P (mom ,MMSE) ( P (mom ,MMSE)

1 , . . . , P (mom ,MMSE)K ), which

realizes the minimum mean square error of P 1 , . . . , P K as

P (mom ,MMSE)

= 1Z P d P =( P 1 ,..., P K ) T

P m ≥0

P det C N ( P 1 , . . . , P K )

e−( b −E[ b ])C N ( P 1 ,..., P K )−1 ( b −E[ b ])T

with Z a normalization factor.Practically speaking, the problem is two-fold. First, either in the maximum-

likelihood or in the minimum mean square error approach, a multi-dimensionalline search is required. This is obviously extremely expensive compared to theNewton–Girard method. Simplication methods, such as iterative algorithms canbe thought of, although they still require a lot of computations and are rarelyensured to converge to the global extremum sought for. The second problemis that the on-line or off-line computation of C N (P 1 , . . . , P K ) is also extremelytedious. Note that the matrix C N (P 1 , . . . , P K ) depends on the parameters K , M ,N , n1 , . . . , n K , P 1 , . . . , P K and σ2 , and is of size T ×T . Based on combinatorialapproaches, recent work has led to the possibility of a partly on-line, partlyoff-line computation of such matrices. Typically, it is intolerable that tables of C N (P 1 , . . . , P K ) be kept in memory for multiple values of K , M , N , n1 , . . . , n K ,σ2 and P 1 , . . . , P K . It is nonetheless acceptable that reference matrices be kept inmemory so to fast compute on-line C N (P 1 , . . . , P K ) for all K , M , N , n1 , . . . , n K ,

σ2

and P 1 , . . . , P K .

17.2.5 Analytic method

We nally introduce in this section the method based on G-estimation, whichwill be seen to have numerous advantages compared to the previous methods,although it is slightly handicapped by a cluster separability condition. Theapproach relies heavily on the recent techniques from Mestre, established in[Mestre , 2008b] and discussed at length in Chapter 8. This demands much morework than the combinatorial and rather automatic moment free deconvolutionapproach. Nevertheless, it appears that this approach can somewhat bereproduced for different models, as long as exact separation theorems, such asTheorem 7.2 or Theorem 7.8, are available.



448 17. Estimation

The main strategy is the following:

• We rst need to study the asymptotic spectrum of B N 1M YY H , as all

system dimensions ( N , n, M ) grow large (remember that K is xed). Forthis, we will proceed by– determining the almost sure l.s.d. of B N as all system dimensions grow

large with nite limiting ratio. Practically, this will allow us to connect theasymptotic spectrum of B N to the spectrum of P ;

– studying the exact separation of the eigenvalues of B N in clusters of eigenvalues. This is necessary rst to determine whether the coming stepof complex integration is possible and second to determine a well-chosenintegration contour for the estimation of every P k .

• Then, we will write P k under the form of the complex integral of a functionalof the spectrum of P over this well-chosen contour. Since the spectrum of Pcan be linked to that of B N (at least asymptotically) through the previousstep, a change of variable will allow us to rewrite P k under the form of anintegral of some functional of the l.s.d. of B N . This point is the key stepin our derivation, where P k is now connected to the observation matrix Y(although only in an asymptotic way).

• Finally, the estimate P k of P k will be computed from the previous step byreplacing the l.s.d. of B N by its e.s.d., i.e. by the truly observed eigenvaluesof 1

M YY H in the expression relating P k to the l.s.d. of B N .

We therefore divide this section into three subsections that analyze successivelythe almost sure l.s.d. of B N , then the conditions for cluster separation, and nallythe actual calculus of the power estimator.

17.2.5.1 Limiting spectrum of B N In this section, we prove the following result.

Theorem 17.5. Let B N = 1M YY H , with Y dened as in (17.2). Then, for M ,

N , n growing large with limit ratios M/N → c, N/n k → ck , 0 < c, c 1 , . . . , c K <

∞, the e.s.d. F B N

of B N converges almost surely to the distribution function F ,whose Stieltjes transform mF (z) satises, for z ∈C +

mF (z) = cmF (z) + ( c −1)1z

(17.5)

where mF (z) is the unique solution with positive imaginary part of the implicit equation in mF

1mF

= −σ2 + 1f −

K

k=1

1ck

P k1 + P k f

(17.6)

in which we denoted f the value

f = (1 −c)mF −czm 2F .




The rest of this section is dedicated to the proof of Theorem 17.5. Firstremember that the matrix Y in (17.2) can be extended into the larger samplecovariance matrix Y

∈C (N + n )×M (conditionally on H )

Y =HP

12 σI N

0 0XW

.

From Theorem 3.13, since H has independent entries with nite fourth ordermoment, we have that the e.s.d. of HPH H converges weakly and almost surely toa limit distribution G as N, n 1 , . . . , n K → ∞ with N/n k → ck > 0. For z ∈C + ,the Stieltjes transform mG (z) of G is the unique solution with positive imaginarypart of the equation in mG

z = − 1mG

+K

k =1

1ck

P k1 + P k mG

. (17.7)

The almost sure convergence of the e.s.d. of HPH H ensures the almost sureconvergence of the e.s.d. of HPH H + σ 2 I N 0

0 0 . Since mG (z) evaluated at z ∈C + isthe Stieltjes transform of the l.s.d. of HPH H + σ2 I N evaluated at z + σ2 , addingn zero eigenvalues, we nally have that the e.s.d. of HPH H + σ 2 I N 0

0 0 tends almostsurely to a distribution H whose Stieltjes transform mH (z) satises

mH (z) = c0

1 + c0mG (z −σ2) − 1

1 + c0

1z

(17.8)

for z ∈C + and c0 the limit of N/n , i.e. c0 = ( c−11 + . . . + c−1

K )−1 .As a consequence, the sample covariance matrix 1

M YY H has a populationcovariance matrix which is not deterministic but whose e.s.d. has an almost surelimit H for increasing dimensions. Since X and W have entries with nite fourthorder moment, we can again apply Theorem 3.13 and we have that the e.s.d. of B N

1M Y

H Y converges almost surely to the limit F whose Stieltjes transform

mF (z) is the unique solution in C +

of the equation in mF

z = − 1mF

+ 1c

1 + 1c0 t

1 + tm F dH (t)

= − 1mF

+1 + 1

c0

cmF 1 −

1mF

mH − 1mF

(17.9)

for all z ∈C + .For z ∈C + , mF (z) ∈C + . Therefore −1/m F (z) ∈C + and we can evaluate

(17.8) at

−1/m F (z). Combining ( 17.8) and ( 17.9), we then have

z = −1c

1mF (z)2 mG −

1mF (z) −σ2 +

1c −1

1mF (z)

(17.10)



450 17. Estimation

where, according to ( 17.7), mG (−1/m F (z) −σ2) satises

1

mF (z) =

−σ2 +

1

mG (− 1m F (z ) −σ2) −

K

k=1

1

ck

P k1 + P k mG (− 1m F (z ) −σ2)

.

(17.11)

Together with ( 17.10), denoting f (z) = mG (− 1m F (z ) −σ2) = (1 −c)mF (z) −

czm F (z)2 , this is exactly ( 17.6).Since the eigenvalues of the matrices B N and B N only differ by M −N zeros,

we also have that the Stieltjes transform mF (z) of the l.s.d. of B N satises

mF (z) = cmF (z) + ( c −1)1z

. (17.12)

This completes the proof of Theorem 17.5. For further usage, notice here that(17.12) provides a simplied expression for mG (−1/m F (z) −σ2). Indeed, wehave:

mG (−1/m F (z) −σ2) = −zm F (z)mF (z). (17.13)

Therefore, the support of the (almost sure) l.s.d. F of B N can be evaluated asfollows: for any z ∈C + , mF (z) is given by (17.5), in which mF (z) solves (17.6);the inverse Stieltjes transform formula ( 3.2) then allows us to evaluate F frommF (z), for values of z spanning over the set

z = x + iy, x > 0

and y small. This

is depicted in Figure 17.12, where P has three distinct values P 1 = 1, P 2 = 3,P 3 = 10 and n1 = n2 = n3 , N/n = 10, M/N = 10, σ2 = 0 .1, as well as in Figure17.13 for the same setup but with P 3 = 5.

Two remarks on Figure 17.12 and Figure 17.13 are of fundamental importanceto the following. Similar to the study carried out in Chapter 7, it appears thatthe asymptotic l.s.d. F of B N is compactly supported and divided into up toK + 1 disjoint compact intervals, which we further refer to as clusters . Eachcluster can be mapped onto one or many values in the set σ2 , P 1 , . . . , P K . Forinstance, in Figure 17.13, the rst cluster is mapped to σ2 , the second cluster to

P 1 , and the third cluster to the set P 2 , P 3. Depending on the ratios c and c0and on the particular values taken by P 1 , . . . , P K and σ2 , these clusters are eitherdisjoint compact intervals, as in Figure 17.12, or they may overlap to generatelarger compact intervals, as in Figure 17.13. As is in fact required by the lawof large numbers, for increasing c and c0 , the asymptotic spectrum tends tobe divided into thinner and thinner clusters. The inference technique proposedhereafter relies on the separability of the clusters associated with each P i andto σ2 . Precisely, to be able to derive a consistent estimate of the transmittedpower P k , the cluster associated with P k in F , number it cluster kF , must bedistinct from the neighboring clusters ( k

−1)F and ( k + 1) F , associated with

P k−1 and P k+1 , respectively (when they exist), and also distinct from cluster 1in F associated with σ2 . As such, in the scenario of Figure 17.13, our methodwill be able to provide a consistent estimate for P 1 , but (so far) will not succeed




0.1 1 3 100

0.025

0.05

0.075

0.1

Estimated powers

D e n s i t y

Asymptotic spectrumEmpirical eigenvalues

Figure 17.12 Empirical and asymptotic eigenvalue distribution of 1M YY H when P has

three distinct entries P 1 = 1, P 2 = 3, P 3 = 10, n1 = n 2 = n 3 , c0 = 10, c = 10,σ2 = 0 .1. Empirical test: n = 60.

0.1 1 3 50

0.025

0.05

0.075

0.1

Estimated powers

D e n s i t y

Asymptotic spectrum

Empirical eigenvalues

Figure 17.13 Empirical and asymptotic eigenvalue distribution of 1M YY H when P has

three distinct entries P 1 = 1, P 2 = 3, P 3 = 5, n1 = n 2 = n 3 , c0 = 10, c = 10, σ2 = 0 .1.Empirical test: n = 60.

in providing a consistent estimate for either P 2 or P 3 , since 2F = 3 F . We will seethat a consistent estimate for ( P 2 + P 3)/ 2 is accessible though. Secondly, noticethat the empirical eigenvalues of B N are all inside the asymptotic clusters and,most importantly, in the case where cluster kF is distinct from either cluster 1,



452 17. Estimation

(k −1)F or (k + 1) F , observe that the number of eigenvalues in cluster kF isexactly nk . This is what we referred to as exact separation in Chapter 7. Theexact separation for the current model originates from a direct application of the exact separation for the sample covariance matrix of Theorem 7.2 and isprovided below in Theorem 17.7 for more generic model assumptions than inTheorem 7.2. This is further discussed in the subsequent sections.

17.2.5.2 Condition for separabilityIn the following, we are interested in estimating consistently the power P k fora given xed k ∈ 1, . . . , K . We recall that consistency means here that, asall system dimensions grow large with nite asymptotic ratios, the differenceP k −P k between the estimate P k of P k and P k itself converges to zero with

probability one. As previously mentioned, we will show by construction in thesubsequent section that such an estimate is only achievable if the cluster mappedto P k in F is disjoint from all other clusters. The purpose of the present sectionis to provide sufficient conditions for cluster separability. To ensure that clusterkF (associated with P k in F ) is distinct from cluster 1 (associated with σ2) andclusters iF , i = k (associated with all other P i ), we assume now and for the restof this section that the following conditions are fullled:

(i) k satises Assumption 17.1, given as follows.

Assumption 17.1.

K

r =1

1cr

(P r mG,k )2

(1 + P r mG,k )2 < 1, (17.14)

K

r =1

1cr

(P r mG,k +1 )2

(1 + P r mG,k +1 )2 < 1 (17.15)

with mG, 1 , . . . , m G,K the K real solutions to the equation in mG

K

r =1

1cr

(P r mG )3

(1 + P r mG )3 = 1 (17.16)

with the convention mG,K +1 = 0, and

(ii) k satises Assumption 17.2 as follows.

Assumption 17.2. Denoting, for j ∈ 1, . . . , K jG # i ≤ j | i satises Assumption 17.1 (17.17)

we have the two conditions

1 −c0

c0

(σ2mF ,k G )2

(1 + σ2mF ,k G )2 +k G −1

r =1

1cr

(x+G,r + σ2)2m2

F ,k G

(1 + ( x+G,r + σ2)mF ,k G )2

+K G

r = k G

1cr

(x−G,r + σ2)2m2F ,k G

(1 + ( x−G,r + σ2)mF ,k G )2 < c




and

1 −c0

c0

(σ2mF ,k G +1 )2

(1 + σ2

mF ,k G +1 )2 +

kG

r =1

1

cr

(x+G,r + σ2)2m2

F ,k G +1

(1 + ( x+G,r + σ

2)mF ,k G +1 )

2

+K G

r = k G +1

1cr

(x−G,r + σ2)2m2F ,k G +1

(1 + ( x−G,r + σ2)mF ,k G +1 )2 < c

where x−G,i , x+G,i , i ∈ 1, . . . , K G , are dened by

x−G,i = − 1m−G,i

+K

r =1

1cr

P r1 + P r m−G,i

(17.18)

x+G,i = −

1m+

G,i +

K

r =1

1cr

P r1 + P r m+

G,i , (17.19)

with m−G, 1 , m +G, 1 , . . . , m −G,K G

, m +G,K G

the 2K G real roots of (17.14) and mF ,j , j ∈ 1, . . . , K G + 1, the j th real root (in increasing order) of the equation inmF

1 −c0

c0

(σ2mF )3

(1 + σ2mF )3 +j −1

r =1

1cr

(x+G,r + σ2)3m3

F

(1 + ( x+G,r + σ2)mF )3

+K G

r = j

1

cr

(x−G,r + σ2)3m3F

(1 + ( x−G,r + σ2)mF )3 = c.

Although difficult to fathom at this point, the above assumptions will beclaried later. We give here a short intuitive explanation of the role of everycondition. Assumption 17.1 is a necessary and sufficient condition for clusterkG , that we dene as the cluster associated with P k in G (the l.s.d. of HPH H ),to be distinct from the clusters ( k −1)G and ( k + 1) G , associated with P k−1

and P k+1 in G, respectively. Note that we implicitly assume a unique mappingbetween the P i and clusters in G; this statement will be made more rigorousin subsequent sections. Assumption 17.1 only deals with the inner HPH H

covariance matrix properties and ensures specically that the powers to beestimated differ sufficiently from one another for our method to be able to resolvethem.

Assumption 17.2 deals with the complete B N matrix model. It is however anon-necessary but sufficient condition for cluster kF , associated with P k in F ,to be distinct from clusters ( k −1)F , (k + 1) F , and 1 (cluster 1 being associatedwith σ2). The exact necessary and sufficient condition will be stated further in thenext sections; however, the latter is not exploitable in practice and Assumption17.2 will be shown to be an appropriate substitute. Assumption 17.2 is concerned

with the value of c necessary to avoid:(i) cluster kG (associated with P k in G) to further overlap the clusters kG −1

and kG + 1 associated with P k−1 and P k +1 ,



454 17. Estimation

−15 −10 −5 0 5 10 15 200

20

40

60

80

100

cluster separability region

c = 10

SNR 1σ 2 [dB]

c

P 3 separability limitP 2 separability limitP 1 separability limit

Figure 17.14 Limiting ratio c as a function of σ2 to ensure consistent estimation of P 1 = 1, P 2 = 3 and P 3 = 10, c0 = 10, c1 = c2 = c3 .

(ii) cluster 1 associated with σ2 in F to merge with cluster kF .

As will become evident in the next sections, when σ2 is large, the tendencyis for the cluster associated with σ2 to become large and overlap the clustersassociated with P 1 , then P 2 , etc. To counter this effect, we must increase c, i.e.take more signal samples. Figure 17.14 depicts the critical ratio c that satisesAssumption 17.2 as a function of σ2 , in the case K = 3, ( P 1 , P 2 , P 3) = (1 , 3, 10),c0 = 10, c1 = c2 = c3 . Notice that, in the case c = 10, below σ2 1, it is possibleto separate all clusters, which is compliant with Figure 17.12 where σ2 = 0 .1.

As a consequence, under the assumption (partly proved later) that ourproposed method can only perform consistent power estimation when the clusterseparability conditions are met, we have two rst conclusions:

• if we want to increase the sensitivity of the estimator, i.e. to be able to separatetwo sources of close transmit powers, we need to increase the number of sensors(by increasing c0);

• if we want to detect and reliably estimate power sources in a noise-limitedenvironment, we need to increase the number of sensed samples (by increasingc).

In the subsequent section, we study the properties of the asymptotic spectrumof HPH H and B N in more detail. These properties will lead to an explanationfor Assumptions 17.1 and 17.2. Under those assumptions, we will then derive theStieltjes transform-based power estimator.




17.2.5.3 Multi-source power inferenceIn the following, we nally prove the main result of this section, which providesthe G-estimator P 1 , . . . , P K of the transmit powers P 1 , . . . , P K .

Theorem 17.6. Let B N ∈C N ×N be dened as B N = 1M YY H with Y dened

as in (17.2), and λ = ( λ1 , . . . , λ N ), λ1 ≤ . . . ≤ λN , be the vector of the ordered eigenvalues of B N . Further, assume that the limiting ratios c0 , c1 , . . . , c K , c and P are such that Assumptions 17.1 and 17.2 are fullled for some k ∈ 1, . . . , K .Then, as N , n, M grow large, we have:

P k −P ka .s.

−→ 0

where the estimate P k is given by:

• if M = N

P k = NM

nk (M −N )i∈

N k

(ηi −µi )

• if M = N

P k = N

nk (N −n)i∈

N k

N

j =1

ηi

(λ j −ηi )2

−1

in which N k = N −K i= k ni + 1 , . . . , N −

K i = k+1 n i, η1 ≤ . . . ≤ ηN are the

ordered eigenvalues of the matrix diag(λ ) − 1N √ λ √ λ T

and µ1 ≤ . . . ≤ µN are the ordered eigenvalues of the matrix diag(λ ) − 1

M √ λ √ λ T

.

Remark 17.3. We immediately notice that, if N < n , the powers P 1 , . . . , P l , with lthe largest integer such that N − K

i = l n i < 0, cannot be estimated since clustersmay be empty. The case N ≤ n turns out to be of no practical interest as clustersalways merge and no consistent estimate of either P i can be described.

The approach pursued to prove Theorem 17.6 relies strongly on the originalidea of [Mestre , 2008a] which was detailed for the case of sample covariancematrices in Section 8.1.2 of Chapter 8. From Cauchy’s integration formula,Theorem 8.5

P k = ck1

2πi C k

1ck

ωP k −ω

dω

= ck1

2πi C k

K

r =1

1cr

ωP r −ω

dω (17.20)

for any negatively oriented contour Ck

⊂C , such that P k is contained in the

surface described by the contour, while for every i = k, P i is outside this surface.The strategy unfolds as follows: we rst propose a convenient integration contourCk which is parametrized by a function of the Stieltjes transform mF (z) of the



456 17. Estimation

−1 −13 − 1

100

1

3

10

m −G, 1

m +G, 1

m G, 1

m G

x G

( m G

)

x G (m G )Support of G

Figure 17.15 xG (m G ) for mG real, P diagonal composed of three evenly weightedmasses in 1, 3 and 10. Local extrema are marked in circles, inexion points aremarked in squares.

l.s.d. of B N ; this is the technical part of the proof. We then proceed to a variablechange in ( 17.20) to express P k as a function of mF (z). We evaluate the complexintegral resulting from replacing the limiting mF (z) in (17.20) by its empirical

counterpart mB N (z) = 1N tr( B N −zI N )−1 . This new integral, whose value wename P k , is shown to be almost surely equal to P k in the large N limit. It thensuffices to evaluate P k , which is just a matter of residue calculus.

We start by determining the integration contour Ck . For this, we rst need tostudy the distributions G and F in more detail, following the study carried outin Chapter 7.

Properties of G and F .First consider the matrix HPH H , and let the function xG (mG ) be dened, forscalars m

G ∈R

\ 0,

−1/P

1, . . . ,

−1/P

K , by

xG (mG ) = − 1mG

+K

r =1

1cr

P r1 + P r mG

. (17.21)

The function xG (mG ) is depicted in Figure 17.15 and Figure 17.16 for thecases where c0 = 10, c1 = c2 = c3 and (P 1 , P 2 , P 3) equal (1 , 3, 10) and (1 , 3, 5),respectively. As expected by Theorem 7.4, xG (mG ) is increasing for mG such thatxG (mG ) is outside the support of G. Note now that the function xG presentsasymptotes in the positions −1/P 1 , . . . , −1/P K

limm G ↓(−1/P i ) xG (mG ) = ∞lim

m G ↑(−1/P i )xG (mG ) = −∞




−1 −13 −1

5zero

1

3

5

m G

x G

( m G

)

x G (m G )Support of G

Figure 17.16 xG (m G ) for mG real, P diagonal composed of three evenly weightedmasses in 1, 3 and 5. Local extrema are marked in circles, inexion points are markedin squares.

−1 −13 − 1

10zero

0.11

3

10

m F

x H

( m F

)

x H (m F )

Support of F Support of −1/H

Figure 17.17 xF (m F ) for mF real, σ2 = 0 .1, c = c0 = 10, P diagonal composed of three evenly weighted masses in 1, 3 and 10. The support of F is read on the rightvertical axis.

and that xG (mG ) → 0+ as mG → −∞. Note also that, on its restriction to theset where it is non-decreasing, xG is increasing. To prove this, let mG and mG betwo distinct points such that xG (mG ) > 0 and xG (mG ) > 0, and mG < m G < 0.



458 17. Estimation

We indeed have 1

xG (mG )

−xG (mG ) =

mG −mG

mG mG

1

−

K

r =1

1

cr

P 2r

(P r + 1m G )(P r +

1m G )

. (17.22)

Noticing that, for P i > 0

P iP i + 1

m G−

P iP i + 1

m G

2

= P 2i

(P i + 1m G

)2 + P 2i

(P i + 1m G

)2 −2 P 2i

(P i + 1m G

)(P i + 1m G

)

> 0

we have, after taking the opposite and the sum over i ∈ 1, . . . , K and adding2 on both sides

1 −K

r =1

1cr

P 2r(P r + 1

m G)2 + 1 −

K

r =1

1cr

P 2r(P r + 1

m G)2

< 2 −2K

r =1

1cr

P 2r(P r + 1

m G)(P r + 1

m G)

.

Since we also have

xG (mG ) = 1m2

G1 −

K

r =1

1cr

P 2r(P r + 1

m G)2 ≥ 0

xG (mG ) = 1

(mG )2 1 −K

r =1

1cr

P 2r(P r + 1

m G)2 ≥ 0

we conclude that the term in brackets in ( 17.22) is positive and then thatxG (mG ) −xG (mG ) > 0. Hence xG is increasing on its restriction to the set whereit is non-decreasing.

Notice also that xG , both in Figure 17.15 and Figure 17.16, has exactlyone inexion point on each open set ( −1/P i−1 , −1/P i ), for i ∈ 1, . . . , K , withconvention P 0 = 0 + . This is proved by noticing that xG (mG ) = 0 is equivalent

toK

r =1

1cr

P 3r m3G

(1 + P r mG )3 −1 = 0 . (17.23)

Now, the left-hand side of ( 17.23) has derivative along mG

3K

r =1

1cr

P 3r m2G

(1 + P r mG )4 (17.24)

which is always positive. Notice that the left-hand side of ( 17.23) has asymptotes

for mG = −1/P i for all i ∈ 1, . . . , K and has limits 0 as mG → 0 and 1/c 0 −1

1 This proof is borrowed from the proof of [Mestre , 2008b] , with different notations.




as mG → −∞. If c0 > 1, Equation ( 17.23) (and then xG (mG ) = 0) thereforehas a unique solution in ( −1/P i−1 , −1/P i ) for all i ∈ 1, . . . , K . When xG isincreasing somewhere on (

−1/P i−1 ,

−1/P i ), the inexion point, i.e. the solution

of xG (mG ) = 0 in ( −1/P i−1 , −1/P i ), is necessarily found in the region where xG

increases. If c0 ≤ 1, the leftmost inexion point may not exist.From the discussion above and from Theorem 7.4 (and its corollaries discussed

in Section 7.1.3), it is clear that the support of G is divided into K G ≤ K compactsubsets [x−G,i , x+

G,i ], i ∈ 1, . . . , K G . Also, if c0 > 1, G has an additional mass inzero of probability G(0) −G(0−) = ( c0 −1)/c 0 ; this mass will not be counted asa cluster in G. Observe that every P i can be uniquely mapped to a correspondingsubset [x−G,j , x+

G,j ] in the following fashion. The power P 1 is mapped onto the rstcluster in G; we then have 1 G = 1. Then the power P 2 is either mapped onto

the second cluster in G if xG increases in the subset ( −1/P 1 , −1/P 2), which isequivalent to saying that xG (mG, 2) > 0 for mG, 2 the only solution to xG (mG ) = 0in (−1/P 1 , −1/P 2); in this case, we have 2 G = 2 and the clusters associated withP 1 and P 2 in G are distinct. Otherwise, if xG (mG, 2) ≤ 0, P 2 is mapped onto therst cluster in F ; in this case, 2 G = 1. The latter scenario visually correspondsto the case when P 1 and P 2 engender “overlapping clusters.” More generally, P j , j ∈ 1, . . . , K , is uniquely mapped onto the cluster jG such that

jG = # i ≤ j | min[xG (mG,i ), xG (mG,i +1 )] > 0with convention mG,K +1 = 0, which is exactly

jG = # i ≤ j | i satises Assumption 17.1when c0 > 1. If c0 ≤ 1, mG, 1 , the zero of xG in (−∞, −1/P 1) may not exist. If c0 < 1, we claim that P 1 cannot be evaluated (as was already observed in Remark17.3). The special case when c0 = 1 would require a restatement of Assumption17.1 to handle the special case of P 1 ; this will however not be done, as it willturn out that Assumption 17.2 is violated for P 1 if σ2 > 0, which we assume.

In the particular case of the power P k of interest in Theorem 17.6, because of Assumption 17.1, xG (mG,k ) > 0. Therefore the index kG of the cluster associated

with P k in G satises kG = ( k −1)G + 1 (with convention 0 G = 0). Also, fromAssumption 17.1, xG (mG,k +1 ) > 0. Therefore ( k + 1) G = kG + 1. In that case,we have that P k is the only power mapped to cluster kG in G, and then we havethe required cluster separability condition.

We now proceed to the study of F , the almost sure limit spectrum distributionof B N . In the same way as previously, we have that the support of F is fullydetermined by the function xF (mF ), dened for mF real, such that −1/m F liesoutside the support of H , by

xF (mF ) =

− 1

mF +

1 + c0

cc0

t

1 + tm F dH (t).

Figure 17.17 depicts the function xF in the system conditions already used inFigure 17.12, i.e. K = 3, P 1 = 1 , P 2 = 3, P 3 = 10, c1 = c2 = c3 , c0 = 10, c = 10,



460 17. Estimation

σ2 = 0 .1. Figure 17.17 has the peculiar behavior that it does not have asymptotesas in Figure 17.15 where the population eigenvalue distribution was discrete. Asa consequence, our previous derivations cannot be straightforwardly adapted toderive the spectrum separability condition. If c0 > 1, note also, although it isnot appearing in the abscissa range of Figure 17.17, that there exist asymptotesin the position mF = −1/σ 2 . This is due to the fact that G(0) −G(0−) > 0, andtherefore H (σ2) −H ((σ2)−) > 0. We assume c0 > 1 until further notice.

Applying a second time Theorem 7.4, the support of F is complementary tothe set of real non-negative x such that x = xF (mF ) and xF (mF ) > 0 for acertain real mF , with xF (mF ) given by:

xF (mF ) = 1m2

F −

1 + c0

cc0

t2

(1 + tm F )2 dH (t).

Reminding that H (t) = c0c0 +1 G(t −σ2) + 1

1+ c0δ (t), this can be rewritten

xF (mF ) = 1m2

F − 1c t2

(1 + tm F )2 dG(t −σ2). (17.25)

It is still true that xF (mF ), restricted to the set of mF where xF (mF ) ≥ 0, isincreasing. As a consequence, it is still true also that each cluster of H can bemapped to a unique cluster of F . It is then possible to iteratively map the powerP k onto cluster kG in G, as previously described, and to further map cluster kG inG (which is also cluster kG in H ) onto a unique cluster kF in F (or equivalentlyin F ).

Therefore, a necessary and sufficient condition for the separability of the clusterassociated with P k in F reads:

Assumption 17.3. There exist two distinct real values m( l)F ,k G

< m ( r )F ,k G

such that:

1. xF (m ( l)F ,k G

) > 0, xF (m ( r )F ,k G

) > 02. there exist m( l)

G,k , m ( r )G,k ∈R such that

xG (m ( l)G,k ) = − 1

m ( l)F ,k G

−σ2

xG (m (r )G,k ) = −

1m ( r )

F ,k G−σ2

that satisfy:a. xG (m ( l)

G,k ) > 0, xG (m ( r )G,k ) > 0

b. and

P k−1 < − 1

m ( l)G,k < P k < −

1m (r )

G,k < P k+1 (17.26)

with the convention P 0 = 0 + , P K +1 = ∞.




Assumption 17.3 states rst that cluster kG in G is distinct from clusters ( k −1)G and ( k + 1) G (Item 2b), which is equivalent to Assumption 17.1, and secondthat m ( l)

F ,k G

−1/ (xG (m ( l)

G,k G) + σ2) and m (r )

F ,k G

−1/ (xG (m ( r )

G,k G) + σ2) (which

lie on either side of cluster kG in H ) have respective images x( l)k F

xF (m ( l)F ,k G

)and x(r )

kF xF (m (r )

F ,k G) by xF , such that xF (m ( l)

F ,k G) > 0 and xF (m ( r )

F ,k G) > 0, i.e.

x( l)k F

and x(r )k F

lie outside the support of F , on either side of cluster kF .However, Assumption 17.3, be it a necessary and sufficient condition for the

separability of cluster kF , is difficult to exploit in practice. Indeed, it is notsatisfactory to require the verication of the existence of such m ( l)

F ,k Gand m (r )

F ,k G.

More importantly, the computation of xF requires to know H , which is only fullyaccessible through the non-convenient inverse Stieltjes transform formula

H (x) = 1π limy→0 x

−∞mH (t + iy)dt. (17.27)

Instead of Assumption 17.3, we derive here a sufficient condition for clusterseparability in F , which can be explicitly veried without resorting to involvedStieltjes transform inversion formulas. Note from the clustering of G into K Gclusters plus a mass at zero that ( 17.25) becomes

xF (mF ) = 1m2

F − 1c

K G

r =1 x+G,r

x −G,r

t2

(1 + tm F )2 dG(t −σ2) − c0 −1

cc0

σ4

(1 + σ2mF )2

(17.28)

where we remind that [ x−G,i , x+G,i ] is the support of cluster i in G, i.e.

x−G, 1 , x+G, 1 , . . . , x −G,K G

, x+G,K G

are the images by xG of the 2K G real solutionsto xG (mG ) = 0.

Observe now that the function −t2 / (1 + tm F )2 , found in the integrals of (17.28), has derivative along t

− t2

(1 + tm F )2 = − 2t

(1 + tm F )4 (1 + tm F )

and is therefore strictly increasing when mF < −1/t and strictly decreasingwhen mF > −1/t . For mF ∈ (−1/ (x+G,i + σ2), −1/ (x−G,i +1 + σ2)), we then have

the inequality

xF (mF ) ≥ 1m2

F − 1c

i

r =1

(x+G,r + σ2)2

(1 + ( x+G,r + σ2)mF )2

+K G

r = i +1

(x−G,r + σ2)2

(1 + ( x−G,r + σ2)mF )2 + c0 −1

c0

σ4

(1 + σ2mF )2 . (17.29)

Denote f i (mF ) the right-hand side of ( 17.29). Through the inequality ( 17.29),we then fall back on a nite sum expression as in the previous study of thesupport of G. In that case, we can exhibit a sufficient condition to ensure theseparability of cluster kF from the neighboring clusters. Specically, we only need



462 17. Estimation

to verify that f k G −1(mF ,k G ) > 0, with mF ,k G the single solution to f kG −1(mF ) =0 in the set ( −1/ (x+

G,k G −1 + σ2), −1/ (x−G,k G+ σ2)), and f k G (mF ,k G +1 ) > 0,

with mF ,k G +1 the unique solution to f kG(mF ) = 0 in the set (

−1/ (x+

G,k G+

σ2), −1/ (x−G,k G +1 + σ2)). This is exactly what Assumption 17.2 states.Remember now that we assumed in this section c0 > 1. If c0 ≤ 1, then zero is

in the support of H , and therefore the leftmost cluster in F , i.e. that attached toσ2 , is necessarily merged with that of P 1 . This already discards the possibility of spectrum separation for P 1 and therefore P 1 cannot be estimated. It is thereforenot necessary to update Assumption 17.1 for the particular case of P 1 whenc0 = 1.

Finally, Assumptions 17.1 and 17.2 ensure that ( k −1)F < k F < (k + 1) F ,kF = 1, and there exists a constructive way to derive the mapping k → kF . We

are now in position to determine the contour C

k .

Determination of Ck .From Assumption 17.2 and Theorem 7.4, there exist x( l)

k F and x( r )

k F outside the

support of F , on either side of cluster kF , such that mF (z) has limits m( l)F ,k G

mF (x( l)k F

) and m(r )F ,k G

mF (x( r )k F

), as z → x( l)kF

and z → x(r )kF

, respectively, withmF the analytic extension of mF in the points x( l)

kF ∈R and x(r )

k F ∈R . These

limits m ( l)F ,k G

and m ( r )F ,k G

are on either side of cluster kG in the support of −1/H ,and therefore

−1/m ( l)

F ,k G −σ2 and

−1/m ( l)

F ,k G −σ2 are on either side of cluster

kG in the support of G.Consider any continuously differentiable complex path Γ F,k with endpoints

x( l)k F

and x( r )k F

, and interior points of positive imaginary part. We dene thecontour CF,k as the union of Γ F,k oriented from x( l)

kF to x( r )

k F and its complex

conjugate Γ∗F,k oriented backwards from x( r )k F

to x( l)kF

. The contour CF,k is clearlycontinuous and piecewise continuously differentiable. Also, the support of clusterkF in F is completely inside CF,k , while the supports of the neighboring clustersare away from CF,k . The support of cluster kG in H is then inside −1/m F (CF,k ),2

and therefore the support of cluster kG in G is inside CG,k

−1/m F (CF,k )

−σ2 .

Since mF is continuously differentiable on C \ R (it is in fact holomorphicthere [Silverstein and Choi, 1995 ]) and has limits in x( l)

k F and x(r )

k F , CG,k is also

continuous and piecewise continuously differentiable. Going one more step inthis process, we nally have that P k is inside the contour Ck −1/m G (CG,k ),while P i , for all i = k, is outside Ck . Since mG is also holomorphic on C \ R andhas limits in −1/m F (x( l)

k F ) −σ2 and −1/m F (x( r )

k F ) −σ2 , Ck is a continuous and

piecewise continuously differentiable complex path, which is sufficient to performcomplex integration [Rudin, 1986].

2 We slightly abuse notations here and should instead say that the support of cluster kG in H is inside the contour described by the image by −1/m F of the restriction to C + and C − of CF,k , continuously extended to R in the points −1/m ( l )

F ,k Gand −1/m ( r )

F ,k G.




Figure 17.18 depicts the contours C1 , C2 , C3 originating from circularintegration contours CF,k of diameter [x( l)

k F , x ( r )

k F ], k ∈ 1, 2, 3, for the case

of Figure 17.12. The points x( l)k F

and x( r )k F

for kF

∈ 1, 2, 3

are taken to be

x( l)k F

= xF (mF ,k G ), x(r )kF

= xF (mF ,k G +1 ), with mF ,i the real root of f i (mF )in (−1/ (x+

G,i −1 + σ2), −1/ (x−G,i + σ2)) when i ∈ 1, 2, 3, and we take theconvention mG, 4 = −1/ (15 + σ2).

Recall now that P k was dened as

P k = ck1

2πi C k

K

r =1

1cr

ωP r −ω

dω.

With the variable change ω = −1/m G (t), this becomes

P k = ck

2πi C G,k

K

r =1

1cr

−11 + P r mG (t)

mG (t)mG (t)2 dt

= ck

2πi C G,k

mG (t)K

r =1

1cr

P r1 + P r mG (t) −

K

r =1

1cr

mG (t)mG (t)2 dt

= ck

2πi C G,k

mG (t) − 1

mG (t) +

K

r =1

1cr

P r1 + P r mG (t)

+ c0 −1

c0

mG (t)mG (t)2 dt.

From Equation ( 17.7), this simplies into

P k = ck

c0

12πi C G,k

(c0 tm G (t) + c0 −1) mG (t)mG (t)2 dt. (17.30)

Using (17.10) and proceeding to the change of variable t = −1/m F (z) −σ2 ,(17.30) becomes

P k = ck

2πi C F,k

1 + σ2mF (z) − 1

zm F (z) −mF (z)mF (z)2 −

mF (z)mF (z)mF (z)

dz.

(17.31)

This whole process of variable changes allows us to describe P k as a function of mF (z), the Stieltjes transform of the almost sure limiting spectral distributionof B N , as N → ∞. It then remains to exhibit a relation between P k and thee.s.d. of B N for nite N . This is to what the subsequent section is dedicated.

17.2.5.4 Evaluation of P kLet us now dene mB N (z) and mB N (z) as the Stieltjes transforms of theempirical eigenvalue distributions of B N and B N , respectively, i.e.

mB N (z) = 1N

N

i =1

1λ i −z

(17.32)



464 17. Estimation

1 3 10−6

−4

−2

0

2

4

6

Real part of Ck , k = 1 , 2, 3

I m a g i n a r y p a r t o f

C k

Figure 17.18 (Negatively oriented) integration contours C1 , C2 and C3 , for c = 10,c0 = 10, P 1 = 1, P 2 = 3, P 3 = 10.

and

mB N (z) = N M mB N (z) −

M

−N

M 1z .

Instead of going further with ( 17.31), dene P k , the “empirical counterpart”of P k , as

P k = nnk

12πi C F,k

N n

1 + σ2mB N (z)

× − 1

zm B N (z) −mB N

(z)mB N (z)2 −

mB N (z)

mB N (z)mB N (z)dz. (17.33)

The integrand can then be expanded into several terms, for which residuecalculus can easily be performed. Denote rst η1 , . . . , η N the N real roots of mB N (z) = 0 and µ1 , . . . , µ N the N real roots of mB N (z) = 0. We identify threesets of possible poles for the aforementioned terms: (i) the set λ1 , . . . , λ N ∩[x( l)

kF , x ( r )

k F ], (ii) the set η1 , . . . , η N ∩[x ( l)

kF , x ( r )

k F ], and (iii) the set µ1 , . . . , µ N ∩

[x( l)kF

, x ( r )k F

]. For M = N , the full calculus leads to

P k = NM nk (M −N )

1≤i≤N x ( l )

k F ≤η i ≤x ( r )k F

ηi −1≤i≤N

x ( l )k F ≤µ i ≤x ( r )

k F

µi




+ N nk

1≤i≤N x ( l )k F ≤η i ≤x ( r )

k F

σ2 −1≤i≤N x ( l )

k F ≤λ i ≤x ( r )k F

σ2 +

1≤i≤N x ( l )k F ≤µ i ≤x ( r )

k F

σ2 −1≤i≤N x ( l )

k F ≤λ i ≤x ( r )k F

σ2 .

(17.34)

We know from Theorem 17.5 that mB N (z) a .s.

−→ mF (z) and mB N (z) a.s.

−→ mF (z)as N → ∞. Observing that the integrand in ( 17.33) is uniformly bounded onthe compact CF,k , the dominated convergence theorem, Theorem 6.3, ensuresP k

a .s.

−→ P k .To go further, we now need to determine which of λ1 , . . . , λ N , η1 , . . . , η N and

µ1 , . . . , µ N lie inside CF,k . This requires a result of eigenvalue exact separation

that extends Theorem 7.1 [Bai and Silverstein, 1998] and Theorem 7.2 [Bai andSilverstein, 1999] , as follows.

Theorem 17.7. Let B n = 1n T

12n X n X H

n T12n ∈C p× p , where we assume the

following conditions:

1. X n ∈C p×n has entries xij , 1 ≤ i ≤ p, 1 ≤ j ≤ n, extracted from a doubly innite array x ij of independent variables, with zero mean and unit variance.

2. There exist K and a random variable X with nite fourth order moment such

that, for any x > 01

n1n2 i≤n 1 ,j ≤n 2

P (|x ij | > x ) ≤ KP (|X | > x ) (17.35)

for any n1 , n 2 .3. There is a positive function ψ(x) ↑ ∞ as x → ∞, and M > 0, such that

maxij

E[|x2ij |ψ(|x ij |)] ≤ M. (17.36)

4. p = p(n) with cn = p/n → c > 0 as n → ∞.5. For each n, T n ∈

C p

× p

is Hermitian non-negative denite, independent of x ij , satisfying H n F T n

⇒ H , H a non-random probability distribution function, almost surely. T

12n is any Hermitian square root of T n .

6. The spectral norm T n of T n is uniformly bounded in n almost surely.7. Let a,b > 0, non-random, be such that, with probability one, [a, b] lies in an

open interval outside the support of F cn ,H n for all large n, with F y,G dened to be the almost sure l.s.d. of 1

n X Hn T n X n when H = G and c = y.

Denote λY1 ≥ . . . ≥ λY

p the ordered eigenvalues of the Hermitian matrix Y ∈C p× p . Then, we have that:

1. P (no eigenvalues of B n appear in [a, b] for all large n) = 1 .2. If c(1 −H (0)) > 1, then x0 , the smallest value in the support of F c,H , is

positive, and with probability one, λB nn → x0 as n → ∞.



466 17. Estimation

3. If c(1 −H (0)) ≤ 1, or c(1 −H (0)) > 1 but [a, b] is not contained in [0, x0],then mF c,H (a) < m F c,H (b) < 0. Almost surely, there exists, for all n large, an index in

≥ 0 such that λT n

i n>

−1/m F c,H (b) and λT n

i n +1 >

−1/m F c,H (a) and we

have:

P (λB ni n

> b and λB ni n +1 < a for all large n) = 1 .

Theorem 17.7 is proved in [Couillet et al., 2011c]. This result is more generalthan Theorem 7.2, but the assumptions are so involved that we preferred to stateTheorem 7.2 in Chapter 7 in its original form with i.i.d. entries in matrix X n .

To apply Theorem 17.7 to B N in our scenario, we need to ensure allassumptions are met. Only Items 2–6 need particular attention. In our scenario,

the matrix X n of Theorem 17.7 is ( XW ), while T n is T HPHH

+ σ2

I N 00 0 . Thelatter has been proved to have almost sure l.s.d. H , so that Item 5 is veried.Also, from Theorem 7.1 upon which Theorem 17.7 is based, there exists a subsetof probability one in the probability space that engenders the T over which, for nlarge enough, T has no eigenvalues in any closed set strictly outside the supportof H ; this ensures Item 6. Now, by construction, X and W have independententries of zero mean, unit variance, fourth order moment and are composedof at most K + 1 distinct distributions, irrespective of M . Denote X 1 , . . . , X d ,d ≤ K + 1, d random variables distributed as those distinct distributions. LettingX =

|X 1

|+ . . . +

|X d

|, we have that

1n1n2 i≤n 1 ,j ≤n 2

P (|zij | > x ) ≤ P d

i =1|X i | > x

= P (|X | > x )

where zij is the (i, j )th entry of ( XW ). Since all X i have nite order four moments,

so does X , and Item 2 is veried. From the same argument, Item 3 follows withφ(x) = x2 . Theorem 17.7 can then be applied to B N .

The corollary of Theorem 17.7 applied to B N is that, with probabilityone, for N sufficiently large, there will be no eigenvalue of B N (or B N )outside the support of F , and the number of eigenvalues inside cluster kF

is exactly nk . Since CF,k encloses cluster kF and is away from the otherclusters, λ1 , . . . , λ N ∩[x( l)

k F , x ( r )

k F ] = λ i , i ∈N k almost surely, for all large N .

Also, for any i ∈ 1, . . . , N , it is easy to see from ( 17.32) that mB N (z) → ∞when z ↑ λ i and mB N (z) → −∞ when z ↓ λ i . Therefore mB N (z) has at leastone root in each interval ( λ i−1 , λ i ), with λ0 = 0, hence µ1 < λ 1 < µ 2 < . . . <µN < λ N . This implies that, if k0 is the index such that CF,k contains exactlyλk 0 , . . . , λ k 0 +( n k

−1) , then CF,k also contains

µk 0 +1 , . . . , µ k 0 +( n k

−1)

. The same

result holds for ηk 0 +1 , . . . , η k 0 +( n k −1) . When the indexes exist, due to clusterseparability, ηk 0 −1 and µk0 −1 belong, for N large, to cluster kF −1. We are thenleft with determining whether µk 0 and ηk 0 are asymptotically found inside CF,k .




For this, we use the same approach as in Chapter 8 by noticing that, sincezero is not included in Ck , we have:

12πi C k

1ωdω = 0 .

Performing the same changes of variables as previously, we have:

C F,k

−mF (z)mF (z) −zm F (z)mF (z) −zm F (z)mF (z)z2mF (z)2mF (z)2 dz = 0 . (17.37)

For N large, the dominated convergence theorem, Theorem 6.3, ensures againthat the left-hand side of the ( 17.37) is close to

C F,k

−mB N (z)mB N (z) −zm B N (z)mB N (z) −zm B N (z)mB N

(z)

z2

mB N (z)2

mB N (z)2 dz. (17.38)

The residue calculus of ( 17.38) then leads to

1≤i≤N λ i∈[x

( l )k F

,x ( r )k F

]

2 −1≤i≤N

η i∈[x( l )k F

,x ( r )k F

]

1 −1≤i≤N

µ i∈[x( l )k F

,x ( r )k F

]

1 a.s.

−→ 0. (17.39)

Since the cardinalities of i, η i ∈ [x( l)kF

, x ( r )k F

] and i, µ i ∈ [x( l)kF

, x ( r )k F

] are atmost nk , (17.39) is satised only if both cardinalities equal nk in the limit. As aconsequence, µk 0 ∈ [x

( l)kF

, x ( r )k F

] and ηk 0 ∈ [x( l)kF

, x (r )kF

] for all N large, almost surely.

For N large, N = M , this allows us to simplify ( 17.34) intoP k =

NM nk (M −N )

1≤i≤N λ i∈

N k

(ηi −µi ) (17.40)

with probability one. The same reasoning holds for M = N . This is our nalrelation. It now remains to show that the ηi and the µi are the eigenvalues of diag( λ ) − 1

N √ λ √ λ T

and diag( λ ) − 1M

√ λ √ λ T

, respectively. But this is merelyan application of Lemma 8.1.

This concludes the elaborate proof of Theorem 17.6. We now turn to the proper

evaluation of this last power inference method, for the two system models studiedso far. The rst system model, Scenario (a), has the following characteristics:K = 3 sources, P 1 = 1, P 2 = 3, and P 3 = 10, N = 60 sensors, M = 600 samples,and n1 = n2 = n3 = 2 antennas per transmit source, while for the second systemmodel, Scenario (b): K = 3 sources, P 1 = 1/ 16, P 2 = 1 / 4, N = 24 sensors, M =128 samples, and n1 = n2 = n3 = 4 antennas per transmit source. The histogramand distribution function of the estimated powers for Scenario (b) are depictedin Figure 17.19 and Figure 17.20. Observe that this last estimator seems ratherunbiased and very precise for all three powers under study.

In all previous approaches to the problem of power inference, we have assumedto this point that the number of simultaneous transmissions is known and thatthe number of antennas used by every transmitter is known. In the momentdeconvolution approach, this has to be assumed either when inverting the




therefore of prior importance to be rst able to detect the number of simultaneoustransmissions and the number of antennas per user. In the following, we will seethat this is possible using ad-hoc tricks, although in most practical cases, moretheoretical methods are required that are yet to be investigated.

17.2.6 Joint estimation of number of users, antennas and powers

It is obvious that the less is a priori known to the estimator, the less reliableestimation of the system parameters is possible. We discuss hereafter theproblems linked to the absence of knowledge of some system parameters as wellas what this entails from a cognitive radio point of view. Some further commentson the way to use the above estimators are also made.

• If both the number of transmit sources and the number of antennas per sourceare known prior to signal sensing, then all aforementioned methods will givemore or less accurate estimates of the transmit powers. The accuracy dependsin that case on whether transmit sources are sufficiently distinct from oneanother (depending on the cluster separability condition for Theorem 17.6)and on the efficiency of the algorithm used. From a cognitive radio viewpoint,that would mean that the secondary network is aware of the number of usersexploiting a resource and of the number of antennas per user. It is in fact notnecessary to know exactly how many users are currently transmitting, but onlythe maximum number of such users, as the sensing array would then alwaysdetect the maximum amount of users, some transmitting with null power. Theassumption that the cognitive radio is aware of this maximal number of usersper resource is therefore tenable. The assumption that the number of transmitantennas is known also makes sense if the primary communication protocols isknown not to allow multiple antenna transmissions for instance. Note howeverthat the overall performance in that case is rather degraded by the fact thatsingle antenna transmissions do not provide much channel diversity. If this isso, it is reasonable for the sensing array to acquire more samples for different

realizations of channel H , which would take more time, or to be composed of numerous sensors, which might not be a realistic assumption.

• If the number of users is unknown, as discussed in the previous point, thismight not be a dramatic issue on practical grounds if we can at least assumea maximal number of simultaneous transmissions. Typically, though, in awideband CDMA network, a large number of users may simultaneously occupya given frequency resource. If a cognitive radio is to operate on this frequencyresource, it must then cope with the fact that a very large number of usertransmit powers may need be estimated. Nonetheless, and rather fortunately,it is fairly untypical that all transmit users are found at the same location,close to the secondary network. The most remote users would in that case behidden by thermal noise and the secondary network would then only need todeal with the closest users. Anyhow, if ever a large number of users is to be



470 17. Estimation

found in the neighborhood of a cognitive radio, it is very unlikely that thefrequency resource be reusable at all.

• If now the number of antennas per user is unknown, then more elaboratemethods are demanded since this parameter is essential to all algorithms.Indeed, for both classical and Stieltjes transform approaches, we need to beable to distribute the empirical eigenvalues of 1

M YY H in several clusters, onefor each source, the size of each cluster matching the number of antennas usedby the transmitter. The same holds true for the exact inference method orthe moment approach that both assume known power multiplicities. Amongthe methods to cope with this issue, we present below an ad-hoc suboptimalapproach. We rst assume for readability that we know the number K of transmit sources (taken large enough to cover all possible hypotheses), some

having possibly a null number of transmit antenna. The approach consists inthe following steps:

1. we rst identify a set of plausible hypotheses for n1 , . . . , n K . This can beperformed by inferring clusters based on the spacing between consecutiveeigenvalues: if the distance between neighboring eigenvalues is more than athreshold, then we add an entry for a possible cluster separation in the listof all possible positions of cluster separation. From this list, we create allpossible K -dimensional vectors of eigenvalue clusters. Obviously, the choiceof the threshold is critical to reduce the number of hypotheses to be tested;

2. for each K -dimensional vector with assumed numbers of antennasn1 , . . . , nK , we use Theorem 17.6 in order to obtain estimates of theP 1 , . . . , P K (some being possibly null);

3. based on these estimates, we compare the e.s.d. F B N of B N to F denedas the l.s.d. of the matrix model Y = H PX + W with P the diagonalmatrix composed of ˆn1 entries equal to P 1 , n2 entries equal to P 2 , etc.up to nK entries equal to P K . The l.s.d. F is obtained from Theorem17.5. The comparison can be performed based on different metrics. In thesimulations carried hereafter, we consider as a metric the mean absolutedifference between the Stieltjes transform of F B N and of F on the segment[−1, −0.1].

A more elaborate approach would consist in analyzing the second orderstatistics of F B N , and therefore determining decision rules, such as hypothesistests for every possible set ( K, n 1 , . . . , n K ).

Note that, when the number of antennas per user is unknown to the receiverand clusters can be clearly identied, another problem still occurs. Indeed, evenif the clusters are perfectly disjoint, to this point in our study, the receiver hasno choice but to assume that the cluster separability condition is always met andtherefore that exactly as many users as visible clusters are indeed transmitting.If the condition is in fact not met, say the empirical eigenvalues corresponding tothe p power values P i , . . . , P i +( p−1) are merged into a single cluster, i.e. with the



472 17. Estimation

−15 −10 −5 0 5 10 15 20

−20

−10

0

10

SNR [dB]

N o r m a l i z e d m e a n s q u a r e e r r o r

[ d B ] P 1

P 2

P 3P ∞1P ∞2P ∞3

Figure 17.21 Normalized mean square error of individual powers P 1 , P 2 , P 3 ,P 1 = 1 , P 2 = 3 , P 3 = 10, n1 /n = n 2 /n = n 3 /n = 1 / 3 ,n/N = N/M = 1 / 10, n = 6.Comparison between the conventional and Stieltjes transform approaches.

slightly smaller variance shows a large bias as was anticipated. As for the momentmethod, it shows rather accurate performance for the stronger estimated power,

but proves very inaccurate for smaller powers. This follows from the inherentshortcomings of the moment method. The performance of the estimator P k willbe commented on later.

We then focus on the estimate of the larger power P 3 and take now the SNRto range from −15 to 30 dB under the same conditions as previously and for thesame estimators. The NMSE for the estimators of P 3 is depicted in Figure 17.23.The curve marked with squares will be commented on in the next section. Asalready observed in Figure 17.22, in the high SNR regime, the Stieltjes transformestimator outperforms both alternative methods. We also notice the SNR gainachieved by the Stieltjes transform approach with respect to the conventionalmethod in the low SNR regime, as already observed in Figure 17.21. However,it now turns out that in this low SNR regime, the moment method is gainingground and outperforms both cluster-based methods. This is due to the clusterseparability condition, which is not a requirement for the moment approach.This indicates that much can be gained by the Stieltjes transform method in thelow SNR regime if a more precise treatment of overlapping clusters is taken intoaccount.

17.2.7.2 Joint estimation of K , nk , P kSo far, we have assumed that the number of users K and the numbers of antennasper user n1 , . . . , n K were perfectly known. As discussed previously, this maynot be a strong assumption if it is known in advance how many antennas are




116

14

10

0.1

0.2

0.3

0.4

0.5

0.6

0.70.8

0.9

1

Powers


P kP ∞kP (mom)

k

P kPerfect estimate

Figure 17.22 Distribution function of the estimators P ∞k , P k , P k and P (mom)k for

k ∈ 1, 2, 3, P 1 = 1 / 16, P 2 = 1 / 4, P 3 = 1, n1 = n 2 = n 3 = 4 antennas per user,N = 24 sensors, M = 128 samples and SNR = 20 dB. Optimum estimator shown indotted line.

systematically used by every source or if another mechanism, such as in [Chunget al. , 2007], can provide this information. Nonetheless, these are in generalstrong assumptions. Based on the ad-hoc method described above, we thereforeprovide the performance of our novel Stieltjes transform method in the highSNR regime when only n is known; this assumption is less stringent since inthe medium to high SNR regime we can easily decide which eigenvalues of B N belong to the cluster associated with σ2 and which eigenvalues do not.We denote P k the estimator of P k when K and n1 , . . . , n K are unknown. Weassume for this estimator that all possible combinations of 1 to 3 clusters canbe generated from the n = 6 observed eigenvalues in Scenario (a) and that all

possible combinations of 1 to 3 clusters with even cluster size can be generatedfrom the n = 12 eigenvalues of B N in Scenario (b). For Scenario (a), the NMSEperformance of the estimators P k and P k is proposed in Figure 17.24 for theSNR ranging from 5 dB to 30 dB. For Scenario (b), the distribution functionof the inferred P k is depicted in Figure 17.22, while the NMSE performance forthe inference of P 3 is proposed in Figure 17.23; these are both compared againstthe conventional, moment, and Stieltjes transform estimators. We also indicatein Table 17.1 the percentage of correct estimations of the triplet ( n1 , n 2 , n 3)for both Scenario (a) and Scenario (b). In Scenario (a), this amounts to 12such triplets that satisfy nk

≥ 0, n1 + n2 + n3 = 6, while, in Scenario (b), this

corresponds to 16 triplets that satisfy nk ∈ 2N , n1 + n2 + n3 = 12. Observe thatthe noise variance, assumed to be known a priori in this case, plays an importantrole with respect to the statistical inference of the nk . In Scenario (a), for a



474 17. Estimation

−5 0 5 10 15 20 25 30−20

−15

−10

−5

0

SNR [dB]

N o r m a l i z e d m e a n s q u a r e e r r o r

[ d B ] P 3

P ∞3P (mom)3

P 3

Figure 17.23 Normalized mean square error of largest estimated power P 3 ,P 1 = 1 / 16, P 2 = 1 / 4, P 3 = 1, n1 = n 2 = n 3 = 4 , N = 24, M = 128. Comparisonbetween conventional, moment, and Stieltjes transform approaches.

SNR RCI (a) RCI (b)

5 dB 0.8473 0.133910 dB 0.9026 0.479815 dB 0.9872 0.481920 dB 0.9910 0.512225 dB 0.9892 0.545530 dB 0.9923 0.5490

Table 17.1. Rate of correct inference (RCI) of the triplet (n 1 , n 2 , n 3 ) for scenarios (a) and(b).

SNR greater than 15 dB, the correct hypothesis for the nk is almost alwaystaken and the performance of the estimator is similar to that of the optimalestimator. In Scenario (b), the detection of the exact cluster separation is lessaccurate and the performance for the inference of P 3 saturates at high SNR to

−16 dB of NMSE, against −19 dB when the exact cluster separation is known.It therefore seems that, in the high SNR regime, the performance of the Stieltjestransform detector is loosely affected by the absence of knowledge about thecluster separation. This statement is also conrmed by the distribution functionof P k in Figure 17.22, which still outperforms the conventional and momentmethods. We underline again here that this is merely the result of an ad-hoc approach; this performance could be greatly improved if, e.g. more was knownabout the second order statistics of F B N .



476 17. Estimation

Pearson test developed in Chapter 16 unfolds in particular from this Bayesianframework.



18 System modeling

Channel modeling is a fundamental eld of research, as all communicationmodels that were used so far in this book as well as in the whole mobilecommunication literature are based on such models. These models are bydenition meant to provide an adequate representation of real practicalcommunication environments. As such, i.i.d. Gaussian channel models are meantto represent the most scattered environment possible where multiple waveformsreecting on a large number of scattering objects add up non-coherently andindependently on the receive antennas. From this point of view, the Gaussianassumption is due to a loose application of the law of large numbers. Due tothe mobility of the communication devices in the propagation environment, thestatistical disposition of scatterers is rather uniform, hence the i.i.d. property.This is basically the physical arguments for using Gaussian i.i.d. models,

along with conrmation by eld measurements. An alternative explanation forGaussian channel models will be provided in this chapter, which accounts fora priori information available at the system modeler, rather than for physicalinterpretations.

However, the complexity of propagation environments call for more involvedmodels. This is why the Kronecker model comes into play to adequately modelcorrelations arising at either communication end, supposedly independent fromone another. This is also why the Rician model is of common use, as it can takeinto account possible line-of-sight components in the channel, which can oftenbe assumed of constant fading value for a rather long time period, comparedto the fast varying scattering part of the channel. Those are channels usuallyconsidered in the scientic literature for their overall simplicity, but whoseaccuracy is somewhat disputed [Ozcelik et al., 2003]. Nonetheless, the literatureis also full of alternative models, mostly based on eld tests. After fty yearsof modern telecommunications, it is still unclear which models to value, whichmodels are relevant and for what reasons. This chapter intends rst to propose a joint information-theoretic framework for telecommunications that encompassesmany topics such as channel modeling, source sensing, parameter estimation,etc. from a common probability-theoretic basis. This enables channel modeling

to be seen, no longer as a remote empirical eld of research, but rather as acomponent of a larger information-theoretic framework. More details about thecurrent literature on channel modeling can be found in [Almers et al., 2007]. The



478 18. System modeling

role of random matrix theory in this eld will become clear when we address thequestion of nding the maximum entropy distribution of random matrix models.

Before we proceed with the channel modeling problem, explored throughseveral articles by Debbah, M¨ uller, and Guillaud mainly [de Lacerda Neto et al.,2006; Debbah and M¨uller , 2005; Guillaud et al., 2006], we need to quickly recallthe theoretical foundation of Bayesian probability theory and the maximumentropy principle.

18.1 Introduction to Bayesian channel modeling

The ultimate objective of the eld of channel modeling is to provide uswith probabilistic models that are consistent with the randomness observed in actual communication channels . It therefore apparently makes sense to proberealistic channels in order to infer a general model by “sampling” the observedrandomness. It also makes sense to come up with simple mathematical modelsthat (i) select only the essential features of the channels, (ii) build whateversimple probability distribution around them, and (iii) can be compared toactual channel statistics to see if the unknown random part of the effectivechannel is well approximated by the model. In our opinion, though, these widelyspread approaches suffer from a severe misconception of the word randomness .From a frequentist probability point of view, randomness reects the unseizablecharacter that makes particular events different every time they are observed. Forinstance, we may think of a coin toss as being a physical action ruled by chance, asif it were obeying no law of nature. This is however conceivably absurd when weconsider the same toss coin played in slow motion in such a way that the observercan actually compute from the observed initial rotation speed, height, and otherparameters, such as air friction, the exact outcome of the experiment in advance.Playing a toss coin in slow motion removes its random part. A more strikingexample, that does not require to modify the conditions of the experiment, is

that of consecutive withdrawals of balls concealed in identical boxes with sameinitial content. Assume that we know prior to the rst withdrawal from therst box that some balls are red, the others being blue. From the frequentistconception of randomness, the probability of drawing a red ball at rst, second,or third withdrawal (the nth withdrawal is done from the nth box on the line) isidentical since the boxes are all identical in content and the successive physical events are the same. But this is clearly untrue. Without any need for advancedmathematics, if we are told that the rst box contains red and blue balls (but notthe exact number), we must infer in total honesty that the probability of gettinga blue ball is somewhere around one half. However, if after a hundred successivewithdrawals, all selected balls turned out to be red, we would reasonably thinkthat the probability for the hundred-rst ball withdrawn to be blue is much lowerthan rst anticipated.



18.1. Introduction to Bayesian channel modeling 479

From the above examples, it is more sensitive to see randomness , not as theresult of unruled chance, but rather as the result of a lack of knowledge. Themore we know about an event, the more condence we have about the outcomeof this event. This is the basis of Bayesian probability theory. The condence factor here is what we wish to call the probability of the event. This conceptionof probabilities is completely anchored in our everyday life where our decisionsare based on our knowledge and appreciation of the probability of possible events.In the same way, communication channels, be they often called the environment ,as if we did not have any control over them, can be reduced to the knowledgewe have about them. If we know the communication channels at all times, i.e.channels for which we often coin the phrase “perfect channel state information,”then the channel is no longer conceived as random. If we do not have perfect

channel state information at all times, though, then the channel is random inthe sense that it is one realization of all possible channels consistent with thereduced knowledge we have on this channel.

Modeling channels under imperfect state information therefore consists inproviding a probability distribution for all such channels consistent with thisrestricted information. From a condence point of view, this means we mustassign to each potential channel realization a degree of condence in such arealization. This degree of condence must be computed by taking into accountonly the information available, and by discarding as much as possible allsupplementary unwanted hypotheses. It has been proved, successively by Cox

[Cox, 1946], Shannon [Shannon , 1948], Jaynes [Jaynes , 1957a,b], and Shore andJohnson [Shore and Johnson , 1980] that, under a reasonable axiomatic denitionof the information-theoretic key notion of ignorance (preferably referred to asinformation content by Shannon), the most non-committal way to attributedegrees of condence of possible realizations of an event is to assign to itthe probability distribution that has maximal entropy, among all probabilitydistributions consistent with the prior knowledge. That is, the process of assigning degrees of condence to a parameter x, given some information I ,consists in the following recipe:

• Among all probability distributions for x, discard those that are inconsistentwith I . For instance, if I contains information about the statistical mean of x,then all probability distributions that have different means must be discarded;

• among the remaining set SI of such probability distributions, select the onewhich maximizes the entropy, i.e. calling p this probability distribution, wehave:

p arg max p∈

S I p(x)log( p(x))dx.

This principle is referred to as the maximum entropy principle [Jaynes , 2003],which is widely spread in the signal processing community [Bretthorst, 1987] , ineconometrics [Zellner, 1971 ], engineering [Kapur, 1989] , and in general science[Brillouin, 1962] in spite of several century-old philosophical divisions inside the




community. Since such philosophical debates are nowhere near the target of thissection, let alone the subject of this book, we claim from now on and withoutfurther justication that the maximum entropy principle is the most reliable toolto build statistical models for parameters regarding which limited information isavailable.

In the next section, we address the channel modeling question through aBayesian and maximum entropy point of view.

18.2 Channel modeling under environmental uncertainty

Fast varying communication channels are systems for which a full parametricaldescription is lacking to the observer. Since these channels are changing too fastin time to be adequately tracked by communication devices without incurringtoo much information feedback, only limited inexpensive information is usuallycollected. For instance, the most trivial quantities that we can collect withouteffort on the successive channels is their empirical mean and their empiricalvariance (or covariance matrix in the case of multiple antenna communications).Assuming channel stationarity (at least in the wide sense), it is then possibleto propose a consistent non-committal model for the channel at hand, byfollowing the steps of the maximum entropy principle. In this particular case

where mean and variance are known, it can in fact be shown that the channelmodel under consideration is exactly the Gaussian matrix channel. Some classicalchannel models such as the correlated Gaussian model and the Kronecker modelwere recovered in [Debbah and M¨uller , 2005] and [Guillaud et al., 2006], whilesome new channel maximum entropy-consistent models were also derived. Inthe following, we detail the methodology used to come up with these modelswhen statistical knowledge is available to the system modeler, similar to theideas developed in [Franceschetti et al., 2003]. In [Debbah and M uller, 2005],the maximum entropy principle is used also when deterministic knowledge isavailable at the system modeler. Both problems lead to mathematically differentapproaches, problems of the former type being rather easy to treat, whileproblems of latter type are usually not tractable. We only deal in this chapterwith problems when statistical information about the channel is available.

Let us consider the multiple antenna wireless channel with nt transmit andn r receive antennas. We assume narrowband transmission so that the MIMOcommunication channels to be modeled are non-frequency selective. Let thecomplex scalar coefficient hi,j denote the channel attenuation between thetransmit antenna j and the receive antenna i, j ∈ 1, . . . , n t , i ∈ 1, . . . , n r .Let H ( t ) denote the nr ×n t channel matrix at time instant t. We recall the

general model for a time-varying at-fading channel with additive noise

y ( t ) = H ( t ) x ( t ) + w ( t ) (18.1)



18.2. Channel modeling under environmental uncertainty 481

where the noise vector w ( t )∈

C n r at time t is modeled as a complex circularlysymmetric Gaussian random variable with i.i.d. coefficients (in compliance withmaximum entropy requirements) and x ( t )

∈

C n t denotes the transmit data vectorat time t. In the following, we focus on the derivation of the fading characteristicsof H ( t ) . When we are not concerned with the time-related properties of H ( t ) , wewill drop the time index t, and refer to the channel realization H or equivalentlyto its vectorized notation h vec(H ) = ( h1,1 , . . . , h n r ,1 , h1,2 , . . . , h n r ,n t )T . Letus also denote N n r n t and map the antenna indices into 1 . . . , N ; that is,h = ( h1 , . . . , h N )T .

18.2.1 Channel energy constraints

18.2.1.1 Average channel energy constraintIn this section, we recall the results of [Debbah and M uller, 2005 ] where anentropy-maximizing probability distribution is derived for the case where theaverage energy carried through a MIMO channel is known deterministically.This probability distribution is obtained by maximizing the entropy

C N −log(P H (H ))P H (H )dH

under the only assumption that the channel has a nite average energy NE 0and the normalization constraint associated with the denition of a probability

density, i.e.

C N H 2

F P H (H )dH = N E 0 (18.2)

with H F the matrix Frobenius norm and

C N P H (H )dH = 1 .

This is achieved through the method of Lagrange multipliers, by writing

L(P H ) = − C N log(P H (

H))P H (

H)d

H+ β 1 −

C N P H (H

)dH

+ γ NE 0 − C N ||H ||2F P H (H )dH

where we introduce the scalar Lagrange coefficients β and γ , and by taking thefunctional derivative [Fomin and Gelfand, 2000 ] with respect to P H equal to zero

δL(P H )δP H

= 0 .

This functional derivative takes the form of an integral over H , which is in

particular identically null if the integrand is null for all H . We pick this onesolution and therefore write

−log(P H (H )) −1 −β −γ H 2F = 0




for all H .The latter equation yields

P H (H ) = exp −(β + 1) −γ H2F

and the normalization of this distribution according to ( 18.2) nally allows usto compute the coefficients β and γ . Observing in particular that β = −1 andγ = 1

E 0 are consistent with the initial constraints, the nal distribution is givenby:

P H |E 0 (H ) = 1

(πE 0)N exp −N

i =1

|h i |2E 0

. (18.3)

Interestingly, the distribution dened by ( 18.3) corresponds to a complexGaussian random variable with independent fading coefficients, although neitherGaussianity nor independence were among the initial constraints. Via themaximum entropy principle, these properties are the consequence of theignorance of the modeler regarding any constraint other than the total averageenergy NE 0 .

18.2.1.2 Probabilistic average channel energy constraintLet us now introduce a new model for situations where the channel model denedin the previous section applies locally in time but where E 0 cannot be expected

to be constant, e.g. due to short-term shadowing. Therefore, let us replace E 0 in(18.3) by the random quantity E known only through its p.d.f. P E (E ). In thiscase, the p.d.f. of the channel H can be obtained by marginalizing over E , asfollows.

P H (H ) = ∞0

P H ,E (H , E )dE = ∞0

P H |E (H )P E (E )dE. (18.4)

In order to determine the probability distribution P E , let us nd the maximumentropy distribution under the constraints:

• 0

≤ E

≤ E max , where E max represents an absolute constraint on the power

carried through the channel;• the mean E 0 E max

0 EP E (E )dE is known.

Applying once more the Lagrange multiplier method, we introduce the scalarunknowns β and γ , and maximize the functional

L(P E ) = − E max

0log(P E (E ))P E (E )dE + β E max

0EP E (E )dE −E 0

+ γ E max

0P E (E )dE −1 .

Equating the derivative to zero∂L (P E )

∂P E = 0




and picking the solution corresponding to taking all integrand terms of theresulting integral to be identically zero yields

P E (E ) = exp ( βE −1 + γ )and the Lagrange multipliers are nally eliminated by solving the normalizationequations

E max

0E exp( βE −1 + γ ) dE = E 0

E max

0exp( βE −1 + γ ) dE = 1.

The Lagrangian multiplier β < 0 is then the solution to the implicit equation

E max eβE max − 1β

+ E 0 eβE max −1 = 0 (18.5)

and nally P E is obtained as the truncated exponential law

P E (E ) = β

exp( βE max ) −1eβE

for 0 ≤ E ≤ E max and P E (E ) = 0 elsewhere. Note that taking E max = ∞ in(18.5) yields β = − 1

E 0 and the classical exponential law

P E (E ) = E 0e− EE 0 .

The nal maximum entropy model for P H is then:

P H (H ) = E max

0

1(πE )N

β exp( βE )exp( βE max ) −1

exp −N

i =1

|h i |2E

dE.

18.2.1.3 Application to the single antenna channelIn order to illustrate the difference between the two situations presented so far,let us investigate the single input single output (SISO) case nt = n r = 1 wherethe channel is represented by a single complex scalar h. Furthermore, sincethe distribution is circularly symmetric, it is more convenient to consider thedistribution of r |h|. After the change of variables h r (cos θ + i sin θ), andmarginalization over θ, (18.3) becomes

P r (r ) = 2rE 0

e− r 2E 0 (18.6)

whereas ( 18.4) yields

P r (r ) = E max

0

β eβE max −1

2rE

eβE −r 2E dE. (18.7)

Note that the integral always exists since β < 0. Figure 18.1 depicts thep.d.f. of r under known energy constraint (( 18.6), with E 0 = 1) and the knownenergy distribution constraint (( 18.7) is computed numerically, for E max = 2 and




0 1 2 3 40

0.2

0.4

0.6

0.8

1

Channel gain

D e n s i t y

E knownE max = 2E max =

∞

Figure 18.1 Amplitude distribution of the maximum entropy SISO channel models, forE 0 = 1, E max ∈ 1, 2, ∞.

E max = ∞, taking E 0 = 1). Figure 18.2 depicts the distribution function of thecorresponding instantaneous mutual information C SISO (r ) log2(1 + 1

σ 2 r 2) fora signal-to-noise ratio 1

σ 2 of 15 dB. The lowest range of the d.f. is of particular

interest for wireless communications since it indicates the probability of a channeloutage for a given transmission rate. The curves clearly show that the modelscorresponding to the unknown energy have a lower outage probability than theGaussian channel model.

We now consider the more involved scenario of channels with known spatialcorrelation at either communication end. We will provide the proofs in theircomplete versions as they are instrumental to the general manipulation of Jacobian determinants, marginal eigenvalue distribution for small dimensionalmatrices, etc. and as they provide a few interesting tools to deal with matrixmodels with unitary invariance.

18.2.2 Spatial correlation models

In this section, we will incorporate several states of knowledge about the spatialcorrelation characteristics of the channel in the framework of maximum entropymodeling. We rst study the case where the correlation matrix is deterministicand subsequently extend the result to an unknown covariance matrix.

18.2.2.1 Deterministic knowledge of the correlation matrixIn this section, we establish the maximum entropy distribution of H underthe assumption that the covariance matrix Q C N hh H P H |Q (H )dH is known,where Q is a N ×N complex Hermitian matrix. Each component of the




0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

C SISO (r ) [bits/s/Hz]


E knownE max = 2E max =

∞

Figure 18.2 Mutual information distribution for maximum entropy SISO channelmodels, when E 0 = 1, E max ∈ 1, 2, ∞, SNR of 15 dB.

covariance constraint represents an independent linear constraint of the form

C N

ha h∗bP H |Q (H )dH = q a,b

for (a, b) ∈ 1, . . . , N 2 . Note that this constraint makes any previous energyconstraint redundant since C N H 2

F P H |Q (H )dH = tr Q . Proceeding along thelines of the method exposed previously, we introduce N 2 Lagrange coefficientsα a,b , and maximize

L(P H |Q ) = C N −log(P H |Q (H ))P H |Q (H )dH + β 1 − C N P H |Q (H )dH

+a

∈1,...,N

b∈1,...,N

α a,b C N ha h∗bP H |Q (H )dH −q a,b .

Denoting A ∈C N ×N the matrix with ( a, b) entry αa,b and equating thederivative of the Lagrangian to zero, one solution satises

−log(P H |Q (H )) −1 −β −h T Ah ∗ = 0 . (18.8)

Therefore we take

P H |Q (H ) = exp −(β + 1) −h T Ah ∗

which leads, after elimination of the Lagrange coefficients through proper

normalization, to

P H |Q (H , Q ) = 1

det( πQ ) exp −h H Q −1h . (18.9)




Again, the maximum entropy principle yields a Gaussian distribution,although of course its components are not independent anymore.

18.2.2.2 Knowledge of the existence of a correlation matrixIt was shown in the previous sections that in the absence of information onspace correlation the maximum entropy modeling yields i.i.d. coefficients for thechannel matrix and therefore an identity covariance matrix. We now consider thecase where the covariance is known to be a parameter of interest but is not knowndeterministically. Again, we will proceed in two steps, rst seeking a probabilitydistribution function for the covariance matrix Q , and then marginalizing thechannel distribution over Q .

Density of the correlation matrix.We rst establish the distribution of Q , under the energy constraint

tr( Q )P Q (Q )dQ = N E 0 , by maximizing the functional

L(P Q ) = S −log(P Q (Q ))P Q (Q )dQ + β SP Q (Q )dQ −1

+ γ Str( Q )P Q (Q )dQ −NE 0 . (18.10)

Due to their structure, covariance matrices are restricted to the space S of N ×N non-negative denite complex Hermitian matrices. Therefore, let usperform the variable change to the eigenvalues and eigenvectors space as wasperformed in Chapter 2 and in more detail in Chapter 16. Specically, denoteΛ diag( λ1 . . . λ N ) the diagonal matrix containing the eigenvalues λ1 , . . . , λ N

of Q and let U be the unitary matrix containing the associated eigenvectors,such that Q = UΛU H .

We use the mapping between the space of complex N ×N self-adjoint matrices(of which S is a subspace) and U(N )/T ×R N

≤ , where U(N )/T denotes the spaceof unitary N ×N matrices with rst row composed of real non-negative entries,

and RN

≤ is the space of real N -tuples with non-decreasing components (seeLemma 4.4.6 of [Hiai and Petz , 2006]). The positive semidenite property of the covariance matrices further restricts the components of Λ to non-negativevalues, and therefore S maps into U(N )/T ×R +

≤N .

Let us now dene the function F over U(N )/T ×R +≤

N as

F (U , Λ ) P Q (UΛU H )

where U ∈U(N )/T and the ordered vector of diagonal entries of Λ lies in R +≤

N .According to this mapping, (18.10) becomes

L(F ) = U (N ) /T ×R + N

≤−log(F (U , Λ ))F (U , Λ )K (Λ )dU dΛ




+ β

U (N ) /T ×

R + N

≤

F (U , Λ )K (Λ )dU dΛ

−1

+ γ U (N ) /T ×R + N

≤

N

i=1λ i F (U , Λ )K (Λ )dU dΛ −NE 0 (18.11)

where we introduced the corresponding Jacobian

K (Λ ) πN (N −1) / 2

N j =1 j ! i<j

(λ i −λ j )2

and used tr Q = tr Λ =

N i =1 λ i . Maximizing the entropy of the distribution P Q

by taking ∂L (F )

∂F = 0 and equating all entries in the integrand to zero yields

−K (Λ ) −K (Λ )log(F (U , Λ )) + βK (Λ ) + γ N

i=1

λ i K (Λ ) = 0 .

Since K (Λ ) = 0 except on a set of measure zero, this is equivalent to

F (U , Λ ) = eβ −1+ γ N i =1 λ i . (18.12)

Note that the distribution F (U , Λ )K (Λ ) does not explicitly depend on U . Thisindicates that U is uniformly distributed, with constant density P U = (2 π)N

over U(N )/T . Therefore, the joint density can be factored under the formF (U , Λ )K (Λ ) = P U P Λ (Λ ) where the distribution of the eigenvalues over R +

≤N

is

P Λ (Λ ) = eβ −1

P Ueγ i =1 ...N λ i πN (N −1) / 2

N j =1 j ! i<j

(λ i −λ j )2 . (18.13)

At this point, notice that the form of ( 18.13) indicates that the order of theeigenvalues is immaterial. Therefore, for the sake of simplicity, we will now workwith the p.d.f. P Λ (Λ ) of the joint distribution of the unordered eigenvalues,dened over R + N . Note that its restriction to the set of the ordered eigenvaluesis proportional to P Λ (Λ ). More precisely

P Λ (Λ ) = Ceγ i =1 ...N λ i

i<j

(λ i −λ j )2 (18.14)

where the value

C = eβ −1

P UπN (N −1) / 2

N ! N j =1 j !

can be determined by solving the normalization equation for the probability

distribution P Λ

R + N P Λ (Λ )dΛ = 1




where we used the change of variables xi = −γλ i and the Selberg integral (see(17.6.5) of [Mehta, 2004]). Furthermore, d log( C )

d(−γ ) = N 2

−γ = N E 0 , and we nallyobtain the nal expression of the eigenvalue distribution

P Λ (Λ ) =N E 0

N 2 N

n =1

1n!(n −1)!

e− N E 0 i =1 ...N λ i

i<j

(λ i −λ j )2 . (18.15)

In order to obtain the nal distribution of Q , rst note that since theorder of the eigenvalues is immaterial, the restriction of U to U(N )/T isnot necessary, and Q is distributed as UΛU H where the distribution of Λ

is given by (18.15) and U is a Haar matrix. Furthermore, note that ( 18.15)is a particular case of the density of the eigenvalues of a complex Wishartmatrix, described in Chapter 2. We recall that the complex N ×N Wishartmatrix with K degrees of freedom and covariance Σ , denoted CW N (K, Σ ), isthe matrix A = BB H where B is a N ×K matrix whose columns are complexindependent Gaussian vectors with covariance Σ . Indeed, ( 18.15) describes theunordered eigenvalue density of a CW N (N, E 0

N IN ) matrix. Taking into accountthe isotropic property of the distribution of U , we can conclude that Q itself isalso a CW N (N, E 0

N IN ) Wishart matrix. A similar result with a slightly differentconstraint was obtained by Adhikari in [Adhikari , 2006] where it is shown thatthe entropy-maximizing distribution of a positive denite matrix with knownmean G follows a Wishart distribution with N + 1 degrees of freedom, moreprecisely the CW N (N + 1 , G

N +1 ) distribution.The isotropic property of the obtained Wishart distribution is a consequence

of the fact that no spatial constraints were imposed on the correlation. Theenergy constraint imposed through the trace only affects the distribution of theeigenvalues of Q .

We highlight the fact that the result is directly applicable to the case where thechannel correlation is known to be separable between transmitter and receiver.

In this case, the full correlation matrix Q is known to be the Kronecker productof the transmit Q t and receive Q r correlation matrices. This channel modelis therefore the channel with separable variance prole, or equivalently theKronecker model in the MIMO case. The stochastic nature of Q t and Q r is barelymentioned in the literature, since the correlation matrices are usually assumedto be measurable quantities associated with a particular antenna array shapeand propagation environment. However, in situations where these are not known(for instance, if the array shape is not known at the time of the channel codedesign, or if the properties of the scattering environment cannot be determined),but the Kronecker model is assumed to hold, the above analysis suggests thatthe maximum entropy choice for the distribution of Q t and Q r is independent,complex Wishart distributions with, respectively, nt and nr degrees of freedom.A Kronecker channel representation is provided in Figure 18.3.




Figure 18.3 MIMO Kronecker channel representation, with Q t ∈C n t ×n t . the transmit

covariance matrix, Q r ∈C n r

×n r

the receive correlation matrix and X∈

C n r

×n t

thei.i.d. Gaussian scattering matrix.

Marginalization over Q .The complete distribution of the correlated channel can be obtained bymarginalizing out Q , using its distribution as established in the previousparagraph. The distribution of H is obtained through

P H (H ) = SP H |Q (H , Q )P Q (Q )dQ = U (N )×R + N

P H |Q (H , U , Λ )P Λ (Λ )dU dΛ .

(18.16)Let us rewrite the conditional probability density of ( 18.9) as

P H |Q (h , U , Λ ) = 1

πN det( Λ )e−h H UΛ −1 U H h =

1πN det( Λ )

e−tr( hh H UΛ −1 U H ) .

(18.17)Using this expression in ( 18.16), we obtain

P H (H ) = 1πN R + N U (N )

e−tr( hh H UΛ −1 U H ) dU det( Λ )−1P Λ (Λ )dΛ . (18.18)

Now, similar to the proof of Theorem 16.1, let det( f (i, j )) denote thedeterminant of a matrix with the ( i, j )th element given by an arbitrary functionf (i, j ). Let A be a Hermitian matrix which has its N th eigenvalue AN equalto h H h , and the others A1 , . . . , A N −1 are arbitrary, positive values that willeventually be set to zero. Letting

I (H , A1 , . . . , A N −1) = 1πN R + N U (N )

e−tr( AUΛ −1 U H ) P U dU det( Λ )−1P Λ (Λ )dΛ

the probability P H (H ) can be determined as the limit distribution when the rstN −1 eigenvalues of A go to zero

P H (H ) = limA 1 ,...,A N −1 →0

I (H , A1 , . . . , A N −1).




Applying the Harish–Chandra integral of Theorem 2.4 for the now non-singularmatrix A to integrate over U yields

I (H , A1 , . . . , A N −1)

= (−1)

N ( N −1)2

πN

N −1

n =1n! R + N

det e−A iλ j

∆( A )∆( Λ−1) det( Λ )−1P Λ (Λ )dΛ

= 1πN

N −1

n =1n! R + N

det e−A iλ j det( Λ )N −2

∆( A )∆( Λ ) P Λ (Λ )dΛ

= C πN

N −1

n =1n! R + N

det e−A iλ j det( Λ )N −2∆( Λ )

∆( A ) e− N

E 0tr( Λ )

dΛ

where we used the identity ∆( Λ−1) = det( 1λ i

j −1) = ( −1)N (N +3) / 2 ∆( Λ )det( Λ )N −1 .

Then, we decompose the determinant product using the classical expansionformula. That is, for an arbitrary N ×N matrix X = ( X ij )

det( X ) =a∈

S N

sgn(a )N

n =1X n,a n =

1N !

a ,b∈S N

sgn(a )sgn( b )N

n =1X a n ,bn

where a = ( a1 , . . . , a N ), SN denotes the set of all permutations of 1, . . . , N ,and sgn( a ) is the signature of the permutation a . Using the rst form of thedeterminant expansion, we obtain

∆( Λ )det e−A iλ j = det( λ j −1

i )det( e−A jλ i ) (18.19)

=a ,b∈

S 2N

sgn(a )sgn( b )N

n =1λa n −1

n e−A b nλ n .

Note that in ( 18.19) we used the invariance of the second determinant bytransposition in order to simplify subsequent derivations. Therefore

I (H , A1 , . . . , A N −1)

= C

πN ∆( A )

N −1

n =1n! a ,b∈

S N

sgn(a )sgn( b )N

n =1 R +λN + a n −3

n e−A b nλ n e−

N E 0

λ n dλn

= CN !

πN

N −1

n =1n!

det( f i (Aj ))∆( A )

where we let

f i (x) = R +tN + i−3e−x

t e− N E 0

t dt




and we recognize the second form of the determinant expansion. In order toobtain the limit as A1 , . . . A N −1 go to zero, similar to the proof of Theorem 16.1,we apply Theorem 2.9 with p = 1, N 1 = N

−1 and y1 = 0 since A has only one

non-zero eigenvalue. This yields

P H (H ) = limA 1 ,A 2 ,...,A N −1 →0

I (H , A1 , . . . , A N −1)

= (−γ )N 2

πN xN −1N

N −1

n =1[n!(n −1)!]−1 det f i (0) , f i (0) , . . . , f (N −2)

i (0), f i (xN ) .

(18.20)

At this point, it becomes obvious from ( 18.20) that the probability of H

depends only on its norm (recall that xN = hH

h by denition of A ). Thedistribution of h is isotropic, and is completely determined by the p.d.f. P h H h (x)of having h such that h H h = x.

Thus, for given x, h is uniformly distributed over S N −1(x) = h , h H h = x , thecomplex hypersphere of radius √ x centered on zero. Its volume is V N (x) = π N x N

N ! ,and its surface is S N (x) = dV N (x )

dx = π N x N −1

(N −1)! . Therefore, we can write the p.d.f.of x as

P h H h (x) =

S N

−1

(x )

P H (h )dh

= (−γ )N 2

(N −1)!

N −1

n =1[n!(n −1)!]−1 det f i (0) , f i (0) , . . . , f (N −2)

i (0), f i (x) .

In order to simplify the expression of the successive derivatives of f i , itis useful to identify the Bessel K -function, Section 8.432 of [Gradshteyn andRyzhik , 2000], and to replace it by its innite sum expansion, Section 8.446 of [Gradshteyn and Ryzhik , 2000].

f i (x) = 2 x−γ

i + N

−2

K i+ N −2(2√ −γx )

= ( −γ )−i−N +2i + N −3

k=0

(−1)k (i + N −3 −k)!k!

(−γx )k + ( −1)i + N −1

×+ ∞

k=0

(−γx )i + N −2+ k

k!(i + N −2 + k)! (log(−γx ) −ψ(k + 1) −ψ(i + N −1 + k)) .

Note that there is only one term in the sum with a non-zero pth derivative at

zero. Therefore, the pth derivative of f i at zero is simply (for 0 ≤ p ≤ N −2)

f ( p)i (0) = ( −1)−i−N γ p−i−N +2 (i + N −3 − p)!. (18.21)




Let us bring the last column to become the rst, and expand the resultingdeterminant along its rst column

det f (0)i (0), . . . , f

(N

−2)

i (0), f i (x)= ( −1)N −1 det f i (x), f (0)

i (0), . . . , f (N −2)i (0)

= ( −1)N −1N

n =1(−1)1+ n f n (x)det f (0)

i,n (0) , . . . , f (N −2)i,n (0)

where f ( p)i,n (0) is the N −1 dimensional column obtained by removing the nth

element from f ( p)i (0). Factorizing the ( −1) pγ p−i−N +2 in the expression of f ( p)

i (0)out of the determinant yields

det f (0)i,n (0) , . . . , f (N −2)

i,n (0) = (−1)n −N (N +1) / 2γ n −N 2 + N −1 det( G (n ) )

where the N −1 dimensional matrix G (n ) has ( l, k) entry

G (n )l,k = Γ( q (n )

l + N −k −1)

where Γ( i) = ( i −1)! for i positive integer, and

q (n )l = l , l ≤ n −1,

l + 1 , l ≥ n.

Using the fact that Γ( q (n )l + i) = q (n )

l Γ(q (n )l + i −1) + ( i −1)Γ( q (n )

l + i −1),note that the kth column of G (n ) is

G (n )l,k = q (n )

l Γ(q (n )l + N −k −2) + ( N −k −2)G (n )

l,k +1 .

Since the second term is proportional to the ( k + 1)th column, it can beomitted without changing the value of the determinant. Applying this propertyto the rst N −2 pairs of consecutive columns and repeating this process againto the rst N −2, . . . , 1 pairs of columns, we obtain

det( G(n )

) = det Γ(q (n )

l + N −2), . . . , Γ(q (n )

l + 2) , Γ(q (n )

l + 1) , Γ(q (n )

l )= det q (n )

l Γ(q (n )l + N −3), . . . , q (n )

l Γ(q (n )l + 1) , q (n )

l Γ(q (n )l ), Γ(q (n )

l )

= det q (n )l

N −1−kΓ(q (n )

l )

=N i =1 Γ(i)Γ(n)

det q (n )l

N −1−k

=N i =1 Γ(i)Γ(n)

(−1)12 (N −1)( N −2) det q (n )

lk−1

where the last two equalities are obtained, respectively, by factoring out theΓ(q (n )

l ) factors (common to all terms on the lth row) and inverting the orderof the columns in order to get a proper Vandermonde structure. Finally, the




determinant can be computed as

det q (n )l

k−1 =

1≤j<i ≤N −1

q (n )i

−q (n )

j

=n −2

i =1

i!N −1

i= n

i!(i −n + 1)!

N −1

i = n +1

(i −n)!

=N −1i =1 i!

(n −1)!(N −n)!.

Wrapping up the above derivations, we obtain successively

det( G (n ) ) =N −1

i =1i!

2

(

−1)(N −1)( N −2) / 2 1

[(n −1)!]2 (N −n)!

then:

det f (0)i,n (0) , . . . , f (N −2)

i,n (0) =N −1

i =1i!

2(−1)n +1 γ n −N 2 + N −1

[(n −1)!]2 (N −n)!

which gives

det f (0)i (0), . . . , f (N −2)

i (0), f i (x)

=N

n =1(−1)1−N f n (x)

N −1

i =1

i!2 γ n −N 2 + N −1

[(n −1)!]2 (N −n)!.

Finally, we have:

P h H h (x) = −N

n =1f n (x)

γ N + n −1

[(n −1)!]2 (N −n)!

where γ = −N E 0 . This leads to the maximum entropy distribution for H , given

by:

P H (H ) = − 1

πN (h H h )N −1

N

n =1f n (h H h )

γ N + n −1(N −1)![(n −1)!]2 (N −n)!

.

The corresponding p.d.f. is shown in Figure 18.4, as well as the p.d.f. of theinstantaneous power of a Gaussian i.i.d. channel of the same size and mean power.As expected, the energy distribution of the proposed model is more spread outthan the energy of a Gaussian i.i.d. channel.

Figure 18.5 shows the d.f. curves of the instantaneous mutual informationachieved over the channel described in ( 18.1) by these two channel models. Theproposed model differs in particular in the tails of the distribution: for instance,the 1% outage capacity is reduced from 8 to 7 bits/s/Hz with respect to theGaussian i.i.d. model.




0 10 20 30 40 50

0.02

0.04

0.06

0.08

0.1

Channel gain

D e n s i t y

Gaussian i.i.d.Correlated channel

Figure 18.4 Amplitude distribution of the maximum entropy 4 ×4 MIMO channelmodels, with known identity correlation (Gaussian i.i.d.) or unknown correlation.

6 8 10 12 140

0.2

0.4

0.6

0.8

1

Mutual information [bits/s/Hz]


Gaussian i.i.d.Correlated channel

Figure 18.5 Mutual information distribution of the maximum entropy 4 ×4 MIMOchannel models, with known identity correlation (Gaussian i.i.d.) or unknowncorrelation, SNR of 10 dB.

18.2.2.3 Limited-rank covariance matrixIn this section, we address the situation where the modeler takes into accountthe existence of a covariance matrix of rank L < N (we assume that L is known).Such a situation arises in particular when the communication channel is a prioriknown not to offer numerous degrees of freedom, or when the MIMO antennas on




Figure 18.6 MIMO Kronecker channel representation with limited number of

scatterers in the propagation environment, with Q t ∈Cn t

×n t .

the (possiblyrank-limited) transmit covariance matrix, Q r ∈C n r ×n r the (possibly rank-limited)receive correlation matrix and X ∈

C n r ×n t the i.i.d. Gaussian scattering matrix.

either communication side are known to be close enough for correlation to arise.Figure 18.6 depicts a Kronecker channel environment with limited diversity.

As in the full-rank case, we will use the spectral decomposition Q = UΛU H

of the covariance matrix, with Λ = diag( λ1 , . . . , λ L , 0, . . . , 0). Let us denoteΛ L = diag( λ1 , . . . , λ L ). The maximum entropy probability density of Q withthe extra rank constraint is unsurprisingly similar to that derived previously,with the difference that all the energy is carried by the rst L eigenvalues, i.e.U is uniformly distributed over U(N ), while

P Λ L (Λ L ) = L2

NE 0

L 2 L

n =1

1n!(n −1)!

e− L 2NE 0 i =1 ...L λ i

i<j ≤L

(λ i −λ j )2 . (18.22)

However, the denition of the conditional probability density P H |Q (h , U , Λ )in (18.9) does not hold when Q is not full rank. The channel vector h becomesa degenerate Gaussian random variable. Its projection onto the L-dimensional

subspace associated with the non-zero eigenvalues of Q follows a Gaussian law,whereas the probability of h being outside this subspace is zero. The conditionalprobability in ( 18.17) must therefore be rewritten as

P H |Q (h , U , Λ L ) = 1 span( U [L ] ) (h ) 1

πL Li =1 λ i

e−h H U [L ] Λ −1L U [L ]

H h (18.23)

where U [L ] denotes the N ×L matrix obtained by truncating the last N −Lcolumns of U . The indicator function ensures that P H |Q (h , U , Λ ) is zero for houtside of the column span of U [L ].

We need now to marginalize U and Λ in order to obtain the p.d.f. of h .

P H (h ) = U (N )×R + LP H |Q (h , U , Λ L )P Λ L (Λ L )dU dΛ L .




However, the expression of P H |Q (h , U , Λ L ) does not lend itself directly to themarginalization described in the previous sections, since the zero eigenvalues of Qcomplicate the analysis. This can be avoided by performing the marginalizationof the covariance in an L-dimensional subspace. In order to see this, consider anL ×L unitary matrix B L and note that the N ×N block matrix

B =B L 00 IN −L

is unitary as well. Since the uniform distribution over U(N ) is unitarily invariant,UB is uniformly distributed over U(N ) and for any B L ∈U(L) we have:

P H (h ) = U (N )×R + LP H |Q (h , UB , Λ L )P Λ L (Λ L )dU dΛ L .

Furthermore, since U (L ) dB L = 1

P H (h ) = U (L ) U (N )×R + LP H |Q (h , UB , Λ L )P Λ L (Λ L )dU dΛ L dB L

= U∈

U (N )1span( U [L ] ) (h )P k (U [L ]

H h )dU (18.24)

where (18.24) is obtained by letting k = U [L ]H h and

P k (k ) =

U (L )×

R + L

1

πL L

i =1 λ i

e−k H B L Λ −1L B H

L k P Λ L (Λ L )dB L dΛ L . (18.25)

We can then exploit the similarity of ( 18.25) and ( 18.18) and, by the samereasoning as in previous sections, conclude directly that k is isotropicallydistributed in U(L) and that its p.d.f. depends only on its Frobenius norm,following

P k (k ) = 1

S L (k H k )P (L )

x (k H k )

where

P (L )x (x) = 2x

L

i =1 −L xNE 0

L + i

K i + L−2 2L xNE 0 1[(i −1)!]2 (L −i)! .

Finally, note that h H h = k H k , and that the marginalization over the randomrotation that transforms k into h in (18.24) preserves the isotropic property of the distribution. Therefore

P H (h ) = 1

S N (h H h )P (L )

x (h H h ).

Examples of the corresponding p.d.f. for L ∈ 1, 2, 4, 8, 12, 16 are representedin Figure 18.7 for a 4

×4 channel, together with the p.d.f. of the instantaneous

power of a Gaussian i.i.d. channel of the same size and mean power. As expected,the energy distribution of the proposed maximum entropy model is more spreadout than the energy of a Gaussian i.i.d. channel.




0 10 20 30 40 50

0.02

0.04

0.06

0.08

0.1

Channel gain

D e n s i t y

GaussianL = 1L = 2

L = 4L = 8L = 12L = 16

Figure 18.7 Amplitude distribution of the maximum entropy 4 ×4 MIMO channelmodels, for E 0 = 1, and limited degrees of freedom L ∈ 1, 2, 4, 8, 12, 16.

The d.f. of the mutual information achieved over the limited-rank ( L < 16)and full rank ( L = 16) covariance maximum entropy channel at a signal-to-noiseratio of 10 dB is depicted in Figure 18.8 for various ranks L, together with

the Gaussian i.i.d. channel. As already mentioned, the proposed model differsespecially in the tails of the distribution. In particular, the outage capacity forlow outage probability is greatly reduced with respect to the Gaussian i.i.d.channel model.

18.2.2.4 DiscussionIt is important to understand the reason why maximum entropy channels aredesigned. It is of interest to characterize ergodic and outage capacities when verylimited information is known about the channel as this can provide a gure of

what mean or minimum transmission rates can be expected in a channel that isknown to have limited degrees of freedom. Typically, MIMO communicationchannels with small devices embedded with multiple antennas tend to havestrong correlation. Measuring correlation by a simple scalar number is howeverrather difficult and can be done through many approaches. Measuring correlationthrough the number of degrees of freedom left in the channel is one of those. Thestudy above therefore helps us anticipate the outage performance of multipleantenna communications in more or less scattered environments.

Another very interesting feature of maximum entropy channel models is thatthey can be plugged into problems such as source sensing when the channelenvironment is known to enjoy some specic features. For instance, rememberthat in Chapter 16 we derived Neyman–Pearson tests based on the maximumentropy principle, in the sense that we assumed the communication channel




0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

Mutual information [bits/s/Hz]


GaussianL = 1L = 2L = 4L = 8L = 12L = 16

Figure 18.8 Mutual information distribution of the maximum entropy 4 ×4 MIMOchannel models, with known identity correlation (Gaussian i.i.d.) or unknowncorrelation, SNR of 15 dB.

was only known to have signal-to-noise ratio 1 /σ 2 , in which case we considereda Gaussian i.i.d. channel model (the choice of which is now conrmed by theanalysis above). We then assumed the SNR was imperfectly known, so thatwe obtained an integral form over possible σ2 of the Neyman–Pearson test.Alternatively, we could have assumed from the beginning that the channelvariance was imperfectly known and used the expressions derived above of the distribution of the channel variance. Consistency of the maximum entropyprinciple, detailed in [Jaynes , 2003], would then ensure identical results at theend. Now, in the case when further information is known, such as the channeldegrees of freedom are limited (for instance when a sensor network with multipleclose antennas scans a low frequency resource), adaptive sensing strategies can be

put in place that account for the expected channel correlation. In such scenarios,Neyman–Pearson tests can be more adequately designed than when assumingGaussian i.i.d. propagation channels.

We believe that the maximum entropy approach, often used in signalprocessing questions, while less explored in wireless communications, can provideinteresting solutions to problems dealing with too many unknown variables.Instead of relying on various ad-hoc approaches, the maximum entropy principlemanages to provide an information-theoretic optimum solution to a large range of problems. It was in particular noticed in [Couillet et al., 2010] that conventionalminimum mean square error channel estimators enter the framework of maximumentropy channel estimation, when only the number of propagation paths in thefrequency selective channel is a priori known. Then, for unknown channel delayspread, extensions of the classical minimum mean square error approach can




be designed, whose increased complexity can then be further reduced based onsuboptimal algorithms. This is the basic approach of maximum entropy solutions,which seek for optimal solutions prior to providing suboptimal implementations,instead of using simplied suboptimal models in the rst place. Similarly, amaximum entropy optimal data-aided coarse frequency offset estimator fororthogonal frequency division multiplexing protocols is provided in [Couillet andDebbah, 2010b] , along with a suboptimal iterative algorithm.

This completes this chapter on maximum entropy channel modeling.



502 19. Perspectives

• More importantly, deterministic equivalents provide an approximation of theperformance of such systems for all nite N , and not as N tends to innity.Based on the previous example, we can imagine the case of a cellular MISObroadcast channel with users being successively connected to or disconnectedfrom the base station. In this scenario, the analyzes based on l.s.d. ordeterministic equivalents differ as follows.– with l.s.d. considerations, the sum rate for all nite N can be approximated

by a single value corresponding to some functional of the l.s.d. of thesample covariance matrix when the population covariance matrix modelsthe scenario of an increasingly high user density. The approximation herelies therefore in the fact that the reality does not t the asymptotic model;

– with deterministic equivalents, it is possible to derive an approximation of

the system performance for every N , whatever the position of the activeusers. Therefore, even if the large N asymptotic performance (when allusers are connected and their number grows to innity) leads to a uniqueexpression, the performances for all congurations of N users lead to variousresults. The inaccuracy here does not lie in the system model but rather inthe inexactness of the nite N approximation, which is more convenient.

• Remember nally that, for more involved system models, limiting spectraldistributions may not exist at all, and therefore the l.s.d. approach cannot beused any longer. This led quite a few authors to assume very unrealistic systemmodels in order for a l.s.d. to exist, so to obtain exploitable results. From the

theory of deterministic equivalents developed in Part I, this is unnecessary.

We wanted to insist on the considerations above a second time since randommatrix theory applications to wireless communications are suffering from thefalse impression that the models designed assume an innite number of antennas,an innite number of users, etc. and that these models are so unrealistic thatthe results obtained are worthless. We hope that the reader now has a clearunderstanding that such criticism, totally acceptable ten years ago, are no longer justied. In Chapters 13–14, we derived approximated expressions of the capacity

of multiple antenna systems (single-user MIMO, MIMO MAC, MIMO BC)and observed that the theoretical curves are indiscernible from the simulatedcurves, sometimes for N as small as 4. In this book, when dealing with practicalapplications, we systematically and purposely replaced most asymptotic resultsfound in the literature by deterministic equivalents and replaced any mention of the term asymptotic or the phrase innite size matrices by phrases such as for all nite N or accurate as N grows large .

We believe that much more is to be done regarding deterministic equivalentsfor more involved system models than those presented in this book. Such modelsare in particular demanded for the understanding of future cognitive radionetworks as well as small-cell networks which will present more elaborate systemconditions, such as cooperating base stations, short-range communications withnumerous propagation paths, involved interference patterns, intense control



19.1. From asymptotic results to nite dimensional studies 503

data exchange, limited real-time channel state information, etc. With all theseparameters taken into account, it is virtually impossible to assume largedimensional scenarios of converging matrix models. Deterministic equivalentscan instead provide very precise system performance characterizations. It isimportant also to realize that, while the system models under study in theapplication chapters, Chapters 12–15, were sometimes very intricate, questionssuch as capacity optimization often resulted in very elegant and compact formsand come along with simple iterative algorithms, often with ensured convergence.It is therefore to be believed that even more complex models can still beprovided with simple optimizations. It is important again to recall at thispoint that accurate mathematical derivations are fundamental to ensure inparticular that the capacity maximizing algorithms do converge surely. We

also mentioned that second order statistics of the performance of such modelscan also be well approximated in the large dimensional regime, with simpleforms involving Jacobian matrices of the fundamental equations appearingsystematically. Second order statistics provide further information on the outagecapacity of these systems but also on the overall reliability of the deterministicequivalents. In heterogeneous primary-secondary networks where low probabilityof interference is a key problem to be considered, simple expressions of the secondorder statistics are of dramatic importance. As a consequence, in the near future,considerable effort will still need to be cast on deterministic equivalents andcentral limit theorems. Given the manpower demanded to treat the vastness of

the recent small cell incentive, a systematic simplication of classical randommatrix methods presented in this book will become a research priority.

From a more technical point of view, we also insist on the fact thatthe existence of a trace lemma for some matrix models is often sufficientto establish deterministic equivalents for involved system models. This wasrecently exemplied by the case of Haar matrices for which the trace lemma,Theorem 6.15, allows us to determine the approximated capacity of multi-cellularorthogonal CDMA setups with multi-path channels based on the Stieltjes andShannon transforms provided in Theorem 6.17. Future multi-station short-range

communication models with strong line-of-sight components may surely demandmore exotic channel models than the simple models based on i.i.d. randommatrices. We think in particular of Euclidean matrices [Bordenave, 2008] thatcan be used to model random grids of access points. If trace lemmas can be foundfor these models, it is likely that results similar to the i.i.d. case will be derived,for which systematic optimization methods and statistical analysis will have tobe generated.

The methods for deterministic equivalents presented in this book thereforeonly pave the way for much more complex system model characterizations.Nonetheless, we also noticed that many random matrix models, more structured,

are still beyond analytical reach, although combinatoric moment approaches arelling the gap. In particular, the characterization of the limiting spectrum of some random Vandermonde matrix models has known an increasing interest since



19.2. The replica method 505

of limiting distributions and asymptotic independence of extreme eigenvalues.This topic, which originates from rather old works on the inner symmetryof unitarily invariant random matrices, e.g., [James, 1964], is still beingthoroughly investigated these days. The tools required to study such modelsare very different from those proposed here and call for deeper mathematicalconsiderations. A systematic simplication of these methods which should alsogeneralize to more challenging random matrix models is also the key forfuture usage of these rather difficult tools in signal processing and wirelesscommunications.

As we mention the democratization of some mathematical tools for randommatrices, this book being a strong effort in this direction, we discuss brieyhereafter the method of statistical mechanics known as the replica trick or replica

method which has not been mentioned so far but which has been generating latelyan important wave of results in terms of l.s.d. and deterministic equivalents.Instead of introducing the specics of this tool, which has the major drawbackof relying on non-mathematically rigorous methods, we discuss its interactionwith the methods used in this book.

19.2 The replica method

In addition to the approaches treated in Part I to determine deterministicequivalents, another approach, known as the replica method , is gaining groundin the wireless communication community. In a similar way as deterministicequivalent derivations based on the ‘guessing’ part of the Bai and Silversteinapproach, this technique provides in general a rst rapid hint on the expectedsolutions. The replica method does however not come along with appropriatemathematical tools to prove the accuracy of the derived solutions. More precisely,the replica derivations assume several mathematical properties of continuity andlimit-integral interchange, which are assumed valid at rst (in order to obtainthe hypothetical solutions) but which are very challenging to prove. This tool

therefore has to be used with extreme care. For a short introduction to the replicamethod, see, e.g., the appendix of [Muller, 2003].The replica method is an approach borrowed from physics and especially from

the eld of statistical mechanics, see, e.g., [Mezard et al., 1987; Nishimori, 2001].It was then extensively used in the eld of wireless communications, starting withthe work of Tanaka [Tanaka, 2002] on maximum-likelihood CDMA detectors. Theasymptotic behavior of numerous classical detectors were then derived usingthe replica method, e.g., [Guo and Verd´u, 2002]. The replica method is usedin statistical physics to evaluate thermodynamical entropies and free energy.In a wireless communication context, those are closely linked to the mutualinformation, i.e. the difference of receive signal and source signal entropies[Shannon , 1948], in the sense that free energy and mutual information onlydiffer from an additive constant and a scalar factor. Replica methods have in




particular proved to be very useful when determining central limit theoremsfor the e.s.d. of involved random matrix models. While classical deterministicequivalents and Stieltjes transform approaches used to fail to derive nice closed-form formulas for the asymptotic covariance matrices of functionals of the e.s.d.,the replica method often conjectured that these covariance matrices take the formof Jacobian matrices. So far, all conjectures of the replica method going in thisJacobian direction turned out to be exact. For instance, the proof of Theorem6.21 relies on martingale theory, similar to the proof of Theorem 3.18. With thesetools alone, the limiting variance of the central limit often takes an unpleasantform, which is not obvious to relate to a Jacobian, although it effectively isasymptotically equivalent to a Jacobian. In this respect, replica methods haveturned out to be extremely useful. However, some examples of calculus where

the replica method fails have also been identied. Today, mathematicians areprogressively trying to raise necessary and sufficient conditions for the replicamethod to be valid. That is, situations where the critical points of the replicaderivations are valid or not are progressively being identied.

So far, however, the validity conditions are not sufficient for limiting laws anddeterministic equivalents to be accurately proved using this method. We thereforesee the replica method as a good opportunity for engineers and researchers toeasily come up with would-be results that can then be accurately proved usingclassical random matrix techniques. We do not develop further this technique,though, as it requires the introduction of several additional tools, and we leave

it to the reader to refer to alternative introductory articles.We complete this chapter with the introduction of some ideas on the

generalization of random matrix theory to continuous time random matrix theoryand possible applications to wireless communications.

19.3 Towards time-varying random matrices

In addition to characterizing the capacity of wireless channels, it has always been

of interest in wireless communications to study their time evolution. We haveto this point gone successively through the following large dimensional networkcharacterizations:

• the capacity, sum rate, or rate regions, that allow us to anticipate either theaveraged achievable rate of quasi-static channels or the exact achievable rateof long coded sequences over very fast varying channels;

• the outage capacity, sum rate, or rate regions, which constitute a quality of service parameter relative to the rates achievable with high probability.

Now, since communication channels vary with time, starting from a givendeterministic channel realization, it is possible to anticipate the rate evolutionof this channel. Indeed, similar to the averaging effect arising when thematrix dimensions grow large, that turn random eigenvalue distributions into



19.3. Towards time-varying random matrices 507

asymptotically deterministic quantities, solutions to implicit equations, thebehavior of the time-varying random eigenvalues of a deterministic matrixaffected by a Wiener process (better known as Brownian motion) can bedeterministically characterized as the solution of differential equations. Althoughno publication related to these time-varying aspects for wireless communicationshas been produced so far (mostly because the tools are not mature enough), itis to be believed that random matrix theory for wireless communications maymove on a more or less long-term basis towards random matrix process theory for wireless communications. Nonetheless, these random matrix processes arenothing new and have been the interest of several generations of mathematicians.

We hereafter introduce briey the fundamental ideas, borrowed from a tutorialby Guionnet [Guionnet, 2006 ]. The initial interest of Guionnet is to derive the

limiting spectral distribution of a non-central Wigner matrix with Gaussianentries, based on stochastic calculus. The result we will present here provides,under the form of the solution of a differential equation, the limiting eigenvaluedistribution of such a random matrix affected by Brownian noise at all timest > 0.

We briey introduce the notion of a Wigner matrix-valued Wiener process. AWiener process W t is dened as a stochastic process with the following properties:

• W 0 = 0

• W t is a random variable, almost surely continuous over t

• for s,t > 0, W t −W s is Gaussian with zero mean and variance t −s• for s1 ≤ t1 < s 2 ≤ t2 , W t 1 −W s 1 is independent of W t 2 −W s 2 .

This denition allows for the generation of random processes with independentincrements. That is, if W t is seen as the trajectory of a moving particle, theWiener process assumptions ensure that the increment of the trajectory betweentwo time instants is independent of the increments observed between any twoinstants in the past. This will be suitable in wireless communications to modelthe evolution of an unpredictable time-varying process such as the evolution of a time-varying channel matrix from a deterministically known matrix and an

additional random time-varying innovation term; the later being conventionallymodeled as Gaussian at time t = 1.Instead of considering channel matrices, though, we restrict this introduction

to Wigner matrices. We dene the Wigner matrix-valued Wiener process as thetime-varying matrix X N (t) ∈C N ×N with ( m, n ) entry X N,mn (t) given by:

X N,mn (t) = 1√ 2N (W m,n (t) + iW m,n (t)) , m < n1√ N W m,m (t) , m = n

where W m,n (t) and W m,n (t) are independent Wiener processes. As such, from the

above denition, X N (1) is a Gaussian Wigner matrix. We then dene Y N (t) ∈C N ×N as

Y N (t) = Y N (0) + X N (t)




for some deterministic Hermitian matrix Y N (0) ∈C N ×N .We recognize that at time t = 1, Y N (1) = Y N (0) + X N (1) is a Gaussian

Wigner matrix with Gaussian independent entries of mean given by the entriesof Y N (0) and variance 1 /N . The current question though is to analyze the timeevolution of the eigenvalue distribution of Y N (t).

Denote λ N (t) = ( λN 1 (t), . . . , λ N

N (t)) the set of random eigenvalues of Y N (t) andF Y N ( t ) the e.s.d. of Y N at time t. The following result, due to Dyson [Dyson,1962b], characterizes the time-varying e.s.d. F Y N ( t ) as the solution of a stochasticdifferential equation.

Theorem 19.1. Let λ N (0) be such that λN 1 (0) < .. . < λ N

N (0) . Then F Y N ( t ) is the unique (weak) solution of the stochastic differential system

dλ N i (t) = 1√ N

dV it + 1N

j = i

1λN

i (t) −λN j (t)

dt

with initial condition λ N (0) , such that λN 1 (t) < .. . < λ N

N (t), where (V 1t , . . . , V N t )

is an N -dimensional Wiener process.

This characterizes the distribution of λ N (t) for all nite N . A large dimensionallimit for such processes is then characterized by the following result, [Guionnet,2006, Lemma 12.5].

Theorem 19.2. Let λ N (0) ∈R N such that F Y N (0) converges weakly towards F 0 as N tends to innity. Further, assume that

supN log(1 + x2)dF Y N (0) (x) < ∞.

Then, for all T > 0, the measure-valued process (F Y N ( t ) , t ∈ [0, T ]) converges almost surely in the set of distribution function-valued continuous functions dened on [0, T ] towards (F t , t ∈ [0, T ]), such that, for all z ∈C \ R

mF t (z) = mF 0 (z) + t

0 mF s (z)mF s (z)ds

where the derivative of the Stieltjes transform is taken along z.

This result generalizes the free additive convolution to time continuousprocesses. What this exactly states is that, as N grows large, F Y N ( t ) convergesalmost surely to some d.f. F t , which is continuous along the time variable t.This indicates that, for large N , the eigenvalues of the time-evolving randommatrix Y N (t) follow a trajectory, whose Stieltjes transform satises the abovedifferential equation.

It is to be believed that such time-varying considerations, along with the recentgrowing interest in mean eld and mean eld game theories [Bordenave et al.,2005; Buchegger and Le Boudec , 2005; Le Boudec et al., 2007; Sarajanovic and



19.3. Towards time-varying random matrices 509

Le Boudec, 2005; Sharma et al., 2006], may lead to the opening up of a new eld of research in wireless communications. Indeed, mean eld theory is dealing with thecharacterization of large dimensional time-varying systems for which asymptoticbehavior are found to be solutions of stochastic differential equations. Mean eldtheory is in particular used in game-theoretic settings where numerous players,whose space distribution enjoys some symmetric structure, compete under somecost function constraint. Such characterizations are suitable for the study of themedium access control for future large dimensional networks, where the adjectivelarge qualies the number of users in the network. The time-varying aspectsdeveloped above may well turn in the end into a characterization of the timeevolution of the physical layer for future large dimensional networks, where theadjective large now characterizes, e.g. the number of transmit antennas at the

base station in a cellular broadcast channel or the number of users in a CDMAcell.The possibility to study the time evolution of large dimensional networks, be

it from a physical layer, medium access control layer, or network layer point of view, provides much more information than discrete time analysis in the sensethat:

• as already mentioned, the classical static analysis brought by random matrixtheory in wireless communications only allows us to anticipate the averageperformance and outage performance of a given communication system. Thatis, irrespective of the time instant t, quality of service gures such as averagedor minimally ensured data delivery rate can be derived. Starting from a datarate R0 at time t0 , it is however not possible to anticipate the averaged rateor minimally ensured rate Rt at time t > t 0 (unless t is large enough for theknowledge of R0 to become irrelevant). This can be performed by continuoustime analysis though;

• dynamic system analysis also allows us to anticipate probabilities of chaoticsituations such as system failure after a time t > t 0 , with t0 some initial timewhen the system is under control. We mention specically the scenario of

automatized systems with control, such as recent smart energy distributionnetworks in which information feedback is necessary to set the system as awhole in equilibrium. However, in wireless communications, as in any othercommunication system, feeding information back comes at a price and ispreferably limited. This is why being able to anticipate the (possibly erratic)evolution of a system in free motion is necessary so to be able to decide whencontrol has to take place;

• along the same line of thought as in the previous point, it is also importantfor system designers to take into account and anticipate the consequencesof mobility within a large dimensional network. Indeed, in a networkwhere users’ mobility is governed by some time-varying stochastic process,the achievable transmission data rates depend strongly on channel stateinformation exchanges within the network. In the discrete time random matrix



20 Conclusion

Throughout this book, we tried to propose an up-to-date vision of thefundamental applications of random matrix theory to wireless communications.“Up-to-date” refers to the time when these lines were written. At the pace whichthe random matrix eld evolves these days, the current book will be largelyoutdated when published. This is one of the two fundamental reasons why wethoroughly introduced the methods used to derive most of the results known tothis day, as these technical approaches will take more time to be replaced bymore powerful tools. The other, more important, reason why such an emphasiswas made on these techniques, is that the wireless communication communityis moving fast and is in perpetual need for new random matrix models forwhich mathematical research has no answer yet. Quite often, such an answerdoes not exist because it is either of no apparent interest to mathematicians or

simply because too many of these problems are listed that cannot all be solvedin a reasonable amount of time. But very often also, these problems can bedirectly addressed by non-mathematical experts. We desired this book to be bothaccessible in the sense that fast solutions to classical problems can be derivedby wireless communication engineers and rigorous in some of the proofs so thatprecise proof techniques be known to whomever desires to derive mathematicallysound information-theoretic results.

An important outcome of the current acceleration of the breakthroughs madein random matrix theory for wireless communications is the generalization of non-mathematically accurate methods, such as the replica method, introducedbriey in Chapter 19. From our point of view, thanks to the techniques developedin this book, it is also fairly simple to derive deterministic equivalents for thevery same problems addressed by the replica approach. Nonetheless, the replicamethod is a very handy tool for fast results that may take time to obtain usingconventional methods. This is conrmed for instance by the work of Moustakaset al. [Moustakas and Simon, 2007 ] on frequency selective MIMO channels, laterproved accurate by Dupuy and Loubaton in an unpublished work, or by the workof Taricco [Taricco , 2008] on MIMO Rician models, largely generalized by thework of Hachem et al. [Hachem et al., 2008b], or again by the work of Simon

et al. [Simon et al., 2006] generalized and accurately proved by Couillet et al.[Couillet et al., 2011a] and further extended in [Wagner et al., 2011] by Wagner et al. . As reminded in Chapter 19, replica methods also provide results that are not



512 20. Conclusion

at all immediate using conventional tools, and constitute, as such, an importanttool to be further developed. However, we intend this book to reinforce the ideato the reader that the difficult Stieltjes transform and Gaussian method toolsof yesterday are now clearly understood and have moved to a simple and veryaccessible framework, no longer exclusive to a few mathematicians among theresearch community. We recall in particular that the deterministic equivalentmethod we referred to as Bai and Silverstein’s approach in this book is rathersimple and only requires practice. Once deterministic equivalents are inferred via the “guess-work” technique, accurate proofs can then be performed, whichare usually based on very classical techniques. Moreover, as most results end upbeing solutions of implicit equations, it is important for practical purposes tobe able to prove solution uniqueness and if possible sure convergence of some

xed-point algorithm to the solution. One of the reasons comes as follows: in thesituation where we have to estimate some key system parameters (such as theoptimal transmit covariance matrix in MIMO communications) and that theseparameters are one of the solutions to an implicit equation, if sure convergenceof some iterative algorithm towards this specic solution is proved, then thestability of the system under consideration is ensured.

Another source of debate within the random matrix community is the questionof whether free probabilistic tools or more conventional random matrix toolsmust be used to solve problems dealing with large dimensional random matrices.Some time ago, problems related to Haar matrices were all approached using free

probability tools since the R- and S -transforms are rather convenient for dealingwith sums or products of these types of matrices. Nonetheless, as an equivalenttrace lemma for Haar random matrices, Theorem 6.15, exists, it is also possible totreat models involving Haar matrices with the same tools as those used for i.i.d.matrices, see, e.g., Theorem 6.17. Moreover, free probability relies on stringentassumptions on the matrices involved in sums and products, starting withthe eigenvalue boundedness assumption that can be somewhat extended usingmore conventional random matrix techniques. There is therefore no fundamentalreason to prefer the exclusive usage of free probability theory, as the same

derivations and much more are accessible through classical random matrixtheory. Nonetheless, it is usually simpler, when possible, to exploit directlyfree probability theorems for involved sums and products of asymptoticallyfree matrix families than to resort to complete random matrix derivations andconvergence proofs, see, e.g., Section 15.2 on multi-hop communications. Sinceboth elds are not orthogonal, it is therefore possible to use results from both of them to come up fast with results on more involved matrix models.

Regarding the latest contributions on signal sensing and parameter estimation,the studies provided in this book showed that, while important limitations(linked to spectrum clustering) restrict the use of recent Stieltjes transform-based

techniques, these tools perform outstandingly better than any other moment-based approach and obviously better than original algorithms that only assumeone large system dimension. Obtaining Stieltjes transform estimators requires



513

work, but is not so difficult once the relation between the l.s.d. of the observedmatrix model and the e.s.d. of the hidden parameter matrix is found. Provingthat the estimator is indeed correct requires to ensure that exact separationof the eigenvalues in the observed matrix holds true. Nonetheless, as recalledmany times, proving exact separation, already for the simple information plusnoise model, is a very involved problem. Obtaining exact separation for moreinvolved models is therefore an open question, still under investigation and stillrather exclusive to pure mathematicians. To this day, for these complex models,intellectual honesty requires to step back to less precise combinatoric moment-based methods, which are also much easier to derive, and, as we have seen, canbe often automatically obtained by computer software.

Moment approaches are also the only access we have to even more involved

random matrix models, such as random Vandermonde or random Toeplitzmatrices. Parameter estimation can then be performed for models involvingVandermonde matrices using the inaccurate moment approach, as no alternativetechnique is available yet. Moment approaches also have the property to beable to provide the exact moments of some functionals of small dimensionalrandom matrices on average, as well as exact covariance matrices of the successivemoments. Nonetheless, we also saw that functionals of random matrices as largeas 4 ×4 matrices are often very accurately deterministically approximated usingmethods that assume large dimensions. The interest of tools that assume smalldimensions is therefore often very limited.

Small dimensional random matrix theory need not be discarded, though, asexemplied in Chapter 16 and Chapter 18, where important results on multi-source detection and more generally optimum statistical inference through themaximum entropy principle were introduced. Even if such approaches often leadto very involved expressions, from which sometimes not much can be said, theyalways provide upper-bounds on alternative approaches which are fundamentalto assess the performance of such alternative suboptimal methods. However,small dimensional techniques are very often restricted to simple problems that arevery symmetrical in the sense that the matrices involved need to have pleasant

invariance properties. The increasing complexity of large dimensional systemscomes however in contradiction with this simplicity requirement. It is thereforebelieved that small dimensional random matrix theory will leave more and moreroom for the much better performing large dimensional random matrix theoryfor all applications.

Finally, we mention that the most recent eld of study, for which newresults appear at an increasing pace, is that of the limiting distribution andthe large deviation of smallest and largest eigenvalues for different types of models, asymptotic independence within the spectrum of large dimensionalmatrices, etc. These new ideas, stated in this book under the form of a

series of important results rely on powerful tools, which necessitate a lengthymathematical introduction to Fredholm determinants or operator theory, whichwe briey provided in Chapter 19. It is believed that the time will come when



514 20. Conclusion

these tools will be made simpler and more accessible so that non-specialists canalso benet from these important results in the medium to long-term.

To conclude, we wish to insist once more that random matrix theory, whichwas ten years ago still in its infancy with techniques only exploitable bymathematicians of the eld, has now become more popular, is better understood,and provides wireless telecommunication researchers with a large pool of usefuland accessible tools. We now enter an era where the initial results on systemperformance evaluation are commonplace and where thrilling results have nowto do with statistical inference in large dimensional inverse problems involvingpossibly time-varying random matrix processes.



References

T. B. Abdallah and M. Debbah. Downlink CDMA: to cell or not to cell. In12th European Signal Processing Conference (EUSIPCO’04) , pages 197–200,Vienna, Austria, September 2004.

S. Adhikari. A non-parametric approach for uncertainty quantication inelastodynamics. In Proceedings of the 47th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference , 2006.

D. Aktas, M. N. Bacha, J. S. Evans, and S. V. Hanly. Scaling results on thesum capacity of cellular networks with MIMO links. IEEE Transactions on Information Theory , 52(7):3264–3274, 2006.

I. F. Akyildiz, W. Y. Lee, M. C. Vuran, and S. Mohanty. Nextgeneration/dynamic spectrum access/cognitive radio wireless networks: asurvey. Computer Networks Journal , 50(13):2127–2159, 2006.

P. Almers, E. Bonek, A. Burr, N. Czink, M. Debbah, V. Degli-Esposti,H. Hofstetter, P. Kyosti, D. Laurenson, and G. Matz et al. Survey of channeland radio propagation models for wireless MIMO systems. EURASIP Journal on Wireless Communications and Networking , 2007.

G. W. Anderson, A. Guionnet, and O. Zeitouni. Lecture notes on randommatrices, 2006. www.mathematik.uni-muenchen.de/ ~lerdos/SS09/Random/randommatrix.pdf . SAMSI, Lecture Notes.

G. W. Anderson, A. Guionnet, and O. Zeitouni. An introduction to random matrices . Cambridge University Press, 2010. ISBN 0521194520.

T. W. Anderson. The non-central Wishart distribution and certain problems of multivariate statistics. The Annals of Mathematical Statistics , 17(4):409–431,1946.

T. W. Anderson. Asymptotic theory for principal component analysis. Annals of Mathematical Statistics , 34(1):122–148, March 1963.

L. Arnold. On the asymptotic distribution of the eigenvalues of random matrices.Journal of Mathematics and Analytic Applications , 20:262–268, 1967.

L. Arnold. On Wigner’s semi-circle law for the eigenvalues of random matrices.Probability Theory and Related Fields , 19(3):191–198, September 1971.

L. Arnold, V. M. Gundlach, and L. Demetrius. Evolutionary formalism for

products of positive random matrices. The Annals of Applied Probability , 4(3):859–901, 1994.

Z. D. Bai. Circular law. The Annals of Probability , 25(1):494–529, 1997.

http://www.mathematik.uni-muenchen.de/~lerdos/SS09/Random/randommatrix.pdf









516 References

Z. D. Bai and J. W. Silverstein. No eigenvalues outside the support of thelimiting spectral distribution of large dimensional sample covariance matrices.The Annals of Probability , 26(1):316–345, January 1998.

Z. D. Bai and J. W. Silverstein. Exact separation of eigenvalues of largedimensional sample covariance matrices. The Annals of Probability , 27(3):1536–1555, 1999.

Z. D. Bai and J. W. Silverstein. CLT of linear spectral statistics of largedimensional sample covariance matrices. The Annals of Probability , 32(1A):553–605, 2004.

Z. D. Bai and J. W. Silverstein. On the signal-to-interference-ratio of CDMAsystems in wireless communications. Annals of Applied Probability , 17(1):81–101, 2007.

Z. D. Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices . Springer Series in Statistics, New York, NY, USA, second edition,2009.

Z. D. Bai and J. F. Yao. Central limit theorems for eigenvalues in aspiked population model. Annales de lInstitut Henri Poincare-Probabilites et Statistiques , 44(3):447–474, 2008a.

Z. D. Bai and J. F. Yao. Limit theorems for sample eigenvalues in a generalizedspiked population model. 2008b. http://arxiv.org/abs/0806.1141 .

J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis , 97(6):1382–1408,

2006.J. Baik, G. Ben Arous, and S. Peche. Phase transition of the largest eigenvalue

for non-null complex sample covariance matrices. The Annals of Probability ,33(5):1643–1697, 2005.

F. Benaych-Georges. Rectangular random matrices, related free entropy and freeFisher’s information. Journal of Operator Theory , 62(2):371–419, 2009.

F. Benaych-Georges and R. Rao. The eigenvalues and eigenvectors of nite, lowrank perturbations of large random matrices. Advances in Mathematics , 227(1):494–521, 2011. ISSN 0001-8708.

F. Benaych-Georges, A. Guionnet, and M. Maida. Fluctuations of the extremeeigenvalues of nite rank deformations of random matrices. 2010. http://arxiv.org/abs/1009.0145 .

H. Bercovici and V. Pata. The law of large numbers for free identically distributedrandom variables. The Annals of Probability , 24(1):453–465, 1996.

P. Bianchi, M. Debbah, and J. Najim. Asymptotic independence in the spectrumof the Gaussian Unitary Ensemble. Electronic Communications in Probability ,15:376–395, 2010. http://arxiv.org/abs/0811.0979 .

P. Bianchi, J. Najim, M. Maida, and M. Debbah. Performance of some eigen-based hypothesis tests for collaborative sensing. IEEE Transactions on

Information Theory , 57(4):2400–2419, 2011.P. Biane. Free probability for probabilists. Quantum Probability

Communications , 11:55–71, 2003.

http://arxiv.org/abs/0806.1141












References 517

E. Biglieri, G. Caire, and G. Tarico. CDMA system design through asymptoticanalysis. IEEE Transactions on Communications , 48:1882–1896, November2000.

E. Biglieri, G. Caire, G. Taricco, and E. Viterbo. How fading affects CDMA: anasymptotic analysis with linear receivers. IEEE Journal on Selected Areas in Communications (Wireless Series) , 19(2):191–201, 2001.

P. Billingsley. Convergence of Probability Measures . John Wiley and Sons, Inc.,Hoboken, NJ, 1968.

P. Billingsley. Probability and Measure . John Wiley and Sons, Inc., Hoboken,NJ, third edition, 1995.

N. Bonneau, E. Altman, M. Debbah, and G. Caire. When to synchronizein uplink CDMA. In Proceedings of IEEE International Symposium on

Information Theory (ISIT’05) , pages 337–341, 2005.N. Bonneau, M. Debbah, and E. Altman. Wardrop equilibrium in CDMAnetworks. In Workshop on Resource Allocation in Wireless Networks ,Limassol, Cyprus, 2007.

N. Bonneau, M. Debbah, E. Altman, and A. Hjørungnes. Non-atomic games formulti-user systems. IEEE Journal on Selected Areas in Communications , 26(7):1047–1058, 2008.

S. Borade, L. Zheng, and R. Gallager. Amplify-and-forward in wireless relaynetworks: rate, diversity, and network size. IEEE Transactions on Information Theory , 53(10):3302–3318, 2007.

C. Bordenave. Eigenvalues of euclidean random matrices. Random Structures and Algorithms , 33(4):515–532, 2008.

C. Bordenave, D. McDonald, and A. Proutiere. Random multi-access algorithms,a mean eld analysis. In Proceedings of IEEE Annual Allerton Conference on Communication, Control, and Computing (Allerton’05) , Allerton, IL, USA,2005.

G. L. Bretthorst. Bayesian spectrum analysis and parameter estimation . PhDthesis, Washington University, St. Louis, 1987.

L. Brillouin. Science and Information Theory . Academic Press, New York,

second edition, 1962.S. Buchegger and J. Y. Le Boudec. Self-policing mobile ad-hoc networks byreputation systems. IEEE Communication Magazine , 43(7):101–107, 2005.

D. Cabric, S. M. Mishra, and R. W. Brodersen. Implementation issues inspectrum sensing for cognitive radios. In Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR’04) , pages 772–776, Pacic Grove, CA, USA, 2004.

D. Cabric, A. Tkachenko, and R. W. Brodersen. Spectrum sensing measurementsof pilot, energy and collaborative detection. In IEEE Military Communications Conference , October 2006.

G. Caire and S. Shamai. On the achievable throughput of a multiantennaGaussian broadcast channel. IEEE Transactions on Information Theory , 49(7):1691–1706, 2003.



518 References

D. Calin, H. Claussen, and H. Uzunalioglu. On femto deployment architecturesand macrocell offloading benets in joint macro-femto deployments. IEEE Transactions on Communications , 48(1):26–32, January 2010.

L. S. Cardoso, M. Debbah, P. Bianchi, and J. Najim. Cooperative spectrumsensing using random matrix theory. In IEEE Pervasive Computing (ISWPC’08) , pages 334–338, Santorini, Greece, May 2008.

A. Caticha. Maximum entropy, uctuations and priors. Maximum Entropy and Bayesian Methods in Science and Engineering , 568:94–106, 2001.

H. Chandra. Differential operators on a semi-simple Lie algebra. American Journal of Mathematics , 79:87–120, 1957.

V. Chandrasekhar, M. Kountouris, and J. G. Andrews. Coverage in multi-antenna two-tier networks. IEEE Transactions on Wireless Communications ,

8(10):5314–5327, 2009.J. M. Chaufray, W. Hachem, and P. Loubaton. Asymptotic analysis of optimumand sub-optimum CDMA downlink MMSE receivers. IEEE Transactions on Information Theory , 50(11):2620–2638, 2004.

S. S. Christensen, R. Agarwal, E. D. Carvalho, and J. M. Cioffi. Weighted sum-rate maximization using weighted MMSE for MIMO-BC beamforming design,Part I. IEEE Transactions on Wireless Communications , 7(12):4792–4799,2008.

C. N. Chuah, D. N. C. Tse, J. M. Kahn, and R. A. Valenzuela. Capacity scalingin MIMO wireless systems under correlated fading. IEEE Transactions on

Information Theory , 48(3):637–650, March 2002.P. Chung, J. B¨ ohme, C. Mecklenbra¨uker, and A. Hero. Detection of the number

of signals using the Benjamini-Hochberg procedure. IEEE Transactions on Signal Processing , 55(6):2497–2508, 2007.

H. Claussen, L. T. Ho, and L. G. Samuel. An overview of the femtocell concept.Bell Labs Technical Journal , 13(1):221–245, May 2008.

M. H. M. Costa. Writing on dirty paper. IEEE Transactions on Information Theory , 29(3):439–441, 1983.

L. Cottatellucci and M. Debbah. The effect of line of sight on the asymptotic

capacity of MIMO systems. In Proceedings of IEEE International Symposium on Information Theory (ISIT’04) , page 542, Chicago, USA, July 2004a.L. Cottatellucci and M. Debbah. On the capacity of MIMO Rice channels. In

Proceedings of IEEE Annual Allerton Conference on Communication, Control,and Computing (Allerton’04) , Allerton, IL, USA, October 2004b.

L. Cottatellucci and R. M¨ uller. Asymptotic design and analysis of multistagedetectors with unequal powers. In Proceedings of IEEE Information Theory Workshop (ITW’02) , pages 167–170, Bangalore, India, 2002.

L. Cottatellucci and R. M¨ uller. A systematic approach to multistage detectorsin multipath fading channels. IEEE Transactions on Information Theory , 51

(9):3146–3158, 2005.L. Cottatellucci, R. M¨ uller, and M. Debbah. Asymptotic design and analysis

of linear detectors for asynchronous CDMA systems. In Proceedings of IEEE



References 519

International Symposium on Information Theory (ISIT’04) , Chicago, IL, USA,2004.

L. Cottatellucci, R. M¨ uller, and M. Debbah. Asynchronous CDMA systemswith random spreading – Part I: Fundamental limits. IEEE Transactions on Information Theory , 56(4):1477–1497, 2010a.

L. Cottatellucci, R. M¨ uller, and M. Debbah. Asynchronous CDMA systemswith random spreading – Part II: Design criteria. IEEE Transactions on Information Theory , 56(4):1498–1520, 2010b.

R. Couillet and M. Debbah. Free deconvolution for OFDM multicell SNRdetection. In Proceedings of IEEE International Symposium on Personal,Indoor and Mobile Radio Communications (PIMRC’08) , Cannes, France,2008.

R. Couillet and M. Debbah. Uplink capacity of self-organizing clusteredorthogonal CDMA networks in at fading channels. In Proceedings of IEEE Information Theory Workshop , Taormina, Sicily, 2009.

R. Couillet and M. Debbah. A Bayesian framework for collaborative multi-sourcesignal detection. IEEE Transactions on Signal Processing , 58(10):5186–5195,October 2010a.

R. Couillet and M. Debbah. Information theoretic approach to synchronization:the OFDM carrier frequency offset example. In Sixth Advanced International Conference on Telecommunications (AICT) , Barcelona, Spain, 2010b.

R. Couillet and M. Guillaud. Performance of statistical inference methods forthe energy estimation of multiple sources. In Proceedings of IEEE Workshopon Statistical Signal Processing (SSP’11) , Nice, France, 2011. To appear.

R. Couillet and W. Hachem. Local failure detection and diagnosis in large sensornetworks. IEEE Transactions on Information Theory , 2011. Submitted forpublication.

R. Couillet, S. Wagner, M. Debbah, and A. Silva. The space frontier:physical limits of multiple antenna information transfer. In Workshop on Interdisciplinary Systems Approach in Performance Evaluation and Design of Computer and Communication Systems (Inter-Perf’08) , Athens, Greece, 2008.

R. Couillet, A. Ancora, and M. Debbah. Bayesian foundations of channelestimation for smart radios. Advances in Electronics and Telecommunications ,1(1):41–49, 2010.

R. Couillet, M. Debbah, and J. W. Silverstein. A deterministic equivalent forthe analysis of correlated MIMO multiple access channels. IEEE Transactions on Information Theory , 57(6):3493–3514, June 2011a.

R. Couillet, J. Hoydis, and M. Debbah. Deterministic equivalents for the analysisof unitary precoded systems. IEEE Transactions on Information Theory ,2011b. ttp://arxiv.org/abs/1011.3717 . Submitted for publication.

R. Couillet, J. W. Silverstein, Z. D. Bai, and M. Debbah. Eigen-inference for

energy estimation of multiple sources. IEEE Transactions on Information Theory , 57(4):2420–2439, 2011c.

http://ttp//arxiv.org/abs/1011.3717

http://ttp//arxiv.org/abs/1011.3717



520 References

R. T. Cox. Probability, frequency and reasonable expectation. American Journal of Physics , 14(1):1–13, 1946.

A. D. Dabbagh and D. J. Love. Multiple antenna MMSE based downlinkprecoding with quantized feedback or channel mismatch. IEEE Transactions on Communications , 56(11):1859–1868, 2008.

R. de Lacerda Neto, A. Menouni Hayar, M. Debbah, and B. H. Fleury. Amaximum entropy approach to ultra-wideband channel modelling. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing , 2006.

M. Debbah and R. Müller. Impact of the power of the steering directions on theasymptotic capacity of MIMO channels. In Proceedings of IEEE International Symposium on Signal Processing and Information Technology (ISSPIT’03) ,

Darmstadt, Germany, December 2003.M. Debbah and R. Müller. MIMO channel modelling and the principle of maximum entropy. IEEE Transactions on Information Theory , 51(5):1667–1690, 2005.

M. Debbah and R. Müller. Capacity complying MIMO channel models.In Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR’03) , Pacic Grove, CA, USA,November 2003.

M. Debbah, W. Hachem, P. Loubaton, and M. de Courville. MMSE analysisof certain large isometric random precoded systems. IEEE Transactions on

Information Theory , 49(5):1293–1311, May 2003a.M. Debbah, P. Loubaton, and M. de Courville. The spectral efficiency of linear

precoders. In Proceedings of IEEE Information Theory Workshop (ITW’03) ,pages 90–93, Paris, France, March 2003b.

P. Deift. Orthogonal Polynomials and Random Matrices: a Riemann-Hilbert Approach . New York University Courant Institute of Mathematical Sciences,New York, NY, USA, 2000.

A. Dembo and O. Zeitouni. Large Deviations Techniques and applications .Springer Verlag, 2009.

P. Ding, D. J. Love, and M. D. Zoltowski. Multiple antenna broadcast channelswith shape feedback and limited feedback, Part I. IEEE Transactions on Signal Processing , 55(7):3417–3428, 2007.

B. Dozier and J. W. Silverstein. On the empirical distribution of eigenvaluesof large dimensional information plus noise-type matrices. Journal of Multivariate Analysis , 98(4):678–694, 2007a.

B. Dozier and J. W. Silverstein. Analysis of the limiting spectral distributionof large dimensional information-plus-noise type matrices. Journal of Multivariate Analysis , 98(6):1099–1122, 2007b.

J. Dumont, W. Hachem, S. Lasaulce, P. Loubaton, and J. Najim. On

the capacity achieving covariance matrix for Rician MIMO channels: anasymptotic approach. IEEE Transactions on Information Theory , 56(3):1048–1069, 2010.



References 521

F. Dupuy and P. Loubaton. Mutual information of frequency selective MIMOsystems: an asymptotic approach, 2009. http://www-syscom.univ-mlv.fr/~fdupuy/publications.php .

F. Dupuy and P. Loubaton. On the capacity achieving covariance matrix forfrequency selective MIMO channels using the asymptotic approach. IEEE Transactions on Information Theory , 2010. http://arxiv.org/abs/1001.3102 . To appear.

F. J. Dyson. Statistical theory of the energy levels of complex systems, Part II.Journal of Mathematical Physics , 3:157–165, January 1962a.

F. J. Dyson. A Brownian-motion model for the eigenvalues of a random matrix.Journal of Mathematical Physics , 3:1191–1198, 1962b.

S. Enserink and D. Cochran. A cyclostationary feature detector. In Proceedings

of IEEE Conference Record of the Asilomar Conference on Signals, Systems,and Computers (ASILOMAR’94) , pages 806–810, Pacic Grove, CA, USA,1994.

J. Evans and D. N. C. Tse. Large system performance of linear multiuser receiversin multipath fading channels. IEEE Transactions on Information Theory , 46(6):2059–2078, 2000.

K. Fan. Maximum properties and inequalities for the eigenvalues of completelycontinuous operators. Proceedings of the National Academy of Sciences of the United States of America , 37(11):760–766, 1951.

J. Faraut. Random matrices and orthogonal polynomials. Lecture Notes, CIMPA

School of Merida, 2006. www.math.jussieu.fr/ ~faraut/Merida.Notes.pdf .N. Fawaz and M. Medard. On the non-coherent wideband multipath fading relay

channel. In Proceedings of IEEE International Symposium on Information Theory (ISIT’10) , pages 679–683, Austin, Texas, USA, 2010.

N. Fawaz, K. Zari, M. Debbah, and D. Gesbert. Asymptotic capacity andoptimal precoding in MIMO multi-hop relay networks. IEEE Transactions on Information Theory , 57(4):2050–2069, 2011. ISSN 0018-9448.

O. N. Feldheim and S. Sodin. A universality result for the smallest eigenvaluesof certain sample covariance matrices. Geometric And Functional Analysis , 20

(1):88–123, 2010. ISSN 1016-443X.R. A. Fisher. The sampling distribution of some statistics obtained from non-linear equations. The Annals of Eugenics , 9:238–249, 1939.

S. V. Fomin and I. M. Gelfand. Calculus of Variations . Prentice Hall, 2000.G. J. Foschini and M. J. Gans. On limits of wireless communications in

a fading environment when using multiple antennas. Wireless Personal Communications , 6(3):311–335, March 1998.

M. Franceschetti, S. Marano, and F. Palmieri. The role of entropy in wavepropagation. In Proceedings of IEEE International Symposium on Information Theory (ISIT’03) , Yokohama, Japan, July 2003.

Y. V. Fyodorov. Introduction to the random matrix theory: Gaussian unitaryensemble and beyond. Recent Perspectives in Random Matrix Theory and Number Theory , 322:31–78, 2005.

http://www-syscom.univ-mlv.fr/~fdupuy/publications.php






http://www.math.jussieu.fr/~faraut/Merida.Notes.pdf











522 References

W. A. Gardner. Exploitation of spectral redundancy in cyclostationary signals.IEEE Signal Processing Magazine , 8(2):14–36, 1991.

S. Geman. A limit theorem for the norm of random matrices. The Annals of Probability , 8(2):252–261, 1980.

A. Ghasemi and E. S. Sousa. Collaborative spectrum sensing for opportunisticaccess in fading environments. In IEEE Proceedings of the International Symposium on Dynamic Spectrum Access Networks , pages 131–136, 2005.

A. Ghasemi and E. S. Sousa. Spectrum sensing in cognitive radio networks:the cooperation-processing tradeoff. Wireless Communications and Mobile Computing , 7(9):1049–1060, 2007.

V. L. Girko. Ten years of general statistical analysis. www.general-statistical-analysis.girko.freewebspace.com/chapter14.

pdf .V. L. Girko. Theory of Random Determinants . Kluwer, Kluwer AcademicPublishers, Dordrecht, The Netherlands, 1990.

M. A. Girshick. On the sampling theory of roots of determinantal equations.The Annals of Math. Statistics , 10:203–204, 1939.

J. Glimm and A. Jaffe. Quantum Physics . Springer, New York, NY, USA, 1981.A. Goldsmith, S. A. Jafar, N. Jindal, and S. Vishwanath. Capacity limits of

MIMO channels. IEEE Journal on Selected Areas in Communications , 21(5):684–702, 2003.

I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series and Products .

Academic Press, sixth edition, 2000.A. J. Grant and P. D. Alexander. Random sequence multisets for synchronous

code-division multiple-access channels. IEEE Transactions on Information Theory , 44(7):2832–2836, November 1998.

R. M. Gray. Toeplitz and circulant matrices: a review. Foundations and Trends in Communications and Information Theory , 2(3), 2006.

D. Gregoratti and X. Mestre. Random DS/CDMA for the amplify and forwardrelay channel. IEEE Transactions on Wireless Communications , 8(2):1017–1027, 2009.

D. Gregoratti, W. Hachem, and X. Mestre. Randomized isometric linear-dispersion space-time block coding for the DF relay channel. IEEE Transactions on Signal Processing , 2010. Submitted for publication.

M. Guillaud, M. Debbah, and A. L. Moustakas. Maximum entropy MIMOwireless channel models, 2006. http://arxiv.org/abs/cs.IT/0612101 .

M. Guillaud, M. Debbah, and A. L. Moustakas. Modeling the multiple-antennawireless channel using maximum entropy methods. In International Workshopon Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt’07) , pages 435–442, Saratoga Springs, NY, November2007.

A. Guionnet. Large random matrices: lectures on macroscopic asymptotics.Ecole d’Ete de Probabilites de Saint-Flour XXXVI-2006, 2006. www.umpa.ens-lyon.fr/ ~aguionne/cours.pdf .

http://www.general-statistical-analysis.girko.freewebspace.com/chapter14.pdf




http://arxiv.org/abs/cs.IT/0612101


http://www.umpa.ens-lyon.fr/~aguionne/cours.pdf












References 523

D. Guo and S. Verd u. Multiuser detection and statistical mechanics.Communications, Information and Network Security , pages 229–277, 2002.

D. Guo, S. Verd u, and L. K. Rasmussen. Asymptotic normality of linearmultiuser receiver outputs. IEEE Transactions on Information Theory , 48(12):3080–3095, December 2002.

U. Haagerup, H. Schultz, and S. Thorbjørnsen. A random matrix approach tothe lack of projections in Cred*(F2). Advances in Mathematics , 204(1):1–83,2006.

W. Hachem. Simple polynomial MMSE receivers for CDMA transmissions onfrequency selective channels. IEEE Transactions on Information Theory , pages164–172, January 2004.

W. Hachem. An expression for

log(t/σ 2 + 1) µ µ(dt), 2008. unpublished.

W. Hachem, P. Loubaton, and J. Najim. Deterministic equivalents for certainfunctionals of large random matrices. Annals of Applied Probability , 17(3):875–930, 2007.

W. Hachem, O. Khorunzhy, P. Loubaton, J. Najim, and L. A. Pastur. A newapproach for capacity analysis of large dimensional multi-antenna channels.IEEE Transactions on Information Theory , 54(9), 2008a.

W. Hachem, P. Loubaton, and J. Najim. A CLT for information theoreticstatistics of Gram random matrices with a given variance prole. The Annals of Probability , 18(6):2071–2130, December 2008b.

W. Hachem, P. Loubaton, X. Mestre, J. Najim, and P. Vallet. A subspace

estimator of nite rank perturbations of large random matrices. Journal on Multivariate Analysis , 2011. Submitted for publication.

L. Hanlen and A. Grant. Capacity analysis of correlated MIMO channels.In Proceedings of IEEE International Symposium on Information Theory (ISIT’03) , Yokohama, Japan, July 2003.

S. V. Hanly and D. N. C. Tse. Resource pooling and effective bandwidthsin CDMA networks with multiuser receivers and spatial diversity. IEEE Transactions on Information Theory , pages 1328–1351, May 2001.

B. Hassibi and B. M. Hochwald. How Much Training is Needed in Multiple-

Antenna Wireless Links. IEEE Transactions on Information Theory , 49(4):951–963, April 2003.A. Haurie and P. Marcotte. On the relationship between Nash–Cournot and

Wardrop equilibria. Networks , 15(1):295–308, 1985.F. Hiai and D. Petz. The Semicircle Law, Free Random Variables and Entropy

- Mathematical Surveys and Monographs No. 77 . American MathematicalSociety, Providence, RI, USA, 2006.

B. Hochwald and S. Vishwanath. Space-time multiple access: Linear growthin the sum rate. Proceedings of IEEE Annual Allerton Conference on Communication, Control, and Computing (Allerton’02) , 2002.

B. M. Hochwald, T. L. Marzetta, and V. Tarokh. Multiple-antenna channelhardening and its implications for rate feedback and scheduling. IEEE Transactions on Information Theory , 50(9):1893–1909, 2004.



524 References

M. L. Honig and W. Xiao. Performance of reduced-rank linear interferencesuppression. IEEE Transactions on Information Theory , 47(5):1928–1946,May 2001.

R. A. Horn and C. R. Johnson. Matrix Analysis . Cambridge University Press,1985.

R. A. Horn and C. R. Johnson. Topics in Matrix Analysis . Cambridge UniversityPress, 1991.

J. Hoydis, M. Kobayashi, and M. Debbah. Asymptotic performance of linearreceivers in network MIMO. In Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR’10) ,Pacic Grove, CA, USA, November 2010.

J. Hoydis, R. Couillet, and M. Debbah. Random beamforming over correlated

fading channels. IEEE Transactions on Information Theory , 2011a. Submittedfor publication.J. Hoydis, R. Couillet, and M. Debbah. Asymptotic analysis of double-

scattering channels. In Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR’11) , PacicGrove, CA, USA, 2011b.

J. Hoydis, M. Debbah, and M. Kobayashi. Asymptotic moments for interferencemitigation in correlated fading channels. In Proceedings of IEEE International Symposium on Information Theory (ISIT’11) , Saint Petersburg, Russia,August 2011c. http://arxiv.org/abs/1104.4911 .

J. Hoydis, M. Kobayashi, and M. Debbah. Optimal channel training in uplinknetwork mimo systems. IEEE Transactions on Signal Processing , 59(6), June2011d.

M. Hoyhtya, A. Hekkala, and A. Mammela. Spectrum awareness: techniquesand challenges for active spectrum sensing. Springer Cognitive Networks , 3:353–372, April 2007.

P. L. Hsu. On the distribution of roots of certain determinantal equations. The Annals of Eugenics , 9:250–258, 1939.

H. Huh, G. Caire, S. H. Moon, and I. Lee. Multi-cell MIMO downlink with

fairness criteria: the large-system limit. In Proceedings of IEEE International Symposium on Information Theory (ISIT’10) , pages 2058–2062, June 2010.Y. Hur, J. Park, W. Woo, K. Lim, C. H. Lee, H. S. Kim, and J. Laskar. A

wideband analog multi-resolution spectrum sensing (MRSS) technique forcognitive radio (CR) systems. In IEEE International Symposium on Circuits and Systems (ISCAS’06) , page 4, Island of Kos, Greece, 2006.

A. A. Hutter, E. Carvalho, and J. M. Cioffi. On the impact of channel estimationfor multiple antenna diversity reception in mobile OFDM systems. Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems,and Computers (Asimolar’00) , 2, 2000.

C. Hwang. A brief survey on the spectral radius and the spectral distribution of large dimensional random matrices with i.i.d. entries. Random Matrices and Their Applications , 50:145–152, 1986.






References 525

C. Hwang. Eigenvalue distribution of correlation matrix in asynchronous CDMAwith innite observation window width. In Proceedings of IEEE International Symposium on Information Theory (ISIT’07) , Nice, France, June 2007.

C. Itzykson and J. B. Zuber. Quantum Field Theory . Dover Publications, 2006.A. T. James. Distributions of matrix variates and latent roots derived from

normal samples. The Annals of Mathematical Statistics , 35(2):475–501, 1964.E. T. Jaynes. Information theory and statistical mechanics, Part I. Physical

Review , 106(2):620–630, 1957a.E. T. Jaynes. Information theory and statistical mechanics, Part II. Physical

Review , 108(2):171–190, 1957b.E. T. Jaynes. Probability Theory: The Logic of Science . Cambridge University

Press, 2003.

H. Jeffreys. An invariant form for the prior probability in estimation problems.Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences , 186(1007):453–461, 1946.

S. Jin, M. R. McKay, X. Gao, and I. B. Collings. MIMO multichannelbeamforming: SER and outage using new eigenvalue distributions of complexnoncentral Wishart matrices. IEEE Transactions on Communications , 56(3):424–434, 2008. http://arxiv.org/abs/0611007 .

N. Jindal. MIMO broadcast channels with nite-rate feedback. IEEE Transactions on Information Theory , 52(11):5045–5060, 2006.

M. Joham, K. Kusume, M. H. Gzara, W. Utschick, and J. A. Nossek. Transmit

Wiener lter for the downlink of TDD DS-CDMA systems. Proceedings of ISSSTA 2002 , 1:9–13, 2002.

K. Johansson. Shape uctuations and random matrices. Communications of Mathematical Physics , 209:437–476, 2000.

I. M. Johnstone. On the distribution of the largest eigenvalue in principalcomponents analysis. Annals of Statistics , 99(2):295–327, 2001.

I. M. Johnstone. High dimensional statistical inference and random matrices. InInternational Congress of Mathematicians I , pages 307–333, Zurich, Germany,2006. European Mathematical Society.

F. Kaltenberger, M. Kountouris, D. Gesbert, and R. Knopp. On the trade-off between feedback and capacity in measured MU-MIMO channels. IEEE Transactions on Wireless Communications , 8(9):4866–4875, 2009.

M. A. Kamath, B. L Hughes, and Y. Xinying. Gaussian approximations forthe capacity of MIMO Rayleigh fading channels. In Proceedings of IEEE Conference Record of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR’02) , pages 614–618, Pacic Grove, CA, USA, 2002.

A. Kammoun, R. Couillet, J. Najim, and M. Debbah. Performance of capacityinference methods under colored interference, 2011. Submitted for publication.

J. N. Kapur. Maximum Entropy Models in Science and Engineering . John Wiley

and Sons, Inc., New York, 1989.N. El Karoui. Tracy-Widom limit for the largest eigenvalue of a large class of

complex sample covariance matrices. The Annals of Probability , 35(2):663–714,

http://arxiv.org/abs/0611007

http://arxiv.org/abs/0611007



526 References

2007.N. El Karoui. Spectrum estimation for large dimensional covariance matrices

using random matrix theory. Annals of Statistics , 36(6):2757–2790, December2008.

S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory .Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

A. M. Khorunzhy, B. A. Khoruzhenko, and L. A. Pastur. On asymptoticproperties of large random matrices with independent entries. Journal of Mathematical Physics , 37(10):5033–5061, 1996.

H. Kim and K. G. Shin. In-band spectrum sensing in cognitive radio networks:energy detection or feature detection? In ACM Internatioanl Conference on Mobile Computing and Networking , pages 14–25, San Francisco, CA, USA,

September 2008.P. Koev. Random matrix statistics toolbox. http://math.mit.edu/ ~plamen/software/rmsref.html .

V. I. Kostylev. Energy detection of a signal with random amplitude. InProceedings of IEEE International Conference on Communications (ICC’02) ,pages 1606–1610, New York, NY, USA, 2002.

L. Laloux, P. Cizeau, M. Potters, and J. P. Bouchaud. Random matrix theoryand nancial correlations. International Journal of Theoretical and Applied Finance , 3(3):391–397, July 2000.

J. Y. Le Boudec, D. McDonald, and J. Mundinger. A generic mean eld

convergence result for systems of interacting objects. In International Conference on the Quantitative Evaluation of Systems (QEST’07) , Budapest,Hungary, 2007.

O. Leveque and I. E. Telatar. Information-theoretic upper bounds on thecapacity of large extended ad hoc wireless networks. IEEE Transactions on Information Theory , 51(3):858–865, March 2005.

N. Levy and S. Shamai. Clustered local decoding for Wyner-type cellular models.In Proceedings of IEEE Information Theory and Applications Workshop(ITA’09) , pages 318–322, San Diego, CA, USA, 2009.

L. Li, A. M. Tulino, and S. Verd´ u. Design of reduced-rank MMSE multiuserdetectors using random matrix methods. IEEE Transactions on Information Theory , 50(6):986–1008, June 2004.

P. Loubaton and W. Hachem. Asymptotic analysis of reduced rank wiener lters.In Proceedings of IEEE Information Theory Workshop (ITW’03) , pages 328–331, Paris, France, 2003.

S. Loyka. Channel capacity of MIMO architecture using the exponentialcorrelation matrix. IEEE Communication Letters , 5(9):1350–1359, September2001.

A. Lytova and L. A. Pastur. Central Limit Theorem for linear eigenvalue

statistics of random matrices with independent entries. The Annals of Probability , 37(5):1778–1840, 2009.

http://math.mit.edu/~plamen/software/rmsref.html








References 527

U. Madhow and M. L. Honig. MMSE interference suppression for direct-sequencespread-spectrum CDMA. IEEE Transactions on Communications , 42(12):3178–3188, December 1994.

A. Mantravadi and V. V. Veeravalli. Mmse detection in asynchronous cdmasystems: An equivalence result. IEEE Transactions on Information Theory ,48(12):3128–3137, December 2002.

I. Maric and R. D. Yates. Bandwidth and power allocation for cooperativestrategies in Gaussian relay networks. IEEE Transactions on Information Theory , 56(4):1880–1889, 2010.

I. Maric, A. Goldsmith, and M. Medard. Analog network coding in the high-SNRregime. In IEEE Wireless Network Coding Conference (WiNC’10) , pages 1–6,Boston, MA, USA, 2010.

C. Martin and B. Ottersten. Asymptotic eigenvalue distributions and capacityfor MIMO channels under correlated fading. IEEE Transactions on Wireless Communications , 3(4):1350–1359, July 2004.

V. A. Marcenko and L. A. Pastur. Distributions of eigenvalues for some sets of random matrices. Math USSR-Sbornik , 1(4):457–483, April 1967.

A. Masucci, Ø. Ryan, S. Yang, and M. Debbah. Finite dimensional statisticalinference. IEEE Transactions on Information Theory , 57(4):2457–2473, 2011.ISSN 0018-9448.

M. L. McCloud and L. L. Scharf. A new subspace identication algorithmfor high-resolution DOA estimation. IEEE Transactions on Antennas and

Propagation , 50(10):1382–1390, 2002.M. L. Mehta. Random Matrices . ELSEVIER, San Diego, CA, USA, rst edition,

2004.F. Meshkati, H. V. Poor, S. C. Schwartz, and N. B. Mandayam. An

energy-efficient approach to power control and receiver design in wirelessdata networks. IEEE Transactions on Communications , 53(11):1885–1894,November 2005.

X. Mestre. On the asymptotic behavior of the sample estimates of eigenvaluesand eigenvectors of covariance matrices. IEEE Transactions on Signal

Processing , 56(11):5353–5368, November 2008a.X. Mestre. Improved estimation of eigenvalues of covariance matrices and theirassociated subspaces using their sample estimates. IEEE Transactions on Information Theory , 54(11):5113–5129, November 2008b.

X. Mestre and M. Lagunas. Modied subspace algorithms for DoA estimationwith large arrays. IEEE Transactions on Signal Processing , 56(2):598–614,February 2008.

X. Mestre, J. R. Fonollosa, and A. Pages-Zamora. Capacity of MIMO channels:asymptotic evaluation under correlated fading. IEEE Journal on Selected Areas in Communications , 21(5):829–838, June 2003.

X. Mestre, P. Vallet, W. Hachem, and P. Loubaton. Asymptotic analysis of a consistent subspace estimator for observations of increasing dimension. InProceedings of IEEE Workshop on Statistical Signal Processing (SSP’11) , Nice,



528 References

France, 2011.M. Mezard, G. Parisi, and M. Virasoro. Spin Glass Theory and Beyond . World

scientic Singapore, 1987. ISBN 9971501155.S. M. Mishra, A. Sahai, and R. Brodersen. Cooperative sensing among cognitive

radios. In Proceedings of IEEE International Conference on Communications (ICC’06) , pages 1658–1663, Istanbul, Turkey, 2006.

J. Mitola III and G. Q. Maguire Jr. Cognitive radio: making software radiosmore personal. IEEE Personal Communication Magazine , 6(4):13–18, 1999.

S. Moshavi, E. G. Kanterakis, and D. L. Schilling. Multistage linear receivers forDS-CDMA systems. International Journal of Wireless Information Networks ,3(1):1–17, 1996.

A. L. Moustakas and S. H. Simon. Optimizing multiple-input single output

(MISO) communication with general Gaussian channels: nontrivial covarianceand non-zero mean. IEEE Transactions on Information Theory , 49(10):2770–2780, October 2003.

A. L. Moustakas and S. H. Simon. Random matrix theory of multi-antennacommunications: the Rician channel. Journal of Physics A: Mathematical and General , 38(49):10859–10872, November 2005.

A. L. Moustakas and S. H. Simon. On the outage capacity of correlated multiple-path MIMO channels. IEEE Transactions on Information Theory , 53(11):3887–3903, 2007.

A. L. Moustakas, H. U. Baranger, L. Balents, A. M. Sengupta, and S. H. Simon.

Communication through a diffusive medium: Coherence and capacity. Science ,287:287–290, 2000.

R. J. Muirhead. Aspects of Multivariate Statistical Theory . Wiley Online Library,1982.

R. Muller. Multiuser receivers for randomly spread signals: fundamental limitswith and without decision-feedback. IEEE Transactions on Information Theory , 47(1):268–283, January 2001.

R. Muller. A random matrix model of communication via antenna arrays. IEEE Transactions on Information Theory , 48(9):2495–2506, September 2002.

R. Muller. On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels. IEEE Transactions on Information Theory , 48(7):2086–2091, July 2002.

R. Muller. Channel capacity and minimum probability of error in large dualantenna array systems with binary modulation. IEEE Transactions on Signal Processing , 51(11):2821–2828, 2003.

R. Muller and S. Verd´u. Design and analysis of low-complexity interferencemitigation on vector channels. IEEE Journal on Selected Areas in Communications , 19(8):1429–1441, August 2001.

A. Nica and R. Speicher. On the multiplication of free N-tuples of

noncommutative random variables. American Journal of Mathematics , 118:799–837, 1996.



References 529

H. Nishimori. Statistical Physics of Spin Glasses and Information Processing:An Introduction . Clarendon Press, Gloucestershire, UK, July 2001.

C. Oestges, B. Clerckx, M. Guillaud, and M. Debbah. Dual-polarized wirelesscommunications: from propagation models to system performance evaluation.IEEE Transactions on Wireless Communications , 7(10):4019–4031, October2008.

H. Ozcelik, M. Herdin, W. Weichselberger, J. Wallace, and E. Bonek. Decienciesof Kronecker MIMO radio channel model. Electronics Letters , 39(16):1209–1210, 2003.

L. A. Pastur. A simple approach to global regime of random matrix theory. InMathematical Results in Statistical Mechanics , pages 429–454. World ScienticPublishing, 1999.

D. Paul. Asymptotics of sample eigenstructure for a large dimensional spikedcovariance model. Statistica Sinica , 17(4):1617, 2007.M. J. M. Peacock, I. B. Collings, and M. L. Honig. Eigenvalue distributions

of sums and products of large random matrices via incremental matrixexpansions. IEEE Transactions on Information Theory , 54(5):2123–2138,2008.

C. B. Peel, B. M. Hochwald, and A. L. Swindlehurst. A vector-perturbationtechnique for near-capacity multiantenna multiuser communication, Part I:channel inversion and regularization. IEEE Transactions on Communications ,53(1):195–202, 2005.

D. Petz and J. Reffy. On asymptotics of large Haar distributed unitary matrices.Periodica Mathematica Hungarica , 49(1):103–117, September 2004.

V. Plerous, P. Gopikrishnan, B. Rosenow, L. Amaral, T. Guhr, and H. Stanley.Random matrix approach to cross correlations in nancial data. Phys. Rev.E , 65(6), June 2002.

T. S. Pollock, T. D. Abhayapala, and R. A. Kennedy. Antenna saturationeffects on dense array MIMO capacity. In Proceedings of IEEE International Conference on Communications (ICC’03) , pages 2301–2305, Anchorage,Alaska, 2003.

H. V. Poor. An Introduction to Signal Detection and Estimation . Springer, 1994.H. V. Poor and S. Verd´ u. Probability of error in MMSE multiuser detection.IEEE Transactions on Information Theory , 43(3):858–871, 1997.

N. R. Rao and A. Edelman. The polynomial method for random matrices.Foundations of Computational Mathematics , 8(6):649–702, December 2008.

N. R. Rao, J. A. Mingo, R. Speicher, and A. Edelman. Statistical eigen-inferencefrom large Wishart matrices. Annals of Statistics , 36(6):2850–2885, December2008.

P. Rapajic and D. Popescu. Information capacity of random signature multiple-input multiple output channel. IEEE Transactions on Communications , 48

(8):1245–1248, August 2000.T. Ratnarajah and R. Vaillancourt. Complex singular Wishart matrices and

applications. Computers and Mathematics with Applications , 50(3–4):399–411,



530 References

2005.T. Ratnarajah, R. Vaillancourt, and M. Alvo. Eigenvalues and condition

numbers of complex random matrices. SIAM Journal on Matrix Analysis and Applications , 26(2):441–456, 2005a.

T. Ratnarajah, R. Vaillancourt, and M. Alvo. Complex random matrices andRician channel capacity. Problems of Information Transmission , 41(1):1–22,2005b.

S. N. Roy. p-statistics or some generalizations in the analysis of varianceappropriate to multi-variate problems. Sankhya: The Indian Journal of Statistics , 4:381–396, 1939.

W. Rudin. Real and Complex Analysis . McGraw-Hill Series in HigherMathematics, third edition, May 1986.

Ø. Ryan. Tools for convolution with nite Gaussian matrices, 2009a. http://folk.uio.no/oyvindry/finitegaussian/ .Ø. Ryan. Documentation for the random matrix library, 2009b. http://folk.

uio.no/oyvindry/rmt/doc.pdf .Ø. Ryan and M. Debbah. Free deconvolution for signal processing applications.

In Proceedings of IEEE International Symposium on Information Theory (ISIT’07) , pages 1846–1850, Nice, France, June 2007a.

Ø. Ryan and M. Debbah. Multiplicative free convolution and information plusnoise type matrices, 2007b. http://arxiv.org/abs/math/0702342 .

Ø. Ryan and M. Debbah. Asymptotic behavior of random Vandermonde matrices

with entries on the unit circle. IEEE Transactions on Information Theory , 55(7):3115–3148, July 2009.

Ø. Ryan and M. Debbah. Convolution operations arising from Vandermondematrices. IEEE Transactions on Information Theory , 2011. http://arxiv.org/abs/0910.4624 . To appear.

S. Sarajanovic and J. Y. Le Boudec. An articial immune system approachwith secondary response for misbehavior detection in mobile ad-hoc networks.IEEE Transactions on Neural Networks, Special Issue on Adaptive Learning Systems in Communication Networks , 16(5):1076–1087, 2005.

A. Scaglione. Statistical analysis of the capacity of MIMO frequency selectiveRayleigh fading channels with arbitrary number of inputs and outputs.In Proceedings of IEEE International Symposium on Information Theory (ISIT’02) , page 278, Lausanne, Switzerland, July 2002.

L. Scharf. Statistical Signal Processing: Detection, Estimation and Time-Series Analysis . Addison-Wesley, Boston, MA, USA, 1991.

R. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation , 34(3):276–280, 1986.

P. Schramm and R. M¨ uller. Spectral efficiency of CDMA systems with linearMMSE interference suppression. IEEE Transactions on Communications , 47

(5):722–731, May 1999.E. Seneta. Non-negative Matrices and Markov Chains . Springer Verlag, New

York, second edition, 1981.

http://folk.uio.no/oyvindry/finitegaussian/



http://folk.uio.no/oyvindry/rmt/doc.pdf



http://arxiv.org/abs/math/0702342






http://arxiv.org/abs/math/0702342







References 531

A. M. Sengupta and P. P. Mitra. Capacity of multivariate channels withmultiplicative noise: random matrix techniques and large- N expansions forfull transfer matrices. Journal of Statistical Physics , 125(5-6):1223–1242,December 2006.

R. Seroul. Programming for Mathematicians . Springer Universitext, New York,NY, USA, February 2000.

S. Sesia, I. Touk, and M. Baker. LTE, The UMTS Long Term Evolution: From Theory to Practice . John Wiley and Sons, Inc., 2009.

S. Shamai and S. Verd´ u. The impact of frequency-at fading on the spectralefficiency of CDMA. IEEE Transactions on Information Theory , 47(4):1302–1327, 2001.

C. Shannon. A mathematical theory of communication. Bell System Technical

Journal , 27:379–423, 1948.G. Sharma, A. Ganesh, and P. Key. Performance analysis of random accessscheduling schemes. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM’06) , Barcelona, Spain, April 2006.

S. Shi, M. Schubert, and H. Boche. Rate optimization for multiuser MIMOsystems with linear processing, Part II. IEEE Transactions on Signal Processing , 56(8):4020–4030, 2008.

J. Shore and R. Johnson. Axiomatic derivation of the principle of maximumentropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory , 26(1):26–37, 1980.

J. W. Silverstein. On the randomness of eigenvectors generated from networkswith random topologies. SIAM Journal on Applied Mathematics , 37(2):235–245, 1979.

J. W. Silverstein. Describing the behavior of eigenvectors of random matricesusing sequences of measures on orthogonal groups. SIAM Journal on Mathematical Analysis , 12(2):274–281, 1981.

J. W. Silverstein. Some limit theorems on the eigenvectors of large dimensionalsample covariance matrices. Journal of Multivariate Analysis , 15(3):295–324,1984.

J. W. Silverstein. Eigenvalues and eigenvectors of large dimensional samplecovariance matrices. Random Matrices and their Applications , pages 153–159,1986.

J. W. Silverstein and Z. D. Bai. On the empirical distribution of eigenvalues of aclass of large dimensional random matrices. Journal of Multivariate Analysis ,54(2):175–192, 1995.

J. W. Silverstein and S. Choi. Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis , 54(2):295–309, 1995.

J. W. Silverstein, Z. D. Bai, and Y. Q. Yin. A note on the largest eigenvalue of a

large dimensional sample covariance matrix. Journal of Multivariate Analysis ,26(2):166–168, 1988.



532 References

M. K. Simon, F. F. Digham, and M. S. Alouini. On the energy detection of unknown signals over fading channels. In Proceedings of IEEE International Conference on Communications (ICC’03) , Anchorage, Alaska, 2003.

S. H. Simon, A. L. Moustakas, and L. Marinelli. Capacity and characterexpansions: Moment generating function and other exact results for MIMOcorrelated channels. IEEE Transactions on Information Theory , 52(12):5336–5351, 2006.

O. Somekh, B. J. Zaidel, and S. Shamai. Spectral efficiency of joint multiplecell-sites processors for randomly spread DS-CDMA. In Proceedings of IEEE International Symposium on Information Theory (ISIT’04) , Chicago, CA,USA, July 2004.

O. Somekh, B. M. Zaidel, and S. Shamai. Sum rate characterization of joint

multiple cell-site processing. IEEE Transactions on Information Theory , 53(12):4473–4497, 2007.R. Speicher. Combinatorial theory of the free product with amalgamation

and operator-valued free probability theory. Memoirs of the American Mathematical Society , 627:1–88, 1998.

C. Sun, W. Zhang, and K. B. Letaief. Cooperative spectrum sensing forcognitive radios under bandwidth constraints. In Proceedings of IEEE Wireless Communications & Networking Conference (WCNC’07) , pages 1–5, HongKong, 2007a.

C. Sun, W. Zhang, and K. B. Letaief. Cluster-based cooperative spectrum sensing

in cognitive radio systems. In Proceedings of IEEE International Conference on Communications (ICC’07) , pages 2511–2515, Glasgow, Scotland, 2007b.

T. Tanaka. A statistical-mechanics approach to large-system analysis of CDMAmultiuser detectors. IEEE Transactions on Information Theory , 48(11):2888–2910, 2002.

R. Tandra and A. Sahai. Fundamental limits on detection in low SNRunder noise uncertainty. In International Conference on Wireless Networks,Communications and Mobile Computing , pages 464–469, 2005.

R. Tandra, M. Mishra, and A. Sahai. What is a spectrum hole and what does it

take to recognize one? Proceedings of the IEEE , 97(5):824–848, 2009.G. Taricco. Asymptotic mutual information statistics of separately correlatedRician fading MIMO channels. IEEE Transactions on Information Theory , 54(8):3490–3504, 2008.

I. E. Telatar. Capacity of multi-antenna Gaussian channels. Bell Labs, Technical Memorandum , pages 585–595, 1995.

I. E. Telatar. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications , 10(6):585–595, February 1999.

Z. Tian and G. B. Giannakis. A wavelet approach to wideband spectrumsensing for cognitive radios. In International Conference on Cognitive Radio

Oriented Wireless Networks and Communications (CROWCOM’06) , pages 1–5, Mykonos Island, Greece, 2006.



References 533

E. C. Titchmarsh. The Theory of Functions . Oxford University Press, New York,NY, USA, 1939.

C. A. Tracy and H. Widom. On orthogonal and symplectic matrix ensembles.Communications in Mathematical Physics , 177(3):727–754, 1996.

D. N. C. Tse and S. V. Hanly. Multiaccess fading channels. I. Polymatroidstructure, optimal resource allocation and throughput capacities. IEEE Transactions on Information Theory , 44(7):2796–2815, 1998.

D. N. C. Tse and S. V. Hanly. Linear multiuser receivers: effective interference,effective bandwidth and user capacity. IEEE Transactions on Information Theory , 45(2):641–657, February 1999.

D. N. C. Tse and S. Verd´ u. Optimum asymptotic multiuser efficiency of randomlyspread CDMA. IEEE Transactions on Information Theory , 46(7):2718–2722,

July 2000.D. N. C. Tse and P. Viswanath. Fundamentals of Wireless Communication .Cambridge University Press, Cambridge, UK, 2005.

D. N. C. Tse and O. Zeitouni. Linear multiuser receivers in random environments.IEEE Transactions on Information Theory , 46(1):171–188, January 2000.

D. N. C. Tse and L. Zheng. Diversity and multiplexing: a fundamental tradeoff in multiple-antenna channels. IEEE Transactions on Information Theory , 49(5):1073–1096, 2003.

G. H. Tucci. A Note on Averages over Random Matrix Ensembles. IEEE Transactions on Information Theory , 2010. http://arxiv.org/abs/0910.

0575 . Submitted for publication.G. H. Tucci and P. A. Whiting. Eigenvalue results for large scale random

Vandermonde matrices with unit complex entries. IEEE Transactions on Information Theory , 2010. To appear.

A. M. Tulino and S. Verd´ u. Random matrix theory and wireless communications.Foundations and Trends in Communications and Information Theory , 1(1),2004.

A. M. Tulino and S. Verd´ u. Impact of antenna correlation on the capacity of multiantenna channels. IEEE Transactions on Information Theory , 51(7):

2491–2509, 2005.A. M. Tulino, S. Verd´u, and A. Lozano. Capacity of antenna arrays with space,polarization and pattern diversity. In Proceedings of IEEE Information Theory Workshop (ITW’03) , pages 324–327, Paris, France, 2003.

A. M. Tulino, L. Li, and S. Verd´ u. Spectral efficiency of multicarrier CDMA.IEEE Transactions on Information Theory , 51(2):479–505, 2005.

H. Urkowitz. Energy detection of unknown deterministic signals. Proceedings of the IEEE , 55(4):523–531, 1967.

P. Vallet and P. Loubaton. A G-estimator of the MIMO channel ergodiccapacity. In Proceedings of IEEE International Symposium on Information

Theory (ISIT’09) , pages 774–778, Seoul, Korea, June 2009.P. Vallet, P. Loubaton, and X. Mestre. Improved subspace estimation for

multivariate observations of high dimension: the deterministic signals case.








534 References

IEEE Transactions on Information Theory , 2010. http://arxiv.org/abs/1002.3234 . Submitted for publication.

P. Vallet, W. Hachem, P. Loubaton, X. Mestre, and J. Najim. On the consistencyof the G-MUSIC DoA estimator. In Proceedings of IEEE Workshop on Statistical Signal Processing (SSP’11) , Nice, France, 2011a.

P. Vallet, W. Hachem, P. Loubaton, X. Mestre, and J. Najim. An improvedmusic algorithm based on low-rank perturbation of large random matrices.In Proceedings of IEEE Workshop on Statistical Signal Processing (SSP’11) ,Nice, France, 2011b.

A. W. Van der Vaart. Asymptotic Statistics . Cambridge University Press, NewYork, 2000.

H. L. Vantrees. Detection, Estimation and Modulation Theory . Wiley and Sons,

1968.S. Verd u and S. Shamai. Spectral efficiency of CDMA with random spreading.IEEE Transactions on Information Theory , 45(2):622–640, February 1999.

S. Viswanath, N. Jindal, and A. Goldsmith. Duality, achievable rates, and sum-rate capacity of Gaussian MIMO broadcsat channels. IEEE Transactions on Information Theory , 49(10):2658–2668, 2003.

H. Viswanathan and S. Venkatesan. Asymptotics of sum rate for dirty papercoding and beamforming in multiple-antenna broadcast channels. Proceedings of IEEE Annual Allerton Conference on Communication, Control, and Computing (Allerton’03) , 41(2):1064–1073, 2003.

D. Voiculescu. Addition of certain non-commuting random variables. Journal of functional analysis , 66(3):323–346, 1986.

D. Voiculescu. Multiplication of certain non-commuting random variables. J.Operator Theory , 18:223–235, 1987.

D. Voiculescu. Limit laws for random matrices and free products. Inventiones Mathematicae , 104(1):201–220, December 1991.

D. Voiculescu, K. J. Dykema, and A. Nica. Free random variables. American Mathematical Society , 1992.

S. Wagner, R. Couillet, M. Debbah, and D. T. M. Slock. Large system analysis

of linear precoding in MISO broadcast channels with limited feedback. IEEE Transactions on Information Theory , 2011. http://arxiv.org/abs/0906.3682 . Submitted for publication.

B. Wang, K. J. Liu, and T. Clancy. Evolutionary cooperative spectrum sensinggame: how to collaborate? IEEE Transactions on Communications , 58(3):890–900, March 2010. ISSN 0090-6778.

J. G. Wardrop. Road paper: some theoretical aspects of road traffic research.ICE Proceedings, Engineering Divisions , 1(3):325–362, 1952.

M. Wax and T. Kailath. Detection of signals by information theoretic criteria.IEEE Transactions on Signal, Speech and Signal Processing , 33(2):387–392,

1985.W. Weichselberger, M. Herdin, H. Ozcelik, and E. Bonek. A stochastic MIMO

channel model with joint correlation of both link ends. IEEE Transactions on













References 535

Wireless Communications , 5(1):90–100, 2006. ISSN 1536-1276.H. Weingarten, Y. Steinberg, and S. Shamai. The capacity region of the Gaussian

multiple-input multiple-output broadcast channel. IEEE Transactions on Information Theory , 52(9):3936–3964, 2006.

A. Wiesel, Y. C. Eldar, and S. Shamai. Zero-forcing precoding and generalizedinverses. IEEE Transactions on Signal Processing , 56(9):4409–4418, 2008.

E. Wigner. Characteristic vectors of bordered matrices with innite dimensions.The Annals of Mathematics , 62(3):548–564, November 1955.

E. Wigner. On the distribution of roots of certain symmetric matrices. The Annals of Mathematics , 67(2):325–327, March 1958.

J. Wishart. The generalized product moment distribution in samples from anormal multivariate population. Biometrika , 20(1-2):32–52, December 1928.

A. D. Wyner. Shannon-theoretic approach to a Gaussian cellular multiple accesschannel. IEEE Transactions on Information Theory , 40(6):1713–1727, 1994.S. Yang and J. C. Belore. Diversity of MIMO multihop relay channels. IEEE

Transactions on Information Theory , 2008. http://arxiv.org/abs/0708.0386 . Submitted for publication.

J. Yao, R. Couillet, J. Najim, E. Moulines, and M. Debbah. CLT for eigen-inference methods in cognitive radios. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11) , Prague,Czech Republic, 2011. To appear.

R. D. Yates. A framework for uplink power control in cellular radio systems.IEEE Journal on Selected Areas in Communications , 13(7):1341–1347, 1995.

Y. Q. Yin, Z. D. Bai, and P. R. Krishnaiah. On the limit of the largest eigenvalueof the large dimensional sample covariance matrix. Probability Theory and Related Fields , 78(4):509–521, 1988.

T. Yoo and A. Goldsmith. On the optimality of multiantenna broadcastscheduling using zero-forcing beamforming. IEEE Journal on Selected Areas in Communications , 24(3):528–541, 2006.

W. Yu and J. M. Cioffi. Sum capacity of Gaussian vector broadcast channels.IEEE Transactions on Information Theory , 50(9):1875–1892, 2004.

B. M. Zaidel, S. Shamai, and S. Verd´ u. Multicell uplink spectral efficiency of coded DS-CDMA with random signatures. IEEE Journal on Selected Areas in Communications , 19(8):1556–1569, August 2001.

A. Zellner. An Introduction to Bayesian Inference in Econometrics . John Wileyand Sons, Inc., New York, second edition, 1971.

Y. Zeng and Y. C. Liang. Eigenvalue based spectrum sensing algorithms forcognitive radio. IEEE Transactions on Communications , 57(6):1784–1793,2009.

L. Zhang. Spectral analysis of large dimensional random matrices . PhD thesis,National University of Singapore, 2006.

W. Zhang and K. B. Letaief. Cooperative spectrum sensing with transmitand relay diversity in cognitive networks. IEEE Transactions on Wireless








536 References

Communications , 7(12):4761–4766, December 2008.



Index

almost sure convergence, 19distribution function, 19

arcsinus law, 87

asymptotic freeness, 78

Bai and Silverstein method, 115Bayesian probability theory, 478Bell number, 101Borel–Cantelli lemma, 46broadcast channel, 335

linear precoders, 336Brownian motion, 507

Dyson, 508

capacity maximizing precoder, 5

frequency selective channels, 325Rayleigh model, 296Rice model, 318

Carleman’s condition, 95Catalan number, 102Cauchy integral formula, 202CDMA

orthogonal, 284random, 264

central limit theorem, 63, 213martingale difference, 69variance prole, 175

channel modeling, 477correlated channel, 484rank-limited channel, 494

circular law, 31CLT, see central limit theoremcognitive radio, 393complex analysis

pole, 206residue, 206residue calculus, 207

complex zonal polynomial, 24conditional probability, 66consistent estimator, 2

convergence in probability, 19correlated channel, 484correlation prole, 149cumulant

classical cumulant, 101free cumulant, 99moment to cumulant, 100

cumulative distribution function, see distribution function

decoder design, 328delta method, 216detection, 393

condition number criterion, 413error exponent, 416GLRT criterion, 414Neyman–Pearson criterion, 399test power, 416

deterministic equivalent, 114

Haar matrix, 153information plus noise, 152variance prole, 145

distribution function, 18dominated convergence theorem, 135Dyson Brownian motion, 508

eigen-inference, see G-estimationeigenvector

central limit theorem, 238limiting distribution, 238

elementary symmetric polynomials, 100

empirical spectral distribution, 29ergodic capacity, 296

frequency selective channels, 324Rayleigh model, 295Rice model, 316

e.s.d., 29estimation, 421

DoA, 422G-MUSIC, 425MUSIC, 423

G-MUSIC, 429power estimation, 432

free probability, 440G-estimation, 447

η-transform, 41exact separation, 184, 193

537



538 Index

femto-cell, 11xed-point algorithm, 117Fredholm determinant, 233free family, 73free moments, 98free probability theory, 72

additive convolution, 75additive deconvolution, 75additive free convolution, 100asymptotic freeness 78

sample covariance matrix, 189linear precoders, 336l.s.d., 30

MACergodic capacity, 360quasi-static channel, 357rate region, 355

Marcenko–Pastur law, 4Marcenko Pastur law 32

random matrix methods for wireless communications

Documents