submitted to ieee wireless communications letters, โ€ฆ

5
SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 1 Opening the Black Box of Deep Neural Networks in Physical Layer Communication Jun Liu, Kai Mei, Dongtang Ma, Senior Member, IEEE and Jibo Wei, Member, IEEE Abstractโ€”Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this letter, we aim to quantitatively analyse why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is ๏ฌ‚own in a DNN-based communication system under the information theoretic concepts. Index Termsโ€”Deep neural network (DNN), physical layer communication, information theory. I. I NTRODUCTION D EEP neural networks (DNN) have recently drawn a lot of attention as a powerful tool in science and engineering problems such as protein structure prediction, image recog- nition, speech recognition and natural language processing that are virtually impossible to explicitly formulate. Although the mathematical theories of communication systems have been developed dramatically since Claude Elwood Shannonโ€™s monograph โ€œA mathematical theory of communicationโ€ [1] provides the foundation of digital communication, the wireless channel-related gap between theory and practice motivates researchers to implement DNNs in existing physical layer communication. In order to mitigate the gap, a natural thought is to let a DNN to jointly optimize a transmitter and a receiver for a given channel model without being limited to component- wise optimization. In [2], a pure data-driven end-to-end com- munication system is proposed to jointly optimize transmitter and receiver components. Then, the authors consider the linear and nonlinear steps of processing the received signal as a radio transformer network (RTN) which can be integrated into the end-to-end training process. The ideas of end-to-end learning of communication system and RTN through DNN are extended to orthogonal frequency division multiplexing (OFDM) in [3]. Another natural idea is to recover channel state information (CSI) and estimate the channel as accurate as possible by implementing a DNN so that the effects of fading could be Manuscript received June 2, 2021; revised X X, 2021; accepted X X, 2021. This work was supported in part by National Natural Science Founda- tion of China (NSFC) under Grant 61931020, 61372099 and 61601480. (Cor- responding author: Jun Liu.) Jun Liu, Kai Mei, Dongtang Ma, and Jibo Wei are with the College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China (E-mail: {liujun15, meikai11, dong- tangma, wjbhw}@nudt.edu.cn). reduced. The authors of [4] propose an end-to-end DNN-based CSI compression feedback and recovery mechanism which is further extended with long short-term memory (LSTM) [5]. In [6], a residual learning based DNN designed for OFDM channel estimation is introduced. Furthermore, in order to mitigate the disturbances, in addition to Gaussian noise, such as channel fading and nonlinear distortion, [7] proposes an online fully complex extreme learning machine-based symbol detection scheme. Comparing with traditional physical layer communication systems, the above-mentioned DNN-based techniques show competitive performance. However, what has been missing is to understand the dynamics behind the DNN in physical layer communication. In this paper, we attempt to ๏ฌrst give a mathematical explanation to reveal the mechanism of end-to-end DNN-based communication systems. Then, we try to unveil the role of the DNNs in the tasks of CSI recovery, channel estimation and symbol detection. We believe that we have developed a concise way to open as well as understand the โ€œblack boxโ€ of DNNs in physical layer communication. To summarize, our main contributions of this paper are twofold: โ€ข Instead of proposing a scheme combining a DNN with a typical communication system, we analyse the be- haviours of a DNN-based communication system from the perspectives of the whole DNN (communication sys- tem), encoder (transmitter) and decoder (receiver). Our simulation results verify that the constellations produced by autoencoders are equivalent to the (locally) optimum constellations obtained by the gradient-search algorithm which minimize the asymptotic probability of error in Gaussian noise under an average power constraint. โ€ข We consider the tasks of CSI recovery, channel estimation and symbol detection as a typical inference problem. The information ๏ฌ‚ow in the DNNs of these tasks is estimated by using matrix-based functional of Renyiโ€™s -entropy to approximate Shannonโ€™s entropy. The remainder of this paper is organized as follows. In Section II, we give the system model and formulate the problem. Next, simulation results are presented in Section III. Finally, the conclusions are drawn in Section IV. Notations: The notations adopted in the paper are as fol- lows. We use boldface lowercase x, capital letters X and calligraphic letters X to denote column vectors, matrices and sets respectively. In addition, and E {ยท} denote respectively the Hadamard product and the expectation operation. arXiv:2106.01124v2 [eess.SP] 6 Jun 2021

Upload: others

Post on 22-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, โ€ฆ

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 1

Opening the Black Box of Deep Neural Networksin Physical Layer Communication

Jun Liu, Kai Mei, Dongtang Ma, Senior Member, IEEE and Jibo Wei, Member, IEEE

Abstractโ€”Deep Neural Network (DNN)-based physical layertechniques are attracting considerable interest due to theirpotential to enhance communication systems. However, moststudies in the physical layer have tended to focus on theapplication of DNN models to wireless communication problemsbut not to theoretically understand how does a DNN work in acommunication system. In this letter, we aim to quantitativelyanalyse why DNNs can achieve comparable performance in thephysical layer comparing with traditional techniques and theircost in terms of computational complexity. We further investigateand also experimentally validate how information is flown ina DNN-based communication system under the informationtheoretic concepts.

Index Termsโ€”Deep neural network (DNN), physical layercommunication, information theory.

I. INTRODUCTION

DEEP neural networks (DNN) have recently drawn a lot ofattention as a powerful tool in science and engineering

problems such as protein structure prediction, image recog-nition, speech recognition and natural language processingthat are virtually impossible to explicitly formulate. Althoughthe mathematical theories of communication systems havebeen developed dramatically since Claude Elwood Shannonโ€™smonograph โ€œA mathematical theory of communicationโ€ [1]provides the foundation of digital communication, the wirelesschannel-related gap between theory and practice motivatesresearchers to implement DNNs in existing physical layercommunication. In order to mitigate the gap, a natural thoughtis to let a DNN to jointly optimize a transmitter and a receiverfor a given channel model without being limited to component-wise optimization. In [2], a pure data-driven end-to-end com-munication system is proposed to jointly optimize transmitterand receiver components. Then, the authors consider the linearand nonlinear steps of processing the received signal as a radiotransformer network (RTN) which can be integrated into theend-to-end training process. The ideas of end-to-end learningof communication system and RTN through DNN are extendedto orthogonal frequency division multiplexing (OFDM) in [3].Another natural idea is to recover channel state information(CSI) and estimate the channel as accurate as possible byimplementing a DNN so that the effects of fading could be

Manuscript received June 2, 2021; revised X X, 2021; accepted XX, 2021. This work was supported in part by National Natural Science Founda-tion of China (NSFC) under Grant 61931020, 61372099 and 61601480. (Cor-responding author: Jun Liu.)

Jun Liu, Kai Mei, Dongtang Ma, and Jibo Wei are with the Collegeof Electronic Science and Technology, National University of DefenseTechnology, Changsha 410073, China (E-mail: {liujun15, meikai11, dong-tangma, wjbhw}@nudt.edu.cn).

reduced. The authors of [4] propose an end-to-end DNN-basedCSI compression feedback and recovery mechanism which isfurther extended with long short-term memory (LSTM) [5].In [6], a residual learning based DNN designed for OFDMchannel estimation is introduced. Furthermore, in order tomitigate the disturbances, in addition to Gaussian noise, suchas channel fading and nonlinear distortion, [7] proposes anonline fully complex extreme learning machine-based symboldetection scheme.

Comparing with traditional physical layer communicationsystems, the above-mentioned DNN-based techniques showcompetitive performance. However, what has been missing isto understand the dynamics behind the DNN in physical layercommunication.

In this paper, we attempt to first give a mathematicalexplanation to reveal the mechanism of end-to-end DNN-basedcommunication systems. Then, we try to unveil the role ofthe DNNs in the tasks of CSI recovery, channel estimationand symbol detection. We believe that we have developed aconcise way to open as well as understand the โ€œblack boxโ€ ofDNNs in physical layer communication. To summarize, ourmain contributions of this paper are twofold:

โ€ข Instead of proposing a scheme combining a DNN witha typical communication system, we analyse the be-haviours of a DNN-based communication system fromthe perspectives of the whole DNN (communication sys-tem), encoder (transmitter) and decoder (receiver). Oursimulation results verify that the constellations producedby autoencoders are equivalent to the (locally) optimumconstellations obtained by the gradient-search algorithmwhich minimize the asymptotic probability of error inGaussian noise under an average power constraint.

โ€ข We consider the tasks of CSI recovery, channel estimationand symbol detection as a typical inference problem. Theinformation flow in the DNNs of these tasks is estimatedby using matrix-based functional of Renyiโ€™s ๐›ผ-entropy toapproximate Shannonโ€™s entropy.

The remainder of this paper is organized as follows. InSection II, we give the system model and formulate theproblem. Next, simulation results are presented in Section III.Finally, the conclusions are drawn in Section IV.

Notations: The notations adopted in the paper are as fol-lows. We use boldface lowercase x, capital letters X andcalligraphic letters X to denote column vectors, matrices andsets respectively. In addition, ๏ฟฝ and E {ยท} denote respectivelythe Hadamard product and the expectation operation.

arX

iv:2

106.

0112

4v2

[ee

ss.S

P] 6

Jun

202

1

Page 2: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, โ€ฆ

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 2

InformationSource

Transmitter WirelessChannel

Receiver Destination

Encoder Decoder

Error Propagation

PDF

Autoencoder

s

s

sx y

( )f

f ฮ˜ ( )g

g ฮ˜

z sv

( )|p v z

Fig. 1. Schematic diagram of a general communication system and itscorresponding autoencoder representation.

II. SYSTEM MODEL AND PROBLEM FORMULATION

In this section, we first describe the considered systemmodel and then provide a detailed explanation of the problemformulation from three different perspectives.

A. System Model

As shown in the upper part of Fig. 1, letโ€™s consider the pro-cess of message transmission from the perspectives of a typicalcommunication system and an autoencoder, respectively. Weassume that an information source generates a sequence of ๐‘˜-bit message symbols ๐‘  โˆˆ {1, 2, ยท ยท ยท , ๐‘€} to be communicatedto the destination, where ๐‘€ = 2๐‘˜ . Then the modulationmodules inside the transmitter map each symbol ๐‘  to a signalx โˆˆ R๐‘ , where ๐‘ denoted the dimension of the signal space.The signal alphabet is denoted by x1, x2, ยท ยท ยท , x๐‘€ . Duringchannel transmission, ๐‘-dimensional signal x is corrupted toy โˆˆ R๐‘ with conditional probability density function (PDF)๐‘ (y|x) =

โˆ๐‘๐‘›=1 ๐‘ (๐‘ฆ๐‘› |๐‘ฅ๐‘›). In this paper, we use ๐‘/2 band-

pass channels, each with separately modulated inphase andquadrature components to transmit the ๐‘-dimensional signal[8]. Finally, the received signal is mapped by the demodulationmodule inside the receiver to ๐‘  which is an estimate of thetransmitted symbol ๐‘ . The procedures mentioned above havebeen exhaustively presented by Shannon.

From the point of view of filtering or signal inference,the idea of autoencoder-based communication system matchesNorbert Wienerโ€™s perspective [9]. As shown in the lower partof the Fig. 1, the autoencoder consists of an encoder and adecoder and each of them is a feedforward neural network(NN) with parameters (weights) ๐šฏ ๐‘“ and ๐šฏ๐‘”, respectively.Note that each symbol ๐‘  from information source usually needsto be encoded to an one-hot vector s โˆˆ R๐‘€ and then is fedinto the encoder. Under a given constraint (e.g., average signalpower constraint), the PDF of a wireless channel and a lossfunction to minimize error symbol probability, the encoderand decoder are respectively able to learn to appropriatelyrepresent s as z = ๐‘“๐šฏ ๐‘“

(s) and to map the corrupted signal v toa estimate of transmitted symbol s = ๐‘”๐šฏ๐‘”

(v) where z, v โˆˆ R๐‘ .Here, we use z1, z2, ยท ยท ยท , z๐‘€ denoted the transmitted signalsfrom the encoder in order to distinguish it from the transmittedsignals from the transmitter.

B. Understanding Autoencoders on Message Transmission

From the prospective of the whole autoencoder (communi-cation system), it aims to transmit information to destinationwith low error probability. The symbol error probability, i.e.,the probability that the wireless channel has shifted a signalpoint into another signalโ€™s decision region, is

๐‘ƒ๐‘’ =1๐‘€

๐‘€โˆ‘๐‘š=1

Pr (s โ‰  s๐‘š |s๐‘š transmitted). (1)

The autoencoder can use the cross-entropy loss function

Llog(s, s;๐šฏ ๐‘“ ,๐šฏ๐‘”

)= โˆ’ 1

๐ต

๐ตโˆ‘๐‘=1

๐‘€โˆ‘๐‘–=1

s(๐‘) [๐‘–] log(s(๐‘) [๐‘–]

)= โˆ’ 1

๐ต

๐ตโˆ‘๐‘=1

s(๐‘) [๐‘ ](2)

to represent the price paid for inaccuracy of prediction wheres(๐‘) [๐‘–] denotes the ๐‘–th element of the ๐‘th symbol in a trainingset with ๐ต symbols. In order to train the autoencoder tominimize the symbol error probability, the optimal parameterscould be found by optimizing the loss function(

๐šฏโˆ—๐‘“ ,๐šฏ

โˆ—๐‘”

)= arg min(๐šฏ ๐‘“ ,๐šฏ๐‘”)

[Llog

(s, s;๐šฏ ๐‘“ ,๐šฏ๐‘”

) ]subject to E

[โ€–zโ€–2

2]โ‰ค ๐‘ƒav

(3)

where ๐‘ƒav denotes the average power. In this paper, we set๐‘ƒav = 1/๐‘€. Now, we must be very curious about how do themappings z = ๐‘“๐šฏ ๐‘“

(s) look like after the training was done.

C. Encoder: Finding a Good Representation

Letโ€™s pay attention to the encoder (transmitter). In thedomain of communication, an encoder needs to learn a robustrepresentation z = ๐‘“๐šฏ ๐‘“

(s) to transmit s against channeldisturbances, such as thermal noise, channel fading, nonlineardistortion, phase jitter, etc. This is equivalent to find an coded(or uncoded) modulation scheme with the signal set of size๐‘€ to map a symbol s to a point z for a given transmittedpower, which maximizes the minimum distance between anytwo constellation points. Usually the problem of finding goodsignal constellations for a Gaussian channel1 is associated withthe search for lattices with high packing density which is anold and well-studied problem in the mathematical literature[11].

We use the algorithm proposed in [12] to obtain the opti-mum constellations. Consider a zero-mean stationary additivewhite Gaussian noise (AWGN) channel with one-sided spec-tral density 2๐‘0. For large signal-to-noise ratio (SNR), theasymptotic approximation of the (1) can be written as

๐‘ƒ๐‘’ โˆผ exp(โˆ’ 1

8๐‘0min๐‘–โ‰  ๐‘—

z๐‘– โˆ’ z ๐‘—

22

). (4)

1The problem of constellation optimization is usually considered under thecondition of the Gaussian channel. Although the problem under the conditionof Rayleigh fading channel has been studied in [10], its prerequisite is thatthe side information is perfect known.

Page 3: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, โ€ฆ

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 3

To minimize ๐‘ƒ๐‘’, the problem can be formulated as{zโˆ—๐‘š

}๐‘€๐‘š=1 = arg min

{z๐‘š }๐‘€๐‘š=1

(๐‘ƒ๐‘’)

subject to E[โ€–zโ€–2

2]โ‰ค ๐‘ƒav

(5)

where{zโˆ—๐‘š

}๐‘€๐‘š=1 denotes the optimal signal set. This optimiza-

tion problem can be solved by using a constrained gradient-search algorithm. We denote {z๐‘š}๐‘€๐‘š=1 as a ๐‘€ ร— ๐‘ matrix

Z = [z1, z2, ยท ยท ยท , z๐‘€ ]๐‘‡ . (6)

Then, the ๐‘ th step of the constrained gradient-search algorithmcan be described by

Zโ€ฒ

๐‘ +1 = Z๐‘  โˆ’ [๐‘ โˆ‡๐‘ƒ๐‘’ (Z๐‘ ) (7a)

Z๐‘ +1 =Zโ€ฒ

๐‘ +1โˆ‘๐‘–

โˆ‘๐‘—

(Zโ€ฒ๐‘ +1 [๐‘–, ๐‘—]

)2 (7b)

where [๐‘  denotes step size and โˆ‡๐‘ƒ๐‘’ (Z๐‘ ) โˆˆ R๐‘€ร—๐‘ denotesthe gradient of ๐‘ƒ๐‘’ respect to the current constellation points.It can be written as

โˆ‡๐‘ƒ๐‘’ (Z๐‘ ) = [g1, g2, ยท ยท ยท , g๐‘€ ]๐‘‡ (8)

where

g๐‘š โˆผ โˆ’โˆ‘๐‘–โ‰ ๐‘š

exp

(โˆ’โ€–z๐‘š โˆ’ z๐‘– โ€–2

28๐‘0

) (1

โ€–z๐‘š โˆ’ z๐‘– โ€–22+ 1

4๐‘0

)1z๐‘šโˆ’z๐‘– .

(9)The vector 1z๐‘šโˆ’z๐‘– denotes ๐‘-dimensional unit vector in thedirection of z๐‘š โˆ’ z๐‘– .

Comparing (3) to (5), the mechanism of the encoder in anautoencoder-based communication system has been unveiled.The mapping function of encoder can be represented as{

๐‘“๐šฝโˆ—๐‘“(s๐‘š)

}๐‘€๐‘š=1

โ†’{zโˆ—๐‘š

}๐‘€๐‘š=1 (10)

when the PDF used for generating training samples is multi-variate zero-mean normal distribution zโˆ’z โˆผ N๐‘ (0, ๐šบ) where0 denotes ๐‘-dimensional zero vector and ๐šบ = (2๐‘0/๐‘) I is a๐‘ ร— ๐‘ diagonal matrix.

D. Decoder: Inference

Finally, it is the time to zoom in the lower right cornerof the Fig. 1 to investigate what happens inside the decoder(receiver). As Fig. 2(a) shown, for the tasks of DNN-based CSIrecovery, channel estimate and symbol detection, the problemcan be formulated as an inference model. For the sake ofconvenience, we can denote the target output of the decoder asz instead of s because we can assume z = ๐‘“๐šฏ ๐‘“

(s) is bijection.If the decoder is symmetric, the decoder also can be seen as asub autoencoder which consists of a sub encoder and decoder.Its bottleneck (or middlemost) layer codes is denoted as u.Here we use z to denote CSI or transmitted symbol which wedesire to predict. The decoder infers a prediction z = ๐‘”๐šฏ๐‘”

(v)according to its corresponding measurable variable v.

z

PDF

z

TrainingData

( )|p v z v

ห†zu

...... ......

( )g

g ฮ˜

Decoder

v z1t

2t3t

1t

2t

3t

v1t 2t

z

u

st1

t

...

...

...

... st

2t

( )1 |p t v ( )2 1|p t t ( )3 2|p t t ( )1|s sp โˆ’t t

( )1ห† |p z t ( )1 2|p t t ( )2 3|p t t ( )1 |s sp โˆ’ t t

(a)

(b)

( )|p v z

s

Fig. 2. (a) An inference model with a DNN decoder of size (2๐‘† โˆ’ 1) hiddenlayers for learning. (b) The graph representation of the decoder with (๐‘† โˆ’ 1)hidden layers in both sub encoder and decoder. The solid arrow denotes thedirection of input feedforward propagation and the dashed arrow denotes thedirection of information flow in the error back-propagation phase.

If the joint distribution ๐‘ (v, z) is known, the expected(population) risk C๐‘ (v,z)

(๐‘”๐šฏ๐‘”

,Llog

)can be written as

E[Llog

(z, z;๐šฏ๐‘”

) ]=

โˆ‘vโˆˆV ,zโˆˆZ

๐‘ (v, z) log(

1๐‘„ (z|v)

)=

โˆ‘vโˆˆV ,zโˆˆZ

๐‘ (v, z) log(

1๐‘ (z|v)

)+

โˆ‘vโˆˆV ,zโˆˆZ

๐‘ (v, z) log(๐‘ (z|v)๐‘„ (z|v)

)= ๐ป (z|v) + ๐ทKL (๐‘ (z|v) | |๐‘„ (z|v))โ‰ฅ ๐ป (z|v)

(11)where ๐‘„ (ยท|v) =๐‘”๐šฏ๐‘”

(v) โˆˆ ๐‘ (Z) and ๐ทKL (๐‘ (z|v) | |๐‘„ (z|v))denotes Kullback-Leibler divergence between ๐‘ (z|v) and๐‘„ (z|v) [13]2. If and only if the decoder is given by the con-ditional posterior ๐‘”๐šฏ๐‘”

(v) =๐‘ (z|v), the expected (population)

risk reaches the minimum min๐‘”๐šฏ๐‘”

C๐‘ (v,z)(๐‘”๐šฏ๐‘”

,Llog

)= ๐ป (z|v).

In physical layer communication, instead of perfectly know-ing the channel-related joint distribution ๐‘ (v, z), we only have

a set of ๐ต i.i.d. samples D๐ต :={(

v(๐‘) , z(๐‘))}๐ต

๐‘=1from ๐‘ (v, z).

In this case, the empirical risk is defined as

C๐‘ (v,z)(๐‘”๐šฏ๐‘”

,L,D๐ต

)=

1๐ต

๐ตโˆ‘๐‘=1

L[z๐‘ , ๐‘”๐šฏ๐‘”

(v๐‘)]. (12)

Practically, the D๐ต from ๐‘ (v, z) usually is a finite set.This leads the difference between the empirical and expected(population) risks which can be defined as

gen๐‘ (v,z)(๐‘”๐šฏ๐‘”

,L,D๐ต

)=C๐‘ (v,z)

(๐‘”๐šฏ๐‘”

,Llog

)โˆ’

C๐‘ (v,z)(๐‘”๐šฏ๐‘”

,L,D๐ต

).

(13)

We now can preliminarily conclude that the DNN-basedreceiver is an estimator with minimum empirical risk fora given set D๐ต, whereas its performance is inferior to the

2If ๐‘‹ is a continuous random variable the sum becomes an integral whenits PDF exists.

Page 4: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, โ€ฆ

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 4

(a)

(b)

M=8, N=2 M=16, N=2 M=16, N=3

M=8, N=2 M=16, N=2 M=16, N=3

Fig. 3. Comparisons of (a) optimum constellations obtained by gradient-search technique and (b) constellations produced by autoencoders.

optimal with minimum expected (population) risk under agiven joint distribution ๐‘ (v, z).

Furthermore, it is necessary to quantitatively understandhow information flows inside the decoder. Fig. 2(b) showsthe graph representation of the decoder where t๐‘– andtโ€ฒ๐‘– (1 โ‰ค ๐‘– โ‰ค ๐‘†) denote ๐‘–th hidden layer representations startingfrom the input layer and the output layer, respectively. Here,we use the method proposed in [14] to illustrate layer-wisemutual information by three kinds of information planes (IPs)where the Shannonโ€™s entropy is estimated by matrix-basedfunctional of Renyiโ€™s ๐›ผ-entropy [15]. Its details are given inAppendix.

III. SIMULATION RESULTS

In this section, we provide simulation results to illustratethe behaviour of DNN in physical layer communication.

A. Constellation Comparison

Fig. 3(a) shows the optimum constellations obtained bygradient-search technique proposed in [12]. When ๐‘ = 2 and3, the algorithm were run allowing for 1000 and 3000 steps,respectively. The step size [ = 2 ร— 10โˆ’4. Fig. 3(b) shows theconstellations produced by autoencoders which have the samenetwork structures and hyperparameters with the autoencodersmentioned in [2]. The autoencoders were trained with 106

epochs, each of which contains ๐‘€ different symbols.When ๐‘ = 2, the two-dimensional constellations produced

by autoencoders have a similar pattern to the optimum con-stellations which form a lattice of (almost) hexagonal. Specifi-cally, in the case of (๐‘€ = 8, ๐‘ = 2), one of the constellationsfound by the autoencoder can be obtained by rotating theoptimum constellation found by gradient-search technique. Inthe case of (๐‘€ = 16, ๐‘ = 2), the constellation found by theautoencoder is different from the optimum constellation butit still forms a lattice of (almost) equilateral triangles. In thecase of (๐‘€ = 16, ๐‘ = 3), one signal point of the optimumconstellation is almost at the origin while the other 15 signalpoints are almost at the surface of a sphere with radius ๐‘ƒav andcentre 0. This pattern is similar to the surface of a truncated

200 400 600 800 1000 1200 14000.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

S a(z|z L

S)

Size of Training Set B

SNR=3 dB, N=64 SNR=15 dB, N=64 SNR=30 dB, N=64 SNR=15 dB, N=128 SNR=15 dB, N=512

Fig. 4. The entropy ๐‘†๐›ผ (z |zLS) with respect to different values of SNR and๐‘ .

icosahedron which is composed of pentagonal and hexagonalfaces. However, the three-dimensional constellation producedby an autoencoder is a local optima which is form by 16 signalpoints almost in a plane.

From the perspective of computational complexity, the costto train an autoencoder is significantly higher than the cost oftraditional algorithm. Specifically, an autoencoder which has 4dense layers respectively with ๐‘€ , ๐‘ , ๐‘€ and ๐‘€ neurons needsto train (2๐‘€ + 1) (๐‘€ + ๐‘) + 2๐‘€ parameters for 106 epochswhereas the gradient-search algorithm only needs 2๐‘€ trainableparameters for 103 steps.

B. Information Flow

We consider a common channel estimation problemfor an OFDM system with ๐‘ subcarriers. Let z ฮ”

=

[๐ป [0] , ๐ป [1] , ยท ยท ยท , ๐ป [๐‘ โˆ’ 1]]๐‘‡ which denotes frequency im-pulse response (FIR) vector of a channel. For the sake ofconvenience, we denotes the measurable variable as v ฮ”

= zLSwhere zLS represents the least-square (LS) estimation of z.Usually, it can be obtained by using training symbol-basedchannel estimation. In this paper, we use linear interpolationand the number of pilots ๐‘๐‘ = ๐‘/4 = 16.

According to (11), the minimum logarithmic expected (pop-ulation) risk for this inference problem is ๐ป (z|zLS) which canbe estimated by Renyiโ€™s ๐›ผ-entropy ๐‘†๐›ผ (z|zLS) =๐‘†๐›ผ (z, zLS) โˆ’๐‘†๐›ผ (zLS) with ๐›ผ = 1.01. Fig. 4 illustrates the entropy๐‘†๐›ผ (z|zLS) with respect to different values of SNR and ๐‘ .As can be seen, ๐‘†๐›ผ (z|zLS) monotonically decreases as thesize of training set increases. When ๐ต โ†’ โˆž, ๐‘†๐›ผ (z|zLS)decreases slowly. It is because the joint distribution ๐‘ (z, zLS)can be perfectly learned and therefore the empirical risk isapproaching to the expected risk. Interestingly, when ๐ต > 580,the lower the SNR or the larger input dimension ๐‘ is, thesmaller ๐ต is needed to obtain the same value of ๐‘†๐›ผ (z|zLS).

Fig. 5(a), (b), (c) illustrate the behaviour of the IP-I, IP-II and IP-III in a DNN-based OFDM channel estimator withtopology โ€œ128 โˆ’ 64 โˆ’ 32 โˆ’ 16 โˆ’ 8 โˆ’ 16 โˆ’ 32 โˆ’ 64 โˆ’ 128โ€ where

Page 5: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, โ€ฆ

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

T1

T2

T3

T4

I(T;V

' )

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(a) IP-I

T1-T'1

T2-T'2

T3-T'3

T4-T'4

I(T' ;V

' )

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(b) IP-II T1-T

'1

T2-T'2

T3-T'3

T4-T'4

I(T' ;V

)

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(c) IP-III0 100 200 300 400 500 600 700 800 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

MSE

Iterations(d) Loss Curve

Fig. 5. The three IPs and loss curve in a DNN-based channel estimator.

linear activation function is considered and the training sampleis constructed by concatenating the real and imaginary parts ofthe complex channel vectors. Batch size is 100 and learningrate [ = 0.001. We use ๐‘‰ and ๐‘‰ โ€ฒ to denote the input andoutput of the decoder, respectively. The number of iterationsis illustrated through a colour bar. From IP-I, it can be seen thatthe final value of mutual information I (๐‘‡ ;๐‘‰ โ€ฒ) in each layertends to be equal to the final value of I (๐‘‡ ;๐‘‰), which meansthat the information from ๐‘‰ has been learnt and transferredto ๐‘‰ โ€ฒ by each layer. In IP-II, I (๐‘‡ โ€ฒ;๐‘‰ โ€ฒ) < I (๐‘‡ ;๐‘‰) in eachlayer, which implies that all the layers are not overfitting.The tendency of I (๐‘‡ ;๐‘‰) to approach the value of I (๐‘‡ โ€ฒ;๐‘‰)can be observed from IP-III. Finally, from all the IPs, it iseasy to notice that the mutual information does not changesignificantly when the number of iterations is larger than 200.Meanwhile, according to Fig. 5(d), the MSE reaches a verylow value and also does not change sharply. It means that200 iterations are enough for the task of 64-subcarrier channelestimation using a DNN with the above-mentioned topology.

IV. CONCLUSION

In this paper, we propose a framework to understand themanner of the DNNs in physical communication. We findthat a DNN-based transmitter essentially tries to producea good representation of the information source. Then, wequantitatively analyse the information flow in a DNN-basedcommunication system. We believe that this framework hasthe potential for the design of DNN-based physical commu-nication.

APPENDIX AMATRIX-BASED FUNCTIONAL OF RENYIโ€™S ๐›ผ-ENTROPY

For a random variable ๐‘‹ in a finite set X, its Renyiโ€™s entropyof order ๐›ผ is defined as

๐ป๐›ผ (๐‘‹) = 11 โˆ’ ๐›ผ

logโˆซX๐‘“ ๐›ผ (๐‘ฅ) ๐‘‘๐‘ฅ (14)

where ๐‘“ (๐‘ฅ) is the PDF of the random variable ๐‘‹ . Let{๐‘ฅ (๐‘)

}๐ต๐‘=1 be an i.i.d. sample of ๐ต realizations from the random

variable ๐‘‹ with PDF ๐‘“ (๐‘ฅ). The Gram matrix K can be definedas K [๐‘–, ๐‘—] = ^

(๐‘ฅ๐‘– , ๐‘ฅ ๐‘—

)where ^ : X ร— X โ†ฆโ†’ R is a real valued

positive definite and infinitely divisible kernel. Then, a matrix-based analogue to Renyiโ€™s ๐›ผ-entropy for a normalized positivedefinite matrix A of size ๐ต ร— ๐ต with trace 1 can be given bythe functional

๐‘†๐›ผ (A) = 11 โˆ’ ๐›ผ

log2

[๐ตโˆ‘

๐‘=1_๐‘ (A)๐›ผ

](15)

where _๐‘ (A) denotes the ๐‘th eigenvalue of A, a normalizedversion of K:

A [๐‘–, ๐‘—] = 1๐ต

K [๐‘–, ๐‘—]โˆšK [๐‘–, ๐‘–] K [ ๐‘— , ๐‘—]

. (16)

Now, the joint-entropy can be defined as

๐‘†๐›ผ (A,B) = ๐‘†๐›ผ

[A ๏ฟฝ B

tr (A ๏ฟฝ B)

]. (17)

Finally, the matrix notion of Renyiโ€™s mutual information canbe defined as

๐ผ๐›ผ (A; B) = ๐‘†๐›ผ (A) + ๐‘†๐›ผ (B) โˆ’ ๐‘†๐›ผ (A,B) . (18)

REFERENCES

[1] C. E. Shannon, โ€œA mathematical theory of communication,โ€ The Bellsystem technical journal, vol. 27, no. 3, pp. 379โ€“423, 1948.

[2] T. Oโ€™shea and J. Hoydis, โ€œAn introduction to deep learning for thephysical layer,โ€ IEEE Transactions on Cognitive Communications andNetworking, vol. 3, no. 4, pp. 563โ€“575, 2017.

[3] A. Felix, S. Cammerer, S. Dรถrner, J. Hoydis, and S. Ten Brink, โ€œOFDM-autoencoder for end-to-end learning of communications systems,โ€ in2018 IEEE 19th International Workshop on Signal Processing Advancesin Wireless Communications (SPAWC). IEEE, 2018, pp. 1โ€“5.

[4] C.-K. Wen, W.-T. Shih, and S. Jin, โ€œDeep learning for massive MIMOCSI feedback,โ€ IEEE Wireless Communications Letters, vol. 7, no. 5,pp. 748โ€“751, 2018.

[5] T. Wang, C.-K. Wen, S. Jin, and G. Y. Li, โ€œDeep learning-based CSIfeedback approach for time-varying massive MIMO channels,โ€ IEEEWireless Communications Letters, vol. 8, no. 2, pp. 416โ€“419, 2018.

[6] L. Li, H. Chen, H.-H. Chang, and L. Liu, โ€œDeep residual learning meetsOFDM channel estimation,โ€ IEEE Wireless Communications Letters,vol. 9, no. 5, pp. 615โ€“618, 2019.

[7] J. Liu, K. Mei, X. Zhang, D. Ma, and J. Wei, โ€œOnline extreme learningmachine-based channel estimation and equalization for OFDM systems,โ€IEEE Communications Letters, vol. 23, no. 7, pp. 1276โ€“1279, 2019.

[8] B. Sklar and P. K. Ray, Digital Communications Fundamentals andApplications. Pearson Education, 2014.

[9] S. Yu, M. Emigh, E. Santana, and J. C. Prรญncipe, โ€œAutoencoders trainedwith relevant information: blending Shannon and Wienerโ€™s perspectives,โ€in 2017 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2017, pp. 6115โ€“6119.

[10] J. Boutros, E. Viterbo, C. Rastello, and J.-C. Belfiore, โ€œGood latticeconstellations for both Rayleigh fading and Gaussian channels,โ€ IEEETransactions on Information Theory, vol. 42, no. 2, pp. 502โ€“518, 1996.

[11] G. C. Jorge, A. A. de Andrade, S. I. Costa, and J. E. Strapasson,โ€œAlgebraic constructions of densest lattices,โ€ Journal of Algebra, vol.429, pp. 218โ€“235, 2015.

[12] G. Foschini, R. Gitlin, and S. Weinstein, โ€œOptimization of two-dimensional signal constellations in the presence of Gaussian noise,โ€IEEE Transactions on Communications, vol. 22, no. 1, pp. 28โ€“38, 1974.

[13] A. Zaidi, I. Estella-Aguerri et al., โ€œOn the information bottleneckproblems: models, connections, applications and information theoreticviews,โ€ Entropy, vol. 22, no. 2, p. 151, 2020.

[14] S. Yu and J. C. Principe, โ€œUnderstanding autoencoders with informationtheoretic concepts,โ€ Neural Networks, vol. 117, pp. 104โ€“123, 2019.

[15] L. G. S. Giraldo, M. Rao, and J. C. Principe, โ€œMeasures of entropyfrom data using infinitely divisible kernels,โ€ IEEE Transactions onInformation Theory, vol. 61, no. 1, pp. 535โ€“548, 2014.