using neural networks in communication problems theory and ... · an information theoretic...

Using Neural Networks in Communication Problems

– Theory and Examples

Lizhong Zheng

MIT

Globecom, December 10, 2019

Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 1 / 32

Collaborators

Shao-lun Huang Xiangxiang Xu Anuran Makur David Qiu Mohamed AlHajri

Greg Wornell Lingjia Liu Zhou Zhou Jing Liang


Introduction

The Success of Neural Networks, and us

Computer Vision & NLP:

complex problems with no

clear model


Introduction

Use NN in Communication Problems?

The next revolution in communication networks.

Difficulties

Use too much resources, computation and samples;

Domain knowledge and reusable solutions;

No guarantee, optimality and robustness;

Some problems are particularly hard for SGD;

“It is human nature to prefer simplicity”


Introduction

In This Talk:

An information theoretic interpretation of learning in neural networks

An example to use NN for a physical layer communication problem

Simplify and Specialize


Information Theory for NN

Neural Networks as Information Processing

Processing result:

Features S(x)

Corresponding features

v(y)

QY |X ∝ exp[ST (x)·v(y)]

Not sufficient to represent the true model.


Information Theory for NN

A Little Machinery: Local Geometry

Variation of distribution: Prior P0 → Posterior Q,

Can write in vector form

P0 → Q = [Q(x)− P0(x), x ∈ X ]

LLR( QP0

) =[log Q(x)

P0(x), x ∈ X

]≈

[Q(x)−P0(x)

P0(x), x ∈ X

]

Information Vector, with reference P0,

φ(Q) =

[Q(x)− P0(x)√

P0(x), x ∈ X

]


Local Geometry

Information Vector for Feature Functions

A feature function f : X 7→ R

w.o.l.g. require EP0 [f (X )] = 0.

Recall: LLR( QP0

) =[log Q(x)

P0(x), x ∈ X

]≈

[Q(x)−P0(x)

P0(x), x ∈ X

]

Evaluating any feature function is equivalent as computing an LLR, or

making a binary decision, or estimate a scalar parameter.

Information vector for a feature function

φ(f ) =[√

P0(x) · f (x), x ∈ X]


Local Geometry

Turning Things Euclidean

In functional space:

Euclidean norm: ‖φ(f )‖2 = varP0 [f (X )]

Inner product: 〈φ(f1), φ(f2)〉 = EP0 [f1(X )f2(X )]

orthogonal features are uncorrelated (no repetitive information).

In distribution space

K-L divergence: D(P||Q) ≈ ‖φ(P) − φ(Q)‖2

length of information vector measure the information volume.

Inner product ↔ Fisher information


Local Geometry

Empirical Average

P0

bP $ �

⌫ $ f

X1, . . . ,Xn i.i.d. from P0

Empirical distribution P̂ ↔ φ

For a feature function f ↔ ν

The empirical average

1

n

n∑i=1

f (xi ) = Ep̂ [f (X )] =∑x

P̂(x) · f (x)

=∑x

(P̂(x)− P0(x)) · f (x)

=∑x

P̂(x)− P0(x)√P0(x)

·√

P0(x)f (x) = 〈φ, ν〉

Taking a feature as projection of info. vector.


Local Geometry

Joint Distribution

For a joint distribution PXY

Use PXPY as the reference,

Weak dependence,

Canonical Dependence Matrix (CDM): B ∈ R|Y|×|X |

B(x , y) =PXY (x , y)− PX (x)PY (y)√

PX (x)PY (y), x ∈ X , y ∈ Y

Mutual information:

I (X ;Y ) = D(PXY ||PXPY ) ∝ ‖B‖2


Local Geometry

Decomposition of Mutual Information

SVD B =∑

i σi · ψi· φT

i

Recall I (X ;Y ) ∝ ‖B‖2 =∑

i σ2i

Dependence between X ,Y can be written as a number of modes

Similar statement in common information

HGR maximal correlation: given X ,Y ∼ PXY

maxf ,g

corr[f (X ), g(Y )] = σ1, f ∗(x) =φ1(x)√PX (x)

, g∗(y) =ψ1(y)√PY (y)

CCA, correspondence analysis.Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 12 / 32

Local Geometry

Neural Network with Discrete Inputs

True PXY and Empirical

P̂XY ;

Try to fit

minQ

D(P̂XY ||QXY )

Ideal network.

Softmax regression:

Q(k)XY (x , y) ∝ PX (x)PY (y) exp

[k∑

i=1

si (x)vi (y)

]


Local Geometry

Guess What Would an Ideal NN Do?

Q(k)XY (x , y) ∝ PX (x)PY (y) exp

[k∑

i=1

si (x)vi (y)

]

B̂(k) =

k∑

i=1

√PX (x)si (x)︸︷︷︸

φi

·√

PY (y)vi (y)︸︷︷︸ψi

Low rank approximation: min ‖B̂(k) − B‖2,

s∗i (x) =φi (x)√PX (x)

, v∗i (y) =ψi (y)√PY (y)


Local Geometry

NN as SVD Solver

Capture the most significant k

modes of dependence, without

further bias.

S(x) =φ∗(x)√PX (x)

v(y) =ψ∗(y)√PY (y)

Who did the SVD? Backprop - ACE - Power method

v(y) = E [S(X )|Y = y ] ψ = B · φ

S(x) = E [v(Y )|X = x ] φ = BT · ψ


Local Geometry

Connections to Information Theory

Decomposition of mutual information, common randomness, ...

HGR maximal correlation;

Low rank matrix completion for models;

Universal feature: maximize average relevance to unknown queries.

The most “learn-able” partial model.


Local Geometry

More Realistic Neural Networks

Network structure puts a limit on what

feature functions can be generated;

Approximate the ideal feature function

with Euclidean errors;

Every quantity has a name;

Many equivalent representations, need

canonical basis (whitening).


Detection Problem

Use This in Communication Problems

Where to use?

When there is no clear model, no optimal solution;

Non-linear, None-Gaussian.

Physical layer vs. higher layer.

Focus the learning power

Combine with classical processing.


Detection Problem

Symbol Detection over Interference Channel

Y = h · X + W

X ∈ QAM / PAM, CSIR: h is known at the receiver;

W is non-Gaussian, with fixed unknown PDF pW .

Standard solution: Linear MMSE /w min. distance.

NN solution:

How do we use the knowledge of h, the QAM?

Train with one channel realization / SNR, use for another?


Detection Problem

First Simple Step: Regularity of PAM

Y = h · X + W , X ∈ {−3,−1,+1,+3}

Observation p(y |X = +1) = p(y + 2h|X = +3)...

Reuse binary decision

modules.

CNN over amplitude.

Reuse of training

samples.


Detection Problem

Transferable Knowledge

Y = hs · X + W , X ∈ {−1,+1}

Suppose we trained network for hs , can we use it for a different ht?

Knowledge of the pdf pW , where is it stored, how to use it?

Concrete example, interference W from PAM, but the receiver

doesn’t know it.


Detection Problem

Source Problem: Linear Approach

Y = hs · X + W , X ∈ {−1,+1}


Detection Problem

Ground Truth for the Source Problem

Y = hs · X + W , X ∈ {−1,+1}


Detection Problem

What Does the NN Learn?

Y = hs · X + W , X ∈ {−1,+1}


Detection Problem

Make NN Work Harder

Lower the SNR:


Detection Problem

Make NN Work Harder

Throw in some “sand”:


Detection Problem

Make NN Work Harder


Knowledge Region:

(y ± hs)TK−1Z (y ± hs) < γ


Detection Problem

Transfer of Knowledge

Trained for Y = hs · X + Z ;

Target problem Y = ht · X + Z

same interference structure;

different fading ht .


Detection Problem

Receiver Structure


Conclusion

Concluding Remarks

Focus the NN learning power to the non-linear, non-Gaussian,

non-ideal part of the problem, and fill in the rest;

Theoretic understanding:

Processing at the input, output, or in the middle of NNs;

Choose features by the “relevance” metric;

Provable guarantees;

A spectrum of methods from more “COMM” ones to more “NN”

ones;


using neural networks in communication problems theory and ... · an information theoretic...

Documents