using neural networks in communication problems theory and ... · an information theoretic...
TRANSCRIPT
Using Neural Networks in Communication Problems
– Theory and Examples
Lizhong Zheng
MIT
Globecom, December 10, 2019
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 1 / 32
Collaborators
Shao-lun Huang Xiangxiang Xu Anuran Makur David Qiu Mohamed AlHajri
Greg Wornell Lingjia Liu Zhou Zhou Jing Liang
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 2 / 32
Introduction
The Success of Neural Networks, and us
Computer Vision & NLP:
complex problems with no
clear model
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 3 / 32
Introduction
Use NN in Communication Problems?
The next revolution in communication networks.
Difficulties
Use too much resources, computation and samples;
Domain knowledge and reusable solutions;
No guarantee, optimality and robustness;
Some problems are particularly hard for SGD;
“It is human nature to prefer simplicity”
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 4 / 32
Introduction
In This Talk:
An information theoretic interpretation of learning in neural networks
An example to use NN for a physical layer communication problem
Simplify and Specialize
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 5 / 32
Information Theory for NN
Neural Networks as Information Processing
Processing result:
Features S(x)
Corresponding features
v(y)
QY |X ∝ exp[ST (x)·v(y)]
Not sufficient to represent the true model.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 6 / 32
Information Theory for NN
A Little Machinery: Local Geometry
Variation of distribution: Prior P0 → Posterior Q,
Can write in vector form
P0 → Q = [Q(x)− P0(x), x ∈ X ]
LLR( QP0
) =[log Q(x)
P0(x), x ∈ X
]≈
[Q(x)−P0(x)
P0(x), x ∈ X
]
Information Vector, with reference P0,
φ(Q) =
[Q(x)− P0(x)√
P0(x), x ∈ X
]
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 7 / 32
Local Geometry
Information Vector for Feature Functions
A feature function f : X 7→ R
w.o.l.g. require EP0 [f (X )] = 0.
Recall: LLR( QP0
) =[log Q(x)
P0(x), x ∈ X
]≈
[Q(x)−P0(x)
P0(x), x ∈ X
]
Evaluating any feature function is equivalent as computing an LLR, or
making a binary decision, or estimate a scalar parameter.
Information vector for a feature function
φ(f ) =[√
P0(x) · f (x), x ∈ X]
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 8 / 32
Local Geometry
Turning Things Euclidean
In functional space:
Euclidean norm: ‖φ(f )‖2 = varP0 [f (X )]
Inner product: 〈φ(f1), φ(f2)〉 = EP0 [f1(X )f2(X )]
orthogonal features are uncorrelated (no repetitive information).
In distribution space
K-L divergence: D(P||Q) ≈ ‖φ(P) − φ(Q)‖2
length of information vector measure the information volume.
Inner product ↔ Fisher information
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 9 / 32
Local Geometry
Empirical Average
P0
bP $ �
⌫ $ f
X1, . . . ,Xn i.i.d. from P0
Empirical distribution P̂ ↔ φ
For a feature function f ↔ ν
The empirical average
1
n
n∑i=1
f (xi ) = Ep̂ [f (X )] =∑x
P̂(x) · f (x)
=∑x
(P̂(x)− P0(x)) · f (x)
=∑x
P̂(x)− P0(x)√P0(x)
·√
P0(x)f (x) = 〈φ, ν〉
Taking a feature as projection of info. vector.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 10 / 32
Local Geometry
Joint Distribution
For a joint distribution PXY
Use PXPY as the reference,
Weak dependence,
Canonical Dependence Matrix (CDM): B ∈ R|Y|×|X |
B(x , y) =PXY (x , y)− PX (x)PY (y)√
PX (x)PY (y), x ∈ X , y ∈ Y
Mutual information:
I (X ;Y ) = D(PXY ||PXPY ) ∝ ‖B‖2
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 11 / 32
Local Geometry
Decomposition of Mutual Information
SVD B =∑
i σi · ψi· φT
i
Recall I (X ;Y ) ∝ ‖B‖2 =∑
i σ2i
Dependence between X ,Y can be written as a number of modes
Similar statement in common information
HGR maximal correlation: given X ,Y ∼ PXY
maxf ,g
corr[f (X ), g(Y )] = σ1, f ∗(x) =φ1(x)√PX (x)
, g∗(y) =ψ1(y)√PY (y)
CCA, correspondence analysis.Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 12 / 32
Local Geometry
Neural Network with Discrete Inputs
True PXY and Empirical
P̂XY ;
Try to fit
minQ
D(P̂XY ||QXY )
Ideal network.
Softmax regression:
Q(k)XY (x , y) ∝ PX (x)PY (y) exp
[k∑
i=1
si (x)vi (y)
]
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 13 / 32
Local Geometry
Guess What Would an Ideal NN Do?
Q(k)XY (x , y) ∝ PX (x)PY (y) exp
[k∑
i=1
si (x)vi (y)
]
B̂(k) =
k∑
i=1
√PX (x)si (x)︸ ︷︷ ︸
φi
·√
PY (y)vi (y)︸ ︷︷ ︸ψi
Low rank approximation: min ‖B̂(k) − B‖2,
s∗i (x) =φi (x)√PX (x)
, v∗i (y) =ψi (y)√PY (y)
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 14 / 32
Local Geometry
NN as SVD Solver
Capture the most significant k
modes of dependence, without
further bias.
S(x) =φ∗(x)√PX (x)
v(y) =ψ∗(y)√PY (y)
Who did the SVD? Backprop - ACE - Power method
v(y) = E [S(X )|Y = y ] ψ = B · φ
S(x) = E [v(Y )|X = x ] φ = BT · ψ
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 15 / 32
Local Geometry
Connections to Information Theory
Decomposition of mutual information, common randomness, ...
HGR maximal correlation;
Low rank matrix completion for models;
Universal feature: maximize average relevance to unknown queries.
The most “learn-able” partial model.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 16 / 32
Local Geometry
More Realistic Neural Networks
Network structure puts a limit on what
feature functions can be generated;
Approximate the ideal feature function
with Euclidean errors;
Every quantity has a name;
Many equivalent representations, need
canonical basis (whitening).
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 17 / 32
Detection Problem
Use This in Communication Problems
Where to use?
When there is no clear model, no optimal solution;
Non-linear, None-Gaussian.
Physical layer vs. higher layer.
Focus the learning power
Combine with classical processing.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 18 / 32
Detection Problem
Symbol Detection over Interference Channel
Y = h · X + W
X ∈ QAM / PAM, CSIR: h is known at the receiver;
W is non-Gaussian, with fixed unknown PDF pW .
Standard solution: Linear MMSE /w min. distance.
NN solution:
How do we use the knowledge of h, the QAM?
Train with one channel realization / SNR, use for another?
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 19 / 32
Detection Problem
First Simple Step: Regularity of PAM
Y = h · X + W , X ∈ {−3,−1,+1,+3}
Observation p(y |X = +1) = p(y + 2h|X = +3)...
Reuse binary decision
modules.
CNN over amplitude.
Reuse of training
samples.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 20 / 32
Detection Problem
Transferable Knowledge
Y = hs · X + W , X ∈ {−1,+1}
Suppose we trained network for hs , can we use it for a different ht?
Knowledge of the pdf pW , where is it stored, how to use it?
Concrete example, interference W from PAM, but the receiver
doesn’t know it.
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 21 / 32
Detection Problem
Source Problem: Linear Approach
Y = hs · X + W , X ∈ {−1,+1}
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 22 / 32
Detection Problem
Ground Truth for the Source Problem
Y = hs · X + W , X ∈ {−1,+1}
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 23 / 32
Detection Problem
What Does the NN Learn?
Y = hs · X + W , X ∈ {−1,+1}
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 24 / 32
Detection Problem
Make NN Work Harder
Lower the SNR:
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 25 / 32
Detection Problem
Make NN Work Harder
Throw in some “sand”:
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 26 / 32
Detection Problem
Make NN Work Harder
Throw in some “sand”:
Knowledge Region:
(y ± hs)TK−1Z (y ± hs) < γ
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 27 / 32
Detection Problem
Make NN Work Harder
Throw in some “sand”:
Knowledge Region:
(y ± hs)TK−1Z (y ± hs) < γ
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 28 / 32
Detection Problem
Transfer of Knowledge
Trained for Y = hs · X + Z ;
Target problem Y = ht · X + Z
same interference structure;
different fading ht .
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 29 / 32
Detection Problem
Receiver Structure
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 30 / 32
Conclusion
Concluding Remarks
Focus the NN learning power to the non-linear, non-Gaussian,
non-ideal part of the problem, and fill in the rest;
Theoretic understanding:
Processing at the input, output, or in the middle of NNs;
Choose features by the “relevance” metric;
Provable guarantees;
A spectrum of methods from more “COMM” ones to more “NN”
ones;
Lizhong Zheng (MIT) Using NN for Comm. Problems Globecom, December 10, 2019 31 / 32