a few thoughts on how we may want to further study dnn
TRANSCRIPT
![Page 1: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/1.jpg)
A Few Thoughts on How We May Want to Further Study DNN
Eric Xing
Carnegie Mellon University
![Page 2: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/2.jpg)
Deep Learning is Amazing!!!
![Page 3: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/3.jpg)
What makes it work? Why?
![Page 4: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/4.jpg)
An MLer’s View of the World
Loss funcJons (likelihood, reconstrucJon, margin, …)
Constraints (normality, sparsity, label, prior, KL, sum, …)
Algorithms MC (MCMC, Importance), Opt (gradient, IP), …
Stopping criteria Change in objecJve, change in update …
Structures (Graphical, group, chain, tree, iid, …)
Empirical Performances?
![Page 5: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/5.jpg)
DL ML (e.g., GM)
Empirical goal: e.g., classificaJon, feature learning
e.g., transfer learning, latent variable inference
Structure: Graphical Graphical
ObjecJve: Something aggregated from local funcJons
Something aggregated from local funcJons
Vocabulary: Neuron, acJvaJon/gate funcJon …
Variables, potenJal funcJon
Algorithm: A single, unchallenged, inference algorithm -‐-‐ BP
A major focus of open research, many algorithms, and more to come
EvaluaJon: On a black-‐box score -‐-‐ end performance
On almost every intermediate quanJty
ImplementaJon: Many untold-‐tricks More or less standardized
Experiments: Massive, real data (GT unknown) Modest, o^en simulated data (GT known)
< = ? >
![Page 6: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/6.jpg)
A slippery slope to mythology?
• How to conclusively determine what an improve in performance could come from: – Be_er model (architecture, acJvaJon, loss, size)? – Be_er algorithm (more accurate, faster convergence)? – Be_er training data?
• Current research in DL seem to get everything above mixed by evaluaJng on a black-‐box “performance score” that is not directly reflecJng – Correctness of inference – Achievability/usefulness of model – Variance due to stochasJcity
![Page 7: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/7.jpg)
Although a single dimension (# of layers) is compared, many other dimensions may also change, to name a few: • Per training-‐iteraJon Jme • Tolerance to inaccurate inference • IdenJfiability • …
An Example
![Page 8: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/8.jpg)
Inference quality
• Training error is the old concept of a classifier with no hidden states, no inference is involved, and thus inference accuracy is not an issue
• But a DNN is not just a classifier, some DNNs are not even fully supervised, there are MANY hidden states, why their inference quality is not taken seriously?
• In DNN, inference accuracy = visualizing features – Study of inference accuracy is badly discouraged – Loss/accuracy is not monitored
![Page 9: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/9.jpg)
Inference/Learning Algorithm, and their evaluaJon
![Page 10: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/10.jpg)
Eric Xing
10
Learning a GM with Hidden Variables – the thought process
• In fully observed iid seings, the log likelihood decomposes into a sum of local terms (at least for directed models).
• With latent variables, all the parameters become coupled together via marginalizaJon
),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l
∑∑ ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
![Page 11: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/11.jpg)
Eric Xing
11
Gradient Learning for mixture models
• We can learn mixture densiJes using gradient descent on the log likelihood. The gradients are quite interesJng:
• In other words, the gradient is aggregated from many other
intermediate states – ImplicaJon: costly iteraJon, heavy coupling between parameters
• Other issues: imposing constraints, idenJfiability …
l(θ ) = log p(x |θ ) = log π k pk (x θk )k∑
∂l∂θk
=1
p(x |θ )π k∂pk (x θk )∂θkk
∑
=π k
p(x |θ )pk (x θk )
∂ log pk (x θk )∂θkk
∑
= π kpk (x θk )p(x |θ )
∂ log pk (x θk )∂θkk
∑ = rk∂lk∂θkk
∑
![Page 12: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/12.jpg)
Eric Xing
12
Then AlternaJve Approaches Were Proposed
• The EM algorithm – M: a convex problem – E: approximate constrained opJmizaJon
• Mean field • BP/LBP • Marginal polytope
• Spectrum algorithm: – redefine intermediate states, convexify the original problem
![Page 13: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/13.jpg)
Learning a DNN
![Page 14: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/14.jpg)
Learning a DNN • In a nutshell, sequenJally, and recursively apply:
• Things can geing hairy when locally defined losses are introduced, e.g., auto-‐encoder, which breaks a loss-‐driven global opJmizaJon formulaJon
• Depending on starJng point, BP converge or diverge with
probability 1 – A serious problem in Large-‐Scale DNN
![Page 15: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/15.jpg)
![Page 16: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/16.jpg)
DL U7lity of the network • A vehicle to conceptually synthesize complex decision hypothesis – stage-‐wise projecJon and aggregaJon
• A vehicle for organizing compuJng operaJons – stage-‐wise update of latent states
• A vehicle for designing processing steps/compuJng modules – Layer-‐wise parallizaJon
• No obvious uJlity in evaluaJng DL algorithms
U7lity of the Loss Func7on • Global loss? Well it is non-‐convex anyway, why bother ?
GM
• A vehicle for synthesizing a global loss funcJon from local structure – potenJal funcJon, feature funcJon
• A vehicle for designing sound and efficient inference algorithms – Sum-‐product, mean-‐field
• A vehicle to inspire approximaJon and penalizaJon – Structured MF, Tree-‐approx
• A vehicle for monitoring theoreJcal and empirical behavior and accuracy of inference
• A major measure of quality of algorithm and model
![Page 17: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/17.jpg)
GMFr
GMFb
BP
An Old Study of DL as GM Learning A sigmoid belief network at a GM, and mean-‐field par77ons
Study focused on only inference/learning accuracy, speed, and par77on
[Xing, Russell, Jordan, UAI 2003]
Now we can ask, with a correctly learned DN, is it doing will on the desired task?
![Page 18: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/18.jpg)
Why A Graphical Model formulaJon of DL might be fruiqul
• Modular design: easy to incorporate knowledge and interpret, easy to integrate feature learning with high level tasks, easy to built on exisJng (parJal) soluJons
• Defines an explicit and natural objec7ve • Guilds strategies for systema7c study of inference,
parallelizaJon, evaluaJon, and theoreJcal analysis • A clear path to further upgrade:
– structured predicJon – IntegraJon of mulJple data modality – Modeling complex: Jme series, missing data, online data …
• Big DL on distributed architectures, where things can get messy everywhere due to incorrect parallel computaJons
![Page 19: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/19.jpg)
Easy to incorporate knowledge and interpret
arJculaJon
targets
distorJon-‐free acousJcs
distorted acousJcs
distorJon factors & feedback to arJculaJon Slides Courtesy:
Li Deng
![Page 20: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/20.jpg)
Easy to integrate feature learning with high level tasks
Hidden Markov Model +
Gaussian Mixture Model
Hidden Markov Model +
Deep Neural Network
Jointly trained, but shallow Deep, but separately trained
Hidden Markov Model +
Deep Graphical Models
Jointly trained and deep
![Page 21: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/21.jpg)
Distributed DL
![Page 22: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/22.jpg)
Mathema7cs 101 for ML
~✓t+1 = ~✓t +�f~✓(D)
argmax
~✓⌘ L({xi,yi}Ni=1 ;
~✓) + ⌦(
~✓)
Model Parameter Data
This computaJon needs to be parallelized!
![Page 23: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/23.jpg)
Toward Big ML
Data Parallel Model Parallel Task Parallel
�~✓(D1)
�~✓(D2) �~✓(D3)
�~✓(Dn)
D ⌘ {D1,D2, . . . ,Dn}
�~✓1(D)
�~✓2(D) �~✓3(D)
�~✓k(D)
~✓ ⌘ [~✓ T1 , ~✓ T
2 , . . . , ~✓ Tk }T
�f1~✓(D)
�f2~✓(D) �f3
~✓(D)
�fm~✓(D)
f ⌘ {f1, f2, . . . , fm}
~✓t+1 = ~✓t +�f~✓(D)
![Page 24: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/24.jpg)
Data-‐Parallel DNN using Petuum Parameter Server
• Just put global parameters in SSPTable:
• DNN (SGD) – The weight table
• Topic Modeling (MCMC) – Topic-‐word table
• Matrix Factoriza7on (SGD) – Factor matrices L, R
• Lasso Regression (CD) – Coefficients β
• SSPTable supports generic classes of algorithms – With these models as examples
L
R SSPTable
Topic 1 Topic 2 Topic 3
Topic 4
β 24
![Page 25: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/25.jpg)
• If the undistributed BP updates of a mulJ-‐layer DNN lead to weights , and the distributed BP updates under SSP lead to weights , then converges in probability to , i.e . Consequently
wt
wt
~wt
~
wt (wt
~P! →! wt )
Theorem: MulJlayer convergence of SSP based distributed DNNs to opJma
(wt*~
P! →! w*)
![Page 26: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/26.jpg)
Model-‐Parallel DNN using Petuum Scheduler
Neuron ParJJon
Weight ParJJon
![Page 27: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/27.jpg)
Theorem: MulJlayer convergence of model distributed DNNs to opJma
• If the undistributed BP updates of a mulJ-‐layer DNN lead to weights and the distributed BP updates in model distributed seing lead to weights , then converges in probability to
, i.e . .Consequently
• In case of model distributed DNN we divided the DNN verJcally such that a single layer is distributed across processors
wt
wt
~
(wt
~P! →! wt )
(wt*~
P! →! w*)
wt
~
wt
![Page 28: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/28.jpg)
Distributed DNN: (preliminary) • ApplicaJon: phoneme classificaJon in speech recogniJon. • Dataset: TIMIT dataset with 1M samples. • Network configuraJon: input layer with 440 units, output layer with 1993
units, six hidden layers with 2048 units in each layer
Methods PER CondiJonal Random Field [1] 34.8%
Large-‐Margin GMM [2] 33%
CD-‐HMM [3] 27.3%
Recurrent Neural Nets [4] 26.1%
Deep Belief Network [5] 23.0%
Petuum DNN (Data ParJJon) 24.95%
Petuum DNN (Model ParJJon) 25.12%
0 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 Speedu
p Number of Cores
Linear
Petuum DNN
![Page 29: A Few Thoughts on How We May Want to Further Study DNN](https://reader033.vdocuments.mx/reader033/viewer/2022060205/55a107291a28ab866a8b4572/html5/thumbnails/29.jpg)
Conclusion • In GM: lots of efforts are directed to improving inference
accuracy and convergence speed – An advanced tutorial would survey dozen’s of inference algorithms/
theories, but few use cases on empirical tasks
• In DL: most effort is directed to comparing different architectures and gate funcJons (based on empirical performance on a downstream task) – An advanced tutorial typically consist of a list of all designs of nets,
many use cases, but a single name of algorithm: back prop of SGD
• The two fields are similar at the beginning (energy, structure, etc.), and soon diverge to their own signature pipelines
• A convergence might be necessary and fruiqul