beating the perils of non-convexity: machine learning ......learning with big data: ill-posed...
TRANSCRIPT
Beating the Perils of Non-Convexity:
Machine Learning using Tensor Methods
Anima Anandkumar
..
Joint work with Majid Janzamin and Hanie Sedghi.
U.C. Irvine
Learning with Big Data
Learning is finding needle in a haystack
Learning with Big Data
Learning is finding needle in a haystack
High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning with Big Data
Learning is finding needle in a haystack
High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning with big data: statistically and computationally challenging!
Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clusteringk-means, hierarchical . . .
Maximum Likelihood EstimatorProbabilistic latent variable models
Supervised Learning
Optimizing a neural network withrespect to a loss function Input
Neuron
Output
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg..
Images taken from https://www.facebook.com/nonconvex
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg.. Real world is mostly non-convex!
Images taken from https://www.facebook.com/nonconvex
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
How to deal with non-convexity?
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
Training Neural Networks
Tremendous practical impactwith deep learning.
Algorithm: backpropagation.
Highly non-convex optimization
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)
Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)
Loss surface for our method
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
0
20
40
60
80
100
120
140
160
180
200
w1(1) w1(2)
Overcoming Hardness of Training
In general, training a neural network is NP hard.
How does knowledge of input distribution help?
Overcoming Hardness of Training
In general, training a neural network is NP hard.
How does knowledge of input distribution help?
Generative vs. Discriminative Models
p(x, y)
0 10 20 30 40 50 60 70 80 90 1000
0.02
0.04
0.06
0.08
0.1
0.12
Input data x
Class y = 1Class y = 0
p(y|x)
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
Input data x
Class y = 1 Class y = 0
Generative models: Encode domain knowledge.
Discriminative: good classification performance.
Neural Network is a discriminative model.
Do generative models help in discriminative tasks?
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?x
φ(x)
y
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?x
φ(x)
y
Multivariate Moments: Many possibilities, . . .
E[x⊗ y], E[x⊗ x⊗ y], E[φ(x)⊗ y], . . .
Tensor Notation for Higher Order Moments
Multi-variate higher order moments form tensors.
Are there spectral operations on tensors akin to PCA on matrices?
Matrix
E[x⊗ y] ∈ Rd×d is a second order tensor.
E[x⊗ y]i1,i2 = E[xi1yi2 ].
For matrices: E[x⊗ y] = E[xy⊤].
Tensor
E[x⊗ x⊗ y] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ y]i1,i2,i3 = E[xi1xi2yi3 ].
In general, E[φ(x)⊗ y] is a tensor.
What class of φ(·) useful for training neural networks?
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:Input:x ∈ R
d
S1(x) ∈ Rd
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S2(x) ∈ Rd×d
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · uj⊗uj ⊗ uj
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
= + ....
λ11(A1)1 λ12(A1)2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M2 = E[y · S2(x)] =∑
j∈[k]
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M2 = E[y · S2(x)] =∑
j∈[k]
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
= + ....
λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
= + ....
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Why tensors are required?
Matrix decomposition recovers subspace, not actual weights.
Tensor decomposition uniquely recovers under non-degeneracy.
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Guaranteed learning of weights of first layer via tensor decomposition.
Learning the other parameters via a Fourier technique.
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
Rank-1 components are the estimates of columns of A1
+ + · · ·
CP tensor
decomposition
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
Rank-1 components are the estimates of columns of A1
+ + · · ·
CP tensor
decomposition
Fourier technique ⇒ a2, b1, b2
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
Learning the other parameters via a Fourier technique.
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
Learning the other parameters via a Fourier technique.
Theorem (JSA’14)
number of samples n = poly(d, k), we have w.h.p.
|f(x)− f(x)|2 ≤ O(1/n).
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor
Methods” by M. Janzamin, H. Sedghi and A., June. 2015.
Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded
Cf :=
∫
Rd
‖ω‖2 · |F (ω)|dω
n samples, d input dimension, k number of neurons.
Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded
Cf :=
∫
Rd
‖ω‖2 · |F (ω)|dω
n samples, d input dimension, k number of neurons.
Theorem(JSA’14)
Assume Cf is small.
E[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).
Polynomial sample complexity n in terms of dimensions d, k.
Computational complexity same as SGD with enough parallelprocessors.
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor
Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
Tractable Learning for LVMs
GMM HMM
h1 h2 h3
x1 x2 x3
ICA
h1 h2 hk
x1 x2 xd
Multiview and Topic Models
At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor.
Propose randomized FFT sketches.
Computational complexity independent of tensor order.
Linear scaling in input dimension and number of samples.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.
(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor.
Propose randomized FFT sketches.
Computational complexity independent of tensor order.
Linear scaling in input dimension and number of samples.
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries.
Use extended BLAS to minimize data permutation, I/O calls.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.
(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
Preliminary Results on SparkIn-memory processing of Spark: ideal for iterative tensor methods.
Alternating Least Squares for Tensor Decomposition.
minw,A,B,C
∥
∥
∥
∥
∥
T −k
∑
i=1
λiA(:, i) ⊗B(:, i)⊗ C(:, i)
∥
∥
∥
∥
∥
2
F
Update Rows Independently
Tensor slices
B C
worker 1
worker 2
worker i
worker k
Results on NYtimes corpus
3 ∗ 105 documents, 108 words
Spark Map-Reduce
26mins 4 hrs
(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.
Convolutional Tensor Decomposition
== ∗
xx
∑
f∗i w∗
i F∗ w∗
(a)Convolutional dictionary model (b)Reformulated model
.... ....= ++
Cumulant λ1(F∗1 )
⊗3 +λ2(F∗2 )
⊗3 . . .
Efficient methods for tensor decomposition with circulant constraints.
Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.
Reinforcement Learning (RL) of POMDPsPartially observable Markov decision processes.
Proposed Method
Consider memoryless policies. Episodic learning: indirect exploration.
Tensor methods: careful conditioning required for learning.
First RL method for POMDPs with logarithmic regret bounds.
xi xi+1 xi+2
yi yi+1
ri ri+1
ai ai+1
0 1000 2000 3000 4000 5000 6000 70002
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Number of Trials
Ave
rage
Rew
ard
SM−UCRL−POMDPUCRL−MDPQ−learningRandom Policy
Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.
, under preparation.
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.
First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.
First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!
Outlook
Training multi-layer neural networks, models with invariances,reinforcement learning using neural networks . . .
Unified framework for tractable non-convex methods with guaranteedconvergence to global optima?
My Research Group and Resources
Podcast/lectures/papers/software available athttp://newport.eecs.uci.edu/anandkumar/