beating the perils of non-convexity: machine learning ......learning with big data: ill-posed...

Beating the Perils of Non-Convexity:

Machine Learning using Tensor Methods

Anima Anandkumar

..

Joint work with Majid Janzamin and Hanie Sedghi.

U.C. Irvine

Learning with Big Data

Learning is finding needle in a haystack



High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.



High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Learning with big data: statistically and computationally challenging!

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clusteringk-means, hierarchical . . .

Maximum Likelihood EstimatorProbabilistic latent variable models

Supervised Learning

Optimizing a neural network withrespect to a loss function Input

Neuron

Output

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg..

Images taken from https://www.facebook.com/nonconvex

https://www.facebook.com/nonconvex

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg.. Real world is mostly non-convex!

Images taken from https://www.facebook.com/nonconvex

https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima



In high dimensions possiblyexponential local optima



In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Outline

1 Introduction

2 Guaranteed Training of Neural Networks

3 Overview of Other Results on Tensors

4 Conclusion

Training Neural Networks

Tremendous practical impactwith deep learning.

Algorithm: backpropagation.

Highly non-convex optimization

Toy Example: Failure of Backpropagation

x1

x2

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

y

x1 x2x

w1 w2

Our method: guaranteed risk bounds for training neural networks

Toy Example: Failure of Backpropagation

x1

x2

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

y

x1 x2x

w1 w2

Our method: guaranteed risk bounds for training neural networks

animakumar

Line

animakumar

Line

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

200

250

300

350

400

450

500

550

600

650

w1(1) w1(2)

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

200

250

300

350

400

450

500

550

600

650

w1(1) w1(2)

Loss surface for our method

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

0

20

40

60

80

100

120

140

160

180

200

w1(1) w1(2)

Overcoming Hardness of Training

In general, training a neural network is NP hard.

How does knowledge of input distribution help?

Generative vs. Discriminative Models

p(x, y)

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

Input data x

Class y = 1Class y = 0

p(y|x)

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

Input data x

Class y = 1 Class y = 0

Generative models: Encode domain knowledge.

Discriminative: good classification performance.

Neural Network is a discriminative model.

Do generative models help in discriminative tasks?

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

φ(x)

y

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

φ(x)

y

Multivariate Moments: Many possibilities, . . .

E[x⊗ y], E[x⊗ x⊗ y], E[φ(x)⊗ y], . . .

Tensor Notation for Higher Order Moments

Multi-variate higher order moments form tensors.

Are there spectral operations on tensors akin to PCA on matrices?

Matrix

E[x⊗ y] ∈ Rd×d is a second order tensor.

E[x⊗ y]i1,i2 = E[xi1yi2 ].

For matrices: E[x⊗ y] = E[xy⊤].

Tensor

E[x⊗ x⊗ y] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ y]i1,i2,i3 = E[xi1xi2yi3 ].

In general, E[φ(x)⊗ y] is a tensor.

What class of φ(·) useful for training neural networks?

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

d

S1(x) ∈ Rd



with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:Input:x ∈ R

d

S1(x) ∈ Rd



with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S1(x) ∈ Rd



with pdf p(·):

S1(x) := −∇x log p(x)


Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S2(x) ∈ Rd×d



with pdf p(·):

S1(x) := −∇x log p(x)


Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · uj⊗uj ⊗ uj

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

= + ....

λ11(A1)1 λ12(A1)2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M2 = E[y · S2(x)] =∑

j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M2 = E[y · S2(x)] =∑

j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

= + ....

λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

= + ....

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Why tensors are required?

Matrix decomposition recovers subspace, not actual weights.

Tensor decomposition uniquely recovers under non-degeneracy.


E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2


E [y · Sm(x)] = E

[

∇(m)f(x)]

⇓

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Guaranteed learning of weights of first layer via tensor decomposition.

Learning the other parameters via a Fourier technique.

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

d

S3(x) ∈ Rd×d×d



Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment



Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using


S3(xi)

Cross-

moment

Rank-1 components are the estimates of columns of A1

+ + · · ·

CP tensor

decomposition



Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using


S3(xi)

Cross-

moment

Rank-1 components are the estimates of columns of A1

+ + · · ·

CP tensor

decomposition

Fourier technique ⇒ a2, b1, b2

Estimation error bound


M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

Full column rank assumption on weight matrix A1

Guaranteed tensor decomposition (AGHKT’14, AGJ’14)



M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j






M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j




Theorem (JSA’14)

number of samples n = poly(d, k), we have w.h.p.

|f(x)− f(x)|2 ≤ O(1/n).

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor

Methods” by M. Janzamin, H. Sedghi and A., June. 2015.

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded

Cf :=

∫

Rd

‖ω‖2 · |F (ω)|dω

n samples, d input dimension, k number of neurons.

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded

Cf :=

∫

Rd

‖ω‖2 · |F (ω)|dω

n samples, d input dimension, k number of neurons.

Theorem(JSA’14)

Assume Cf is small.

E[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).

Polynomial sample complexity n in terms of dimensions d, k.

Computational complexity same as SGD with enough parallelprocessors.

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor

Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.

Outline

1 Introduction



4 Conclusion

Tractable Learning for LVMs

GMM HMM

h1 h2 h3

x1 x2 x3

ICA

h1 h2 hk

x1 x2 xd

Multiview and Topic Models

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor.

Propose randomized FFT sketches.

Computational complexity independent of tensor order.

Linear scaling in input dimension and number of samples.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.

(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor.

Propose randomized FFT sketches.

Computational complexity independent of tensor order.

Linear scaling in input dimension and number of samples.

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries.

Use extended BLAS to minimize data permutation, I/O calls.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.

(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

Preliminary Results on SparkIn-memory processing of Spark: ideal for iterative tensor methods.

Alternating Least Squares for Tensor Decomposition.

minw,A,B,C

∥

∥

∥

∥

∥

T −k

∑

i=1

λiA(:, i) ⊗B(:, i)⊗ C(:, i)

∥

∥

∥

∥

∥

2

F

Update Rows Independently

Tensor slices

B C

worker 1

worker 2

worker i

worker k

Results on NYtimes corpus

3 ∗ 105 documents, 108 words

Spark Map-Reduce

26mins 4 hrs

(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.

Convolutional Tensor Decomposition

== ∗

xx

∑

f∗i w∗

i F∗ w∗

(a)Convolutional dictionary model (b)Reformulated model

.... ....= ++

Cumulant λ1(F∗1 )

⊗3 +λ2(F∗2 )

⊗3 . . .

Efficient methods for tensor decomposition with circulant constraints.

Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.

Reinforcement Learning (RL) of POMDPsPartially observable Markov decision processes.

Proposed Method

Consider memoryless policies. Episodic learning: indirect exploration.

Tensor methods: careful conditioning required for learning.

First RL method for POMDPs with logarithmic regret bounds.

xi xi+1 xi+2

yi yi+1

ri ri+1

ai ai+1

0 1000 2000 3000 4000 5000 6000 70002

2.1

2.2

2.3

2.4

2.5

2.6

2.7

Number of Trials

Ave

rage

Rew

ard

SM−UCRL−POMDPUCRL−MDPQ−learningRandom Policy

Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.

, under preparation.

Outline

1 Introduction



4 Conclusion

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Outlook

Training multi-layer neural networks, models with invariances,reinforcement learning using neural networks . . .

Unified framework for tractable non-convex methods with guaranteedconvergence to global optima?

My Research Group and Resources

Podcast/lectures/papers/software available athttp://newport.eecs.uci.edu/anandkumar/

beating the perils of non-convexity: machine learning ......learning with big data: ill-posed...

Documents