beating the perils of non-convexity: machine learning ......learning with big data: ill-posed...

60
Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine

Upload: others

Post on 12-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Beating the Perils of Non-Convexity:

Machine Learning using Tensor Methods

Anima Anandkumar

..

Joint work with Majid Janzamin and Hanie Sedghi.

U.C. Irvine

Page 2: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Learning with Big Data

Learning is finding needle in a haystack

Page 3: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Learning with Big Data

Learning is finding needle in a haystack

High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Page 4: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Learning with Big Data

Learning is finding needle in a haystack

High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Learning with big data: statistically and computationally challenging!

Page 5: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clusteringk-means, hierarchical . . .

Maximum Likelihood EstimatorProbabilistic latent variable models

Supervised Learning

Optimizing a neural network withrespect to a loss function Input

Neuron

Output

Page 6: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg..

Images taken from https://www.facebook.com/nonconvex

Page 7: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg.. Real world is mostly non-convex!

Images taken from https://www.facebook.com/nonconvex

Page 8: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

Page 9: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

Page 10: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Page 11: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Outline

1 Introduction

2 Guaranteed Training of Neural Networks

3 Overview of Other Results on Tensors

4 Conclusion

Page 12: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Training Neural Networks

Tremendous practical impactwith deep learning.

Algorithm: backpropagation.

Highly non-convex optimization

Page 13: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Toy Example: Failure of Backpropagation

x1

x2

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

y

x1 x2x

w1 w2

Our method: guaranteed risk bounds for training neural networks

Page 14: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Toy Example: Failure of Backpropagation

x1

x2

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

y

x1 x2x

w1 w2

Our method: guaranteed risk bounds for training neural networks

animakumar
Line
animakumar
Line
Page 15: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Toy Example: Failure of Backpropagation

x1

x2

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

y

x1 x2x

w1 w2

Our method: guaranteed risk bounds for training neural networks

animakumar
Line
animakumar
Line
Page 16: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

200

250

300

350

400

450

500

550

600

650

w1(1) w1(2)

Page 17: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

200

250

300

350

400

450

500

550

600

650

w1(1) w1(2)

Loss surface for our method

−4

−3

−2

−1

0

1

2

3

4 −4−3

−2−1

01

23

4

0

20

40

60

80

100

120

140

160

180

200

w1(1) w1(2)

Page 18: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Overcoming Hardness of Training

In general, training a neural network is NP hard.

How does knowledge of input distribution help?

Page 19: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Overcoming Hardness of Training

In general, training a neural network is NP hard.

How does knowledge of input distribution help?

Page 20: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Generative vs. Discriminative Models

p(x, y)

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

Input data x

Class y = 1Class y = 0

p(y|x)

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

Input data x

Class y = 1 Class y = 0

Generative models: Encode domain knowledge.

Discriminative: good classification performance.

Neural Network is a discriminative model.

Do generative models help in discriminative tasks?

Page 21: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

φ(x)

y

Page 22: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

φ(x)

y

Multivariate Moments: Many possibilities, . . .

E[x⊗ y], E[x⊗ x⊗ y], E[φ(x)⊗ y], . . .

Page 23: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Tensor Notation for Higher Order Moments

Multi-variate higher order moments form tensors.

Are there spectral operations on tensors akin to PCA on matrices?

Matrix

E[x⊗ y] ∈ Rd×d is a second order tensor.

E[x⊗ y]i1,i2 = E[xi1yi2 ].

For matrices: E[x⊗ y] = E[xy⊤].

Tensor

E[x⊗ x⊗ y] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ y]i1,i2,i3 = E[xi1xi2yi3 ].

In general, E[φ(x)⊗ y] is a tensor.

What class of φ(·) useful for training neural networks?

Page 24: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

d

S1(x) ∈ Rd

Page 25: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

d

S1(x) ∈ Rd

Page 26: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

d

S1(x) ∈ Rd

Page 27: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:Input:x ∈ R

d

S1(x) ∈ Rd

Page 28: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S1(x) ∈ Rd

Page 29: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S2(x) ∈ Rd×d

Page 30: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

p(x)

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

Page 31: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 32: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · uj⊗uj ⊗ uj

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 33: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 34: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M1 = E[y · S1(x)] =∑

j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

= + ....

λ11(A1)1 λ12(A1)2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 35: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M2 = E[y · S2(x)] =∑

j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 36: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M2 = E[y · S2(x)] =∑

j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

= + ....

λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 37: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 38: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

= + ....

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Page 39: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Why tensors are required?

Matrix decomposition recovers subspace, not actual weights.

Tensor decomposition uniquely recovers under non-degeneracy.

Page 40: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Moments of a Neural Network

E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2

Given labeled examples {(xi, yi)}

E [y · Sm(x)] = E

[

∇(m)f(x)]

M3 = E[y · S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·)σ(·)

1 k

y

x1 x2 x3 xdxx

a2

A1 · · ·

· · ·

· · ·

Guaranteed learning of weights of first layer via tensor decomposition.

Learning the other parameters via a Fourier technique.

Page 41: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

Page 42: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Page 43: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Rank-1 components are the estimates of columns of A1

+ + · · ·

CP tensor

decomposition

Page 44: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

d

S3(x) ∈ Rd×d×d

1

n

n∑

i=1

yi ⊗ S3(xi) =1

n

n∑

i=1

yi⊗

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Rank-1 components are the estimates of columns of A1

+ + · · ·

CP tensor

decomposition

Fourier technique ⇒ a2, b1, b2

Page 45: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition.

M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

Full column rank assumption on weight matrix A1

Guaranteed tensor decomposition (AGHKT’14, AGJ’14)

Page 46: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition.

M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

Full column rank assumption on weight matrix A1

Guaranteed tensor decomposition (AGHKT’14, AGJ’14)

Learning the other parameters via a Fourier technique.

Page 47: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition.

M3 = E[y ⊗ S3(x)] =∑

j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

Full column rank assumption on weight matrix A1

Guaranteed tensor decomposition (AGHKT’14, AGJ’14)

Learning the other parameters via a Fourier technique.

Theorem (JSA’14)

number of samples n = poly(d, k), we have w.h.p.

|f(x)− f(x)|2 ≤ O(1/n).

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor

Methods” by M. Janzamin, H. Sedghi and A., June. 2015.

Page 48: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded

Cf :=

Rd

‖ω‖2 · |F (ω)|dω

n samples, d input dimension, k number of neurons.

Page 49: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded

Cf :=

Rd

‖ω‖2 · |F (ω)|dω

n samples, d input dimension, k number of neurons.

Theorem(JSA’14)

Assume Cf is small.

E[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).

Polynomial sample complexity n in terms of dimensions d, k.

Computational complexity same as SGD with enough parallelprocessors.

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor

Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.

Page 50: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Outline

1 Introduction

2 Guaranteed Training of Neural Networks

3 Overview of Other Results on Tensors

4 Conclusion

Page 51: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Tractable Learning for LVMs

GMM HMM

h1 h2 h3

x1 x2 x3

ICA

h1 h2 hk

x1 x2 xd

Multiview and Topic Models

Page 52: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor.

Propose randomized FFT sketches.

Computational complexity independent of tensor order.

Linear scaling in input dimension and number of samples.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.

(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

Page 53: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor.

Propose randomized FFT sketches.

Computational complexity independent of tensor order.

Linear scaling in input dimension and number of samples.

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries.

Use extended BLAS to minimize data permutation, I/O calls.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.

(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

Page 54: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Preliminary Results on SparkIn-memory processing of Spark: ideal for iterative tensor methods.

Alternating Least Squares for Tensor Decomposition.

minw,A,B,C

T −k

i=1

λiA(:, i) ⊗B(:, i)⊗ C(:, i)

2

F

Update Rows Independently

Tensor slices

B C

worker 1

worker 2

worker i

worker k

Results on NYtimes corpus

3 ∗ 105 documents, 108 words

Spark Map-Reduce

26mins 4 hrs

(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.

Page 55: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Convolutional Tensor Decomposition

== ∗

xx

f∗i w∗

i F∗ w∗

(a)Convolutional dictionary model (b)Reformulated model

.... ....= ++

Cumulant λ1(F∗1 )

⊗3 +λ2(F∗2 )

⊗3 . . .

Efficient methods for tensor decomposition with circulant constraints.

Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.

Page 56: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Reinforcement Learning (RL) of POMDPsPartially observable Markov decision processes.

Proposed Method

Consider memoryless policies. Episodic learning: indirect exploration.

Tensor methods: careful conditioning required for learning.

First RL method for POMDPs with logarithmic regret bounds.

xi xi+1 xi+2

yi yi+1

ri ri+1

ai ai+1

0 1000 2000 3000 4000 5000 6000 70002

2.1

2.2

2.3

2.4

2.5

2.6

2.7

Number of Trials

Ave

rage

Rew

ard

SM−UCRL−POMDPUCRL−MDPQ−learningRandom Policy

Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.

, under preparation.

Page 57: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Outline

1 Introduction

2 Guaranteed Training of Neural Networks

3 Overview of Other Results on Tensors

4 Conclusion

Page 58: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Page 59: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Outlook

Training multi-layer neural networks, models with invariances,reinforcement learning using neural networks . . .

Unified framework for tractable non-convex methods with guaranteedconvergence to global optima?

Page 60: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional

My Research Group and Resources

Podcast/lectures/papers/software available athttp://newport.eecs.uci.edu/anandkumar/