0.15in geometry and regularization in nonconvex …geometry and regularization in nonconvex...

Geometry and Regularizationin Nonconvex Quadratic Inverse Problems

Yuejie Chi

Department of Electrical and Computer Engineering

Princeton Workshop, May 2018

Acknowledgements

Thanks to my collaborators:

Yuxin Chen Cong Ma Kaizheng Wang Yuanxin LiPrinceton Princeton Princeton CMU/OSU

This research is supported by NSF, ONR and AFOSR.

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

Nonconvex optimization in practice

Using simple algorithms such as gradient descent, e.g., “backpropagation” for training deep neural networks...

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

`(yi;x)

m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

-3 -2 -1 0 1 2 3

3θ0 = [1, 0]

θn = [0.816,−0.268]

-3 -2 -1 0 1 2 3

Figure credit: Mei, Bai and Montanari

minimizex f(x) =1

`(yi;x)m→∞=⇒ E[`(y;x)]

-3 -2 -1 0 1 2 3

3θ0 = [1, 0]

θn = [0.816,−0.268]

-3 -2 -1 0 1 2 3

minimizex f(x) =1

`(yi;x)m→∞=⇒ E[`(y;x)]

-3 -2 -1 0 1 2 3

3θ0 = [1, 0]

θn = [0.816,−0.268]

-3 -2 -1 0 1 2 3

Putting together...

global convergence

statistical models

benign landscape

Computational efficiency?

global convergence

statistical models

benign landscape

But how fast?

Overview and question we aim to address

samplecomplexity

critical points

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

samplecomplexity

critical points smoothness

efficient

Statistical:

Computational:

samplecomplexity

efficient

Statistical:

Computational:

inefficient

efficient

samplecomplexity

efficient

Statistical:

Computational:

inefficient

efficient

regularized unregularized

samplecomplexity

efficient

Statistical:

Computational:

inefficient

efficient

regularized unregularized

Quadratic inverse problem

Y likelihood: fY |X(y | x unknown signal: X observation: Y

X y n p

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical loss minimization

minimizex f(x) = 1m

Ë!a€k x

"2 ≠ ykÈ2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt ≠ ÷t Òf(xt)

10/ 29

AX yi = ka>i Xk2

Recover X\ ∈ Rn×r from m “random” quadratic measurements

yi =∥∥∥a>i X\

∥∥∥2

2, i = 1, . . . ,m

The rank-1 case is the famous phase retrieval problem.

Geometric interpretation

If X\ is orthonormal...

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

Geometric interpretation

If X\ is orthonormal...

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

Shallow neural network

hidden layer input layer output layer

x y W y

x y W y +

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

σ(a>xi).

Shallow neural network with quadratic activation

x y W y

x y W y +

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

σ(a>xi)σ(z)=z2

:=r∑

(a>xi)2 =

∥∥∥a>X\∥∥∥

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

(∥∥∥a>kX∥∥∥

2− yk

• Use r = 1 as a running example.

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

(∥∥∥a>kX∥∥∥

2− yk

• Use r = 1 as a running example.

Empirical risk minimization

minimizex f(x) =1

[(a>k x

)2 − yk]2

xt+1 = xt − η∇f(xt)

Empirical risk minimization

minimizex f(x) =1

[(a>k x

)2 − yk]2

xt+1 = xt − η∇f(xt)

Geometry of loss surface

-1 -0.5 0 0.5 1-1.5

P(0)PX|(x | 0)

P(1)PX|(x | 1)

L(x) =PX|(x | 1)

PX|(x | 0)

H0 ! N (0, 1)

H1 ! N (1, 1)

L(x) =fX(x | H1)

fX(x | H0)=

exp (x1)2

2+ log

Pe,MAP = P(0)↵ + P(1)

S1,k = (Wk + c1Fk)ei1,k

S2,k = (Wk + c2Fk)ei2,k, 8pixel k

Wk = f1(S1,k, S2,k)

Fk = f2(S1,k, S2,k)

saddle point spectral init

saddle point spectral initialization

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

)iterations

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

)iterations

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

︸︷︷︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸︷︷︸condition number n

(even locally)

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

In E[∇2f(x)

] 10In

log 1ε

)iterations if

m→∞

(even locally)

2In ∇2f(x) O(n)In

n if m n log n

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

In E[∇2f(x)

] 10In

log 1ε

)iterations if

m→∞

(even locally)

2In ∇2f(x) O(n)In

n if m n log n

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

In E[∇2f(x)

] 10In

log 1ε

)iterations if

m→∞

(even locally)

2In ∇2f(x) O(n)In

n if m n log n

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

In E[∇2f(x)

] 10In

log 1ε

)iterations if

m→∞

(even locally)

2In ∇2f(x) O(n)In

n if m n log n

E[∇2f(x)

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

In E[∇2f(x)

] 10In

log 1ε

)iterations if

m→∞

(even locally)

2In ∇2f(x) O(n)In

n if m n log n

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

Vanilla GD (WF) can proceed much more aggressively!

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

Generic optimization theory is too pessimistic!

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

[3(a>k x

)2 −(a>k x

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

∇2f(x) =1

[3(a>k x

)2 −(a>k x

(1/2) · In ∇2f(x) O(log n) · In

∇2f(x) =1

[3(a>k x

)2 −(a>k x

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

minimizeh,x f(h, x) =mX

i XX>ej e>i X\X\>ej

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

(1/2) · In ∇2f(x) O(log n) · In

∇2f(x) =1

[3(a>k x

)2 −(a>k x

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

a>1 (x x\)

a>2 (x x\)

(1/2) · In ∇2f(x) O(log n) · In

∇2f(x) =1

[3(a>k x

)2 −(a>k x

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

a>1 (x x\)

a>2 (x x\)

a>1 (x x\)

a>2 (x x\)

(1/2) · In ∇2f(x) O(log n) · In

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

Our findings: GD is implicitly regularized

GD implicitly forces iterates to remain incoherenteven without regularization

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

• ‖xt − x\‖2 .(1− η

log n log 1ε

• ‖xt − x\‖2 .(1− η

log n log 1ε

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

iterations if m nr6 log2 n.

∥∥ .√

(incoherence)

1− σ2r(X\)η

∥∥ .√

(incoherence)

1− σ2r(X\)η

∥∥ .√

(incoherence)

1− σ2r(X\)η

∥∥ .√

(incoherence)

1− σ2r(X\)η

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;

Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

a1 a2 x\

a>2 (x x\)

kx x\k2

incoherence region w.r.t. a1

xt,(l) al w.r.t. al

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

a1 a2 x\

a>2 (x x\)

kx x\k2

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

a1 a2 x\

a>2 (x x\)

kx x\k2

xt,(1) xt

xt,(l) al w.r.t. al

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

a1 a2 x\

a>2 (x x\)

kx x\k2

minimizex f(x) =1

i x)2 yi

x 2 Rns.t.a>

, 1 i m

minimizeX f(X) =X

(i,j)2

i XX>ej e>i X\X\>ej

j X\X\>ei, (i, j) 2

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

a1 a2 x\

a>2 (x x\)

kx x\k2

xt,(1) xt

xt,(l) al w.r.t. al

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

Incoherence region in high dimensions

·· · ··

2-dimensional high-dimensional︸︷︷︸incoherence region is vanishingly small

This recipe is quite general

Low-rank matrix completion

X ? ? ? X ?? ? X X ? ?X ? ? X ? ?? ? X ? ? XX ? ? ? ? ?? X ? ? X ?? ? X X ? ?

? ? ? ?

Fig. credit: Candes

Given partial samples of a low-rank matrix M in an index set Ω,fill in missing entries.

Applications: recommendation systems, ...

Incoherence

1 0 0 · · · 00 0 0 · · · 0...

......

0 0 0 · · · 0

︸︷︷︸hard µ=n

1 1 1 · · · 11 1 1 · · · 1...

......

1 1 1 · · · 1

︸︷︷︸easy µ=1

Definition (Incoherence for matrix completion)

A rank-r matrix M \ with eigendecomposition M \ = U \Σ\U \> issaid to be µ-incoherent if

∥∥∥U \∥∥∥

2,∞≤√µ

∥∥∥U \∥∥∥

√µr

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

Prior theory

(j,k)∈Ω

(e>j XX>ek −Mj,k

Prior theory

(j,k)∈Ω

(e>j XX>ek −Mj,k

Prior theory

(j,k)∈Ω

(e>j XX>ek −Mj,k

Conclusions

optimization theory + statistical model: vanilla gradientdescent is “implicitly regularized” and runs fast!

Computational:near dimension-freeiteration complexity

Statistical:near-optimal

sample complexity

• works for a few other problems such as low-rank matrixcompletion and blind deconvolution;

• “leave-one-out” arguments are useful for decoupling weakdependency to allow finer characterization of GD trajectories.

References

1. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma,K. Wang, Y. Chi and Y. Chen, arXiv:1711.10467.

2. Nonconvex Matrix Factorization from Rank-One Measurements, Y.Li, C. Ma, Y. Chen, and Y. Chi, arXiv:1802.06286.

3. Gradient Descent with Random Initialization: Fast GlobalConvergence for Nonconvex Phase Retrieval, Y. Chen, Y. Chi, J.Fan and C. Ma, arXiv:1803.07726.

Thank you!

0.15in geometry and regularization in nonconvex …geometry and regularization in nonconvex...

Documents

optimality and complexity for constrained optimization...

nonconvex robust optimization for problems with...

rob nonconvex opt

stochastic homogenization of some nonconvex hamilton

nonconvex generalized benders...

subgradient methods in nonsmooth nonconvex optimization

nonsmooth, nonconvex optimizationnonsmooth, nonconvex...

fast algorithms for nonconvex compressive sensing

introduction to nonconvex optimization

nonconvex sparse regularization for deep neural networks...

nonconvex wireless

enhancing convexification techniques for solving nonconvex

yuxin chen yuejie chiy jianqing fan cong ma z ... -...

nonconvex optimization methods: iteration complexity and

nonconvex optimization for high-dimensional signal

globally solving nonconvex quadratic programming problems...

hidden convexity in nonconvex quadratic optimization

stochastic variance-reduced cubic regularization...

chap5.5 regularization

global optimization of nonconvex piecewise linear