0.15in geometry and regularization in nonconvex …geometry and regularization in nonconvex...

Post on 22-Jul-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Geometry and Regularizationin Nonconvex Quadratic Inverse Problems

Yuejie Chi

Department of Electrical and Computer Engineering

Princeton Workshop, May 2018

Acknowledgements

Thanks to my collaborators:

Yuxin Chen Cong Ma Kaizheng Wang Yuanxin LiPrinceton Princeton Princeton CMU/OSU

This research is supported by NSF, ONR and AFOSR.

1

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Nonconvex optimization in practice

Using simple algorithms such as gradient descent, e.g., “backpropagation” for training deep neural networks...

3

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)

m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Putting together...

global convergence

statistical models

benign landscape

5

Computational efficiency?

global convergence

statistical models

benign landscape

But how fast?

6

Overview and question we aim to address

samplecomplexity

critical points

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Quadratic inverse problem

A

x

Ax

1

1

-3

2

-1

4

2-2

-1

3

-1

3

10

8

3

18

125

2

27

18

Y likelihood: fY |X(y | x unknown signal: X observation: Y

X y n p

1

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical loss minimization

minimizex f(x) = 1m

mÿ

k=1

Ë!a€k x

"2 ≠ ykÈ2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt ≠ ÷t Òf(xt)

10/ 29

X

1

0

2

-1

1

20

0

3

4

1

1

0

-1

-1

21

-1

3

1

r

AX yi = ka>i Xk2

2

Recover X\ ∈ Rn×r from m “random” quadratic measurements

yi =∥∥∥a>i X\

∥∥∥2

2, i = 1, . . . ,m

The rank-1 case is the famous phase retrieval problem.

8

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Shallow neural network

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi).

10

Shallow neural network with quadratic activation

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi)σ(z)=z2

:=r∑

i=1

(a>xi)2 =

∥∥∥a>X\∥∥∥

2

2.

11

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt − η∇f(xt)

13

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt − η∇f(xt)

13

Geometry of loss surface

-1 -0.5 0 0.5 1-1.5

-1

-0.5

0

0.5

1

1.5

1

2

3

4

5

6

7

8

P(0)PX|(x | 0)

H0

><H1

P(1)PX|(x | 1)

L(x) =PX|(x | 1)

PX|(x | 0)

H1

><H0

H0 ! N (0, 1)

H1 ! N (1, 1)

L(x) =fX(x | H1)

fX(x | H0)=

1p2

exp (x1)2

2

1p2

expx2

2

= exp

x 1

2

L(x)

H1

><H0

() x

H1

><H0

1

2+ log

Pe,MAP = P(0)↵ + P(1)

S1,k = (Wk + c1Fk)ei1,k

S2,k = (Wk + c2Fk)ei2,k

S1,k = (Wk + c1Fk)ei1,k

S2,k = (Wk + c2Fk)ei2,k, 8pixel k

Wk = f1(S1,k, S2,k)

Fk = f2(S1,k, S2,k)

x\

1

saddle point spectral init

1

saddle point spectral initialization

1

14

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

α

β

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Vanilla GD (WF) can proceed much more aggressively!

17

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Generic optimization theory is too pessimistic!

18

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;

Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Incoherence region in high dimensions

·· · ··

2-dimensional high-dimensional︸ ︷︷ ︸incoherence region is vanishingly small

26

This recipe is quite general

Low-rank matrix completion

X ? ? ? X ?? ? X X ? ?X ? ? X ? ?? ? X ? ? XX ? ? ? ? ?? X ? ? X ?? ? X X ? ?

? ? ? ?

?

?

??

??

???

?

?

Fig. credit: Candes

Given partial samples of a low-rank matrix M in an index set Ω,fill in missing entries.

Applications: recommendation systems, ...

28

Incoherence

1 0 0 · · · 00 0 0 · · · 0...

......

0 0 0 · · · 0

︸ ︷︷ ︸hard µ=n

vs.

1 1 1 · · · 11 1 1 · · · 1...

......

1 1 1 · · · 1

︸ ︷︷ ︸easy µ=1

Definition (Incoherence for matrix completion)

A rank-r matrix M \ with eigendecomposition M \ = U \Σ\U \> issaid to be µ-incoherent if

∥∥∥U \∥∥∥

2,∞≤√µ

n

∥∥∥U \∥∥∥

F=

õr

n.

29

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Conclusions

optimization theory + statistical model: vanilla gradientdescent is “implicitly regularized” and runs fast!

Computational:near dimension-freeiteration complexity

Statistical:near-optimal

sample complexity

• works for a few other problems such as low-rank matrixcompletion and blind deconvolution;

• “leave-one-out” arguments are useful for decoupling weakdependency to allow finer characterization of GD trajectories.

31

References

1. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma,K. Wang, Y. Chi and Y. Chen, arXiv:1711.10467.

2. Nonconvex Matrix Factorization from Rank-One Measurements, Y.Li, C. Ma, Y. Chen, and Y. Chi, arXiv:1802.06286.

3. Gradient Descent with Random Initialization: Fast GlobalConvergence for Nonconvex Phase Retrieval, Y. Chen, Y. Chi, J.Fan and C. Ma, arXiv:1803.07726.

Thank you!

32

top related