copyright by kai zhong 2018

449
Copyright by Kai Zhong 2018

Upload: others

Post on 20-Apr-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright by Kai Zhong 2018

Copyright

by

Kai Zhong

2018

Page 2: Copyright by Kai Zhong 2018

The Dissertation Committee for Kai Zhongcertifies that this is the approved version of the following dissertation:

Provable Non-convex Optimization

for Learning Parametric Models

Committee:

Inderjit S. Dhillon, Supervisor

Prateek Jain, Co-Supervisor

Chandrajit Bajaj

George Biros

Rachel Ward

Page 3: Copyright by Kai Zhong 2018

Provable Non-convex Optimization

for Learning Parametric Models

by

Kai Zhong,

DISSERTATION

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN

August 2018

Page 4: Copyright by Kai Zhong 2018

Acknowledgments

This thesis is not possible without the help of many people. First I would

like to thank my PhD supervisors Inderjit Dhillon and Prateek Jain. They have

given me continuous support in many ways. I have benefited greatly from their

creative ideas, their insightful discussions, their encouraging comments and their

passion in academic research through my PhD study. This thesis is a direct result

of their guidance.

I would also like to thank Chandrajit Bajaj, George Biros, and Rachel Ward

for serving on my dissertation committee and their constructive advise on my dis-

sertation.

I also greatly appreciate the collaborations with my co-authors. An incom-

plete list is Ian En-Hsu Yen, Zhao Song, Pradeep Ravikumar, Bowei Yan, Xiangru

Huang, Qi Lei, Cho-Jui Hsieh, Arnaud Vandaele, Meng Li, Jimmy Lin, Sanjiv

Kumar, Ruiqi Guo, David Simcha, Ashish Kapoor, and Peter Bartlett. I would spe-

cially thank Ian for collaborating many projects with me. I will remember that we

spent numerous afternoons discussing research, which helped me a lot during my

early years of PhD study.

I would like to thank my other lab-mates for fruitful discussions and count-

less help in my research and life: Si Si, Hsiang-Fu Yu, Nagarajan Natarajan, Kai-

Yang Chiang, David Inouye, Jiong Zhang, Nikhil Rao, Donghyuk Shin and Joyce

iv

Page 5: Copyright by Kai Zhong 2018

Whang.

I am also very lucky to have my best friends, Bangguo Xiong, Lingyuan

Gao, Kecheng Xu and Xiaoqing Xu, who brought happiness and passion to my life.

Finally, I would like to thank my parents for their constant support and love.

v

Page 6: Copyright by Kai Zhong 2018

Provable Non-convex Optimization

for Learning Parametric Models

Publication No.

Kai Zhong, Ph.D.

The University of Texas at Austin, 2018

Supervisors: Inderjit S. DhillonPrateek Jain

Non-convex optimization plays an important role in recent advances of ma-

chine learning. A large number of machine learning tasks are performed by solving

a non-convex optimization problem, which is generally NP-hard. Heuristics, such

as stochastic gradient descent, are employed to solve non-convex problems and

work decently well in practice despite the lack of general theoretical guarantees. In

this thesis, we study a series of non-convex optimization strategies and prove that

they lead to the global optimal solution for several machine learning problems, in-

cluding mixed linear regression, one-hidden-layer (convolutional) neural networks,

non-linear inductive matrix completion, and low-rank matrix sensing. At a high

level, we show that the non-convex objectives formulated in the above problems

have a large basin of attraction around the global optima when the data has benign

statistical properties. Therefore, local search heuristics, such as gradient descent or

vi

Page 7: Copyright by Kai Zhong 2018

alternating minimization, are guaranteed to converge to the global optima if initial-

ized properly. Furthermore, we show that spectral methods can efficiently initialize

the parameters such that they fall into the basin of attraction. Experiments on syn-

thetic datasets and real applications are carried out to justify our theoretical analyses

and illustrate the superiority of our proposed methods.

vii

Page 8: Copyright by Kai Zhong 2018

Table of Contents

Acknowledgments iv

Abstract vi

List of Tables xiv

List of Figures xv

Chapter 1. Introduction 1

Chapter 2. Preliminaries 52.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Concentration Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Convex Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 3. Mixed Linear Regression 183.1 Introduction to Mixed Linear Regression . . . . . . . . . . . . . . . 193.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Initialization via Tensor method . . . . . . . . . . . . . . . . . . . . 263.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 293.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

Page 9: Copyright by Kai Zhong 2018

Chapter 4. One-hidden-layer Fully-connected Neural Networks 344.1 Introduction to One-hidden-layer Neural Networks . . . . . . . . . . 354.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Initialization via Tensor Methods . . . . . . . . . . . . . . . . . . . 434.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 504.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 5. One-hidden-layer Convolutional Neural Networks 575.1 Introduction to One-hidden-layer Convolutional Neural Networks . . 585.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Initialization via Tensor Method . . . . . . . . . . . . . . . . . . . . 675.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 695.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 705.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chapter 6. Non-linear Inductive Matrix Completion 746.1 Introduction to Inductive Matrix Completion . . . . . . . . . . . . . 756.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 796.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 816.4 Initialization and Recovery Guarantee . . . . . . . . . . . . . . . . 866.5 Experiments on Synthetic and Real-world Data . . . . . . . . . . . . 876.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Chapter 7. Low-rank Matrix Sensing 957.1 Introduction to Low-rank Matrix Sensing . . . . . . . . . . . . . . . 967.2 Problem Formulation – Two Settings . . . . . . . . . . . . . . . . . 997.3 Rank-one Matrix Sensing via Alternating Minimization . . . . . . . 1017.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 1107.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

ix

Page 10: Copyright by Kai Zhong 2018

Chapter 8. Discussion 1158.1 Over-specified/Over-parameterized Neural Networks . . . . . . . . 115

8.1.1 Learning a ReLU using Over-specified Neural Networks . . . 1178.1.2 Numerical Experiments with Multiple Hidden Units . . . . . 122

8.2 Initialization Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1248.3 Stochastic Gradient Descent (SGD) and Other Fast Algorithms . . . 126

Appendices 127

Appendix A. Mixed Linear Regression 128A.1 Proofs of Local Convergence . . . . . . . . . . . . . . . . . . . . . 128

A.1.1 Some Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . 128A.1.2 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . 130A.1.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . 138A.1.4 Proof of the lemmata . . . . . . . . . . . . . . . . . . . . . . 140

A.1.4.1 Proof of Lemma A.1.1 . . . . . . . . . . . . . . . . 140A.1.4.2 Proof of Lemma A.1.2 . . . . . . . . . . . . . . . . 140A.1.4.3 Proof of Lemma A.1.3 . . . . . . . . . . . . . . . . 141A.1.4.4 Proof of Lemma A.1.4 . . . . . . . . . . . . . . . . 144A.1.4.5 Proof of Lemma A.1.5 . . . . . . . . . . . . . . . . 145

A.1.5 Proof of Lemma A.1.6 . . . . . . . . . . . . . . . . . . . . . 148A.2 Proofs of Tensor Method for Initialization . . . . . . . . . . . . . . 149

A.2.1 Some Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . 149A.2.2 Proof of Theorem 3.4.2 . . . . . . . . . . . . . . . . . . . . 152A.2.3 Proof of Theorem 3.5.1 . . . . . . . . . . . . . . . . . . . . 154A.2.4 Proofs of Some Lemmata . . . . . . . . . . . . . . . . . . . 155

A.2.4.1 Proof of Lemma A.2.1 . . . . . . . . . . . . . . . . 155A.2.4.2 Proof of Lemma A.2.2 . . . . . . . . . . . . . . . . 155A.2.4.3 Proof of Lemma A.2.3 . . . . . . . . . . . . . . . . 156A.2.4.4 Proof of Lemma A.2.4 . . . . . . . . . . . . . . . . 157A.2.4.5 Proof of Lemma A.2.5 . . . . . . . . . . . . . . . . 162

x

Page 11: Copyright by Kai Zhong 2018

Appendix B. One-hidden-layer Fully-connected Neural Networks 165B.1 Matrix Bernstein Inequality for Unbounded Case . . . . . . . . . . . 165B.2 Properties of Activation Functions . . . . . . . . . . . . . . . . . . 170B.3 Local Positive Definiteness of Hessian . . . . . . . . . . . . . . . . 172

B.3.1 Main Results for Positive Definiteness of Hessian . . . . . . . 172B.3.2 Positive Definiteness of Population Hessian at the Ground

Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176B.3.3 Error Bound of Hessians near the Ground Truth for Smooth

Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.3.4 Error Bound of Hessians near the Ground Truth for Non-

smooth Activations . . . . . . . . . . . . . . . . . . . . . . . 207B.3.5 Positive Definiteness for a Small Region . . . . . . . . . . . 212

B.4 Tensor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.4.1 Tensor Initialization Algorithm . . . . . . . . . . . . . . . . 219B.4.2 Main Result for Tensor Methods . . . . . . . . . . . . . . . . 223B.4.3 Error Bound for the Subspace Spanned by the Weight Matrix 226B.4.4 Error Bound for the Reduced Third-order Moment . . . . . . 239B.4.5 Error Bound for the Magnitude and Sign of the Weight Vectors246

Appendix C. One-hidden-layer Convolutional Neural Networks 254C.1 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

C.1.1 Orthogonal weight matrices for the population case . . . . . . 254C.1.2 Non-orthogonal weight matrices for the population case . . . 256

C.2 Properties of Activation Functions . . . . . . . . . . . . . . . . . . 258C.3 Positive Definiteness of Hessian near the Ground Truth . . . . . . . 259

C.3.1 Bounding the eigenvalues of Hessian . . . . . . . . . . . . . 259C.3.2 Error bound of Hessians near the ground truth for smooth

activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 277C.3.3 Error bound of Hessians near the ground truth for non-smooth

activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 295C.3.4 Proofs for Main results . . . . . . . . . . . . . . . . . . . . . 302

xi

Page 12: Copyright by Kai Zhong 2018

Appendix D. Non-linear Inductive Matrix Completion 317D.1 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

D.1.1 Positive definiteness of the population hessian . . . . . . . . 318D.1.2 Warm up: orthogonal case . . . . . . . . . . . . . . . . . . . 319D.1.3 Error bound for the empirical Hessian near the ground truth . 321

D.2 Positive Definiteness of Population Hessian . . . . . . . . . . . . . 322D.2.1 Orthogonal case . . . . . . . . . . . . . . . . . . . . . . . . 325D.2.2 Non-orthogonal Case . . . . . . . . . . . . . . . . . . . . . . 334

D.3 Positive Definiteness of the Empirical Hessian . . . . . . . . . . . . 341D.3.1 Local Linear Convergence . . . . . . . . . . . . . . . . . . . 363

Appendix E. Low-rank Matrix Sensing 371E.1 Proof of Theorem 7.3.1 . . . . . . . . . . . . . . . . . . . . . . . . 371E.2 Proofs for Matrix Sensing using Rank-one Independent Gaussian

Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378E.2.1 Proof of Claim 7.3.1 . . . . . . . . . . . . . . . . . . . . . . 378E.2.2 Proof of Thoerem 7.3.2 . . . . . . . . . . . . . . . . . . . . 379

E.3 Proof of Matrix Sensing using Rank-one Dependent Gaussian Mea-surements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383E.3.1 Proof of Lemma 7.3.1 . . . . . . . . . . . . . . . . . . . . . 383E.3.2 Proof of Theorem 7.3.3 . . . . . . . . . . . . . . . . . . . . 384

Appendix F. Over-specified Neural Network 389F.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389F.2 Two Learning Phases . . . . . . . . . . . . . . . . . . . . . . . . . 393

F.2.1 Phase I – Learning the Directions . . . . . . . . . . . . . . . 393F.2.2 Phase II – Learning the Magnitudes . . . . . . . . . . . . . . 395

F.3 Proof of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 396F.3.1 Population Case . . . . . . . . . . . . . . . . . . . . . . . . 396F.3.2 Finite-Sample Case . . . . . . . . . . . . . . . . . . . . . . 397

F.4 Proofs of Phase I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398F.4.1 Proof of Phase I-a . . . . . . . . . . . . . . . . . . . . . . . 403F.4.2 Proofs of Phase I-b . . . . . . . . . . . . . . . . . . . . . . . 405

F.5 Proofs for Phase II . . . . . . . . . . . . . . . . . . . . . . . . . . . 407F.5.1 Proof for Lemma F.2.3 . . . . . . . . . . . . . . . . . . . . . 407

xii

Page 13: Copyright by Kai Zhong 2018

Bibliography 411

Vita 433

xiii

Page 14: Copyright by Kai Zhong 2018

List of Tables

6.1 The error rate in semi-supervised clustering using NIMC and IMC. . 88

6.2 Test RMSE for recommending new users with movies on Movie-

lens dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.1 Comparison of sample complexity and computational complexity

for different approaches and different measurements . . . . . . . . . 98

8.1 The average objective function value when gradient descent gets

stuck at a non-global local minimum over 100 random trials. Note

that the average function value here does not take globally con-

verged function values into account. . . . . . . . . . . . . . . . . . 123

B.1 ρ(σ) values for different activation functions. Note that we can cal-

culate the exact values for ReLU, Leaky ReLU, squared ReLU and

erf. We can’t find a closed-form value for sigmoid or tanh, but

we calculate the numerical values of ρ(σ) for σ = 0.1, 1, 10. 1

ρerf(σ) = min(4σ2+1)−1/2− (2σ2+1)−1, (4σ2+1)−3/2− (2σ2+

1)−3, (2σ2 + 1)−2 . . . . . . . . . . . . . . . . . . . . . . . . . . 171

xiv

Page 15: Copyright by Kai Zhong 2018

List of Figures

3.1 Empirical Performance of MLR. . . . . . . . . . . . . . . . . . . . 31

3.2 Comparison with EM in terms of time and iterations. Our method

with random initialization is significantly better than EM with ran-

dom initialization. Performance of the two methods is comparable

when initialized with tensor method. . . . . . . . . . . . . . . . . . 32

4.1 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 (a) (left) Minimal eigenvalue of Hessian at the ground truth for dif-

ferent activations against the sample size (b) (right) Convergence

of gradient descent with different random initializations. . . . . . . 70

6.1 The rate of success of GD over synthetic data. Left: sigmoid, Right:

ReLU). White blocks denote 100% success rate. . . . . . . . . . . 87

6.2 NIMC v.s. IMC on gene-disease association prediction task. . . . . 91

7.1 Comparison of computational complexity and measurement com-

plexity for different approaches and different operators . . . . . . . 111

xv

Page 16: Copyright by Kai Zhong 2018

7.2 Recovery rate for different matrix dimension d (x-axis) and differ-

ent number of measurements m (y-axis). The color reflects the re-

covery rate scaled from 0 to 1. The white color indicates perfect re-

covery, while the black color denotes failure in all the experiments.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xvi

Page 17: Copyright by Kai Zhong 2018

Chapter 1

Introduction

The goal of most machine learning tasks is to find a statistical model that

fits the data well. One popular procedure to find a good model is as follows: 1)

construct a model class that is the best guess of the underlying data-generating

model; 2) form a proper optimization problem from the model using the existing

training data; 3) solve the optimization problem and obtain the solution. All three

steps are critical in order to obtain a good model.

Choosing a suitable model class is the first key step for the success of ma-

chine learning tasks. There are mainly two types of models: parametric models,

which have a finite number of parameters, and non-parametric models, which, by

contrast, have an infinite number of parameters. Learning non-parametric mod-

els may not scale well for large datasets. For example, naive kernel methods can

scale quadratically with the size of the training data. On the other hand, simple

parametric models, such as linear models, are easy to train but may have limited

model capacity which cannot capture the underlying model. Therefore, recently

large-capacity parametric models, such as neural networks, have become increas-

ingly attractive. To fit a particular dataset, it is also important to use a well-designed

and structured model class. For example, convolutional neural networks have a spe-

1

Page 18: Copyright by Kai Zhong 2018

cially constructed neural network architecture for computer vision tasks. Low-rank

models impose low-rank structure for the matrix variables. This thesis focuses on

parametric models that have particular structures. The models we consider include

mixed linear regression models, one-hidden-layer (convolutional) neural networks,

non-linear inductive matrix completion, and low-rank matrix sensing models.

The second step is to form an optimization problem. An optimization prob-

lem is typically formulated via a maximum (log-)likelihood estimator, which min-

imizes an objective function with certain constraints. To find a good model in

a model class, we can form either convex optimization problems or non-convex

optimization problems. For example, low-rank models can either directly form a

non-convex problem, where the low-rank constraint is enforced by representing the

matrix variable as a product of two thin matrices, X = UV >, or form a convex opti-

mization problem where the low-rank constraint is relaxed to a constrained nuclear

norm, ‖X‖∗ ≤ C. For convex problems, there exist polynomial algorithms that are

guaranteed to converge to the global optima. Non-convex problems are in general

NP-hard to solve, therefore, there are no general polynomial algorithms that can

solve non-convex problems. Popular heuristics such as gradient descent might get

stuck at some bad local minima. Furthermore, complex models such as deep neural

networks are more natural to be formulated as non-convex optimization problems

due to the models’ high nonlinearity.

The third step now comes to the choice of algorithms that solve the opti-

mization problem. When we consider the optimization process, we need to con-

sider its efficiency. For the low-rank model example, its corresponding non-convex

2

Page 19: Copyright by Kai Zhong 2018

problem can often be solved more efficiently than its convex counterpart unless the

convex one is implemented very carefully due to the nuclear norm constraint. Al-

though there are no general theoretical guarantees, simple optimization heuristics

can still solve non-convex problems to a sufficiently good solution. For example,

SGD is widely applied to solve highly non-convex objectives of neural networks

and alternating minimization is often employed to solve non-convex problems in-

volving low-rank models. Both heuristics work quite well in practice. Considering

the efficiency yet the lack of theoretical guarantees of non-convex approaches, in

this thesis, we focus on providing theoretical guarantees for non-convex optimiza-

tion approaches.

Since non-convex problems are generally NP-hard, we need additional as-

sumptions to develop theoretical guarantees for these problems. In this thesis, we

assume data comes from benign distributions. In particular, a) we assume the input

data of the model follows a Gaussian distribution; b) we consider realizable set-

tings, i.e. there exists a ground truth parametric model that maps the input to the

target without noise. As a result, we provide recovery guarantees for learning many

non-convex problems, including mixed linear regression, one-hidden-layer (con-

volutional) neural networks, non-linear inductive matrix completion, and low-rank

matrix sensing.

Specifically, our main theoretical discoveries are two fold. 1) There is a

large basin of attraction around the global optima of the non-convex objective func-

tions of the above mentioned problems. 2) The parameters can be efficiently ini-

tialized into the basin of attraction by spectral methods. At a high level, we apply

3

Page 20: Copyright by Kai Zhong 2018

the following optimization strategy. First spectral methods are used to initialize the

parameters, and then local search heuristics, such as gradient descent or alternat-

ing minimization, are used to achieve the global optima which recovers the ground

truth parameters in the realizable setting. Moreover, we conduct experiments on

synthetic data to justify our theoretical analyses and provide some experimental

results on real-world datasets to demonstrate the superior performance of our pro-

posed models.

Thesis Overview

The roadmap of this thesis is as follows. We first introduce some prelim-

inaries for this thesis, including notations, definitions and other preliminaries in

Chapter 2. Then, we present our main results for five non-convex problems: mixed

linear regression models, one-hidden-layer (convolutional) neural networks, non-

linear inductive matrix completion, and low-rank matrix sensing models in Chap-

ter 3-7 respectively. In each of them, we introduce the specific problem, formulate

it and present the theoretical results for its local landscape and the initialization

methods. Finally, in Chapter 8, we discuss several open problems.

4

Page 21: Copyright by Kai Zhong 2018

Chapter 2

Preliminaries

In this chapter, we introduces some notations, definitions and preliminaries

required in this thesis.

2.1 Notations

For any positive integer n, we use [n] to denote the set 1, 2, · · · , n. For

random variable X , let E[X] denote the expectation of X (if this quantity exists).

For integer k, we use Dk to denote N(0, Ik), the standard Gaussian distribution with

dimension k. We use 1f to denote the indicator function, which is 1 if f holds and

0 otherwise. We define (z)+ := max0, z. For any vector x ∈ Rn, we use ‖x‖ to

denote its `2 norm.

For any function f , we define O(f) to be f · logO(1)(f). For two functions

f, g, we use the shorthand f . g (resp. &) to indicate that f ≤ Cg (resp. ≥) for an

absolute constant C. We use f h g to mean cf ≤ g ≤ Cf for constants c, C. We

use poly(f) to denote fO(1).

We provide several definitions related to matrix A. Let det(A) denote the

determinant of a square matrix A. Let A> denote the transpose of A. Let A† denote

the Moore-Penrose pseudoinverse of A. Let A−1 denote the inverse of a full rank

5

Page 22: Copyright by Kai Zhong 2018

square matrix. Let ‖A‖F denote the Frobenius norm of matrix A. Let ‖A‖ denote

the spectral norm of matrix A. Let σi(A) to denote the i-th largest singular value

of A. We often use capital letter to denote the stack of corresponding small letter

vectors, e.g., W = [w1 w2 · · · wk] ∈ Rd×k, where wi ∈ Rd. For two same-

size matrices, A,B ∈ Rd1×d2 , we use A B ∈ Rd1×d2 to denote element-wise

multiplication of these two matrices.

Now we provide several definitions related to tensor T . We use ⊗ to denote

outer product and · to denote dot product. Given two column vectors u, v ∈ Rn,

then u ⊗ v ∈ Rn×n and (u ⊗ v)i,j = ui · vj , and u>v =∑n

i=1 uivi ∈ R. Given

three column vectors u, v, w ∈ Rn, then u⊗v⊗w ∈ Rn×n×n is a third-order tensor

and (u⊗ v ⊗ w)i,j,k = ui · vj · wk. We use u⊗r ∈ Rnr to denote the vector u outer

product with itself r − 1 times.

We say a tensor T ∈ Rn×n×n is symmetric if and only if for any i, j, k,

Ti,j,k = Ti,k,j = Tj,i,k = Tj,k,i = Tk,i,j = Tk,j,i. Given a third order tensor T ∈

Rn1×n2×n3 and three matrices A ∈ Rn1×d1 , B ∈ Rn2×d2 , C ∈ Rn3×d3 , we use

T (A,B,C) to denote a d1 × d2 × d3 tensor where the (i, j, k)-th entry is,

n1∑i′=1

n2∑j′=1

n3∑k′=1

Ti′,j′,k′Ai′,iBj′,jCk′,k.

We use ‖T‖ or ‖T‖op to denote the operator norm of the tensor T , i.e.,

‖T‖op = ‖T‖ = max‖a‖=1|T (a, a, a)|.

For tensor T ∈ Rn1×n2×n3 , we use matrix T (1) ∈ Rn1×n2n3 to denote the

flattening of tensor T along the first dimension, i.e., [T (1)]i,(j−1)n3+k = Ti,j,k,∀i ∈

6

Page 23: Copyright by Kai Zhong 2018

[n1], j ∈ [n2], k ∈ [n3]. Similarly for matrices T (2) ∈ Rn2×n3n1 and T (3) ∈

Rn3×n1n2 .

2.2 Definitions

First we define some concepts about convexity and smoothness.

Definition 2.2.1 (m-Strongly Convex Function). A differentiable function f is called

m-strongly-convex with parameter m > 0 if the following inequality holds for all

points x,y in its domain:

〈∇f(x)−∇f(y), x− y〉 ≥ m‖x− y‖.

If further f is twice continuously differentiable, then f is strongly convex with pa-

rameter m if and only if∇2f(x) mI for all x in the domain.

Definition 2.2.2 ((A, δ,m)-Locally Strongly Convex Function). A function f is

called (A, δ,m) locally strongly convex with a convex subset of the domain A, pa-

rameters δ ∈ [0, 1], and m > 0 if given any point x ∈ A, with probability at least,

1− δ,∇2f(x) exists and the following inequality holds:

∇2f(x) mI.

Definition 2.2.3 (L-Lipschitz Function). A differentiable function f is called L-

Lipschitz with parameter L > 0 if the following inequality holds for all points x in

its domain:

‖∇f(x)‖ ≤ L

7

Page 24: Copyright by Kai Zhong 2018

Definition 2.2.4 (M -Smooth Function). A differentiable function f is called M -

smooth with parameter M > 0 if the following inequality holds for all points x, y

in its domain:

‖∇f(x)−∇f(y)‖ ≤M‖x− y‖.

If further f is twice continuously differentiable, then f is M -smooth with parameter

M if and only if ∇2f(x) MI for all x in the domain.

Definition 2.2.5 ((A, δ,M )-Locally Smooth Function). A function f is called (A, δ,M )

locally smooth with a convex subset of the domain A, parameters δ ∈ [0, 1], and

M > 0 if given any point x ∈ A, with probability at least, 1− δ,∇2f(x) exists and

the following inequality holds:

∇2f(x) MI.

2.3 Linear Algebra

This section includes some facts about linear algebra that will be used in

thesis.

Fact 2.3.1. Given a full column-rank matrix W = [w1, w2, · · · , wk] ∈ Rd×k, let

W = [ w1

‖w1‖ , w2

‖w2‖ , · · · , wk

‖wk‖]. Then, we have: (I) for any i ∈ [k], σk(W ) ≤ ‖wi‖ ≤

σ1(W ); (II) 1/κ(W ) ≤ σk(W ) ≤ σ1(W ) ≤√k.

Proof. Part (I). We have,

σk(W ) ≤ ‖Wei‖ = ‖wi‖ ≤ σ1(W )

8

Page 25: Copyright by Kai Zhong 2018

Part (II). We first show how to lower bound σk(W ),

σk(W ) = min‖s‖=1

‖Ws‖

= min‖s‖=1

∥∥∥∥∥k∑

i=1

si‖wi‖

wi

∥∥∥∥∥ by definition of W

≥ min‖s‖=1

σk(W )

(k∑

i=1

(si‖wi‖

)2

) 12

by ‖wi‖ ≥ σk(W )

≥ min‖s‖2=1

σk(W )

(k∑

i=1

(si

maxj∈[k] ‖wj‖)2

) 12

by maxj∈[k]‖wj‖ ≥ ‖wi‖

= σk(W )/maxj∈[k]‖wj‖ by ‖s‖ = 1

≥ σk(W )/σ1(W ). by maxj∈[k]‖wj‖ ≤ σ1(W )

= 1/κ(W ).

It remains to upper bound σ1(W ),

σ1(W ) ≤

(k∑

i=1

σ2i (W )

) 12

= ‖W‖F ≤√k.

Fact 2.3.2. Let U ∈ Rd×k and V ∈ Rd×k (k ≤ d) denote two orthogonal matrices.

Then ‖UU> − V V >‖ = ‖(I − UU>)V ‖ = ‖(I − V V >)U‖ =√1− σ2

k(U>V ).

Proof. Let U⊥ ∈ Rd×(d−k) and V⊥ ∈ Rd×(d−k) be the orthogonal complementary

9

Page 26: Copyright by Kai Zhong 2018

matrices of U, V ∈ Rd×k respectively.

‖UU> − V V >‖ = ‖(I − V V >)UU> − V V >(I − UU>)‖

= ‖V⊥V>⊥ UU> − V V >U⊥U

>⊥‖

=

∥∥∥∥[V⊥ V][V >

⊥ U 00 −V >U⊥

][U>

U>⊥

]∥∥∥∥= max(‖V >

⊥ U‖, ‖V >U⊥‖).

We show how to simplify ‖V >⊥ U‖,

‖V >⊥ U‖ = ‖(I − V V >)U‖ =

√‖U>(I − V V >)U‖

= max‖a‖=1

√1− ‖V >Ua‖2 =

√1− σ2

k(V>U).

Similarly we can simplify ‖U>⊥V ‖,

‖U>⊥V ‖ =

√1− σ2

k(U>V ) =

√1− σ2

k(V>U).

Fact 2.3.3. Let C ∈ Rd1×d2 , B ∈ Rd2×d3 be two matrices. Then ‖CB‖ ≤ ‖C‖‖B‖F

and ‖CB‖ ≥ σmin(C)‖B‖F .

Proof. For each i ∈ [d3], let bi denote the i-th column of B. We can upper bound

‖CB‖,

‖CB‖F =

(d2∑i=1

‖Cbi‖2)1/2

(d2∑i=1

‖C‖2‖bi‖2)1/2

= ‖C‖‖B‖F .

We show how to lower bound ‖CB‖,

‖CB‖ =

(d2∑i=1

‖Cbi‖2)1/2

(d2∑i=1

σ2k(C)‖bi‖2

)1/2

= σmin(C)‖B‖F .

10

Page 27: Copyright by Kai Zhong 2018

2.4 Concentration Bounds

This thesis assumes data has nice distributions. Specifically, we assume

input data follows Gaussian distribution. When a random vector follows Gaussian

distribution, there are several concentration properties. We illustrate them in this

section. They will be used as lemmata or tools for the analyses in this thesis.

Lemma 2.4.1 (Proposition 1.1 in [68]). If x ∼ N(0, Id) and Σ ∈ Rd×d is a fixed

positive semi-definite matrix. For all t > 0, w. p. 1− e−t, we have

xTΣx ≤ tr(Σ) + 2√

tr(Σ2)t+ 2‖Σ‖t.

By taking t = 2 log(d) + log(n) for some n ≥ d, we have

Corollary 2.4.1. If x ∼ N(0, Id) and Σ ∈ Rd×d is a fixed positive semi-definite

matrix. We have, w. p. 1− 1nd−2,

xTΣx ≤ tr(Σ) + 2√

tr(Σ2)(2 log(d) + log(n)) + 2‖Σ‖(2 log(d) + log(n))

≤ 12 tr(Σ) log(n).

Corollary 2.4.2. Let z denote a fixed d-dimensional vector, then for any C ≥ 1 and

n ≥ 1, we have

Prx∼N(0,Id)

[|〈x, z〉|2 ≤ 5C‖z‖2 log n] ≥ 1− 1/(ndC).

Proof. This follows by Proposition 1.1 in [68].

Corollary 2.4.3. For any C ≥ 1 and n ≥ 1, we have

Prx∼N(0,Id)

[‖x‖2 ≤ 5Cd log n] ≥ 1− 1/(ndC).

11

Page 28: Copyright by Kai Zhong 2018

Proof. This follows by Proposition 1.1 in [68].

Setting Σ = ββT in Corollary 2.4.1, we have

Corollary 2.4.4. If x ∼ N(0, Id), then given any fixed β ∈ Rd, w. p. 1− 1nd−2, we

have

(βTx)2 ≤ 12‖β‖2 log n.

Setting Σ = I and t = 2 log(d)+ log n for some n ≥ d in Lemma 2.4.1, we

have

Corollary 2.4.5. If x ∼ N(0, Id), then w. p. 1− 1nd−2, we have

‖x‖2 ≤ d+ 2√

3d log n+ 6 log(n) ≤ 4d log n.

Lemma 2.4.2. Let a, b, c ≥ 0 denote three constants, let u, v, w ∈ Rd denote three

vectors, let Dd denote Gaussian distribution N(0, Id) then

Ex∼Dd

[|u>x|a|v>x|b|w>x|c

]h ‖u‖a‖v‖b‖w‖c.

Proof.

Ex∼Dd

[|u>x|a|v>x|b|w>x|c

]≤(

Ex∼Dd

[|u>x|2a])1/2

·(

Ex∼Dd

[|u>x|4b])1/4

·(

Ex∼Dd

[|u>x|4c])1/4

. ‖u‖a‖v‖b‖w‖c,

where the first step follows by Holder’s inequality, i.e., E[|XY Z|] ≤ (E[|X|2])1/2 ·

(E[|Y |4])1/4 · (E[|Z|4])1/4, the third step follows by calculating the expectation and

a, b, c are constants.

12

Page 29: Copyright by Kai Zhong 2018

Since all the three components |u>x|, |v>x|, |w>x| are positive and related

to a common random vector x, we can show a lower bound,

Ex∼Dd

[|u>x|a|v>x|b|w>x|c

]& ‖u‖a‖v‖b‖w‖c.

Matrix Bernstein inequality is a key inequality that will be used throughout

this thesis. We demonstrate it below.

Lemma 2.4.3 (Matrix Bernstein Inequality for bounded case, Theorem 6.1 in [127]).

Consider a finite sequence Xk of independent, random, self-adjoint matrices with

dimension d. Assume that E[Xk] = 0 and λmax(Xk) ≤ R almost surely. Compute

the norm of the total variance, σ2 := ‖∑

k E(X2k)‖. Then the following chain of

inequalities holds for all t ≥ 0.

Pr[λmax(∑k

Xk) ≥ t]

≤ d · exp(− σ2

R2· h(Rt

σ2))

≤ d · exp( −t2/2

σ2 +Rt/3)

d · exp(−3t2/8σ2) for t ≤ σ2/R;

d · exp(−3t/8R) for t ≥ σ2/R.

The function h(u) := (1 + u) log(1 + u)− u for u ≥ 0.

2.5 Convex Analyses

When an objective function is strongly convex and smooth, with properly

chosen step size, gradient descent converges to the global optimal linearly. Formally

13

Page 30: Copyright by Kai Zhong 2018

it is stated as below. Our local convergence guarantee for non-convex problems is

based on it.

Theorem 2.5.1 (Convergence of Gradient Descent on Strongly Convex and Smooth

Functions). Let f : Rd → R be a m-strongly convex and M -smooth function. Let

its minimum be x∗ = argminx f(x). Assume x(t+1) = x(t)−η∇f(x(t)) is the update

of gradient descent with step size η. Then if η = 1/M ,

‖x(t+1) − x∗‖2 ≤ (1− m

M)‖x(t) − x∗‖2.

Proof.

‖x(t+1) − x∗‖2

= ‖x(t) − η∇f(x(t))− x∗‖2

= ‖x(t) − x∗‖2 − 2η〈∇f(x(t)), x(t) − x∗〉+ η2‖∇f(x(t))‖2

We can rewrite f(x),

∇f(x(t)) =

(∫ 1

0

∇2f(x∗ + γ(x(t) − x∗))dγ

)(x(t) − x∗).

We define function H : Rd → Rd×d such that

H(x(t) − x∗) =

(∫ 1

0

∇2f(x∗ + γ(x(t) − x∗))dγ

).

According to the smoothness and strong convexity of f ,

mI H MI. (2.1)

14

Page 31: Copyright by Kai Zhong 2018

We can upper bound ‖∇f(x(t))‖2,

‖∇f(x(t))‖2 = 〈H(x(t) − x∗), H(x(t) − x∗)〉 ≤M〈x(t) − x∗, H(x(t) − x∗)〉.

Therefore,

‖x(t+1) − x∗‖2

≤ ‖x(t) − x∗‖2 − (−η2M + 2η)〈x(t) − x∗, H(x(t) − x∗)〉

≤ ‖x(t) − x∗‖2 − (−η2M + 2η)m‖x(t) − x∗‖2

= ‖x(t) − x∗‖2 − m

M‖x(t) − x∗‖2

≤ (1− m

M)‖x(t) − x∗‖2

where the third equality holds by setting η = 1M

.

2.6 Tensor Decomposition

We will use tensor decomposition for initialization for most problems dis-

cussed in this thesis. This section introduces some preliminaries for tensors and

tensor decomposition guarantees.

Lemma 2.6.1 ( Some properties of thrid-order tensor). If T ∈ Rd×d×d is a super-

symmetric tensor, i.e.,Tijk is equivalent for any permutation of the index, then the

operator norm is defined as

‖T‖op := sup‖a‖=1

|T (a,a,a)|

And we have the following properties for ‖T‖op.

15

Page 32: Copyright by Kai Zhong 2018

Property 1. ‖T‖op = sup‖a‖=‖b‖=‖c‖=1 |T (a, b, c)|

Property 2. ‖T‖op ≤ ‖T(1)‖ ≤√K‖T‖op

Property 3. If T is a rank-one tensor, then ‖T(1)‖ = ‖T‖op

Property 4. For any matrix W ∈ Rd×d′ , ‖T (W,W,W )‖op ≤ ‖T‖op‖W‖3

Proof. Property 1. See the proof in Lemma 21 of [67].

Property 2.

‖T(1)‖ = max‖a‖=1

‖T (a, I, I)‖F ≤ max‖a‖=1

√K‖T (a, I, I)‖ = max

‖a‖=‖b‖=1

√K|T (a, b, b)| = ‖T‖op.

Obviously, max‖a‖=1 ‖T (a, I, I)‖F ≥ ‖T‖op.

Property 3. Let T = v ⊗ v ⊗ v.

‖T(1)‖ = max‖a‖=1

‖T (a, I, I)‖F = max‖a‖=1

‖v‖2(vTa)2 = ‖v‖3 = max‖a‖=1

|(vTa)3| = ‖T‖op.

Property 4. There exists a u ∈ Rd′ with ‖u‖ = 1 such that

‖T (W,W,W )‖op = |T (Wu,Wu,Wu)| ≤ ‖T‖op‖Wu‖3 ≤ ‖T‖op‖W‖3

Theorem 2.6.1 (Non-orthogonal tensor decomposition. Theorem 3 in [78]). Let

T =∑k

i=1 πiu⊗3i ∈ Rd×d×d, where ui’s are unit vectors . Let T = T + εR, where

ε > 0 and R ∈ Rd×d×d with ‖R‖op = 1. Let w1, · · · , wL be i.i.d. random Gaussian

vectors, wl ∼ N(0, Id),∀l ∈ [L], and let the matrices Ml ∈ Rd×d be constructed

via projection of T along w1, · · · , wL, i.e., Ml = T (I, I, wl). Assume incoherence

µ on (ui) : u>i uj ≤ µ. Let L ≥

(50

1−µ2

)log(15d(k− 1)/δ)2. Let uii∈[L] be the set

16

Page 33: Copyright by Kai Zhong 2018

of output vectors of Algorithm 1 in [78]. Then, with probability at least 1 − δ, for

every ui, there exists a ui such that

‖ui − ui‖ ≤ O

(√‖π‖1πmax

π2min

1

(1− µ2)σk(U)(1 + C(δ))

)ε+ o(ε),

where U = [u1u2 · · ·uk] ∈ Rd×k, πmin = miniπi, πmax = maxiπi, C(δ) :=

O(log(kd/δ)√

dL)

17

Page 34: Copyright by Kai Zhong 2018

Chapter 3

Mixed Linear Regression1

In this chapter, we study the mixed linear regression (MLR) problem, where

the goal is to recover multiple underlying linear models from their unlabeled linear

measurements. We propose a non-convex objective function which we show is

locally strongly convex in the neighborhood of the ground truth. We use a tensor

method for initialization so that the initial models are in the local strong convexity

region. We then employ general convex optimization algorithms to minimize the

objective function. To the best of our knowledge, our approach provides first exact

recovery guarantees for the MLR problem with K ≥ 2 components. Moreover,

our method has near-optimal computational complexity O(Nd) as well as near-

optimal sample complexity O(d) for constant K. Furthermore, our empirical results

indicate that even with random initialization, our approach converges to the global

optima in linear time, providing speed-up of up to two orders of magnitude.

1The content of this chapter is published as Mixed linear regression with multiple components,Kai Zhong, Prateek Jain, and Inderjit S. Dhillon, in Advances in Neural Information ProcessingSystems, 2016. The dissertator’s contribution includes deriving the detailed theoretical analysis,conducting the numerical experiments and writing most parts of the paper.

18

Page 35: Copyright by Kai Zhong 2018

3.1 Introduction to Mixed Linear Regression

The mixed linear regression (MLR) [26, 28, 137] models each observation

as being generated from one of the K unknown linear models; the identity of the

generating model for each data point is also unknown. MLR is a popular technique

for capturing non-linear measurements while still keeping the models simple and

computationally efficient. Several widely-used variants of linear regression, such

as piecewise linear regression [43, 130] and locally linear regression [27], can be

viewed as special cases of MLR. MLR has also been applied in time-series analysis

[25], trajectory clustering [45], health care analysis [37] and phase retrieval [13].

See [129] for more applications.

In general, MLR is NP-hard [137] with the hardness arising due to lack of

information about the model labels (model from which a point is generated) as well

as the model parameters. However, under certain statistical assumptions, several

recent works have provided poly-time algorithms for solving MLR [4, 13, 28, 137].

But most of the existing recovery gurantees are restricted either to mixtures with

K = 2 components [13, 28, 137] or require poly(1/ε) samples/time to achieve ε-

approximate solution [26, 108] (analysis of [137] for two components can obtain ε

approximate solution in log(1/ε) samples). Hence, solving the MLR problem with

K ≥ 2 mixtures while using near-optimal number of samples and computational

time is still an open question.

In this section, we resolve the above question under standard statistical as-

sumptions for constant many mixture components K. To this end, we propose the

19

Page 36: Copyright by Kai Zhong 2018

following smooth objective function as a surrogate to solve MLR:

f(w1,w2, · · · ,wK) :=n∑

i=1

ΠKk=1(yi − xT

i wk)2, (3.1)

where (xi, yi) ∈ Rd+1i=1,2,··· ,N are the data points and wkk=1,2,··· ,K are the

model parameters. The intuition for this objective is that the objective value is zero

when wkk=1,2,··· ,K is the global optima and y’s do not contain any noise. Fur-

thermore, the objective function is smooth and hence less prone to getting stuck in

arbitrary saddle points or oscillating between two points. The standard EM [137]

algorithm instead makes a “sharp” selection of mixture component and hence the

algorithm is more likely to oscillate or get stuck. This intuition is reflected in Fig-

ure 3.1 (d) which shows that with random initialization, EM algorithm routinely

gets stuck at poor solutions, while our proposed method based on the above objec-

tive still converges to the global optima.

Unfortunately, the above objective function is non-convex and is in general

prone to poor saddle points, local minima. However under certain standard as-

sumptions, we show that the objective is locally strongly convex (Theorem 3.3.1)

in a small basin of attraction near the optimal solution. Moreover, the objective

function is smooth. Hence, we can use gradient descent method to achieve linear

rate of convergence to the global optima. But, we will need to initialize the op-

timization algorithm with an iterate which lies in a small ball around the optima.

To this end, we modify the tensor method in [4, 26] to obtain a “good” initializa-

tion point. Typically, tensor methods require computation of third and higher order

moments which leads to significantly worse sample complexity in terms of data

20

Page 37: Copyright by Kai Zhong 2018

dimensionality d. However, for the special case of MLR, we provide a small modi-

fication of the standard tensor method that achieves nearly optimal sample and time

complexity bounds for constant K (see Theorem 3.4.2) . More concretely, our ap-

proach requires O(d(K log d)K) many samples and requires O(Nd) computational

time; note the exponential dependence on K. Also for constant K, the method has

nearly optimal sample and time complexity.

Although EM with power method [13] shares the same computational com-

plexity as ours, there is no convergence guarantee for EM to the best of our knowl-

edge. In contrast, we provide local convergence guarantee for our method. That

is, if N = O(rKK) and if data satisfies certain standard assumptions, then starting

from an initial point Ukk=1,··· ,K that lies in a small ball of constant radius around

the globally optimal solution, our method converges super-linearly to the globally

optimal solution. Unfortunately, our existing analyses do not provide global con-

vergence guarantee and we leave it as a topic for future work. Interestingly, our

empirical results indicated that even with randomly initialized Ukk=1,··· ,K , our

method is able to recover the true subspace exactly using nearly O(rK) samples.

We summarize our contributions below. We propose a non-convex contin-

uous objective function for solving the mixed linear regression problem. To the

best of our knowledge, our algorithm is the first work that can handle K ≥ 2 com-

ponents with global convergence guarantee in the noiseless case (Theorem 3.5.1).

Our algorithm has near-optimal linear (in d) sample complexity and near-optimal

computational complexity; however, our sample complexity dependence on K is

exponential.

21

Page 38: Copyright by Kai Zhong 2018

3.2 Problem Formulation

In this section, we assume the dataset (xi, yi) ∈ Rd+1i=1,2,··· ,N is gener-

ated by,

zi ∼ multinomial(p), xi ∼ N(0, Id), yi = xTi w

∗zi, (3.2)

where p is the proportion of different components satisfying pT1 = 1, w∗k ∈

Rdk=1,2,··· ,K are the ground truth parameters. The goal is to recover w∗kk=1,2,··· ,K

from the dataset. Our analysis is based on noiseless cases but we illustrate the

empirical performance of our algorithm for the noisy cases, where yi = xTi w

∗zi+ ei

for some noise ei (see Figure 3.1).

Notation for this section We use [N] to denote the set 1, 2, · · · , N and

Sk ⊂ [N ] to denote the index set of the samples that come from k-th component.

Define pmin := mink∈[K]pk, pmax := maxk∈[K]pk. Define ∆wj := wj −

w∗j and ∆w∗

kj := w∗k − w∗

j . Define ∆min := minj 6=k‖∆w∗jk‖ and ∆max :=

maxj 6=k‖∆w∗jk‖. We assume ∆min

∆maxis independent of the dimension d. Define

w := [w1;w2; · · · ;wK ] ∈ RKd. We denote w(t) as the parameters at t-th iteration

and w(0) as the initial parameters. For simplicity, we assume there are pkN samples

from the k-th model in any random subset of N samples. We assume K is a constant

in general. However, if some numbers depend on KK , we will explicitly present it

in the big O notation. For simplicity, we just include higher-order terms of K and

ignore lower-order terms, e.g., O((2K)2K) may be replaced by O(KK).

22

Page 39: Copyright by Kai Zhong 2018

3.3 Local Strong Convexity

In this section, we analyze the Hessian of objective (3.1).

Theorem 3.3.1 (Local Strong Convexity). Let xi, yii=1,2,··· ,N be sampled from

the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the samples and lie in

the neighborhood of the optimal solution, i.e.,

‖∆wk‖ := ‖wk −w∗k‖ ≤ cm∆min,∀k ∈ [K], (3.3)

where cm = O(pmin(3K)−K(∆min/∆max)2K−2), ∆min = minj 6=k‖w∗

j − w∗k‖

and ∆max = maxj 6=k‖w∗j − w∗

k‖. Let P ≥ 1 be a constant. Then if N ≥

O((PK)Kd logK+2(d)), w.p. 1−O(Kd−P ), we have,

1

8pminN∆2K−2

min I ∇2f(w + δw) 10N(3K)K∆2K−2max I, (3.4)

for any δw := [δw1; δw2; · · · ; δwK ] satisfying ‖δwk‖ ≤ cf∆min, where cf =

O(pmin(3K)−Kd−K+1(∆min/∆max)2K−2).

The above theorem shows the Hessians of a small neighborhood around

a fixed wkk=1,2,··· ,K , which is close enough to the optimum, are positive def-

initeness (PD). The conditions on wkk=1,··· ,K and δwkk=1,··· ,K are different.

wkk=1,··· ,K are required to be independent of samples and in a ball of radius

cm∆min centered at the optimal solution. On the other hand, δwkk=1,2,··· ,K can be

dependent on the samples but are required to be in a smaller ball of radius cf∆min.

The conditions are natural as if ∆min is very small then distinguishing between w∗k

and w∗k′ is not possible and hence Hessians will not be PD w.r.t both the compo-

nents.

23

Page 40: Copyright by Kai Zhong 2018

To prove the theorem, we decompose the Hessian of Eq. (3.1) into multi-

ple blocks, (∇f)jl = ∂2f∂wj∂wl

∈ Rd×d. When wk → w∗k for all k ∈ [K], the

diagonal blocks of the Hessian will be strictly positive definite. At the same time,

the off-diagonal blocks will be close to zeros. The blocks are approximated by

the samples using matrix Bernstein inequality. The detailed proof can be found in

Appendix A.1.2.

Traditional analysis of optimization methods on strongly convex functions,

such as gradient descent, requires the Hessians of all the parameters are PD. The-

orem 3.3.1 implies that when wk = w∗k for all k = 1, 2, · · · , K, a small basin of

attraction around the optimum is strongly convex as formally stated in the following

corollary.

Corollary 3.3.1 (Local Strong Convexity for all the Parameters). Let xi, yii=1,2,··· ,N

be sampled from the MLR model (3.2). Let wkk=1,2,··· ,K lie in the neighborhood

of the optimal solution, i.e.,

‖wk −w∗k‖ ≤ cf∆min,∀k ∈ [K], (3.5)

where cf = O(pmin(3K)−Kd−K+1(∆min/∆max)2K−2). Then, for any constant

P ≥ 1, if N ≥ O((PK)Kd logK+2(d)), w.p. 1 − O(Kd−P ), the objective func-

tion f(w1,w2, · · · ,wK) in Eq. (3.1) is strongly convex. In particular, w.p. 1 −

O(Kd−P ), for all w satisfying Eq. (3.5),

1

8pminN∆2K−2

min I ∇2f(w) 10N(3K)K∆2K−2max I. (3.6)

The strong convexity of Corollary 3.3.1 only holds in the basin of attraction

near the optimum that has diameter in the order of O(d−K+1), which is too small

24

Page 41: Copyright by Kai Zhong 2018

to be achieved by our initialization method (in Sec. 6.4) using O(d) samples. Next,

we show by a simple construction, the linear convergence of gradient descent (GD)

with resampling is still guaranteed when the solution is initialized in a much larger

neighborhood.

Theorem 3.3.2 (Convergence of Gradient Descent). Let xi, yii=1,2,··· ,N be sam-

pled from the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the sam-

ples and lie in the neighborhood of the optimal solution, defined in Eq. (3.3). One

iteration of gradient descent can be described as, w+ = w − η∇f(w), where

η = 1/(10N(3K)K∆2K−2max ). Then, if N ≥ O(KKd logK+2(d)), w.p. 1−O(Kd−2),

‖w+ −w∗‖2 ≤ (1− pmin∆2K−2min

80(3K)K∆2K−2max

)‖w −w∗‖2 (3.7)

Remark. The linear convergence Eq. (3.7) requires the resampling of the

data points for each iteration. In Sec. 3.5, we combine Corollary 3.3.1, which

doesn’t require resampling when the iterate is sufficiently close to the optimum,

to show that there exists an algorithm using a finite number of samples to achieve

any solution precision.

To prove Theorem 3.3.2, we prove the PD properties on a line between a

current iterate and the optimum by constructing a set of anchor points and then ap-

ply traditional analysis for the linear convergence of gradient descent. The detailed

proof can be found in Appendix A.1.3.

25

Page 42: Copyright by Kai Zhong 2018

3.4 Initialization via Tensor method

In this section, we propose a tensor method to initialize the parameters.

We define the second-order moment M2 := EJy2(x⊗ x− I)K and the third-order

moments,

M3 := Eqy3x⊗ x⊗ x

y−∑j∈[d]

Eqy3(ej ⊗ x⊗ ej + ej ⊗ ej ⊗ x+ x⊗ ej ⊗ ej)

y.

According to Lemma 6 in [108], M2 =∑

k=[K] 2pkw∗k⊗w∗

k and M3 =∑

k=[K] 6pkw∗k⊗

w∗k ⊗ w∗

k. Therefore by calculating the eigendecomposition of the estimated mo-

ments, we are able to recover the parameters to any precision provided enough

samples. Theorem 8 of [108] needs O(d3) sample complexity to obtain the model

parameters with certain precision. Such high sample complexity comes from the

tensor concentration bound. However, we find the problem of tensor eigendecom-

position in MLR can be reduced to RK×K×K space such that the sample complex-

ity and computational complexity are O(poly(K)). Our method is similar to the

whitening process in [26, 67]. However, [26] needs O(d6) sample complexity due

to the nuclear-norm minimization problem, while ours requires only O(d). For this

sample complexity, we need assume the following,

Assumption 3.4.1. The following quantities, σK(M2), ‖M2‖, ‖M3‖2/3op ,∑

k∈[K] pk‖w∗k‖2

and (∑

k∈[K] pk‖w∗k‖3)2/3, have the same order of d, i.e., the ratios between any two

of them are independent of d.

The above assumption holds when w∗kk=1,2,··· ,K are orthonormal to each

other.

26

Page 43: Copyright by Kai Zhong 2018

We formally present the tensor method in Algorithm 3.4.1 and its theoretical

guarantee in Theorem 3.4.2.

Theorem 3.4.2. Under Assumption 3.4.1, if |Ω| ≥ O(d log2(d) + log4(d)), then

w.p. 1−O(d−2), Algorithm 3.4.1 will output w(0)k Kk=1 that satisfies,

‖w(0)k −w∗

k‖ ≤ cm∆min,∀k ∈ [K]

which falls in the locally PD region, Eq. (3.3), in Theorem 3.3.1.

Algorithm 3.4.1 Initialization for MLR via Tensor MethodInput: xi, yii∈ΩOutput: w(0)

k Kk=1

1: Partition the dataset Ω into Ω = ΩM2 ∪ Ω2 ∪ Ω3 with |ΩM2| = O(d log2(d)),|Ω2| = O(d log2(d)) and |Ω3| = O(log4(d))

2: Compute the approximate top-K eigenvectors, Y ∈ Rd×K , of the second-ordermoment, M2 :=

1|ΩM2

|∑

i∈ΩM2y2i (xi ⊗ xi − I), by the power method.

3: Compute R2 =1

2|Ω2|∑

i∈Ω2y2i (Y

Txi ⊗ Y Txi − I).

4: Compute the whitening matrix W ∈ RK×K of R2, i.e., W = U2Λ−1/22 UT

2 ,where R2 = U2Λ2U

T2 is the eigendecomposition of R2.

5: Compute R3 =1

6|Ω3|∑

i∈Ω3y3i (ri⊗ri⊗ri−

∑j∈[K] ej⊗ri⊗ej−

∑j∈[K] ej⊗

ej ⊗ ri −∑

j∈[K] ri ⊗ ej ⊗ ej), where ri = Y Txi for all i ∈ Ω3.6: Compute the eigenvalues akKk=1 and the eigenvectors vkKk=1 of the

whitened tensor R3(W , W , W ) ∈ RK×K×K by using the robust tensor powermethod [4].

7: Return the estimation of the models, w(0)k = Y (W T )†(akvk)

The proof can be found in Appendix A.2.2. Forming M2 explicitly will cost

O(Nd2) time, which is expensive when d is large. We can compute each step of the

power method without explicitly forming M2. In particular, we alternately compute

Y (t+1) =∑

i∈ΩM2y2i (xi(x

Ti Y

(t)) − Y (t)) and let Y (t+1) = QR(Y (t+1)). Now each

27

Page 44: Copyright by Kai Zhong 2018

power method iteration only needs O(KNd) time. Furthermore, the number of it-

erations needed will be a constant, since power method has linear convergence rate

and we don’t need very accurate solution. For the proof of this claim, we refer to the

proof of Lemma A.2.3 in Appendix A.2. Next we compute R2 using O(KNd) and

compute W in O(K3) time. Computing R3 takes O(KNd+K3N) time. The robust

tensor power method takes O(poly(K)polylog(d)) time. In summary, the computa-

tional complexity for the initialization is O(KdN +K3N +poly(K)polylog(d)) =

O(dN).

3.5 Recovery Guarantee

We are now ready to show the complete algorithm, Algorithm 3.5.1, that

has global convergence guarantee. We use fΩ(w) to denote the objective function

Eq. (3.1) generated from a subset of the dataset Ω, i.e.,fΩ(w) =∑

i∈ΩΠKk=1(yi −

xTi wk)

2.

Theorem 3.5.1 (Global Convergence Guarantee). Let xi, yii=1,2,··· ,N be sampled

from the MLR model (3.2) with N ≥ O(d(K log(d))2K+3). Let the step size η

be smaller than a positive constant. Then given any precision ε > 0, after T =

O(log(d/ε)) iterations, w.p. 1 − O(Kd−2 log(d)), the output of Algorithm 3.5.1

satisfies

‖w(T ) −w∗‖ ≤ ε∆min.

The detailed proof is in Appendix A.2.3. The computational complexity

required by our algorithm is near-optimal: (a) tensor method (Algorithm 3.4.1) is

28

Page 45: Copyright by Kai Zhong 2018

carefully employed such that only O(dN) computation is needed; (b) gradient de-

scent with resampling is conducted in log(d) iterations to push the iterate to the next

phase; (c) gradient descent without resampling is finally executed to achieve any

precision with log(1/ε) iterations. Therefore the total computational complexity is

O(dN log(d/ε)). As shown in the theorem, our algorithm can achieve any precision

ε > 0 without any sample complexity dependency on ε. This follows from Corol-

lary 3.3.1 that shows local strong convexity of objective (3.1) with a fixed set of

samples. By contrast, tensor method [26, 108] requires O(1/ε2) samples and EM

algorithm requires O(log(1/ε)) samples[13, 137].

Algorithm 3.5.1 Gradient Descent for MLRInput: xi, yii=1,2,··· ,N , step size η.Output: w

1: Partition the dataset into Ω(t)t=0,1,··· ,T0+1

2: Initialize w(0) by Algorithm 3.4.1 with Ω(0)

3: for t = 1, 2, · · · , T0 do4: w(t) = w(t−1) − η∇fΩ(t)(w(t−1))

5: for t = T0 + 1, T0 + 2, · · · , T0 + T1 do6: w(t) = w(t−1) − η∇fΩ(T0+1)(w(t−1))

3.6 Numerical Experiments

In this section, we use synthetic data to show the properties of our algo-

rithm that minimizes Eq. (3.1), which we call LOSCO (LOcally Strongly Convex

Objective). We generate data points and parameters from standard normal distri-

bution. We set K = 3 and pk = 13

for all k ∈ [K]. The error is defined as

ε(t) = minπ∈Perm([K])maxk∈[K] ‖w(t)π(k) −w∗

k‖/‖w∗k‖, where Perm([K]) is the set

29

Page 46: Copyright by Kai Zhong 2018

of all the permutation functions on the set [K]. The errors reported in this section are

averaged over 10 trials. In our experiments, we find there is no difference whether

doing resampling or not. Hence, for simplicity, we use the original dataset for all

the processes. We set both of two parameters in the robust tensor power method (de-

noted as N and L in Algorithm 1 in [4]) to be 100. The experiments are conducted

in Matlab. After the initialization, we use alternating minimization (i.e., block co-

ordinate descent) to exactly minimize the objective over wk for k = 1, 2, · · · , K

cyclicly.

Fig. 3.1(a) shows the recovery rate for different dimensions and different

samples. We call the result of a trial is recovered if ε(t) < 10−6 for some t < 100.

The recovery rate is the proportion of recovered times out of 10 trials. As shown in

the figure, the sample complexity for exact recovery is nearly linear to d. Fig. 3.1(b)

shows the behavior of our algorithm in the noisy case. The noise is drawn from ei ∈

N(0, σ2), i.i.d., and d is fixed as 100. As we can see from the figure, the solution

error is almost proportional to the noise deviation. Comparing among different N ’s,

the solution error decreases when N increases, so it seems consistent in presence

of unbiased noise. We also illustrate the performance of our tensor initialization

method in Fig. 3.1(c), which shows to achieve an initial error ε(0) = c for some

constant c < 1, our tensor method only requires N to be proportional to d. Note that

the naive initialization methods, random initialization (using normal distribution) or

all-zero initialization, will lead to ε(0) ≈ 1.4 and ε(0) = 1 respectively.

We next compare with EM algorithm, where we alternately assign labels

to points and exactly solve each model parameter according to the labels. EM

30

Page 47: Copyright by Kai Zhong 2018

has been shown to be very sensitive to the initialization [137]. The grid search

initialization method proposed in [137] is not feasible here, because it only handles

two components with a same magnitude. Therefore, we use random initialization

and tensor initialization for EM. We compare our method with EM on convergence

speed under different dimensions and different initialization methods. We use exact

alternating minimization (LOSCO-ALT) to optimize our objective (3.1), which has

similar computational complexity as EM. Fig. 3.2(a)(b) shows when both methods

are initialized from tensor method, our method converges slightly faster than EM

in terms of time, and when initialized from random vectors, our method converges

much faster than EM. In the case of (b), EM with random initialization doesn’t

converge to the optima, while our method still converges. In Fig. 3.2(c)(d), we also

show in terms of iterations, our method converges to the optima even more faster

than EM.

200 400 600 800 1000d

0.5

1

1.5

2

N

×104

0

0.2

0.4

0.6

0.8

1

Rec

over

y R

ate

-10 -5 0log(σ)

-15

-10

-5

0

log(ϵ)

N=6000N=60000N=600000

200 400 600 800 1000d

2

4

6

8

10

N

×104

-2

-1.5

-1

-0.5

0

0.5

1

log(ϵ)

(a) Sample complexity (b) Noisy case (c) The initialization error

Figure 3.1: Empirical Performance of MLR.

3.7 Related Work

EM algorithm without careful initialization is only guaranteed to have local

31

Page 48: Copyright by Kai Zhong 2018

0 0.5 1 1.5time(s)

-30

-20

-10

0log(err)

LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random

0 100 200 300 400time(s)

-30

-20

-10

0

log(err)

LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random

(a) d = 100, N = 6k (b) d = 1k, N = 60k

20 40 60 80 100iterations

-30

-20

-10

0

log(err)

LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random

20 40 60 80 100iterations

-30

-20

-10

0

log(err)

LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random

(c) d = 100, N = 6k (d) d = 1k, N = 60k

Figure 3.2: Comparison with EM in terms of time and iterations. Our methodwith random initialization is significantly better than EM with random initialization.Performance of the two methods is comparable when initialized with tensor method.

convergence [13, 76, 137]. [137] proposed a grid search method for initialization.

However, it is limited to the two-component case and seems non-trivial to extend

to multiple components. It is known that exact minimization for each step of EM

is not scalable due to the O(d2N + d3) complexity. Alternatively, we can use EM

with gradient update, whose local convergence is also guaranteed by [13] but only

in the two-symmetric-component case, i.e., when w2 = −w1.

32

Page 49: Copyright by Kai Zhong 2018

Tensor Methods for MLR were studied by [26, 108]. [108] approximated

the third-order moment directly from samples with Gaussian distribution, while

[26] learned the third-order moment from a low-rank linear regression problem.

Tensor methods can obtain the model parameters to any precision ε but requires

1/ε2 time/samples. Also, tensor methods can handle multiple components but suffer

from high sample complexity and high computational complexity. For example,

the sample complexity required by [26] and [108] is O(d6) and O(d3) respectively.

On the other hand, the computational burden mainly comes from the operation on

tensor, which costs at least O(d3) for a very simple tensor evaluation. [26] also

suffers from the slow nuclear norm minimization when estimating the second and

third order moments. In contrast, we use tensor method only for initialization, i.e.,

we require ε to be a certain constant. Moreover, with a simple trick, we can ensure

that the sample and time complexity of our initialization step is only linear in d and

N .

Convex Formulation. Another approach to guarantee the recovery of the pa-

rameters is to relax the non-convex problem to convex problem. [28] proposed

a convex formulation of MLR with two components. The authors provide up-

per bounds on the recovery errors in the noisy case and show their algorithm is

information-theoretically optimal. However, the convex formulation needs to solve

a nuclear norm function under linear constraints, which leads to high computational

cost. The extension from two components to multiple components for this formu-

lation is also not straightforward.

33

Page 50: Copyright by Kai Zhong 2018

Chapter 4

One-hidden-layer Fully-connected Neural Networks1

In this chapter, we consider regression problems with one-hidden-layer

neural networks (1NNs). We distill some properties of activation functions that lead

to local strong convexity in the neighborhood of the ground-truth parameters for the

1NN squared-loss objective and most popular nonlinear activation functions sat-

isfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs,

squared ReLUs and sigmoids. For activation functions that are also smooth, we

show local linear convergence guarantees of gradient descent under a resampling

rule. For homogeneous activations, we show tensor methods are able to initial-

ize the parameters to fall into the local strong convexity region. As a result, ten-

sor initialization followed by gradient descent is guaranteed to recover the ground

truth with sample complexity d · log(1/ε) ·poly(k, λ) and computational complexity

n · d · poly(k, λ) for smooth homogeneous activations with high probability, where

d is the dimension of the input, k (k ≤ d) is the number of hidden nodes, λ is a

conditioning property of the ground-truth parameter matrix between the input layer

1The content of this chapter is published as Recovery guarantees for one-hidden-layer neuralnetworks, Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, in Interna-tional Conference for Machine Learning, 2017. The dissertator’s contribution includes deriving thedetailed theoretical analysis, conducting the numerical experiments and writing most parts of thepaper.

34

Page 51: Copyright by Kai Zhong 2018

and the hidden layer, ε is the targeted precision and n is the number of samples. To

the best of our knowledge, this is the first work that provides recovery guarantees

for 1NNs with both sample complexity and computational complexity linear in the

input dimension and logarithmic in the precision.

4.1 Introduction to One-hidden-layer Neural Networks

Neural Networks (NNs) have achieved great practical success recently. Many

theoretical contributions have been made very recently to understand the extraor-

dinary performance of NNs. The remarkable results of NNs on complex tasks in

computer vision and natural language processing inspired works on the expres-

sive power of NNs [31, 32, 34, 91, 99, 101, 124]. Indeed, several works found

NNs are very powerful and the deeper the more powerful. However, due to the

high non-convexity of NNs, knowing the expressivity of NNs doesn’t guarantee

that the targeted functions will be learned. Therefore, several other works fo-

cused on the achievability of global optima. Many of them considered the over-

parameterized setting, where the global optima or local minima close to the global

optima will be achieved when the number of parameters is large enough, including

[35, 44, 57, 59, 89, 105]. This, however, leads to overfitting easily and can’t provide

any generalization guarantees, which are actually the essential goal in most tasks.

A few works have considered generalization performance. For example,

[135] provide generalization bound under the Rademacher generalization analy-

sis framework. Recently [142] describe some experiments showing that NNs are

complex enough that they actually memorize the training data but still generalize

35

Page 52: Copyright by Kai Zhong 2018

well. As they claim, this cannot be explained by applying generalization analysis

techniques, like VC dimension and Rademacher complexity, to classification loss

(although it does not rule out a margins analysis—see, for example, [17]; their ex-

periments involve the unbounded cross-entropy loss).

In this work, we don’t develop a new generalization analysis. Instead we

focus on parameter recovery setting, where we assume there are underlying ground-

truth parameters and we provide recovery guarantees for the ground-truth parame-

ters up to equivalent permutations. Since the parameters are exactly recovered, the

generalization performance will also be guaranteed.

Several other techniques are also provided to recover the parameters or

to guarantee generalization performance, such as tensor methods [72] and kernel

methods [9]. These methods require sample complexity O(d3) or computational

complexity O(n2), which can be intractable in practice.

Recently [110] show that neither specific assumptions on the niceness of the

input distribution or niceness of the target function alone is sufficient to guarantee

learnability using gradient-based methods. In this work, we assume data points

are sampled from Gaussian distribution and the parameters of hidden neurons are

linearly independent.

Our main contributions are as follows,

1. We distill some properties for activation functions, which are satis-

fied by a wide range of activations, including ReLU, squared ReLU, sigmoid and

tanh. With these properties we show positive definiteness (PD) of the Hessian

36

Page 53: Copyright by Kai Zhong 2018

in the neighborhood of the ground-truth parameters given enough samples (The-

orem 4.3.1). Further, for activations that are also smooth, we show local linear

convergence is guaranteed using gradient descent.

2. We propose a tensor method to initialize the parameters such that the

initialized parameters fall into the local positive definiteness area. Our contribution

is that we reduce the sample/computational complexity from cubic dependency on

dimension to linear dependency (Theorem 5.4.1).

3. Combining the above two results, we provide a globally converging algo-

rithm (Algorithm 8.1.1) for smooth homogeneous activations satisfying the distilled

properties. The whole procedure requires sample/computational complexity linear

in dimension and logarithmic in precision (Theorem 5.5.1).

Roadmap. This section is organized as follows. In Section 4.2, we present our

problem setting and show three key properties of activations required for our guar-

antees. In Section 4.3, we introduce the formal theorem of local strong convexity

and show local linear convergence for smooth activations. Section 4.4 presents a

tensor method to initialize the parameters so that they fall into the basin of the local

strong convexity region.

4.2 Problem Formulation

We consider the following regression problem. Given a set of n samples

S = (x1, y1), (x2, y2), · · · (xn, yn) ⊂ Rd × R,

37

Page 54: Copyright by Kai Zhong 2018

let D denote a underlying distribution over Rd × R with parameters

w∗1, w

∗2, · · ·w∗

k ⊂ Rd, and v∗1, v∗2, · · · , v∗k ⊂ R

such that each sample (x, y) ∈ S is sampled i.i.d. from this distribution, with

D : x ∼ N(0, I), y =k∑

i=1

v∗i · φ(w∗>i x), (4.1)

where φ(z) is the activation function, k is the number of nodes in the hidden layer.

The main question we want to answer is: How many samples are sufficient to re-

cover the underlying parameters?

It is well-known that, training one hidden layer neural network is NP-complete

[20]. Thus, without making any assumptions, learning deep neural network is in-

tractable. Throughout the section, we assume x follows a standard normal distri-

bution; the data is noiseless; the dimension of input data is at least the number of

hidden nodes; and activation function φ(z) satisfies some reasonable properties.

Actually our results can be easily extended to multivariate Gaussian distri-

bution with positive definite covariance and zero mean since we can estimate the

covariance first and then transform the input to a standard normal distribution but

with some loss of accuracy. Although this work focuses on the regression problem,

we can transform classification problems to regression problems if a good teacher

is provided as described in [126]. Our analysis requires k to be no greater than d,

since the first-layer parameters will be linearly dependent otherwise.

For activation function φ(z), we assume it is continuous and if it is non-

smooth let its first derivative be left derivative. Furthermore, we assume it satisfies

38

Page 55: Copyright by Kai Zhong 2018

Property 1, 2, and 3. These properties are critical for the later analyses. We also

observe that most activation functions actually satisfy these three properties.

Property 1. The first derivative φ′(z) is nonnegative and homogeneously bounded,

i.e., 0 ≤ φ′(z) ≤ L1|z|p for some constants L1 > 0 and p ≥ 0.

Property 2. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =

Ez∼N(0,1)[φ′2(σ·z)zq],∀q ∈ 0, 2. Let ρ(σ) denote minβ0(σ)−α2

0(σ)−α21(σ), β2(σ)−

α21(σ)− α2

2(σ), α0(σ) · α2(σ)− α21(σ) The first derivative φ′(z) satisfies that, for

all σ > 0, we have ρ(σ) > 0.

Property 3. The second derivative φ′′(z) is either (a) globally bounded |φ′′(z)| ≤

L2 for some constant L2, i.e., φ(z) is L2-smooth, or (b) φ′′(z) = 0 except for e (e is

a finite constant) points.

Remark 4.2.1. The first two properties are related to the first derivative φ′(z) and

the last one is about the second derivative φ′′(z). At high level, Property 1 re-

quires φ to be non-decreasing with homogeneously bounded derivative; Property 2

requires φ to be highly non-linear; Property 3 requires φ to be either smooth or

piece-wise linear.

Theorem 4.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,

squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-

tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),

the tanh function and the erf function φ(z) =∫ z

0e−t2dt, satisfy Property 1,2,3. The

linear function, φ(z) = z, doesn’t satisfy Property 2 and the quadratic function,

φ(z) = z2, doesn’t satisfy Property 1 and 2.

39

Page 56: Copyright by Kai Zhong 2018

The proof can be found in Appendix C.2.

4.3 Local Strong Convexity

In this section, we study the Hessian of empirical risk near the ground truth.

We consider the case when v∗ is already known. Note that for homogeneous acti-

vations, we can assume v∗i ∈ −1, 1 since vφ(z) = v|v|φ(|v|

1/pz), where p is the

degree of homogeneity. As v∗i only takes discrete values for homogeneous activa-

tions, in the next section, we show we can exactly recover v∗ using tensor methods

with finite samples.

For a set of samples S, we define the Empirical Risk,

fS(W ) =1

2|S|∑

(x,y)∈S

(k∑

i=1

v∗i φ(w>i x)− y

)2

. (4.2)

For a distribution D, we define the Expected Risk,

fD(W ) =1

2E

(x,y)∼D

( k∑i=1

v∗i φ(w>i x)− y

)2. (4.3)

Let’s calculate the gradient and the Hessian of fS(W ) and fD(W ). For each j ∈ [k],

the partial gradient of fD(W ) with respect to wj can be represented as

∂fD(W )

∂wj

= E(x,y)∼D

[(k∑

i=1

v∗i φ(w>i x)− y

)v∗jφ

′(w>j x)x

].

For each j, l ∈ [k] and j 6= l, the second partial derivative of fD(W ) for the (j, l)-th

off-diagonal block is,

∂2fD(W )

∂wj∂wl

= E(x,y)∼D

[v∗j v

∗l φ

′(w>j x)φ

′(w>l x)xx

>],40

Page 57: Copyright by Kai Zhong 2018

and for each j ∈ [k], the second partial derivative of fD(W ) for the j-th diagonal

block is

∂2fD(W )

∂w2j

= E(x,y)∼D

[(k∑

i=1

v∗i φ(w>i x)− y

)v∗jφ

′′(w>j x)xx

> + (v∗jφ′(w>

j x))2xx>

].

If φ(z) is non-smooth, we use the Dirac function and its derivatives to represent

φ′′(z). Replacing the expectation E(x,y)∼D by the average over the samples |S|−1∑

(x,y)∈S ,

we obtain the Hessian of the empirical risk.

Considering the case when W = W ∗ ∈ Rd×k, for all j, l ∈ [k], we have,

∂2fD(W∗)

∂wj∂wl

= E(x,y)∼D

[v∗j v

∗l φ

′(w∗>j x)φ′(w∗>

l x)xx>].If Property 3(b) is satisfied, φ′′(z) = 0 almost surely. So in this case the diagonal

blocks of the empirical Hessian can be written as,

∂2fS(W )

∂w2j

=1

|S|∑

(x,y)∈S

(v∗jφ′(w>

j x))2xx>.

Now we show the Hessian of the objective near the global optimum is positive

definite.

Definition 4.3.1. Given the ground truth matrix W ∗ ∈ Rd×k, let σi(W∗) denote

the i-th singular value of W ∗, often abbreviated as σi. Let κ = σ1/σk, λ =

(∏k

i=1 σi)/σkk . Let vmax denote maxi∈[k] |v∗i | and vmin denote mini∈[k] |v∗i | . Let

ν = vmax/vmin. Let ρ denote ρ(σk). Let τ = (3σ1/2)4p/minσ∈[σk/2,3σ1/2]ρ2(σ).

Theorem 4.3.1 (Informal version of Theorem B.3.1). For any W ∈ Rd×k with

‖W −W ∗‖ ≤ poly(1/k, 1/λ, 1/ν, ρ/σ2p1 ) · ‖W ∗‖, let S denote a set of i.i.d. sam-

ples from distribution D (defined in (4.1)) and let the activation function satisfy

41

Page 58: Copyright by Kai Zhong 2018

Property 1,2,3. Then for any t ≥ 1, if |S| ≥ d · poly(log d, t, k, ν, τ, λ, σ2p1 /ρ), we

have with probability at least 1− d−Ω(t),

Ω(v2minρ(σk)/(κ2λ))I ∇2fS(W ) O(kv2maxσ

2p1 )I.

Remark 4.3.1. As we can see from Theorem 4.3.1, ρ(σk) from Property 2 plays an

important role for positive definite (PD) property. Interestingly, many popular acti-

vations, like ReLU, sigmoid and tanh, have ρ(σk) > 0, while some simple functions

like linear (φ(z) = z) and square (φ(z) = z2) functions have ρ(σk) = 0 and their

Hessians are rank-deficient. Another important numbers are κ and λ, two different

condition numbers of the weight matrix, which directly influences the positive defi-

niteness. If W ∗ is rank deficient, λ→∞, κ→∞ and we don’t have PD property.

In the best case when W ∗ is orthogonal, λ = κ = 1. In the worse case, λ can be

exponential in k. Also W should be close enough to W ∗. In the next section, we

provide tensor methods to initialize w∗i and v∗i such that they satisfy the conditions

in Theorem 4.3.1.

For the PD property to hold, we need the samples to be independent of the

current parameters. Therefore, we need to do resampling at each iteration to guar-

antee the convergence in iterative algorithms like gradient descent. The following

theorem provides the linear convergence guarantee of gradient descent for smooth

activations.

Theorem 4.3.2 (Linear convergence of gradient descent, informal version of Theo-

rem B.3.2). Let W be the current iterate satisfying

‖W −W ∗‖ ≤ poly(1/ν, 1/k, 1/λ, ρ/σ2p1 )‖W ∗‖.

42

Page 59: Copyright by Kai Zhong 2018

Let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) with |S| ≥

d · poly(log d, t, k, ν, τ, λ, σ2p1 /ρ) and let the activation function satisfy Property 1,2

and 3(a). Define m0 := Θ(v2minρ(σk)/(κ2λ)) and M0 := Θ(kv2maxσ

2p1 ). If we

perform gradient descent with step size 1/M0 on fS(W ) and obtain the next iterate,

W = W − 1

M0

∇fS(W ),

then with probability at least 1− d−Ω(t),

‖W −W ∗‖2F ≤ (1− m0

M0

)‖W −W ∗‖2F .

We provide the proofs in the Appendix B.3.1

4.4 Initialization via Tensor Methods

In this section, we show that Tensor methods can recover the parameters

W ∗ to some precision and exactly recover v∗ for homogeneous activations.

It is known that most tensor problems are NP-hard [63, 64] or even hard

to approximate [117]. However, by making some assumptions, tensor decompo-

sition method becomes efficient [4, 115, 132, 133]. Here we utilize the noiseless

assumption and Gaussian inputs assumption to show a provable and efficient tensor

methods.

Preliminary

Let’s define a special outer product ⊗ for simplification of the notation.

If v ∈ Rd is a vector and I is the identity matrix, then v⊗I =∑d

j=1[v ⊗ ej ⊗

43

Page 60: Copyright by Kai Zhong 2018

ej + ej ⊗ v ⊗ ej + ej ⊗ ej ⊗ v]. If M is a symmetric rank-r matrix factorized as

M =∑r

i=1 siviv>i and I is the identity matrix, then

M⊗I =r∑

i=1

si

d∑j=1

6∑l=1

Al,i,j,

where A1,i,j = vi⊗ vi⊗ ej⊗ ej , A2,i,j = vi⊗ ej⊗ vi⊗ ej , A3,i,j = ej⊗ vi⊗ vi⊗ ej ,

A4,i,j = vi ⊗ ej ⊗ ej ⊗ vi, A5,i,j = ej ⊗ vi ⊗ ej ⊗ vi and A6,i,j = ej ⊗ ej ⊗ vi ⊗ vi.

Denote w = w/‖w‖. Now let’s calculate some moments.

Definition 4.4.1. We define M1,M2,M3,M4 and m1,i,m2,i,m3,i,m4,i as follows :

M1 = E(x,y)∼D[y · x].

M2 = E(x,y)∼D[y · (x⊗ x− I)].

M3 = E(x,y)∼D[y · (x⊗3 − x⊗I)].

M4 = E(x,y)∼D[y · (x⊗4 − (x⊗ x)⊗I + I⊗I)].

γj(σ) = Ez∼N(0,1)[φ(σ · z)zj], ∀j = 0, 1, 2, 3, 4.

m1,i = γ1(‖w∗i ‖).

m2,i = γ2(‖w∗i ‖)− γ0(‖w∗

i ‖).

m3,i = γ3(‖w∗i ‖)− 3γ1(‖w∗

i ‖).

m4,i = γ4(‖w∗i ‖) + 3γ0(‖w∗

i ‖)− 6γ2(‖w∗i ‖).

According to Definition 4.4.1, we have the following results,

Claim 4.4.1. For each j ∈ [4], Mj =∑k

i=1 v∗imj,iw

∗⊗ji .

Note that some mj,i’s will be zero for specific activations. For example, for

activations with symmetric first derivatives, i.e., φ′(z) = φ′(−z), like sigmoid and

44

Page 61: Copyright by Kai Zhong 2018

erf, we have φ(z) + φ(−z) being a constant and M2 = 0 since γ0(σ) = γ2(σ).

Another example is ReLU. ReLU functions have vanishing M3, i.e., M3 = 0, as

γ3(σ) = 3γ1(σ). To make tensor methods work, we make the following assumption.

Assumption 4.4.1. Assume the activation function φ(z) satisfies the following con-

ditions:

1. If Mj 6= 0, then mj,i 6= 0 for all i ∈ [k].

2. At least one of M3 and M4 is non-zero.

3. If M1 = M3 = 0, then φ(z) is an even function, i.e., φ(z) = φ(−z).

4. If M2 = M4 = 0, then φ(z) is an odd function, i.e., φ(z) = −φ(−z).

If φ(z) is an odd function then φ(z) = −φ(−z) and vφ(w>x) = −vφ(−w>x).

Hence we can always assume v > 0. If φ(z) is an even function, then vφ(w>x) =

vφ(−w>x). So if w recovers w∗ then −w also recovers w∗. Note that ReLU, leaky

ReLU and squared ReLU satisfy Assumption 4.4.1. We further define the following

non-zero moments.

Definition 4.4.2. Let α ∈ Rd denote a randomly picked vector. We define P2 and

P3 as follows: P2 = Mj2(I, I, α, · · · , α) , where j2 = minj ≥ 2|Mj 6= 0 and

P3 = Mj3(I, I, I, α, · · · , α), where j3 = minj ≥ 3|Mj 6= 0.

According to Definition 4.4.1 and 4.4.2, we have,

Claim 4.4.2. P2 =∑k

i=1 v∗imj2,i(α

>w∗i )

j2−2w∗⊗2i and

P3 =k∑

i=1

v∗imj3,i(α>w∗

i )j3−3w∗⊗3

i .

45

Page 62: Copyright by Kai Zhong 2018

In other words for the above definition, P2 is equal to the first non-zero

matrix in the ordered sequence M2,M3(I, I, α),M4(I, I, α, α). P3 is equal to

the first non-zero tensor in the ordered sequence M3,M4(I, I, I, α). Since α is

randomly picked up, w∗>i α 6= 0 and we view this number as a constant throughout

this work. So by construction and Assumption 4.4.1, both P2 and P3 are rank-k.

Also, let P2 ∈ Rd×d and P3 ∈ Rd×d×d denote the corresponding empirical moments

of P2 ∈ Rd×d and P3 ∈ Rd×d×d respectively.

Algorithm

Now we briefly introduce how to use a set of samples with size linear in

dimension to recover the ground truth parameters to some precision. As shown in

the previous section, we have a rank-k 3rd-order moment P3 that has tensor decom-

position formed by w∗1, w

∗2, · · · , w∗

k. Therefore, we can use the non-orthogonal

decomposition method [78] to decompose the corresponding estimated tensor P3

and obtain an approximation of the parameters. The precision of the obtained pa-

rameters depends on the estimation error of P3, which requires Ω(d3/ε2) samples to

achieve ε error. Also, the time complexity for tensor decomposition on a d× d× d

tensor is Ω(d3).

In this work, we reduce the cubic dependency of sample/computational

complexity in dimension [72] to linear dependency. Our idea follows the tech-

niques used in [151], where they first used a 2nd-order moment P2 to approximate

the subspace spanned by w∗1, w

∗2, · · · , w∗

k, denoted as V , then use V to reduce a

higher-dimensional third-order tensor P3 ∈ Rd×d×d to a lower-dimensional tensor

46

Page 63: Copyright by Kai Zhong 2018

Algorithm 4.4.1 Initialization via Tensor Method1: procedure INITIALIZATION(S) . Theorem 5.4.12: S2, S3, S4 ← PARTITION(S, 3)

3: P2 ← ES2 [P2]

4: V ← POWERMETHOD(P2, k)

5: R3 ← ES3 [P3(V, V, V )]

6: uii∈[k] ← KCL(R3)

7: w(0)i , v

(0)i i∈[k] ← RECMAGSIGN(V, uii∈[k], S4)

8: Return w(0)i , v

(0)i i∈[k]

R3 := P3(V, V, V ) ∈ Rk×k×k. Since the tensor decomposition and the tensor esti-

mation are conducted on a lower-dimensional Rk×k×k space, the sample complexity

and computational complexity are reduced.

The detailed algorithm is shown in Algorithm 4.4.1. First, we randomly

partition the dataset into three subsets each with size O(d). Then apply the power

method on P2, which is the estimation of P2 from S2, to estimate V . After that, the

non-orthogonal tensor decomposition (KCL)[78] on R3 outputs ui which estimates

siV>w∗

i for i ∈ [k] with unknown sign si ∈ −1, 1. Hence w∗i can be estimated

by siV ui. Finally we estimate the magnitude of w∗i and the signs si, v

∗i in the

RECMAGSIGN function for homogeneous activations. We discuss the details of

each procedure and provide POWERMETHOD and RECMAGSIGN algorithms in

Appendix B.4.

Theoretical Analysis

We formally present our theorem for Algorithm 4.4.1, and provide the proof

in the Appendix B.4.2.

47

Page 64: Copyright by Kai Zhong 2018

Algorithm 4.4.2 Globally Converging Algorithm1: procedure LEARNING1NN(S, d, k, ε) . Theorem 5.5.12: T ← log(1/ε) · poly(k, ν, λ, σ2p

1 /ρ).3: η ← 1/(kv2maxσ

2p1 ).

4: S0, S1, · · · , Sq ← PARTITION(S, q + 1).5: W (0), v(0) ← INITIALIZATION(S0).6: Set v∗i ← v

(0)i in Eq. (4.2) for all fSq(W ), q ∈ [T ]

7: for q = 0, 1, 2, · · · , T − 1 do8: W (q+1) = W (q) − η∇fSq+1(W

(q))

9: Return w(T )i , v

(0)i i∈[k]

Theorem 4.4.2. Let the activation function be homogeneous satisfying Assump-

tion 4.4.1. For any 0 < ε < 1 and t ≥ 1, if |S| ≥ ε−2 · d · poly(t, k, κ, log d), then

there exists an algorithm (Algorithm 4.4.1) that takes |S|k · O(d) time and outputs

a matrix W (0) ∈ Rd×k and a vector v(0) ∈ Rk such that, with probability at least

1− d−Ω(t),

‖W (0) −W ∗‖F ≤ ε · poly(k, κ)‖W ∗‖F , and v(0)i = v∗i .

4.5 Recovery Guarantee

Combining the positive definiteness of the Hessian near the global optimal

in Section 4.3 and the tensor initialization methods in Section 4.4, we come up

with the overall globally converging algorithm Algorithm 8.1.1 and its guarantee

Theorem 5.5.1.

Theorem 4.5.1 (Global convergence guarantees). Let S denote a set of i.i.d. sam-

ples from distribution D (defined in (4.1)) and let the activation function be homo-

geneous satisfying Property 1, 2, 3(a) and Assumption 4.4.1. Then for any t ≥ 1 and

48

Page 65: Copyright by Kai Zhong 2018

20 40 60 80 100d

2000

4000

6000

8000

10000N

0

0.2

0.4

0.6

0.8

1

Rec

over

y R

ate

20 40 60 80 100d

2000

4000

6000

8000

10000

N

0.5

1

1.5

Tens

or in

itial

izat

ion

erro

r

(a) Sample complexity for recovery (b) Tensor initialization error

0 200 400 600 800 1000iteration

-10

-5

0

5

10

15

20

log(

obj)

Initialize v,W with TensorRandomly initialize both v,WFix v=v*, randomly initialize W

(c) Objective v.s. iterations

Figure 4.1: Numerical Experiments

any ε > 0, if |S| ≥ d log(1/ε)·poly(log d, t, k, λ), T ≥ log(1/ε)·poly(k, ν, λ, σ2p1 /ρ)

and 0 < η ≤ 1/(kv2maxσ2p1 ), then there is an Algorithm (procedure LEARNING1NN

in Algorithm 8.1.1) taking |S| · d · poly(log d, k, λ) time and outputting a matrix

W (T ) ∈ Rd×k and a vector v(0) ∈ Rk satisfying

‖W (T ) −W ∗‖F ≤ ε‖W ∗‖F , and v(0)i = v∗i .

with probability at least 1− d−Ω(t).

This follows by combining Theorem 4.3.2 and Theorem 5.4.1.

49

Page 66: Copyright by Kai Zhong 2018

4.6 Numerical Experiments

In this section we use synthetic data to verify our theoretical results. We

generate data points xi, yii=1,2,··· ,n from Distribution D(defined in Eq. (4.1)). We

set W ∗ = UΣV >, where U ∈ Rd×k and V ∈ Rk×k are orthogonal matrices gener-

ated from QR decomposition of Gaussian matrices, Σ is a diagonal matrix whose

diagonal elements are 1, 1+ κ−1k−1

, 1+ 2(κ−1)k−1

, · · · , κ. In this experiment, we set κ = 2

and k = 5. We set v∗i to be randomly picked from −1, 1 with equal chance. We

use squared ReLU φ(z) = maxz, 02, which is a smooth homogeneous function.

For non-orthogonal tensor methods, we directly use the code provided by [78] with

the number of random projections fixed as L = 100. We pick the stepsize η = 0.02

for gradient descent. In the experiments, we don’t do the resampling since the al-

gorithm still works well without resampling.

First we show the number of samples required to recover the parameters for

different dimensions. We fix k = 5, change d for d = 10, 20, · · · , 100 and n for

n = 1000, 2000, · · · , 10000. For each pair of d and n, we run 10 trials. We say a

trial successfully recovers the parameters if there exists a permutation π : [k]→ [k],

such that the returned parameters W and v satisfy

maxj∈[k]‖w∗

j − wπ(j)‖/‖w∗j‖ ≤ 0.01 and vπ(j) = v∗j .

We record the recovery rates and represent them as grey scale in Fig. 4.1(a). As

we can see from Fig. 4.1(a), the least number of samples required to have 100%

recovery rate is about proportional to the dimension.

Next we test the tensor initialization. We show the error between the output

50

Page 67: Copyright by Kai Zhong 2018

of the tensor method and the ground truth parameters against the number of sam-

ples under different dimensions in Fig 4.1(b). The pure dark blocks indicate, in at

least one of the 10 trials,∑k

i=1 v(0)i 6=

∑ki=1 v

∗i , which means v

(0)i is not correctly

initialized. Let Π(k) denote the set of all possible permutations π : [k] → [k]. The

grey scale represents the averaged error,

minπ∈Π(k)

maxj∈[k]‖w∗

j − w(0)π(j)‖/‖w

∗j‖,

over 10 trials. As we can see, with a fixed dimension, the more samples we have the

better initialization we obtain. We can also see that to achieve the same initialization

error, the sample complexity required is about proportional to the dimension.

We also compare different initialization methods for gradient descent in

Fig. 4.1(c). We fix d = 10, k = 5, n = 10000 and compare three different ini-

tialization approaches, (I) Let both v and W be initialized from tensor methods,

and then do gradient descent for W while v is fixed; (II) Let both v and W be

initialized from random Gaussian, and then do gradient descent for both W and v;

(III) Let v = v∗ and W be initialized from random Gaussian, and then do gradient

descent for W while v is fixed. As we can see from Fig 4.1(c), Approach (I) is

the fastest and Approach (II) doesn’t converge even if more iterations are allowed.

Both Approach (I) and (III) have linear convergence rate when the objective value

is small enough, which verifies our local linear convergence claim.

51

Page 68: Copyright by Kai Zhong 2018

4.7 Related Work

The recent empirical success of NNs has boosted their theoretical analyses

[5, 9, 15, 16, 42, 52, 107]. We classify them into three main directions.

Expressive Power

Expressive power is studied to understand the remarkable performance of

neural networks on complex tasks. Although one-hidden-layer neural networks

with sufficiently many hidden nodes can approximate any continuous function [65],

shallow networks can’t achieve the same performance in practice as deep networks.

Theoretically, several recent works show the depth of NNs plays an essential role in

the expressive power of neural networks [34]. As shown in [31, 32, 124], functions

that can be implemented by a deep network of polynomial size require exponential

size in order to be implemented by a shallow network. [9, 91, 99, 101] design some

measures of expressivity that display an exponential dependence on the depth of

the network. However, the increasing of the expressivity of NNs or its depth also

increases the difficulty of the learning process to achieve a good enough model. In

this work, we focus on 1NNs and provide recovery guarantees using a finite number

of samples.

Achievability of Global Optima

The global convergence is in general not guaranteed for NNs due to their

non-convexity. It is widely believed that training deep models using gradient-based

methods works so well because the error surface either has no local minima, or if

52

Page 69: Copyright by Kai Zhong 2018

they exist they need to be close in value to the global minima. [123] present exam-

ples showing that for this to be true additional assumptions on the data, initializa-

tion schemes and/or the model classes have to be made. Indeed the achievability of

global optima has been shown under many different types of assumptions.

In particular, [30] analyze the loss surface of a special random neural net-

work through spin-glass theory and show that it has exponentially many local op-

tima, whose loss is small and close to that of a global optimum. Later on, [74]

eliminate some assumptions made by [30] but still require the independence of ac-

tivations as [30], which is unrealistic. [105] study the geometric structure of the

neural network objective function. They have shown that with high probability

random initialization will fall into a basin with a small objective value when the

network is over-parameterized. [89] consider polynomial networks where the acti-

vations are square functions, which are typically not used in practice. [57] show that

when a local minimum has zero parameters related to a hidden node, a global opti-

mum is achieved. [44] study the landscape of 1NN in terms of topology and geom-

etry, and show that the level set becomes connected as the network is increasingly

over-parameterized. [59] show that products of matrices don’t have spurious local

minima and that deep residual networks can represent any function on a sample, as

long as the number of parameters is larger than the sample size. [121] consider over-

specified NNs, where the number of samples is smaller than the number of weights.

[35] propose a new approach to second-order optimization that identifies and at-

tacks the saddle point problem in high-dimensional non-convex optimization. They

apply the approach to recurrent neural networks and show practical performance.

53

Page 70: Copyright by Kai Zhong 2018

[9] use results from tropical geometry to show global optimality of an algorithm,

but it requires (2n)kpoly(n) computational complexity.

Almost all of these results require the number of parameters is larger than

the number of points, which probably overfits the model and no generalization per-

formance will be guaranteed. In this work, we propose an efficient and provable

algorithm for 1NNs that can achieve the underlying ground-truth parameters.

Generalization Bound / Recovery Guarantees

The achievability of global optima of the objective from the training data

doesn’t guarantee the learned model to be able to generalize well on unseen testing

data. In the literature, we find three main approaches to generalization guarantees.

1) Use generalization analysis frameworks, including VC dimension/Rademacher

complexity, to bound the generalization error. A few works have studied the gener-

alization performance for NNs. [135] follow [121] but additionally provide gener-

alization bounds using Rademacher complexity. They assume the obtained parame-

ters are in a regularization set so that the generalization performance is guaranteed,

but this assumption can’t be justified theoretically. [61] apply stability analysis to

the generalization analysis of SGD for convex and non-convex problems, arguing

early stopping is important for generalization performance.

2) Assume an underlying model and try to recover this model. This direction

is popular for many non-convex problems including matrix completion/sensing [14,

58, 71, 122], mixed linear regression [151], subspace recovery [40] and other latent

models [4].

54

Page 71: Copyright by Kai Zhong 2018

Without making any assumptions, those non-convex problems are intractable

[8, 10, 49, 50, 60, 102, 116, 120, 137]. Recovery guarantees for NNs also need as-

sumptions. Several different approaches under different assumptions are provided

to have recovery guarantees on different NN settings.

Tensor methods [4, 115, 132, 133] are a general tool for recovering models

with latent factors by assuming the data distribution is known. Some existing recov-

ery guarantees for NNs are provided by tensor methods [72, 109]. However, [109]

only provide guarantees to recover the subspace spanned by the weight matrix and

no sample complexity is given, while [72] require O(d3/ε2) sample complexity. In

this work, we use tensor methods as an initialization step so that we don’t need very

accurate estimation of the moments, which enables us to reduce the total sample

complexity from 1/ε2 to log(1/ε).

[7] provide polynomial sample complexity and computational complexity

bounds for learning deep representations in unsupervised setting, and they need to

assume the weights are sparse and randomly distributed in [−1, 1].

[126] analyze 1NN by assuming Gaussian inputs in a supervised setting, in

particular, regression and classification with a teacher. This work also considers

this setting. However, there are some key differences. a) [126] require the second-

layer parameters are all ones, while we can learn these parameters. b) In [126],

the ground-truth first-layer weight vectors are required to be orthogonal, while we

only require linear independence. c) [126] require a good initialization but doesn’t

provide initialization methods, while we show the parameters can be efficiently

initialized by tensor methods. d) In [126], only the population case (infinite sample

55

Page 72: Copyright by Kai Zhong 2018

size) is considered, so there is no sample complexity analysis, while we show finite

sample complexity.

Recovery guarantees for convolution neural network with Gaussian inputs

are provided in [21], where they show a globally converging guarantee of gradient

descent on a one-hidden-layer no-overlap convolution neural network. However,

they consider population case, so no sample complexity is provided. Also their

analysis depends on ReLU activations and the no-overlap case is very unlikely to be

used in practice. In this work, we consider a large range of activation functions, but

for one-hidden-layer fully-connected NNs.

3) Improper Learning. In the improper learning setting for NNs, the learn-

ing algorithm is not restricted to output a NN, but only should output a prediction

function whose error is not much larger than the error of the best NN among all the

NNs considered. [147, 149] propose kernel methods to learn the prediction func-

tion which is guaranteed to have generalization performance close to that of the

NN. However, the sample complexity and computational complexity are exponen-

tial. [11] transform NNs to convex semi-definite programming. The works by [12]

and [19] are also in this direction. However, these methods are actually not learning

the original NNs. Another work by [148] uses random initializations to achieve

arbitrary small excess risk. However, their algorithm has exponential running time

in 1/ε.

56

Page 73: Copyright by Kai Zhong 2018

Chapter 5

One-hidden-layer Convolutional Neural Networks

In this chapter, we consider model recovery for non-overlapping convolu-

tional neural networks (CNNs) with multiple kernels. We show that when the inputs

follow Gaussian distribution and the sample size is sufficiently large, the squared

loss of such CNNs is locally strongly convex in a basin of attraction near the global

optima for most popular activation functions, like ReLU, Leaky ReLU, Squared

ReLU, Sigmoid and Tanh. The required sample complexity is proportional to the

dimension of the input and polynomial in the number of kernels and a condition

number of the parameters. We also show that tensor methods are able to initialize

the parameters to the local strong convex region. Hence, for most smooth activa-

tions, gradient descent following tensor initialization is guaranteed to converge to

the global optimal with time that is linear in input dimension, logarithmic in pre-

cision and polynomial in other factors. To the best of our knowledge, this is the

first work that provides recovery guarantees for CNNs with multiple kernels using

polynomial number of samples in polynomial running time.

57

Page 74: Copyright by Kai Zhong 2018

5.1 Introduction to One-hidden-layer Convolutional Neural Net-works

Convolutional Neural Networks (CNNs) have been very successful in many

machine learning areas, including image classification [77], face recognition [80],

machine translation [48] and game playing [113]. Comparing with fully-connected

neural networks (FCNN), CNNs leverage three key ideas that improve their perfor-

mance in machine learning tasks, namely, sparse weights, parameter sharing and

equivariance to translation [56]. These ideas allow CNNs to capture common pat-

terns in portions of original inputs.

Despite the empirical success of neural networks, the mechanism behind

them is still not fully understood. Recently there are several theoretical works on

analyzing FCNNs, including the expressive power of FCNNs [31, 32, 34, 99, 101,

124], the achievability of global optima [35, 57, 59, 89, 97, 105] and the recov-

ery/generalization guarantees [72, 85, 109, 125, 135, 153].

With the increasing number of papers analyzing FCNNs, theoretical litera-

ture about CNNs is also rising. Recent theoretical CNN research focuses on gen-

eralization and recovery guarantees. In particular, generalization guarantees for

two-layer CNNs are provided by [149], where they convexify CNNs by relaxing the

class of CNN filters to a reproducing kernel Hilbert space (RKHS). However, to pair

with RKHS, only several uncommonly used activations are acceptable. There have

been much progress in the parameter recovery setting, including [21, 38, 39, 53].

However, existing results can only handle one kernel. It is still unknown

if multi-kernel CNNs will have recovery guarantees. In this section, we consider

58

Page 75: Copyright by Kai Zhong 2018

multiple kernels instead of just one kernel as in [21, 38, 39, 53].

We use a popular proof framework for non-convex problem such as one-

hidden-layer neural network [153]. The framework is as follows. First show local

strong convexity near the global optimal and then use a proper initialization method

to initialize the parameters to fall in the the local strong convexity region. In par-

ticular, we first show that the population Hessian of the squared loss of CNN at

the ground truth is positive definite (PD) as long as the activation satisfies some

properties in Section 5.3. Note that the Hessian of the squared loss at the ground

truth can be trivially proved to be positive semidefinite (PSD), but only PSD-ness

at the ground truth can’t guarantee convergence of most optimization algorithms

like gradient descent. The proof for the PD-ness of Hessian at the ground truth

is non-trivial. Actually we will give examples in Section 5.3 where the distilled

properties are not satisfied and their Hessians are only PSD but not PD. Then given

the PD-ness of population Hessian at the ground truth, we are able to show that

the empirical Hessian at any fixed point that is close enough to the ground truth is

also PD with high probability by using matrix Bernstein inequality and the distilled

properties of activations and we show gradient descent converges to the global op-

timal given an initialization that falls into the PD region. In Section 5.4, we provide

existing guarantees for the initialization using tensor methods. Finally, we present

some experimental results to verify our theory.

Although we use the same proof framework, our proof details are different

from [153] and resolve some new technical challenges. To mention a few, 1) CNNs

have more complex structures, so we need a different property for the activation

59

Page 76: Copyright by Kai Zhong 2018

function (see Property 4) , fortunately this new condition is still satisfied by com-

mon activation functions; 2) The patches introduce additional dependency, which

requires different proof techniques to disentangle when we attempt to show the PD-

ness of population Hessian; 3) The patches also introduce difficulty when applying

matrix Bernstein inequality for error bound. Part of the proof requires to bound the

error between the empirical sum of non-symmetric random matrices and its corre-

sponding population version. However, such non-symmetric random matrices are

not studied in [153] and cannot be applied to classic matrix Bernstein inequality.

Therefore, we exploit the structure of this type of random matrices so that we can

still bound the error.

In summary, our contributions are,

1. We show that the Hessian of the squared loss at a given point that is suf-

ficiently close to the ground truth is positive definite with high probabil-

ity(w.h.p.) when a sufficiently large number of samples are provided and

the activation function satisfies some properties.

2. Given an initialization point that is sufficiently close to the ground truth,

which can be obtained by tensor methods, we show that for smooth activa-

tion functions that satisfy the distilled properties, gradient descent converges

to the ground truth parameters within ε precision using O(log(1/ε)) samples

w.h.p.. To the best of our knowledge, this is the first time that recovery guar-

antees for non-overlapping CNNs with multiple kernels are provided.

60

Page 77: Copyright by Kai Zhong 2018

5.2 Problem Formulation

We consider the CNN setting with one hidden layer, r non-overlapping

patches and t different kernels. Let (x, y) ∈ Rd × R be a pair of an input and its

corresponding final output, k = d/r be the kernel size (or the size of each patch),

wj ∈ Rk be the parameters of j-th kernel (j = 1, 2, · · · , t), and Pi · x ∈ Rk be the

i-th patch (i = 1, 2, · · · , r) of input x, where r matrices P1, P2, · · · , Pr ∈ Rk×d are

defined in the following sense.

P1 =[Ik 0 · · · 0

], · · · , Pr =

[0 0 · · · Ik

].

By construction of Pii∈[r], Pi · x and Pi′ · x (i 6= i′) don’t have any overlap on

the features of x. Throughout this section, we assume the number of kernels t is no

more than the size of each patch, i.e., t ≤ k. So by definition of d, d ≥ maxk, r, t.

We assume each sample (x, y) ∈ Rd×R is sampled from the following un-

derlying distribution with parameters W ∗ = [w∗1 w

∗2 · · · w∗

t ] ∈ Rk×t and activation

function φ(·),

D : x ∼ N(0, Id), y =t∑

j=1

r∑i=1

φ(w∗>j · Pi · x). (5.1)

Given a distribution D, we define the Expected Risk,

fD(W ) =1

2E

(x,y)∼D

t∑j=1

r∑i=1

φ(w>j · Pi · x)− y

2. (5.2)

Given a set of n samples

S = (x1, y1), (x2, y2), · · · , (xn, yn) ⊂ Rd × R,

61

Page 78: Copyright by Kai Zhong 2018

we define the Empirical Risk,

fS(W ) =1

2|S|∑

(x,y)∈S

t∑j=1

r∑i=1

φ(w>j · Pi · x)− y

2

. (5.3)

We calculate the gradient and the Hessian of fD(W ). The gradient and theHessian of fS(W ) are similar. For each j ∈ [t], the partial gradient of fD(W ) withrespect to wj can be represented as

∂fD(W )

∂wj

= E(x,y)∼D

[(t∑

l=1

r∑i=1

φ(w>l Pix)− y

)(r∑

i=1

φ′(w>j Pix)Pix

)].

For each j ∈ [t], the second partial derivative of fD(W ) with respect to wj

can be represented as

∂2fD(W )

∂w2j

= E(x,y)∼D

( r∑i=1

φ′(w>j Pix)Pix

)(r∑

i=1

φ′(w>j Pix)Pix

)>

+

(t∑

l=1

r∑i=1

φ(w>l Pix)− y

)(r∑

i=1

φ′′(w>j Pix)Pix(Pix)

>

)].

When W = W ∗, we have

∂2fD(W∗)

∂w2j

= E(x,y)∼D

( r∑i=1

φ′(w∗>j Pix)Pix

)(r∑

i=1

φ′(w∗>j Pix)Pix

)>.

For each j, l ∈ [t] and j 6= l, the second partial derivative of fD(W ) with respect to

62

Page 79: Copyright by Kai Zhong 2018

wj and wl can be represented as

∂2fD(W )

∂wj∂wl

= E(x,y)∼D

( r∑i=1

φ′(w>j Pix)Pix

)(r∑

i=1

φ′(w>l Pix)Pix

)>.

For activation function φ(z), we define the following new property for CNN.

Property 4. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =

Ez∼N(0,1)[φ′2(σ · z)zq],∀q ∈ 0, 2. Let ρ(σ) denote

min β0(σ)− α20(σ)− α2

1(σ),

β2(σ)− α21(σ)− α2

2(σ),

α0(σ) · α2(σ)− α21(σ), α

20(σ).

The first derivative φ′(z) satisfies that, for all σ > 0, we have ρ(σ) > 0.

Note that Property 4 is different from Property 2 for one-hidden-layer fully-

connected NNs. We can show that most commonly used activations satisfy these

properties, such as ReLU (φ(z) = maxz, 0, ρ(σ) = 0.091), leaky ReLU (φ(z) =

maxz, 0.01z, ρ(σ) = 0.089), squared ReLU (φ(z) = maxz, 02, ρ(σ) = 0.27σ2)

and sigmoid (φ(z) = 1/(1 + e−z)2, ρ(1) = 0.049). Also note that when Prop-

erty 3(b) is satisfied, i.e., the activation function is non-smooth, but piecewise lin-

ear, i.e., φ′′(z) = 0 almost surely. Then the empirical Hessian exists almost surely

for a finite number of samples.

63

Page 80: Copyright by Kai Zhong 2018

5.3 Local Strong Convexity

In this section, we first show the eigenvalues of the Hessian at any fixed

point that is close to the ground truth are lower bounded and upper bounded by

two positives respectively w.h.p.. Then in the subsequent sections, we present the

main idea of the proofs step-by-step from special cases to general cases. Since we

assume t ≤ k, the following definition is well defined.

Definition 5.3.1. Given the ground truth matrix W ∗ ∈ Rk×t, let σi(W∗) denote the

i-th singular value of W ∗, often abbreviated as σi.

Let κ = σ1/σt, λ = (∏t

i=1 σi)/σtt . Let τ = (3σ1/2)

4p/minσ∈[σt/2,3σ1/2]ρ2(σ).

Theorem 5.3.1 (Lower and upper bound for the Hessian around the ground truth,informal version of Theorem C.3.2). For any W ∈ Rk×t with

‖W −W ∗‖ ≤ poly(1/r, 1/t, 1/κ, 1/λ, 1/ν, ρ/σ2p1 ) · ‖W ∗‖,

let S denote a set of i.i.d. samples from distribution D (defined in (5.1)) and

let the activation function satisfy Property 1,4,3. Then for any s ≥ 1, if |S| ≥

dpoly(s, t, r, ν, τ, κ, λ, σ2p1 /ρ, log d), we have with probability at least 1− d−Ω(s),

Ω(rρ(σt)/(κ2λ))I ∇2fS(W ) O(tr2σ2p

1 )I. (5.4)

Note that κ is the traditional condition number of W ∗, while λ is a more

involved condition number of W ∗. Both of them are 1 if W ∗ has orthonormal

columns. ρ(σ) is a number that is related to the activation function as defined in

Property 4. Property 4 requires ρ(σt) > 0, which is important for the PD-ness of

the Hessian.

64

Page 81: Copyright by Kai Zhong 2018

Here we show a special case when Property 4 is not satisfied and population

Hessian is only positive semi-definite. We consider quadratic activation function,

φ(z) = z2, i.e., W ∗ = Ik. Let A =[a1 a2 · · · ak

]∈ Rk×k. Then the smallest

eigenvalue of∇2f(W ∗) can be written as follows,

min‖A‖F=1

Ex∼Dd

( k∑j=1

r∑i=1

a>j xi · 2xij)

)2 = 4 · min

‖A‖F=1E

x∼Dd

(〈A, r∑i=1

xix>i 〉

)2.

Then as long as we set A such that A = −A>, we have 〈A,∑r

i=1 xix>i 〉 = 0 for

any x. Therefore, the smallest eigenvalue of the population Hessian at the ground

truth for the quadratic activation function is zero. That is to say, the Hessian is

only PSD but not PD. Also note that ρ(σ) = 0 for the quadratic activation function.

Therefore, Property 4 is important for the PD-ness of the Hessian.

Locally Linear Convergence of Gradient Descent

A caveat of Theorem 5.3.1 is that the lower and upper bounds of the Hessian

only hold for a fixed W given a set of samples. That is to say, given a set of samples,

Eq (5.4) doesn’t hold for all the W ’s that are close enough to the ground truth

w.h.p. at the same time. So we want to point out that this theorem doesn’t indicate

the classical local strong convexity, since the classical strong convexity requires all

the Hessians at any point at a local area to be PD almost surely. Fortunately, our

goal is to show the convergence of optimization methods and we can still show

gradient descent converges to the global optimal linearly given a sufficiently good

initialization.

65

Page 82: Copyright by Kai Zhong 2018

Theorem 5.3.2 (Linear convergence of gradient descent, informal version of Theo-

rem C.3.3). Let W be the current iterate satisfying

‖W −W ∗‖ ≤ poly(1/t, 1/r, 1/λ, 1/κ, ρ/σ2p1 )‖W ∗‖.

Let S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Let the

activation function satisfy Property 1,4 and 3(a). Define

m0 = Θ(rρ(σt)/(κ2λ)) and M0 = Θ(tr2σ2p

1 ).

For any s ≥ 1, if we choose

|S| ≥ d · poly(s, t, log d, τ, κ, λ, σ2p1 /ρ),

and perform gradient descent with step size 1/M0 on fS(W ) and obtain the next

iterate,

W = W − 1

M0

∇fS(W ),

then with probability at least 1− d−Ω(s),

‖W −W ∗‖2F ≤ (1− m0

M0

)‖W −W ∗‖2F .

To show the linear convergence of gradient descent for one iteration, we

need to show that all the Hessians along the line between the current point to the

optimal point are PD, which can’t be satisfied by simple union bound, since there

are infinite number of Hessians. Our solution is to set a finite number of anchor

points that are equally distributed along the line, whose Hessians can be shown to

66

Page 83: Copyright by Kai Zhong 2018

be PD w.h.p. using union bound. Then we show all the points between two adjacent

anchor points have PD Hessians, since these points are much closer to the anchor

points than to the ground truth. The proofs are postponed to Appendix C.3.4.

Note that this theorem holds only for one iteration. For multiple iterations,

we need to do resampling at each iteration. However, since the number of iterations

required to achieve ε precision is O(log(1/ε)), the number of samples required is

also proportional to log(1/ε).

5.4 Initialization via Tensor Method

It is known that most tensor problems are NP-hard [63, 64] or even hard to

approximate [118]. Tensor decomposition method becomes efficient [4, 115, 132,

133] under some assumptions. We consider realizable setting and Gaussian inputs

assumption to show a provable and efficient tensor methods.

In this section, we discuss how to use tensor method to initialize the pa-

rameters to the local strong convexity region. Let’s define the following quan-

tities: γj(σ) = Ez∼N(0,1)[φ(σ · z)zj], ∀j = 0, 1, 2, 3. Let v ∈ Rd be a vector

and I be the identity matrix, define a special outer product ⊗ as follows, v⊗I :=∑dj=1[v ⊗ ej ⊗ ej + ej ⊗ v ⊗ ej + ej ⊗ ej ⊗ v].

We denote w = w/‖w‖ and xi = Pi · x. For each i ∈ [r], we can calculate

67

Page 84: Copyright by Kai Zhong 2018

the second-order and third-order moments,

Mi,2 = E(x,y)∼D[y · (xi ⊗ xi − I)] =t∑

j=1

(γ2(‖w∗j‖)− γ0(‖w∗

j‖))w∗⊗2j . (5.5)

Mi,3 = E(x,y)∼D[y · (x⊗3i − xi⊗I)] =

t∑j=1

(γ3(‖w∗j‖)− 3γ1(‖w∗

j‖))w∗⊗3j . (5.6)

For simplicity, we assume γ2(‖w∗j‖) 6= γ0(‖w∗

j‖) and γ3(‖w∗j‖) 6= 3γ1(‖w∗

j‖) for

any j ∈ [t], then Mi,2 6= 0 and Mi,3 6= 0. Note that when this assumption doesn’t

hold, we can seek for higher-order moments and then degrade them to second-order

moments or third-order moments. Now we can use non-orthogonal tensor decom-

position [78] to decompose the empirical version of Mi,3 and obtain the estimation

of w∗j for j ∈ [t]. According to [153], from the empirical version of Mi,2 and Mi,3,

we are able to estimate W ∗ to some precision.

Theorem 5.4.1. For any 0 < ε < 1 and s ≥ 1, if

|S| ≥ ε−2 · k · poly(s, t, κ, log d),

then there exists an algorithm (based on non-orthogonal tensor decomposition [78])

that takes O(tk|S|) time and outputs a matrix W (0) ∈ Rk×t such that, with proba-

bility at least 1− d−Ω(s),

‖W (0) −W ∗‖F ≤ ε · poly(t, κ)‖W ∗‖F .

Therefore, setting ε = ρ(σt)2/poly(t, κ, λ), W (0) will satisfy the initializa-

tion condition in Theorem B.3.2.

68

Page 85: Copyright by Kai Zhong 2018

Algorithm 5.5.1 Globally Converging Algorithm1: procedure LEARNING1CNN(S, T) . Theorem 5.5.12: η ← 1/(tr2σ2p

1 ).3: S0, S1, · · · , ST ← PARTITION(S, T + 1).4: W (0) ← TENSOR INITIALIZATION(S0).5: for q = 0, 1, 2, · · · , T − 1 do6: W (q+1) = W (q) − η∇fSq+1(W

(q))

7: Return W (T )

5.5 Recovery Guarantee

In this section, we can show the global convergence of gradient descent ini-

tialized by tensor method (Algorithm 8.1.1) by combining the local convergence

of gradient descent Theorem B.3.2 and the tensor initialization guarantee Theo-

rem 5.4.1.

Theorem 5.5.1 (Global convergence guarantees). Let S denote a set of i.i.d. sam-

ples from distribution D (defined in (5.1)) and let the activation function satisfying

Property 1, 4, 3(a). Then for any s ≥ 1 and any ε > 0, if

|S| ≥ d log(1/ε) · poly(log d, s, t, λ, r),

T ≥ log(1/ε) · poly(t, r, λ, σ2p1 /ρ),

η ∈ (0, 1/(tr2σ2p1 )],

then there is an algorithm (procedure LEARNING1CNN in Algorithm 8.1.1) taking

|S| · d · poly(log d, t, r, λ)

time and outputting a matrix W (T ) ∈ Rk×t satisfying

‖W (T ) −W ∗‖F ≤ ε‖W ∗‖F ,

69

Page 86: Copyright by Kai Zhong 2018

0 20 40 60 80 100number of samples

-50

-40

-30

-20

-10

0

10lo

g of

the

min

imal

eig

enva

lue

of H

essi

an

squared ReLUReLUsigmoidquadratic

0 500 1000 1500 2000iteration

-15

-10

-5

0

5

10

15

log(obj)

Figure 5.1: (a) (left) Minimal eigenvalue of Hessian at the ground truth for differentactivations against the sample size (b) (right) Convergence of gradient descent withdifferent random initializations.

with probability at least 1− d−Ω(s).

5.6 Numerical Experiments

In this section, we do some experiments on synthetic data to verify our

analysis. We set W ∗ = UΣV >, where U ∈ Rk×t and V ∈ Rt×t are orthogonal

matrices generated from QR decomposition of Gaussian matrices, Σ is a diagonal

matrix whose elements are 1, 1 + κ−1t−1

, 1 + 2(κ−1)t−1

, · · · , κ, so that κ is the condition

number of W ∗. Then data points xi, yii=1,2,··· ,n are generated from Distribution

D(defined in Eq. (5.1)) with W ∗. In this experiment, we set κ = 2, d = 10, k = 5,

r = 2 and t = 2.

In our first experiment, we show that the minimal eigenvalues of Hessians

at the the ground truth for different number of samples and different activation

functions. As we can see from Fig. 5.1(a), The minimal eigenvalues using ReLU,

70

Page 87: Copyright by Kai Zhong 2018

squared ReLU and sigmoid activations are positive, while the minimal eigenvalue

of Hessian using quadratic activation is zero. Note that we use log scale for y-axis.

Also, we can see when the sample size increases the minimal eigenvalues converges

to the minimal eigenvalue of the population Hessian.

In the second experiment, we demonstrate how gradient descent converges.

We use squared ReLU as an example, pick stepsize η = 0.01 for gradient descent

and set n = 1000. In the experiments, we don’t do the resampling for each iteration

since the algorithm still works well without resampling. The results are shown in

Fig. 5.1(b), where different lines use different initializations sampled from normal

distribution. The common properties of all the lines are that 1) they converge to

the global optimal; 2) they have linear convergence rate when the objective value is

close to zero, which verifies Theorem B.3.2.

5.7 Related Work

With the great success of neural networks, there is an increasing amount

of literature that provides theoretical analysis and guarantees for NNs. Some of

them measure the expressive power of NNs [31, 32, 34, 99, 101, 124] in order to

explain the remarkable performance of NNs on complex tasks. Many other works

try to handle the non-convexity of NNs by showing that the global optima or local

minima close to the global optima will be achieved when the number of parameters

is large enough [35, 57, 59, 89, 105]. However, such an over-parameterization will

also overfit the training data easily and limit the generalization.

In this work, we consider parameter recovery guarantees, where the typical

71

Page 88: Copyright by Kai Zhong 2018

setting is to assume an underlying model and then try to recover the model. Once the

parameters of the underlying model are recovered, generalization performance will

also be guaranteed. Many non-convex problems, such as matrix completion/sensing

[71] and mixed linear regression [151], have nice recovery guarantees. Recovery

guarantees for FCNNs have been studied in several works by different approaches.

One of the approaches is tensor method [72, 109]. In particular, [109] guarantee

to recover the subspace spanned by the weight matrix but no sample complexity is

given, while [72] provide the recovery of the parameters and require O(d3/ε2) sam-

ple complexity. [125, 126, 153] consider the recovery of one-hidden-layer FCNNs

using algorithms based on gradient descent. [125, 126] provide recovery guarantees

for one-hidden-layer FCNNs with orthogonal weight matrix and ReLU activations

given infinite number of samples sampled from Gaussian distribution. [153] show

the local strong convexity of the squared loss for one-hidden-layer FCNNs and use

tensor method to initialize the parameters to the local strong convexity region fol-

lowed by gradient descent that finally converges to the ground truth parameters. In

this work, we consider the recovery guarantees for non-overlapping CNNs.

There is an increasing number of theoretical literature on CNNs. [32] con-

sider the CNNs as generalized tensor decomposition and show the expressive power

and depth efficiency of CNNs. [96] studied the loss surface of CNNs. [21] provide a

globally converging guarantee of gradient descent on one-hidden-layer CNNs. [38]

eliminate the Gaussian input assumption and only require a weaker assumption on

the inputs. However, 1) their analysis depends on ReLU activations, 2) they only

consider one kernel. [39] shows that with random initialization gradient descent

72

Page 89: Copyright by Kai Zhong 2018

with weight normalization converges to the ground truth parameters with constant

probability. In this section, we provide recovery guarantees for CNNs with multiple

kernels and give sample complexity analysis. Moreover our analysis can be applied

to a large range of activations including most commonly used activations. Another

approach for CNNs that is worth mentioning is convex relaxation [149], where

the class of CNN filters is relaxed to a reproducing kernel Hilbert space (RKHS).

They show generalization error bound for this relaxation. However, to pair with

RKHS, only several uncommonly used activations work for their analysis. Also,

the learned function by convex relaxation is not the original CNN anymore. Re-

cently, [53] applies isotonic regression to learning CNNs with overlapping patches.

It uses a milder assumption on the data input and doesn’t need special initialization.

However, it can not handle multiple kernels either.

73

Page 90: Copyright by Kai Zhong 2018

Chapter 6

Non-linear Inductive Matrix Completion

This chapter considers non-linear inductive matrix completion that has ap-

plications in recommendation systems. The goal of a recommendation system is to

predict the interest of a user in a given item by exploiting the existing set of ratings

as well as certain user/item features. A standard approach to modeling this problem

is Inductive Matrix Completion where the predicted rating is modeled as an inner

product of the user and the item features projected onto a latent space. In order to

learn the parameters effectively from a small number of observed ratings, the latent

space is constrained to be low-dimensional which implies that the parameter matrix

is constrained to be low-rank. However, such bilinear modeling of the ratings can

be limiting in practice and non-linear prediction functions can lead to significant

improvements. A natural approach to introducing non-linearity in the prediction

function is to apply a non-linear activation function on top of the projected user/item

features. Imposition of non-linearities further complicates an already challenging

problem that has two sources of non-convexity: a) low-rank structure of the pa-

rameter matrix, and b) non-linear activation function. We show that one can still

solve the non-linear Inductive Matrix Completion problem using gradient descent

type methods as long as the solution is initialized well. That is, close to the optima,

the optimization function is strongly convex and hence admits standard optimiza-

74

Page 91: Copyright by Kai Zhong 2018

tion techniques, at least for certain activation functions, such as Sigmoid and tanh.

We also highlight the importance of the activation function and show how ReLU

can behave significantly differently than say a sigmoid function. Finally, we apply

our proposed technique to recommendation systems and semi-supervised cluster-

ing, and show that our method can lead to much better performance than standard

linear Inductive Matrix Completion methods.

6.1 Introduction to Inductive Matrix Completion

Matrix Completion (MC) or Collaborative filtering [24, 55] is by now a

standard technique to model recommendation systems problems where a few user-

item ratings are available and the goal is to predict ratings for any user-item pair.

However, standard collaborative filtering suffers from two drawbacks: 1) Cold-

start problem: MC can’t give prediction for new users or items, 2) Missing side-

information: MC cannot leverage side-information that is typically present in rec-

ommendation systems such as features for users/items. Consequently, several meth-

ods [2, 69, 104, 136] have been proposed to leverage the side information together

with the ratings. Inductive matrix completion (IMC) [2, 69] is one of the most

popular methods in this class.

IMC models the ratings as the inner product between certain linear map-

ping of the user/items’ features, i.e., A(x, y) = 〈U>x, V >y〉, where A(x, y) is the

predicted rating of user x for item y, x ∈ Rd1 , y ∈ Rd2 are the feature vectors.

Parameters U ∈ Rd1×k, V ∈ Rd2×k (k ≤ d1, k ≤ d2) can typically be learned using

a small number of observed ratings [69].

75

Page 92: Copyright by Kai Zhong 2018

However, the bilinear structure of IMC is fairly simplistic and limiting in

practice and might lead to fairly poor accuracy on real-world recommendation prob-

lems. For example, consider the Youtube recommendation system [33] that requires

predictions over videos. Naturally, a linear function over the pixels of videos will

lead to fairly inaccurate predictions and hence one needs to model the videos using

non-linear networks. The survey paper by [145] presents many more such exam-

ples, where we need to design a non-linear ratings prediction function for the input

features, including [82] for image recommendation, [131] for music recommenda-

tion and [143] for recommendation systems with multiple types of inputs.

We can introduce non-linearity in the prediction function using several stan-

dard techniques, however, if our parameterization admits too many free parameters

then learning them might be challenging as the number of available user-item rat-

ings tend to be fairly small. Instead, we use a simple non-linear extension of IMC

that can control the number of parameters to be estimated. Note that IMC based pre-

diction function can be viewed as an inner product between certain latent user-item

features where the latent features are a linear map of the raw user-item features.

To introduce non-linearity, we can use a non-linear mapping of the raw user-item

features rather than the linear mapping used by IMC. This leads to the following

general framework that we call non-linear inductive matrix completion (NIMC),

A(x, y) = 〈U(x),V(y)〉, (6.1)

where x ∈ X, y ∈ Y are the feature vectors, A(x, y) is their rating and U : X →

S,V : Y → S are non-linear mappings from the raw feature space to the latent

space.

76

Page 93: Copyright by Kai Zhong 2018

The above general framework reduces to standard inductive matrix comple-

tion when U,V are linear mappings and further reduces to matrix completion when

xi, yj are unit vectors ei, ej for i-th item and j-th user respectively. When [xi, ei]

is used as the feature vector and U is restricted to be a two-block (one for xi and

the other for ei) diagonal matrix, then the above framework reduces to the dirtyIMC

model [29]. Similarly, U/V can also be neural networks (NNs), such as feedforward

NNs [33, 112], convolutional NNs for images and recurrent NNs for speech/text.

In this chapter, we focus on a simple nonlinear activation based mapping

for the user-item features. That is, we set U(x) = φ(U∗>x) and V(x) = φ(V ∗>x)

where φ is a nonlinear activation function φ. Note that if φ is ReLU then the latent

space is guaranteed to be in non-negative orthant which in itself can be a desirable

property for certain recommendation problems.

Note that parameter estimation in both IMC and NIMC models is hard due

to non-convexity of the corresponding optimization problem. However, for ”nice”

data, several strong results are known for the linear models, such as [24, 46, 71] for

MC and [29, 69, 136] for IMC. However, non-linearity in NIMC models adds to the

complexity of an already challenging problem and has not been studied extensively,

despite its popularity in practice.

In this chapter, we study a simple one-layer neural network style NIMC

model mentioned above. In particular, we formulate a squared-loss based optimiza-

tion problem for estimating parameters U∗ and V ∗. We show that under a realizable

model and Gaussian input assumption, the objective function is locally strongly

convex within a ”reasonably large” neighborhood of the ground truth. Moreover,

77

Page 94: Copyright by Kai Zhong 2018

we show that the above strong convexity claim holds even if the number of ob-

served ratings is nearly-linear in dimension and polynomial in the conditioning of

the weight matrices. In particular, for well-conditioned matrices, we can recover

the underlying parameters using only poly log(d1 + d2) user-item ratings, which is

critical for practical recommendation systems as they tend to have very few ratings

available per user. Our analysis covers popular activation functions, e.g., sigmoid

and ReLU, and discuss various subtleties that arise due to the activation function.

Finally we discuss how we can leverage standard tensor decomposition techniques

to initialize our parameters well. We would like to stress that practitioners typically

use random initialization itself, and hence results studying random initialization for

NIMC model would be of significant interest.

As mentioned above, due to non-linearity of activation function along with

non-convexity of the parameter space, the existing proof techniques do not apply

directly to the problem. Moreover, we have to carefully argue about both the op-

timization landscape as well as the sample complexity of the algorithm which is

not carefully studied for neural networks. Our proof establishes some new tech-

niques that might be of independent interest, e.g., how to handle the redundancy in

the parameters for ReLU activation. To the best of our knowledge, this is one of

the first theoretically rigorous study of neural-network based recommendation sys-

tems and will hopefully be a stepping stone for similar analysis for ”deeper” neural

networks based recommendation systems. We would also like to highlight that our

model can be viewed as a strict generalization of a one-hidden layer neural network,

hence our result represents one of the few rigorous guarantees for models that are

78

Page 95: Copyright by Kai Zhong 2018

more powerful than one-hidden layer neural networks [22, 85, 153].

Finally, we apply our model on synthetic datasets and verify our theoretical

analysis. Further, we compare our NIMC model with standard linear IMC on sev-

eral real-world recommendation-type problems, including user-movie rating pre-

diction, gene-disease association prediction and semi-supervised clustering. NIMC

demonstrates significantly superior performance over IMC.

Roadmap. We first present the formal model and the corresponding opti-

mization problem in Section 6.2. We then present the local strong convexity and

local linear convergence results in Section 6.3. Finally, we demonstrate the empiri-

cal superiority of NIMC when compared to linear IMC (Section 6.5).

6.2 Problem Formulation

Consider a user-item recommender system, where we have n1 users with

feature vectors X := xii∈[n1] ⊆ Rd1 , n2 items with feature vectors Y := yjj∈[n2] ⊆

Rd2 and a collection of partially-observed user-item ratings, Aobs = A(x, y)|(x, y) ∈

Ω ⊆ X × Y . That is A(xi, yj) is the rating that user xi gave for item yj . For

simplicity, we assume xi’s and yj’s are sampled i.i.d. from distribution X and Y,

respectively. Each element of the index set Ω is also sampled independently and

uniformly with replacement from S := X × Y .

In this chapter, our goal is to predict the rating for any user-item pair with

feature vectors x and y, respectively. We model the user-item ratings as:

A(x, y) = φ(U∗>x)>φ(V ∗>y), (6.2)

79

Page 96: Copyright by Kai Zhong 2018

where U∗ ∈ Rd1×k, V ∗ ∈ Rd2×k and φ is a non-linear activation function. Under

this realizable model, our goal is to recover U∗, V ∗ from a collection of observed

entries, A(x, y)|(x, y) ∈ Ω. Without loss of generality, we set d1 = d2. Also we

treat k as a constant throughout the paper. Our analysis requires U∗, V ∗ to be full

column rank, so we require k ≤ d. And w.l.o.g., we assume σk(U∗) = σk(V

∗) = 1,

i.e., the smallest singular value of both U∗ and V ∗ is 1.

Note that this model is similar to one-hidden layer feed-forward network

popular in standard classification/regression tasks. However, as there is an inner

product between the output of two non-linear layers, φ(U∗x) and φ(V ∗y), it cannot

be modeled by a single hidden layer neural network (with same number of nodes).

Also, for linear activation function, the problem reduces to inductive matrix com-

pletion [2, 69].

Now, to solve for U∗, V ∗, we optimize a simple squared-loss based opti-

mization problem, i.e.,

minU∈Rd1×k,V ∈Rd2×k

fΩ(U, V ) =∑

(x,y)∈Ω

(φ(U>x)>φ(V >y)− A(x, y))2. (6.3)

Naturally, the above problem is a challenging non-convex optimization prob-

lem that is strictly harder than two non-convex optimization problems which are

challenging in their own right: a) the linear inductive matrix completion where

non-convexity arises due to bilinearity of U>V , and b) the standard one-hidden

layer neural network (NN). In fact, recently a lot of research has focused on under-

standing various properties of both the linear inductive matrix completion problem

[46, 69] as well as one-hidden layer NN [47, 153].

80

Page 97: Copyright by Kai Zhong 2018

In this chapter, we show that despite the non-convexity of Problem (6.3), it

behaves as a convex optimization problem close to the optima if the data is sampled

stochastically from a Gaussian distribution. This result combined with standard

tensor decomposition based initialization [72, 78, 153] leads to a polynomial time

algorithm for solving (6.3) optimally if the data satisfies certain sampling assump-

tions in Theorem 6.2.1. Moreover, we also discuss the effect of various activation

functions, especially the difference between a sigmoid activation function vs RELU

activation (see Theorem 6.3.1 and Theorem 6.3.3).

Informally, our recovery guarantee can be stated as follows,

Theorem 6.2.1 (Informal Recovery Guarantee). Consider a recommender system

with a realizable model Eq. (6.2) with sigmoid activation, Assume the features

xii∈[n1] and yjj∈[n2] are sampled i.i.d. from the normal distribution and the ob-

served pairs Ω are i.i.d. sampled from xii∈[n1] × yjj∈[n2] uniformly at random.

Then there exists an algorithm such that U∗, V ∗ can be recovered to any precision

ε with time complexity and sample complexity (refers to n1, n2, |Ω|) polynomial in

the dimension and the condition number of U∗, V ∗, and logarithmic in 1/ε.

6.3 Local Strong Convexity

Our main result shows that when initialized properly, gradient-based algo-

rithms will be guaranteed to converge to the ground truth. We first study the Hes-

sian of empirical risk for different activation functions, then based on the positive-

definiteness of the Hessian for smooth activations, we show local linear conver-

gence of gradient descent. The proof sketch is provided in Appendix D.1.

81

Page 98: Copyright by Kai Zhong 2018

The positive definiteness of the Hessian does not hold for several activation

functions. Here we provide some examples. Counter Example 1) The Hessian at

the ground truth for linear activation is not positive definite because for any full-rank

matrix R ∈ Rk×k, (U∗R, V ∗R−1) is also a global optimal. Counter Example 2)

The Hessian at the ground truth for ReLU activation is not positive definite because

for any diagonal matrix D ∈ Rk×k with positive diagonal elements, U∗D,V ∗D−1

is also a global optimal. These counter examples have a common property: there

is redundancy in the parameters. Surprisingly, for sigmoid and tanh, the Hessian

around the ground truth is positive definite. More surprisingly, we will later show

that for ReLU, if the parameter space is constrained properly, its Hessian at a given

point near the ground truth can also be proved to be positive definite with high

probability.

Local Geometry and Local Linear Convergence for Sigmoid and Tanh

We define two natural condition numbers for the problem that captures the

”hardness” of the problem:

Definition 6.3.1. Define λ := maxλ(U∗), λ(V ∗) and κ := maxκ(U∗), κ(V ∗),

where λ(U) = σk1(U)/(Πk

i=1σi(U)), κ(U) = σ1(U)/σk(U), and σi(U) denotes the

i-th singular value of U with the ordering σi ≥ σi+1.

First we show the result for sigmoid and tanh activations.

Theorem 6.3.1 (Positive Definiteness of Hessian for Sigmoid and Tanh). Let the

activation function φ in the NIMC model (6.2) be sigmoid or tanh and let κ, λ be as

82

Page 99: Copyright by Kai Zhong 2018

defined in Definition 6.3.1. Then for any t > 1 and any given U, V , if

n1 & tλ4κ2d log2 d, n2 & tλ4κ2d log2 d, |Ω| & tλ4κ2d log2 d,

and ‖U − U∗‖+ ‖V − V ∗‖ . 1/(λ2κ),

then with probability at least 1 − d−t, the smallest eigenvalue of the Hessian of

Eq. (6.3) is lower bounded by:

λmin(∇2fΩ(U, V )) & 1/(λ2κ).

Remark. Theorem 6.3.1 shows that, given sufficiently large number of user-

items ratings and a sufficiently large number of users/items themselves, the Hessian

at a point close enough to the true parameters U∗, V ∗, is positive definite with high

probability. The sample complexity, including n1, n2 and |Ω|, have a near-linear

dependency on the dimension, which matches the linear IMC analysis [69]. Strong

convexity parameter as well as the sample complexity depend on the condition num-

ber of U∗, V ∗ as defined in Definition 6.3.1. Although we don’t explicitly show the

dependence on k, both sample complexity and the minimal eigenvalue scale as a

polynomial of k. The proofs can be found in Appendix D.1.

As the above theorem shows the Hessian is positive definite w.h.p. for a

given U, V that is close to the optima. This result along with smoothness of the

activation function implies linear convergence of gradient descent that samples a

fresh batch of samples in each iteration as shown in the following, whose proof is

postponed to Appendix D.3.1.

83

Page 100: Copyright by Kai Zhong 2018

Theorem 6.3.2. Let [U c, V c] be the parameters in the c-th iteration. Assuming

‖U c − U∗‖ + ‖V c − V ∗‖ . 1/(λ2κ), then given a fresh sample set, Ω, that is in-

dependent of [U c, V c] and satisfies the conditions in Theorem 6.3.1, the next iterate

using one step of gradient descent, i.e., [U c+1, V c+1] = [U c, V c] − η∇fΩ(U c, V c),

satisfies

‖U c+1 − U∗‖2F + ‖V c+1 − V ∗‖2F ≤ (1−Ml/Mu)(‖U c − U∗‖2F + ‖V c − V ∗‖2F )

with probability 1− d−t, where η = Θ(1/Mu) is the step size and Ml & 1/(λ2κ) is

the lower bound on the eigenvalues of the Hessian and Mu . 1 is the upper bound

on the eigenvalues of the Hessian.

Remark. The linear convergence requires each iteration has a set of fresh

samples. However, since it converges linearly to the ground-truth, we only need

log(1/ε) iterations, therefore the sample complexity is only logarithmic in 1/ε. This

dependency is better than directly using Tensor decomposition method [72], which

requires O(1/ε2) samples. Note that we only use Tensor decomposition to initialize

the parameters. Therefore the sample complexity required in our tensor initializa-

tion does not depend on ε.

Empirical Hessian around the Ground Truth for ReLU

We now present our result for ReLU activation. As we see in Counter Exam-

ple 2, without any further modification, the Hessian for ReLU is not locally strongly

convex due to the redundancy in parameters. Therefore, we reduce the parameter

space by fixing one parameter for each (ui, vi) pair, i ∈ [k]. In particular, we fix

84

Page 101: Copyright by Kai Zhong 2018

u1,i = u∗1,i,∀i ∈ [k] when minimizing the objective function, Eq. (6.3), where u1,i

is i-th element in the first row of U . Note that as long as u∗1,i 6= 0, u1,i can be fixed

to any other non-zero values. We set u1,i = u∗1,i just for simplicity of the proof. The

new objective function can be represented as

fReLUΩ (W,V ) =

1

2|Ω|∑

(x,y)∈Ω

(φ(W>x2:d + x1(u∗(1))>)>φ(V >y)− A(x, y))2.

(6.4)

where u∗(1) is the first row of U∗ and W ∈ R(d−1)×k.

Surprisingly, after fixing one parameter for each (ui, vi) pair, the Hessian

using ReLU is also positive definite w.h.p. for a given (U, V ) around the ground

truth.

Theorem 6.3.3 (Positive Definiteness of Hessian for ReLU). Define u0 := mini∈[k]|u∗1,i|.

For any t > 1 and any given U, V , if

n1 & u−40 tλ4κ12d log2 d, n2 & u−4

0 tλ4κ12d log2 d, |Ω| & u−40 tλ4κ12d log2 d,

‖W −W ∗‖+ ‖V − V ∗‖ . u40/λ

4κ12,

then with probability 1 − d−t, the minimal eigenvalue of the objective for ReLU

activation function, Eq. (6.4), is lower bounded,

λmin(∇2fReLUΩ (W,V )) & u2

0/λ2κ4.

Remark. Similar to the sigmoid/tanh case, the sample complexity for ReLU

case also has a linear dependency on the dimension. However, here we have a worse

dependency on the condition number of the weight matrices. The scale of u0 can

85

Page 102: Copyright by Kai Zhong 2018

also be important and in practice one needs to set it carefully. Note that although

the activation function is not smooth, the Hessian at a given point can still exist with

probability 1, since ReLU is smooth almost everywhere and there are only a finite

number of samples. However, owing to the non-smoothness, a proof of convergence

of gradient descent method for ReLU is still an open problem.

6.4 Initialization and Recovery Guarantee

To achieve the ground truth, our algorithm needs a good initialization method

that can initialize the parameters to fall into the neighborhood of the ground truth.

Here we show that this is possible by using tensor method under the Gaussian as-

sumption.

In the following, we consider estimating U∗. Estimating V ∗ is similar.

Define M3 := E[A(x, y) ·(x⊗3−x⊗I)], where x⊗I :=∑d

j=1[x⊗ej⊗ej+

ej⊗x⊗ej+ej⊗ej⊗x]. Define γj(σ) := E[φ(σ ·z)zj], ∀j = 0, 1, 2, 3. Then M3 =∑ki=1 αiu

∗⊗3i , where u∗

i = u∗i /‖u∗

i ‖ and αi = γ0(‖v∗i ‖)(γ3(‖u∗i ‖)− 3γ1(‖u∗

i ‖)).

When αi 6= 0, we can approximately recover αi and u∗i from the empirical ver-

sion of M3 using non-orthogonal tensor decomposition [78]. When φ is sigmoid,

γ0(‖v∗i ‖) = 0.5. Given αi, we can estimate ‖u∗i ‖, since αi is a monotonic function

w.r.t. ‖u∗i ‖. Applying Lemma B.7 in [153], we can bound the approximation er-

ror of empirical M3 and population M3 using polynomial number of samples. By

[78], we can bound the estimation error of ‖u∗i ‖ and u∗

i . Finally combining Theo-

rem 6.3.1, we are able to show the recovery guarantee for sigmoid activation, i.e.,

Theorem 6.2.1.

86

Page 103: Copyright by Kai Zhong 2018

0 50 100

n

0

2

4

6

8

10

m/(

2kd)

0 50 100 150 200

n

0

5

10

15

20

m/(

2kd)

Figure 6.1: The rate of success of GD over synthetic data. Left: sigmoid, Right:ReLU). White blocks denote 100% success rate.

Although tensor initialization has nice theoretical guarantees and sample

complexity, it heavily depends on Gaussian assumption and realizable model as-

sumption. In contrast, practitioners typically use random initialization.

6.5 Experiments on Synthetic and Real-world Data

In this section, we show experimental results on both synthetic data and real-

world data. Our experiments on synthetic data are intended to verify our theoretical

analysis, while the real-world data shows the superior performance of NIMC over

IMC. We apply gradient descent with random initialization to both NIMC and IMC.

Synthetic Data

We first generate some synthetic datasets to verify the sample complexity

and the convergence of gradient descent using random initialization. We fix k =

5, d = 10. For sigmoid, set the number of samples n1 = n2 = n = 10 · ii=1,2··· ,10

87

Page 104: Copyright by Kai Zhong 2018

Table 6.1: The error rate in semi-supervised clustering using NIMC and IMC.

Dataset n d k |Ω| NIMC IMCNIMC-

RFFIMC-RFF

mushroom 8124 112 25n 0 0.0049 0 0

20n 0 0.0010 0 0

segment 2310 19 75n 0.0543 0.0694 0.0197 0.0257

20n 0.0655 0.0768 0.0092 0.0183

covtype 1708 54 75n 0.1671 0.1733 0.1548 0.1529

20n 0.1555 0.1600 0.1200 0.1307

letter 15000 16 265n 0.0590 0.0704 0.0422 0.0430

20n 0.0664 0.0760 0.0321 0.0356

yalefaces 2452 100 385n 0.0315 0.0329 0.0266 0.0273

20n 0.0212 0.0277 0.0064 0.0142

usps 7291 256 105n 0.0211 0.0361 0.0301 0.0185

20n 0.0184 0.0320 0.0199 0.0152

and the number of observations |Ω| = m = 2kd · ii=1,2,··· ,10. For ReLU, set

n = 20 · ii=1,2··· ,10 and m = 4kd · ii=1,2,··· ,10. The sampling rule follows our

previous assumptions. For each n,m pair, we make 5 trials and take the average

of the successful recovery times. We say a solution (U, V ) successfully recovers

the ground truth parameters when the solution achieves 0.001 relative testing error,

i.e., ‖φ(XtU)φ(XtU)>−φ(XtU∗)φ(XtU

∗)>‖F ≤ 0.001 · ‖φ(XtU∗)φ(XtU

∗)>‖F ,

where Xt ∈ Rn×d is a newly sampled testing dataset. For both ReLU and sigmoid,

we minimize the original objective function (6.3). We illustrate the recovery rate in

Figure 6.1. As we can see, ReLU requires more samples/observations than that for

sigmoid for exact recovery (note the scales of n and m/2kd are different in the two

figures). This is consistent with our theoretical results. Comparing Theorem 6.3.1

and Theorem 6.3.3, we can see the sample complexity for ReLU has a worse de-

88

Page 105: Copyright by Kai Zhong 2018

pendency on the conditioning of U∗, V ∗ than sigmoid. We can also see that when n

is sufficiently large, the number of observed ratings required remains the same for

both methods. This is also consistent with the theorems, where |Ω| is near-linear in

d and is independent of n.

Semi-supervised Clustering

We apply NIMC to semi-supervised clustering and follow the experimental

setting in GIMC [112]. In this problem, we are given a set of items with their

features, X ∈ Rn×d, where n is the number of items and d is the feature dimension,

and an incomplete similarity matrix A, where Ai,j = 1 if i-th item and j-th item

are similar and Ai,j = 0 if i-th item and j-th item are dissimilar. The goal is to do

clustering using both existing features and the partially observed similarity matrix.

We build the dataset from a classification dataset where the label of each item is

known and will be used as the ground truth cluster. We first compute the similarity

matrix from the labels and sample |Ω| entries uniformly as the observed entries.

Since there is only one features we set yj = xj in the objective function Eq. (6.3).

We initialize U and V to be the same Gaussian random matrix, then apply

gradient descent. This guarantees U and V to keep identical during the optimization

process. Once U converges, we take the top k left singular vectors of φ(XU) to do

k-means clustering. The clustering error is defined as in [112]. Like [112], we

define the clustering error as follows,

error =2

n(n− 1)

∑(i,j):π∗

i =π∗j

1πi 6=πj+

∑(i,j):π∗

i 6=π∗j

1πi=πj

,

89

Page 106: Copyright by Kai Zhong 2018

where π∗ is the ground-truth clustering and π is the predicted clustering. We

compare NIMC of a ReLU activation function with IMC on six datasets using

raw features and random Fourier features (RFF). The random Fourier feature is

r(x) = 1√q·[sin(Qx)> cos(Qx)>

]> ∈ R2q and each entry of Q ∈ Rq×d is i.i.d.

sampled from N(0, σ). We use Random Fourier features in order to see how increas-

ing the depth of the neural network changes the performance. However, our anal-

ysis only works for one-layer neural networks, therefore, we use Random Fourier

features, which can be viewed as using two-layer neural networks but with the first-

layer parameters fixed.

σ is chosen such that a linear classifier using these random features achieves

the best classification accuracy. q is set as 100 for all datasets. Datasets mushroom,

segment, letter,usps,covtype are downloaded from libsvm website. We subsample

covtype dataset to balance the samples from different classes. We preprocess yale-

faces dataset as described in [79]. As shown in the right table in Figure 6.1, when

using raw features, NIMC achieves better clustering results than IMC for all the

cases. This is also true for most cases when using Random Fourier features.

Recommendation Systems

Recommender systems are used in many real situations. Here we consider

two tasks.

Movie recommendation for users. We use Movielens[1] dataset, which

has not only the ratings users give movies but also the users’ demographic informa-

tion and movies’ genre information. Our goal is to predict ratings that new users

90

Page 107: Copyright by Kai Zhong 2018

Table 6.2: Test RMSE for recommending new users with movies on Movielensdataset.

Dataset #movies #users # ratings # movie feat. # user feat.RMSENIMC

RMSEIMC

ml-100k 1682 943 100,000 39 29 1.034 1.321ml-1m 3883 6040 1,000,000 38 29 1.021 1.320

0 10 20 30 40

Recall(%)

0

1

2

3

4

Pre

cis

ion

(%)

IMC

NIMC

(a) (b) (c) (d)

Figure 6.2: NIMC v.s. IMC on gene-disease association prediction task.

will give the existing movies. We randomly split the users into existing users (train-

ing data) and new users (testing data) with ratio 4:1. The user features include 21

types of occupations, 7 different age ranges and one gender information; the movie

features include 18-19 (18 for ml-1m and 19 for ml-100k) genre features and 20

features from the top 20 right singular values of the training rating matrix (which

has size #training users -by- #movies). In our experiments, we set k to be 50. Here

are our results on datasets ml-1m and ml-100k. For NIMC, we use ReLU activa-

tion. As shown in Table 6.2, NIMC achieves much smaller RMSE than IMC for

both ml-100k and ml-1m datasets.

Gene-Disease association prediction. We use the dataset collected by

[92], which has 300 gene features and 200 disease features. Our goal is to predict

associated genes for a new disease given its features. Since the dataset only contains

91

Page 108: Copyright by Kai Zhong 2018

positive labels, this is a problem called positive-unlabeled learning [66] or one-

class matrix factorization [139]. We adapt our objective function to the following

objective,

f(U, V ) =1

2

∑(i,j)∈Ω

(φ(U>xi)>φ(V >yj)− Aij)

2 + β∑

(i,j)∈Ωc

(φ(U>xi)>φ(V >yj))

2

,

(6.5)

where A is the association matrix, Ω is the set of indices for observed associations,

Ωc is the complementary set of Ω and β is the penalty weight for unobserved as-

sociations. There are totally 12331 genes and 3209 diseases in the dataset. We

randomly split the diseases into training diseases and testing diseases with ratio

4:1. The results are presented in Fig 6.2. We follow [92] and use the cumulative

distribution of the ranks as a measure for comparing the performances of differ-

ent methods, i.e., the probability that any ground-truth associated gene of a disease

appears in the retrieved top-r genes for this disease.

In Fig 6.2(a), we show how k changes the performance of NIMC and IMC.

In general, the higher k, the better the performance. The performance of IMC be-

comes stable when k is larger than 100, while the performance of NIMC is still

increasing. Although IMC performs better than NIMC when k is small, the perfor-

mance of NIMC increases much faster than IMC when k increases. β is fixed as

0.01 and r = 100 in the experiment for Fig 6.2(a). In Fig. 6.2(b), we present how β

in Eq. (6.5) affects the performance. We tried over β = [10−4, 10−3, 10−2, 10−1, 1]

to check how the value of β changes the performance. As we can see, β = 10−3 and

10−2 give the best results. Fig. 6.2(c) shows the probability that any ground-truth

92

Page 109: Copyright by Kai Zhong 2018

associated gene of a disease appears in the retrieved top-r genes for this disease

w.r.t. different r’s. Here we fix k = 200, and β = 0.01. Fig. 6.2(d) shows the

precision-recall curves for different methods when k = 200, and β = 0.01.

6.6 Related Work

Collaborative filtering: Our model is a non-linear version of the standard

inductive matrix completion model [69]. Practically, IMC has been applied to

gene-disease prediction [92], matrix sensing [150], multi-label classification[140],

blog recommender system [111], link prediction [29] and semi-supervised clus-

tering [29, 112]. However, IMC restricts the latent space of users/items to be a

linear transformation of the user/item’s feature space. [112] extended the model to

a three-layer neural network and showed significantly better empirical performance

for multi-label/multi-class classification problem and semi-supervised problems.

Although standard IMC has linear mappings, it is still a non-convex prob-

lem due to the bilinearity UV >. To deal with this non-convex problem, [58, 69]

provided recovery guarantees using alternating minimization with sample com-

plexity linear in dimension. [136] relaxed this problem to a nuclear-norm prob-

lem and also provided recovery guarantees. More general norms have been studied

[102, 117–119], e.g. weighted Frobenius norm, entry-wise `1 norm. More recently,

[146] uses gradient-based non-convex optimization and proves a better sample com-

plexity. [29] studied dirtyIMC models and showed that the sample complexity can

be improved if the features are informative when compared to matrix completion.

Several low-rank matrix sensing problems [46, 150] are also closely related to IMC

93

Page 110: Copyright by Kai Zhong 2018

models where the observations are sampled only from the diagonal elements of the

rating matrix. [87, 104] introduced and studied an alternate framework for ratings

prediction with side-information but the prediction function is linear in their case

as well.

Neural networks: Nonlinear activation functions play an important role in

neural networks. Recently, several powerful results have been discovered for learn-

ing one-hidden-layer feedforward neural networks [22, 52, 72, 85, 125, 153], con-

volutional neural networks [21, 38, 39, 53, 152]. However, our problem is a strict

generalization of the one-hidden layer neural network and is not covered by the

above mentioned results.

94

Page 111: Copyright by Kai Zhong 2018

Chapter 7

Low-rank Matrix Sensing1

In this chapter, we study the problem of low-rank matrix sensing where

the goal is to reconstruct a matrix exactly using a small number of linear measure-

ments. Existing methods for the problem either rely on measurement operators such

as random element-wise sampling which cannot recover arbitrary low-rank matri-

ces or require the measurement operator to satisfy the Restricted Isometry Property

(RIP). However, RIP based linear operators are generally full rank and require large

computation/storage cost for both measurement (encoding) as well as reconstruc-

tion (decoding). We propose simple rank-one Gaussian measurement operators for

matrix sensing that are significantly less expensive in terms of memory and compu-

tation for both encoding and decoding. Moreover, we show that the matrix can be

reconstructed exactly using a simple alternating minimization method. Finally, we

demonstrate the effectiveness of the measurement scheme vis-a-vis existing meth-

ods.

1The content of this chapter is published as Efficient matrix sensing using rank-1 Gaussianmeasurements, Kai Zhong, Prateek Jain, and Inderjit S. Dhillon, in International Conference onAlgorithmic Learning Theory, 2015. The dissertator’s contribution includes deriving part of thetheoretical analysis, conducting the numerical experiments and writing part of the paper.

95

Page 112: Copyright by Kai Zhong 2018

7.1 Introduction to Low-rank Matrix Sensing

We consider the matrix sensing problem, where the goal is to recover a low-

rank matrix using a small number of linear measurements. The matrix sensing pro-

cess contains two phases: a) compression phase (encoding), and b) reconstruction

phase (decoding).

In the compression phase, a sketch/measurement of the given low-rank ma-

trix is obtained by applying a linear operator A : Rd1×d2 → Rm. That is, given a

rank-k matrix, W∗ ∈ Rd1×d2 , its linear measurements are computed by:

b = A(W∗) = [tr(A>1 W∗) tr(A>

2 W∗) . . . tr(A>mW∗)]

> , (7.1)

where Al ∈ Rd1×d2l=1,2,...,m parameterize the linear operator A and tr denotes the

trace operator. Then, in the reconstruction phase, the underlying low-rank matrix is

reconstructed using the given measurements b. That is, W∗ is obtained by solving

the following optimization problem:

minW

rank(W ) s.t. A(W ) = b . (7.2)

The matrix sensing problem is a matrix generalization of the popular compressive

sensing problem and has several real-world applications in the areas of system iden-

tification and control, Euclidean embedding, and computer vision (see [103] for a

detailed list of references).

Naturally, the design of the measurement operator A is critical for the suc-

cess of matrix sensing as it dictates cost of both the compression as well as the

reconstruction phase. Most popular operators for this task come from a family of

96

Page 113: Copyright by Kai Zhong 2018

operators that satisfy a certain Restricted Isometry Property (RIP). However, these

operators require each Al, that parameterizes A, to be a full rank matrix. That

is, cost of compression as well as storage of A is O(md1d2), which is infeasible

for large matrices. Reconstruction of the low-rank matrix W∗ is also expensive,

requiring O(md1d2 + d21d2) computation steps. Moreover, m is typically at least

O(k · d log(d)) where d = d1 + d2. But, these operators are universal, i.e., every

rank-k W∗ can be compressed and recovered using such RIP based operators.

Here, we seek to reduce the computational/storage cost of such operators

but at the cost of the universality property. That is, we propose to use simple rank-

one operators, i.e., where each Al is a rank one matrix. We show that using similar

number of measurements as the RIP operators, i.e., m = O(k · d log(d)), we can

recover a fixed rank W∗ exactly.

In particular, we propose two measurement schemes: a) rank-one indepen-

dent measurements, b) rank-one dependent measurements. In rank-one indepen-

dent measurement, we use Al = xly>l , where xl ∈ Rd1 , yl ∈ Rd2 are both

sampled from zero mean sub-Gaussian product distributions, i.e., each element

of xl and yl is sampled from a fixed zero-mean univariate sub-Gaussian distri-

bution. Rank-one dependent measurements combine the above rank-one measure-

ments with element-wise sampling, i.e., Al = xily>jl

where xil ,yjl are sampled as

above. Also, (il, jl) ∈ [n1] × [n2] is a randomly sampled index, where n1 ≥ d1,

n2 ≥ d2. These measurements can also be viewed as the inductive version of the

matrix completion problem (see Section 7.2), where xi represents features of the

i-th user (row) and yj represents features of the j-th movie (column).

97

Page 114: Copyright by Kai Zhong 2018

Table 7.1: Comparison of sample complexity and computational complexity fordifferent approaches and different measurements

METHODS SAMPLE COMPLEXITY COMPUTATIONAL COMPLEXITY

ALSRANK-1 INDEP. O(k4β2d log2 d) O(kdm)

RANK-1 DEP. O(k4β2d log d) O(dm+ knd)

RIP O(k4d log d) O(d2m)

Next, we provide recovery algorithms for both of the above measurement

operators. Note that, in general, the recovery problem (7.2) is NP-hard to solve.

However, for the RIP based operators, both alternating minimization as well as

nuclear norm minimization methods are known to solve the problem exactly in

polynomial time [71]. Note that the existing analysis of both the methods crucially

uses RIP and hence does not extend to the proposed operators. We show that if

m = O(k4 · β2 · (d1 + d2) log2(d1 + d2)), where β is the condition number of W∗

then alternating minimization for the rank-one independent measurements recovers

W∗ in time O(kdm), where d = d1 + d2.

We summarize the sample complexity and computational complexity for

different approaches and different measurements in Table 7.1. In the table, ALS

refers to alternating least squares, i.e., alternating minimization. For the symbols,

d = d1 + d2, n = n1 + n2.

We summarize related work in Section 7.5. In Section 7.2 we formally

introduce the matrix sensing problem and our proposed rank-one measurement op-

erators. In Section 7.3, we present the alternating minimization method for ma-

trix reconstruction. We then present a generic analysis for alternating minimization

98

Page 115: Copyright by Kai Zhong 2018

when applied to the proposed rank-one measurement operators. Finally, we provide

empirical validation of our methods in Section 7.4.

7.2 Problem Formulation – Two Settings

The goal of matrix sensing is to design a linear operator A : Rd1×d2 → Rm

and a recovery algorithm so that a low-rank matrix W∗ ∈ Rd1×d2 can be recovered

exactly using A(W∗). In this work, we focus on rank-one measurement operators,

Al = xly>l , and call such problems as Low-Rank matrix estimation using Rank

One Measurements (LRROM): recover the rank-k matrix W ∗ ∈ Rd1×d2 by using

rank-1 measurements of the form:

b = [x>1 W ∗y1 x>

2 W ∗y2 . . . x>mW ∗ym]

>, (7.3)

where xl,yl are “feature” vectors and are provided along with the measurements b.

We propose two different kinds of rank-one measurement operators based

on Gaussian distribution.

1) Rank-one Independent Gaussian Operator. Our first measurement op-

erator is a simple rank-one Gaussian operator, AGI = [A1 . . . Am], where, Al =

xly>l l=1,2,...,m, and xl, yl are sampled i.i.d. from spherical Gaussian distribution.

2) Rank-one Dependent Gaussian Operator. Our second operator can

introduce certain “dependencies” in our measurement and has in fact interesting

connections to the matrix completion problem. We provide the operator as well as

the connection to matrix completion in this sub-section. To generate the rank-one

dependent Gaussian operator, we first sample two Gaussian matrices X ∈ Rn1×d1

99

Page 116: Copyright by Kai Zhong 2018

and Y ∈ Rn2×d2 , where each entry of both X,Y is sampled independently from

Gaussian distribution and n1 ≥ Cd1, n2 ≥ Cd2 for a global constant C ≥ 1. Then,

the Gaussian dependent operator AGD = [A1, . . . Am] where Al = xily>il(il,jl)∈Ω.

Here x>i is the i-th row of X and y>

j is the j-th row of Y . Ω is a uniformly random

subset of [n1]×[n2] such that E[|Ω|] = m. For simplicity, we assume that each entry

(il, jl) ∈ [n1]× [n2] is sampled i.i.d. with probability p = m/(n1× n2). Therefore,

the measurements using the above operator are given by: bl = x>ilWyjl , (il, jl) ∈ Ω.

Connections to Inductive Matrix Completion (IMC): Note that the above

measurements are inspired by matrix completion style sampling operator. How-

ever, here we first multiply W with X , Y and then sample the obtained matrix

XWY >. In the domain of recommendation systems (say user-movie system), the

corresponding reconstruction problem can also be thought as the inductive matrix

completion problem. That is, let there be n1 users, n2 movies, X represents user

features, and Y represents the movie features. Then, the true ratings matrix is given

by R = XWY > ∈ Rn1×n2 .

That is, given the user/movie feature vectors xi ∈ Rd1 for i = 1, 2, ..., n1

and yj ∈ Rd2 for j = 1, 2, ..., n2, our goal is to recover a rank-k matrix W∗ of size

d1 × d2 from a few observed entries Rij = x>i W∗yj , for (i, j) ∈ Ω ⊂ [n1] × [n2].

Because of the equivalence between the dependent rank-one measurements and the

entries of the rating matrix, in the rest of this section, we will use Rij(i,j)∈Ω as the

dependent rank-one measurements.

Now, if we can reconstruct W∗ from the above measurements, we can pre-

dict ratings inductively for new users/movies, provided their feature vectors are

100

Page 117: Copyright by Kai Zhong 2018

Algorithm 7.3.1 AltMin-LRROM: Alternating Minimization for LRROM1: Input: Measurements: ball, Measurement matrices: Aall, Number of iterations:

H2: Divide (Aall, ball) into 2H + 1 sets (each of size m) with h-th set being Ah =Ah

1 , Ah2 , . . . , A

hm and bh = [bh1 b

h2 . . . bhm]

>

3: Initialization: U0 =top-k left singular vectors of 1m

∑ml=1 b

0lA

0l

4: for h = 0 to H − 1 do5: b← b2h+1,A← A2h+1

6: Vh+1 ← argminV ∈Rd2×k

∑l(bl − x>

l UhV>yl)

2

7: Vh+1 = QR(Vh+1) //orthonormalization of Vh+1

8: b← b2h+2,A← A2h+2

9: Uh+1 ← argminU∈Rd1×k

∑l(bl − x>

l UV >h+1yl)

2

10: Uh+1 = QR(Uh+1) //orthonormalization of Uh+1

11: Output: WH = UH(VH)>

given.

Hence, our reconstruction procedure also solves the IMC problem. How-

ever, there is a key difference: in matrix sensing, we can select X , Y according

to our convenience, while in IMC, X and Y are provided a priori. But, for gen-

eral X,Y one cannot solve the problem because if say R = XW∗Y> is a 1-sparse

matrix, then W∗ cannot be reconstructed even with a large number of samples.

7.3 Rank-one Matrix Sensing via Alternating Minimization

We now present the alternating minimization approach for solving the re-

construction problem (7.2) with rank-one measurements (7.3). Since W to be re-

covered is restricted to have at most rank-k, (7.2) can be reformulated as the fol-

101

Page 118: Copyright by Kai Zhong 2018

lowing non-convex optimization problem:

minU∈Rd1×k,V ∈Rd2×k

m∑l=1

(bl − x>l UV >yl)

2 . (7.4)

Alternating minimization is an iterative procedure that alternately optimizes for U

and V while keeping the other factor fixed. As optimizing for U (or V ) involves

solving just a least squares problem, so each individual iteration of the algorithm

is linear in matrix dimensions. For the rank-one measurement operator, we use a

particular initialization method to initialize U (see line 3 of Algorithm 7.3.1). See

Algorithm 7.3.1 for a pseudo-code of the algorithm.

General Theoretical Guarantee for Alternating Minimization

As mentioned above, (7.4) is non-convex in U, V and hence standard anal-

ysis would only ensure convergence to a local minima. However, [71] recently

showed that the alternating minimization method in fact converges to the global

minima of two low-rank estimation problems: matrix sensing with RIP matrices

and matrix completion.

The rank-one operator given above does not satisfy RIP (see Definition 7.3.1),

even when the vectors xl,yl are sampled from the normal distribution (see Claim 7.3.1).

Furthermore, each measurement need not reveal exactly one entry of W ∗ as in the

case of matrix completion. Hence, the proof of [71] does not apply directly. How-

ever, inspired by the proof of [71], we distill out three key properties that the op-

erator should satisfy, so that alternating minimization would converge to the global

optimum.

102

Page 119: Copyright by Kai Zhong 2018

Theorem 7.3.1. Let W ∗ = U∗Σ∗V>∗ ∈ Rd1×d2 be a rank-k matrix with k-singular

values σ∗1 ≥ σ∗

2 · · · ≥ σ∗k. Also, let A : Rd1×d2 → Rm be a linear measurement

operator parameterized by m matrices, i.e., A = A1, A2, . . . , Am where Al =

xly>l . Let A(W ) be as given by (7.1).

Now, let A satisfy the following properties with parameter δ = 1k3/2·β·100

(β = σ∗1/σ∗

k):

1. Initialization: ‖ 1m

∑l blAl −W ∗‖2 ≤ ‖W ∗‖2 · δ.

2. Concentration of operators Bx, By: Let Bx = 1m

∑ml=1(y

>l v)

2xlx>l and

By = 1m

∑ml=1(x

>l u)

2yly>l , where u ∈ Rd1 ,v ∈ Rd2 are two unit vectors

that are independent of randomness in xl,yl, ∀i. Then the following holds:

‖Bx − I‖2 ≤ δ and ‖By − I‖2 ≤ δ.

3. Concentration of operators Gx, Gy: Let Gx = 1m

∑l(y

>l v)(ylv⊥)xlx

>l ,

Gy = 1m

∑l(x

>l u)(u

>⊥xl)yly

>l , where u,u⊥ ∈ Rd1 , v,v⊥ ∈ Rd2 are unit

vectors, s.t., u>u⊥ = 0 and v>v⊥ = 0. Furthermore, let u,u⊥,v,v⊥ be

independent of randomness in xl,yl, ∀i. Then, ‖Gx‖2 ≤ δ and ‖Gy‖2 ≤ δ.

Then, after H-iterations of the alternating minimization method (Algorithm 7.3.1),

we obtain WH = UHV>H s.t., ‖WH −W ∗‖2 ≤ ε, where H ≤ 100 log(‖W ∗‖F/ε).

See Appendix E.1 for a detailed proof. Note that we require intermediate

vectors u,v,u⊥,v⊥ to be independent of randomness in Al’s. Hence, we partition

Aall into 2H + 1 partitions and at each step (Ah, bh) and (Ah+1, bh+1) are supplied

to the algorithm. This implies that the measurement complexity of the algorithm is

given by m · H = m log(‖W ∗‖F/ε). That is, given O(m log(‖(d1 + d2)W ∗‖F ))

103

Page 120: Copyright by Kai Zhong 2018

samples, we can estimate matrix WH , s.t., ‖WH −W ∗‖2 ≤ 1(d1+d2)c

, where c > 0

is any constant.

Independent Gaussian Measurements

In this section, we consider the rank-one independent measurement operator

AGI specified in Section 7.2. Now, for this operator AGI , we show that if m =

O(k4β2 · (d1 + d2) · log2(d1 + d2)), then w.p. ≥ 1 − 1/(d1 + d2)100, any fixed

rank-k matrix W ∗ can be recovered by AltMin-LRROM (Algorithm 7.3.1). Here

β = σ∗1/σ∗

k is the condition number of W ∗. That is, using nearly linear number

of measurements in d1, d2, one can exactly recover the d1 × d2 rank-k matrix W ∗.

As mentioned in the previous section, the existing matrix sensing results

typically assume that the measurement operator A satisfies the Restricted Isometry

Property (RIP) defined below:

Definition 7.3.1. A linear operator A : Rd1×d2 → Rm satisfies RIP iff, for ∀W s.t.

rank(W ) ≤ k, the following holds:

(1− δk)‖W‖2F ≤ ‖A(W )‖2F ≤ (1 + δk)‖W‖2F ,

where δk > 0 is a constant dependent only on k.

Naturally, this begs the question whether we can show that our rank-1 mea-

surement operator AGI satisfies RIP, so that the existing analysis for RIP based

low-rank matrix sensing can be used [71]. We answer this question in the negative,

i.e., for m = O((d1 + d2) log(d1 + d2)), AGI does not satisfy RIP even for rank-1

matrices (with high probability):

104

Page 121: Copyright by Kai Zhong 2018

Claim 7.3.1. 4.2. Let AGI = A1, A2, . . . Am be a measurement operator with

each Al = xly>l , where xl ∈ Rd1 ∼ N(0, I), yl ∈ Rd2 ∼ N(0, I), 1 ≤ l ≤ m. Let

m = O((d1 + d2) logc(d1 + d2)), for any constant c > 0. Then, with probability at

least 1− 1/m10, AGI does not satisfy RIP for rank-1 matrices with a constant δ.

See Appendix E.2 for a detailed proof of the above claim. Now, even though

AGI does not satisfy RIP, we can still show that AGI satisfies the three properties

mentioned in Theorem 7.3.1. and hence we can use Theorem 7.3.1 to obtain the

exact recovery result.

Theorem 7.3.2 (Rank-One Independent Gaussian Measurements using ALS). Let

AGI = A1, A2, . . . Am be a measurement operator with each Al = xly>l , where

xl ∈ Rd1 ∼ N(0, I), yl ∈ Rd2 ∼ N(0, I), 1 ≤ l ≤ m. Let m = O(k4β2(d1 +

d2) log2(d1 + d2)). Then, Property 1, 2, 3 required by Theorem 7.3.1 are satisfied

with probability at least 1− 1/(d1 + d2)100.

Proof. Here, we provide a brief proof sketch. See Appendix E.2 for a detailed

proof.

Initialization: Note that,

1

m

m∑l=1

blxly>l =

1

m

m∑l=1

xlx>l U∗Σ∗V

>∗ yly

>l =

1

m

m∑l=1

Zl ,

where Zl = xlx>l U∗Σ∗V

>∗ yly

>l . Note that E[Zl] = U∗Σ∗V

>∗ . Hence, to prove the

initialization result, we need a tail bound for sums of random matrices. To this end,

we use matrix Bernstein inequality Lemma 2.4.3. However, matrix Bernstein in-

equality requires a bounded random variable while Zl is an unbounded variable. We

105

Page 122: Copyright by Kai Zhong 2018

handle this issue by clipping Zl to ensure that its spectral norm is always bounded.

Furthermore, by using properties of normal distribution, we can ensure that w.p.

≥ 1− 1/m3, Zl’s do not require clipping and the new “clipped” variables converge

to nearly the same quantity as the original “non-clipped” Zl’s. See Appendix E.2

for more details.

Concentration of Bx, By, Gx, Gy: Consider Gx = 1m

∑ml=1 xlx

>l y

>l vv⊥

>yl. As,

v,v⊥ are unit-norm vectors, y>l v ∼ N(0, 1) and v>

⊥xl ∼ N(0, 1). Also, since v

and v⊥ are orthogonal, y>l v and v>

⊥yl are independent variables. Hence, Gx =

1m

∑ml=1 Zl where E[Zl] = 0. Here again, we apply matrix Bernstein inequality

Lemma 2.4.3 after using a clipping argument. We can obtain the required bounds

for Bx, By, Gy also in a similar manner.

Note that the clipping procedure ensures that Zl’s don’t need to be clipped

with probability ≥ 1 − 1/m3 only. That is, we cannot apply the union bound to

ensure that the concentration result holds for all v,v⊥. Hence, we need a fresh set

of measurements after each iteration to ensure concentration.

Global optimality of the rate of convergence of the Alternating Minimiza-

tion procedure for this problem now follows directly by using Theorem 7.3.1. We

would like to note that while the above result shows that the AGI operator is al-

most as powerful as the RIP based operators for matrix sensing, there is one critical

drawback: while RIP based operators are universal that is they can be used to re-

cover any rank-k W ∗, AGI needs to be resampled for each W ∗. We believe that

the two operators are at two extreme ends of randomness vs universality trade-off

106

Page 123: Copyright by Kai Zhong 2018

and intermediate operators with higher success probability but using larger number

of random bits should be possible.

Dependent Gaussian Measurements

For the dependent Gaussian measurements, the alternating minimization

formulation is given by:

minU∈Rd1×k,V ∈Rd2×k

∑(i,j)∈Ω

(x>i UV >yj −Rij)

2 . (7.5)

Here again, we can solve the problem by alternatively optimizing for U and V .

Later in Section 7.3, we show that using such dependent measurements leads to a

faster recovery algorithm when compared to the recovery algorithm for independent

measurements.

Note that both the measurement matrices X,Y can be thought of as or-

thonormal matrices. The reason being, XW ∗Y> = UXΣXV

>X W ∗VYΣYU

>Y , where

X = UXΣXV>X and Y = UYΣY V

>Y is the SVD of X,Y respectively. Hence,

R = XW ∗Y> = UX(ΣXV

>X W ∗VYΣY )U

>Y . Now UX , UY can be treated as

the true “X”, “Y” matrices and W ∗ ← (ΣXV>X W ∗VYΣY ) can be thought of as

W ∗. Then the “true” W ∗ can be recovered using the obtained WH as: WH ←

VXΣ−1X WHΣ

−1Y V >

Y . We also note that such a transformation implies that the condi-

tion number of R and that of W ∗ ← (ΣXV>X W ∗VYΣY ) are exactly the same.

Similar to the previous section, we utilize our general theorem for optimal-

ity of the LRROM problem to provide a convergence analysis of rank-one Gaussian

dependent operators AGD. We prove if X and Y are random orthogonal matrices,

107

Page 124: Copyright by Kai Zhong 2018

defined in [24], the above mentioned dependent measurement operator AGD gener-

ated from X,Y also satisfies Properties 1, 2, 3 in Theorem 7.3.1. Hence, AltMin-

LRROM (Algorithm 7.3.1) converges to the global optimum in O(log(‖W ∗‖F/ε))

iterations.

Theorem 7.3.3 (Rank-One Dependent Gaussian Measurements using ALS). Let

X0 ∈ Rn1×d1 and Y0 ∈ Rn2×d2 be Gaussian matrices, i.e. every entry is sampled

i.i.d from N(0, 1). Let X0 = XΣXV>X and Y0 = Y ΣY V

>Y be the thin SVD of

X0 and Y0 respectively. Then the rank-one dependent operator AGD formed by

X,Y with m ≥ O(k4β2(d1 + d2) log(d1 + d2)) satisfy Property 1,2,3 required by

Theorem 7.3.1 with high probability.

See Appendix E.3 for a detailed proof. Interestingly, our proof does not

require X , Y to be Gaussian. It instead utilizes only two key properties about X,Y

which are given by:

1. Incoherence: For some constant µ, c,

maxi∈[n]‖xi‖22 ≤

µd

n, (7.6)

where d = max(d, log n)

2. Averaging Property: For H different orthogonal matrices Uh ∈ Rd×k, h =

1, 2, . . . , H , the following hold for these Uh’s,

maxi∈[n]‖U>

h xi‖22 ≤µ0k

n, (7.7)

where µ0, c are some constants and k = max(k, log n).

108

Page 125: Copyright by Kai Zhong 2018

Hence, the above theorem can be easily generalized to solve the inductive matrix

completion problem (IMC), i.e., solve (7.5) for arbitrary X,Y . Moreover, the sam-

ple complexity of the analysis would be nearly in (d1 + d2), instead of (n1 + n2)

samples required by the standard matrix completion methods.

The following lemma shows that the above two properties hold w.h.p. for

random orthogonal matrices .

Lemma 7.3.1. If X ∈ Rn×d is a random orthogonal matrix, then both Incoherence

and Averaging properties are satisfied with probability ≥ 1− (c/n3) log n, where c

is a constant.

The proof of Lemma 7.3.1 can be found in Appendix E.3.

Computational Complexity for Alternating Minimization

In this section, we briefly discuss the computational complexity for Algo-

rithm 7.3.1. For simplicity, we set d = d1 + d2 and n = n1 + n2, and in practical

implementation, we don’t divide the measurements and use the whole measurement

operator A for every iteration. The most time-consuming part of Algorithm 7.3.1 is

the step for solving the least square problem. Given U = Uh, V can be obtained by

solving the following linear system,

m∑l=1

〈V,A>l Uh〉AlUh =

m∑l=1

blA>l Uh . (7.8)

The dimension of this linear system is kd, which could be large, thus we use conju-

gate gradient (CG) method to solve it. In each CG iteration, different measurement

109

Page 126: Copyright by Kai Zhong 2018

operators have different computational complexity. For RIP-based full-rank opera-

tors, the computational complexity for each CG step is O(d2m) while it is O(kdm)

for rank-one independent operators. However, for rank-one dependent operators,

using techniques introduced in [140], we can reduce the per iteration complexity to

be O(kdn +md). Furthermore, if n = d, the computational complexity of depen-

dent operators is only O(kd2+md), which is better than the complexity of rank-one

independent operators in an order of k.

7.4 Numerical Experiments

In this section, we demonstrate empirically that our Gaussian rank-one lin-

ear operators are significantly more efficient for matrix sensing than the existing

RIP based measurement operators. In particular, we apply alternating minimization

(ALS) to the measurements obtained using three different operators: rank-one inde-

pendent (Rank1 Indep), rank-one dependent (Rank1 Dep), and a RIP based operator

generated using random Gaussian matrices (RIP).

The experiments are conducted on Matlab.We first generated a random rank-5

signal W ∗ ∈ R50×50, and compute m = 1000 measurements using different mea-

surement operators. Figure 7.1(a) plots the relative error in recovery, err = ‖W −

W ∗‖2F/‖W ∗‖2F , against computational time required by each method. Clearly, re-

covery using rank-one measurements requires significantly less time compared to

the RIP based operator.

Next, we compare the measurement complexity (m) of each method. Here

again, we first generate a random rank-5 signal W ∗ ∈ R50×50 and its measurements

110

Page 127: Copyright by Kai Zhong 2018

10-2 10-1 100 101

time(s)

10-6

10-5

10-4

10-3

10-2

10-1

∥W−

W∗∥2 F

/∥W

∗∥2 F

ALS Rank1 DepALS Rank1 IndepALS RIP

(a) Relative error in recovery v.s. compu-tation time

200 300 400 500 600 700 800 900 1000 1100

Number of Measurements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RecoveryRate

ALS Rank1 DepALS Rank1 IndepALS RIP

(b) Recovery rate v.s. number of measure-ments

Figure 7.1: Comparison of computational complexity and measurement complexityfor different approaches and different operators

using different operators. We then measure error in recovery by each of the method

and consider success if the relative error err ≤ 0.05. We repeat the experiment 10

times to obtain the recovery rate (number of success/10) for each value of m (num-

ber of measurements). Figure 7.1(b) plots the recovery rate of different approaches

for different m. Clearly, the rank-one based measurements have similar recovery

rate and measurement complexity as the RIP based operators. However, our rank-

one operator based methods are much faster than the corresponding methods for the

RIP-based measurement scheme.

Finally, in Figure 7.2, we validate our theoretical analysis on the measure-

ment complexity by showing the recovery rate for different d and m. We fix the

rank k = 5, set d = d1 = d2 and n1 = d1, n2 = d2 for dependent operators. Figure

3 plots the recovery rate for various d and m. As shown in Figure 7.2, both inde-

pendent and dependent operators using alternating minimization require a number

111

Page 128: Copyright by Kai Zhong 2018

50 100 1500

500

1000

1500

2000

2500

3000

d

Num

ber o

f Mea

sure

men

ts

0

0.2

0.4

0.6

0.8

1

50 100 1500

500

1000

1500

2000

2500

3000

d

Num

ber o

f Mea

sure

men

ts

0

0.2

0.4

0.6

0.8

1

Indep. ALS Dep. ALS

Figure 7.2: Recovery rate for different matrix dimension d (x-axis) and differentnumber of measurements m (y-axis). The color reflects the recovery rate scaledfrom 0 to 1. The white color indicates perfect recovery, while the black color de-notes failure in all the experiments.

of measurements proportional to the dimension of d. We also see that dependent

operators require a slight larger number of measurements than that of independent

ones.

7.5 Related Work

Matrix Sensing: Matrix sensing[103][70][81] is a generalization of the pop-

ular compressive sensing problem for the sparse vectors and has applications in sev-

eral domains such as control, vision etc. [103] introduced measurement operators

that satisfy RIP and showed that using only O(kd log d) measurements, a rank-k

W∗ ∈ Rd1×d2 can be recovered. Recently, a set of universal Pauli measurements,

used in quantum state tomography, have been shown to satisfy the RIP condition

112

Page 129: Copyright by Kai Zhong 2018

[88]. These measurement operators are Kronecker products of 2× 2 matrices, thus,

they have appealing computation and memory efficiency.

Matrix Completion and Inductive Matrix Completion: Matrix completion

[24][75][71] is a special case of rank-one matrix sensing problem when the oper-

ator takes a subset of the entries. However, to guarantee exact recovery, the target

matrix has to satisfy the incoherence condition. Using our rank-one Gaussian op-

erators, we don’t require any condition on the target matrix. For inductive matrix

completion (IMC), which is a generalization of matrix completion utilizing movies’

and users’ features, the authors of [136] provided the theoretical recovery guarantee

for nuclear-norm minimization. In this work, we show that IMC is equivalent to the

matrix sensing problem using dependent rank-one measurements and focusing on

non-convex approach, alternating minimization.

Alternating Minimization: Although nuclear-norm minimization enjoys nice

recovery guarantees, it usually doesn’t scale well. In practice, alternating minimiza-

tion is employed to solve problem (7.2) by assuming the rank is known. Alternating

minimization solves two least square problems alternatively in each iteration, thus

is very computationally efficient[138]. Although widely used in practice, its the-

oretical guarantees are relatively less understood due to non-convexity. [71] first

showed optimality of alternating minimization in the matrix sensing/low-rank es-

timation setting under the RIP setting. Subsequently, several other papers have

also shown global convergence guarantees for alternating minimization, e.g. matrix

completion [62][58], robust PCA [93] and dictionary learning [3]. In this work,

we provide a generic analysis for alternating minimization applied to the proposed

113

Page 130: Copyright by Kai Zhong 2018

rank-one measurement operators. Our results distill out certain key problem spe-

cific properties that would imply global optimality of alternating minimization. We

then show that the rank-one Gaussian measurements satisfy those properties.

114

Page 131: Copyright by Kai Zhong 2018

Chapter 8

Discussion

8.1 Over-specified/Over-parameterized Neural Networks

Over-parameterization/over-specification endows greater expressive power

to modern neural networks and is also believed to be one of the primary reasons

why gradient-based local search algorithms have benign behavior.

It is conjectured in [89, 106] that over-parameterization is the primary rea-

son why local search heuristics can successfully find the global optima, based on

the intuition that over-parameterization mitigates the local minima issue and makes

the landscape more suitable for gradient descent to succeed. However, even if the

global optimum of the training loss is achieved, the generalization error of the so-

lution is typically not guaranteed due to the possibility of over-fitting.

On the generalization side, abundant empirical evidence has shown that

over-parameterized neural networks generalize well even when the number of pa-

rameters is much larger than the sample size, which seems to contradict classical

statistical learning theory. Somewhat more surprisingly, with some architecture de-

sign, vanilla gradient descent without any explicit regularization, such as dropout

or weight decay, converges to a solution that could generalize well. Some recent

generalization error bounds [18, 54, 94, 95] are derived which depend on the norm

115

Page 132: Copyright by Kai Zhong 2018

of the weight matrices, but there is no guarantee that the learned weight matrices

have small norms. Some recent work [144] regularizes the weight matrices so that

the norm is bounded, however, it is theoretically unknown if the training loss will

be also affected.

To analyze the convergence properties of NNs, many assume realizabil-

ity [22, 72, 125, 153], where an underlying ground-truth NN is assumed to exist

and the goal is to recover this target NN. In realizable settings, once the underly-

ing model is recovered, we have generalization performance guaranteed. Here we

would like to make the distinction between two different assumptions within real-

izability. One class assumes the number of hidden nodes is known and the learning

architecture is fixed to be the same as the underlying data generative model, such

as those in [72, 125, 153]. The other class (such as in [22]) assumes the underlying

model can be represented as a NN with fewer hidden nodes than actually speci-

fied, we call it over-specified models. Over-specification is more suitable for real

problems as it is usually difficult to know the true architecture of the underlying

model a priori. In other areas such as low-rank matrix sensing [84] and low-rank

semi-definite programs [23], over-specifying the rank is shown to achieve better

performance and reduce the number of local minima in non-convex optimization.

In some literature, researchers also use the term over-parameterization to refer to

the problem where the number of observations is smaller than the number of param-

eters. We point out that when the sample size is sufficiently large and the underlying

data can be modeled by a low-complexity NN, over-parameterization often implies

over-specification.

116

Page 133: Copyright by Kai Zhong 2018

Over-specified models have attracted lots of attention in recent research.

Many of them target smoothed activation functions, such as sigmoid activation [100,

141], linear activation [74], quadratic activation [114] and general smoothed acti-

vations [86, 97]. Brutzkus et. al. [22] recently studies over-parameterization of

Leaky ReLU NNs for classification problems with separable data. In this section,

we consider over-specified NNs with ReLU activation for regression problems. To

the best of our knowledge, this section is the first one that directly analyzes the

over-specified ReLU network under square loss.

In this section, we analyzed a simple case: recovering a ReLU function by

applying gradient descent on the squared-loss function with an over-specified one-

hidden-layer ReLU network. Our analysis shows that when starting from a small

initialization, gradient descent converges to the target model. Under mild condi-

tions, our sample complexity and time complexity are independent of the over-

specification degree up to a logarithmic factor. Furthermore, we present some ex-

perimental results with multiple hidden ReLU units in the ground truth network.

8.1.1 Learning a ReLU using Over-specified Neural Networks

We consider the regression problem with input data (x, y) where x ∈ Rd

and y ∈ R. The data is generated as follows.

D : x ∼ N(0, Id); y = φ(w∗>x). (8.1)

In Eq. (8.1), φ(·) is the ReLU activation function φ(x) = maxx, 0, w∗ is the

underlying ground-truth parameters. Without loss of generality, we assume ‖w∗‖ =

117

Page 134: Copyright by Kai Zhong 2018

1. This model can be viewed as one of the simplest NNs, which has one hidden

layer with one hidden unit. Our convergence guarantee is based on this simple true

model. As empirically studied by [106] and also presented in our experimental

section Sec. 8.1.2, for learning a ground-truth NN with multiple hidden units, it

is possible for gradient descent (GD) to get stuck in bad local minima even if the

learning model is mildly over-specified.

We use the natural loss for regression problem, that is, squared loss in the

objective function. And we learn an over-specified NN model, which has multiple

ReLU units. Let f(W ) denote the population loss,

f(W ) = E(x,y)

(k∑

i=1

φ(w>i x)− y

)2

, (8.2)

where k ≥ 1 and the parameter to be learned is W := [w1,w2, · · · ,wk] ∈ Rd×k.

More practically, when given a finite set of samples S, we use fS(W ) to denote the

empirical loss,

fS(W ) =1

|S|∑

(x,y)∈S

(k∑

i=1

φ(w>i x)− y

)2

. (8.3)

We use gradient descent to solve the optimization problem minW f(W ), with the

update step as

W (t+1) ← W (t) − ηt∇f(W (t)),

where the i-th column of the gradient of f(W ) can be written in the following form,

[∇f(W )]i =∂f(W )

∂wi

= E(x,y)

(k∑

i=1

φ(w>i x)− y

)φ′(w>

i x)x. (8.4)

118

Page 135: Copyright by Kai Zhong 2018

When a finite number of samples are available, the expectations above are replaced

by empirical averages. Under such a setting, we introduce our main result in the

following section.

Now let’s first present our algorithms and theoretical results on the con-

vergence of gradient descent for both population and finite-sample settings. It is

well-known that learning neural networks requires some tactics in selecting proper

initialization and step size. We show that, with a small initialization magnitude, the

optimization path of gradient descent can be divided into the following two phases,

which will be elaborated in Appendix F.2.1 and F.2.2.

1. In the first phase the angle between wi and w∗ decreases to a small value,

while the magnitudes of wii∈[k] keep small. At the end of this phase, all

wi’s are closely aligned with w∗. Therefore, this phase can be viewed as

compressing the over-specified model class to a minimal model class that can

be fit to the target model.

2. In the second phase, the magnitude of∑

i wi moves forward to the magnitude

of w∗, while the angles between wi and w∗ remain small. This phase can be

viewed as fitting the compressed model to the target model.

The optimization procedure is presented in Algorithm 8.1.1. Starting from

random initialization with small scale, Alg. 8.1.1 first does gradient update with

a small step size (O(ε1/2)). Then in the second phase, use a constant step size to

do gradient descent for linear convergence to the target function. The reason for

choosing such step sizes will become clear in the analysis. Our main result shows

that Alg. 8.1.1 converges to the target function with time complexity O(ε−1/2). The

119

Page 136: Copyright by Kai Zhong 2018

Algorithm 8.1.1 Gradient Descent for Minimizing Eq. (8.2)1: procedure LEARNINGOVERSPECIFIED1NN(k, ε, σ)2: Set η1 = O(ε1/2k−1(log(ε−1))−1), T1 = O(ε−1/2) and η2 = O(k−1), T2 =

O(log(ε−1)).3: Initialize w(0)

i i.i.d. from uniform distribution on a sphere with radius σ.4: for t = 0, 1, 2, · · · , T1 − 1 do5: W (t+1) = W (t) − η1∇f(W (t))

6: for t = T1, T1 + 1, · · · , T1 + T2 − 1 do7: W (t+1) = W (t) − η2∇f(W (t))

8: Return W (T1+T2)

proof details are postponed to Appendix F.3.1.

Theorem 8.1.1 (Convergence Guarantee of Population Gradient Descent). For any

ε > 0, with small enough σ, after T = O(ε−1/2) iterations, Algorithm 8.1.1 outputs

a solution W (T ) that satisfies,

f(W (T )) . ε, almost surely.

Furthermore, for weight vector of each hidden units, we have

∠(w(T )i ,w∗) .

√ε, ‖

∑j

w(T )j −w∗‖ . ε.

Remark 8.1.1. For any target error ε, the total time complexity of Alg. 8.1.1 is

O(ε−1/2). It can be seen from Theorem 8.1.1 that, this complexity is independent of

the number of hidden units in the over-specified model. The over-specification only

changes the magnitude of initialization σ = O( 1k) and step size η1 = O( 1

k).

In a more practical setting, one only has access to a finite, yet large, number

of observations. Built on the algorithm with population gradient, we establish the

120

Page 137: Copyright by Kai Zhong 2018

Algorithm 8.1.2 Gradient Descent with Finite Samples1: procedure LEARNINGOVERSPECIFIED1NNEMP(W (0), k, ε, δ, S)2: Set η1 = O(ε1/2k−1(log(ε−1))−1), T1 = O(ε−1/2) and η2 = O(k−1), T2 =

O(log(ε−1)).3: Divide the dataset S into T1 + T2 batches, S(t)t=0,1,2,··· ,T1+T2−1, which

satisfy|S(t)| & ε−2 log(1/ε) · d log k log(1/δ).

4: for t = 0, 1, 2, · · · , T1 − 1 do5: W (t+1) = W (t) − η1∇fS(t)(W (t))

6: for t = T1, T1 + 1, · · · , T1 + T2 − 1 do7: W (t+1) = W (t) − η2∇fS(t)(W (t))

8: Return W (T1+T2)

empirical version of the algorithm in Alg. 8.1.2. Alg. 8.1.2 is presented in an online

setting, where new independent samples are drawn in each iteration. It can be

viewed as stochastic gradient descent without reusing the samples. We show that

with mild condition on the initialization, the function returned by Algorithm 8.1.2

converges to the target function. The following theorem characterizes the time and

sample complexity for Alg. 8.1.2.

Theorem 8.1.2 (Convergence Guarantee of Empirical Gradient Descent). Let the

target error ε > 0, tail probability δ > 0, and initialization w(0)i i∈[k] be given in

Algorithm 8.1.2. Denote α(0)i := ∠(w(0)

i ,w∗), |S(t)| be the batch size in the t-th

iteration, and W (T ) be the output of Algorithm 8.1.2 after T iterations. Assume the

initialization satisfies: mini(π−α(0)i )3 & ε, and ‖w(0)

i ‖ . ε1/2 ·k−1 · (log(ε−1))−2 ·

mini(π− α(0)i )3. If |S(t)| & ε−2 log(1/ε) · d log k log(1/δ), and T = O(ε−1/2), then

121

Page 138: Copyright by Kai Zhong 2018

with probability at least 1− δ,

f(W (T )) . ε.

Furthermore, the weights for each hidden unit satisfy

∠(w(T )i ,w∗) .

√ε, ‖

∑j

w(T )j −w∗‖ . ε.

Remark 8.1.2. In Algorithm 8.1.2, the total number of samples required is O(ε−5/2·

d) and the time complexity is O(ε−1/2). The initialization condition essentially

requires the initialization vectors are not aligned with the opposite direction of w∗,

and the magnitude is not too large. If one initializes W (0) uniformly from a sphere,

it is harder to satisfy for larger k.

8.1.2 Numerical Experiments with Multiple Hidden Units

In this section, we show some empirical results for the case where the un-

derlying model has more than one hidden unit. Specifically, we formulate the model

as,

x ∼ N(0, Id); y =

k0∑j=1

φ(w∗>j x), (8.5)

where w∗jj=1,2,··· ,k0 are the ground-truth parameters. We set d = k0, W ∗ :=

[w∗1,w

∗2, · · · ,w∗

k0] = Id and k ≥ k0. We then perform gradient descent with

random initialization on the population risk, Eq. (8.2). Similar experiments are

also conducted in [106], where they show bad local minima commonly exist for

Eq. (8.2) with some d k and k0. Here we want to show that when k becomes larger,

GD becomes less likely to get stuck in local minima and moreover, even if it gets

stuck, the converged local minimum is smaller.

122

Page 139: Copyright by Kai Zhong 2018

Table 8.1: The average objective function value when gradient descent gets stuck ata non-global local minimum over 100 random trials. Note that the average functionvalue here does not take globally converged function values into account.

k − k0 0 1 2 4 8 16 32 64k0 = 4 0 0 0 0 0 0 0 0k0 = 8 0.2703 0 0 0 0 0 0 0k0 = 16 0.3088 0.2915 0 0 0 0 0 0k0 = 32 0.6017 0.4295 0.4079 0.6358 0.3276 0 0 0k0 = 64 0.9060 0.6463 0.6126 0.5432 0.4491 0.2724 0.1143 0

As we can from Table 8.1, gradient descent gets stuck in bad local minima

more frequently with large k0 and d. However, increasing k could reduce the av-

erage training error and help the solution to land closer to global optima. When

k is as large as 2k0, GD converges to the global optima for all the cases we tried.

These results show over-specification plays an important role in reshaping the land-

scape of the objective function to make the optimization easier. More interestingly,

even when the algorithm gets stuck at local minima, the average local minimum

value is smaller for a larger k, as shown in Table 8.1. Note that our experiments

are performed on the population risk, and gradient is evaluated exactly given stan-

dard Gaussian input, hence the neural network is not over-parameterized. The phe-

nomenon that over-specified models mitigate the local minima issue happens not

only when number of parameters is larger than the number of samples, but also

happens in the population case.

123

Page 140: Copyright by Kai Zhong 2018

8.2 Initialization Methods

Initialization methods can significantly affect the performance of non-convex

optimization. Here we discuss some initialization methods used for deep learning.

There are mainly three initialization methods for deep learning in theoretical com-

munities and practical communities.

Tensor Methods. In this thesis, we used tensor methods to initialize the

parameters and obtain the theoretical guarantees. However, by our empirical expe-

rience with tensor methods, tensor methods are very sensitive to the data distribu-

tion. It often requires to know the underlying input data distributions. Also, tensor

decomposition involves a third-order tensor decomposition, whose computational

complexity can be cubic in the dimension if not implemented carefully. Therefore,

tensor methods are not commonly used as an initialization methods in practice.

Nevertheless, when data has nice distributions, tensor methods can be guaranteed

to obtain good initialization and are preferable than random initialization as shown

in Fig. 3.2 for mixed linear regression and Fig. 4.1 for one-hidden-layer neural net-

works.

Random Initialization In practice, random Gaussian numbers or random

uniform numbers are often employed to initialize the parameters. Although differ-

ent random initialization methods can affect the converging solution significantly,

they can perform well if carefully handled. For example, Xavier [51], which is one

of the most popular choices in practice, uses random Gaussian distribution with

variance inver-proportional to the weight dimension. In this thesis, we also used

random initialization for real-world applications, such as recommender systems,

124

Page 141: Copyright by Kai Zhong 2018

semi-supervised clustering in Chapter 6. On the other hand, random initialization

does not always perform well on synthetic data. For example, it might not con-

verge or converge very slowly as shown in Fig. 3.2 for mixed linear regression and

Fig. 4.1 for one-hidden-layer neural networks. Even if the neural network is a little

over-specified as shown in Table 8.1, gradient descent with random initialization

might still lead to a local minimum.

For theoretical guarantees, random initialization can be applied to non-

convex problems together with local search heuristics when the objective function

does not have bad local minima since saddle points can be escaped by perturbed

gradient descent [73]. Although naive square-loss based objectives can have bad

local minima as shown in Table 8.1, landscape designs[47] are proposed to use a

different objective function that eliminates all the bad local minima.

It is also shown that random initialization can converge to global minima

with some probability for simple convolutional neural networks [39] or converge

when small initialization magnitude for residual networks [85].

Orthogonal Initialization. Orthogonal initialization uses orthogonal ma-

trices as the weight matrices for initialization. It has been shown that orthogonal

weight matrices can avoid vanishing/exploding gradient problem in deep neural net-

works or recurrent neural networks [90, 144]. Therefore, orthogonal weight initial-

ization methods are employed, which can lead to a better solution in practice. For

example, using orthogonal weight initialization, [134] uses orthogonal initialization

for 10,000-layer CNNs and achieves comparable results with residual networks.

125

Page 142: Copyright by Kai Zhong 2018

8.3 Stochastic Gradient Descent (SGD) and Other Fast Algo-rithms

Our convergence rate for gradient descent requires resampling in each iter-

ation. This procedure can also be viewed as stochastic gradient descent with suf-

ficiently large batch size (proportional to the input dimension) and without reusing

the data. Therefore, if the input dimension is not too large, empirical batch size

(128-512) suffice to provide the guarantees. Note that we already show that the

population objective has positive definite hessians anywhere in a neighborhood of

ground truth. The caveat of using small-batch SGD is that a SGD update might

move the iterate outside of the local strong convexity region, which will invali-

date the following steps. Therefore, more statistical analysis needs to be done to

bound the probability that a SGD with a small batch size could move out of the

local strong convexity region during the optimization process. Other SGD-based

algorithms, such ADAM or AdaGrad, may encounter similar problems.

Second-order optimization algorithms are also very popular especially for

smooth and strongly convex problems. For example, quasi-Newton methods, such

as LBFGS, could be employed to solve smooth non-convex objective function.

For traditional strongly convex functions, quasi-Newton method can achieve super-

linear convergence rate. For the non-convex problems we considered, we can easily

show that quasi-Newton method can achieve super-linear local convergence rate

in the population case if initialized properly. For the empirical case, more careful

analyses need to be conducted.

126

Page 143: Copyright by Kai Zhong 2018

Appendices

127

Page 144: Copyright by Kai Zhong 2018

Appendix A

Mixed Linear Regression

A.1 Proofs of Local ConvergenceA.1.1 Some Lemmata

We first introduce some facts and lemmata, whose proofs can be found in

Sec. A.1.4.

Fact A.1.1. We define a sequence of constants, CJJ=0,1,···, that satisfy

C0 = 1, C1 = 3, and CJ = CJ−1 + (4J2 + 2J)CJ−2 for J ≥ 2. (A.1)

By construction, we can upper bound CJ ,

CJ ≤ CJ−2 + (4J2 + 2J)CJ−2 + (4(J − 1)2 + 2J − 2)CJ−3

≤ CJ−2 + (4J2 + 2J)CJ−2 + (4(J − 1)2 + 2J − 2)CJ−2

≤ 8J2CJ−2

≤ (3J)J .

(A.2)

Lemma A.1.1 (Stein-type Lemma). Let x ∼ N(0, Id) and f(x) be a function of x

whose second derivative exists. Then

Eqf(x)xxT

y= EJf(x)KI + E

q∇2f(x)

y

128

Page 145: Copyright by Kai Zhong 2018

Lemma A.1.2. Let x ∼ N(0, Id) and Ak 0 for all k = 1, 2, · · · , K, then

ΠKk=1 tr(Ak)I E

qΠK

k=1(xTAkx)xx

Ty CKΠ

Kk=1 tr(Ak)I, (A.3)

where CK is a constant depending only on K, which is defined in Eq. (A.1).

Lemma A.1.3. Let xi ∼ N(0, Id) i.i.d., for all i ∈ [n] and Ak 0 for all k =

1, 2, · · · , K. Let B := EqΠK

k=1(xTAkx)xx

Ty

, Bi := ΠKk=1(x

Ti Akxi)xix

Ti and

B = 1n

∑ni=1Bi.

If n ≥ O( 1δ2logK(1

δ)(PK)Kd logK+1 d) and δ >

√4KC2K+1√

ndPfor some 0 <

δ ≤ 1 and P ≥ 1, then w.p. 1−O(Kd−P ), we have

‖B −B‖ ≤ δ‖B‖. (A.4)

Lemma A.1.4. Let x ∼ N(0, Id). Then given β,γ ∈ Rd and Ak 0 for all

k = 1, 2, · · · , K, we have

‖β‖‖γ‖ΠKk=1 tr(Ak) ≤ ‖E

q(βTx)(γTx)ΠK

k=1(xTAkx)xx

Ty‖ ≤

√3C2K+1‖β‖‖γ‖ΠK

k=1 tr(Ak)I.

(A.5)

Lemma A.1.5. Let xi ∼ N(0, Id) i.i.d., for all i ∈ [n], β,γ ∈ Rd and Ak 0

for all k = 1, 2, · · · , K. Let B := Eq(βTx)(γTx)ΠK

k=1(xTAkx)xx

Ty

, Bi :=

(βTxi)(γTxi)Π

Kk=1(x

Ti Akxi)xix

Ti and B = 1

n

∑ni=1 Bi.

If n ≥ O( 1δ2logK+1(1/δ)(PK)Kd logK+2(d)), δ >

√8KC2K+3√

ndPfor some

0 < δ ≤ 1 and P ≥ 1, then w.p. 1−O(Kd−P ), we have

‖B −B‖ ≤ δ‖B‖. (A.6)

Lemma A.1.6. If n ≥ c logK+1(c)K4Kd logK+2(d), where c is a constant, then

n ≥ cd log d logK+1(n).

129

Page 146: Copyright by Kai Zhong 2018

A.1.2 Proof of Theorem 3.3.1

Proof. Denote the Hessian of Eq. (3.1), H ∈ RKd×Kd. Let H =∑

i Hi, where

Hi :=

H11

i H12i · · · H1K

i

H21i H22

i · · · H2Ki

. . .HK1

i HK2i · · · HKK

i

(A.7)

For diagonal blocks,

Hjji := 2

(Πk 6=j(yi − (wk + δwk)

Txi)2)xix

Ti (A.8)

For off-diagonal blocks,

Hjli := 4(yi−(wj+δwj)

Txi)(yi−(wl+δwl)Txi)

(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

(A.9)

In the following we will show that when wk is close to the optimal solution w∗k and

δwk is small enough for all k, then H will be positive definite w.h.p..

The main idea is to upper bound the off-diagonal blocks and lower bound

the diagonal blocks because,

σmin(H) = min∑Kj=1 ‖aj‖2=1

K∑j=1

aTj H

jjaj +∑j 6=l

2aTj H

jlal

≥ min∑Kj=1 ‖aj‖2=1

K∑j=1

σmin(Hjj)‖aj‖2 −

∑j 6=l

‖Hjl‖‖aj‖‖al‖

≥ minjσmin(H

jj) −maxj 6=l‖Hjl‖(K − 1)(

∑j

‖aj‖)

≥ minjσmin(H

jj) − (K − 1)maxj 6=l‖Hjl‖.

(A.10)

130

Page 147: Copyright by Kai Zhong 2018

First consider the diagonal blocks. The idea is to decompose the diagonal

blocks into two parts. The first one only contains w and doesn’t contain δw, so for

this fixed w we apply Lemma A.1.3 to bound this term. The second one depends

on δw. We find an upper bound for this term which only depends on the magnitude

of δw. Therefore, the bound will hold for any qualified δw. Let’s first define

k1, k2, · · · , kK−1 = [K]\j.

Hjj ∑i∈Sj

Hjji

=∑i∈Sj

2(ΠK−1

s=1 (yi −wTksxi − δwT

ksxi)2)xix

Ti

∑i∈Sj

2((yi −wT

k1xi)

2 − 2|yi −wTk1xi|‖δwk1‖‖xi‖

)(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti

∑i∈Sj

2(yi −wTk1xi)

2(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

F1

−∑i∈Sj

4‖∆w∗jk1−∆wk1‖‖δwk1‖‖xi‖2

(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

E1

(A.11)

131

Page 148: Copyright by Kai Zhong 2018

F1 ∑i∈Sj

2(yi −wTk1xi)

2(yi −wTk2xi)

2(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti

−∑i∈Sj

4(yi −wTk1xi)

2‖∆w∗jk2 −∆wk2‖‖δwk2‖‖xi‖2

(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti

∑i∈Sj

2(yi −wTk1xi)

2(yi −wTk2xi)

2(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

F2

−∑i∈Sj

4‖∆w∗jk1 −∆wk1‖2‖∆w∗

jk2 −∆wk2‖‖δwk2‖‖xi‖4(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

E2

(A.12)

Similarly, we decompose Fn = Fn+1 − En+1, for n = 1, 2, · · · , K − 1. Then,

recursively, we have

Hjj F1 − E1 F2 − E2 − E1 · · · FK−1 − EK−1 − EK−2 − · · · − E1

(A.13)

So Hjj is decomposed into FK−1, which contains only w, and E1, E2, · · · , EK−1,

each of which contains a separate term of ‖δw‖.

By Lemma A.1.2 and Lemma A.1.3,

E1 4∑i∈Sj

‖∆w∗jk1−∆wk1‖‖δwk1‖(ΠK−1

s=2 ‖∆w∗jks −∆wks − δwks‖2)‖xi‖2(K−1)xix

Ti

4cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗

jk‖2∑i∈Sj

‖xi‖2(K−1)xixTi

6cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗

jk‖2pjNCK−1dK−1I

(A.14)

and similarly, for all r = 1, 2, · · · , K − 1,

Er 6cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗

jk‖2pjNCK−1dK−1I. (A.15)

132

Page 149: Copyright by Kai Zhong 2018

For FK−1, we have

FK−1 =∑i∈Sj

2(Πk 6=j(yi −wT

k xi)2)xix

Ti

ξ1pjNΠk 6=j‖∆w∗

jk −∆wk‖2I

pjNΠk 6=j(‖∆w∗jk‖ − ‖∆wk‖)2I

pjN(1− cm)2(K−1)Πk 6=j‖∆w∗

jk‖2I

(A.16)

where ξ1 is because of Lemma A.1.2 and Lemma A.1.3 by setting Ak = (∆w∗jk −

∆wk)(∆w∗jk −∆wk)

T and δ = 1/(2CK−1).

Now combining Eq. (A.16), Eq. (A.13) and Eq. (A.15), we can lower bound

the eigenvalues of Hjj ,

Hjj ((1− cm)

2(K−1) − 6cf (K − 1)(1 + cm + cf )2K−3CK−1d

K−1)pjNΠk 6=j‖∆w∗

jk‖2I

(A.17)

Next consider the off-diagonal blocks for j 6= l,

133

Page 150: Copyright by Kai Zhong 2018

∑i∈Sq

Hjli

=∑i∈Sq

4(yi − (wj + δwj)Txi)(yi − (wl + δwl)

Txi)(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

∑i∈Sq

4(yi −wTj xi)(yi − (wl + δwl)

Txi)(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

+∑i∈Sq

4‖δwTj xi‖|yi − (wl + δwl)

Txi|(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

∑i∈Sq

4(yi −wTj xi)(yi −wT

l xi)(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

+∑i∈Sq

4|yi −wTj xi|‖δwT

l xi‖(Πk 6=j,k 6=l(yi − (wk + δwk)

Txi)2)xix

Ti

+∑i∈Sq

4‖δwj‖‖w∗q −wl − δwl)‖

(Πk 6=j,k 6=l‖w∗

q −wk + δwk‖2)‖xi‖2(K−1)xix

Ti

...

∑i∈Sq

4(yi −wTj xi)(yi −wT

l xi)(Πk 6=j,k 6=l(yi −wT

k xi)2)xix

Ti

+ 8(K − 1)cf (1 + cm + cf )2K−3∆2K−2

max

∑i∈Sq

‖xi‖2(K−1)xixTi

∑i∈Sq

4(yi −wTj xi)(yi −wT

l xi)(Πk 6=j,k 6=l(yi −wT

k xi)2)xix

Ti

+ 12(K − 1)cf (1 + cm + cf )2K−3∆2K−2

max pqNCK−1dK−1I

(A.18)

134

Page 151: Copyright by Kai Zhong 2018

For the first term above,

‖∑i∈Sq

4(w∗q −wj)

Txi(w∗q −wl)

Txi

(Πk 6=j,k 6=l((w

∗q −wk)

Txi)2)xix

Ti ‖

ξ1≤6pqN‖E

q(w∗

q −wj)Txi(w

∗q −wl)

Txi

(Πk 6=j,k 6=l((w

∗q −wk)

Txi)2)xix

Ti

y‖

ξ2≤6pqN

√3C2K−3‖w∗

q −wj‖‖w∗q −wl‖

(Πk 6=j,k 6=l‖w∗

q −wk‖2)

≤6pqN√

3C2K−3‖∆w∗qj −∆wj‖‖∆w∗

ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗

qk −∆wk‖2),

(A.19)

where ξ1 is because of Lemma A.1.5 and ξ2 is because of Lemma A.1.4.

We consider three cases: q 6= j, q 6= l, q = j and q = l. When q 6= j and

q 6= l,

‖∆w∗qj −∆wj‖‖∆w∗

ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗

qk −∆wk‖2)

≤(1 + cm)2K−2c2m‖∆w∗

qj‖‖∆w∗ql‖(Πk 6=j,k 6=l‖∆w∗

qk‖2) (A.20)

When q = j,

‖∆w∗qj −∆wj‖‖∆w∗

ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗

qk −∆wk‖2)

≤(1 + cm)2K−1cm‖∆w∗

qj‖‖∆w∗ql‖(Πk 6=j,k 6=l‖∆w∗

qk‖2) (A.21)

For q = l, we have similar results. Therefore,

‖Hjl‖ ≤K∑q=1

‖∑i∈Sq

Hjli ‖

≤∑q

(1 + cm)2K−1cm6pqN

√3C2K−3∆

2K−2max

+∑q

12(K − 1)cf (1 + cm + cf )2K−3pqNCK−1d

K−1∆2K−2max

≤(1 + cm)2K−1cm6N

√3C2K−3∆

2K−2max

+ 12(K − 1)cf (1 + cm + cf )2K−3NCK−1d

K−1∆2K−2max

(A.22)

135

Page 152: Copyright by Kai Zhong 2018

Now we obtain the lower bound for the minimal eigenvalue of the Hessian.

When cm ≤ pmin∆2K−2min

500K√

C2K−3∆2K−2max

and cf ≤ pmin∆2K−2min

1000(K−1)2CK−1dK−1∆2K−2max

, we have (1 −

cm)2K−2 ≥ (1− 1

2K)2K−2 ≥ 1

4, (1 + cm + cf )

2K−2 ≤ 3. Hence,

‖Hjl‖ ≤ 1

16(K − 1)pminN∆2K−2

min , (A.23)

Combining Eq.(A.10), Eq.(A.17) and Eq.(A.23), we have

σmin(H) ≥ 1

8pminN∆2K−2

min , (A.24)

which is a positive constant.

In the following we upper bound the maximal eigenvalue of the Hessian.

σmax(H) = max∑Kj=1 ‖aj‖2=1

K∑j=1

aTj H

jjaj +∑j 6=l

2aTj H

jlal

≤ max∑Kj=1 ‖aj‖2=1

K∑j=1

‖(Hjj)‖‖aj‖2 +∑j 6=l

‖Hjl‖‖aj‖‖al‖

≤ maxj‖Hjj‖+max

j 6=l‖Hjl‖(K − 1)(

∑j

‖aj‖)

≤ maxj‖Hjj‖+ (K − 1)max

j 6=l‖Hjl‖.

(A.25)

Consider the diagonal blocks and define k1, k2, · · · , kK−1 = [K]\j.

Hjji = 2

(ΠK−1

s=1 (yi −wTksxi − δwT

ksxi)2)xix

Ti

2((yi −wT

k1xi)

2 + 2|yi −wTk1xi||δwT

k1xi|+ (δwT

k1xi)

2)(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti

2(yi −wTk1xi)

2(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

F1

+ 2(2‖∆w∗jk1−∆wk1‖+ ‖δwk1‖)‖δwk1‖‖xi‖2

(ΠK−1

s=2 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

E1

(A.26)

136

Page 153: Copyright by Kai Zhong 2018

For E1,

E1 4cf (1 + cm + cf )2K−3∆2K−2

max ‖xi‖2K−2xixTi (A.27)

For F1,

F1

2(yi −wTk1xi)

2(yi −wTk2xi)

2(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

F2

+ 2(yi −wTk1xi)

2|(2∆w∗jk2−2∆wk2−δwk2)

Txi||δwTk2xi|(ΠK−1

s=3 (yi −wTksxi − δwT

ksxi)2)xix

Ti︸ ︷︷ ︸

E2

(A.28)

We also have for E2

E2 4cf (1 + cm + cf )2K−3∆2K−2

max ‖xi‖2K−2xixTi (A.29)

Therefore, recursively, we have

Hjji 2ΠK−1

s=1 (yi −wTksxi)

2xixTi︸ ︷︷ ︸

FK−1

+ 4Kcf (1 + cm + cf )2K−3∆2K−2

max ‖xi‖2K−2xixTi

(A.30)

137

Page 154: Copyright by Kai Zhong 2018

Now applying Lemma A.1.2 and Lemma A.1.3,

Hjj =∑q

∑i∈Sq

Hjji

6cfK(1 + cm + cf )2K−3NCK−1d

K−1∆2K−2max I +

∑q

∑i∈Sq

2(Πk 6=q((∆w∗

jk −∆wk)Txi)

2)xix

Ti

6cfK(1 + cm + cf )2K−3NCK−1d

K−1∆2K−2max I + 3

∑q

pqNCK−1

(Πk 6=q‖∆w∗

jk −∆wk‖2)

=6cfK(1 + cm + cf )2K−3NCK−1d

K−1∆2K−2max I

+ 3pjNCK−1(1 + cm)2K−2

(Πk 6=j‖∆w∗

jk‖2)

+ 3∑q:q 6=j

pqNCK−1c2m(1 + cm)

2K−4(Πk:k 6=j‖∆w∗

jk‖2)

9NCK−1∆2K−2max I

(A.31)

Combining the off-diagonal blocks bound in Eq. (A.23), applying union bound on

the probabilities of the lemmata and Eq. (A.2) complete the proof.

A.1.3 Proof of Theorem 3.3.2

We first introduce a corollary of Theorem 3.3.1, which shows the strong

convexity on a line between a current iterate and the optimum.

Corollary A.1.1 (Positive Definiteness on the Line between w and w∗). Let xi, yii=1,2,··· ,N

be sampled from the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the

samples and lie in the neighborhood of the optimal solution, defined in Eq. (3.3).

Then, if N ≥ O(KKd logK+2(d)), w.p. 1−O(Kd−2), for all λ ∈ [0, 1],

1

8pminN∆2K−2

min I ∇2f(λw∗ + (1− λ)w) 10N(3K)K∆2K−2max I. (A.32)

Proof. We set dK−1 anchor points equally along the line λw∗ + (1 − λ)w for λ ∈

138

Page 155: Copyright by Kai Zhong 2018

[0, 1]. Then based on these anchors, according to Theorem 3.3.1, by setting P =

K + 1, we complete the proof.

Now we show the proof of Theorem 3.3.2.

Proof. Let α := 18pminN∆2K−2

min and β := 10N(3K)K∆2K−2max .

‖w+ −w∗‖2 =‖w − η∇f(w)−w∗‖2

=‖w −w∗‖2 − 2η∇f(w)T (w −w∗) + η2‖∇f(w)‖2(A.33)

∇f(w) =

(∫ 1

0

∇2f(w∗ + γ(w −w∗))dγ

)(w −w∗)

=: H(w −w∗)

(A.34)

According to Corollary A.1.1,

αI H βI. (A.35)

‖∇f(w)‖2 = (w −w∗)T H2(w −w∗) ≤ β(w −w∗)T H(w −w∗) (A.36)

Therefore,

‖w+ −w∗‖2 ≤‖w −w∗‖2 − (−η2β + 2η)(w −w∗)T H(w −w∗)

≤‖w −w∗‖2 − (−η2β + 2η)α‖w −w∗‖2

=‖w −w∗‖2 − α

β‖w −w∗‖2

≤(1− α

β)‖w −w∗‖2

(A.37)

where the third equality holds by setting η = 1β

.

139

Page 156: Copyright by Kai Zhong 2018

A.1.4 Proof of the lemmata

A.1.4.1 Proof of Lemma A.1.1

Proof. Let g(x) = 1(2π)d/2

e−‖x‖2/2 and we have xg(x)dx = −dg(x).

Eqf(x)xxT

y=

∫f(x)xxTg(x)dx

= −∫

f(x)(dg(x))xT

=

∫∇f(x)xTg(x))dx+

∫f(x)g(x)Idx

= −∫∇f(x)(dg(x))T + EJf(x)KI

= Eq∇2f(x)

y+ EJf(x)KI

(A.38)

A.1.4.2 Proof of Lemma A.1.2

Proof. Let GK := EqΠK

k=1(xTAkx)xx

Ty

. First we show the lower bound.

σmin(GK) = min‖a‖=1

EqΠK

k=1(xTAkx)(x

Ta)2y

≥ ΠKk=1E

q(xTAkx)

ymin‖a‖=1

Eq(xTa)2

y

= ΠKk=1 tr(Ak)

(A.39)

Next, we show the upper bound. As we know, when K = 1, G1 = tr(A1)I+

2A1 and for any K > 1, GK should have an explicit closed-form. However, it is too

complicated to derive and formulate it for general K. Fortunately we only need the

property of Eq. (A.3) in our proofs. We prove it by induction. First, it is obvious

that Eq. (A.3) holds for K = 1 and C1 = 3. We assume that, for any J < K, there

140

Page 157: Copyright by Kai Zhong 2018

exists a constant CJ depending only on J , such that

GJ CJΠJk=1 tr(Ak)I (A.40)

Then by Stein-type lemma, Lemma A.1.1,

GK =EqΠK

k=1(xTAkx)xx

Ty

=EqΠK

k=1(xTAkx)

yI + 2

K∑j=1

Eq(ΠK

k 6=j(xTAkx))Aj

y

+ 4∑j,l:j 6=l

AjEq(Πk:k 6=j,k 6=l(x

TAkx))xxTyAl

CK−1ΠKk=1 tr(Ak)I + 2

K∑j=1

CK−2(Πk 6=j tr(Ak))Aj

+ 4∑j,l:j 6=l

CK−2‖Aj‖‖Al‖Πk:k 6=j,k 6=l tr(Ak)I

(CK−1 + (2K + 4K2)CK−2

)ΠK

k=1 tr(Ak)I

(A.41)

So CK = CK−1 + (4K2 + 2K)CK−2. Note that C0 = 1.

A.1.4.3 Proof of Lemma A.1.3

Proof. Proof Sketch: We use matrix Bernstein inequality to prove this lemma. How-

ever, the spectral norm of the random matrix Bi is not uniformly bounded, which is

required by matrix Bernstein inequality. So we define a new random matrix,

Mi := 1(Ei)ΠKk=1(x

Ti Akxi)xix

Ti ,

where Ei is an event when ‖Bi‖ is bounded, which will hold with high probability

and 1() is the indicate function of value 1 and 0, i.e., 1(E) = 1 if E holds and

1(E) = 0 otherwise. Then

‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖,

141

Page 158: Copyright by Kai Zhong 2018

where M = EJMiK and M = 1n

∑ni=1 Mi. We show that

1. M = B w.h.p. by the union bound

2. ‖M −M‖ is bounded by matrix Bernstein inequality

3. ‖M −B‖ is bounded because EJ1(Ec)K is small.

Proof Details:

Step 1. First we show that ‖Bi‖ is bounded w.h.p.. First,

‖Bi‖ = ΠKk=1(x

Ti Akxi)‖xi‖2

Since x ∼ N(0, Id), by Corollary 2.4.5, we have PJ‖x‖2 ≥ (4P + 5)d log nK ≤

n−1d−P . By Corollary 2.4.1, PqxTAkx > (4P + 5) tr(Ak) log n

y≤ n−1d−P .

Therefore w.p. 1− (K + 1)n−1d−P ,

‖Bi‖ ≤ (4P + 5)K+1 × (ΠKk=1 tr(Ak))d log

K+1(n).

Define

m := (4P + 5)K+1(ΠKk=1 tr(Ak))d log

K+1(n). (A.42)

and the event

Ei = ‖Bi‖ ≤ m,

Let Ec be the complementary set of E, thus PJEciK ≤ (K + 1)n−1d−P . By union

bound, w.p. 1− (K + 1)d−P , ‖Bi‖ ≤ m for all i ∈ [n] and M = B.

Step 2. Now we bound ‖M −M‖ by Matrix Bernstein’s inequality[127].

Set Zi := Mi −M . Thus EJZiK = 0 and ‖Zi‖ ≤ 2m. And

‖EqZ2

i

y‖ = ‖E

qM2

i

y−M2‖ ≤ ‖E

qM2

i

y‖+ ‖M2‖

142

Page 159: Copyright by Kai Zhong 2018

Since M is PSD, ‖EJM2i K‖ ≤ m‖M‖. Now by matrix Bernstein’s inequality, for

any δ > 0,

P

t1

n‖

n∑i=1

Zi‖ ≥ δ‖M‖

|

≤ 2d exp(− δ2n2‖M‖2/2mn‖M‖+ 2mnδ‖M‖/3

) = 2d exp(− δ2n‖M‖/2m+ 2mδ/3

)

(A.43)

Setting

n ≥ (P + 1)(4

3δ+

2

δ2)m‖M‖−1 log d, (A.44)

we have w.p. at least 1− 2d−P ,

‖ 1n

∑Mi −M‖ ≤ δ‖M‖ (A.45)

Step 3. Now we bound ‖M − B‖. For simplicity, we replace xi by x and

Ei by E.

‖M −B‖

=‖EJBi1(Eci)K‖

= max‖a‖=1

Eq(aTx)2ΠK

k=1(xTAkx)1(E

c)y

ζ1≤ max

‖a‖=1E

q(aTx)4ΠK

k=1(xTAkx)

2y1/2

EJ1(Ec)K1/2

= max‖a‖=1

〈aaT ,Eq(xTaaTx)ΠK

k=1(xTAkx)

2xxTy〉1/2EJ1(Ec)K1/2

ζ2≤ max

‖a‖=1〈aaT , C2K+1Π

Kk=1 tr(Ak)

2I〉1/2EJ1(Ec)K1/2

ζ3≤√

(K + 1)C2K+1√ndP

ΠKk=1 tr(Ak)

(A.46)

where ζ1 is from Holder’s inequality, ζ2 is because of Lemma A.1.2 and ζ3 is be-

cause EJ1(Ec)K = PJEcK. Assume n ≥ 4(K + 1)C2K+1/dP , we have ‖M −B‖ ≤

143

Page 160: Copyright by Kai Zhong 2018

12‖B‖ and 3

2‖B‖ ≥ ‖M‖ ≥ 1

2‖B‖. So combining this result with Eq. (A.42),

Eq. (A.44), and Eq. (A.45), if

n ≥ max4(K + 1)C2K+1/dP , c1

1

δ2(4P + 5)K+2d logK+1(n) log d, (A.47)

we obtain

‖ 1n

∑Mi −M‖ ≤ 1

3δ‖M‖ ≤ 1

2δ‖B‖. (A.48)

According to Lemma A.1.6, n ≥ O( 1δ2logK+1(1

δ)(PK)Kd logK+2 d) will imply

Eq. (A.47). By further setting δ >√

4(K+1)C2K+1√ndP

, we have ‖M − B‖ ≤ 12δ‖B‖,

completing the proof.

A.1.4.4 Proof of Lemma A.1.4

Proof.‖E

q(βTx)(γTx)ΠK

k=1(xTAkx)xx

Ty‖

≥ Eq(βTx)2(γTx)2ΠK

k=1(xTAkx)

y/(‖β‖‖γ‖)

≥ ‖β‖‖γ‖ΠKk=1 tr(Ak).

(A.49)

‖Eq(βTx)(γTx)ΠK

k=1(xTAkx)xx

Ty‖

= maxa,b

Eq(βTx)(γTx)(aTx)(bTx)ΠK

k=1(xTAkx)

y/(‖a‖‖b‖)

≤ Eq(aTx)2(bTx)2

y1/2E

q(βTx)2(γTx)2ΠK

k=1(xTAkx)

2y1/2

/(‖a‖‖b‖)

≤√

3C2K+1‖β‖‖γ‖ΠKk=1 tr(Ak)

(A.50)

144

Page 161: Copyright by Kai Zhong 2018

A.1.4.5 Proof of Lemma A.1.5

Proof. Note that the matrix Bi is probably not PSD. Thus we can’t apply Lemma A.1.3

directly. But the proof is similar to that for Lemma A.1.3.

Define

m := (4P + 5)K+2‖β‖‖γ‖(ΠKk=1 tr(Ak))d log

K+1(n), (A.51)

and the event, Ei := ‖Bi‖ ≤ m. Then by Corollary 2.4.4,

PJEiK ≥ 1− 2Kn−1d−P .

Define a new random matrix Mi := 1(Ei)Bi, its expectation M := EJMiK and its

empirical average M = 1n

∑ni=1Mi.

Step 1. By union bound, we have w.p. 1 − 2Kd−P , Mi = Bi for all i, i.e.,

M = B.

Step 2. We now bound ‖M − B‖, For simplicity, we replace xi by x and

145

Page 162: Copyright by Kai Zhong 2018

Ei by E.

‖M −B‖

=‖EJBi1(Eci)K‖

= max‖a‖=‖b‖=1

Eq(aTx)(bTx)(βTx)(γTx)ΠK

k=1(xTAkx)1(E

c)y

ζ1≤ max

‖a‖=‖b‖=1E

q(aTx)2(bTx)2(βTx)2(γTx)2ΠK

k=1(xTAkx)

2y1/2

EJ1(Ec)K1/2

= max‖a‖=‖b‖=1

〈aaT ,Eq(bTx)2(βTx)2(γTx)2ΠK

k=1(xTAkx)

2xxTy〉1/2EJ1(Ec)K1/2

ζ2≤ max

‖a‖=‖b‖=1〈aaT , C2K+3‖β‖2‖γ‖2ΠK

k=1 tr(Ak)2I〉1/2EJ1(Ec)K1/2

ζ3≤√2KC2K+3√

ndP‖β‖‖γ‖ΠK

k=1 tr(Ak)

(A.52)

where ζ1 is from Holder’s inequality, ζ2 is because of Lemma A.1.2 and ζ3 is be-

cause EJ1(Ec)K = PJEcK.

According to Eq. (A.52) and Lemma A.1.4, if√

2KC2K+3√ndP

≤ δ/2, then

‖M −B‖ ≤ 1

2δ‖β‖‖γ‖ΠK

k=1 tr(Ak) ≤1

2δ‖B‖ (A.53)

Since δ ≤ 1, we also have ‖M −B‖ ≤ 12‖B‖, so by Lemma A.1.4,

3

2‖B‖ ≥ ‖M‖ ≥ 1

2‖B‖ ≥ 1

2‖β‖‖γ‖ΠK

k=1 tr(Ak) (A.54)

Step 3. Now we bound ‖M − M‖. ‖M‖ ≤ m automatically holds. Since

M is probably not PSD, we don’t have ‖EJM2i K‖ ≤ m‖M‖. However, we can still

146

Page 163: Copyright by Kai Zhong 2018

show that EJM2i K ≤ O(m)‖M‖.

‖EqM2

i

y‖

≤ ‖EqB2

i

y‖

= ‖Eq(βTx)2(γTx)2ΠK

k=1(xTAkx)

2‖x‖2xxTy‖

≤ C2K+3d× (‖β‖‖γ‖ΠKk=1 tr(Ak))

2

≤ 2C2K+3

(4P + 5)K+2m‖M‖

(A.55)

We can use matrix Bernstein inequality now. Let Zi := Mi − M . ‖Zi‖ ≤ 2m.

‖EJZ2i K‖ ≤ ( 2C2K+3

(4P+5)K+2 + 1)m‖M‖. Define CK := 2C2K+3

(4P+5)K+2 + 1, then

P

t1

n‖

n∑i=1

Zi‖ ≥ δ‖M‖

|

≤ 2d exp(− δ2n2‖M‖2/2CKmn‖M‖+ 2mnδ‖M‖/3

) ≤ 2d exp(− δ2n‖M‖/2CKm+ 2mδ/3

)

(A.56)

Thus, when n ≥ (P + 1)( CK

δ2+ 2

3δ)m/‖M‖ log d, we have w.p., 1− c2d

−P ,

‖M −M‖ ≤ 1

3δ‖M‖ ≤ 1

2δ‖B‖.

By Eq. (A.51) and Eq. (A.54),

(P + 1)(CK

δ2+

2

3δ)m/‖M‖ log d ≤ c1

CK

δ2× (4P + 5)K+3d logK+1(n) log(d)

≤ c1δ2(2C2K+3(P + 1) + (4P + 5)K+2)d logK+1(n) log(d)

Applying the fact, ‖B−B‖ ≤ ‖B−M‖+‖M−M‖+‖M−B‖, and Lemma A.1.6

completes the proof.

147

Page 164: Copyright by Kai Zhong 2018

A.1.5 Proof of Lemma A.1.6

Proof. Assume we require n ≥ cd log(d) logK+1(n) and we have n ≥ bcd log(d) logA(d),

where b, A depends only on K. First let’s assume we haveb = K4K logK+1(c)

A = K + 1

Therefore, b ≥ K4K

b ≥ 4K+1 logK+1(c)

A ≥ K + 1

b ≥ (4(A+ 1))K+1

Taking a log on both sides, we obtain,log b ≥ (K + 1) log(4 log(b))

log b ≥ (K + 1) log(4 log(c))

log b+ A log log(d) ≥ (K + 1) log(4 log(d))

log b+ A log log(d) ≥ (K + 1) log(4(A+ 1) log log(d))

Combining the above four inequalities, we have,

log b+ A log log(d) ≥ (K + 1) log(4maxlog(b), log(c), log(d), (A+ 1) log log(d))

Replacing max by addition, we have,

log b+ A log log(d) ≥ (K + 1) log(log(b) + log(c) + log(d) + (A+ 1) log log(d))

Taking exp on both sides,

b logA(d) ≥ logK+1(bcd log(d) logA(d))

148

Page 165: Copyright by Kai Zhong 2018

Reformulating:

bcd log(d) logA(d)

logK+1(bcd log(d) logA(d))≥ cd log d

n

logK+1(n)≥ cd log d

Finally, we obtain

n ≥ cd log d logK+1(n)

A.2 Proofs of Tensor Method for InitializationA.2.1 Some Lemmata

We will use the following lemma to guarantee the robust tensor power

method. The proofs of these lemmata will be found in Sec. A.2.4.

Lemma A.2.1 ( Some properties of thrid-order tensor). If T ∈ Rd×d×d is a super-

symmetric tensor, i.e.,Tijk is equivalent for any permutation of the index, then the

operator norm defined as

‖T‖op := sup‖a‖=1

|T (a,a,a)|

Property 1. ‖T‖op = sup‖a‖=‖b‖=‖c‖=1 |T (a, b, c)|

Property 2. ‖T‖op ≤ ‖T(1)‖ ≤√K‖T‖op

Property 3. If T is a rank-one tensor, then ‖T(1)‖ = ‖T‖op

Property 4. For any matrix W ∈ Rd×d′ , ‖T (W,W,W )‖op ≤ ‖T‖op‖W‖3

149

Page 166: Copyright by Kai Zhong 2018

Lemma A.2.2 (Approximation error for the second moment). Let xi, yii∈[n] be

generated from the mixed linear regression model (3.2). Define M2 :=∑

k=[K] 2pkw∗k⊗

w∗k and M2 :=

1n

∑i∈[n] y

2i (xi⊗xi− I). Then with n ≥ c1

1pminδ22

d log2(d), we have

w.p. 1− c2Kd−2,

‖M2 −M2‖ ≤ δ2∑k

pk‖w∗k‖2 (A.57)

where c1, c2 are universal constants.

And for any fixed orthogonal matrix Y ∈ Rd×K , with the same condition,

we have

‖Y T (M2 −M2)Y ‖ ≤ δ2∑k

pk‖w∗k‖2 (A.58)

Lemma A.2.3 (Subspace Estimation). Let M2,M3 be

M2 =∑k=[K]

2pkw∗k ⊗w∗

k, and M3 =∑k=[K]

6pkw∗k ⊗w∗

k ⊗w∗k, (A.59)

and M2 be an estimate of M2. Assume ‖M2 −M2‖ ≤ δσK(M2) and δ ≤ 16. Let Y

be the returned matrix of the power method after O(log(1/δ)) steps. Define R2 =

Y TM2Y and R3 = M3(Y, Y, Y ). Then ‖R2‖ ≤ ‖M2‖ and ‖R3‖op ≤ ‖M3‖op. We

also have

‖Y Y Tw∗k −w∗

k‖ ≤ 3δ‖w∗k‖,∀k (A.60)

and

σK(R2) ≥3

4σK(M2)

150

Page 167: Copyright by Kai Zhong 2018

Lemma A.2.4 (Approximation error for the third moment). Let xi, yii∈[n] be

drawn from the mixed linear regression model (3.2). Let Y ∈ Rd×K be any fixed

orthogonal matrix that satisfies, ‖Y Y Tw∗k − w∗

k‖ ≤ 12‖w∗

k‖,∀k, and ri = Y Txi,

for all i ∈ [n]. Let

R3 =1

n

∑i∈[n]

y3i (ri⊗ri⊗ri−∑j∈[K]

ej⊗ri⊗ej−∑j∈[K]

ej⊗ej⊗ri−∑j∈[K]

ri⊗ej⊗ej)

and

R3 =∑k=[K]

6pk(YTw∗

k)⊗ (Y Tw∗k)⊗ (Y Tw∗

k)

Then if n ≥ c31

pminδ23K3 log4(d) and 3

√C5n

−1/2d−1 ≤ δ34

, we have w.p. 1−c4Kd−2

‖R3 −R3‖op ≤ δ3∑k∈[K]

pk‖w∗k‖3,

where c3 and c4 are universal constant.

Lemma A.2.5 (Robust Tensor Power Method. Similar to Lemma 4 in [26]). Let

R2 =∑K

k=1 pkuk ⊗ uk and R3 =∑K

k=1 pkuk ⊗ uk ⊗ uk, where uk ∈ RK can be

any fixed vector. Define σK := σK(R2). Assume the estimations of R2 and R3, R2

and R3 respectively, satisfy ‖R2 − R2‖op ≤ ε2 and ‖R3 − R3‖op ≤ ε3 with

ε2 ≤ σK/3, 8‖R3‖opσ−5/2K ε2 + 2

√2σ

−3/2K ε3 ≤ cT

1

K√pmax

, (A.61)

for some constant cT . Let the whitening matrix W = U2Λ−1/22 UT

2 , where R2 =

U2Λ2UT2 is the eigendecomposition of R2. Then w.p. 1−η, the eigenvalues akKk=1

and the eigenvectors vkKk=1 computed from the whitened tensor R3(W , W , W ) ∈

RK×K×K by using the robust tensor power method [4] will satisfy

‖(W T )†(akvk)− uk‖ ≤ κ2ε2 + κ3ε3

151

Page 168: Copyright by Kai Zhong 2018

where κ2 = 3‖R2‖1/2σ−1K + 200‖R2‖1/2‖R3‖opσ−5/2

K , κ3 = 75‖R2‖1/2σ−3/2K and η

is related to the computational time by O(log(1/η)).

Remark: This lemma differs from Lemma 4 of [26] in the requirement on

ε2, ε3. Lemma 4 in [26] treats ε2, ε3 in the same order (that are bounded by the same

value), however, they should have different order because one is for second-order

moments and the other is for third-order moments.

A.2.2 Proof of Theorem 3.4.2

Proof Details. We state the proof outline here,

1. ‖M2 −M2‖ ≤ εM2 by Matrix Bernstein’s inequality.

2. ‖Y Y Tw∗k −w∗

k‖ ≤ εY ‖w∗k‖ for all k ∈ [K] by Davis-Kahan’s theorem [36].

3. ‖R2 −R2‖ ≤ ε2 by Matrix Bernstein’s inequality.

4. ‖R3 −R3‖op ≤ ε3 by Matrix Bernstein’s inequality after matricizing tensor.

5. Let uk = (W T )†(akvk). Then ‖uk − Y Tw∗k‖ ≤ εu by the robust tensor

power method.

6. Finally, ‖w(0)k −w∗

k‖ ≤ c6∆min by combining the results of Step 2 and Step

5.

The lemmata in Appendix A.2.1 provide the bound for the above steps: Lemma A.2.2

for Step 1, Lemma A.2.3 for Step 2 and Step 3, Lemma A.2.4 for Step 4, and

Lemma A.2.5 for Step 5. Now we show the details. Define

κ2 := 4‖M2‖1/2σ−1K (M2) + 412‖M2‖1/2‖M3‖opσ−5/2

K (M2)

152

Page 169: Copyright by Kai Zhong 2018

and

κ3 := 116‖M2‖1/2σ−3/2K (M2).

By Lemma A.2.3, we have κ3 ≥ κ3 and κ2 ≥ κ2 for any orthogonal matrix Y .

‖w(0)k −w∗

k‖ξ1≤ ‖Y uk − Y Y Tw∗

k‖+ ‖Y Y Tw∗k −w∗

k‖ξ2≤ κ2‖R2 −R2‖+ κ3‖R3 −R3‖op +

2

3δM2‖w∗

k‖σ−1K (M2)

∑k

pk‖w∗k‖2

ξ3≤ κ2δ2

∑k

pk‖w∗k‖2 + κ3δ3

∑k

pk‖w∗k‖3 +

2

3δM2σ

−1K (M2)(max

k‖w∗

k‖)∑k

pk‖w∗k‖2

(A.62)

where ξ1 is due to triangle inequality, ξ2 is due to Lemma A.2.5, Lemma A.2.3 and

Lemma A.2.2, and ξ3 is due to Lemma A.2.2 and Lemma A.2.4. Therefore, we can

set

δ2 ≤c6∆min

3κ2

∑k pk‖w∗

k‖2,

δ3 ≤c6∆min

3κ3

∑k∈[K] pk‖w∗

k‖3

and

δM2 ≤c6∆min

2σ−1K (M2)(maxk ‖w∗

k‖)∑

k pk‖w∗k‖2

,

such that ‖w(0)k −w∗

k‖ ≤ c6∆min. Note that Lemma A.2.5 also requires Eq. (A.61),

which can be satisfied if

‖R2 −R2‖ ≤ minσK(M2)

4,

cTσK(M2)5/2

34‖M3‖opK√pmax

and

‖R3 −R3‖op ≤cTσK(M2)

3/2

6K√pmax

.

153

Page 170: Copyright by Kai Zhong 2018

Therefore, we require

δ2 ≤ δ∗2 :=1∑

k pk‖w∗k‖2

minσK(M2)

4,

cTσK(M2)5/2

34‖M3‖opK√pmax

,c6∆min

3κ2

δ3 ≤ δ∗3 :=1∑

k∈[K] pk‖w∗k‖3

minc6∆min

3κ3

,cTσK(M2)

3/2

6K√pmax

δM2 ≤ δ∗M2:=

c6∆min

2σ−1K (M2)(maxk ‖w∗

k‖)∑

k pk‖w∗k‖2

,

Now we analyze the sample complexity. δ∗M2, δ∗2, δ

∗3 correspond to the sam-

ple sets, ΩM2 , Ω2 and Ω3 respectively. By Lemma A.2.2, Lemma A.2.4, we require

|ΩM2| ≥ cM2

1

pminδ∗2M2

d log2(d)

|Ω2| ≥ c21

pminδ∗22d log2(d)

|Ω3| ≥ c31

pminδ∗23K3 log11/2(d),

and 3√C5n

−1/2d−1 ≤ δ34

. For the probability, we can set η = d−2 in Lemma A.2.5

by scarifying a little more computational time, which is in the order of O(log(d)).

Therefore, the final probability is at least 1−O(Kd−2).

A.2.3 Proof of Theorem 3.5.1

According to Theorem 3.3.2, after T0 = O(log d) iterations, we arrive the

local convexity region in Corollary 3.3.1. Then we just need one more set of

samples, but still need O(log(1/ε)) iterations to achieve 1/ε precision. By The-

orem 3.3.1, Corollary 3.3.1, Theorem 3.3.2 and Theorem 3.4.2, we can partition the

dataset into |Ω(t)| = O(d(K log(d))2K+2) for all t = 0, 1, 2, · · · , T0 + 1 to satisfy

their sample complexity requirement. This complete the proof.

154

Page 171: Copyright by Kai Zhong 2018

A.2.4 Proofs of Some Lemmata

A.2.4.1 Proof of Lemma A.2.1

Proof. Property 1. See the proof in Lemma 21 of [67].

Property 2.

‖T(1)‖ = max‖a‖=1

‖T (a, I, I)‖F ≤ max‖a‖=1

√K‖T (a, I, I)‖ = max

‖a‖=‖b‖=1

√K|T (a, b, b)| = ‖T‖op.

Obviously, max‖a‖=1 ‖T (a, I, I)‖F ≥ ‖T‖op.

Property 3. Let T = v ⊗ v ⊗ v.

‖T(1)‖ = max‖a‖=1

‖T (a, I, I)‖F = max‖a‖=1

‖v‖2(vTa)2 = ‖v‖3 = max‖a‖=1

|(vTa)3| = ‖T‖op.

Property 4. There exists a u ∈ Rd′ with ‖u‖ = 1 such that

‖T (W,W,W )‖op = |T (Wu,Wu,Wu)| ≤ ‖T‖op‖Wu‖3 ≤ ‖T‖op‖W‖3

A.2.4.2 Proof of Lemma A.2.2

Proof. Define M(k)2 := 2w∗

kw∗Tk and M

(k)2 = 1

|Sk|∑

i∈Sky2i (xi ⊗ xi − I), where

Sk ⊂ [n] is the index set for samples from the k-th model. Since we assume |Sk| =

pkn, M2 =∑

k∈[K] pkM(k)2 . We first bound ‖M (k)

2 −M(k)2 ‖. By Lemma A.1.3 with

K = 1, A1 = w∗kw

∗Tk , then if |Sk| ≥ c1

1δ2d log2(d), we have w.p., 1− c2d

−2,

‖ 1

|Sk|∑i∈Sk

y2ixixTi − ‖w∗

k‖2I − 2w∗kw

∗k‖ ≤ δ‖w∗

k‖2.

By Lemma A.1.3 with K = 0, we have w.p. at least 1− d−2,

‖ 1

|Sk|∑i∈Sk

xixTi − I‖ ≤ δ

155

Page 172: Copyright by Kai Zhong 2018

Then

‖ 1

|Sk|∑i∈Sk

(xTi w

∗k)

2 − ‖w∗k‖2‖ ≤ ‖

1

|Sk|∑i∈Sk

xixTi − I‖‖w∗

k‖2 ≤ δ‖w∗k‖2.

Thus

‖ 1

|Sk|∑i∈Sk

y2i (xixTi − I)− 2w∗

kw∗k‖ ≤ 2δ‖w∗

k‖2.

And w.p. 1−O(Kd−2),

‖M2 −M2‖ ≤ 2δ∑k

pk‖w∗k‖2.

A.2.4.3 Proof of Lemma A.2.3

Proof. ‖R2‖ ≤ ‖Y ‖2‖M2‖ = ‖M2‖. By Property 4 in Lemma A.2.1, ‖R3‖op ≤

‖Y ‖3‖M3‖op = ‖M3‖op. Let U be the top-K eigenvectors of M2. Then U =

span(w∗1,w

∗2, · · · ,w∗

K). Let Y ∈ Rd×K be the top-K eigenvectors of M2. By

Lemma 9 in [67] (Davis-Kahan’s theorem [36] can also prove it),

‖(I − Y Y T )UUT‖ ≤ 3

2δ.

According to Theorem 7.2 in [6], after t steps of the power method, we have

‖Y Y T − Y (t)Y (t)T‖ ≤ (σK+1(M2)

σK(M2))t‖Y Y T − Y (0)Y (0)T‖.

When δ ≤ 1/3, by Weyl’s inequality, we have σK+1(M2) ≤ 13σK(M2) and σK(M2) ≥

23σK(M2). Therefore, after t = log(2/(3δ)) steps of the power method, we have

‖Y Y T − Y (t)Y (t)T‖ ≤ 3

156

Page 173: Copyright by Kai Zhong 2018

Let Y = Y (t). We have

‖Y Y T − UUT‖ ≤ ‖Y Y T − Y Y T‖+ ‖UUT − Y Y T‖ ≤ 3δ

and

‖Y Y Tw∗k −w∗

k‖ ≤ ‖Y Y T − UUT‖‖w∗k‖ ≤ 3δ‖w∗

k‖

Now we consider σK(R2). The proof is similar to that for Property 3 in Lemma 9

in [67].

σK(R2) ≥ σK(M2)σ2K(Y

TU)

Note that ‖Y T⊥ U‖ = ‖Y Y T − UUT‖, where Y⊥ is the subspace orthogonal to Y .

For any normalized vector v,

‖Y TUv‖2 = ‖Uv‖2 − ‖Y T⊥ Uv‖2 ≥ 1− (3δ)2 ≥ 3

4

Therefore, we have σK(R2) ≥ 34σK(M2).

A.2.4.4 Proof of Lemma A.2.4

Proof. We prove it by matricizing the tensor. Define

Gi = y3i (ri⊗ ri⊗ ri−∑j∈[K]

ej ⊗ ri⊗ ej −∑j∈[K]

ej ⊗ ej ⊗ ri−∑j∈[K]

ri⊗ ej ⊗ ej).

Like in Lemma A.2.2, we first bound ‖R(k)3 −R

(k)3 ‖op, where R(k)

3 = 1|Sk|∑

i∈SkGi,

and R(k)3 = 6(Y Tw∗

k)⊗ (Y Tw∗k)⊗ (Y Tw∗

k).

‖R(k)3 ‖op = 6‖Y Tw∗

k‖3.

By Lemma A.2.3, 12‖w∗

k‖ ≤ ‖Y Tw∗k‖ ≤ 3

2‖w∗

k‖. Thus

3

4‖w∗

k‖3 ≤ ‖R(k)3 ‖op ≤

81

4‖w∗

k‖3. (A.63)

157

Page 174: Copyright by Kai Zhong 2018

Then

‖Gi‖op ≤ 4|xTi w

∗k|3‖ri‖3, (A.64)

By Corollary 2.4.5, we have w.p., 1 − n−1d−2, ‖ri‖2 ≤ 4K log n. Thus, w.p.

1− 4n−1d−2,

‖Gi‖op ≤ 4× 123/2‖w∗k‖3 log3(n)(4K)3/2

Define m := c6‖w∗k‖3K3/2 log3(n) for constant c6 = 4× (48)3/2, and the event

Ei := ‖Gi‖op ≤ m

Then PJEciK ≤ 4n−1d−2. Define a new tensor Bi = 1(Ei)Gi, its expectation

B = EJBiK (the expectation is over all samples from the k-th components) and

its empirical average B = 1|Sk|∑

i∈[Sk]Bi.

Step 1. So we have Bi = Gi for all i ∈ Sk w.p. 1− 4d−2, i.e.,

R(k)3 = B (A.65)

Step 2. We bound ‖B −R(k)3 ‖op

‖B −R(k)3 ‖op = ‖EJ1(Ec

i)GiK‖op

= max‖a‖=1

|EJ1(Eci)Gi(a,a,a)K|

≤ EJ1(Eci)K

1/2 max‖a‖=1

|EqGi(a,a,a)

2y|1/2

≤ 2n−1/2d−1 max‖a‖=1

|Eq(y3i ((r

Ti a)

3 − 3rTi a))

2y|1/2

≤ 2n−1/2d−1 max‖a‖=1

|Eq(w∗T

k xi)6((xT

i Y a)6 + 9(xTi Y a)2)

y|1/2

≤ 2n−1/2d−1√2C5‖w∗

k‖3

ξ

≤3√

C5n−1/2d−1‖R(k)

3 ‖op,(A.66)

158

Page 175: Copyright by Kai Zhong 2018

where ξ is due to Eq. (A.63). Therefore, if 3√C5n

−1/2d−1 ≤ δ34

, we have

‖B −R(k)3 ‖op ≤

3δ38‖w∗

k‖3 ≤δ32‖R(k)

3 ‖op (A.67)

And further if δ3 ≤ 1, combining Eq. (A.63),

3

8‖w∗

k‖3 ≤1

2‖R(k)

3 ‖op ≤ ‖B‖op ≤3

2‖R(k)

3 ‖op ≤ 32‖w∗k‖3

Step 3. We bound ‖B −B‖op. Let Zi = (Bi −B)(1).

‖B(1)‖ ≤ max‖a‖=1

‖B(1)a‖

= max‖a‖=1

‖B(a, I, I)‖F

≤ max‖a‖=1

K‖B(a, I, I)‖

≤ max‖a‖=1

max‖b‖=1

√K|B(a, b, b)|

ξ=√K‖B‖op

≤ 32√K‖w∗

k‖3

(A.68)

where ξ is due to Lemma A.2.1.

‖Zi‖ ≤ ‖Bi(1)‖+ ‖B(1)‖ ≤√K(‖Bi‖op + ‖B‖op) ≤ 2

√Km

Now consider ‖EqZiZ

Ti

y‖ and ‖E

qZT

i Zi

y‖.

EqZiZ

Ti

y= E

q(Bi(1) −B(1))(Bi(1) −B(1))

Ty= E

qBi(1)B

Ti(1)

y−B(1)B

T(1)

159

Page 176: Copyright by Kai Zhong 2018

‖EqBi(1)B

Ti(1)

y‖ ≤ ‖E

qGi(1)G

Ti(1)

y‖

≤ ‖Eq(w∗T

k x)6(‖r‖4rrT + 2‖r‖2I + (K + 6)rrT − 6‖r‖2rrT )y‖

≤ ‖Eq(w∗T

k x)6(‖Y Tx‖4Y TxxTY + 2‖Y Tx‖2I + (K + 6)Y TxxTY )y‖

≤ 2C5K2‖w∗

k‖6,(A.69)

where the last inequality is due to Lemma A.1.2. Thus

‖EqZiZ

Ti

y‖ ≤ 3C5K

2‖w∗k‖6

SimilarlyEqZT

i Zi

y= E

rBT

i(1)Bi(1)

z−BT

(1)B(1) and ‖BT(1)B(1)‖ ≤ ‖B(1)‖2.

‖EqBT

i(1)Bi(1)

y‖

≤ ‖EqGT

i(1)Gi(1)

y‖

≤ max‖A‖F=1,A sym.

Eqy6i ‖rTArr − (2Ar + tr(A)r)‖2

y

≤ max‖A‖F=1,A sym.

Eq(w∗T

k x)6((rTAr)2‖r‖2 + 4rTA2r + tr2(A)‖r‖2 + | tr(A)rTAr|(4 + 2‖r‖2))y

≤ max‖A‖F=1,A sym.

Eq(w∗T

k x)6(‖r‖6‖A‖2F + 4‖r‖2 tr(A2) + tr2(A)‖r‖2 + (4 + 2‖r‖2)‖r‖2| tr(A)|‖A‖F )y

≤ Er(w∗T

k x)6(‖r‖6 + 4‖r‖2 +K‖r‖2 +√K(4 + 2‖r‖2)‖r‖2)

z

= Er(w∗T

k x)6(‖Y Tx‖6 + 2√K‖Y Tx‖4 + 4(

√K + 1)‖Y Tx‖2 +K‖Y Tx‖2)

z

≤ 2C5K3‖w∗

k‖6(A.70)

Therefore,

‖EqZT

i Zi

y‖ ≤ 3C5K

3‖w∗k‖6,

160

Page 177: Copyright by Kai Zhong 2018

and

max‖EqZT

i Zi

y‖, ‖E

qZiZ

Ti

y‖ ≤ 3C5K

3‖w∗k‖6 ≤ cm2K

3/2m‖w∗k‖3

Now we are ready to apply matrix Bernstein’s inequality.

P

t1

|Sk|‖∑i∈Sk

Zi‖ ≥ t

|

≤ 2K2 exp(− −|Sk|t2/2cm2K3/2m‖w∗

k‖3 + 2√Kmt/3

) (A.71)

Setting t = δ3‖w∗k‖3, we have when

|Sk| ≥ c31

δ23K3 log3(n) log(d) (A.72)

w.p. 1− d−2,

‖B −B‖op ≤ ‖1

|Sk|∑i∈Sk

Zi‖ ≤ δ3‖w∗k‖3, (A.73)

for some universal constant c3. And there exists some constant c3, such that |Sk| ≥

c31δ23K3 log4(d) will imply (A.72). Step 4. Combing all the K components. With

above three steps for k-th component, i.e., Eq. (A.65), Eq. (A.67) and Eq. (A.73),

w.h.p., we have

‖R(k)3 −R

(k)3 ‖op ≤ δ3‖w∗

k‖3

Now we can complete the proof by combing all the K components, w.p. 1 −

O(Kd−2)

‖R3 −R3‖op ≤∑k∈[K]

pk‖R(k)3 −R

(k)3 ‖op ≤ δ3

∑k∈[K]

pk‖w∗k‖3 (A.74)

161

Page 178: Copyright by Kai Zhong 2018

A.2.4.5 Proof of Lemma A.2.5

Proof. Most part of the proof follows the proof of Lemma 4 in [26]. Let W TR2W =

UΛUT . Define W := WUΛ−1/2UT , then W is the whitening matrix of R2, i.e.,

W TR2W = I . Define the whitened tensor T = R3(W,W,W ), i.e.,

T :=K∑k=1

pkWTuk ⊗W Tuk ⊗W Tuk

=K∑k=1

p−1/2k (p

1/2k W Tuk)⊗ (p

1/2k W Tuk)⊗ (p

1/2k W Tuk)

=K∑k=1

p−1/2k vk ⊗ vk ⊗ vk,

(A.75)

where vk := p1/2k W TukKk=1 are orthogonal basis because

∑Kk=1 vkv

Tk = W TR2W =

IK . In practice, we have T := M3(W , W , W ), an estimation of T . Define εT :=

‖T − T‖op. Similar to the proof of Lemma 4 in [26], we have

εT =‖R3(W,W,W )− R3(W , W , W )‖op

≤‖R3(W,W,W )−R3(W,W, W )‖op + ‖R3(W,W, W )−R3(W, W , W )‖op

+ ‖R3(W, W , W )−R3(W , W , W )‖op + ‖R3(W , W , W )− R3(W , W , W )‖op

=‖R3(W,W,W − W )‖op + ‖R3(W,W − W , W )‖op

+ ‖R3(W − W , W , W )‖op + ‖R3(W , W , W )− R3(W , W , W )‖op

≤‖R3‖op(‖W‖2 + ‖W‖‖W‖+ ‖W‖2)εW + ‖W‖3ε3(A.76)

where εW = ‖W −W‖.

If ε2 ≤ σK/3, we have |σK(R2) − σK | ≤ ε2 ≤ σK/3. Then 23σK ≤

162

Page 179: Copyright by Kai Zhong 2018

σK(R2) ≤ 43σK and ‖W‖ ≤

√2σ

−1/2K .

εW = ‖W −W‖ = ‖W (I − UΛ−1/2UT )‖ ≤ ‖W‖‖I − Λ−1/2‖ (A.77)

Since we have ‖I − Λ‖ = ‖W TR2W − W T R2W‖ ≤ ‖W‖2ε2 = 2σ−1K ε2. Thus

‖I − Λ−1/2‖ ≤ max|1− (1 + 2ε2/σK)−1/2|, |1− (1− 2ε2/σK)

−1/2| ≤ ε2/σK

Therefore,

εW ≤√2ε2σ

−3/2K

(A.78)

Now we have

εT ≤8‖R3‖opσ−5/2K ε2 + 2

√2σ

−3/2K ε3 (A.79)

Thus we can apply Theorem 5.1 [4] to show the guarantees of the robust tensor

power method to recover vkKk=1 and pkKk=1. It can be stated as below, for some

universal constant cT and a small value η (the computational complexity is related

to η by O(log(1/η))), if εT ≤ cT1

K√pmax

, w.p. 1 − η the returned eigenvectors

vkKk=1 and eigenvalues akKk=1 satisfy

‖vk − vk‖ ≤ 8εT√pk ≤ 8εT

√pmax, |ak −

1√pk| ≤ 5εT (A.80)

Let ak = 1√pk

. Now we show

‖(W T )†(akvk)− uk‖ = ‖(W T )†(akvk)−W †akvk‖

≤ ‖(W T )†(akvk)− (W T )†(akvk)‖+ ‖(W T )†(akvk)− (W T )†akvk‖

≤ ‖(W T )†‖(‖akvk − akvk‖+ ‖akvk − akvk‖) + ‖(W T )† − (W T )†‖‖akvk‖

≤ ‖(W T )†‖(ak8εT/ak + 5εT ) + ‖(W T )† − (W T )†‖ak(A.81)

163

Page 180: Copyright by Kai Zhong 2018

If εT ≤ 110

√pmax

, we have ak/ak ≤ 3/2. If ε2 ≤ σK/3,

‖(W T )†‖ = ‖Λ2‖1/2 ≤√2‖R2‖1/2 (A.82)

and‖(W T )† − (W T )†‖ = ‖(W T )†(I − UΛ1/2UT )‖

= ‖(W T )†‖‖I − Λ1/2‖

≤ 2√2‖R2‖1/2ε2/σK

(A.83)

‖(W T )†(akvk)− uk‖ ≤ ‖R2‖1/2(25εT + 3ε2/σK)

≤ (3‖R2‖1/2σ−1K + 200‖R2‖1/2‖R3‖opσ−5/2

K )ε2 + (75‖R2‖1/2σ−3/2K )ε3

(A.84)

164

Page 181: Copyright by Kai Zhong 2018

Appendix B

One-hidden-layer Fully-connected Neural Networks

B.1 Matrix Bernstein Inequality for Unbounded Case

In many proofs we need to bound the difference between some population

matrices/tensors and their empirical versions. Typically, the classic matrix Bern-

stein inequality requires the norm of the random matrix be bounded almost surely

(e.g., Theorem 6.1 in [127]) or the random matrix satisfies subexponential prop-

erty (Theorem 6.2 in [127]) . However, in our cases, most of the random matrices

don’t satisfy these conditions. So we derive the following lemmata that can deal

with random matrices that are not bounded almost surely or follow subexponential

distribution, but are bounded with high probability.

Lemma B.1.1 (Matrix Bernstein for unbounded case (A modified version of bounded

case, Theorem 6.1 in [127])). Let B denote a distribution over Rd1×d2 . Let d = d1+

d2. Let B1, B2, · · ·Bn be i.i.d. random matrices sampled from B. Let B = EB∼B[B]

and B = 1n

∑ni=1 Bi. For parameters m ≥ 0, γ ∈ (0, 1), ν > 0, L > 0, if the distri-

165

Page 182: Copyright by Kai Zhong 2018

bution B satisfies the following four properties,

(I) PrB∼B

[‖B‖ ≤ m] ≥ 1− γ;

(II)∥∥∥ EB∼B

[B]∥∥∥ > 0;

(III) max(∥∥∥ E

B∼B[BB>]

∥∥∥,∥∥∥ EB∼B

[B>B]∥∥∥) ≤ ν;

(IV) max‖a‖=‖b‖=1

(E

B∼B

[(a>Bb

)2])1/2 ≤ L.

Then we have for any 0 < ε < 1 and t ≥ 1, if

n ≥ (18t log d) · (ν + ‖B‖2 +m‖B‖ε)/(ε2‖B‖2) and γ ≤ (ε‖B‖/(2L))2

with probability at least 1− 1/d2t − nγ,

‖B −B‖ ≤ ε‖B‖.

Proof. Define the event

ξi = ‖Bi‖ ≤ m,∀i ∈ [n].

Define Mi = 1‖Bi‖≤mBi. Let M = EB∼B[1‖B‖≤mB] and M = 1n

∑ni=1Mi. By

triangle inequality, we have

‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖. (B.1)

In the next a few paragraphs, we will upper bound the above three terms.

The first term in Eq. (D.19). Denote ξc as the complementary set of ξ, thus

Pr[ξci ] ≤ γ. By a union bound over i ∈ [n], with probability 1− nγ, ‖Bi‖ ≤ m for

all i ∈ [n]. Thus M = B.

166

Page 183: Copyright by Kai Zhong 2018

The second term in Eq. (D.19). For a matrix B sampled from B, we use ξ

to denote the event that ξ = ‖B‖ ≤ m. Then, we can upper bound ‖M − B‖ in

the following way,

‖M −B‖

=∥∥∥ EB∼B

[1‖B‖≤m ·B]− EB∼B

[B]∥∥∥

=∥∥∥ EB∼B

[B · 1ξc ]∥∥∥

= max‖a‖=‖b‖=1

EB∼B

[a>Bb1ξc ]

≤ max‖a‖=‖b‖=1

EB∼B

[(a>Bb)2]1/2 · EB∼B

[1ξc ]1/2 by Holder’s inequality

≤ L EB∼B

[1ξc ]1/2 by Property (IV)

≤ Lγ1/2, by Pr[ξc] ≤ γ

≤ 1

2ε‖B‖, by γ ≤ (ε‖B‖/(2L))2

which implies

‖M −B‖ ≤ ε

2‖B‖.

Since ε < 1, we also have ‖M −B‖ ≤ 12‖B‖ and 3

2‖B‖ ≥ ‖M‖ ≥ 1

2‖B‖.

The third term in Eq. (D.19). We can bound ‖M −M‖ by Matrix Bern-

stein’s inequality [127].

We define Zi = Mi −M . Thus we have EBi∼B

[Zi] = 0, ‖Zi‖ ≤ 2m, and

∥∥∥∥ EBi∼B

[ZiZ>i ]

∥∥∥∥ =

∥∥∥∥ EBi∼B

[MiM>i ]−M ·M>

∥∥∥∥ ≤ ν + ‖M‖2 ≤ ν + 3‖B‖2.

167

Page 184: Copyright by Kai Zhong 2018

Similarly, we have∥∥∥∥ EBi∼B

[Z>i Zi]

∥∥∥∥ ≤ ν +3‖B‖2. Using matrix Bernstein’s inequal-

ity, for any ε > 0,

PrB1,··· ,Bn∼B

[1

n

∥∥∥∥∥n∑

i=1

Zi

∥∥∥∥∥ ≥ ε‖B‖

]≤ d exp

(− ε2‖B‖2n/2ν + 3‖B‖2 + 2m‖B‖ε/3

).

By choosing

n ≥ (3t log d) · ν + 3‖B‖2 + 2m‖B‖ε/3ε2‖B‖2/2

,

for t ≥ 1, we have with probability at least 1− 1/d2t,∥∥∥∥∥ 1nn∑

i=1

Mi −M

∥∥∥∥∥ ≤ ε

2‖B‖

Putting it all together, we have for 0 < ε < 1, if

n ≥ (18t log d) · (ν + ‖B‖2 +m‖B‖ε)/(ε2‖B‖2) and γ ≤ (ε‖B‖/(2L))2

with probability at least 1− 1/d2t − nγ,∥∥∥∥∥ 1nn∑

i=1

Bi − EB∼B

[B]

∥∥∥∥∥ ≤ ε∥∥∥ EB∼B

[B]∥∥∥.

Corollary B.1.1 (Error bound for symmetric rank-one random matrices). Let x1, x2, · · ·xn

denote n i.i.d. samples drawn from Gaussian distribution N(0, Id). Let h(x) :

Rd → R be a function satisfying the following properties (I), (II) and (III).

(I) Prx∼N(0,Id)

[|h(x)| ≤ m] ≥ 1− γ

(II)

∥∥∥∥ Ex∼N(0,Id)

[h(x)xx>]

∥∥∥∥ > 0;

(III)

(E

x∼N(0,Id)[h4(x)]

)1/4

≤ L.

168

Page 185: Copyright by Kai Zhong 2018

Define function B(x) = h(x)xx> ∈ Rd×d, ∀i ∈ [n]. Let B = Ex∼N(0,Id)

[h(x)xx>].

For any 0 < ε < 1 and t ≥ 1, if

n & (t log d) · (L2d+ ‖B‖2 + (mtd log n)‖B‖ε)/(ε2‖B‖2), and γ + 1/(nd2t) . (ε‖B‖/L)2

then

Prx1,··· ,xn∼N(0,Id)

[∥∥∥∥∥B − 1

n

n∑i=1

B(xi)

∥∥∥∥∥ ≤ ε‖B‖

]≥ 1− 2/(d2t)− nγ.

Proof. We show that the four Properties in Lemma D.3.9 are satisfied. Define func-

tion B(x) = h(x)xx>.

(I) ‖B(x)‖ = ‖h(x)xx>‖ = |h(x)|‖x‖2.

By using Fact 2.4.3, we have

Prx∼N(0,Id)

[‖x‖2 ≤ 10td log n] ≥ 1− 1/(nd2t)

Therefore,

Prx∼N(0,Id)

[‖B(x)‖ ≤ m · 10td log(n)] ≥ 1− γ − 1/(nd2t).

(II)∥∥∥ EB∼B

[B]∥∥∥ =

∥∥∥∥ Ex∼N(0,Id)

[h(x)xx>]

∥∥∥∥ > 0.

(III)

max(∥∥∥ E

B∼B[BB>]

∥∥∥, ∥∥∥ EB∼B

[B>B]∥∥∥)

= max‖a‖=1

Ex∼N(0,Id)

[(h(x))2‖x‖2(a>x)2]

≤(

Ex∼N(0,Id)

[(h(x))4]

)1/2

·(

Ex∼N(0,Id)

[‖x‖8])1/4

· max‖a‖=1

(E

x∼N(0,Id)[(a>x)8]

)1/4

. L2d.

169

Page 186: Copyright by Kai Zhong 2018

(IV)

max‖a‖=‖b‖=1

(E

B∼B[(a>Bb)2]

)1/2= max

‖a‖=1

(E

x∼N(0,Id)[h2(x)(a>x)4]

)1/2

≤(

Ex∼N(0,Id)

[h4(x)]

)1/4

· max‖a‖=1

(E

x∼N(0,Id)[(a>x)8]

)1/4

. L.

Applying Lemma D.3.9, we obtain, for any 0 < ε < 1 and t ≥ 1, if

n & (t log d) · (L2d+ ‖B‖2 + (mtd log n)‖B‖ε)/(ε2‖B‖2), and γ + 1/(nd2t) . (ε‖B‖/L)2

then

Prx1,··· ,xn∼N(0,Id)

[∥∥∥∥∥B − 1

n

n∑i=1

B(xi)

∥∥∥∥∥ ≤ ε‖B‖

]≥ 1− 2/(d2t)− nγ.

B.2 Properties of Activation Functions

Theorem 4.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,

squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-

tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),

the tanh function and the erf function φ(z) =∫ z

0e−t2dt, satisfy Property 1,2,3. The

linear function, φ(z) = z, doesn’t satisfy Property 2 and the quadratic function,

φ(z) = z2, doesn’t satisfy Property 1 and 2.

170

Page 187: Copyright by Kai Zhong 2018

Proof. We can easily verify that ReLU , leaky ReLU and squared ReLU satisfy Prop-erty 2 by calculating ρ(σ) in Property 2, which is shown in Table B.1. Property 1 forReLU , leaky ReLU and squared ReLU can be verified since they are non-decreasingwith bounded first derivative. ReLU and leaky ReLU are piece-wise linear, so theysatisfy Property 3(b). Squared ReLU is smooth so it satisfies Property 3(a).

Table B.1: ρ(σ) values for different activation functions. Note that we can calculatethe exact values for ReLU, Leaky ReLU, squared ReLU and erf. We can’t find aclosed-form value for sigmoid or tanh, but we calculate the numerical values ofρ(σ) for σ = 0.1, 1, 10. 1 ρerf(σ) = min(4σ2 + 1)−1/2 − (2σ2 + 1)−1, (4σ2 +1)−3/2 − (2σ2 + 1)−3, (2σ2 + 1)−2

Activations ReLU Leaky ReLU squared ReLU erfsigmoid

(σ = 0.1)sigmoid(σ = 1)

sigmoid(σ = 10)

α0(σ)12

1.012

σ√

1(2σ2+1)1/2

0.99 0.606 0.079α1(σ)

1√2π

0.99√2π

σ 0 0 0 0

α2(σ)12

1.012

2σ√

1(2σ2+1)3/2

0.97 0.24 0.00065β0(σ)

12

1.00012

2σ 1(4σ2+1)1/2

0.98 0.46 0.053β2(σ)

12

1.00012

6σ 1(4σ2+1)3/2

0.94 0.11 0.00017ρ(σ) 0.091 0.089 0.27σ ρerf(σ)

1 1.8E-4 4.9E-2 5.1E-5

Smooth non-decreasing activations with bounded first derivatives automat-

ically satisfy Property 1 and 3. For Property 2, since their first derivatives are

symmetric, we have E[φ′(σ · z)z] = 0. Then by Holder’s inequality and φ′(z) ≥ 0,

we have

Ez∼D1 [φ′2(σ · z)] ≥ (Ez∼D1 [φ

′(σ · z)])2,

Ez∼D1 [φ′2(σ · z)z2] · Ez∼D1 [z

2] ≥(Ez∼D1 [φ

′(σ · z)z2])2,

Ez∼D1 [φ′(σ · z)z2] · Ez∼D1 [φ

′(σ · z)] = Ez∼D1 [(√φ′(σ · z)z)2] · Ez∼D1 [(

√φ′(σ · z))2] ≥ (Ez∼D1 [φ

′(σ · z)z])2.

The equality in the first inequality happens when φ′(σ · z) is a constant a.e.. The

equality in the second inequality happens when |φ′(σ ·z)| is a constant a.e., which is

171

Page 188: Copyright by Kai Zhong 2018

invalidated by the non-linearity and smoothness condition. The equality in the third

inequality holds only when φ′(z) = 0 a.e., which leads to a constant function under

non-decreasing condition. Therefore, ρ(σ) > 0 for any smooth non-decreasing

non-linear activations with bounded symmetric first derivatives. The statements

about linear activations and quadratic activation follow direct calculations.

B.3 Local Positive Definiteness of HessianB.3.1 Main Results for Positive Definiteness of Hessian

Bounding the Spectrum of the Hessian near the Ground Truth

Theorem B.3.1 (Bounding the spectrum of the Hessian near the ground truth).

For any W ∈ Rd×k with ‖W − W ∗‖ . v4minρ2(σk)/(k

2κ5λ2v4maxσ4p1 ) · ‖W ∗‖,

let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) and

let the activation function satisfy Property 1,2,3. Then for any t ≥ 1, if |S| ≥

d · poly(log d, t) · k2v4maxτκ8λ2σ4p

1 /(v4minρ2(σk)), we have with probability at least

1− d−Ω(t),

Ω(v2minρ(σk)/(κ2λ))I ∇2fS(W ) O(kv2maxσ

2p1 )I.

Proof. The main idea of the proof follows the following inequalities,

∇2fD(W∗)− ‖∇2fS(W )−∇2fD(W

∗)‖I ∇2fS(W ) ∇2fD(W∗) + ‖∇2fS(W )−∇2fD(W

∗)‖I

The proof sketch is first to bound the range of the eigenvalues of∇2fD(W∗) (Lemma B.3.1)

and then bound the spectral norm of the remaining error, ‖∇2fS(W )−∇2fD(W∗)‖.

‖∇2fS(W )−∇2fD(W∗)‖ can be further decomposed into two parts, ‖∇2fS(W )−

172

Page 189: Copyright by Kai Zhong 2018

H‖ and ‖H − ∇2fD(W∗)‖, where H is ∇2fD(W ) if φ is smooth, otherwise H is

a specially designed matrix . We can upper bound them when W is close enough

to W ∗ and there are enough samples. In particular, if the activation satisfies Prop-

erty 3(a), see Lemma B.3.6 for bounding ‖H −∇2fD(W∗)‖ and Lemma B.3.7 for

bounding ‖H−∇2fS(W )‖. If the activation satisfies Property 3(b), see Lemma B.3.8.

Finally we can complete the proof by setting δ = O(v2minρ(σ1)/(kv2maxκ

2λσ2p1 ))

in Lemma B.3.7 and Lemma B.3.8, setting ‖W −W ∗‖ . v2minρ(σk)/(kκ2λv2maxσ

p1)

in Lemma B.3.6 and setting ‖W − W ∗‖ ≤ v4minρ2(σk)σk/(k

2κ4λ2v4maxσ4p1 ) in

Lemma B.3.8.

Local Linear Convergence of Gradient Descent

Although Theorem B.3.1 gives upper and lower bounds for the spectrum of

the Hessian w.h.p., it only holds when the current set of parameters W are indepen-

dent of samples. When we use iterative methods, like gradient descent, to optimize

this objective, the next iterate calculated from the current set of samples will depend

on this set of samples. Therefore, we need to do resampling at each iteration. Here

we show that for activations that satisfies Properties 1, 2 and 3(a), linear conver-

gence of gradient descent is guaranteed. To the best of our knowledge, there is no

linear convergence guarantees for general non-smooth objective. So the following

proposition also applies to smooth objectives only, which excludes ReLU.

Theorem B.3.2 (Linear convergence of gradient descent, formal version of Theo-

rem 4.3.2). Let W c ∈ Rd×k be the current iterate satisfying

‖W c −W ∗‖ . v4minρ2(σk)/(k

2κ5λ2v4maxσ4p1 )‖W ∗‖.

173

Page 190: Copyright by Kai Zhong 2018

Let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) Let the

activation function satisfy Property 1,2 and 3(a). Define

m0 = Θ(v2minρ(σk)/(κ2λ)) and M0 = Θ(kv2maxσ

2p1 ).

For any t ≥ 1, if we choose

|S| ≥ d · poly(log d, t) · k2v4maxτκ8λ2σ4p

1 /(v4minρ2(σk)) (B.2)

and perform gradient descent with step size 1/M0 on fS(Wc) and obtain the next

iterate,

W = W c − 1

M0

∇fS(W c),

then with probability at least 1− d−Ω(t),

‖W −W ∗‖2F ≤ (1− m0

M0

)‖W c −W ∗‖2F .

Proof. To prove Theorem B.3.2, we need to show the positive definite properties

on the entire line between the current iterate and the optimum by constructing a set

of anchor points, which are independent of the samples. Then we apply traditional

analysis for the linear convergence of gradient descent.

In particular, given a current iterate W c, we set d(p+1)/2 anchor points W aa=1,2,··· ,d(p+1)/2

equally along the line ξW ∗ + (1− ξ)W c for ξ ∈ [0, 1].

According to Theorem B.3.1, by setting t ← t + (p + 1)/2, we have with

probability at least 1− d−(t+(p+1)/2) for each anchor point W a,

m0I ∇2fS(Wa) M0I.

174

Page 191: Copyright by Kai Zhong 2018

Then given an anchor point W a, according to Lemma C.3.11, we have with

probability 1 − 2d−(t+(p+1)/2), for any points W between (W a−1 + W a)/2 and

(W a +W a+1)/2,

m0I ∇2fS(W ) M0I. (B.3)

Finally by applying union bound over these d(p+1)/2 small intervals, we have

with probability at least 1− d−t for any points W on the line between W c and W ∗,

m0I ∇2fS(W ) M0I.

Now we can apply traditional analysis for linear convergence of gradient

descent.

Let η denote the stepsize.

‖W −W ∗‖2F

= ‖W c − η∇fS(W c)−W ∗‖2F

= ‖W c −W ∗‖2F − 2η〈∇fS(W c), (W c −W ∗)〉+ η2‖∇fS(W c)‖2F

We can rewrite fS(Wc),

∇fS(W c) =

(∫ 1

0

∇2fS(W∗ + γ(W c −W ∗))dγ

)vec(W c −W ∗).

We define function HS : Rd×k → Rdk×dk such that

HS(Wc −W ∗) =

(∫ 1

0

∇2fS(W∗ + γ(W c −W ∗))dγ

).

According to Eq. (C.13),

m0I H M0I. (B.4)

175

Page 192: Copyright by Kai Zhong 2018

We can upper bound ‖∇fS(W c)‖2F ,

‖∇fS(W c)‖2F = 〈HS(Wc −W ∗), HS(W

c −W ∗)〉 ≤M0〈W c −W ∗, HS(Wc −W ∗)〉.

Therefore,

‖W −W ∗‖2F

≤ ‖W c −W ∗‖2F − (−η2M0 + 2η)〈W c −W ∗, H(W c −W ∗)〉

≤ ‖W c −W ∗‖2F − (−η2M0 + 2η)m0‖W c −W ∗‖2F

= ‖W c −W ∗‖2F −m0

M0

‖W c −W ∗‖2F

≤ (1− m0

M0

)‖W c −W ∗‖2F

where the third equality holds by setting η = 1M0

.

B.3.2 Positive Definiteness of Population Hessian at the Ground Truth

The goal of this Section is to prove Lemma B.3.1.

Lemma B.3.1 (Positive definiteness of population Hessian at the ground truth). If

φ(z) satisfies Property 1,2 and 3, we have the following property for the second

derivative of function fD(W ) at W ∗,

Ω(v2minρ(σk)/(κ2λ))I ∇2fD(W

∗) O(kv2maxσ2p1 )I.

Proof. The proof directly follows Lemma B.3.3 (Section B.3.2) and Lemma B.3.4(Sec-

tion B.3.2).

Lower Bound on the Eigenvalues of Hessian for the Orthogonal Case

176

Page 193: Copyright by Kai Zhong 2018

Lemma B.3.2. Let D1 denote Gaussian distribution N(0, 1). Let α0 = Ez∼D1 [φ′(z)],

α1 = Ez∼D1 [φ′(z)z], α2 = Ez∼D1 [φ

′(z)z2], β0 = Ez∼D1 [φ′2(z)] ,β2 = Ez∼D1 [φ

′2(z)z2].

Let ρ denote min(β0 − α20 − α2

1), (β2 − α21 − α2

2). Let P =[p1 p2 · · · pk

]∈

Rk×k. Then we have,

Eu∼Dk

( k∑i=1

p>i u · φ′(ui)

)2 ≥ ρ‖P‖2F (B.5)

Proof. The main idea is to explicitly calculate the LHS of Eq (C.4), then reformu-

late the equation and find a lower bound represented by α0, α1, α2, β0, β2.

Eu∼Dk

( k∑i=1

p>i u · φ′(ui)

)2

=k∑

i=1

k∑l=1

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]

=k∑

i=1

Eu∼Dk

[p>i (φ′(ui)

2 · uu>)pi]︸ ︷︷ ︸A

+∑i 6=l

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]︸ ︷︷ ︸B

177

Page 194: Copyright by Kai Zhong 2018

Further, we can rewrite the diagonal term in the following way,

A =k∑

i=1

Eu∼Dk

[p>i (φ′(ui)

2 · uu>)pi]

=k∑

i=1

Eu∼Dk

[p>i

(φ′(ui)

2 ·

(u2i eie

>i +

∑j 6=i

uiuj(eie>j + eje

>i ) +

∑j 6=i

∑l 6=i

ujuleje>l

))pi

]

=k∑

i=1

Eu∼Dk

[p>i

(φ′(ui)

2 ·

(u2i eie

>i +

∑j 6=i

u2jeje

>j

))pi

]

=k∑

i=1

[p>i

(E

u∼Dk

[φ′(ui)2u2

i ]eie>i +

∑j 6=i

Eu∼Dk

[φ′(ui)2u2

j ]eje>j

)pi

]

=k∑

i=1

[p>i

(β2eie

>i +

∑j 6=i

β0eje>j

)pi

]

=k∑

i=1

p>i ((β2 − β0)eie>i + β0Ik)pi

= (β2 − β0)k∑

i=1

p>i eie>i pi + β0

k∑i=1

p>i pi

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F ,

where the second step follows by rewriting uu> =k∑

i=1

k∑j=1

uiujeie>j , the third step

follows by

Eu∼Dk

[φ′(ui)2uiuj] = 0, ∀j 6= i and E

u∼Dk

[φ′(ui)2ujul] = 0, ∀j 6= l, the fourth

step follows by pushing expectation, the fifth step follows by Eu∼Dk

[φ′(ui)2u2

i ] = β2

and Eu∼Dk

[φ′(ui)2u2

j ] = Eu∼Dk

[φ′(ui)2] = β0, and the last step follows by

k∑i=1

p2i,i =

‖ diag(P )‖2 andk∑

i=1

p>i pi =k∑

i=1

‖pi‖2 = ‖P‖2F .

178

Page 195: Copyright by Kai Zhong 2018

We can rewrite the off-diagonal term in the following way,

B =∑i 6=l

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]

=∑i 6=l

Eu∼Dk

[p>i

(φ′(ul)φ

′(ui) ·

(u2i eie

>i + u2

l ele>l + uiul(eie

>l + ele

>i ) +

∑j 6=l

uiujeie>j

+∑j 6=i

ujuleje>l +

∑j 6=i,l

∑j′ 6=i,l

ujuj′eje>j′

))pl

]

=∑i 6=l

Eu∼Dk

[p>i

(φ′(ul)φ

′(ui) ·

(u2i eie

>i + u2

l ele>l + uiul(eie

>l + ele

>i ) +

∑j 6=i,l

u2jeje

>j

))pl

]

=∑i 6=l

[p>i

(E

u∼Dk

[φ′(ul)φ′(ui)u

2i ]eie

>i + E

u∼Dk

[φ′(ul)φ′(ui)u

2l ]ele

>l

+ Eu∼Dk

[φ′(ul)φ′(ui)uiul](eie

>l + ele

>i ) +

∑j 6=i,l

Eu∼Dk

[φ′(ul)φ′(ui)u

2j ]eje

>j

)pl

]

=∑i 6=l

[p>i

(α0α2(eie

>i + ele

>l ) + α2

1(eie>l + ele

>i ) +

∑j 6=i,l

α20eje

>j

)pl

]=∑i 6=l

[p>i((α0α2 − α2

0)(eie>i + ele

>l ) + α2

1(eie>l + ele

>i ) + α2

0Ik)pl]

= (α0α2 − α20)∑i 6=l

p>i (eie>i + ele

>l )pl︸ ︷︷ ︸

B1

+α21

∑i 6=l

p>i (eie>l + ele

>i )pl︸ ︷︷ ︸

B2

+α20

∑i 6=l

p>i pl︸ ︷︷ ︸B3

,

where the third step follows by

Eu∼Dk

[φ′(ul)φ′(ui)uiuj] = 0,

and

Eu∼Dk

[φ′(ul)φ′(ui)uj′uj] = 0, ∀j′ 6= j.

179

Page 196: Copyright by Kai Zhong 2018

For the term B1, we have

B1 = (α0α2 − α20)∑i 6=l

p>i (eie>i + ele

>l )pl

= 2(α0α2 − α20)∑i 6=l

p>i eie>i pl

= 2(α0α2 − α20)

k∑i=1

p>i eie>i

(k∑

l=1

pl − pi

)

= 2(α0α2 − α20)

(k∑

i=1

p>i eie>i

k∑l=1

pl −k∑

i=1

p>i eie>i pi

)= 2(α0α2 − α2

0)(diag(P )> · P · 1− ‖ diag(P )‖2)

For the term B2, we have

B2 = α21

∑i 6=l

p>i (eie>l + ele

>i )pl

= α21

(∑i 6=l

p>i eie>l pl +

∑i 6=l

p>i ele>i pl

)

= α21

(k∑

i=1

k∑l=1

p>i eie>l pl −

k∑j=1

p>j eje>j pj +

k∑i=1

k∑l=1

p>i ele>i pl −

k∑j=1

p>j eje>j pj

)= α2

1((diag(P )>1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)

180

Page 197: Copyright by Kai Zhong 2018

For the term B3, we have

B3 = α20

∑i 6=l

p>i pl

= α20

(k∑

i=1

p>i

k∑l=1

pl −k∑

i=1

p>i pi

)

= α20

∥∥∥∥∥k∑

i=1

pi

∥∥∥∥∥2

−k∑

i=1

‖pi‖2

= α20(‖P · 1‖2 − ‖P‖2F )

Let diag(P ) denote a length k column vector where the i-th entry is the

181

Page 198: Copyright by Kai Zhong 2018

(i, i)-th entry of P ∈ Rk×k. Furthermore, we can show A+B is,

A+B

= A+B1 +B2 +B3

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A

+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸

B1

+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸

B2

+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸

B3

= ‖α0P · 1+ (α2 − α0) diag(P )‖2︸ ︷︷ ︸C1

+α21(diag(P )> · 1)2︸ ︷︷ ︸

C2

+α21

2‖P + P> − 2 diag(diag(P ))‖2F︸ ︷︷ ︸

C3

+ (β0 − α20 − α2

1)‖P − diag(diag(P ))‖2F︸ ︷︷ ︸C4

+ (β2 − α21 − α2

2)‖ diag(P )‖2︸ ︷︷ ︸C5

≥ (β0 − α20 − α2

1)‖P − diag(diag(P ))‖2F + (β2 − α21 − α2

2)‖ diag(P )‖2

≥ min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · (‖P − diag(diag(P ))‖2F + ‖ diag(P )‖2)

= min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · (‖P − diag(diag(P ))‖2F + ‖ diag(diag(P ))‖2)

≥ min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · ‖P‖2F

= ρ‖P‖2F ,

where the first step follows by B = B1 + B2 + B3, and the second step follows by

the definition of A,B1, B2, B3 the third step follows by A+B1 +B2 +B3 = C1 +

C2+C3+C4+C5, the fourth step follows by C1, C2, C3 ≥ 0, the fifth step follows

a ≥ min(a, b), the sixth step follows by ‖ diag(P )‖2 = ‖ diag(diag(P ))‖2, the

seventh step follows by triangle inequality, and the last step follows the definition

of ρ.

Claim B.3.1. A+B1 +B2 +B3 = C1 + C2 + C3 + C4 + C5.

182

Page 199: Copyright by Kai Zhong 2018

Proof. The key properties we need are, for two vectors a, b, ‖a + b‖2 = ‖a‖2 +

2〈a, b〉 + ‖b‖2; for two matrices A,B, ‖A + B‖2F = ‖A‖2F + 2〈A,B〉 + ‖B‖2F .

183

Page 200: Copyright by Kai Zhong 2018

Then, we have

C1 + C2 + C3 + C4 + C5

= (‖α0P · 1‖)2 + 2(α0α2 − α20)〈P · 1, diag(P )〉+ (α2 − α0)

2‖ diag(P )‖2︸ ︷︷ ︸C1

+α21(diag(P )> · 1)2︸ ︷︷ ︸

C2

+α21

2(2‖P‖2F + 4‖ diag(diag(P ))‖2F + 2〈P, P>〉 − 4〈P, diag(diag(P ))〉 − 4〈P>, diag(diag(P ))〉)︸ ︷︷ ︸

C3

+ (β0 − α20 − α2

1)(‖P‖2F − 2〈P, diag(diag(P ))〉+ ‖ diag(diag(P ))‖2F )︸ ︷︷ ︸C4

+(β2 − α21 − α2

2)‖ diag(P )‖2︸ ︷︷ ︸C5

= α20‖P · 1‖2 + 2(α0α2 − α2

0)〈P · 1, diag(P )〉+ (α2 − α0)2‖ diag(P )‖2︸ ︷︷ ︸

C1

+α21(diag(P )> · 1)2︸ ︷︷ ︸

C2

+α21

2(2‖P‖2F + 4‖ diag(P )‖2 + 2〈P, P>〉 − 8‖ diag(P )‖2)︸ ︷︷ ︸

C3

+ (β0 − α20 − α2

1)(‖P‖2F − 2‖ diag(P )‖2 + ‖ diag(P )‖2)︸ ︷︷ ︸C4

+(β2 − α21 − α2

2)‖ diag(P )‖2︸ ︷︷ ︸C5

= α20‖P · 1‖2 + 2(α0α2 − α2

0) diag(P )> · P · 1+ α21(diag(P )> · 1)2 + α2

1〈P, P>〉

+ (β0 − α20)‖P‖2F + ((α2 − α0)

2 − 2α21 − β0 + α2

0 + α21 + β2 − α2

1 − α22)︸ ︷︷ ︸

β2−β0−2(α2α0−α20+α2

1)

‖ diag(P )‖2

= 0︸︷︷︸part of A

+2(α2α0 − α20) · diag(P )>P · 1︸ ︷︷ ︸part of B1

+α21 · ((diag(P )>1)2 + 〈P, P>〉)︸ ︷︷ ︸

part of B2

+α20 · ‖P · 1‖2︸ ︷︷ ︸

part of B3

+ (β0 − α20) · ‖P‖2F︸ ︷︷ ︸

proportional to ‖P‖2F

+(β2 − β0 − 2(α2α0 − α20 + α2

1)) · ‖ diag(P )‖2︸ ︷︷ ︸proportional to ‖ diag(P )‖2

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A

+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸

B1

+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸

B2

+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸

B3

=A+B1 +B2 +B3

where the second step follows by 〈P, diag(diag(P ))〉 = ‖ diag(P )‖2 and ‖ diag(diag(P ))‖2F =

184

Page 201: Copyright by Kai Zhong 2018

‖ diag(P )‖2.

Lower Bound on the Eigenvalues of Hessian for Non-orthogonal Case

First we show the lower bound of the eigenvalues. The main idea is to

reduce the problem to a k-by-k problem and then lower bound the eigenvalues using

orthogonal weight matrices.

Lemma B.3.3 (Lower bound). If φ(z) satisfies Property 1,2 and 3, we have the

following property for the second derivative of function fD(W ) at W ∗,

Ω(v2minρ(σk)/(κ2λ))I ∇2fD(W

∗).

Proof. Let a ∈ Rdk denote vector[a>1 a>2 · · · a>k

]>, let b ∈ Rdk denote vec-

tor[b>1 b>2 · · · b>k

]> and let c ∈ Rdk denote vector[c>1 c>2 · · · c>k

]>. The

smallest eigenvalue of the Hessian can be calculated by

∇2f(W ∗) min‖a‖=1

a>∇2f(W ∗)a Idk = min‖a‖=1

Ex∼Dd

( k∑i=1

v∗i a>i x · φ′(w∗>

i x)

)2 Idk

(B.6)

185

Page 202: Copyright by Kai Zhong 2018

Note that

min‖a‖=1

Ex∼Dd

( k∑i=1

(v∗i ai)>x · φ′(w∗>

i x)

)2

= min‖a‖6=0

Ex∼Dd

( k∑i=1

(v∗i ai)>x · φ′(w∗>

i x)

)2/‖a‖2

= min∑i ‖bi/v∗i ‖2 6=0

Ex∼Dd

( k∑i=1

b>i x · φ′(w∗>i x)

)2/( k∑

i=1

‖bi/v∗i ‖2)

by ai = bi/v∗i

= min∑i ‖bi‖2 6=0

Ex∼Dd

( k∑i=1

b>i x · φ′(w∗>i x)

)2/( k∑

i=1

‖bi/v∗i ‖2)

≥ v2min min∑i ‖bi‖2 6=0

Ex∼Dd

( k∑i=1

b>i x · φ′(w∗>i x)

)2/( k∑

i=1

‖bi‖2)

by vmin = mini∈[k]|v∗i |

= v2min min‖a‖=1

Ex∼Dd

( k∑i=1

a>i x · φ′(w∗>i x)

)2 (B.7)

Let U ∈ Rd×k be the orthonormal basis of W ∗ and let V =[v1 v2 · · · vk

]=

U>W ∗ ∈ Rk×k. Also note that V and W ∗ have same singular values and W ∗ =

UV . We use U⊥ ∈ Rd×(d−k) to denote the complement of U . For any vector

aj ∈ Rd, there exist two vectors bj ∈ Rk and cj ∈ Rd−k such that

aj = Ubj + U⊥cj.

We use Dd to denote Gaussian distribution N(0, Id), Dd−k to denote Gaussian dis-

tribution N(0, Id−k), and Dk to denote Gaussian distribution N(0, Ik). Then we can

186

Page 203: Copyright by Kai Zhong 2018

rewrite formulation (C.6) (removing v2min) as

Ex∼Dd

( k∑i=1

a>i x · φ′(w∗>i x)

)2 = E

x∼Dd

( k∑i=1

(b>i U> + c>i U

>⊥ )x · φ′(w∗>

i x)

)2 = A+B + C

where

A = Ex∼Dd

( k∑i=1

b>i U>x · φ′(w∗>

i x)

)2,

B = Ex∼Dd

( k∑i=1

c>i U>⊥x · φ′(w∗>

i x)

)2,

C = Ex∼Dd

[2

(k∑

i=1

b>i U>x · φ′(w∗>

i x)

(k∑

i=1

c>i U>⊥x · φ′(w∗>

i x)

)].

We calculate A,B,C separately. First, we can show

A = Ex∼Dd

( k∑i=1

b>i U>x · φ′(w∗>

i x)

)2 = E

z∼Dk

( k∑i=1

b>i z · φ′(v>i z)

)2.

187

Page 204: Copyright by Kai Zhong 2018

Second, we can show

B = Ex∼Dd

( k∑i=1

c>i U>⊥x · φ′(w∗>

i x)

)2

= Es∼Dd−k,z∼Dk

( k∑i=1

c>i s · φ′(v∗>i z)

)2

= Es∼Dd−k,z∼Dk

[(y>s)2] by defining y =k∑

i=1

φ′(v∗>i z)ci ∈ Rd−k

= Ez∼Dk

[E

s∼Dd−k

[(y>s)2]

]= E

z∼Dk

[E

s∼Dd−k

[d−k∑j=1

s2jy2j

]]by E[sjsj′ ] = 0

= Ez∼Dk

[d−k∑j=1

y2j

]by sj ∼ N(0, 1)

= Ez∼Dk

∥∥∥∥∥k∑

i=1

φ′(v∗>i z)ci

∥∥∥∥∥2 by definition of y

Third, we have C = 0 since U>⊥x is independent of w∗>

i x and U>x. Thus, putting

them all together,

Ex∼Dd

( k∑i=1

a>i x · φ′(w∗>i x)

)2 = E

z∼Dk

( k∑i=1

b>i z · φ′(v>i z)

)2+ E

z∼Dk

∥∥∥∥∥k∑

i=1

φ′(v>i z)ci

∥∥∥∥∥2

188

Page 205: Copyright by Kai Zhong 2018

Let us lower bound A,

A = Ez∼Dk

( k∑i=1

b>i z · φ′(v>i z)

)2

=

∫(2π)−k/2

(k∑

i=1

b>i z · φ′(v>i z)

)2

e−‖z‖2/2dz

=ξ1

∫(2π)−k/2

(k∑

i=1

b>i V†>s · φ′(si)

)2

e−‖V †>s‖2/2 · | det(V †)|ds

≥ξ2

∫(2π)−k/2

(k∑

i=1

b>i V†>s · φ′(si)

)2

e−σ21(V

†)‖s‖2/2 · | det(V †)|ds

=ξ3

∫(2π)−k/2

(k∑

i=1

b>i V†>u/σ1(V

†) · φ′(ui/σ1(V†))

)2

e−‖u‖2/2| det(V †)|/σk1(V

†)du

=

∫(2π)−k/2

(k∑

i=1

p>i u · φ′(σk · ui)

)2

e−‖u‖2/2 1

λdu

=1

λE

u∼Dk

( k∑i=1

p>i u · φ′(σk · ui)

)2

where V † ∈ Rk×k is the inverse of V ∈ Rk×k, i.e., V †V = I , p>i = b>i V†>/σ1(V

†)

and σk = σk(W∗). ξ1 replaces z by z = V †>s, so v∗>i z = si. ξ2 uses the fact

‖V †>s‖ ≤ σ1(V†)‖s‖. ξ3 replaces s by s = u/σ1(V

†). Note that φ′(σk · ui)’s

are independent of each other, so we can simplify the analysis. In particular,

Lemma B.3.2 gives a lower bound in this case in terms of pi. Note that ‖pi‖ ≥

‖bi‖/κ. Therefore,

Ez∼Dk

( k∑i=1

b>i z · φ′(v>i z)

)2 ≥ ρ(σk)

1

κ2λ‖b‖2.

189

Page 206: Copyright by Kai Zhong 2018

For B, similar to the proof of Lemma B.3.2, we have,

B = Ez∼Dk

∥∥∥∥∥k∑

i=1

φ′(v>i z)ci

∥∥∥∥∥2

=

∫(2π)−k/2

∥∥∥∥∥k∑

i=1

φ′(v>i z)ci

∥∥∥∥∥2

e−‖z‖2/2dz

=

∫(2π)−k/2

∥∥∥∥∥k∑

i=1

φ′(σk · ui)ci

∥∥∥∥∥2

e−‖V †>u/σ1(V †)‖2/2 · det(V †/σ1(V†))du

=

∫(2π)−k/2

∥∥∥∥∥k∑

i=1

φ′(σk · ui)ci

∥∥∥∥∥2

e−‖V †>u/σ1(V †)‖2/2 · 1λdu

≥∫

(2π)−k/2

∥∥∥∥∥k∑

i=1

φ′(σk · ui)ci

∥∥∥∥∥2

e−‖u‖2/2 · 1λdu

=1

λE

u∼Dk

∥∥∥∥∥k∑

i=1

φ′(σk · ui)ci

∥∥∥∥∥2

=1

λ

(k∑

i=1

Eu∼Dk

[φ′(σk · ui)φ′(σk · ui)c

>i ci] +

∑i 6=l

Eu∼Dk

[φ′(σk · ui)φ′(σk · ul)c

>i cl]

)

=1

λ

(E

z∼D1

[φ′(σk · ui)2]

k∑i=1

‖ci‖2 +(

Ez∼D1

[φ′(σk · z)])2∑

i 6=l

c>i cl

)

=1

λ

( Ez∼D1

[φ′(σk · z)])2∥∥∥∥∥

k∑i=1

ci

∥∥∥∥∥2

2

+

(E

z∼D1

[φ′(σk · z)2]−(

Ez∼D1

[φ′(σk · z)])2)‖c‖2

≥ 1

λ

(E

z∼D1

[φ′(σk · z)2]−(

Ez∼D1

[φ′(σk · z)])2)‖c‖2

≥ ρ(σk)1

λ‖c‖2,

where the first step follows by definition of Gaussian distribution, the second step

follows by replacing z by z = V †>u/σ1(V†), and then v>i z = ui/σ1(V

†) =

190

Page 207: Copyright by Kai Zhong 2018

uiσk(W∗), the third step follows by ‖u‖2 ≥ ‖ 1

σ1(V †)V †>u‖2 , the fourth step follows

by det(V †/σ1(V†)) = det(V †)/σk

1(V†) = 1/λ, the fifth step follows by definition

of Gaussian distribution, the ninth step follows by x2 ≥ 0 for any x ∈ R, and the

last step follows by Property 2.

Note that 1 = ‖a‖2 = ‖b‖2 + ‖c‖2. Thus, we finish the proof for the lower

bound.

Upper Bound on the Eigenvalues of Hessian for Non-orthogonal Case

Lemma B.3.4 (Upper bound). If φ(z) satisfies Property 1,2 and 3, we have the

following property for the second derivative of function fD(W ) at W ∗,

∇2fD(W∗) O(kv2maxσ

2p1 )I.

191

Page 208: Copyright by Kai Zhong 2018

Proof. Similarly, we can calculate the upper bound of the eigenvalues by

‖∇2f(W ∗)‖

= max‖a‖=1

a>∇2f(W ∗)a

= v2max max‖a‖=1

Ex∼Dk

( k∑i=1

a>i x · φ′(w∗>i x)

)2

= v2max max‖a‖=1

k∑i=1

k∑l=1

Ex∼Dk

[a>i x · φ′(w∗>i x) · a>l x · φ′(w∗>

l x)]

≤ v2max max‖a‖=1

k∑i=1

k∑l=1

(E

x∼Dk

[(a>i x)4] · E

x∼Dk

[(φ′(w∗>i x))4] · E

x∼Dk

[(a>l x)4] · E

x∼Dk

[(φ′(w∗>l x))4]

)1/4

. v2max max‖a‖=1

k∑i=1

k∑l=1

‖ai‖ · ‖al‖ · ‖w∗i ‖p · ‖w∗

l ‖p

≤ v2max max‖a‖=1

k∑i=1

k∑l=1

‖ai‖ · ‖al‖ · σ2p1

≤ kv2maxσ2p1 ,

where the first inequality follows Holder’s inequality, the second inequality follows

by Property 1, the third inequality follows by ‖w∗i ‖ ≤ σ1(W

∗), and the last inequal-

ity by Cauchy-Schwarz inequality.

B.3.3 Error Bound of Hessians near the Ground Truth for Smooth Activa-tions

The goal of this Section is to prove Lemma B.3.5

Lemma B.3.5 (Error bound of Hessians near the ground truth for smooth activa-

192

Page 209: Copyright by Kai Zhong 2018

tions). Let φ(z) satisfy Property 1, Property 2 and Property 3(a). Let W satisfy

‖W − W ∗‖ ≤ σk/2. Let S denote a set of i.i.d. samples from the distribution

defined in (4.1). Then for any t ≥ 1 and 0 < ε < 1/2, if

|S| ≥ ε−2dκ2τ · poly(log d, t)

then we have, with probability at least 1− 1/dΩ(t),

‖∇2fS(W )−∇2fD(W∗)‖ . v2maxkσ

p1(εσ

p1 + ‖W −W ∗‖).

Proof. This follows by combining Lemma B.3.6 and Lemma B.3.7 directly.

Second-order Smoothness near the Ground Truth for Smooth Activa-

tions The goal of this Section is to prove Lemma B.3.6.

Fact B.3.3. Let wi denote the i-th column of W ∈ Rd×k, and w∗i denote the i-th

column of W ∗ ∈ Rd×k. If ‖W −W ∗‖ ≤ σk(W∗)/2, then for all i ∈ [k],

1

2‖w∗

i ‖ ≤ ‖wi‖ ≤3

2‖w∗

i ‖.

Proof. Note that if ‖W − W ∗‖ ≤ σk(W∗)/2, we have σk(W

∗)/2 ≤ σi(W ) ≤32σ1(W

∗) for all i ∈ [k] by Weyl’s inequality. By definition of singular value,

we have σk(W∗) ≤ ‖w∗

i ‖ ≤ σ1(W∗). By definition of spectral norm, we have

‖wi − w∗i ‖ ≤ ‖W −W ∗‖. Thus, we can lower bound ‖wi‖,

‖wi‖ ≤ ‖w∗i ‖+ ‖wi − w∗

i ‖ ≤ ‖w∗i ‖+ ‖W −W ∗‖ ≤ ‖w∗

i ‖+ σk/2 ≤3

2‖w∗

i ‖.

Similarly, we have ‖wi‖ ≥ 12‖w∗

i ‖.

193

Page 210: Copyright by Kai Zhong 2018

Lemma B.3.6 (Second-order smoothness near the ground truth for smooth activa-

tions). If φ(z) satisfies Property 1, Property 2 and Property 3(a), then for any W

with ‖W −W ∗‖ ≤ σk/2, we have

‖∇2fD(W )−∇2fD(W∗)‖ . k2v2maxσ

p1‖W −W ∗‖.

Proof. Let ∆ = ∇2fD(W )−∇2fD(W∗). For each (i, l) ∈ [k]× [k], let ∆i,l ∈ Rd×d

denote the (i, l)-th block of ∆. Then, for i 6= l, we have

∆i,l = Ex∼Dd

[v∗i v

∗l

(φ′(w>

i x)φ′(w>

l x)− φ′(w∗>i x)φ′(w∗>

l x))xx>],

and for i = l, we have

∆i,i

= Ex∼Dd

[(k∑

r=1

v∗rφ(w>r x)− y

)v∗i φ

′′(w>i x)xx

> + v∗2i(φ′2(w>

i x)− φ′2(w∗>i x)

)xx>

]

= Ex∼Dd

[(k∑

r=1

v∗rφ(w>r x)− y

)v∗i φ

′′(w>i x)xx

>

]+ E

x∼Dd

[v∗2i(φ′2(w>

i x)− φ′2(w∗>i x)

)xx>].

(B.8)

In the next a few paragraphs, we first show how to bound the off-diagonal

term, and then show how to bound the diagonal term.

194

Page 211: Copyright by Kai Zhong 2018

First, we consider off-diagonal terms.

‖∆i,l‖

=

∥∥∥∥ Ex∼Dd

[v∗i v

∗l

(φ′(w>

i x)φ′(w>

l x)− φ′(w∗>i x)φ′(w∗>

l x))xx>]∥∥∥∥

≤ v2max max‖a‖=1

Ex∼Dd

[∣∣φ′(w>i x)φ

′(w>l x)− φ′(w∗>

i x)φ′(w∗>l x)

∣∣ · (x>a)2]

≤ v2max max‖a‖=1

Ex∼Dd

[(|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|+ |φ′(w∗>i x)||φ′(w>

l x)− φ′(w∗>l x)|

)(x>a)2

]= v2max max

‖a‖=1

(E

x∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>a)2]

+ Ex∼Dd

[|φ′(w∗>

i x)||φ′(w>l x)− φ′(w∗>

l x)|(x>a)2])

≤ v2max max‖a‖=1

(E

x∼Dd

[L2|(wi − w∗

i )>x||φ′(w>

l x)|(x>a)2]+ E

x∼Dd

[|φ′(w∗>

i x)|L2|(wl − w∗l )

>x|(x>a)2])

≤ v2max max‖a‖=1

(E

x∼Dd

[L2|(wi − w∗

i )>x|L1|w>

l x|p(x>a)2]+ E

x∼Dd

[L1|w∗>

i x|pL2|(wl − w∗l )

>x|(x>a)2])

≤ v2maxL1L2 max‖a‖=1

(E

x∼Dd

[|(wi − w∗

i )>x||w>

l x|p(x>a)2]+ E

x∼Dd

[|(wl − w∗

l )>x||w∗>

i x|p(x>a)2])

. v2maxL1L2 max‖a‖=1

(‖wi − w∗i ‖‖wl‖p‖a‖2 + ‖wl − w∗

l ‖‖w∗i ‖p‖a‖2)

. v2maxL1L2σp1(W

∗)‖W −W ∗‖ (B.9)

where the first step follows by definition of ∆i,l, the second step follows by defini-

tion of spectral norm and v∗i v∗l ≤ v2max, the third step follows by triangle inequality,

the fourth step follows by linearity of expectation, the fifth step follows by Prop-

erty 3(a), i.e., |φ′(w>i x) − φ′(w∗>

i x)| ≤ L2|(wi − w∗i )

>x|, the sixth step follows

by Property 1, i.e., φ′(z) ≤ L1|z|p, the seventh step follows by Fact 2.4.2, and the

last step follows by ‖a‖2 = 1, ‖wi − w∗i ‖ ≤ ‖W − W ∗‖, ‖wi‖ ≤ 3

2‖w∗

i ‖, and

‖w∗i ‖ ≤ σ1(W

∗).

Note that the proof for the off-diagonal terms also applies to bounding the

195

Page 212: Copyright by Kai Zhong 2018

second-term in the diagonal block Eq. (B.8). Thus we only need to show how to

bound the first term in the diagonal block Eq. (B.8).∥∥∥∥∥ Ex∼Dd

[(k∑

r=1

v∗rφ(w>r x)− y

)v∗i φ

′′(w>i x)xx

>

]∥∥∥∥∥=

∥∥∥∥∥ Ex∼Dd

[(k∑

r=1

v∗r(φ(w>r x)− φ(w∗>

r x))

)v∗i φ

′′(w>i x)xx

>

]∥∥∥∥∥≤ v2max

k∑r=1

max‖a‖=1

Ex∼Dd

[|φ(w>r x)− φ(w∗>

r x)||φ′′(w>i x)|(x>a)2]

≤ v2max

k∑r=1

max‖a‖=1

Ex∼Dd

[|φ(w>r x)− φ(w∗>

r x)|L2(x>a)2]

≤ v2maxL2

k∑r=1

max‖a‖=1

Ex∼Dd

[max

z∈[w>r x,w∗>

r x]|φ′(z)| · |(wr − w∗

r)>x| · (x>a)2

]

≤ v2maxL2

k∑r=1

max‖a‖=1

Ex∼Dd

[max

z∈[w>r x,w∗>

r x]L1|z|p · |(wr − w∗

r)>x| · (x>a)2

]

≤ v2maxL1L2

k∑r=1

max‖a‖=1

Ex∼Dd

[(|w>r x|p + |w∗>

r x|p) · |(wr − w∗r)

>x| · (x>a)2]

. v2maxL1L2

k∑r=1

[(‖wr‖p + ‖w∗r‖p)‖wr − w∗

r‖]

. kv2maxL1L2σp1(W

∗)‖W −W ∗‖, (B.10)

where the first step follows by y =∑k

r=1 v∗rφ(w

∗>r x), the second step follows by

definition of spectral norm and v∗rv∗i ≤ |vmax|2, the third step follows by Prop-

erty 3(a), i.e., |φ′′(w>i x)| ≤ L2, the fourth step follows by |φ(w>

r x) − φ(w∗>r x) ≤

maxz∈[w>r x,w∗>

r x] |φ′(z)||(wr−w∗r)

>x|, the fifth step follows Property 1, i.e., |φ′(z)| ≤

L1|z|p, the sixth step follows by maxz∈[w>r x,w∗>

r x] |z|p ≤ (|w>r x|p + |w∗>

r x|p), the

seventh step follows by Fact 2.4.2.

196

Page 213: Copyright by Kai Zhong 2018

Putting it all together, we can bound the error by

‖∇2fD(W )−∇2fD(W∗)‖

= max‖a‖=1

a>(∇2fD(W )−∇2fD(W∗))a

= max‖a‖=1

k∑i=1

k∑l=1

a>i ∆i,lal

= max‖a‖=1

(k∑

i=1

a>i ∆i,iai +∑i 6=l

a>i ∆i,lal

)

≤ max‖a‖=1

(k∑

i=1

‖∆i,i‖‖ai‖2 +∑i 6=l

‖∆i,l‖‖ai‖‖al‖

)

≤ max‖a‖=1

(k∑

i=1

C1‖ai‖2 +∑i 6=l

C2‖ai‖‖al‖

)

= max‖a‖=1

C1

k∑i=1

‖ai‖2 + C2

( k∑i=1

‖ai‖

)2

−k∑

i=1

‖ai‖2

≤ max‖a‖=1

(C1

k∑i=1

‖ai‖2 + C2

(k

k∑i=1

‖ai‖2 −k∑

i=1

‖ai‖2))

= max‖a‖=1

(C1 + C2(k − 1))

. k2v2maxL1L2σp1(W

∗)‖W −W ∗‖.

where the first step follows by definition of spectral norm and a denotes a vector

∈ Rdk, the first inequality follows by ‖A‖ = max‖x‖6=0,‖y‖6=0x>Ay‖x‖‖y‖ , the second

inequality follows by ‖∆i,i‖ ≤ C1 and ‖∆i,l‖ ≤ C2, the third inequality follows

by Cauchy-Scharwz inequality, the eighth step follows by∑

i=1 ‖ai‖2 = 1, and the

last step follows by Eq (B.9) and (B.10).

Empirical and Population Difference for Smooth Activations

197

Page 214: Copyright by Kai Zhong 2018

The goal of this Section is to prove Lemma B.3.7. For each i ∈ [k], let σi

denote the i-th largest singular value of W ∗ ∈ Rd×k.

Note that Bernstein inequality requires the spectral norm of each random

matrix to be bounded almost surely. However, since we assume Gaussian distribu-

tion for x, ‖x‖2 is not bounded almost surely. The main idea is to do truncation and

then use Matrix Bernstein inequality. Details can be found in Lemma D.3.9 and

Corollary B.1.1.

Lemma B.3.7 (Empirical and population difference for smooth activations). Let

φ(z) satisfy Property 1,2 and 3(a). Let W satisfy ‖W −W ∗‖ ≤ σk/2. Let S denote

a set of i.i.d. samples from distribution D (defined in (4.1)). Then for any t ≥ 1 and

0 < ε < 1/2, if

|S| ≥ ε−2dκ2τ · poly(log d, t)

then we have, with probability at least 1− d−Ω(t),

‖∇2fS(W )−∇2fD(W )‖ .v2maxkσp1(εσ

p1 + ‖W −W ∗‖).

Proof. Define ∆ = ∇2fD(W )−∇2fS(W ). Let’s first consider the diagonal blocks.

Define

∆i,i = Ex∼Dd

[(k∑

r=1

v∗rφ(w>r x)− y

)v∗i φ

′′(w>i x)xx

> + v∗2i φ′2(w>i x)xx

>

]

(1

n

n∑j=1

(k∑

r=1

v∗rφ(w>r xj)− y

)v∗i φ

′′(w>i xj)xjx

>j + v∗2i φ′2(w>

i xj)xjx>j

).

Let’s further decompose ∆i,i into ∆i,i = ∆(1)i,i +∆

(2)i,i , where

198

Page 215: Copyright by Kai Zhong 2018

∆(1)i,i

= Ex∼Dd

[(k∑

r=1

v∗rφ(w>r x)− y

)v∗i φ

′′(w>i x)xx

>

]− 1

n

n∑j=1

(k∑

r=1

v∗rφ(w>r xj)− y

)v∗i φ

′′(w>i xj)xjx

>j

= v∗i

k∑r=1

(v∗r E

x∼Dd

[(φ(w>

r x)− φ(w∗>r x))φ′′(w>

i x)xx>]

− 1

n

n∑j=1

(φ(w>r xj)− φ(w∗>

r xj))φ′′(w>

i xj)xjx>j

),

and

∆(2)i,i = E

x∼Dd

[v∗2i φ′2(w>i x)xx

>]− 1

n

n∑j=1

[v∗2i φ′2(w>i xj)xjx

>j ]. (B.11)

The off-diagonal block is

∆i,l = v∗i v∗l

(E

x∼Dd

[φ′(w>i x)φ

′(w>l x)xx

>]− 1

n

n∑j=1

φ′(w>i xj)φ

′(w>l xj)xjx

>j

)

Combining Claims. B.3.2, B.3.3 and B.3.4, and taking a union bound over

k2 different ∆i,j , we obtain if n ≥ ε−2κ2(W ∗)τd·poly(t, log d) for any ε ∈ (0, 1/2),

with probability at least 1− 1/dt,

‖∇2fS(W )−∇2f(W )‖ . v2maxkσp1(W

∗)(εσp1(W

∗) + ‖W −W ∗‖)

Claim B.3.2. For each i ∈ [k], if n ≥ dpoly(log d, t)

‖∆(1)i,i ‖ . kv2maxσ

p1(W

∗)‖W −W ∗‖

holds with probability 1− 1/d4t.

199

Page 216: Copyright by Kai Zhong 2018

Proof. For each r ∈ [k], we define function Br : Rd → Rd×d,

Br(x) = L1L2 · (|w>r x|p + |w∗>

r x|p) · |(wr − w∗r)

>x| · xx>.

According to Properties 1,2 and 3(a), we have for each x ∈ S,

−Br(x) (φ(w>r x)− φ(w∗>

r x))φ′′(w>i x)xx

> Br(x)

Therefore,

∆(1)i,i v2max

k∑r=1

(E

x∼Dd

[Br(x)] +1

|S|∑x∈S

Br(x)

).

Let hr(x) = L1L2|w>r x|p · |(wr − w∗

r)>x|. Let Br = Ex∼Dd

[hr(x)xx>].

Define function Br(x) = hr(x)xx>.

(I) Bounding |hr(x)|.

According to Fact 2.4.2, we have for any constant t ≥ 1, with probability

1− 1/(nd4t),

|hr(x)| = L1L2|w>r x|p|(wr − w∗

r)>x| . L1L2‖wr‖p‖wr − w∗

r‖(t log n)(p+1)/2.

(II) Bounding ‖Br‖.

‖Br‖ ≥ Ex∼Dd

[L1L2|w>

r x|p|(wr − w∗r)

>x|((wr − w∗

r)>x

‖wr − w∗r‖

)2]& L1L2‖wr‖p‖wr − w∗

r‖,

where the first step follows by definition of spectral norm, and last step follows by

Fact 2.4.2. Using Fact 2.4.2, we can also prove an upper bound ‖Br‖, ‖Br‖ .

L1L2‖wr‖p‖wr − w∗r‖.

200

Page 217: Copyright by Kai Zhong 2018

(III) Bounding (Ex∼Dd[h4(x)])1/4

Using Fact 2.4.2, we have(E

x∼Dd

[h4(x)]

)1/4

= L1L2

(E

x∼Dd

[(|w>

r x|p|(wr − w∗r)

>x|)4])1/4

. L1L2‖wr‖p‖wr − w∗r‖.

By applying Corollary B.1.1, if n ≥ ε−2dpoly(log d, t), then with probabil-

ity 1− 1/d4t,

∥∥∥∥∥ Ex∼Dd

[|w>

r x|p · |(wr − w∗r)

>x| · xx>]− 1

|S|

(∑x∈S

|w>r xj|p · |(wr − w∗

r)>x| · xx>

)∥∥∥∥∥=

∥∥∥∥∥Br −1

|S|∑x∈S

Br(x)

∥∥∥∥∥≤ ε‖Br‖

. ε‖wr‖p‖wr − w∗r‖. (B.12)

If ε ≤ 1/2, we have

‖∆(1)i,i ‖ .

k∑i=1

v2max‖Br‖ . kv2maxσp1(W

∗)‖W −W ∗‖

Claim B.3.3. For each i ∈ [k], if n ≥ ε−2dτpoly(log d, t) , then

‖∆(2)i,i ‖ . εv2maxσ

2p1

holds with probability 1− 1/d4t.

201

Page 218: Copyright by Kai Zhong 2018

Proof. Recall the definition of ∆(2)i,i .

∆(2)i,i = E

x∼Dd

[v∗2i φ′2(w>i x)xx

>]− 1

|S|∑x∈S

[v∗2i φ′2(w>i x)xx

>]

Let hi(x) = φ′2(w>i x). Let Bi = Ex∼Dd

[hi(x)xx>] Define function Bi(x) =

hi(x)xx>.

(I) Bounding |hi(x)|.

For any constant t ≥ 1, (φ′(w>i x))

2 ≤ L21|w>

i x|2p . L21‖wi‖2ptp logp(n)

with probability 1− 1/(nd4t) according to Fact 2.4.2.

(II) Bounding ‖Bi‖.∥∥∥∥ Ex∼Dd

[φ′2(w>i x)xx

>]

∥∥∥∥ = max‖a‖=1

Ex∼Dd

[φ′2(w>

i x)(x>a)2

]= max

‖a‖=1E

x∼Dd

[φ′2(w>

i x)(αw>

i x+ βx>v)2]

= maxα2+β2=1,‖v‖=1

Ex∼Dd

[φ′2(w>

i x)((αw>

i x)2 + (βx>v)2

)]= max

α2+β2=1

(α2 E

z∼D1

[φ′2(‖wi‖z)z2] + β2 Ez∼D1

[φ′2(‖wi‖z)])

= max

(E

x∼D1

[φ′2(‖wi‖z)z2], Ex∼D1

[φ′2(‖wi‖z)])

where wi = wi/‖wi‖ and v is a unit vector orthogonal to wi such that a = αwi+βv.

Now from Property 2, we have

ρ(‖wi‖) ≤∥∥∥∥ Ex∼Dd

[φ′2(w>i x)xx

>]

∥∥∥∥ . L21‖wi‖2p.

(III) Bounding (Ex∼Dd[h4

i (x)])1/4.

(Ex∼Dd

[h4i (x)]

)1/4=

(E

x∼Dd

[φ′8(w>i x)]

)1/4

. L21‖wi‖2p.

202

Page 219: Copyright by Kai Zhong 2018

By applying Corollary B.1.1, we have, for any 0 < ε < 1, if

n ≥ ε−2d ‖wi‖4pρ2(‖wi‖)poly(log d, t) the following bound holds∥∥∥∥∥Bi −

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε‖Bi‖,

with probability at least 1− 1/d4t.

Therefore, if n ≥ ε−2dτpoly(log d, t), where τ = (3σ1/2)4p

minσ∈[σk/2,3σ1/2]ρ2(σ)

, we

have with probability 1− 1/d4t

‖∆(2)i,i ‖ . εv2maxσ

2p1

Claim B.3.4. For each i ∈ [k], j ∈ [k], i 6= j, if n ≥ ε−2κ2τdpoly(log d, t), then

‖∆i,j‖ ≤ εv2maxσ2p1 (W ∗)

holds with probability 1− 1/d4t.

Proof. Recall the definition of off-diagonal blocks ∆i,l,

∆i,l = v∗i v∗l

(E

x∼Dd

[φ′(w>i x)φ

′(w>l x)xx

>]− 1

|S|∑x∈S

φ′(w>i x)φ

′(w>l x)xx

>

)(B.13)

Let hi,l(x) = φ′(w>i x)φ

′(w>l x). Define function Bi,l(x) = hi,l(x)xx

>. Let

Bi,l = Ex∼Dd[hi,l(x)xx

>].

(I) Bounding |hi,l(x)|.

203

Page 220: Copyright by Kai Zhong 2018

For any constant t ≥ 1, we have with probability 1− 1/(nd4t)

|hi,l(x)| = |φ′(w>i x)φ

′(w>l x)|

≤ L21‖w>

i x‖p‖w>l x‖p

≤ L21‖wi‖p‖wl‖p(t log n)p

. L21σ

2p1 (t log n)p

where the third step follows by Fact 2.4.2.

(II) Bounding ‖Bi,l‖.

Let U ∈ Rd×2 be the orthogonal basis of spanwi, wl and U⊥ ∈ Rd×(d−2)

be the complementary matrix of U . Let matrix V ∈ R2×2 denote U>[wi wl], then

UV = [wi wl] ∈ Rd×2. Given any vector a ∈ Rd, there exist vectors b ∈ R2 and

c ∈ Rd−2 such that a = Ub+ U⊥c. We can simplify ‖Bi,l‖ in the following way,

‖Bi,l‖ =∥∥∥∥ Ex∼Dd

[φ′(w>i x)φ

′(w>l x)xx

>]

∥∥∥∥= max

‖a‖=1E

x∼Dd

[φ′(w>i x)φ

′(w>l x)(x

>a)2]

= max‖b‖2+‖c‖2=1

Ex∼Dd

[φ′(w>i x)φ

′(w>l x)(b

>U>x+ c>U>⊥x)

2]

= max‖b‖2+‖c‖2=1

Ex∼Dd

[φ′(w>i x)φ

′(w>l x)((b

>U>x)2 + (c>U>⊥x)

2)]

= max‖b‖2+‖c‖2=1

Ez∼D2

[φ′(v>1 z)φ′(v>2 z)(b

>z)2]︸ ︷︷ ︸A1

+ Ez∼D2,s∼Dd−2

[φ′(v>1 z)φ′(v>2 z)(c

>s)2]︸ ︷︷ ︸A2

204

Page 221: Copyright by Kai Zhong 2018

We can lower bound the term A1,

A1 = Ez∼D2

[φ′(v>1 z)φ′(v>2 z)(b

>z)2]

=

∫(2π)−1φ′(v>1 z)φ

′(v>2 z)(b>z)2e−‖z‖2/2dz

=

∫(2π)−1φ′(s1)φ

′(s2)(b>V †>s)2e−‖V †>s‖2/2 · | det(V †)|ds

≥∫

(2π)−1(φ′(s1)φ′(s2))(b

>V †>s)2e−σ21(V

†)‖s‖2/2 · | det(V †)|ds

=

∫(2π)−1(φ′(u1/σ1(V

†))φ′(u2/σ1(V†))) · (b>V †>u/σ1(V

†))2e−‖u‖2/2| det(V †)|/σ21(V

†)du

=σ2(V )

σ1(V )E

u∼D2

[(p>u)2φ′(σ2(V ) · u1)φ

′(σ2(V ) · u2)]

=σ2(V )

σ1(V )E

u∼D2

[((p1u1)

2 + (p2u2)2 + 2p1p2u1u2

)φ′(σ2(V ) · u1)φ

′(σ2(V ) · u2)]

=σ2(V )

σ1(V )

(‖p‖2 E

z∼D1

[φ′(σ2(V ) · z)z2] · Ez∼D1

[φ′(σ2(V ) · z)]

+ ((p>1)2 − ‖p‖2) Ez∼D1

[φ′(σ2(V ) · z)z]2)

where p = V †b · σ2(V ) ∈ R2.

Since we are maximizing over b ∈ R2, we can choose b such that ‖p‖ = ‖b‖.

Then

A1 = Ez∼D2

[φ′(v>1 z)φ′(v>2 z)(b

>z)2]

≥ σ2(V )

σ1(V )‖b‖2

(E

z∼D1

[φ′(σ2(V ) · z)z2] · Ez∼D1

[φ′(σ2(V ) · z)]− Ez∼D1

[φ′(σ2(V ) · z)z]2)

205

Page 222: Copyright by Kai Zhong 2018

For the term A2, similarly we have

A2 = Ez∼D2,s∼Dd−2

[φ′(v>1 z)φ′(v>2 z)(c

>s)2]

= Ez∼D2

[φ′(v>1 z)φ′(v>2 z)] E

s∼Dd−2

[(c>s)2]

= ‖c‖2 Ez∼D2

[φ′(v>1 z)φ′(v>2 z)]

≥ ‖c‖2σ2(V )

σ1(V )

(E

z∼D1

[φ′(σ2(V ) · z)])2

For simplicity, we just set ‖b‖ = 1 and ‖c‖ = 0 to lower bound,∥∥∥∥ Ex∼Dd

[φ′(w>

i x)φ′(w>

l x)xx>]∥∥∥∥

≥ σ2(V )

σ1(V )

(E

z∼D1

[φ′(σ2(V ) · z)z2] · Ez∼D1

[φ′(σ2(V ) · z)]−(

Ez∼D1

[φ′(σ2(V ) · z)z])2)

≥ σ2(V )

σ1(V )ρ(σ2(V ))

≥ 1

κ(W ∗)ρ(σ2(V ))

where the second step is from Property 2 and the fact that σk/2 ≤ σ2(V ) ≤

σ1(V ) ≤ 3σ1/2.

For the upper bound of ‖Ex∼Dd[φ′(w>

i x)φ′(w>

l x)xx>]‖, we have

Ez∼D2

[φ′(v>1 z)φ′(v>2 z)(b

>z)2] ≤ L21 Ez∼D2

[|v>1 z|p · |v>2 z|p · |b>z|2]

. L21‖v1‖p‖v2‖p‖b‖2

. L21σ

2p1

where the first step follows by Property 1, the second step follows by Fact 2.4.2,

and the last step follows by ‖v1‖ ≤ σ1, ‖v2‖ ≤ σ1 and ‖b‖ ≤ 1. Similarly, we can

206

Page 223: Copyright by Kai Zhong 2018

upper bound,

Ez∼D2,s∼Dd−2

[φ′(v>1 z)φ′(v>2 z)(c

>s)2] = ‖c‖2 Ez∼D2

[φ′(v>1 z)φ′(v>2 z)] . L2

1σ2p1

Thus, we have∥∥Ex∼Dd[φ′(w>

i x)φ′(w>

l x)xx>]∥∥ . L2

1σ2p1 . σ2p

1 .

(III) Bounding (Ex∼Dd[h4

i,l(x)])1/4.

(Ex∼Dd

[h4i,l(x)]

)1/4=(Ex∼Dd

[φ′4(w>i x) · φ′4(w>

l x)])1/4

. L21‖wi‖p‖wl‖p . L2

1σ2p1 .

Therefore, applying Corollary B.1.1, we have, if n ≥ ε−2κ2(W ∗)τdpoly(log d, t),

then

‖∆i,j‖ ≤ εv2maxσ2p1 (W ∗).

holds with probability at least 1− 1/d4t.

B.3.4 Error Bound of Hessians near the Ground Truth for Non-smooth Acti-vations

The goal of this Section is to prove Lemma B.3.8,

Lemma B.3.8 (Error bound of Hessians near the ground truth for non-smooth acti-

vations). Let φ(z) satisfy Property 1,2 and 3(b). Let W satisfy ‖W −W ∗‖ ≤ σk/2.

Let S denote a set of i.i.d. samples from the distribution defined in (4.1). Then for

any t ≥ 1 and 0 < ε < 1/2, if

|S| ≥ ε−2κ2τdpoly(log d, t)

207

Page 224: Copyright by Kai Zhong 2018

with probability at least 1− d−Ω(t),

‖∇2fS(W )−∇2fD(W∗)‖ . v2maxkσ

2p1 (ε+ (σ−1

k · ‖W −W ∗‖)1/2).

Proof. As we noted previously, when Property 3(b) holds, the diagonal blocks of

the empirical Hessian can be written as, with probability 1,

∂2fS(W )

∂w2i

=1

|S|∑

(x,y)∈S

(v∗i φ′(w>

i x))2xx>

We construct a matrix H ∈ Rdk×dk with i, l-th block as

Hi,l = v∗i v∗l Ex∼Dd

[φ′(w>

i x)φ′(w>

l x)xx>] ∈ Rd×d, ∀i ∈ [k], l ∈ [k].

Note that H 6= ∇2fD(W ). However we can still bound ‖H − ∇2fS(W )‖

and ‖H − ∇2fD(W∗)‖ when we have enough samples and ‖W − W ∗‖ is small

enough. The proof for ‖H−∇2fS(W )‖ basically follows the proof of Lemma B.3.7

as ∆(2)ii in Eq. (B.11) and ∆il in Eq. (B.13) forms the blocks of H −∇2fS(W ) and

we can bound them without smoothness of φ(·).

Now we focus on H −∇2fD(W∗). We again consider each block.

∆i,l = Ex∼Dd

[v∗i v

∗l (φ

′(w>i x)φ

′(w>l x)− φ′(w∗>

i x)φ′(w∗>l x))xx>].

We used the boundedness of φ′′(z) to prove Lemma B.3.6. Here we can’t use this

208

Page 225: Copyright by Kai Zhong 2018

condition. Without smoothness, we will stop at the following position.

∥∥Ex∼Dd[v∗i v

∗l (φ

′(w>i x)φ

′(w>l x)− φ′(w∗>

i x)φ′(w∗>l x))xx>]

∥∥≤ |v∗i v∗l | max

‖a‖=1Ex∼Dd

[|φ′(w>

i x)φ′(w>

l x)− φ′(w∗>i x)φ′(w∗>

l x)|(x>a)2]

≤ |v∗i v∗l | max‖a‖=1

Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|

+ |φ′(w∗>i x)||φ′(w>

l x)− φ′(w∗>l x)|(x>a)2

]= |v∗i v∗l | max

‖a‖=1

(Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>a)2]

+ Ex∼Dd

[|φ′(w∗>

i x)||φ′(w>l x)− φ′(w∗>

l x)|(x>a)2]). (B.14)

where the first step follows by definition of spectral norm, the second step follows

by triangle inequality, and the last step follows by linearity of expectation.

Without loss of generality, we just bound the first term in the above formu-

lation. Let U be the orthogonal basis of span(wi, w∗i , wl). If wi, w

∗i , wl are inde-

pendent, U is d-by-3. Otherwise it can be d-by-rank(span(wi, w∗i , wl)). Without

loss of generality, we assume U = span(wi, w∗i , wl) is d-by-3. Let [vi v∗i vl] =

U>[wi w∗i wl] ∈ R3×3, and [ui u∗

i ul] = U>⊥ [wi w∗

i wl] ∈ R(d−3)×3. Let

209

Page 226: Copyright by Kai Zhong 2018

a = Ub+ U⊥c, where U⊥ ∈ Rd×(d−3) is the complementary matrix of U .

Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>a)2]

= Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>(Ub+ U⊥c))2]

. Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|((x>Ub)2 + (x>U⊥c)

2)]

= Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>Ub)2]

+ Ex∼Dd

[|φ′(w>

i x)− φ′(w∗>i x)||φ′(w>

l x)|(x>U⊥c)2]

= Ez∼D3

[|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2

]+ Ey∼Dd−3

[|φ′(u>

i y)− φ′(u∗>i y)||φ′(u>

l y)|(y>c)2]

(B.15)

where the first step follows by a = Ub + U⊥c, the last step follows by (a + b)2 ≤

2a2 + 2b2.

By Property 3(b), we have e exceptional points which have φ′′(z) 6= 0. Let

these e points be p1, p2, · · · , pe. Note that if v>i z and v∗>i z are not separated by

any of these exceptional points, i.e., there exists no j ∈ [e] such that v>i z ≤ pj ≤

v∗>i z or v∗>i z ≤ pj ≤ v>i z, then we have φ′(v>i z) = φ′(v∗>i z) since φ′′(s) are

zeros except for pjj=1,2,··· ,e. So we consider the probability that v>i z, v∗>i z are

separated by any exception point. We use ξj to denote the event that v>i z, v∗>i z

are separated by an exceptional point pj . By union bound, 1 −∑e

j=1 Pr[ξj] is the

probability that v>i z, v∗>i z are not separated by any exceptional point. The first term

210

Page 227: Copyright by Kai Zhong 2018

of Equation (D.23) can be bounded as,

Ez∼D3

[|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2

]= Ez∼D3

[1∪e

j=1ξj|φ′(v>i z) + φ′(v∗>i z)||φ′(v>l z)|(z>b)2

]≤(Ez∼D3

[1∪e

j=1ξj

])1/2(Ez∼D3

[(φ′(v>i z) + φ′(v∗>i z))2φ′(v>l z)

2(z>b)4])1/2

(e∑

j=1

Prz∼D3

[ξj]

)1/2(Ez∼D3

[(φ′(v>i z) + φ′(v∗>i z))2φ′(v>l z)

2(z>b)4])1/2

.

(e∑

j=1

Prz∼D3

[ξj]

)1/2

(‖vi‖p + ‖v∗i ‖p)‖vl‖p‖b‖2

where the first step follows by if v>i z, v∗>i z are not separated by any exceptional

point then φ′(v>i z) = φ′(v∗>i z) and the last step follows by Holder’s inequality and

Property 1.

It remains to upper bound Prz∼D3 [ξj]. First note that if v>i z, v∗>i z are sepa-

rated by an exceptional point, pj , then |v∗>i z− pj| ≤ |v>i z− v∗>i z| ≤ ‖vi− v∗i ‖‖z‖.

Therefore,

Prz∼D3

[ξj] ≤ Prz∼D3

[|v>i z − pj|‖z‖

≤ ‖vi − v∗i ‖].

Note that ( v∗>i z

‖z‖‖v∗i ‖+ 1)/2 follows Beta(1,1) distribution which is uniform

distribution on [0, 1].

Prz∼D3

[|v∗>i z − pj|‖z‖‖v∗i ‖

≤ ‖vi − v∗i ‖‖v∗i ‖

]≤ Pr

z∼D3

[|v∗>i z|‖z‖‖v∗i ‖

≤ ‖vi − v∗i ‖‖v∗i ‖

].‖vi − v∗i ‖‖v∗i ‖

.‖W −W ∗‖σk(W ∗)

,

where the first step is because we can view v∗>i z

‖z‖ and pj‖z‖ as two independent ran-

dom variables: the former is about the direction of z and the later is related to the

211

Page 228: Copyright by Kai Zhong 2018

magnitude of z. Thus, we have

Ez∈D3 [|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2] . (e‖W −W ∗‖/σk(W∗))1/2σ2p

1 (W ∗)‖b‖2.(B.16)

Similarly we have

Ey∈Dd−3[|φ′(u>

i y)− φ′(u∗>i y)||φ′(u>

l y)|(y>c)2] . (e‖W −W ∗‖/σk(W∗))1/2σ2p

1 (W ∗)‖c‖2.(B.17)

Finally combining Eq. (C.8), Eq. (D.24) and Eq. (D.25), we have

‖H −∇2fD(W∗)‖ . kv2max(e‖W −W ∗‖/σk(W

∗))1/2σ2p1 (W ∗),

which completes the proof.

B.3.5 Positive Definiteness for a Small Region

Here we introduce a lemma which shows that the Hessian of any W , which

may be dependent on the samples but is very close to an anchor point, is close to

the Hessian of this anchor point.

Lemma B.3.9. Let S denote a set of samples from Distribution D defined in Eq. (4.1).

Let W a ∈ Rd×k be a point (respect to function fS(W )), which is independent of the

samples S, satisfying ‖W a −W ∗‖ ≤ σk/2. Assume φ satisfies Property 1, 2 and

3(a). Then for any t ≥ 1, if

|S| ≥ dpoly(log d, t),

212

Page 229: Copyright by Kai Zhong 2018

with probability at least 1 − d−t, for any W (which is not necessarily to be inde-

pendent of samples) satisfying ‖W a −W‖ ≤ σk/4, we have

‖∇2fS(W )−∇2fS(Wa)‖ ≤ kv2maxσ

p1(‖W a −W ∗‖+ ‖W −W a‖d(p+1)/2).

Proof. Let ∆ = ∇2fS(W )−∇2fS(Wa) ∈ Rdk×dk, then ∆ can be thought of as k2

blocks, and each block has size d× d. The off-diagonal blocks are,

∆i,l = v∗i v∗l

1

|S|∑x∈S

(φ′(w>

i x)φ′(w>

l x)− φ′(wa>i x)φ′(wa>

l x))xx>

For diagonal blocks,

∆i,i =1

|S|∑x∈S

((k∑

q=1

v∗qφ(w>q x)− y

)v∗i φ

′′(w>i x)xx

> + v∗2i φ′2(w>i x)xx

>

)

− 1

|S|∑x∈S

((k∑

q=1

v∗qφ(wa>q x)− y

)v∗i φ

′′(wa>i x)xx> + v∗2i φ′2(wa>

i x)xx>

)

We further decompose ∆i,i into ∆i,i = ∆(1)i,i +∆

(2)i,i , where

∆(1)i,i = v∗i

1

|S|∑x∈S

((k∑

q=1

v∗qφ(w>q x)− y

)φ′′(w>

i x)−

(k∑

q=1

v∗qφ(wa>q x)− y

)φ′′(wa>

i x)

)xx>,

and

∆(2)i,i =v∗2i

1

|S|∑

(x,y)∈S

φ′2(w>i x)xx

> − φ′2(wa>i x)xx>. (B.18)

213

Page 230: Copyright by Kai Zhong 2018

We can further decompose ∆(1)i,i into ∆

(1,1)i,i and ∆

(1,2)i,i ,

∆(1)i,i = v∗i

1

|S|∑x∈S

(k∑

q=1

v∗qφ(w>q x)− y

)φ′′(w>

i x)−

(k∑

q=1

v∗qφ(wa>q x)− y

)φ′′(wa>

i x)xx>

= v∗i1

|S|∑x∈S

(k∑

q=1

v∗qφ(w>q x)−

k∑q=1

v∗qφ(wa>q x)

)φ′′(w>

i x)xx>

+ v∗i1

|S|∑x∈S

(k∑

q=1

v∗qφ(wa>q x)− y

)(φ′′(w>

i x)− φ′′(wa>i x))xx>

= v∗i1

|S|∑x∈S

k∑q=1

v∗q (φ(w>q x)− φ(wa>

q x))φ′′(w>i x)xx

>

+ v∗i1

|S|∑x∈S

k∑q=1

v∗q (φ(wa>q x)− φ(w∗>

q x))(φ′′(w>i x)− φ′′(wa>

i x))xx>

= ∆(1,1)i,i +∆

(1,2)i,i .

Combining Claim B.3.5 and Claim B.3.6 , we have if

n ≥ dpoly(log d, t)

with probability at least 1− 1/d4t,

‖∆(1)i,i ‖ . kv2maxσ

p1(‖W a −W ∗‖+ ‖W a −W‖d(p+1)/2). (B.19)

Therefore, combining Eq. (B.19), Claim B.3.7 and Claim B.3.8, we com-

plete the proof.

Claim B.3.5. For each i ∈ [k], if n ≥ dpoly(log d, t), then

‖∆(1,1)i,i ‖ . kv2maxσ

p1‖W a −W‖d(p+1)/2

214

Page 231: Copyright by Kai Zhong 2018

Proof. Define function h1(x) = ‖x‖p+1 and h2(x) = |w∗>q x|p|(w∗

q − waq )

>x|. Note

that h1 and h2 don’t contain W which maybe depend on the samples. Therefore, we

can use the modified matrix Bernstein inequality (Corollary B.1.1) to bound ∆(1)i,i .

(I) Bounding |h1(x)|.

By Fact 2.4.3, we have h1(x) . (td log n)(p+1)/2 with probability at least

1− 1/(nd4t).

(II) Bounding ‖Ex∼Dd[‖x‖p+1xx>]‖.

Let g(x) = (2π)−d/2e−‖x‖2/2. Note that xg(x)dx = −dg(x).

Ex∼Dd

[‖x‖p+1xx>] = ∫

‖x‖p+1g(x)xx>dx

= −∫‖x‖p+1d(g(x))x>

= −∫‖x‖p+1d(g(x)x>) +

∫‖x‖p+1g(x)Iddx

=

∫d(‖x‖p+1)g(x)x> +

∫‖x‖p+1g(x)Iddx

=

∫(p+ 1)‖x‖p−1g(x)xx>dx+

∫‖x‖p+1g(x)Iddx

∫‖x‖p+1g(x)Iddx

= Ex∼Dd[‖x‖p+1]Id.

Since ‖x‖2 follows χ2 distribution with degree d, Ex∼Dd[‖x‖q] = 2q/2 Γ((d+q)/2)

Γ(d/2)

for any q ≥ 0. So, dq/2 . Ex∼Dd[‖x‖q] . dq/2. Hence, ‖Ex∼Dd

[h1(x)xx>]‖ &

215

Page 232: Copyright by Kai Zhong 2018

d(p+1)/2. Also∥∥Ex∼Dd

[h1(x)xx

>]∥∥ ≤ max‖a‖=1

Ex∼Dd

[h1(x)(x

>a)2]

≤ max‖a‖=1

(Ex∼Dd

[h21(x)

])1/2(Ex∼Dd

[(x>a)4

])1/2. d(p+1)/2.

(III) Bounding (Ex∼Dd[h4

1(x)])1/4.

(Ex∼Dd

[h41(x)]

)1/4. d(p+1)/2.

Define function B(x) = h(x)xx> ∈ Rd×d, ∀i ∈ [n]. Let B = Ex∼Dd[h(x)xx>].

Therefore by applying Corollary B.1.1, we obtain for any 0 < ε < 1, if

n ≥ ε−2dpoly(log d, t)

with probability at least 1− 1/dt,∥∥∥∥∥ 1

|S|∑x∈S

‖x‖p+1xx> − Ex∼Dd

[‖x‖p+1xx>]∥∥∥∥∥ . δd(p+1)/2.

Therefore we have with probability at least 1− 1/dt,∥∥∥∥∥ 1

|S|∑x∈S

‖x‖p+1xx>

∥∥∥∥∥ . d(p+1)/2. (B.20)

Claim B.3.6. For each i ∈ [k], if n ≥ dpoly(log d, t), then

‖∆(1,2)i,i ‖ . kσp

1‖W a −W ∗‖,

holds with probability at least 1− 1/d4t.

216

Page 233: Copyright by Kai Zhong 2018

Proof. Recall the definition of ∆(1,2)i,i ,

∆(1,2)i,i = v∗i

1

|S|∑x∈S

k∑q=1

v∗q (φ(wa>q x)− φ(w∗>

q x))(φ′′(w>i x)− φ′′(wa>

i x))xx>

In order to upper bound the ‖∆(1,2)i,i ‖, it suffices to upper bound the spectral norm

of this quantity,

1

|S|∑x∼S

(φ(wa>q x)− φ(w∗>

q x))(φ′′(w>i x)− φ′′(wa>

i x))xx>.

By Property 1, we have

|φ(wa>q x)− φ(w∗>

q x)| . L1(|wa>q x|p + |w∗>

q x|)|(w∗q − wa

q )>x|.

By Property 3, we have |φ′′(w>i x)− φ′′(wa>

i x)| ≤ 2L2.

For the second part |w∗>q x|p|(w∗

q − waq )

>x|xx>, according to Eq. (C.7), we

have with probability 1− d−t, if n ≥ dpoly(log d, t),∥∥∥∥∥Ex∼Dd

[|w∗>

q x|p|(w∗q − wa

q )>x|xx>]− 1

|S|∑x∈S

|w∗>q x|p|(w∗

q − waq )

>x|xx>

∥∥∥∥∥ . δ‖w∗q‖p‖w∗

q − waq‖.

Also, note that

∥∥Ex∼Dd

[|w∗>

q x|p|(w∗q − wa

q )>x|xx>]∥∥ . ‖w∗

q‖p‖w∗q − wa

q‖.

Thus, we obtain∥∥∥∥∥ 1

|S|∑x∈S

|w∗>q x|p|(w∗

q − waq )

>x|xx>

∥∥∥∥∥ . ‖W a −W ∗‖σp1. (B.21)

217

Page 234: Copyright by Kai Zhong 2018

Claim B.3.7. For each i ∈ [k], if n ≥ dpoly(log d, t), then

‖∆(2)i,i ‖ . kv2maxσ

p1‖W −W a‖d(p+1)/2

holds with probability 1− 1/d4t.

Proof. We have

‖∆(2)i,i ‖ ≤ v∗2i

∥∥∥∥∥ 1

|S|∑x∈S

((φ′(w>

i xj)− φ′(wa>i x)) · (φ′(w>

i x) + φ′(wa>i x))

)xx>

∥∥∥∥∥≤ v∗2i

∥∥∥∥∥ 1

|S|∑x∈S

(L2|(wi − wa

i )>x| · L1(|w>

i x|p + |wa>i x|p)

)xx>

∥∥∥∥∥≤ v∗2i ‖W −W a‖

∥∥∥∥∥ 1

|S|∑x∈S

(L2‖x‖ · L1(‖wi‖p‖x‖p + |wa>

i x|p))xx>

∥∥∥∥∥.Applying Corollary B.1.1 finishes the proof.

Claim B.3.8. For each i ∈ [k], j ∈ [k], i 6= j, if n ≥ dpoly(log d, t), then

‖∆i,l‖ . v2maxσp1‖W a −W‖

holds with probability 1− d4t.

218

Page 235: Copyright by Kai Zhong 2018

Proof. Recall the definition of ∆i,l,

∆i,l = v∗i v∗l

1

|S|∑x∈S

(φ′(w>

i x)φ′(w>

l x)− φ′(w>i x)φ

′(wa>l x)

+φ′(w>i x)φ

′(wa>l x)− φ′(wa>

i x)φ′(wa>l x)

)xx>

= v∗i v∗l

1

|S|∑x∈S

(φ′(w>

i x)φ′(w>

l x)− φ′(w>i x)φ

′(wa>l x)

)+ v∗i v

∗l

1

|S|∑x∈S

(φ′(w>

i x)φ′(wa>

l x)− φ′(wa>i x)φ′(wa>

l x))xx>

|v∗i v∗l |1

|S|∑x∈S

(L1‖wi‖pL2‖wl − wa

l ‖‖x‖p+1 + L2‖wi − wai ‖‖x‖L1‖wa>

l x‖p)xx>

Applying Corollary B.1.1 completes the proof.

B.4 Tensor MethodsB.4.1 Tensor Initialization Algorithm

We describe the details of each procedure in Algorithm 4.4.1 in this section.

a) Compute the subspace estimation from P2 (Algorithm B.4.1). Note

that the eigenvalues of P2 and P2 are not necessarily nonnegative. However, only

k of the eigenvalues will have large magnitude. So we can first compute the top-

k eigenvectors/eigenvalues of both C · I + P2 and C · I − P2, where C is large

enough such that C ≥ 2‖P2‖. Then from the 2k eigen-pairs, we pick the top-k

eigenvectors with the largest eigenvalues in magnitude, which is executed in TOPK

in Algorithm B.4.1. For the outputs of TOPK, k1, k2 are the numbers of picked

largest eigenvalues from C · I + P2 and C · I − P2 respectively and π1(i) returns

the original index of i-th largest eigenvalue from C · I + P2 and similarly π2 is for

C · I − P2. Finally orthogonalizing the picked eigenvectors leads to an estimation

219

Page 236: Copyright by Kai Zhong 2018

of the subspace spanned by w∗1 w∗

2 · · · w∗k. Also note that forming P2 takes

O(n · d2) time and each step of the power method doing a multiplication between

a d × d matrix and a d × k matrix takes k · d2 time by a naive implementation.

Here we reduce this complexity from O((k + n)d2) to O(knd). The idea is to

compute each step of the power method without explicitly forming P2. We take

P2 = M2 as an example; other cases are similar. In Algorithm B.4.1, let the step

P2V be calculated as P2V = 1|S|∑

(x,y)∈S y(x(x>V ) − V ). Now each iteration

only needs O(knd) time. Furthermore, the number of iterations required will be

a small number, since the power method has a linear convergence rate and as an

initialization method we don’t need a very accurate solution. The detailed algorithm

is shown in Algorithm B.4.1. The approximation error bound of P2 to P2 is provided

in Lemma B.4.4. Lemma B.4.5 provides the theoretical bound for Algorithm B.4.1.

b) Form and decompose the 3rd-moment R3 (Algorithm 1 in [78]). We

apply the non-orthogonal tensor factorization algorithm, Algorithm 1 in [78], to

decompose R3. According to Theorem 3 in [78], when R3 is close enough to R3,

the output of the algorithm, ui will close enough to siV>w∗

i , where si is an unknown

sign. Lemma B.4.8 provides the error bound for ‖R3 −R3‖.

c) Recover the magnitude of w∗i and the signs si, v

∗i (Algorithm B.4.2).

For Algorithm B.4.2, we only consider homogeneous functions. Hence we can

assume v∗i ∈ −1, 1 and there exist some universal constants cj such that mj,i =

cj‖w∗i ‖p+1 for j = 1, 2, 3, 4, where p + 1 is the degree of homogeneity. Note

that different activations have different situations even under Assumption 4.4.1. In

particular, if M4 = M2 = 0, φ(·) is an odd function and we only need to know siv∗i .

220

Page 237: Copyright by Kai Zhong 2018

If M3 = M1 = 0, φ(·) is an even function, so we don’t care about what si is.

Let’s describe the details for Algorithm B.4.2. First define two quantities

Q1 and Q2,

Q1 = Ml1(I, α, · · · , α︸ ︷︷ ︸(l1−1) α’s

) =k∑

i=1

v∗i cl1‖w∗i ‖p+1(α>w∗

i )l1−1w∗

i , (B.22)

Q2 = Ml2(V, V, α, · · · , α︸ ︷︷ ︸(l2−2) α’s

) =k∑

i=1

v∗i cl2‖w∗i ‖p+1(α>w∗

i )l2−2(V >w∗

i )(V>w∗

i )>,

(B.23)

where l1 ≥ 1 such that Ml1 6= 0 and l2 ≥ 2 such that Ml2 6= 0. There are possibly

multiple choices for l1 and l2. We will discuss later on how to choose them. Now

we solve two linear systems.

z∗ = argminz∈Rk

∥∥∥∥∥k∑

i=1

zisiw∗i −Q1

∥∥∥∥∥, and r∗ = argminr∈Rk

∥∥∥∥∥k∑

i=1

riV>w∗

i (V>w∗

i )> −Q2

∥∥∥∥∥F

.

(B.24)

The solutions of the above linear systems are

z∗i = v∗i sl1i cl1‖w∗

i ‖p+1(α>siw∗i )

l1−1, and r∗i = v∗i sl2i cl2‖w∗

i ‖p+1(α>siw∗i )

l2−2.

We can approximate siw∗i by V ui and approximate Q1 and Q2 by their empirical

versions Q1 and Q2 respectively. Hence, in practice, we solve

z = argminz∈Rk

∥∥∥∥∥k∑

i=1

ziV ui − Q1

∥∥∥∥∥, and r = argminr∈Rk

∥∥∥∥∥k∑

i=1

riuiu>i − Q2

∥∥∥∥∥F

(B.25)

So we have the following approximations,

zi ≈ v∗i sl1i cl1‖w∗

i ‖p+1(α>V ui)l1−1, and ri ≈ v∗i s

l2i cl2‖w∗

i ‖p+1(α>V ui)l2−2,∀i ∈ [k].

221

Page 238: Copyright by Kai Zhong 2018

In Lemma B.4.11 and Lemma B.4.12, we provide robustness of the above two linear

systems, i.e., the solution errors, ‖z−z∗‖ and ‖r−r∗‖, are bounded under small per-

turbations of w∗i , Q1 and Q2. Recall that the final goal is to approximate ‖w∗

i ‖ and

the signs v∗i , si. Now we can approximate ‖w∗i ‖ by (|zi/(cl1(α>V ui)

l1−1)|)1/(p+1).

To recover v∗i , si, we need to note that if l1 and l2 are both odd or both even, we

can’t recover both v∗i and si. So we consider the following situations,

1. If M1 = M3 = 0, we choose l1 = l2 = minj ∈ 2, 4|Mj 6= 0. Return

v(0)i = sign(ricl2) and s

(0)i being −1 or 1.

2. If M2 = M4 = 0, we choose l1 = minj ∈ 1, 3|Mj 6= 0, l2 = 3. Return

v(0)i being −1 or 1 and s

(0)i = sign(v

(0)i zicl1).

3. Otherwise, we choose l1 = minj ∈ 1, 3|Mj 6= 0, l2 = minj ∈

2, 4|Mj 6= 0. Return v(0)i = sign(ricl2) and s

(0)i = sign(v

(0)i zicl1).

The 1st situation corresponds to part 3 of Assumption 4.4.1,where si doesn’t matter,

and the 2nd situation corresponds to part 4 of Assumption 4.4.1, where only siv∗i

matters. So we recover ‖w∗i ‖ to some precision and v∗i , si exactly provided enough

samples. The recovery of w∗i and v∗i then follows.

Sample complexity: We use matrix Bernstein inequality to bound the error

between P2 and P2, which requires Ω(d) samples (Lemma B.4.4). To bound the

estimation error between R3 and R3, we flatten the tensor to a matrix and then

use matrix Bernstein inequality to bound the error, which requires Ω(k3) samples

(Lemma B.4.8). In Algorithm B.4.2, we also need to approximate a Rd vector and

a Rk×k matrix, which also requires Ω(d). Thus, taking O(d) + O(k3) samples is

sufficient.

222

Page 239: Copyright by Kai Zhong 2018

Algorithm B.4.1 Power Method via Implicit Matrix-Vector Multiplication

1: procedure POWERMETHOD(P2, k)2: C ← 3‖P2‖, T ← a large enough constant.3: Initial guess V (0)

1 ∈ Rd×k, V(0)2 ∈ Rd×k

4: for t = 1→ T do5: V

(t)1 ← QR(CV

(t−1)1 + P2V

(t−1)1 ) . P2V

(t−1)1 is not calculated directly,

see Sec. B.4.1(a)6: V

(t)2 ← QR(CV

(t−1)2 − P2V

(t−1)2 )

7: for j = 1, 2 do8: V

(T )j ←

[vj,1 vj,2 · · · vj,k

]9: for i = 1→ k do

10: λj,i ← |v>j,iP2vj,i| . Calculate the absolute of eigenvalues

11: π1, π2, k1, k2 ← TOPK(λ, k) . πj : [kj]→ [k] and k1 + k2 = k, seeSec. B.4.1(a)

12: for j = 1, 2 do13: Vj ←

[vj,πj(1) v1,πj(2) · · · vj,πj(kj)

]14: V2 ← QR((I − V1V

>1 )V2)

15: V ← [V1, V2]16: return V

Time complexity: In Part a), by using a specially designed power method,

we only need O(knd) time to compute the subspace estimation V . Part b) needs

O(knd) to form R3 and the tensor factorization needs O(k3) time. Part c) requires

calculation of d × k and k2 × k linear systems in Eq. (B.25), which takes at most

O(knd) running time. Hence, the total time complexity is O(knd).

B.4.2 Main Result for Tensor Methods

The goal of this Section is to prove Theorem 5.4.1.

Theorem 5.4.1. Let the activation function be homogeneous satisfying Assump-

223

Page 240: Copyright by Kai Zhong 2018

Algorithm B.4.2 Recovery of the Ground Truth Parameters of the Neural Network,i.e., w∗

i and v∗i1: procedure RECMAGSIGN(V, uii∈[k], S)2: if M1 = M3 = 0 then3: l1 ← l2 ← minj ∈ 2, 4|Mj 6= 04: else if M2 = M4 = 0 then5: l1 ← minj ∈ 1, 3|Mj 6= 0, l2 ← 36: else7: l1 ← minj ∈ 1, 3|Mj 6= 0, l2 ← minj ∈ 2, 4|Mj 6= 0.8: S1, S2 ← PARTITION(S, 2) . |S1|, |S2| = Ω(d)9: Choose α to be a random unit vector

10: Q1 ← ES1 [Q1] . Q1 is the empirical version of Q1(defined in Eq.(B.22))11: Q2 ← ES2 [Q2] . Q2 is the empirical version of Q2(defined in Eq.(B.23))12: z ← argminz

∥∥∥∑ki=1 ziV ui − Q1

∥∥∥13: r ← argminr

∥∥∥∑ki=1 riuiu

>i − Q2

∥∥∥F

14: v(0)i ← sign(ricl2)

15: s(0)i ← sign(v

(0)i zicl1)

16: w(0)i ← s

(0)i (|zi/(cl1(α>V ui)

l1−1)|)1/(p+1)V ui

17: return v(0)i ,w(0)

i

tion 4.4.1. For any 0 < ε < 1 and t ≥ 1, if |S| ≥ ε−2 · d · poly(t, k, κ, log d), then

there exists an algorithm (Algorithm 4.4.1) that takes |S|k · O(d) time and outputs

a matrix W (0) ∈ Rd×k and a vector v(0) ∈ Rk such that, with probability at least

1− d−Ω(t),

‖W (0) −W ∗‖F ≤ ε · poly(k, κ)‖W ∗‖F , and v(0)i = v∗i .

Proof. The success of Algorithm 4.4.1 depends on two approximations. The first

is the estimation of the normalized w∗i up to some unknown sign flip, i.e., the error

‖w∗i−siV ui‖ for some si ∈ −1, 1. The second is the estimation of the magnitude

224

Page 241: Copyright by Kai Zhong 2018

of w∗i and the signs v∗i , si which is conducted in Algorithm B.4.2.

For the first one,

‖w∗i − siV ui‖ ≤ ‖V V >w∗

i − w∗i ‖+ ‖V V >w∗

i − V siui‖

= ‖V V >w∗i − w∗

i ‖+ ‖V >w∗i − siui‖, (B.26)

where the first step follows by triangle inequality, the second step follows by V >V =

I .

We can upper bound ‖V V >w∗i − w∗

i ‖,

‖V V >w∗i − w∗

i ‖ ≤ (‖P2 − P2‖/σk(P2) + ε)

≤ (poly(k, κ)‖P2 − P2‖+ ε)

≤ poly(k, κ)ε, (B.27)

where the first step follows by Lemma B.4.5, the second step follows by

σk(P2) ≥ 1/poly(k, κ), and the last step follows by ‖P2 − P2‖ ≤ εpoly(k, κ) if the

number of samples is proportional to O(d/ε2) as shown in Lemma B.4.4.

We can upper bound ‖V >w∗i − siui‖,

‖V >w∗i − siui‖ ≤ poly(k, κ)‖R3 −R3‖ ≤ εpoly(k, κ), (B.28)

where the first step follows by Theorem 3 in [78], and the last step follows by

‖R3 − R3‖ ≤ εpoly(k, κ) if the number of samples is proportional to O(k2/ε2) as

shown in Lemma B.4.8.

Combining Eq. (B.26), (B.27) and (B.28) together,

‖w∗i − siV ui‖ ≤ εpoly(k, κ).

225

Page 242: Copyright by Kai Zhong 2018

For the second one, we can bound the error of the estimation of moments,

Q1 and Q2, using number of samples proportional to O(d) by Lemma B.4.10 and

Lemma B.4.4 respectively. The error of the solutions of the linear systems Eq.(B.25)

can be bounded by ‖Q1− Q1‖, ‖Q2− Q2‖, ‖ui− V >w∗i ‖ and ‖(I − V V >)w∗

i ‖ ac-

cording to Lemma B.4.11 and Lemma B.4.12. Then we can bound the error of the

output of Algorithm B.4.2. Furthermore, since v∗i ’s are discrete values, they can be

exactly recovered. All the sample complexities mentioned in the above lemmata

are linear in dimension and polynomial in other factors to achieve a constant error.

So accumulating all these errors we complete our proof.

Remark B.4.1. The proofs of these lemmata for Theorem 4.4.1 can be found in the

following sections. Note that these lemmata also hold for any activations satisfying

Property 1 and Assumption 4.4.1. However, since we are unclear how to implement

the last step of Algorithm 4.4.1 (Algorithm B.4.2) for general non-homogeneous

activations, we restrict our theorem to homogeneous activations only.

B.4.3 Error Bound for the Subspace Spanned by the Weight Matrix

Error Bound for the Second-order Moment in Different Cases

Lemma B.4.1. Let M2 be defined as in Definition 4.4.1. Let M2 be the empirical

version of M2, i.e.,

M2 =1

|S|∑

(x,y)∈S

y · (x⊗ x− Id),

226

Page 243: Copyright by Kai Zhong 2018

where S denote a set of samples from Distribution D defined in Eq. (4.1). Assume

M2 6= 0, i.e., m(2)i 6= 0 for any i. Then for any 0 < ε < 1, t ≥ 1, if

|S| ≥ maxi∈[k]

(‖w∗i ‖p+1/|m2,i|+ 1) · ε−2dpoly(log d, t)

with probability at least 1− d−t,

‖M2 − M2‖ ≤ εk∑

i=1

|v∗im2,i|.

Proof. Recall that, for each sample (x, y), y =∑k

i=1 v∗i φ(w

∗>i x). We consider each

component i ∈ [k]. Define function Bi(x) : Rd → Rd×d such that

Bi(x) = φ(w∗>i x) · (x⊗ x− Id).

Define g(z) = φ(z)− φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤ L1/(p+ 1)|z|p+1, which

follows Property 1.

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ . (L1

p+ 1|w∗>

i x|p+1 + |φ(0)|)(‖xj‖2 + 1)

. (L1

p+ 1‖w∗

i ‖p+1 + |φ(0)|)dpoly(log d, t)

where the last step follows by Fact 2.4.2 and Fact 2.4.3.

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m2,iw

∗iw

∗>i . Therefore, ‖Ex∼Dd

[Bi(x)]‖ = |m2,i|.

(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)

>‖,Ex∼Dd‖Bi(x)

>Bi(x)‖).

227

Page 244: Copyright by Kai Zhong 2018

Note that Bi(x) is a symmetric matrix, thus it suffices to only bound one of

them.∥∥Ex∼Dd[B2

i (x)]∥∥ .

(Ex∼Dd

[φ(w∗>i x)4]

)1/2(Ex∼Dd[‖x‖4]

)1/2. (

L1

p+ 1‖w∗

i ‖p+1 + |φ(0)|)2d.

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)

2]).

Note that Bi(x) is a symmetric matrix, thus it suffices to consider the case

where a = b.

max‖a‖=1

(Ex∼Dd

[(a>Bi(x)a)

2])1/2

.(Ex∼Dd

[φ4(w∗>i x)]

)1/4.

L1

p+ 1‖w∗

i ‖p+1 + |φ(0)|.

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

n &L2d+ |m2,i|2 + L|m2,i|d · poly(log d, t)ε

ε2|m2,i|2t log d

with probability at least 1− 1/dt,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m2,i|.

Lemma B.4.2. Let M3 be defined as in Definition 4.4.1. Let M3 be the empirical

version of M3, i.e.,

M3 =1

|S|∑

(x,y)∈S

y · (x⊗3 − x⊗I),

where S denote a set of samples (each sample is i.i.d. sampled from Distribution D

defined in Eq. (4.1)). Assume M3 6= 0, i.e., m3,i 6= 0 for any i. Let α be a fixed unit

vector. Then for any 0 < ε < 1, t ≥ 1, if

|S| ≥ maxi∈[k]

(‖w∗i ‖p+1/|m3,i(w

∗>i α)|2 + 1) · ε−2dpoly(log d, t)

228

Page 245: Copyright by Kai Zhong 2018

with probability at least 1− 1/dt,

‖M3(I, I, α)− M3(I, I, α)‖ ≤ ε

k∑i=1

|v∗im3,i(w∗>i α)|.

Proof. Since y =∑k

i=1 v∗i φ(w

∗>i x). We consider each component i ∈ [k].

Define function Bi(x) : Rd → Rd×d such that

Bi(x) = [φ(w∗>i x) · (x⊗3 − x⊗I)](I, I, α) = φ(w∗>

i x) · ((x>α)x⊗2 − α>xI − αx> − xα>).

Define g(z) = φ(z) − φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤ L1

p+1|z|p+1 . |z|p+1,

which follows Property 1. In order to apply Lemma D.3.9, we need to bound the

following four quantities,

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ = ‖φ(w∗>i x) · ((x>α)x⊗2 − α>xId − αx> − xα>)‖

≤ |φ(w∗>i x)| · ‖(x>α)x⊗2 − α>xI − αx> − xα>‖

. (|w∗>i x|p+1 + |φ(0)|)‖(x>α)x⊗2 − α>xI − αx> − xα>‖

. (|w∗>i x|p+1 + |φ(0)|)(|x>α|‖x‖2 + 3|α>x|),

where the third step follows by definition of g(z), and last step follows by definition

of spectral norm and triangle inequality.

Using Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with prob-

ability 1− 1/(nd4t),

‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)dpoly(log d, t).

229

Page 246: Copyright by Kai Zhong 2018

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m3,i(w

∗>i α)w∗

iw∗>i . Therefore, ‖Ex∼Dd

[Bi(x)]‖ =

|m3,i(w∗>i α)|.

(III) Bounding max(‖Ex∼Dd[Bi(x)Bi(x)

>]‖, ‖Ex∼Dd[Bi(x)

>Bi(x)]‖).

Because matrix Bi(x) is symmetric, thus it suffices to bound one of them,∥∥Ex∼Dd[B2

i (x)]∥∥ .

(Ex∼Dd

[φ(w∗>

i x)4])1/2(Ex∼Dd

[(x>α)4

])1/2(Ex∼Dd[‖x‖4]

)1/2. (‖w∗

i ‖p+1 + |φ(0)|)2d.

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)

2])1/2.

max‖a‖=1

(Ex∼Dd

[(a>Bi(x)a)

2])1/2

.(Ex∼Dd

[φ4(w∗>

i x)])1/4

. ‖w∗i ‖p+1 + |φ(0)|.

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

|S| & L2d+ |m3,i(w∗>i α)|2 + L|m3,i(w

∗>i α)|d · poly(log d, t)ε

ε2|m3,i(w∗>i α)|2

· t log d

with probability at least 1− d−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m3,i(w∗>i α)|.

Lemma B.4.3. Let M4 be defined as in Definition 4.4.1. Let M4 be the empirical

version of M4, i.e.,

M4 =1

|S|∑

(x,y)∈S

y · (x⊗4 − (x⊗ x)⊗I + I⊗I),

230

Page 247: Copyright by Kai Zhong 2018

where S denote a set of samples (where each sample is i.i.d. sampled are sampled

from Distribution D defined in Eq. (4.1)). Assume M4 6= 0, i.e., m4,i 6= 0 for any i.

Let α be a fixed unit vector. Then for any 0 < ε < 1, t ≥ 1, if

|S| ≥ maxi∈[k]

(‖w∗i ‖p+1/|m4,i|(w∗>

i α)2 + 1)2 · ε−2 · dpoly(log d, t)

with probability at least 1− 1/dt,

‖M4(I, I, α, α)− M4(I, I, α, α)‖ ≤ εk∑

i=1

|v∗im4,i|(w∗>i α)2.

Proof. Since y =∑k

i=1 v∗i φ(w

∗>i x). We consider each component i ∈ [k].

Define function Bi(x) : Rd → Rd×d such that

Bi(x) = [φ(w∗>i x) · (x⊗4 − (x⊗ x)⊗I + I⊗I)](I, I, α, α)

= φ(w∗>i x) · ((x>α)2x⊗2 − (α>x)2I − 2(α>x)(xα> + αx>)− xx> + 2αα> + I).

Define g(z) = φ(z)−φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤ L1/(p+1)|z|p+1 . |z|p+1,

which follows Property 1.

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ .|φ(w∗>i x)| · ((x>α)2‖x‖2 + 1 + ‖x‖2 + (α>x)2)

.(|w∗>i x|p+1 + |φ(0)|) · ((x>α)2‖x‖2 + 1 + ‖x‖2 + (α>x)2)

Using Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with probability

1− 1/(nd4t),

‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)dpoly(log d, t).

231

Page 248: Copyright by Kai Zhong 2018

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m4,i(w

∗>i α)2w∗

iw∗>i . Therefore, ‖Ex∼Dd

[Bi(x)]‖ =

|m4,i|(w∗>i α)2.

(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)

>‖,Ex∼Dd‖Bi(x)

>Bi(x)‖).

∥∥Ex∼Dd[Bi(x)

2]∥∥ .

(Ex∼Dd

[φ(w∗>i x)4]

)1/2(Ex∼Dd[(x>α)8]

)1/2(Ex∼Dd[‖x‖4]

)1/2. (‖w∗

i ‖p+1 + |φ(0)|)2d.

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)

2])1/2.

max‖a‖=1

(Ex∼Dd

[(a>Bi(x)a)

2])1/2

.(Ex∼Dd

[φ4(w∗>

i x)])1/4

. ‖w∗i ‖p+1 + |φ(0)|.

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

n &L2d+ |m4,i|2(w∗>

i α)4 + L|m4,i|(w∗>i α)2dpoly(log d, t)ε

ε2|m4,i|2(w∗>i α)4

· t log d

with probability at least 1− d−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m4,i|(w∗>i α)2.

Error Bound for the Second-order Moment

The goal of this section is to prove Lemma B.4.4, which shows we can

approximate the second order moments up to some precision by using linear sample

complexity in d.

232

Page 249: Copyright by Kai Zhong 2018

Lemma B.4.4 (Estimation of the second order moment). Let P2 and j2 be defined

in Definition 4.4.2. Let S denote a set of i.i.d. samples generated from distribution

D(defined in (4.1)). Let P2 be the empirical version of P2 using dataset S, i.e., P2 =

ES[P2]. Assume the activation function satisfies Property 1 and Assumption 4.4.1.

Then for any 0 < ε < 1 and t ≥ 1, and m0 = mini∈[k]|mj2,i|2(w∗>i α)2(j2−2), if

|S| & σ2p+21 · d · poly(t, log d)/(ε2m0),

then with probability at least 1− d−Ω(t),

‖P2 − P2‖ ≤ εk∑

i=1

|v∗imj2,i(w∗>i α)j2−2|.

Proof. We have shown the bound for j2 = 2, 3, 4 in Lemma B.4.1, Lemma B.4.2

and Lemma B.4.3 respectively. To summarize, for any 0 < ε < 1 we have if

|S| ≥ maxi∈[k]

(‖w∗

i ‖p+1 + |φ(0)|+ |mj2,i(w∗>i α)(j2−2)|)2

|mj2,i|2(w∗>i α)2(j2−2)

· ε−2dpoly(log d, t)

with probability at least 1− d−t,

‖P2 − P2‖ ≤ εk∑

i=1

|v∗imj2,i(w∗>i α)j2−2|.

Subspace Estimation Using Power Method

Lemma B.4.5 shows a small number of power iterations can estimate the

subspace of w∗i i∈[k] to some precision, which provides guarantees for Algorithm B.4.1.

233

Page 250: Copyright by Kai Zhong 2018

Lemma B.4.5 (Bound on subspace estimation). Let P2 be defined as in Defini-

tion. 4.4.2 and P2 be its empirical version. Let U ∈ Rd×k be the orthogonal column

span of W ∗ ∈ Rd×k. Assume ‖P2 − P2‖ ≤ σk(P2)/10. Let C be a large enough

positive number such that C > 2‖P2‖. Then after T = O(log(1/ε)) iterations, the

output of Algorithm B.4.1, V ∈ Rd×k, will satisfy

‖UU> − V V >‖ . ‖P2 − P2‖/σk(P2) + ε,

which implies

‖(I − V V >)w∗i ‖ . (‖P2 − P2‖/σk(P2) + ε)‖w∗

i ‖.

Proof. According to Weyl’s inequality, we are able to pick up the correct numbers

of positive eigenvalues and negative eigenvalues in Algorithm B.4.1 as long as P2

and P2 are close enough.

Let U = [U1 U2] ∈ Rd×k be the eigenspace of spanw∗1 w∗

2 · · · w∗k,

where U1 ∈ Rd×k1 corresponds to positive eigenvalues of P2 and U2 ∈ Rd×k2 is for

negatives.

Let V 1 be the top-k1 eigenvectors of CI + P2. Let V 2 be the top-k2 eigen-

vectors of CI − P2. Let V = [V 1 V 2] ∈ Rd×k.

According to Lemma 9 in [67], we have ‖(I−U1U>1 )V 1‖ . ‖P2−P2‖/σk(P2),‖(I−

U2U>2 )V 2‖ . ‖P2 − P2‖/σk(P2) . Using Fact 2.3.2, we have ‖(I − UU>)V ‖ =

‖UU> − V V >‖.

Let ε be the precision we want to achieve using power method. Let V1 be

the top-k1 eigenvectors returned after O(log(1/ε)) iterations of power methods on

234

Page 251: Copyright by Kai Zhong 2018

CI + P2 and V2 ∈ Rd×k2 for CI − P2 similarly.

According to Theorem 7.2 in [6], we have ‖V1V>1 − V1V

>1 ‖ ≤ ε and

‖V2V>2 − V2V

>2 ‖ ≤ ε.

Let U⊥ be the complementary matrix of U . Then we have,

‖(I − U1U>1 )V 1‖ = max

‖a‖=1‖(I − U1U

>1 )V 1a‖

= max‖a‖=1

‖(U⊥U>⊥ + U2U

>2 )V 1a‖

= max‖a‖=1

√‖U⊥U>

⊥V 1a‖2 + ‖U2U>2 V 1a‖2

≥ max‖a‖=1

‖U2U>2 V 1a‖

= ‖U2U>2 V 1‖, (B.29)

where the first step follows by definition of spectral norm, the second step follows

by I = U1U>1 +U2U

>2 +U>

⊥U>⊥ , the third step follows by U>

2 U⊥ = 0, and last step

follows by definition of spectral norm.

We can upper bound ‖(I − UU>)V ‖,

‖(I − UU>)V ‖

≤ (‖(I − U1U>1 )V 1‖+ ‖(I − U2U

>2 )V 2‖+ ‖U2U

>2 V 1‖+ ‖U1U

>1 V 2‖)

≤ 2(‖(I − U1U>1 )V 1‖+ ‖(I − U2U

>2 )V 2‖)

. ‖P2 − P2‖/σk(P2), (B.30)

where the first step follows by triangle inequality, the second step follows by Eq. (B.29),

and the last step follows by Lemma 9 in [67].

235

Page 252: Copyright by Kai Zhong 2018

We define matrix R such that V 2R = (I − V1V>1 )V2 is the QR decomposi-

tion of (I − V1V>1 )V2, then we have

‖(I − V 2V>2 )V 2‖

= ‖(I − V 2V>2 )(I − V1V

>1 )V2R

−1‖

= ‖(I − V 2V>2 )(I − V 1V

>1 + V 1V

>1 − V1V

>1 )V2R

−1‖

≤ ‖(I − V 2V>2 )(I − V 1V

>1 )V2R

−1‖︸ ︷︷ ︸α

+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V

>1 − V1V

>1 ‖︸ ︷︷ ︸

β

,

where the first step follows by V 2 = (I − V1V>1 )V2R

−1, and the last step follows

by triangle inequality.

Furthermore, we have,

α + β

= ‖(I − V 2V>2 − V 1V

>1 )V2R

−1‖+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V

>1 − V1V

>1 ‖

≤ ‖(I − V 2V>2 )V2R

−1‖+ ‖V 1V>1 V2R

−1‖+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V

>1 − V1V

>1 ‖

≤ ‖V 2V>2 − V2V

>2 ‖‖R−1‖+ ‖V 1V

>1 V2‖‖R−1‖+ ‖(I − V 2V

>2 )‖‖R−1‖‖V 1V

>1 − V1V

>1 ‖

= ‖V 2V>2 − V2V

>2 ‖‖R−1‖+ ‖V 1V

>1 V2‖‖R−1‖+ ‖R−1‖‖V 1V

>1 − V1V

>1 ‖

≤ ‖V 2V>2 − V2V

>2 ‖‖R−1‖+ ‖(I − V 2V

>2 )V2‖‖R−1‖+ ‖R−1‖‖V 1V

>1 − V1V

>1 ‖

≤ (2‖V 2V>2 − V2V

>2 ‖+ ‖V 1V

>1 − V1V

>1 ‖)‖R−1‖

≤ 3ε‖R−1‖

≤ 6ε,

where the first step follows by definition of α, β, the second step follows by triangle

inequality, the third step follows by ‖AB‖ ≤ ‖A‖‖B‖, the fourth step follows by

236

Page 253: Copyright by Kai Zhong 2018

‖(I − V 2V>2 )‖ = 1, the fifth step follows by Eq. (B.29), the sixth step follows by

Fact 2.3.2, the seventh step follows by ‖V 1V>1 −V1V

>1 ‖ ≤ ε and ‖V 2V

>2 −V2V

>2 ‖ ≤

ε, and the last step follows by ‖R−1‖ ≤ 2 (Claim B.4.1).

Finally,

‖UU> − V V >‖ ≤ ‖UU> − V V>‖+ ‖V V

> − V V >‖

= ‖(I − UU>)V ‖+ ‖V V> − V V >‖

≤ ‖P2 − P2‖/σk(P2) + ‖V V> − V V >‖

≤ ‖P2 − P2‖/σk(P2) + ‖V 1V>1 − V1V

>1 ‖+ ‖V 2V

>2 − V2V

>2 ‖

≤ ‖P2 − P2‖/σk(P2) + 2ε,

where the first step follows by triangle inequality, the second step follows by Fact 2.3.2,

the third step follows by Eq. (B.30), the fourth step follows by triangle inequality,

and the last step follows by ‖V 1V>1 − V1V

>1 ‖ ≤ ε and ‖V 2V

>2 − V2V

>2 ‖ ≤ ε.

Therefore we finish the proof.

It remains to prove Claim B.4.1.

Claim B.4.1. σk(R) ≥ 1/2.

Proof. First, we can rewrite R>R in the follow way,

R>R = V >2 (I − V1V

>1 )V2 = I − V >

2 V1V>1 V2

237

Page 254: Copyright by Kai Zhong 2018

Second, we can upper bound ‖V >2 V1‖ by 1/4,

‖V >2 V1‖ = ‖V2V

>2 V1‖

≤ ‖(V2V>2 − V 2V

>2 )V1‖+ ‖V 2V

>2 V1‖

≤ ‖(V2V>2 − V 2V

>2 )V1‖+ ‖V

>2 (V1V

>1 − V 1V

>1 )‖+ ‖V

>2 V 1V

>1 ‖

= ‖(V2V>2 − V 2V

>2 )V1‖+ ‖V

>2 (V1V

>1 − V 1V

>1 )‖

≤ ‖V2V>2 − V 2V

>2 ‖ · ‖V1‖+ ‖V

>2 ‖ · ‖V1V

>1 − V 1V

>1 ‖

≤ ε+ ε

≤ 1/4,

where the first step follows by V >2 V2 = I , the second step follows by triangle

inequality, the third step follows by triangle inequality, the fourth step follows by

‖V >2 V 1V

>1 ‖ = 0, the fifth step follows by ‖AB‖ ≤ ‖A‖ · ‖B‖, and the last step

follows by ‖V 1V>1 − V1V

>1 ‖ ≤ ε, ‖V1‖ = 1, ‖V 2V

>2 − V2V

>2 ‖ ≤ ε and ‖V >

2 ‖ = 1,

and the last step follows by ε < 1/8.

Thus, we can lower bound σ2k(R),

σ2k(R) = λmin(R

>R)

= min‖a‖=1

a>R>Ra

= min‖a‖=1

a>Ia− ‖V >2 V1a‖2

= 1− max‖a‖=1

‖V >2 V1a‖2

= 1− ‖V >2 V1‖2

≥ 3/4

which implies σk(R) ≥ 1/2.

238

Page 255: Copyright by Kai Zhong 2018

B.4.4 Error Bound for the Reduced Third-order Moment

Error Bound for the Reduced Third-order Moment in Different Cases

Lemma B.4.6. Let M3 be defined as in Definition 4.4.1. Let M3 be the empirical

version of M3, i.e.,

M3 =1

|S|∑

(x,y)∈S

y · (x⊗3 − x⊗I),

where S denote a set of samples (where each sample is i.i.d. sampled from Dis-

tribution D defined in Eq. (4.1)). Assume M3 6= 0, i.e., m3,i 6= 0 for any i. Let

V ∈ Rd×k be an orthogonal matrix satisfying ‖UU> − V V >‖ ≤ 1/4, where U is

the orthogonal basis of spanw∗1, w

∗2, · · · , w∗

k. Then for any 0 < ε < 1, t ≥ 1, if

|S| ≥ maxi∈[k]

(‖w∗i ‖p+1/|m3,i|2 + 1) · ε−2 · k2poly(log d, t)

with probability at least 1− 1/dt,

‖M3(V, V, V )− M3(V, V, V )‖ ≤ εk∑

i=1

|v∗im3,i|.

Proof. Since y =∑k

i=1 v∗i φ(w

∗>i x). We consider each component i ∈ [k]. We

define function Ti(x) : Rd → Rk×k×k such that,

Ti(x) = φ(w∗>i x) · ((V >x)⊗3 − (V >x)⊗I).

We flatten tensor Ti(x) along the first dimension into matrix Bi(x) ∈ Rk×k2 . Define

g(z) = φ(z)−φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤ L1/(p+1)|z|p+1, which follows

Property 1.

239

Page 256: Copyright by Kai Zhong 2018

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ ≤ |φ(w∗>i x)| · (‖V >x‖3 + 3k‖V >x‖)

. (|w∗>i x|p+1 + |φ(0)|) · (‖V >x‖3 + 3k‖V >xj‖)

Note that V >x ∼ N(0, Ik). According to Fact 2.4.2 and Fact 2.4.3, we have for any

constant t ≥ 1, with probability 1− 1/(ndt),

‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)k3/2poly(log d, t)

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m3,i(V

>w∗i )vec((V >w∗

i )(V>w∗

i )>)>. There-

fore, ‖Ex∼Dd[Bi(x)]‖ = |m3,i|‖V >w∗

i ‖3. Since ‖V V >−UU>‖ ≤ 1/4, ‖V V >w∗i−

w∗i ‖ ≤ 1/4 and 3/4 ≤ ‖V >w∗

i ‖ ≤ 5/4. So 14|m3,i| ≤ ‖B‖ ≤ 2|m3,i|.

(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)

>‖,Ex∼Dd‖Bi(x)

>Bi(x)‖).

∥∥Ex∼Dd[Bi(x)Bi(x)

>]∥∥ .

(Ex∼Dd

[φ(w∗>i x)4]

)1/2(Ex∼Dd[‖V >x‖6]

)1/2. (‖w∗

i ‖p+1 + |φ(0)|)2k3/2.

∥∥Ex∼Dd[Bi(x)

>Bi(x)]∥∥

.(Ex∼Dd

[φ(w∗>i x)4]

)1/2(Ex∼Dd[‖V >x‖4]

)1/2max

‖A‖F=1

(Ex∼Dd

[〈A, (V >x)(V >x)>〉4])1/2

. (‖w∗i ‖p+1 + |φ(0)|)2k2.

240

Page 257: Copyright by Kai Zhong 2018

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)

2]).

max‖a‖=‖b‖=1

(Ex∼Dd

[(a>Bi(x)b)2])1/2

.(Ex∼Dd

[(φ(w∗>i x))4]

)1/4max‖a‖=1

(Ex∼Dd

[(a>V >x)4])1/2

max‖A‖F=1

(Ex∼Dd

[〈A, (V >x)(V >x)>〉4])1/2

. (‖w∗i ‖p+1 + |φ(0)|)k

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

|S| & L2k2 + |m3,i|2 + k3/2poly(log d, t)|m3,i|εε2|m3,i|2

t log(k)

with probability at least 1− k−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m3,i|.

We can set t = T log(d)/ log(k), then if

|S| ≥ ε−2(1 + 1/|m3,i|2)poly(T, log d)

with probability at least 1− d−T ,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m3,i|.

Also note that for any symmetric 3rd-order tensor R, the operator norm of R,

‖R‖ = max‖a‖=1

|R(a, a, a)| ≤ max‖a‖=1

‖R(a, I, I)‖F = ‖R(1)‖.

241

Page 258: Copyright by Kai Zhong 2018

Lemma B.4.7. Let M4 be defined as in Definition 4.4.1. Let M4 be the empirical

version of M4, i.e.,

M4 =1

|S|∑

(x,y)∈S

y ·(x⊗4 − (x⊗ x)⊗I + I⊗I

),

where S is a set of samples (where each sample is i.i.d. sampled from Distribution D

defined in Eq. (4.1)). Assume M4 6= 0, i.e., m4,i 6= 0 for any i. Let α be a fixed unit

vector. Let V ∈ Rd×k be an orthogonal matrix satisfying ‖UU> − V V >‖ ≤ 1/4,

where U is the orthogonal basis of spanw∗1, w

∗2, · · · , w∗

k. Then for any 0 < ε <

1, t ≥ 1, if

|S| ≥ maxi∈[k]

(1 + ‖w∗i ‖p+1/|m4,i(α

>w∗i )|2) · ε−2 · k2poly(log d, t)

with probability at least 1− d−t,

‖M4(V, V, V, α)− M4(V, V, V, α)‖ ≤ εk∑

i=1

|v∗im4,i(α>w∗

i )|.

Proof. Recall that for each (x, y) ∈ S, we have y =∑k

i=1 v∗i φ(w

∗>i x). We consider

each component i ∈ [k]. We define function r(x) : Rd → Rk such that

r(x) = V >x.

Define function Ti(x) : Rd → Rk×k×k such that

Ti(x) =φ(w∗>i x)

(x>α · r(x)⊗ r(x)⊗ r(x)− (V >α)⊗(r(x)⊗ r(x))

−α>x · r(x)⊗I + (V >α)⊗I).

242

Page 259: Copyright by Kai Zhong 2018

We flatten Ti(x) : Rd → Rk×k×k along the first dimension to obtain function

Bi(x) : Rd → Rk×k2 . Define g(z) = φ(z) − φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤

L1/(p+ 1)|z|p+1, which follows Property 1.

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ .(|w∗>i x|p+1 + |φ(0)|) · (|(x>α)|‖V >x‖3 + 3‖V >α‖‖V >x‖2

+ 3|(x>α)|‖V >xj‖√k + 3‖V >α‖

√k)

Note that V >x ∼ N(0, Ik). According to Fact 2.4.2 and Fact 2.4.3, we have for any

constant t ≥ 1, with probability 1− 1/(ndt),

‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)k3/2poly(log d, t)

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m4,i(α

>w∗i )(V

>w∗i )vec((V >w∗

i )(V>w∗

i )>)>. There-

fore,

‖Ex∼Dd[Bi(x)]‖ = |m4,i(α

>w∗i )|‖V >w∗

i ‖3.

Since ‖V V > − UU>‖ ≤ 1/4, ‖V V >w∗i − w∗

i ‖ ≤ 1/4 and 3/4 ≤ ‖V >w∗i ‖ ≤ 5/4.

So 14|m4,i(α

>w∗i )| ≤ ‖Ex∼Dd

[Bi(x)]‖ ≤ 2|m4,i(α>w∗

i )|.

(III) Bounding max(Ex∼Dd[Bi(x)Bi(x)

>],Ex∼Dd[Bi(x)

>Bi(x)]).

∥∥Ex∼Dd[Bi(x)Bi(x)

>]∥∥ .

(Ex∼Dd

[φ(w∗>

i x)4])1/2(Ex∼Dd

[(α>x)4

])1/2(Ex∼Dd

[‖V >x‖6

])1/2. (‖w∗

i ‖p+1 + |φ(0)|)2k3/2.

243

Page 260: Copyright by Kai Zhong 2018

∥∥Ex∼Dd[Bi(x)

>Bi(x)]∥∥

.(Ex∼Dd

[φ(w∗>i x)4]

)1/2(Ex∼Dd[(α>x)4]

)1/2(Ex∼Dd[‖V >x‖4]

)1/2·(

max‖A‖F=1

Ex∼Dd

[〈A, (V >x)(V >x)>〉4

])1/2

. (‖w∗i ‖p+1 + |φ(0)|)2k2.

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd

[(a>Bi(x)b)

2])1/2.

max‖a‖=‖b‖=1

(Ex∼Dd

[(a>Bi(x)b)

2])1/2

.(Ex∼Dd

[φ4(w∗>i x)]

)1/4(Ex∼Dd

[(α>x)4

])1/4max‖a‖=1

(Ex∼Dd

[(a>V >x)4

])1/2· max‖A‖F=1

(Ex∼Dd

[〈A, (V >x)(V >x)>〉4

])1/2. (‖w∗

i ‖p+1 + |φ(0)|)k.

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

|S| ≥ L2k2 + |m4,i(α>w∗

i )|2 + k3/2poly(t, log d)|m4,i(α>w∗

i )|εε2(m4,i(α>w∗

i ))2

· t log k

with probability at least 1− k−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|

n∑x∈S

Bi(x)

∥∥∥∥∥ ≤ ε|m4,i(α>w∗

i )|. (B.31)

We can set t = T log(d)/ log(k), then if

|S| ≥ (L+ |m4,i(α>w∗

i )|)2k2poly(T, log d)ε2|m4,i(α>w∗

i )|2· T log2 d

with probability at least 1−d−T , Eq. (B.31) holds. Also note that for any symmetric

3rd-order tensor R, the operator norm of R,

‖R‖ = max‖a‖=1

|R(a, a, a)| ≤ max‖a‖=1

‖R(a, I, I)‖F = ‖R(1)‖.

244

Page 261: Copyright by Kai Zhong 2018

Final Error Bound for the Reduced Third-order Moment

Lemma B.4.8 shows R3 can approximate R3 to some small precision with

poly(k) samples.

Lemma B.4.8 (Estimation of the reduced third order moment). Let U ∈ Rd×k de-

note the orthogonal column span of W ∗. Let α be a fixed unit vector and V ∈ Rd×k

denote an orthogonal matrix satisfying ‖V V > − UU>‖ ≤ 1/4. Define R3 :=

P3(V, V, V ), where P3 is defined as in Definition 4.4.2 using α. Let R3 be the

empirical version of R3 using dataset S, where each sample of S is i.i.d. sam-

pled from distribution D(defined in (4.1)). Assume the activation function satisfies

Property 1 and Assumption 4.4.1. Then for any 0 < ε < 1 and t ≥ 1, define

j3 = minj ≥ 3|Mj 6= 0 and m0 = mini∈[k](m(j3)i (α>w∗

i )j3−3)2, if

|S| ≥ σ2p+21 · k2 · poly(log d, t)/(ε2m0)

then we have ,

‖R3 − R3‖ ≤ εk∑

i=1

|v∗imj3,i(w∗>i α)j3−3|,

holds with probability at least 1− d−Ω(t).

Proof. The main idea is to use matrix Bernstein bound after matricizing the third-

order tensor. Similar to the proof of Lemma B.4.4, we consider each node compo-

nent individually and then sum up the errors and apply union bound.

We have shown the bound for j3 = 3, 4 in Lemma B.4.6 and Lemma B.4.7

respectively. To summarize, for any 0 < ε < 1 we have if

|S| ≥ maxi∈[k]

(1 + ‖w∗

i ‖p+1/|mj3,i(w∗>i α)(j3−3)|2

)· ε−2 · k2poly(log d, t)

245

Page 262: Copyright by Kai Zhong 2018

with probability at least 1− d−t,

‖R3 − R3‖ ≤ ε

k∑i=1

|v∗imj3,i(w∗>i α)j3−3|.

B.4.5 Error Bound for the Magnitude and Sign of the Weight Vectors

The lemmata in this section together with Lemma B.4.4 provide guaran-

tees for Algorithm B.4.2. In particular, Lemma B.4.10 shows with linear sam-

ple complexity in d, we can approximate the 1st-order moment to some precision.

And Lemma B.4.11 and Lemma B.4.12 provide the error bounds of linear systems

Eq. (B.25) under some perturbations.

Robustness for Solving Linear Systems

Lemma B.4.9 (Robustness of linear system). Given two matrices A, A ∈ Rd×k, and

two vectors b, b ∈ Rd. Let z∗ = argminz∈Rk ‖Az − b‖ and z = argminz∈Rk ‖(A +

A)z − (b+ b)‖. If ‖A‖ ≤ 14κσk(A) and ‖b‖ ≤ 1

4‖b‖, then, we have

‖z∗ − z‖ .(σ−4k (A)σ2

1(A) + σ−2k (A))‖b‖‖A‖+ σ−2

k (A)σ1(A)‖b‖.

Proof. By definition of z and z, we can rewrite z and z,

z = A†b = (A>A)−1A>b

z = (A+ A)†(b+ b) = ((A+ A)>(A+ A))−1(A+ A)>(b+ b).

As ‖A‖ ≤ 14κσk(A), we have ‖A>A + A>A‖‖(A>A)−1‖ ≤ 1/4. Together with

246

Page 263: Copyright by Kai Zhong 2018

‖b‖ ≤ 14‖b‖, we can ignore the high-order errors. So we have

‖z − z∗‖

. ‖(A>A)−1(A>b+ A>b) + (A>A)−1(A>A+ A>A)(A>A)−1A>b‖

. ‖(A>A)−1‖(‖A‖‖b‖+ ‖A‖‖b‖) + ‖(A>A)−2‖ · ‖A‖‖A‖‖A‖‖b‖

. σ−2k (A)(‖A‖‖b‖+ σ1(A)‖b‖) + σ−4

k (A) · σ21(A)‖A‖‖b‖.

Error Bound for the First-order Moment

Lemma B.4.10 (Error bound for the first-order moment). Let Q1 be defined as in

Eq. (B.22) and Q1 be the empirical version of Q1 using dataset S, where each

sample of S is i.i.d. sampled from distribution D(defined in (4.1)). Assume the acti-

vation function satisfies Property 1 and Assumption 4.4.1. Then for any 0 < ε < 1

and t ≥ 1, define j1 = minj ≥ 1|Mj 6= 0 and m0 = mini∈[k](mj1,i(w∗>i α)j1−1)2

if

|S| ≥ σ2p+21 dpoly(t, log d)/(ε2m0)

we have with probability at least 1− d−Ω(t),

‖Q1 − Q1‖ ≤ εk∑

i=1

|v∗imj1,i(w∗>i α)j1−1|.

Proof. We consider the case when l1 = 3, i.e.,

Q1 = M3(I, α, α) =k∑

i=1

v∗im3,i(α>w∗

i )3w∗

i .

And other cases are similar.

247

Page 264: Copyright by Kai Zhong 2018

Since y =∑k

i=1 v∗i φ(w

∗>i x). We consider each component i ∈ [k].

Define function Bi(x) : Rd → Rd such that

Bi(x) = [φ(w∗>i x) · (x⊗3 − x⊗I)](I, α, α) = φ(w∗>

i x) · ((x>α)2x− 2(x>α)α− x).

Define g(z) = φ(z)−φ(0), then |g(z)| = |∫ z

0φ′(s)ds| ≤ L1/(p+1)|z|p+1,

which follows Property 1.

(I) Bounding ‖Bi(x)‖.

‖Bi(x)‖ ≤ |φ(w∗>i x)| · ‖((x>α)2x− 2α>xα− x)‖

≤ (|w∗>i x|p+1 + |φ(0)|)(((x>α)2 + 1)‖x‖+ 2|α>x|)

According to Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with prob-

ability 1− 1/(ndt),

‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)

√dpoly(log d, t)

(II) Bounding ‖Ex∼Dd[Bi(x)]‖.

Note that Ex∼Dd[Bi(x)] = m3,i(w

∗>i α)2w∗

i . Therefore, ‖Ex∼Dd[Bi(x)]‖ =

|m3,i(w∗>i α)2|.

(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)

>‖,Ex∼Dd‖Bi(x)

>Bi(x)‖).

∥∥Ex∼Dd

[Bi(x)

>Bi(x)]∥∥ .

(Ex∼Dd

[φ(w∗>

i x)4])1/2(Ex∼Dd

[(x>α)8

])1/2(Ex∼Dd

[‖x‖4

])1/2. (‖w∗

i ‖p+1 + |φ(0)|)2d.

248

Page 265: Copyright by Kai Zhong 2018

∥∥Ex∼Dd

[Bi(x)Bi(x)

>]∥∥ .(Ex∼Dd

[φ(w∗>

i x)4])1/2(Ex∼Dd

[(x>α)8

])1/2(max‖a‖=1

Ex∼Dd

[(x>a)4

])1/2

. (‖w∗i ‖p+1 + |φ(0)|)2.

(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)

2]).

max‖a‖=1

(Ex∼Dd

[(a>Bi(x)a)

2])1/2

.(Ex∼Dd

[φ4(w∗>

i x)])1/4

. ‖w∗i ‖p+1 + |φ(0)|.

Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if

|S| & L2d+ |m3,i(w∗>i α)2|2 + L|m3,i(w

∗>i α)2|

√dpoly(log d, t)ε

ε2|m3,i(w∗>i α)2|2

· t log d

with probability at least 1− 1/dt,∥∥∥∥∥Ex∼Dd[Bi(x)]−

1

|S|

n∑x∼S

Bi(x)

∥∥∥∥∥ ≤ ε|m3,i(w∗>i α)2|.

Summing up all k components, we obtain if

|S| ≥ maxi∈[k]

(‖w∗

i ‖p+1 + |φ(0)|+ |m3,i(w∗>i α)2|)2

|m3,i(w∗>i α)2|2

· ε−2dpoly(log d, t)

with probability at least 1− 1/dt,

‖M3(I, α, α)− M3(I, α, α)‖ ≤ ε

k∑i=1

|v∗im3,i(w∗>i α)2|.

Other cases (j1 = 1, 2, 4) are similar, so we complete the proof.

Linear System for the First-order MomentThe following lemma provides

estimation error bound for the first linear system in Eq. (B.25).

249

Page 266: Copyright by Kai Zhong 2018

Lemma B.4.11 (Solution of linear system for the first order moment). Let U ∈

Rd×k be the orthogonal column span of W ∗. Let V ∈ Rd×k denote an orthogonal

matrix satisfying that ‖V V > − UU>‖ ≤ δ2 . 1/(κ2√k). For each i ∈ [k], let ui

denote the vector satisfying ‖ui − V >w∗i ‖ ≤ δ3 . 1/(κ2

√k). Let Q1 be defined

as in Eq. (B.22) and Q1 be the empirical version of Q1 such that ‖Q1 − Q1‖ ≤

δ4‖Q1‖ ≤ 14‖Q1‖. Let z∗ ∈ Rk and z ∈ Rk be defined as in Eq. (B.24) and

Eq. (B.25). Then

|zi − z∗i | ≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)‖z∗‖1.

Proof. Let A ∈ Rk×k denote the matrix where the i-th column is siw∗i . Let A ∈

Rk×k denote the matrix where the i-th column is V ui. Let b ∈ Rk denote the vector

Q1, let b denote the vector Q1 −Q1. Then we have

‖A‖ ≤√k.

Using Fact 2.3.1, we can lower bound σk(A),

σk(A) ≥ 1/κ.

We can upper bound ‖A‖ in the following way,

‖A‖ ≤√kmax

i∈[k]‖V ui − siw

∗i ‖

≤√kmax

i∈[k]‖V ui − siV V >w∗

i + siV V >w∗i − siUU>w∗

i ‖

≤√k(δ3 + δ2).

250

Page 267: Copyright by Kai Zhong 2018

We can upper bound ‖b‖ and ‖b‖,

‖b‖ = ‖Q1‖ ≤ k

k∑i=1

|z∗i |, and ‖b‖ ≤ δ4‖Q1‖.

To apply Lemma B.4.9, we need δ4 ≤ 1/4 and δ2 . 1/(√kκ2), δ3 . 1/(

√kκ2).

So we have,

‖zi − z∗i ‖ ≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)‖Q1‖

≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)k∑

i=1

|z∗i |.

Linear System for the Second-order Moment

The following lemma provides estimation error bound for the second linear

system in Eq. (B.25).

Lemma B.4.12 (Solution of linear system for the second order moment). Let U ∈

Rd×k be the orthogonal column span of W ∗ denote an orthogonal matrix satisfying

that ‖V V > − UU>‖ ≤ δ2 . 1/(κ√k). For each i ∈ [k], let ui denote the vector

satisfying ‖ui − V >w∗i ‖ ≤ δ3 . 1/(

√kκ3). Let Q2 be defined as in Eq. (B.23)

and Q2 be the estimation of Q2 such that ‖Q2 − Q2‖F ≤ δ4‖Q2‖F ≤ 14‖Q2‖F . Let

r∗ ∈ Rk and r ∈ Rk be defined as in Eq. (B.24) and Eq. (B.25). Then

|ri − r∗i | ≤ (k3κ8δ3 + κ2k2δ4)‖r∗‖.

Proof. For each i ∈ [k], let ui = V >w∗i . Let A ∈ Rk2×k denote the matrix where

the i-th column is vec(uiu>i ). Let A ∈ Rk2×k denote the matrix where the i-th

251

Page 268: Copyright by Kai Zhong 2018

column is vec(uiu>i − uiu

>i ). Let b ∈ Rk2 denote the vector vex(Q2), let b ∈ Rk2

denote the vector vec(Q2 − Q2).

Let be the element-wise matrix product (a.k.a. Hadamard product), W =

[w∗1 w∗

2 · · · w∗k] and U = [u1 u2 · · · uk] = V >W . We can upper bound ‖A‖ and

‖A‖ as follows,

‖A‖ = max‖x‖=1

∥∥∥∥∥k∑

i=1

xivec(uiu>i )

∥∥∥∥∥= max

‖x‖=1‖U diag(x)U>‖F

≤ ‖U‖2

≤ σ21(V

>W ),

and

‖A‖ =√kmax

i∈[k]‖Ai‖

≤√kmax

i∈[k]‖uiu

>i − uiu

>i ‖F

≤√kmax

i∈[k]2‖ui − ui‖2

≤ 2√kδ3.

We can lower bound σk(A),

σk(A) =√

σk(A>A)

=√

σk((U>U) (U>U))

= min‖x‖=1

√x>((U>U) (U>U))x

= min‖x‖=1

‖(U>U)1/2 diag(x)(U>U)1/2‖F

≥ σ2k(V

>W )

252

Page 269: Copyright by Kai Zhong 2018

where fourth step follows Schur product theorem, the last step follows by the fact

that ‖CB‖F ≥ σmin(C)‖B‖F and is the element-wise multiplication of two ma-

trices.

We can upper bound ‖b‖ and ‖b‖,

‖b‖ ≤‖Q2‖F ≤ ‖r∗‖,

‖b‖ =‖Q2 − Q2‖F ≤ δ4‖r∗‖.

Since ‖V V >W −W‖ ≤√kδ2, we have for any x ∈ Rk,

‖V V >Wx‖ ≥ ‖Wx‖ − ‖(V V >W −W )x‖

≥ σk(W )‖x‖ − δ2√k‖x‖

Note that according to Fact 2.3.1, σk(W ) ≥ 1/κ. Therefore, if δ2 ≤

1/(2κ√k), we will have σk(V

>W ) ≥ 1/(2κ). Similarly, we have σ1(V>W ) ≤

‖V ‖‖W‖ ≤√k. Then applying Lemma B.4.9 and setting δ2 . 1√

kκ3 , we complete

the proof.

253

Page 270: Copyright by Kai Zhong 2018

Appendix C

One-hidden-layer Convolutional Neural Networks

C.1 Proof Overview

In this section, we briefly give the proof sketch for the local strong con-

vexity. The main idea is first to bound the range of the eigenvalues of the popula-

tion Hessian ∇2fD(W∗) and then bound the spectral norm of the remaining error,

‖∇2fS(W ) − ∇2fD(W∗)‖. The later can be bounded by mainly applying matrix

Bernstein inequality and Property 1, 3 carefully. In Sec. C.1.1, we show that when

Property 4 is satisfied, ∇2fD(W∗) for orthogonal W ∗ with k = t can be lower

bounded. Sec. C.1.2 shows how to reduce the case of a non-orthogonal W ∗ with

k ≥ t to the orthogonal case with k = t. The upper bound is relatively easier, so

we leave those proofs in Appendix C.3.

C.1.1 Orthogonal weight matrices for the population case

In this section, we consider a special case when t = k and W ∗ is orthogonal

to illustrate how we prove PD-ness of Hessian. Without loss of generality, we set

W ∗ = Ik. Let[x>1 x>

2 · · · x>r

]> denote vector x ∈ Rd, where xi = Pix ∈ Rk,

for each i ∈ [r]. Let xij denote the j-th entry of xi. Thus, we can rewrite the second

254

Page 271: Copyright by Kai Zhong 2018

partial derivative of fD(W ∗) with respect to wj and wl as,

∂2fD(W∗)

∂wj∂wl

= E(x,y)∼D

( r∑i=1

φ′(xij)xi

)(r∑

i=1

φ′(xil)xi

)>.

Let a ∈ Rk2 denote vector[a>1 a>2 · · · a>k

]> for ai ∈ Rk, i ∈ [r]. The Hessiancan be lower bounded by

λmin(∇2f(W ∗))

≥ min‖a‖=1

a>∇2f(W ∗)a

= min‖a‖=1

Ex∼Dd

k∑j=1

r∑i=1

a>j xi · φ′(xij)

2 (C.1)

≥ r · min‖a‖=1

Eu∼Dk

k∑j=1

a>j(uφ′(uj)− Eu∼Dk

[uφ′(uj)])2. (C.2)

The last formulation Eq. (C.2) has a unit independent element uj in φ′(·), thus can

be calculated explicitly by defining some quantities. In particular, we can obtain

the following lower bounded for Eq. (C.2).

Lemma C.1.1 (Informal version of Lemma C.3.2). Let D1 denote Gaussian distri-

bution N(0, 1). Let

α0 = Ez∼D1 [φ′(z)],

α1 = Ez∼D1 [φ′(z)z],

α2 = Ez∼D1 [φ′(z)z2],

β0 = Ez∼D1 [φ′2(z)],

β2 = Ez∼D1 [φ′2(z)z2].

255

Page 272: Copyright by Kai Zhong 2018

Let ρ denote

min(β0 − α20 − α2

1), (β2 − α21 − α2

2).

For any positive integer k, let A =[a1 a2 · · · ak

]∈ Rk×k. Then we have,

Eu∼Dk

k∑j=1

a>j(u · φ′(uj)− Eu∼Dk

[uφ′(uj)])2 ≥ ρ‖A‖2F . (C.3)

Note that the definition of ρ contains two elements of the definition of ρ(1)

in Property 4. Therefore, if ρ(1) > 0, we also have ρ > 0. More detailed proofs for

the orthogonal case can be found in Appendix C.3.1.

C.1.2 Non-orthogonal weight matrices for the population case

In this section, we show how to reduce the minimal eigenvalue problem with

a non-orthogonal weight matrix into a problem with an orthogonal weight matrix,

so that we can use the results in Sec. C.1.1 to lower bound the eigenvalues.

Let U ∈ Rk×t be the orthonormal basis of W ∗ ∈ Rk×t and let V = U>W ∗ ∈

Rt×t. We use U⊥ ∈ Rk×(k−t) to denote the complement of U . For any vector

aj ∈ Rk, there exist two vectors bj ∈ Rt and cj ∈ Rk−t such that

aj︸︷︷︸k×1

= U︸︷︷︸k×t

bj︸︷︷︸t×1

+ U⊥︸︷︷︸k×(k−t)

cj︸︷︷︸(k−t)×1

.

Let b ∈ Rt2 denote vector[b>1 b>2 · · · b>t

]> and let c ∈ R(k−t)t denote vector[c>1 c>2 · · · c>t

]>. Define g(w∗i ) = E

x∼Dk

[xφ′(w∗>

i x)].

256

Page 273: Copyright by Kai Zhong 2018

Similar to the steps in Eq. (C.1) and Eq. (C.2), we have

∇2fD(W∗)

r · min‖a‖=1

Ex∼Dk

( t∑i=1

a>i (xφ′(w∗>

i x)− g(w∗i )

)2Ikt

= r · min‖b‖=1,‖c‖=1

Ex∼Dk

[(

t∑i=1

(b>i U> + c>i U

>⊥ )·

(xφ′(w∗>i x)− g(w∗

i )))2]Ikt

r · (C1 + C2 + C3)Ikt,

where

C1 = min‖b‖=1

Ex∼Dk

( t∑i=1

b>i U> · (xφ′(w∗>

i x)− g(w∗i ))

)2,

C2 = min‖c‖=1

Ex∼Dk

( t∑i=1

c>i U>⊥ · (xφ′(w∗>

i x)− g(w∗i ))

)2,

C3 = min‖b‖=‖c‖=1

Ex∼Dk

[2

(t∑

i=1

b>i U> · (xφ′(w∗>

i x)− g(w∗i ))

)(

t∑i=1

c>i U>⊥ (xφ′(w∗>

i x)− g(w∗i ))

)].

Since g(w∗i ) ∝ w∗

i and U>⊥x is independent of φ′(w∗>

i x), we have C3 = 0. C1 can

be lower bounded by the orthogonal case with a loss of a condition number of W ∗,

257

Page 274: Copyright by Kai Zhong 2018

λ, as follows.

C1 ≥1

λEu∼Dt

[(

t∑i=1

σt · b>i V †>(uφ′(σt · ui)−

V >σ1(V†)g(w∗

i )))2]

≥ 1

λEu∼Dt

[(

t∑i=1

σt · b>i V †>(uφ′(σt · ui)−

Eu∼Dt [uφ′(σt · ui)]))

2].

The last formulation is the orthogonal weight case in Eq. (C.2) in Sec. C.1.1. So we

can lower bound it by Lemma C.1.1. The intermediate steps for the derivation of

the above inequalities and the lower bound for C2 can be found in Appendix C.3.1.

C.2 Properties of Activation Functions

Definition C.2.1. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =

Ez∼N(0,1)[φ′2(σ · z)zq],∀q ∈ 0, 2. Let γq(σ) = Ez∼N(0,1)[φ(σ · z)zq], ∀q ∈

0, 1, 2, 3, 4.

Proposition C.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,

squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-

tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),

the tanh function and the erf function φ(z) =∫ z

0e−t2dt, satisfy Property 1,4,3.

Proof. The difference between Property 4 and Property 2 is in the definition of

ρ(σ), for which Property 4 has an additional term α20. From Table B.1, we know

that ReLU, leaky ReLU, squared ReLU satisfy the condition that α0 > 0. For non-

linear non-decreasing smooth functions, α0 = 0 if and only if φ′(z) = 0 almost

258

Page 275: Copyright by Kai Zhong 2018

surely, since φ′(z) ≥ 0. Therefore, ρ(σ) > 0 for any smooth non-decreasing non-

linear activations with bounded symmetric first derivatives.

C.3 Positive Definiteness of Hessian near the Ground TruthC.3.1 Bounding the eigenvalues of Hessian

The goal of this section is to prove Lemma C.3.1.

Lemma C.3.1 (Positive Definiteness of Population Hessian at the Ground Truth).

If φ(z) satisfies Property 1,4 and 3, we have the following property for the second

derivative of function fD(W ) at W ∗ ∈ Rk×t,

Ω(rρ(σt)/(κ2λ))I ∇2fD(W

∗) O(tr2σ2p1 )I.

Proof. This follows by combining Lemma C.3.3 and Lemma C.3.4.

Lower bound for the orthogonal case

Lemma C.3.2 (Formal version of Lemma C.1.1). Let D1 denote Gaussian distri-

bution N(0, 1). Let α0 = Ez∼D1 [φ′(z)], α1 = Ez∼D1 [φ

′(z)z], α2 = Ez∼D1 [φ′(z)z2],

β0 = Ez∼D1 [φ′2(z)] ,β2 = Ez∼D1 [φ

′2(z)z2]. Let ρ denote min(β0−α20−α2

1), (β2−

α21 − α2

2). Let P =[p1 p2 · · · pk

]∈ Rk×k. Then we have,

Eu∼Dk

( k∑i=1

p>i (u · φ′(ui)− Eu∼Dk[uφ′(ui)])

)2 ≥ ρ‖P‖2F (C.4)

259

Page 276: Copyright by Kai Zhong 2018

Proof.

Eu∼Dk

( k∑i=1

p>i (u · φ′(ui)− Eu∼Dk[uφ′(ui)])

)2

= Eu∼Dk

( k∑i=1

p>i u · φ′(ui)

)2−( E

u∼Dk

[(k∑

i=1

p>i u · φ′(ui)

)])2

=k∑

i=1

k∑l=1

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]−

(Eu∼Dk

[k∑

i=1

p>i eiuiφ′(ui)

])2

=k∑

i=1

Eu∼Dk

[p>i (φ

′(ui)2 · uu>)pi

]︸ ︷︷ ︸

A

+∑i 6=l

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]︸ ︷︷ ︸B

(Eu∼Dk

[k∑

i=1

p>i eiuiφ′(ui)

])2

︸ ︷︷ ︸C

First, we can rewrite the term C in the following way,

C =

(Eu∼Dk

[k∑

i=1

p>i eiuiφ′(ui)

])2

=

(k∑

i=1

p>i eiEz∼D1 [φ′(z)z]

)2

= α21

(k∑

i=1

p>i ei

)2

= α21(diag(P )>1)2.

260

Page 277: Copyright by Kai Zhong 2018

Further, we can rewrite the diagonal term in the following way,

A =k∑

i=1

Eu∼Dk

[p>i (φ′(ui)

2 · uu>)pi]

=k∑

i=1

Eu∼Dk

[p>i

(φ′(ui)

2 ·

(u2i eie

>i +

∑j 6=i

uiuj(eie>j + eje

>i ) +

∑j 6=i

∑l 6=i

ujuleje>l

))pi

]

=k∑

i=1

Eu∼Dk

[p>i

(φ′(ui)

2 ·

(u2i eie

>i +

∑j 6=i

u2jeje

>j

))pi

]

=k∑

i=1

[p>i

(E

u∼Dk

[φ′(ui)2u2

i ]eie>i +

∑j 6=i

Eu∼Dk

[φ′(ui)2u2

j ]eje>j

)pi

]

=k∑

i=1

[p>i

(β2eie

>i +

∑j 6=i

β0eje>j

)pi

]

=k∑

i=1

p>i ((β2 − β0)eie>i + β0Ik)pi

= (β2 − β0)k∑

i=1

p>i eie>i pi + β0

k∑i=1

p>i pi

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F ,

where the second step follows by rewriting uu> =k∑

i=1

k∑j=1

uiujeie>j , the third step

follows by

Eu∼Dk

[φ′(ui)2uiuj] = 0, ∀j 6= i and E

u∼Dk

[φ′(ui)2ujul] = 0, ∀j 6= l, the fourth

step follows by pushing expectation, the fifth step follows by Eu∼Dk

[φ′(ui)2u2

i ] = β2

and Eu∼Dk

[φ′(ui)2u2

j ] = Eu∼Dk

[φ′(ui)2] = β0, and the last step follows by

k∑i=1

p2i,i =

‖ diag(P )‖2 andk∑

i=1

p>i pi =k∑

i=1

‖pi‖2 = ‖P‖2F .

261

Page 278: Copyright by Kai Zhong 2018

We can rewrite the off-diagonal term in the following way,

B =∑i 6=l

Eu∼Dk

[p>i (φ′(ul)φ

′(ui) · uu>)pl]

=∑i 6=l

Eu∼Dk

[p>i

(φ′(ul)φ

′(ui) ·

(u2i eie

>i + u2

l ele>l + uiul(eie

>l + ele

>i ) +

∑j 6=l

uiujeie>j

+∑j 6=i

ujuleje>l +

∑j 6=i,l

∑j′ 6=i,l

ujuj′eje>j′

))pl

]

=∑i 6=l

Eu∼Dk

[p>i

(φ′(ul)φ

′(ui) ·

(u2i eie

>i + u2

l ele>l + uiul(eie

>l + ele

>i ) +

∑j 6=i,l

u2jeje

>j

))pl

]

=∑i 6=l

[p>i

(E

u∼Dk

[φ′(ul)φ′(ui)u

2i ]eie

>i + E

u∼Dk

[φ′(ul)φ′(ui)u

2l ]ele

>l

+ Eu∼Dk

[φ′(ul)φ′(ui)uiul](eie

>l + ele

>i ) +

∑j 6=i,l

Eu∼Dk

[φ′(ul)φ′(ui)u

2j ]eje

>j

)pl

]

=∑i 6=l

[p>i

(α0α2(eie

>i + ele

>l ) + α2

1(eie>l + ele

>i ) +

∑j 6=i,l

α20eje

>j

)pl

]=∑i 6=l

[p>i((α0α2 − α2

0)(eie>i + ele

>l ) + α2

1(eie>l + ele

>i ) + α2

0Ik)pl]

= (α0α2 − α20)∑i 6=l

p>i (eie>i + ele

>l )pl︸ ︷︷ ︸

B1

+α21

∑i 6=l

p>i (eie>l + ele

>i )pl︸ ︷︷ ︸

B2

+α20

∑i 6=l

p>i pl︸ ︷︷ ︸B3

,

where the third step follows by

Eu∼Dk

[φ′(ul)φ′(ui)uiuj] = 0,

and

Eu∼Dk

[φ′(ul)φ′(ui)uj′uj] = 0, ∀j′ 6= j.

262

Page 279: Copyright by Kai Zhong 2018

For the term B1, we have

B1 = (α0α2 − α20)∑i 6=l

p>i (eie>i + ele

>l )pl

= 2(α0α2 − α20)∑i 6=l

p>i eie>i pl

= 2(α0α2 − α20)

k∑i=1

p>i eie>i

(k∑

l=1

pl − pi

)

= 2(α0α2 − α20)

(k∑

i=1

p>i eie>i

k∑l=1

pl −k∑

i=1

p>i eie>i pi

)= 2(α0α2 − α2

0)(diag(P )> · P · 1− ‖ diag(P )‖2)

For the term B2, we have

B2 = α21

∑i 6=l

p>i (eie>l + ele

>i )pl

= α21

(∑i 6=l

p>i eie>l pl +

∑i 6=l

p>i ele>i pl

)

= α21

(k∑

i=1

k∑l=1

p>i eie>l pl −

k∑j=1

p>j eje>j pj +

k∑i=1

k∑l=1

p>i ele>i pl −

k∑j=1

p>j eje>j pj

)= α2

1((diag(P )>1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)

263

Page 280: Copyright by Kai Zhong 2018

For the term B3, we have

B3 = α20

∑i 6=l

p>i pl

= α20

(k∑

i=1

p>i

k∑l=1

pl −k∑

i=1

p>i pi

)

= α20

∥∥∥∥∥k∑

i=1

pi

∥∥∥∥∥2

−k∑

i=1

‖pi‖2

= α20(‖P · 1‖2 − ‖P‖2F )

Let diag(P ) denote a length k column vector where the i-th entry is the

264

Page 281: Copyright by Kai Zhong 2018

(i, i)-th entry of P ∈ Rk×k. Furthermore, we can show A+B − C is,

A+B − C

= A+B1 +B2 +B3 − C

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A

+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸

B1

+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸

B2

+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸

B3

− α21(diag(P )> · 1)2︸ ︷︷ ︸

C

= ‖α0P · 1+ (α2 − α0) diag(P )‖2︸ ︷︷ ︸C1

+α21

2‖P + P> − 2 diag(diag(P ))‖2F︸ ︷︷ ︸

C2

+ (β0 − α20 − α2

1)‖P − diag(diag(P ))‖2F︸ ︷︷ ︸C3

+ (β2 − α21 − α2

2)‖ diag(P )‖2︸ ︷︷ ︸C4

≥ (β0 − α20 − α2

1)‖P − diag(diag(P ))‖2F + (β2 − α21 − α2

2)‖ diag(P )‖2

≥ min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · (‖P − diag(diag(P ))‖2F + ‖ diag(P )‖2)

= min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · (‖P − diag(diag(P ))‖2F + ‖ diag(diag(P ))‖2)

≥ min(β0 − α20 − α2

1), (β2 − α21 − α2

2) · ‖P‖2F

= ρ‖P‖2F ,

where the first step follows by B = B1 + B2 + B3, and the second step follows by

the definition of A,B1, B2, B3, C the third step follows by A+B1+B2+B3−C =

C1 + C2 + C3 + C4, the fourth step follows by C1, C2 ≥ 0, the fifth step follows

a ≥ min(a, b), the sixth step follows by ‖ diag(P )‖2 = ‖ diag(diag(P ))‖2F , the

seventh step follows by triangle inequality, and the last step follows the definition

of ρ.

265

Page 282: Copyright by Kai Zhong 2018

Claim C.3.1. A+B1 +B2 +B3 − C = C1 + C2 + C3 + C4.

Proof. The key properties we need are, for two vectors a, b, ‖a + b‖2 = ‖a‖2 +2〈a, b〉 + ‖b‖2; for two matrices A,B, ‖A + B‖2F = ‖A‖2F + 2〈A,B〉 + ‖B‖2F .Then, we have

C1 + C + C3 + C4 + C5

= (‖α0P · 1‖)2 + 2(α0α2 − α20)〈P · 1, diag(P )〉+ (α2 − α0)

2‖ diag(P )‖2︸ ︷︷ ︸C1

+α21(diag(P )> · 1)2︸ ︷︷ ︸

C

+α21

2(2‖P‖2F + 4‖ diag(diag(P ))‖2F + 2〈P, P>〉 − 4〈P, diag(diag(P ))〉 − 4〈P>, diag(diag(P ))〉)︸ ︷︷ ︸

C2

+ (β0 − α20 − α2

1)(‖P‖2F − 2〈P, diag(diag(P ))〉+ ‖ diag(diag(P ))‖2F )︸ ︷︷ ︸C3

+(β2 − α21 − α2

2)‖diag(P )‖2︸ ︷︷ ︸C4

= α20‖P · 1‖2 + 2(α0α2 − α2

0)〈P · 1,diag(P )〉+ (α2 − α0)2‖diag(P )‖2︸ ︷︷ ︸

C1

+α21(diag(P )> · 1)2︸ ︷︷ ︸

C

+α21

2(2‖P‖2F + 4‖ diag(P )‖2 + 2〈P, P>〉 − 8‖ diag(P )‖2)︸ ︷︷ ︸

C2

+ (β0 − α20 − α2

1)(‖P‖2F − 2‖diag(P )‖2 + ‖ diag(P )‖2)︸ ︷︷ ︸C3

+(β2 − α21 − α2

2)‖ diag(P )‖2︸ ︷︷ ︸C4

= α20‖P · 1‖2 + 2(α0α2 − α2

0) diag(P )> · P · 1+ α21(diag(P )> · 1)2 + α2

1〈P, P>〉+ (β0 − α2

0)‖P‖2F + ((α2 − α0)2 − 2α2

1 − β0 + α20 + α2

1 + β2 − α21 − α2

2)︸ ︷︷ ︸β2−β0−2(α2α0−α2

0+α21)

‖diag(P )‖2

= 0︸︷︷︸part of A

+2(α2α0 − α20) · diag(P )>P · 1︸ ︷︷ ︸part of B1

+α21 · ((diag(P )>1)2 + 〈P, P>〉)︸ ︷︷ ︸

part of B2

+α20 · ‖P · 1‖2︸ ︷︷ ︸

part of B3

+ (β0 − α20) · ‖P‖2F︸ ︷︷ ︸

proportional to ‖P‖2F

+(β2 − β0 − 2(α2α0 − α20 + α2

1)) · ‖ diag(P )‖2︸ ︷︷ ︸proportional to ‖ diag(P )‖2

= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A

+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸

B1

+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸

B2

+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸

B3

=A+B1 +B2 +B3

where the second step follows by 〈P, diag(diag(P ))〉 = ‖ diag(P )‖2 and ‖ diag(diag(P ))‖2F =

266

Page 283: Copyright by Kai Zhong 2018

‖ diag(P )‖2.

Lower bound on the eigenvalues of the population Hessian at the ground

truth

Lemma C.3.3. If φ(z) satisfies Property 1, 4, 3 we have

∇2fD(W∗) Ω(rρ(σt)/(κ

2λ)).

Proof. Let x ∈ Rd denote vector[x>1 x>

2 · · · x>r

]> where xi = Pix ∈ Rk, for

each i ∈ [r]. Thus, we can rewrite the partial gradient. For each j ∈ [t], the second

partial derivative of fD(W ) is

∂2fD(W∗)

∂w2j

= E(x,y)∼D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

For each j, l,∈ [t] and j 6= l, the second partial derivative of fD(W ) with

respect to wj and wl can be represented as

∂2fD(W )

∂wj∂wl

= E(x,y)∼D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

First we show the lower bound of the eigenvalues. The main idea is to

reduce the problem to a k-by-k problem and then lower bound the eigenvalues using

orthogonal weight matrices.

267

Page 284: Copyright by Kai Zhong 2018

Let a ∈ Rkt denote vector[a>1 a>2 · · · a>t

]>. The smallest eigenvalue of

the Hessian can be calculated by

∇2f(W ∗) min‖a‖=1

a>∇2f(W ∗)a Ikt

= min‖a‖=1

Ex∼Dd

( t∑j=1

r∑i=1

a>j xi · φ′(w∗>j xi)

)2 Ikt (C.5)

For each i ∈ [r], we define function hi(y) : Rk → R such that

hi(y) =t∑

j=1

a>j y · φ′(w∗>j y).

Then, we can analyze the smallest eigenvalue of the Hessian in the following way,

min‖a‖=1

Ex∼Dd

( t∑j=1

r∑i=1

a>j xi · φ′(w∗>j xi)

)2

= min‖a‖=1

Ex∼Dd

( r∑i=1

t∑j=1

a>j xi · φ′(w∗>j xi)

)2

= min‖a‖=1

Ex∼Dd

( r∑i=1

hi(xi)

)2

= min‖a‖=1

r∑i=1

Ex∼Dd

[h2i (xi)] +

r∑j 6=l

Ex∼Dd

[hj(xj)] Ex∼Dd

[hl(xl)]

= min‖a‖=1

r∑i=1

(Ex∼Dd

[h2i (xi)]− (Ex∼Dd

[hi(xi)])2)+( r∑

l=1

Ex∼Dd

[hl(xl)]

)2

≥ min‖a‖=1

r∑i=1

(Ex∼Dd

[h2i (xi)]− (Ex∼Dd

[hi(xi)])2)

= min‖a‖=1

r∑i=1

Ex∼Dd

[(hi(xi)− Ex∼Dd

[hi(xi)])2]

268

Page 285: Copyright by Kai Zhong 2018

Since min‖a‖=1

∑ri=1 fi(a) ≥

∑ri=1min‖a‖=1 fi(a). Thus, we only need to

consider one i ∈ [r],

min‖a‖=1

Ex∼Dd

[(hi(xi)− Ex∼Dd

[hi(xi)])2]

= min‖a‖=1

Ey∼Dk

[(hi(y)− Ey∼Dk

[hi(y)])2]

= min‖a‖=1

Ey∼Dk

( t∑j=1

a>j y · φ′(w∗>j y)− Ey∼Dk

[t∑

j=1

a>j y · φ′(w∗>j y)

])2

= min‖a‖=1

Ey∼Dk

( t∑j=1

a>j(yφ′(w∗>

j y)− Ey∼Dk[yφ′(w∗>

j y)]))2

≥ min

‖a‖=1Ey∼Dk

( t∑j=1

a>j(yφ′(w∗>

j y)− Ey∼Dk[yφ′(w∗>

j y)]))2

where the second step follows by definition of function hi(y),

We define function g(w) : Rk → Rk such that

g(w) = Ey∼Dk[φ′(w>y)y].

Then we have

min‖a‖=1

Ex∼Dd

[(hi(xi)− Ex∼Dd

[hi(xi)])2] ≥ min

‖a‖=1Ex∼Dk

( t∑j=1

a>j(xφ′(w∗>

j x)− g(w∗j )))2

.(C.6)

Let U ∈ Rk×t be the orthonormal basis of W ∗ ∈ Rk×t and let V =[v1 v2 · · · vt

]= U>W ∗ ∈ Rt×t. Also note that V and W ∗ have same sin-

gular values and W ∗ = UV . We use U⊥ ∈ Rk×(k−t) to denote the complement of

269

Page 286: Copyright by Kai Zhong 2018

U . For any vector aj ∈ Rk, there exist two vectors bj ∈ Rt and cj ∈ Rk−t such that

aj︸︷︷︸k×1

= U︸︷︷︸k×t

bj︸︷︷︸t×1

+ U⊥︸︷︷︸k×(k−t)

cj︸︷︷︸(k−t)×1

.

Let b ∈ Rt2 denote vector[b>1 b>2 · · · b>t

]> and let c ∈ R(k−t)t denote vector[c>1 c>2 · · · c>t

]>.

Let U>g(w∗i ) = g(v∗i ) ∈ Rt, then g(v∗i ) = Ez∼Dt [φ

′(v∗i z)z]. Then we can

rewrite formulation (C.6) as

Ex∼Dk

( t∑i=1

a>i (xφ′(w∗>

i x)− g(w∗i )

)2

= Ex∼Dk

( t∑i=1

(b>i U> + c>i U

>⊥ ) · (xφ′(w∗>

i x)− g(w∗i ))

)2

= A+B + C

where

A = Ex∼Dk

( t∑i=1

b>i U> · (xφ′(w∗>

i x)− g(w∗i ))

)2,

B = Ex∼Dk

( t∑i=1

c>i U>⊥ · (xφ′(w∗>

i x)− g(w∗i ))

)2,

C = Ex∼Dk

[2

(t∑

i=1

b>i U> · (xφ′(w∗>

i x)− g(w∗i ))

(t∑

i=1

c>i U>⊥ · (xφ′(w∗>

i x)− g(w∗i ))

)].

270

Page 287: Copyright by Kai Zhong 2018

We calculate A,B,C separately. First, we can show

A = Ex∼Dk

( t∑i=1

b>i U> ·(xφ′(w∗>

i x)− g(w∗i )))2

= E

z∼Dt

( t∑i=1

b>i · (zφ′(v∗>i z)− g(v∗i ))

)2.

where the first step follows by definition of A and the last step follows by U>g(w∗i ) =

g(v∗i ).

Second, we can show

B = Ex∼Dk

( t∑i=1

c>i U>⊥ · (xφ′(w∗>

i x)− g(w∗i ))

)2

= Ex∼Dk

( t∑i=1

c>i U>⊥ · (xφ′(w∗>

i x))

)2 by U>

⊥ g(w∗i ) = 0

= Es∼Dk−t,z∼Dt

( t∑i=1

c>i s · φ′(v∗>i z)

)2

= Es∼Dk−t,z∼Dt

[(y>s)2] by defining y =t∑

i=1

φ′(v∗>i z)ci ∈ Rk−t

= Ez∼Dt

[E

s∼Dk−t

[(y>s)2]

]= E

z∼Dt

[E

s∼Dk−t

[k−t∑j=1

s2jy2j

]]by E[sjsj′ ] = 0

= Ez∼Dt

[k−t∑j=1

y2j

]by sj ∼ N(0, 1)

= Ez∼Dt

∥∥∥∥∥t∑

i=1

φ′(v∗>i z)ci

∥∥∥∥∥2 by definition of y

271

Page 288: Copyright by Kai Zhong 2018

Third, we have C = 0 since U>⊥x is independent of w∗>

i x and U>x, and g(w∗) ∝

w∗, then U>⊥ g(w

∗) = 0.

Thus, putting them all together,

Ex∼Dk

( k∑i=1

a>i (xφ′(w∗>

i x)− g(w∗i ))

)2

= Ez∼Dt

( t∑i=1

b>i (zφ′(v∗>i z)− g(v∗i ))

)2

︸ ︷︷ ︸A

+ Ez∼Dt

∥∥∥∥∥t∑

i=1

φ′(v∗>i z)ci

∥∥∥∥∥2

︸ ︷︷ ︸B

272

Page 289: Copyright by Kai Zhong 2018

Let us lower bound A,

A = Ez∼Dt

( t∑i=1

b>i · (zφ′(v∗>i z)− g(w∗i ))

)2

=

∫(2π)−t/2

(t∑

i=1

b>i (zφ′(v∗>i z)− g(w∗

i ))

)2

e−‖z‖2/2dz

=

∫(2π)−t/2

(t∑

i=1

b>i (V†>s · φ′(si)− g(w∗

i ))

)2

e−‖V †>s‖2/2 · | det(V †)|ds

≥∫(2π)−t/2

(t∑

i=1

b>i (V†>s · φ′(si)− g(w∗

i ))

)2

e−σ21(V

†)‖s‖2/2 · | det(V †)|ds

=

∫(2π)−t/2

(t∑

i=1

b>i (V†>u/σ1(V

†) · φ′(ui/σ1(V†))− g(w∗

i ))

)2

e−‖u‖2/2| det(V †)|/σt1(V

†)du

=

∫(2π)−t/2

(t∑

i=1

p>i (u · φ′(σt · ui)− V >σ1(V†)g(w∗

i ))

)2

e−‖u‖2/2 1

λdu

=1

λEu∼Dt

( t∑i=1

p>i (uφ′(σt · ui)− V >σ1(V

†)g(w∗i ))

)2

≥ 1

λEu∼Dt

( t∑i=1

p>i (uφ′(σt · ui)− Eu∼Dt [uφ

′(σt · ui)])

)2

where the first step follows by definition of A, the second step follows by high-

dimensional Gaussian distribution, the third step follows by replacing z by V †>s, so

v∗>i z = si, the fourth step follows by the fact ‖V †>s‖ ≤ σ1(V†)‖s‖, and fifth step

follows by replacing s by u/σ1(V†), the sixth step follows by p>i = b>i V

†>/σ1(V†),

the seventh step follows by definition of high-dimensional Gaussian distribution,

and the last step follows by E[(X − C)2] ≥ E[(X − E[X])2].

Note that φ′(σt · ui)’s are independent of each other, so we can simplify the

273

Page 290: Copyright by Kai Zhong 2018

analysis.

In particular, Lemma C.3.2 gives a lower bound in this case in terms of pi.

Note that ‖pi‖ ≥ ‖bi‖/κ. Therefore,

Ez∼Dt

( t∑i=1

b>i z · φ′(v>i z)

)2 ≥ ρ(σt)

1

κ2λ‖b‖2.

274

Page 291: Copyright by Kai Zhong 2018

For B, similar to the proof of Lemma C.1.1, we have,

B = Ez∼Dt

∥∥∥∥∥t∑

i=1

φ′(v>i z)ci

∥∥∥∥∥2

=

∫(2π)−t/2

∥∥∥∥∥t∑

i=1

φ′(v>i z)ci

∥∥∥∥∥2

e−‖z‖2/2dz

=

∫(2π)−t/2

∥∥∥∥∥t∑

i=1

φ′(σt · ui)ci

∥∥∥∥∥2

e−‖V †>u/σ1(V †)‖2/2 · det(V †/σ1(V†))du

=

∫(2π)−t/2

∥∥∥∥∥t∑

i=1

φ′(σt · ui)ci

∥∥∥∥∥2

e−‖V †>u/σ1(V †)‖2/2 · 1λdu

≥∫

(2π)−t/2

∥∥∥∥∥t∑

i=1

φ′(σt · ui)ci

∥∥∥∥∥2

e−‖u‖2/2 · 1λdu

=1

λE

u∼Dt

∥∥∥∥∥t∑

i=1

φ′(σt · ui)ci

∥∥∥∥∥2

=1

λ

(t∑

i=1

Eu∼Dk

[φ′(σt · ui)φ′(σk · ui)c

>i ci] +

∑i 6=l

Eu∼Dt

[φ′(σt · ui)φ′(σt · ul)c

>i cl]

)

=1

λ

(E

z∼D1

[φ′(σt · ui)2]

t∑i=1

‖ci‖2 +(

Ez∼D1

[φ′(σt · z)])2∑

i 6=l

c>i cl

)

=1

λ

( Ez∼D1

[φ′(σt · z)])2∥∥∥∥∥

t∑i=1

ci

∥∥∥∥∥2

2

+

(E

z∼D1

[φ′(σt · z)2]−(

Ez∼D1

[φ′(σt · z)])2)‖c‖2

≥ 1

λ

(E

z∼D1

[φ′(σt · z)2]−(

Ez∼D1

[φ′(σt · z)])2)‖c‖2

≥ ρ(σt)1

λ‖c‖2,

where the first step follows by definition of Gaussian distribution, the second step

follows by replacing z by z = V †>u/σ1(V†), and then v>i z = ui/σ1(V

†) =

275

Page 292: Copyright by Kai Zhong 2018

uiσt(W∗), the third step follows by ‖u‖2 ≥ ‖ 1

σ1(V †)V †>u‖2 , the fourth step follows

by det(V †/σ1(V†)) = det(V †)/σt

1(V†) = 1/λ, the fifth step follows by definition

of Gaussian distribution, the ninth step follows by x2 ≥ 0 for any x ∈ R, and the

last step follows by Property 4.

Note that 1 = ‖a‖2 = ‖b‖2 + ‖c‖2. Thus, we finish the proof for the lower

bound.

Upper bound on the eigenvalues of the population Hessian at the ground

truth

Lemma C.3.4. If φ(z) satisfies Property 1, 4, 3, then

∇2fD(W∗) O(tr2σ2p

1 )

Proof. Similarly to the proof in previous section, we can calculate the upper bound

276

Page 293: Copyright by Kai Zhong 2018

of the eigenvalues by

‖∇2fD(W∗)‖

= max‖a‖=1

a>∇2fD(W∗)a

= max‖a‖=1

Ex∼Dd

( t∑j=1

r∑i=1

a>j xi · φ′(w∗>j xi)

)2

≤ max‖a‖=1

Ex∼Dd

( t∑j=1

r∑i=1

|a>j xi| · |φ′(w∗>j xi)|

)2

= max‖a‖=1

Ex∼Dd

[t∑

j=1

r∑i=1

t∑j′=1

r∑i′=1

|a>j xi| · |φ′(w∗>j xi)| · |a>j′xi′| · |φ′(w∗>

j′ xi′)|

]

= max‖a‖=1

t∑j=1

r∑i=1

t∑j′=1

r∑i′=1

Ex∼Dd

[|a>j xi| · |φ′(w∗>

j xi)| · |a>j′xi′ | · |φ′(w∗>j′ xi′)|

]︸ ︷︷ ︸Aj,i,j′,i′

.

It remains to bound Aj,i,j′,i′ . We have

Aj,i,j′,i′ = Ex∼Dd

[|a>j xi| · |φ′(w∗>

j xi)| · |a>j′xi′ | · |φ′(w∗>j′ xi′)|

]≤(Ex∼Dk

[|a>j x|4] · Ex∼Dk[|φ′(w∗>

j x)|4] · Ex∼Dk[|a>j′x|4] · Ex∼Dk

[|φ′(w∗>j′ x)|4]

)1/4. ‖aj‖ · ‖aj′‖ · ‖w∗

j‖p · ‖w∗j′‖p.

Thus, we have

‖∇2fD(W∗)‖ ≤ tr2σ2p

1 ,

which completes the proof.

C.3.2 Error bound of Hessians near the ground truth for smooth activations

The goal of this Section is to prove Lemma C.3.5

277

Page 294: Copyright by Kai Zhong 2018

Lemma C.3.5 (Error Bound of Hessians near the Ground Truth for Smooth Activa-

tions). Let φ(z) satisfy Property 1 (with p = 0, 1), Property 4 and Property 3(a).

Let W ∈ Rk×t satisfy ‖W −W ∗‖ ≤ σt/2. Let S denote a set of i.i.d. samples from

the distribution defined in (5.1). Then for any s ≥ 1 and 0 < ε < 1/2, if

|S| ≥ ε−2kκ2τ · poly(log d, s)

then we have, with probability at least 1− 1/dΩ(s),

‖∇2fS(W )−∇2fD(W∗)‖ . r2t2σp

1(εσp1 + ‖W −W ∗‖).

Proof. This follows by combining Lemma C.3.6 and Lemma C.3.7 directly.

Second-order smoothness near the ground truth for smooth activations

The goal of this Section is to prove Lemma C.3.6.

Fact C.3.1. Let wi denote the i-th column of W ∈ Rk×t, and w∗i denote the i-th

column of W ∗ ∈ Rk×t. If ‖W −W ∗‖ ≤ σt(W∗)/2, then for all i ∈ [t],

1

2‖w∗

i ‖ ≤ ‖wi‖ ≤3

2‖w∗

i ‖.

Proof. Note that if ‖W − W ∗‖ ≤ σt(W∗)/2, we have σt(W

∗)/2 ≤ σi(W ) ≤32σ1(W

∗) for all i ∈ [t] by Weyl’s inequality. By definition of singular value, we

have σt(W∗) ≤ ‖w∗

i ‖ ≤ σ1(W∗). By definition of spectral norm, we have ‖wi −

w∗i ‖ ≤ ‖W −W ∗‖. Thus, we can lower bound ‖wi‖,

‖wi‖ ≤ ‖w∗i ‖+ ‖wi − w∗

i ‖ ≤ ‖w∗i ‖+ ‖W −W ∗‖ ≤ ‖w∗

i ‖+ σt/2 ≤3

2‖w∗

i ‖.

Similarly, we have ‖wi‖ ≥ 12‖w∗

i ‖.

278

Page 295: Copyright by Kai Zhong 2018

Lemma C.3.6 (Second-order Smoothness near the Ground Truth for Smooth Ac-

tivations). If φ(z) satisfies Property 1 (with p = 0, 1), Property 4 and Prop-

erty 3(a), then for any W ∈ Rk×t with ‖W −W ∗‖ ≤ σt/2, we have

‖∇2fD(W )−∇2fD(W∗)‖ . r2t2σp

1‖W −W ∗‖.

Proof. Recall that x ∈ Rd denotes a vector[x>1 x>

2 · · · x>r

]>, where xi =

Pix ∈ Rk, ∀i ∈ [r] and d = rk. Recall that for each (x, y) ∼ D or (x, y) ∈ S,

y =∑t

j=1

∑ri=1 φ(w

∗>j xi).

Let ∆ = ∇2fD(W )−∇2fD(W∗). For each (j, l) ∈ [t]×[t], let ∆j,l ∈ Rk×k.

Then for any j 6= l, we have

∆j,l = Ex∼Dd

( r∑i=1

φ′(w>j xi)xi

)(r∑

i=1

φ′(w>l xi)xi

)>

(r∑

i=1

φ′(w∗>j xi)xi

)(r∑

i=1

φ′(w∗>l xi)xi

)>

=r∑

i=1

Ex∼Dk

[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>]+∑i 6=i′

(Ey∼Dk,z∼Dk

[φ′(w>

j y)yφ′(w>

l z)z> − φ′(w∗>

j y)yφ′(w∗>l z)z>

])= ∆

(1)j,l +∆

(2)j,l .

Using Claim C.3.2 and Claim C.3.3, we can bound ∆(1)j,l and ∆

(2)j,l .

279

Page 296: Copyright by Kai Zhong 2018

For any j ∈ [t], we have

∆j,j = Ex∼Dd

[(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

+ Ex∼Dd

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

− Ex∼Dd

( r∑i=1

φ′(w∗>j xi)xi

(r∑

i=1

φ′(w∗>j xi)xi

)>

= Ex∼Dd

[(t∑

l=1

r∑i=1

(φ(w>l xi)− φ(w∗>

l xi))

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

+ Ex∼Dd

[r∑

i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

j xi′)− φ′(w∗>j xi)φ

′(w∗>j xi′)

)xix

>i′

]= ∆

(1)j,j +∆

(2)j,j ,

where the first step follows by∇2fD(W )−∇2fD(W∗), the second step follows by

the definition of y.

Using Claim C.3.4, we can bound ∆(1)j,j . Using Claim C.3.5, we can bound

∆(2)j,j .

280

Page 297: Copyright by Kai Zhong 2018

Putting it all together, we can bound the error by

‖∇2fD(W )−∇2fD(W∗)‖

= max‖a‖=1

a>(∇2fD(W )−∇2fD(W∗))a

= max‖a‖=1

t∑j=1

t∑l=1

a>j ∆j,lal

= max‖a‖=1

(t∑

j=1

a>j ∆j,jaj +∑j 6=l

a>j ∆i,lal

)

≤ max‖a‖=1

(t∑

j=1

‖∆j,j‖‖aj‖2 +∑j 6=l

‖∆j,l‖‖aj‖‖al‖

)

≤ max‖a‖=1

(t∑

j=1

C1‖aj‖2 +∑j 6=l

C2‖aj‖‖al‖

)

= max‖a‖=1

C1

t∑j=1

‖ai‖2 + C2

( t∑j=1

‖aj‖

)2

−t∑

j=1

‖aj‖2

≤ max‖a‖=1

(C1

t∑j=1

‖aj‖2 + C2

(t

t∑j=1

‖aj‖2 −t∑

j=1

‖aj‖2))

= max‖a‖=1

(C1 + C2(t− 1))

. r2t2L1L2σp1(W

∗)‖W −W ∗‖.

where the first step follows by definition of spectral norm and a denotes a vector

∈ Rdk, the first inequality follows by ‖A‖ = max‖x‖6=0,‖y‖6=0x>Ay‖x‖‖y‖ , the second

inequality follows by ‖∆i,i‖ ≤ C1 and ‖∆i,l‖ ≤ C2, the third inequality follows by

Cauchy-Scharwz inequality, the eighth step follows by∑

i=1 ‖ai‖2 = 1, where the

last step follows by Claim C.3.2, C.3.3 and C.3.4.

Thus, we complete the proof.

281

Page 298: Copyright by Kai Zhong 2018

Claim C.3.2. For each (j, l) ∈ [t]× [t] and j 6= l, ‖∆(1)j,l ‖ . r2L1L2σ

p1(W

∗)‖W −

W ∗‖.

Proof. Recall the definition of ∆(1)j,l ,

r∑i=1

Ex∼Dk

[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>]In order to upper bound ‖∆(1)

j,l ‖, it suffices to upper bound the spectral norm of this

quantity,

Ex∼Dk

[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>].We have ∥∥Ex∼Dk

[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>]∥∥= max

‖a‖=1Ex∼Dk

[|φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x)|(x>a)2]

≤ max‖a‖=1

(Ex∼Dk

[|φ′(w>

j x)φ′(w>

l x)− φ′(w>j x)φ

′(w∗>l x)|(x>a)2

]+ Ex∼Dk

[|φ′(w>

j x)φ′(w∗>

l x)− φ′(w∗>j x)φ′(w∗>

l x)|(x>a)2])

= max‖a‖=1

(Ex∼Dk

[|φ′(w>

j x)| · |φ′(w>l x)− φ′(w∗>

l x)|(x>a)2]

+ Ex∼Dk

[|φ′(w∗>

l x| · |φ′(w>j x)− φ′(w∗>

j x))|(x>a)2])

We can upper bound the first term of above Equation in the following way,

max‖a‖=1

Ex∼Dk

[|φ′(w>

j x)| · |φ′(w>l x)− φ′(w∗>

l x)|(x>a)2]

≤ 2L1L2Ex∼Dk[|w>

j x|p · |(wl − w∗l )

>x| · |x>a|2]

. L1L2σp1‖W −W ∗‖.

282

Page 299: Copyright by Kai Zhong 2018

Similarly, we can upper bound the second term. By summing over O(r2) terms, we

complete the proof.

Claim C.3.3. For each (j, l) ∈ [t]× [t] and j 6= l, ‖∆(2)j,l ‖ . r2L1L2σ

p1(W

∗)‖W −

W ∗‖.

Proof. Note that

Ey∼Dk,z∼Dk

[φ′(w>

j y)yφ′(w>

l z)z> − φ′(w∗>

j y)yφ′(w∗>l z)z>

]=Ey∼Dk,z∼Dk

[φ′(w>

j y)yφ′(w>

l z)z> − φ′(w>

j y)yφ′(w∗>

l z)z>]

+ Ey∼Dk,z∼Dk

[φ′(w>

j y)yφ′(w∗>

l z)z> − φ′(w∗>j y)yφ′(w∗>

l z)z>]

We consider the first term as follows. The second term is similar.

∥∥Ey∼Dk,z∼Dk[φ′(w>

j y)yφ′(w>

l z)z> − φ′(w>

j y)yφ′(w∗>

l z)z>]∥∥

=∥∥Ey∼Dk,z∼Dk

[φ′(w>j y)(φ

′(w>l z)− φ′(w∗>

l z))yz>]∥∥

≤ max‖a‖=‖b‖=1

Ey,z∼Dk[|φ′(w>

j y)| · |φ′(w>l z)− φ′(w∗>

l z)| · |a>y| · |b>z|]

. L1L2σp1(W

∗)‖W −W ∗‖.

By summing over O(r2) terms, we complete the proof.

Claim C.3.4. For each j ∈ [t], ‖∆(1)j,j ‖ . r2tL1L2σ

p1(W

∗)‖W −W ∗‖.

Proof. Recall the definition of ∆(1)j,j ,

∆(1)j,j = Ex∼Dd

[(t∑

l=1

r∑i=1

(φ(w>l xi)− φ(w∗>

l xi))

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

283

Page 300: Copyright by Kai Zhong 2018

In order to upper bound ‖∆(1)j,j ‖, it suffices to upper bound the spectral norm of this

quantity,

Ex∼Dd

[(φ(w>

l xi)− φ(w∗>l xi)) · φ′′(w>

j xi′)xi′x>i′

]= Ey,z∼Dk

[(φ(w>

l y)− φ(w∗>l y)) · φ′′(w>

j z)zz>]

Thus, we have

∥∥Ey,z∼Dk

[(φ(w>

l y)− φ(w∗>l y)) · φ′′(w>

j z)zz>]∥∥

≤ max‖a‖=1

Ey,z∼Dk|(φ(w>

l y)− φ(w∗>l y)| · |φ′′(w>

j z)| · (z>a)2

≤ max‖a‖=1

Ey,z∼Dk[|φ(w>

l x)− φ(w∗>l y)|L2(z

>a)2]

≤ L2 max‖a‖=1

Ey,z∼Dk

[max

u∈[w>l y,w∗>

l y]|φ′(u)| · |(wl − w∗

l )>y| · (z>a)2

]≤ L2 max

‖a‖=1Ey,z∼Dk

[max

u∈[w>l y,w∗>

l y]L1|u|p · |(wl − w∗

l )>y| · (z>a)2

]≤ L1L2 max

‖a‖=1Ey,z∼Dk

[(|w>l y|p + |w∗>

l y|p) · |(wl − w∗l )

>y| · (z>a)2]

. L1L2(‖wl‖p + ‖w∗l ‖p)‖wl − w∗

l ‖

By summing over all the O(tr2) terms and using triangle inequality, we finish the

proof.

Claim C.3.5. For each j ∈ [t], ‖∆(2)j,j ‖ . r2tL1L2σ

p1(W

∗)‖W −W ∗‖.

Proof. Recall the definition of ∆(2)j,j ,

Ex∼Dd

[r∑

i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

j xi′)− φ′(w∗>j xi)φ

′(w∗>j xi′)

)xix

>i′

]

284

Page 301: Copyright by Kai Zhong 2018

In order to upper bound ‖∆(2)j,j ‖, it suffices to upper bound the spectral norm of these

two quantities, the diagonal term

Ey∼Dk

[(φ′2(w>

j y)− φ′2(w∗>j y))yy>

]and the off-diagonal term,

Ey,z∼Dk

[(φ′(w>

j y)φ′(w>

j z)− φ′(w∗>j y)φ′(w∗>

j z))yz>]

These two terms can be bounded by using the proof similar to the other Claims of

this Section.

Empirical and population difference for smooth activations

The goal of this Section is to prove Lemma C.3.7. For each i ∈ [k], let σi

denote the i-th largest singular value of W ∗ ∈ Rd×k.

Note that Bernstein inequality requires the spectral norm of each random

matrix to be bounded almost surely. However, since we assume Gaussian distribu-

tion for x, ‖x‖2 is not bounded almost surely. The main idea is to do truncation and

then use Matrix Bernstein inequality. Details can be found in Lemma D.3.9 and

Corollary B.1.1.

Lemma C.3.7 (Empirical and Population Difference for Smooth Activations). Let

φ(z) satisfy Property 1,4 and 3(a). Let W ∈ Rk×t satisfy ‖W −W ∗‖ ≤ σt/2. Let

S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Then for any

s ≥ 1 and 0 < ε < 1/2, if

|S| ≥ ε−2kκ2τ · poly(log d, s)

285

Page 302: Copyright by Kai Zhong 2018

then we have, with probability at least 1− 1/dΩ(s),

‖∇2fS(W )−∇2fD(W )‖ .r2t2σp1(εσ

p1 + ‖W −W ∗‖).

Proof. Recall that x ∈ Rd denotes a vector[x>1 x>

2 · · · x>r

]>, where xi =

Pix ∈ Rk, ∀i ∈ [r] and d = rk. Recall that for each (x, y) ∼ D or (x, y) ∈ S,

y =∑t

j=1

∑ri=1 φ(w

∗>j xi).

Define ∆ = ∇2fD(W ) − ∇2fS(W ). Let us first consider the diagonal

blocks. Define

∆j,j = E(x,y)∼D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

+

(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

+

(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

286

Page 303: Copyright by Kai Zhong 2018

Further, we can decompose ∆j,j into ∆j,j = ∆(1)j,j +∆

(2)j,j , where

∆(1)j,j = E(x,y)∼D

[(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

= E(x,y)∼D

[(t∑

l=1

r∑i=1

(φ(w>l xi)− φ(w∗>

l xi))

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

(φ(w>l xi)− φ(w∗>

l xi))

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

=r∑

l=1

r∑i=1

r∑i′=1

(Ex∼Dd

[(φ(w>

l xi)− φ(w∗>l xi))φ

′′(w>j xi′)xi′x

>i′

]− 1

|S|∑x∈S

[(φ(w>

l xi)− φ(w∗>l xi))φ

′′(w>j xi′)xi′x

>i′

])

and

∆(2)j,j = E(x,y)∈D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

− 1

|S|∑

(x,y)∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

=r∑

i=1

r∑i′=1

(Ex∼Dd

[φ′(w>j xi)xiφ

′(w>j xi′)x

>i′ ]−

1

|S|∑x∈S

[φ′(w>j xi)xiφ

′(w>j xi′)x

>i′ ]

)

287

Page 304: Copyright by Kai Zhong 2018

The off-diagonal block is

∆j,l = E(x,y)∼D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

− 1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

=r∑

i=1

r∑i′=1

(Ex∼Dd

[φ′(w>j xi)xiφ

′(w>l xi′)x

>i′ ]−

1

|S|∑x∈S

[φ′(w>j xi)xiφ

′(w>l xi′)x

>i′ ]

)

Note that ∆(2)j,j is a special case of ∆j,l so we just bound ∆j,l. Combining Claims C.3.6

C.3.7, and taking a union bound over t2 different ∆j,l, we obtain if n ≥ ε−2kτκ2poly(log d, s),

with probability at least 1− 1/d4s,

‖∇2fS(W )−∇2f(W )‖ . t2r2σp1(W

∗) · (εσp1(W

∗) + ‖W −W ∗‖).

Therefore, we complete the proof.

Claim C.3.6. For each j ∈ [t], if |S| ≥ kpoly(log d, s)

‖∆(1)j,j ‖ . r2tσp

1(W∗)‖W −W ∗‖

holds with probability 1− 1/d4s.

Proof. Define B∗i,i′,j,l to be

B∗i,i′,j,l =Ex∼Dd

[(φ(w>l xi)− φ(w∗>

l xi))φ′′(w>

j xi′)xi′x>i′ )]

− 1

|S|∑x∈S

[(φ(w>l xi)− φ(w∗>

l xi))φ′′(w>

j xi′)xi′x>i′ )]

288

Page 305: Copyright by Kai Zhong 2018

For each l ∈ [t], we define function Al(x, x′) : R2k → Rk×k,

Al(x, x′) = L1L2 · (|w>

l x|p + |w∗>l x|p) · |(wl − w∗

l )>x| · x′x′>.

Using Properties 1,4 and 3(a), we have for each x ∈ S, for each (i, i′) ∈

[r]× [r],

−Al(xi, xi′) (φ(w>l xi)− φ(w∗>

l xi)) · φ′′(w>j xi′)xi′x

>i′ Al(xi, xi′).

Therefore,

∆(1)j,j

r∑i=1

r∑i′=1

t∑l=1

(Ex∼Dd

[Al(xi, xi′)] +1

|S|∑x∈S

Al(xi, xi′)

).

Let hl(x) = L1L2|w>l x|p · |(wl−w∗

l )>x|. Let Dk denote Gaussian distribu-

tion N(0, Ik). Let Bl = Ex,x′∼D2k[hl(x)x

′x′>].

We define function Bl(x, x′) : R2k → Rk×k such that

Bl(x, x′) = hl(x)x

′x′>.

(I) Bounding |hl(x)|.

According to Lemma 2.4.2, we have for any constant s ≥ 1, with probability

1− 1/(nd8s),

|hl(x)| = L1L2|w>r x|p|(wl − w∗

l )>x| ≤ ‖wl‖p‖wl − w∗

l ‖poly(s, log n).

(II) Bounding ‖Bl‖.

289

Page 306: Copyright by Kai Zhong 2018

‖Bl‖ ≥ Ex∼Dk

[L1L2|w>

l x|p|(wl − w∗l )

>x|]· Ex′∼Dk

[((wl − w∗

l )>x′

‖wl − w∗l ‖

)2]& ‖wl‖p‖wl − w∗

l ‖,

where the first step follows by definition of spectral norm, and last step follows

by Lemma 2.4.2. Using Lemma 2.4.2, we can also prove an upper bound ‖Bl‖,

‖Bl‖ . L1L2‖wl‖p‖wl − w∗l ‖.

(III) Bounding (Ex∼Dk[h4(x)])1/4

Using Lemma 2.4.2, we have(E

x∼Dk

[h4(x)]

)1/4

= L1L2

(E

x∼Dk

[(|w>

l x|p|(wl − w∗l )

>x|)4])1/4

. ‖wl‖p‖wl − w∗l ‖.

By applying Corollary B.1.1, for each (i, i′) ∈ [r]×[r] if n ≥ ε−2kpoly(log d, s),

then with probability 1− 1/d8s,

∥∥∥∥∥ Ex∼Dd

[|w>

l xi|p · |(wl − w∗l )

>xi| · xi′x>i′

]− 1

|S|∑x∈S

|w>l xi|p · |(wl − w∗

l )>xi| · xi′x

>i′

∥∥∥∥∥=

∥∥∥∥∥Bl −1

|S|∑x∈S

Bl(xi, xi′)

∥∥∥∥∥≤ ε‖Bl‖

. ε‖wl‖p‖wl − w∗l ‖. (C.7)

If ε ≤ 1/2, we have

‖∆(1)i,i ‖ .

r∑i=1

r∑i′=1

t∑l=1

‖Bl‖ . r2tσp1(W

∗)‖W −W ∗‖

290

Page 307: Copyright by Kai Zhong 2018

Claim C.3.7. For each (j, l) ∈ [t]× [t], j 6= l, if |S| ≥ ε−2kτκ2poly(log d, s)

‖∆j,l‖ . εr2σ2p1 (W ∗)

holds with probability 1− 1/d4s.

Proof. Recall

∆j,l = E(x,y)∼D

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

− 1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

Recall that x = [x>1 x>

2 · · ·x>r ]

>, xi ∈ Rk,∀i ∈ [r] and d = rk. We define

X = [x1 x2 · · ·xr] ∈ Rk×r. Let φ′(X>wj) ∈ Rr denote the vector

[φ′(x>1 wj) φ

′(x>2 wj) · · ·φ′(x>

r wj)]> ∈ Rr.

We define function B(x) : Rd → Rk×k such that

B(x) = X︸︷︷︸k×r

φ′(X>wj)︸ ︷︷ ︸r×1

φ′(X>wl)>︸ ︷︷ ︸

1×r

X>︸︷︷︸r×k

.

Therefore,

∆j,l = E(x,y)∼D[B(x)]− 1

|S|∑x∈S

[B(x)]

To apply Lemma D.3.9, we show the following.

(I)

291

Page 308: Copyright by Kai Zhong 2018

‖B(x)‖ .

(r∑

i=1

|w>j xi|p‖xi‖

(r∑

i=1

|w>l xi|p‖xi‖

).

By using Lemma 2.4.2,2.4.3, we have with probability 1− 1/nd4s,

‖B(x)‖ ≤ r2k‖wj‖p‖wl‖p log n

(II)

Ex∼Dd

[B(x)]

=r∑

i=1

Ex∼Dd[φ′(w>

j xi)xiφ′(w>

l xi)x>i ] +

∑i 6=i′

Ex∼Dd[φ′(w>

j xi)xiφ′(w>

l xi′)x>i′ ]

=r∑

i=1

Ex∼Dd[φ′(w>

j xi)xiφ′(w>

l xi)x>i ] +

∑i 6=i′

Exi∼Dk[φ′(w>

j xi)xi]Exi′∼Dk[φ′(w>

l xi′)x>i′ ]

=B1 +B2

Let’s first consider B1. Let U ∈ Rk×2 be the orthogonal basis of spanwj, wl

and U⊥ ∈ Rk×(k−2) be the complementary matrix of U . Let matrix V := [v1 v2] ∈

R2×2 denote U>[wj wl], then UV = [wj wl] ∈ Rd×2. Given any vector a ∈ Rk,

there exist vectors b ∈ R2 and c ∈ Rk−2 such that a = Ub+ U⊥c. We can simplify

292

Page 309: Copyright by Kai Zhong 2018

‖B1‖ in the following way,

‖B1‖ =∥∥∥∥ Ex∼Dk

[φ′(w>j x)φ

′(w>l x)xx

>]

∥∥∥∥= max

‖a‖=1E

x∼Dk

[φ′(w>j x)φ

′(w>l x)(x

>a)2]

= max‖b‖2+‖c‖2=1

Ex∼Dk

[φ′(w>j x)φ

′(w>l x)(b

>U>x+ c>U>⊥x)

2]

= max‖b‖2+‖c‖2=1

Ex∼Dk

[φ′(w>j x)φ

′(w>l x)((b

>U>x)2 + (c>U>⊥x)

2)]

= max‖b‖2+‖c‖2=1

Ez∼D2

[φ′(v>1 z)φ′(v>2 z)(b

>z)2]︸ ︷︷ ︸A1

+ Ez∼D2,s∼Dk−2

[φ′(v>1 z)φ′(v>2 z)(c

>s)2]︸ ︷︷ ︸A2

Obviously, A1 ≥ 0. For the term A2, we have

A2 = Ez∼D2,s∼Dk−2

[φ′(v>1 z)φ′(v>2 z)(c

>s)2]

= Ez∼D2

[φ′(v>1 z)φ′(v>2 z)] E

s∼Dk−2

[(c>s)2]

= ‖c‖2 Ez∼D2

[φ′(v>1 z)φ′(v>2 z)]

≥ ‖c‖2σ2(V )

σ1(V )

(E

z∼D1

[φ′(σ2(V ) · z)])2

& ‖c‖2 1

κ(W ∗)ρ(σ2(V ))

Then if we set b = 0, we have

‖Ex∼Dd[B(x)]‖ ≥ max

‖a‖=1

∣∣a>Ex∼Dd[B(x)]a

∣∣ ≥ max‖a‖=1

∣∣a>B1a∣∣ ≥ r

κ(W ∗)ρ(σ2(V )).

The second inequality follows by the fact that Exi∼Dk[φ′(w>

j xi)xi] ∝ wj and a ∈

span(U⊥). The upper bound can be obtained following [153] as

‖Ex∼Dd[B(x)]‖ . r2L2

1σ2p1 .

293

Page 310: Copyright by Kai Zhong 2018

(III)

max

(∥∥∥∥ Ex∼Dd

[B(x)B(x)>]

∥∥∥∥, ∥∥∥∥ Ex∼Dd

[B(x)>B(x)]

∥∥∥∥)= max

‖a‖=1E

x∼Dd

[∣∣a>Xφ′(X>wj)φ′(X>wl)

>X>Xφ′(X>wl)φ′(X>wj)

>X>a∣∣]

. r4L41σ

4p1 k.

(IV)

max‖a‖=‖b‖=1

(E

B∼B

[(a>Bb)2

])1/2= max

‖a‖=1,‖b‖=1

(E

x∼N(0,Id)

[a>Xφ′(X>wj)φ

′(X>wl)>X>b

])1/2

. r2L21σ

2p1 .

Therefore, applying Lemma D.3.9, if |S| ≥ ε−2κ2τkpoly(log d, s) we have

‖∆j,l‖ ≤ εr2σ2p1

holds with probability at least 1− 1/dΩ(s).

Claim C.3.8. For each j ∈ [t], if |S| ≥ ε−2kτκ2poly(log d, s)

‖∆(2)j,j ‖ . εr2tσ2p

1 (W ∗)

holds with probability 1− 1/d4s.

Proof. The proof is identical to Claim C.3.7.

294

Page 311: Copyright by Kai Zhong 2018

C.3.3 Error bound of Hessians near the ground truth for non-smooth activa-tions

The goal of this Section is to prove Lemma C.3.8,

Lemma C.3.8 (Error Bound of Hessians near the Ground Truth for Non-smooth

Activations). Let φ(z) satisfy Property 1,4 and 3(b). Let W ∈ Rk×t satisfy ‖W −

W ∗‖ ≤ σt/2. Let S denote a set of i.i.d. samples from the distribution defined

in (5.1). Then for any t ≥ 1 and 0 < ε < 1/2, if

|S| ≥ ε−2kκ2τpoly(log d, s)

with probability at least 1− 1/dΩ(s),

‖∇2fS(W )−∇2fD(W∗)‖ . r2t2σ2p

1 (ε+ (‖W −W ∗‖/σt)1/2).

Proof. Recall that x ∈ Rd denotes a vector[x>1 x>

2 · · · x>r

]>, where xi =

Pix ∈ Rk, ∀i ∈ [r] and d = rk.

As we noted previously, when Property 3(b) holds, the diagonal blocks of

the empirical Hessian can be written as, with probability 1, for all j ∈ [t],

∂2fS(W )

∂w2j

=1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>.

We also know that, for each (j, l) ∈ [t]× [t] and j 6= l,

∂2fS(W )

∂wj∂wl

=1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>.

295

Page 312: Copyright by Kai Zhong 2018

We define HD(W ) ∈ Rtk×tk such that for each j ∈ [t], the diagonal block HD(W )j,j ∈

Rk×k is

HD(W )j,j = Ex∈Dd

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>.

and for each (j, l) ∈ [t]× [t], the off-diagonal block HD(W )j,l ∈ Rk×k is

HD(W )j,l = Ex∈Dd

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>.

Recall the definition of ∇2fD(W∗), for each j ∈ [t], the diagonal block is

∂2fD(W∗)

∂w2j

= E(x,y)∼D

( r∑i=1

φ′(w∗>j xi)xi

(r∑

i=1

φ′(w∗>j xi)xi

)>.

For each j, l,∈ [t] and j 6= l, the off-diagonal block is

∂2fD(W∗)

∂wj∂wl

= E(x,y)∼D

( r∑i=1

φ′(w∗>j xi)xi

(r∑

i=1

φ′(w∗>l xi)xi

)>.

Thus, we can show

‖∇2fS(W )−∇2fD(W∗)‖ = ‖∇2fS(W )−HD(W ) +HD(W )−∇2fD(W

∗)‖

≤ ‖∇2fS(W )−HD(W )‖+ ‖HD(W )−∇2fD(W∗)‖

. εr2t2σ2p1 + r2t2σ2p

1 (‖W −W ∗‖/σt)1/2,

where the second step follows by triangle inequality, the third step follows by

Lemma C.3.9 and Lemma C.3.10.

Lemma C.3.9. If |S| ≥ ε−2kτκ2poly(log d, s), then we have

‖HD(W )−∇2fS(W )‖ . εr2t2σp1(W

∗)

296

Page 313: Copyright by Kai Zhong 2018

Proof. Using Claim C.3.7, we can bound the spectral norm of all the off-diagonal

blocks, and using Claim C.3.8, we can bound the spectral norm of all the diagonal

blocks.

Lemma C.3.10. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if

‖W −W ∗‖ ≤ σt/2, then we have

‖HD(W )−∇2fD(W∗)‖ . r2t2σ2p

1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.

Proof. This follows by using the similar technique from [153]. Let ∆ = HD(W )−

∇2fD(W∗). For each j ∈ [t], the diagonal block is,

∆j,j = Ex∼Dd

[r∑

i=1

r∑i′=1

(φ′(w>j xi)φ

′(w>j xi′)− φ′(w∗>

j xi)φ′(w∗>

j xi′))xix>i′

]

= Ex∼Dd

[r∑

i=1

(φ′2(w>j xi)− φ′2(w∗>

j xi))xix>i

]

+ Ex∼Dd

[∑i 6=i′

(φ′(w>j xi)φ

′(w>j xi′)− φ′(w∗>

j xi)φ′(w∗>

j xi′))xix>i′

]= ∆

(1)j,j +∆

(2)j,j .

For each (j, l) ∈ [t]× [t] and j 6= l, the off-diagonal block is,

∆j,l = Ex∼Dd

[r∑

i=1

r∑i′=1

(φ′(w>j xi)φ

′(w>l xi′)− φ′(w∗>

j xi)φ′(w∗>

l xi′))xix>i′

]

= Ex∼Dd

[r∑

i=1

(φ′(w>j xi)φ

′(w>l xi)− φ′(w∗>

j xi)φ′(w∗>

l xi))xix>i

]

+ Ex∼Dd

[∑i 6=i′

(φ′(w>j xi)φ

′(w>l xi′)− φ′(w∗>

j xi)φ′(w∗>

l xi′))xix>i′

]= ∆

(1)j,l +∆

(2)j,l

297

Page 314: Copyright by Kai Zhong 2018

Applying Claim C.3.9 and C.3.10 completes the proof.

Claim C.3.9. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if ‖W −

W ∗‖ ≤ σt/2, then we have

max(‖∆(1)j,j ‖, ‖∆

(1)j,l ‖) . rσ2p

1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.

Proof. We want to bound the spectral norm of

Ex∼Dk

[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>].We first show that,

∥∥Ex∼Dk[(φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x))xx>]∥∥

≤ max‖a‖=1

Ex∼Dk

[|φ′(w>

j x)φ′(w>

l x)− φ′(w∗>j x)φ′(w∗>

l x)|(x>a)2]

≤ max‖a‖=1

Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|+ |φ′(w∗>j x)||φ′(w>

l x)− φ′(w∗>l x)|(x>a)2

]= max

‖a‖=1

(Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|(x>a)2]

+ Ex∼Dk

[|φ′(w∗>

j x)||φ′(w>l x)− φ′(w∗>

l x)|(x>a)2]). (C.8)

where the first step follows by definition of spectral norm, the second step follows

by triangle inequality, and the last step follows by linearity of expectation.

Without loss of generality, we just bound the first term in the above formu-

lation. Let U be the orthogonal basis of span(wj, w∗j , wl). If wj, w

∗j , wl are inde-

pendent, U is k-by-3. Otherwise it can be d-by-rank(span(wj, w∗j , wl)). Without

loss of generality, we assume U = span(wj, w∗j , wl) is k-by-3. Let [vj v∗j vl] =

298

Page 315: Copyright by Kai Zhong 2018

U>[wj w∗j wl] ∈ R3×3, and [uj u∗

j ul] = U>⊥ [wj w∗

j wl] ∈ R(k−3)×3 Let a =

Ub+ U⊥c, where U⊥ ∈ Rd×(k−3) is the complementary matrix of U .

Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|(x>a)2]

= Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|(x>(Ub+ U⊥c))2]

. Ex∼Dd

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|((x>Ub)2 + (x>U⊥c)

2)]

= Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|(x>Ub)2]

+ Ex∼Dk

[|φ′(w>

j x)− φ′(w∗>j x)||φ′(w>

l x)|(x>U⊥c)2]

= Ez∼D3

[|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2

]+ Ey∼Dk−3

[|φ′(u>

j y)− φ′(u∗>j y)||φ′(u>

l y)|(y>c)2]

(C.9)

where the first step follows by a = Ub + U⊥c, the last step follows by (a + b)2 ≤

2a2 + 2b2. Let’s consider the first term. The second term is similar.

By Property 3(b), we have e exceptional points which have φ′′(z) 6= 0. Let

these e points be p1, p2, · · · , pe. Note that if v>j z and v∗>j z are not separated by

any of these exceptional points, i.e., there exists no j ∈ [e] such that v>i z ≤ pj ≤

v∗>j z or v∗>j z ≤ pj ≤ v>j z, then we have φ′(v>j z) = φ′(v∗>j z) since φ′′(s) are

zeros except for pjj=1,2,··· ,e. So we consider the probability that v>j z, v∗>j z are

separated by any exception point. We use ξj to denote the event that v>j z, v∗>j z

are separated by an exceptional point pj . By union bound, 1 −∑e

j=1 Pr ξj is the

probability that v>j z, v∗>j z are not separated by any exceptional point. The first term

299

Page 316: Copyright by Kai Zhong 2018

of Equation (D.23) can be bounded as,

Ez∼D3

[|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2

]= Ez∼D3

[1∪e

i=1ξi|φ′(v>j z) + φ′(v∗>j z)||φ′(v>l z)|(z>b)2

]≤(Ez∼D3

[1∪e

i=1ξi

])1/2(Ez∼D3

[(φ′(v>j z) + φ′(v∗>j z))2φ′(v>l z)

2(z>b)4])1/2

(e∑

j=1

Prz∼D3

[ξj]

)1/2(Ez∼D3

[(φ′(v>j z) + φ′(v∗>j z))2φ′(v>l z)

2(z>b)4])1/2

.

(e∑

j=1

Prz∼D3

[ξj]

)1/2

(‖vj‖p + ‖v∗j‖p)‖vl‖p‖b‖2

where the first step follows by if v>j z, v∗>j z are not separated by any exceptional

point then φ′(v>j z) = φ′(v∗>j z) and the last step follows by Holder’s inequality and

Property 1.

It remains to upper bound Prz∼D3 [ξj]. First note that if v>j z, v∗>j z are sepa-

rated by an exceptional point, pj , then |v∗>j z−pj| ≤ |v>j z− v∗>j z| ≤ ‖vj− v∗j‖‖z‖.

Therefore,

Prz∼D3

[ξj] ≤ Prz∼D3

[|v>j z − pj|‖z‖

≤ ‖vj − v∗j‖

].

Note that (v∗>j z

‖z‖‖v∗j ‖+ 1)/2 follows Beta(1,1) distribution which is uniform

distribution on [0, 1].

Prz∼D3

[|v∗>j z − pj|‖z‖‖v∗j‖

≤‖vj − v∗j‖‖v∗j‖

]≤ Pr

z∼D3

[|v∗>j z|‖z‖‖v∗j‖

≤‖vj − v∗j‖‖v∗j‖

]

.‖vj − v∗j‖‖v∗j‖

.‖W −W ∗‖σt(W ∗)

,

300

Page 317: Copyright by Kai Zhong 2018

where the first step is because we can viewv∗>j z

‖z‖ and pj‖z‖ as two independent ran-

dom variables: the former is about the direction of z and the later is related to the

magnitude of z. Thus, we have

Ez∈D3 [|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2] . (e‖W −W ∗‖/σt(W∗))1/2σ2p

1 (W ∗)‖b‖2.(C.10)

Similarly we have

Ey∈Dk−3[|φ′(u>

i y)− φ′(u∗>i y)||φ′(u>

l y)|(y>c)2] . (e‖W −W ∗‖/σt(W∗))1/2σ2p

1 (W ∗)‖c‖2.(C.11)

Thus, we complete the proof.

Claim C.3.10. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if

‖W −W ∗‖ ≤ σt/2, then we have

max(‖∆(2)j,j ‖, ‖∆

(2)j,l ‖) . r2σ2p

1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.

Proof. We bound ‖∆(2)j,l ‖. ‖∆

(2)j,j ‖ is a special case of ‖∆(2)

j,l ‖.

∆(2)j,l =Ex∼Dd

[∑i 6=i′

(φ′(w>j xi)φ

′(w>l xi′)− φ′(w∗>

j xi)φ′(w∗>

l xi′))xix>i′

]=∑i 6=i′

(Exi∼Dk

[φ′(w>j xi)xi]Exi′∼Dk

[φ′(w>l xi′)x

>i′ ]

−Exi∼Dk[φ′(w∗>

j xi)xi]Exi′∼Dk[φ′(w∗>

l xi′)x>i′ ]).

301

Page 318: Copyright by Kai Zhong 2018

Define α1(σ) = Ez∼D1 [φ′(σz)z]. Then

‖∆(2)j,l ‖ ≤ r(r − 1)

∥∥∥∥α1(‖wj‖)α1(‖wl‖)wjw>l − α1(‖w∗

j‖)α1(‖w∗l ‖)w∗

jw∗>l

∥∥∥∥≤ r(r − 1)

(∥∥∥α1(‖wj‖)α1(‖wl‖)wjw>l − α1(‖wj‖)α1(‖w∗

l ‖)wjw∗>l

∥∥∥+∥∥∥α1(‖wj‖)α1(‖w∗

l ‖)wjw∗>l − α1(‖w∗

j‖)α1(‖w∗l ‖)w∗

jw∗>l

∥∥∥). r2σ2p

1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.

where the last inequality uses the same analysis in Claim C.3.9.

C.3.4 Proofs for Main results

Bounding the spectrum of the Hessian near the ground truth The goal

of this Section is to prove Theorem C.3.2

Theorem C.3.2 (Bounding the Spectrum of the Hessian near the Ground Truth, for-

mal version of Theorem 5.3.1). For any W ∈ Rd×k with ‖W−W ∗‖ . ρ2(σt)/(r2t2κ5λ2σ4p

1 )·

‖W ∗‖, let S denote a set of i.i.d. samples from distribution D (defined in (5.1)) and

let the activation function satisfy Property 1,4,3. For any t ≥ 1, if

|S| ≥ dr3 · poly(log d, s) · τκ8λ2σ4p1 /(ρ2(σt)),

then with probability at least 1− d−Ω(s),

Ω(rρ(σt)/(κ2λ))I ∇2fS(W ) O(tr2σ2p

1 )I.

Proof. The main idea of the proof follows the following inequalities,

∇2fD(W∗)− ‖∇2fS(W )−∇2fD(W

∗)‖I ∇2fS(W ) ∇2fD(W∗) + ‖∇2fS(W )−∇2fD(W

∗)‖I

302

Page 319: Copyright by Kai Zhong 2018

We first provide lower bound and upper bound for the range of the eigenvalues of

∇2fD(W∗) by using Lemma C.3.1. Then we show how to bound the spectral norm

of the remaining error, ‖∇2fS(W )−∇2fD(W∗)‖. ‖∇2fS(W )−∇2fD(W

∗)‖ can

be further decomposed into two parts, ‖∇2fS(W ) − HD(W )‖ and ‖HD(W ) −

∇2fD(W∗)‖, where HD(W ) is ∇2fD(W ) if φ is smooth, otherwise HD(W ) is a

specially designed matrix . We can upper bound them when W is close enough

to W ∗ and there are enough samples. In particular, if the activation satisfies Prop-

erty 3(a), we use Lemma C.3.6 to bound ‖HD(W )−∇2fD(W∗)‖ and Lemma C.3.7

to bound ‖HD(W ) − ∇2fS(W )‖. If the activation satisfies Property 3(b), we

use Lemma C.3.10 to bound ‖HD(W ) − ∇2fD(W∗)‖ and Lemma C.3.9 to bound

‖HD(W )−∇2fS(W )‖.

Finally we can complete the proof by setting ε = O(ρ(σ1)/(r2t2κ2λσ2p

1 )) in

Lemma C.3.5 and Lemma C.3.8.

If the activation satisfies Property 3(a), we set ‖W−W ∗‖ . ρ(σt)/(rtκ2λσp

1)

in Lemma C.3.5.

If the activation satisfies Property 3(b), we set ‖W−W ∗‖ . ρ2(σt)σt/(r2t2κ4λ2σ4p

1 )

in Lemma C.3.8.

Linear convergence of gradient descent The goal of this Section is to

prove Theorem C.3.3.

Theorem C.3.3 (Linear convergence of gradient descent, formal version of Theo-

rem B.3.2). Let W ∈ Rt×k be the current iterate satisfying

‖W −W ∗‖ . ρ2(σt)/(r2t2κ5λ2σ4p

1 )‖W ∗‖.

303

Page 320: Copyright by Kai Zhong 2018

Let S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Let the

activation function satisfy Property 1,4 and 3(a). Define

m0 = Θ(rρ(σk)/(κ2λ)) and M0 = Θ(tr2σ2p

1 ).

For any s ≥ 1, if we choose

|S| ≥ d · poly(s, log d) · r2t2τκ8λ2σ4p1 /(ρ2(σt)) (C.12)

and perform gradient descent with step size 1/M0 on fS(W ) and obtain the next

iterate,

W † = W − 1

M0

∇fS(W ),

then with probability at least 1− d−Ω(s),

‖W † −W ∗‖2F ≤ (1− m0

M0

)‖W −W ∗‖2F .

Proof. Given a current iterate W , we set k(p+1)/2 anchor points W aa=1,2,··· ,k(p+1)/2

equally along the line ξW ∗ + (1 − ξ)W for ξ ∈ [0, 1]. Using Theorem C.3.2,

and applying a union bound over all the events, we have with probability at least

1−d−Ω(s) for all anchor points W aa=1,2,··· ,k(p+1)/2 , if |S| satisfies Equation (C.12),

then

m0I ∇2fS(Wa) M0I.

Then based on these anchors, using Lemma C.3.11 we have with probability

1− d−Ω(s), for any points W on the line between W and W ∗,

m0I ∇2fS(W ) M0I. (C.13)

304

Page 321: Copyright by Kai Zhong 2018

Let η be the stepsize.

‖W † −W ∗‖2F

= ‖W − η∇fS(W )−W ∗‖2F

= ‖W −W ∗‖2F − 2η〈∇fS(W ), (W −W ∗)〉+ η2‖∇fS(W )‖2F

We can rewrite fS(W ),

∇fS(W ) =

(∫ 1

0

∇2fS(W∗ + γ(W −W ∗))dγ

)vec(W −W ∗).

We define function HS(W ) : Rk×t → Rtk×tk such that

HS(W −W ∗) =

(∫ 1

0

∇2fS(W∗ + γ(W −W ∗))dγ

).

According to Eq. (C.13),

m0I HS(W −W ∗) M0I. (C.14)

‖∇fS(W )‖2F = 〈HS(W −W ∗), HS(W −W ∗)〉 ≤M0〈W −W ∗, HS(W −W ∗)〉

Therefore,

‖W −W ∗‖2F

≤ ‖W −W ∗‖2F − (−η2M0 + 2η)〈W −W ∗, H(W −W ∗)〉

≤ ‖W −W ∗‖2F − (−η2M0 + 2η)m0‖W −W ∗‖2F

= ‖W −W ∗‖2F −m0

M0

‖W −W ∗‖2F

≤ (1− m0

M0

)‖W −W ∗‖2F

where the third equality holds by setting η = 1/M0.

305

Page 322: Copyright by Kai Zhong 2018

Bounding the spectrum of the Hessian near the fixed point The goal of

this Section is to prove Lemma C.3.11.

Lemma C.3.11. Let S denote a set of samples from Distribution D defined in

Eq. (5.1). Let W a ∈ Rt×k be a point (respect to function fS(W )), which is in-

dependent of the samples S, satisfying ‖W a − W ∗‖ ≤ σt/2. Assume φ satisfies

Property 1, 4 and 3(a). Then for any s ≥ 1, if

|S| ≥ kpoly(log d, s),

with probability at least 1−d−Ω(s), for all W ∈ Rk×t 1 satisfying ‖W a−W‖ ≤ σt/4,

we have

‖∇2fS(W )−∇2fS(Wa)‖ ≤ r3t2σp

1(‖W a −W ∗‖+ ‖W −W a‖k(p+1)/2).

Proof. Let ∆ = ∇2fS(W ) −∇2fS(Wa) ∈ Rkt×kt, then ∆ can be thought of as t2

blocks, and each block has size k × k.

For each j, l ∈ [t] and j 6= l, we use ∆j,l to denote the off-diagonal block,

∆j,l =1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>l xi)xi

)>

− 1

|S|∑x∈S

( r∑i=1

φ′(wa>j xi)xi

(r∑

i=1

φ′(wa>l xi)xi

)>

=1

|S|∑x∈S

r∑i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

l xi′)− φ′(wa>j xi)φ

′(wa>l xi′)

)xix

>i′

1which is not necessarily to be independent of samples

306

Page 323: Copyright by Kai Zhong 2018

For each j ∈ [t], we use ∆j,j to denote the diagonal block,

∆j,j =1

|S|∑

(x,y)∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

+

(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

( r∑i=1

φ′(wa>j xi)xi

(r∑

i=1

φ′(wa>j xi)xi

)>

+

(t∑

l=1

r∑i=1

φ(wa>l xi)− y

(r∑

i=1

φ′′(wa>j xi)xix

>i

)]

307

Page 324: Copyright by Kai Zhong 2018

We further decompose ∆j,j into ∆j,j = ∆(1)j,j +∆

(2)j,j , where

∆(1)j,j =

1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

φ(w>l xi)− y

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

φ(wa>l xi)− y

(r∑

i=1

φ′′(wa>j xi)xix

>i

)]

=1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

(φ(w>

l xi)− φ(w∗>l xi)

))·

(r∑

i=1

φ′′(w>j xi)xix

>i

)]

− 1

|S|∑

(x,y)∈S

[(t∑

l=1

r∑i=1

(φ(wa>

l xi)− φ(w∗>l xi)

))·

(r∑

i=1

φ′′(wa>j xi)xix

>i

)]

=1

|S|∑x∈S

t∑l=1

r∑i=1

r∑i′=1

((φ(w>

l xi)− φ(w∗>l xi))φ

′′(w>j xi′)

− (φ(wa>l xi)− φ(w∗>

l xi))φ′′(wa>

j xi′)

)xi′x

>i′

=1

|S|∑x∈S

t∑l=1

r∑i=1

r∑i′=1

((φ(w>

l xi)− φ(wa>l xi))φ

′′(w>j xi′)

)xi′x

>i′

+1

|S|∑x∈S

t∑l=1

r∑i=1

r∑i′=1

((φ(wa>

l xi)− φ(w∗>l xi))(φ

′′(wa>j xi′) + φ′′(w>

j xi′))

)xi′x

>i′

= ∆(1,1)j,j +∆

(1,2)j,j ,

and

∆(2)j,j =

1

|S|∑x∈S

( r∑i=1

φ′(w>j xi)xi

(r∑

i=1

φ′(w>j xi)xi

)>

− 1

|S|∑x∈S

( r∑i=1

φ′(wa>j xi)xi

(r∑

i=1

φ′(wa>j xi)xi

)>

=1

|S|∑x∈S

r∑i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

j xi′)− φ′(wa>j xi)φ

′(wa>j xi′)

)xix

>i′

308

Page 325: Copyright by Kai Zhong 2018

Combining Claims C.3.11, C.3.12, C.3.14 C.3.13 and taking a union bound over

O(t2) events, we have

‖∇2fS(W )−∇2fS(Wa)‖ ≤

t∑j=1

‖∆(1)j,j ‖+ ‖∆

(2)j,j ‖+

∑j 6=l

‖∆j,l‖

. r3t2σp1(‖W a −W ∗‖+ ‖W −W a‖k(p+1)/2),

holds with probability at least 1− d−Ω(s).

Claim C.3.11. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then

‖∆(1,1)j,j ‖ . tr2σp

1‖W a −W‖k(p+1)/2

holds with probability 1− d−Ω(s).

Proof. Recall the definition ∆(1,1)j,j ,

1

|S|∑x∈S

t∑l=1

r∑i=1

r∑i′=1

((φ(w>

l xi)− φ(wa>l xi))φ

′′(w>j xi′)

)xi′x

>i′ .

In order to upper bound ‖∆(1,1)j,j ‖, it suffices to upper bound the spectral norm of

1

|S|∑x∈S

((φ(w>

l xi)− φ(wa>l xi))φ

′′(w>j xi′)

)xi′x

>i′ .

We focus on the case for i = i′. The case for i 6= i′ is similar. Note that

−2L2L1(‖wl‖p + ‖wl‖p)‖xi‖p+1xix>i

((φ(w>

l xi)− φ(wa>l xi))φ

′′(w>j xi)

)xix

>i

2L2L1(‖wl‖p + ‖wl‖p)‖xi‖p+1xix>i

Define function h1(x) : Rk → R

h1(x) = ‖x‖p+1

309

Page 326: Copyright by Kai Zhong 2018

(I) Bounding |h(x)|.

By Lemma 2.4.3, we have h(x) . (sk log dn)(p+1)/2 with probability at

least 1− 1/(nd4s).

(II) Bounding ‖Ex∼Dk[‖x‖p+1xx>]‖.

Let g(x) = (2π)−k/2e−‖x‖2/2. Note that xg(x)dx = −dg(x).

Ex∼Dk

[‖x‖p+1xx>] = ∫

‖x‖p+1g(x)xx>dx

= −∫‖x‖p+1d(g(x))x>

= −∫‖x‖p+1d(g(x)x>) +

∫‖x‖p+1g(x)Ikdx

=

∫d(‖x‖p+1)g(x)x> +

∫‖x‖p+1g(x)Ikdx

=

∫(p+ 1)‖x‖p−1g(x)xx>dx+

∫‖x‖p+1g(x)Ikdx

∫‖x‖p+1g(x)Ikdx

= Ex∼Dk[‖x‖p+1]Ik.

Since ‖x‖2 follows χ2 distribution with degree k, Ex∼Dk[‖x‖q] = 2q/2 Γ((k+q)/2)

Γ(k/2)for

any q ≥ 0. So, kq/2 . Ex∼Dk[‖x‖q] . kq/2. Hence, ‖Ex∼Dk

[h(x)xx>]‖ & k(p+1)/2.

Also

∥∥Ex∼Dk

[h(x)xx>]∥∥ ≤ max

‖a‖=1Ex∼Dk

[h(x)(x>a)2

]≤ max

‖a‖=1

(Ex∼Dk

[h2(x)

])1/2(Ex∼Dk

[(x>a)4

])1/2. k(p+1)/2.

(III) Bounding (Ex∼Dk[h4(x)])1/4.

310

Page 327: Copyright by Kai Zhong 2018

(Ex∼Dk

[h4(x)])1/4

. k(p+1)/2.

Define function B(x) = h(x)xx> ∈ Rk×k, ∀i ∈ [n]. Let B = Ex∼Dd[h(x)xx>].

Therefore by applying Corollary B.1.1, we obtain for any 0 < ε < 1, if

|S| ≥ ε−2kpoly(log d, s)

with probability at least 1− 1/dΩ(s),∥∥∥∥∥ 1

|S|∑x∈S

‖x‖p+1xx> − Ex∼Dk

[‖x‖p+1xx>]∥∥∥∥∥ . εk(p+1)/2.

Therefore we have with probability at least 1− 1/dΩ(s),∥∥∥∥∥ 1

|S|∑x∼S

‖x‖p+1xx>

∥∥∥∥∥ . k(p+1)/2. (C.15)

Claim C.3.12. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then

‖∆(1,2)j,j ‖ . tr2σp

1‖W a −W ∗‖

holds with probability 1− d−Ω(s).

Proof. Recall the definition of ∆(1,2)j,l ,

1

|S|∑x∈S

t∑l=1

r∑i=1

r∑i′=1

((φ(wa>

l xi)− φ(w∗>l xi))(φ

′′(wa>j xi′) + φ′′(w>

j xi′))

)xi′x

>i′ .

311

Page 328: Copyright by Kai Zhong 2018

In order to upper bound ‖∆(1,2)j,l ‖, it suffices to upper bound the spectral norm of

this quantity,

1

|S|∑x∈S

((φ(wa>

l xi)− φ(w∗>l xi))(φ

′′(wa>j xi′) + φ′′(w>

j xi′))

)xi′x

>i′ ,

where ∀l ∈ [t], i ∈ [r], i′ ∈ [r]. We define function h(y, z) : R2k → R such that

h(y, z) = |φ(wa>l y)− φ(w∗>

l y)| · (|φ′′(wa>

j z)|+ |φ′′(w>j z)|).

We define function B(y, z) : R2k → Rk×k such that

B(y, z) = |φ(wa>l y)− φ(w∗>

l y)| · (|φ′′(wa>

j z)|+ |φ′′(w>j z)|) · zz> = h(y, z)zz>.

Using Property 1, we can show

|φ(wa>l y)− φ(w∗>

l y)| ≤ |(wal − w∗

l )>y| · (|φ′(wa>

l y)|+ |φ′(w∗>l y)|)

≤ |(wal − w∗

l )>y| · L1 · (|wa>

l y|p + |w∗>l y|p)

Using Property 3, we have (|φ′′(wa>j z)| + |φ′′(w>

j z)|) ≤ 2L2. Thus, h(y, z) ≤

2L1L2|(wal − w∗

l )>y| · (|wa>

l y|p + |w∗>l y|p).

Using Lemma 2.4.2, matrix Bernstein inequality Corollary B.1.1, we have,

if |S| ≥ kpoly(log d, s),∥∥∥∥∥∥Ey,z∼Dk[B(y, z)]− 1

|S|∑

(y,z)∈S

B(y, z)

∥∥∥∥∥∥ . ‖Ey,z∼Dk[B(y, z)]‖

. ‖w∗l ‖p‖w∗

l − wal ‖

where S denote a set of samples from distribution D2k. Thus, we obtain∥∥∥∥∥∥ 1

|S|∑

(y,z)∈S

B(y, z)

∥∥∥∥∥∥ . ‖W a −W ∗‖σp1.

312

Page 329: Copyright by Kai Zhong 2018

Taking the union bound over O(tr2) events, summing up those O(tr2) terms com-

pletes the proof.

Claim C.3.13. For each (j, l) ∈ [t]× [t] and j 6= l, if |S| ≥ kpoly(log d, s), then

‖∆j,l‖ . r2σp1‖W a −W‖k(p+1)/2

holds with probability 1− d−Ω(s).

Proof. Recall

∆j,l :=1

|S|∑x∈S

r∑i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

l xi′)− φ′(wa>j xi)φ

′(wa>l xi′)

)xix

>i′

=1

|S|∑x∈S

r∑i=1

r∑i′=1

(φ′(w>

j xi)φ′(w>

l xi′)− φ′(w>j xi)φ

′(wa>l xi′)

+ φ′(w>j xi)φ

′(wa>l xi′)− φ′(wa>

j xi)φ′(wa>

l xi′)

)xix

>i′

We just need to consider

1

|S|∑x∈S

r∑i=1

r∑i′=1

(φ′(w>

j xi)(φ′(w>

l xi′)− φ′(wa>l xi′))

)xix

>i′

Recall that x = [x>1 x>

2 · · ·x>r ]

>, xi ∈ Rk,∀i ∈ [r] and d = rk. We define

X = [x1 x2 · · · xr] ∈ Rk×r. Let φ′(X>wj) ∈ Rr denote the vector

[φ′(x>

1 wj) φ′(x>2 wj) · · · φ′(x>

r wj)]> ∈ Rr.

Let pl(X) denote the vector

[φ′(w>

l x1)− φ′(wa>l x1) φ′(w>

l x2)− φ′(wa>l x2) · · · φ′(w>

l xr)− φ′(wa>l xr)

]> ∈ Rr.

313

Page 330: Copyright by Kai Zhong 2018

We define function B(x) : Rd → Rk×k such that

B(x) = X︸︷︷︸k×r

φ′(X>wj)︸ ︷︷ ︸r×1

pl(X)>︸ ︷︷ ︸1×r

X>︸︷︷︸r×k

.

Note that

‖φ′(X>wj)‖‖pl(X)‖ ≤ L1L2‖wj‖p‖wl − wal ‖

(r∑

i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

).

We define function B(x) : Rd → Rk×k such that

B(x) = L1L2‖wj‖p‖wl − wal ‖

(r∑

i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

)XX>

Also note that[0 B(x)

B>(x) 0

]=

[X 00 X

][0 φ′(X>wj)pl(X)>

pl(X)φ′(X>wj)> 0

][X> 00 X>

]We can lower and upper bound the above term by

−[B(x) 00 B(x)

]

[0 B(x)

B>(x) 0

][B(x) 00 B(x)

]

Therefore,

‖∆j,l‖ =

∥∥∥∥∥ 1

|S|∑x∈S

B(x)

∥∥∥∥∥ .

∥∥∥∥∥ 1

|S|∑x∈S

B(x)

∥∥∥∥∥Define

F (x) :=

(r∑

i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

)XX>.

To bound ‖Ex∼DdF (x) − 1

|S|∑

x∈S F (x)‖, we apply Lemma D.3.9. The

following proof discuss the four properties in Lemma D.3.9.

314

Page 331: Copyright by Kai Zhong 2018

(I)

‖F (x)‖ ≤

(r∑

i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

)3

By using Lemma 2.4.3, we have with probability 1− 1/nd4s,

‖F (x)‖ . r4k3/2+p/2 log3/2+p/2 n

(II)

∥∥∥∥ Ex∼Dd

[F (x)]

∥∥∥∥=

∥∥∥∥∥r · Ex∼Dd

[(r∑

i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

)xjx

>j

]∥∥∥∥∥& r3kp/2+1/2

The upper bound can be obtained similarly,∥∥∥∥ Ex∼Dd

[F (x)]

∥∥∥∥ . r3kp/2+1/2

(III)

max

(∥∥∥∥ Ex∼Dd

[F (x)F (x)>]

∥∥∥∥,∥∥∥∥ Ex∼Dd

[F (x)>F (x)]

∥∥∥∥)

= max‖a‖=1

Ex∼Dd

( r∑i=1

‖xi‖p)2

·

(r∑

i=1

‖xi‖

)2

‖X‖2‖X>a‖2

. r7kp+2.

315

Page 332: Copyright by Kai Zhong 2018

(IV)

max‖a‖=‖b‖=1

(E

B∼B

[(a>F (x)b)2

])1/2= max

‖a‖=1,‖b‖=1

Ex∼N(0,Id)

(a>( r∑i=1

‖xi‖p)·

(r∑

i=1

‖xi‖

)XX>b

)21/2

. r5/2kp/2+1/2.

Using Lemma 2.4.2 and matrix Bernstein inequality Lemma D.3.9, we have,

if |S| ≥ rkpoly(log d, s), with probability at least 1− 1/dΩ(s),∥∥∥∥∥Ex∼Dd[B(x)]− 1

|S|∑x∈S

B(x)

∥∥∥∥∥ . ‖Ex∼Dd[B(x)]‖

. ‖wj‖p‖wl − wal ‖r3k(p+1)/2

Thus, we obtain ∥∥∥∥∥ 1

|S|∑x∈S

B(x)

∥∥∥∥∥ . ‖W a −W‖σp1r

3k(p+1)/2.

We complete the proof.

Claim C.3.14. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then

‖∆(2)j,j ‖ . tr2σp

1‖W a −W‖k(p+1)/2

holds with probability 1− d−Ω(s).

Proof. ∆(2)j,j is a special case of ∆j,l, so we refer readers to the proofs in Claim C.3.13.

316

Page 333: Copyright by Kai Zhong 2018

Appendix D

Non-linear Inductive Matrix Completion

D.1 Proof Overview

At high level the proofs for Theorem 6.3.1 and Theorem 6.3.3 include the

following steps. 1) Show that the population Hessian at the ground truth is positive

definite. 2) Show that population Hessians near the ground truth are also positive

definite. 3) Employ matrix Bernstein inequality to bound the population Hessian

and the empirical Hessian.

We now formulate the Hessian. The Hessian of Eq. (6.3), ∇2fΩ(U, V ) ∈

R(2kd)×(2kd), can be decomposed into two types of blocks, (i ∈ [k], j ∈ [k]),

∂2fΩ(U, V )

∂ui∂vj,∂2fΩ(U, V )

∂ui∂uj

,

where ui(vj , resp.) is the i-th column of U (j-th column of V , resp.). Note that each

of the above second-order derivatives is a d× d matrix.

The first type of blocks are given by:

∂2fΩ(U, V )

∂ui∂vj= E

Ω

[φ′(u>

i x)φ′(v>j y)xy

>φ(v>i y)φ(u>j x)]+ δijE

Ω

[hx,y(U, V )φ′(u>

i x)φ′(v>i y)xy

>],where EΩ[·] = 1

|Ω|∑

(x,y)∈Ω[·], δij = 1i=j , and

hx,y(U, V ) = φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y).

317

Page 334: Copyright by Kai Zhong 2018

For sigmoid/tanh activation function, the second type of blocks are given

by:

∂2fΩ(U, V )

∂ui∂uj

= EΩ

[φ′(u>

i x)φ′(u>

j x)xx>φ(v>i y)φ(v

>j y)]+ δijE

Ω

[hx,y(U, V )φ′′(u>

i x)φ(v>i y)xx

>].(D.1)

For ReLU/leaky ReLU activation function, the second type of blocks are given by:

∂2fΩ(U, V )

∂ui∂uj

= EΩ

[φ′(u>

i x)φ′(u>

j x)xx>φ(v>i y)φ(v

>j y)].

Note that the second term of Eq. (D.1) is missing here as (U, V ) are fixed, the

number of samples is finite and φ′′(z) = 0 almost everywhere.

In this section, we will discuss important lemmas/theorems for Step 1 in

Appendix D.1.1 and Step 2,3 in Appendix D.1.3.

D.1.1 Positive definiteness of the population hessian

The corresponding population risk for Eq. (6.3) is given by:

fD(U, V ) =1

2E(x,y)∼D[(φ(U

>x)>φ(V >y)− A(x, y))2], (D.2)

where D := X×Y. For simplicity, we also assume X and Y are normal distributions.

Now we study the Hessian of the population risk at the ground truth. Let the

Hessian of fD(U, V ) at the ground-truth (U, V ) = (U∗, V ∗) be H∗ ∈ R(2dk)×(2dk),

which can be decomposed into the following two types of blocks (i ∈ [k], j ∈ [k]),

∂2fD(U∗, V ∗)

∂ui∂uj

=Ex,y

[φ′(u∗>

i x)φ′(u∗>j x)xx>φ(v∗>i y)φ(v

∗>j y)

],

∂2fD(U∗, V ∗)

∂ui∂vj=Ex,y

[φ′(u∗>

i x)φ′(v∗>j y)xy>φ(v∗>i y)φ(u∗>j x)

].

318

Page 335: Copyright by Kai Zhong 2018

To study the positive definiteness of H∗, we characterize the minimal eigen-

value of H∗ by a constrained optimization problem,

λmin(H∗) = min

(a,b)∈BEx,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>

i x)y>bi

)2,

(D.3)

where (a, b) ∈ B denotes that∑k

i=1 ‖ai‖2 + ‖bi‖2 = 1. Obviously, λmin(H∗) ≥ 0

due to the squared loss and the realizable assumption. However, this is not sufficient

for the local convexity around the ground truth, which requires the positive (semi-

)definiteness for the neighborhood around the ground truth. In other words, we

need to show that λmin(H∗) is strictly greater than 0, so that we can characterize

an area in which the Hessian still preserves positive definiteness (PD) despite the

deviation from the ground truth.

Challenges. As we mentioned previously there are activation functions that

lead to redundancy in parameters. Hence one challenge is to distill properties of

the activation functions that preserve the PD. Another challenge is the correlation

introduced by U∗ when it is non-orthogonal. So we first study the minimal eigen-

value for orthogonal U∗ and orthogonal V ∗ and then link the non-orthogonal case

to the orthogonal case.

D.1.2 Warm up: orthogonal case

In this section, we consider the case when U∗, V ∗ are unitary matrices, i.e.,

U∗>U∗ = U∗U∗> = Id. (d = k). This case is easier to analyze because the

319

Page 336: Copyright by Kai Zhong 2018

dependency between different elements of x or y can be disentangled. And we are

able to provide lower bound for the Hessian. Before we introduce the lower bound,

let’s first define the following quantities for an activation function φ.

αi,j :=Ez∼N(0,1)[(φ(z))izj],

βi,j :=Ez∼N(0,1)[(φ′(z))izj],

γ :=Ez∼N(0,1)[φ(z)φ′(z)z],

ρ :=min(α2,0β2,0 − α21,0β

21,0 − β2

1,0α21,1), (α2,0β2,2 − α2

1,0β21,2 − γ2).

(D.4)

We now present a lower bound for general activation functions including

sigmoid and tanh.

Lemma D.1.1. Let (a, b) ∈ B denote that∑k

i=1 ‖ai‖2 + ‖bi‖2 = 1. Assume d = k

and U∗, V ∗ are unitary matrices, i.e., U∗>U∗ = U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id,

then the minimal eigenvalue of the population Hessian in Eq. (D.3) can be simplified

as,

min(a,b)∈B

Ex,y

( k∑i=1

φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y

>bi

)2.

Let β, ρ be defined as in Eq. (D.4). If the activation function φ satisfies β1,1 = 0,

then λmin(H∗) ≥ ρ.

Since sigmoid and tanh have symmetric derivatives w.r.t. 0, they satisfy

β1,1 = 0. Specifically, we have ρ ≈ 0.000658 for sigmoid and ρ ≈ 0.0095 for

tanh. Also for ReLU, β1,1 = 1/2, so ReLU does not fit in this lemma. The full

proof of Lemma D.1.1, the lower bound of the population Hessian for ReLU and

the extension to non-orthogonal cases can be found in Appendix D.2.

320

Page 337: Copyright by Kai Zhong 2018

D.1.3 Error bound for the empirical Hessian near the ground truth

In the previous section, we have shown PD for the population Hessian at

the ground truth for the orthogonal cases. Based on that, we can characterize the

landscape around the ground truth for the empirical risk. In particular, we bound the

difference between the empirical Hessian near the ground truth and the population

Hessian at the ground truth. The theorem below provides the error bound w.r.t. the

number of samples (n1, n2) and the number of observations |Ω| for both sigmoid

and ReLU activation functions.

Theorem D.1.1. For any ε > 0, if

n1 & ε−2td log2 d, n2 & ε−2td log2 d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t, for sigmoid/tanh,

‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖;

for ReLU,

‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ .

(‖V − V ∗‖1/2 + ‖U − U∗‖1/2 + ε

)(‖U∗‖+ ‖V ∗‖)2.

The key idea to prove this theorem is to use the population Hessian at (U, V )

as a bridge.

On one side, we bound the population Hessian at the ground truth and the

population Hessian at (U, V ). This would be easy if the second derivative of the

activation function is Lipschitz, which is the case of sigmoid and tanh. But ReLU

321

Page 338: Copyright by Kai Zhong 2018

doesn’t have this property. However, we can utilize the condition that the param-

eters are close enough to the ground truth and the piece-wise linearity of ReLU to

bound this term.

On the other side, we bound the empirical Hessian and the population Hes-

sian. A natural idea is to apply matrix Bernstein inequality. However, there are

two obstacles. First the Gaussian variables are not uniformly bounded. Therefore,

we instead use Lemma B.7 in [153], which is a loosely-bounded version of matrix

Bernstein inequality. The second obstacle is that each individual Hessian calculated

from one observation (x, y) is not independent from another observation (x′, y′),

since they may share the same feature x or y. The analyses for vanilla IMC and MC

assume all the items(users) are given and the observed entries are independently

sampled from the whole matrix. However, our observations are sampled from the

joint distribution of X and Y.

To handle the dependency, our model assumes the following two-stage sam-

pling rule. First, the items/users are sampled from their distributions independently,

then given the items and users, the observations Ω are sampled uniformly with re-

placement. The key question here is how to combine the error bounds from these

two stages. Fortunately, we found special structures in the blocks of Hessian which

enables us to separate x, y for each block, and bound the errors in stage separately.

See Appendix D.3 for details.

D.2 Positive Definiteness of Population Hessian

We state some useful facts in this section.

322

Page 339: Copyright by Kai Zhong 2018

Fact D.2.1. Let A =[a1 a2 · · · ak

]. Let diag(A) ∈ Rk denote the vector where

the i-th entry is Ai,i, ∀i ∈ [k]. Let 1 ∈ Rk denote the vector that the i-th entry is 1,

∀i ∈ [k]. We have the following properties,

(I)k∑

i=1

(a>i ei)2 = ‖ diag(A)‖22,

(II)k∑

i=1

(a>i ai)2 = ‖A‖2F ,

(III)k∑

i=1

k∑j=1

(a>i aj) = ‖A · 1‖22,

(IV)∑i 6=j

a>i aj = ‖A · 1‖22 − ‖A‖2F .

Proof. Using the definition, it is easy to see that (I), (II) and (III) are holding.

Proof of (IV), we have

∑i 6=j

a>i aj =∑i,j

a>i aj −k∑

i=1

a>i ai = ‖A · 1‖22 − ‖A‖2F .

where the last step follows by (II) and (III).

Fact D.2.2. Let A =[a1 a2 · · · ak

]. Let diag(A) ∈ Rk denote the vector where

the i-th entry is Ai,i, ∀i ∈ [k]. Let 1 ∈ Rk denote the vector that the i-th entry is 1,

323

Page 340: Copyright by Kai Zhong 2018

∀i ∈ [k]. We have the following properties,

(I)∑i 6=j

a>i eie>i aj = (diag(A)> · (A · 1))− ‖ diag(A)‖22,

(II)∑i 6=j

a>i eje>j aj = (diag(A)> · (A · 1))− ‖ diag(A)‖22,

(III)∑i 6=j

a>i eia>j ej = (diag(A)> · 1)2 − ‖ diag(A)‖22,

(IV)∑i 6=j

a>i eja>j ei = 〈A>, A〉 − ‖ diag(A)‖22.

Proof. Proof of (I). We have

∑i 6=j

a>i eie>i aj =

∑i,j

a>i eie>i aj −

k∑i=1

a>i eie>i ai

=∑i,j

ai,ie>i aj − ‖ diag(A)‖22

=k∑

i=1

ai,ie>i

k∑j=1

aj − ‖ diag(A)‖22

= (diag(A)> · (A · 1))− ‖ diag(A)‖22

Proof of (II). It is similar to (I).

Proof of (III). We have∑i 6=j

a>i eia>j ej =

∑i,j

a>i eia>j ej −

∑i=1

a>i eia>i ei

=k∑

i=1

a>i ei ·k∑

j=1

a>j ej −k∑

i=1

a>i eia>i ei

=k∑

i=1

ai,i ·k∑

j=1

aj,j −k∑

i=1

ai,iai,i

= (diag(A)> · 1)2 − ‖ diag(A)‖22

324

Page 341: Copyright by Kai Zhong 2018

Proof of (IV). We have

∑i 6=j

a>i eja>j ei =

∑i 6=j

tr(a>i eja

>j ei)

=∑i 6=j

tr(eja

>j eia

>i

)=∑i 6=j

〈eja>j , aie>i 〉

=∑i,j

〈eja>j , aie>i 〉 −k∑

i=1

〈eia>i , aie>i 〉

= 〈A>, A〉 − ‖ diag(A)‖22.

where the second step follows by tr(ABCD) = tr(BCDA), the third step follows

by tr(AB) = 〈A,B>〉.

D.2.1 Orthogonal case

We first study the orthogonal case, where d = k and U∗, V ∗ are unitary

matrices, i.e., U∗>U∗ = U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id.

Lower bound on minimum eigenvalue

Lemma D.2.1 (Restatement of Lemma D.1.1). Let (a, b) ∈ B denote that∑k

i=1 ‖ai‖2+

‖bi‖2 = 1. Assume d = k and U∗, V ∗ are unitary matrices, i.e., U∗>U∗ =

U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id, then the minimal eigenvalue of the popula-

tion Hessian in Eq. (D.3) can be simplified as,

λmin(H∗) = min

(a,b)∈BEx,y

( k∑i=1

φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y

>bi

)2. (D.5)

325

Page 342: Copyright by Kai Zhong 2018

Let β, ρ be defined as in Eq. (D.4). If the activation function φ satisfies β1,1 = 0,

then λmin(H∗) ≥ ρ.

Proof. In the orthogonal case, we can easily transform Eq. (D.3) to Eq. (D.5) since

x, y are normal distribution. Now we can decompose Eq. (D.5) into the following

three terms.

Ex,y

( k∑i=1

φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y

>bi

)2

= Ex,y

( k∑i=1

φ′(xi)φ(yi)x>ai

)2

︸ ︷︷ ︸C

+Ex,y

( k∑i=1

φ′(yi)φ(xi)y>bi

)2

+ 2Ex,y

[∑i,j

φ′(xi)φ(yi)x>aiφ

′(yj)φ(xj)y>bj

]︸ ︷︷ ︸

D

.

Note that the first term is similar to the second term, so we just lower bound the first

term and the third term. Define A = [a1, a2, · · · , ak], B = [b1, b2, · · · , bk]. Let Ao

be the off-diagonal part of A and Ad be the diagonal part of A, i.e., Ao + Ad = A.

And let gA = diag(A) be the vector of the diagonal elements of A. We will bound

C and D in the following.

326

Page 343: Copyright by Kai Zhong 2018

For C, we have

Ex,y

( k∑i=1

φ′(xi)φ(yi)x>ai

)2

=k∑

i=1

Ex,y

[(φ′(xi)φ(yi)x

>ai)2]

+∑i 6=j

Ex,y

[φ′(xi)φ(yi)x

>ai · φ′(xj)φ(yj)x>aj]

=k∑

i=1

α2,0

[(a>i ei)

2(β2,2 − β2,0) + β2,0‖ai‖2]

+∑i 6=j

α21,0

[β21,0a

>i aj + (β1,2β1,0 − β2

1,0)(a>i eie

>i aj + a>i eja

>j ej) + β2

1,1(a>i eia

>j ej + a>i eja

>j ei)

]= C1 + C2.

where the last step follows by

C1 =k∑

i=1

α2,0

[(a>i ei)

2(β2,2 − β2,0) + β2,0‖ai‖2]

C2 =∑i 6=j

α21,0

[β21,0a

>i aj + (β1,2β1,0 − β2

1,0)(a>i eie

>i aj + a>i eja

>j ej) + β2

1,1(a>i eia

>j ej + a>i eja

>j ei)

]First we can simplify C1 in the following sense,

C1 = α2,0(β2,2 − β2,0)k∑

i=1

(a>i ei)2 + α2,0β2,0

k∑i=1

‖ai‖22

= α2,0(β2,2 − β2,0)‖ diag(A)‖22 + α2,0β2,0‖A‖2F ,

where the last step follows by Fact D.2.1.

We can rewrite C2 in the following sense

C2 = α21,0(β

21,0C2,1 + (β1,2β1,0 − β2

1,0) · (C2,2 + C2,3) + β21,1(C2,4 + C2,5)).

327

Page 344: Copyright by Kai Zhong 2018

where

C2,1 =∑i 6=j

a>i aj

C2,2 =∑i 6=j

a>i eie>i aj

C2,3 =∑i 6=j

a>i eje>j aj

C2,4 =∑i 6=j

a>i eia>j ej

C2,5 =∑i 6=j

a>i eja>j ei

Using Fact D.2.1, we have

C2,1 = ‖A · 1‖22 − ‖A‖2F .

Using Fact D.2.2, we have

C2,2 = (diag(A)> · (A · 1))− ‖ diag(A)‖22,

C2,3 = (diag(A)> · (A · 1))− ‖ diag(A)‖22,

C2,4 = (diag(A)> · 1)2 − ‖ diag(A)‖22,

C2,5 = 〈A>, A〉 − ‖ diag(A)‖22.

Thus,

C2 = α21,0(β

21,0(‖A · 1‖22 − ‖A‖2F )

+ (β1,2β1,0 − β21,0)2 · (diag(A)> · (A · 1)− ‖ diag(A)‖22)

+ β21,1((diag(A)

> · 1)2 + 〈A>, A〉 − 2‖ diag(A)‖22)).

328

Page 345: Copyright by Kai Zhong 2018

We consider C1+C2 by focusing different terms, for the ‖A‖2F (from C1 and

C2), we have

(α2,0β2,0 − α21,0β

21,0)‖A‖2F .

For the term 〈A,A>〉 (from C2,5), we have

α21,0β

21,1〈A,A>〉.

For the term ‖ diag(A)‖22 (from C1 and C2), we have

(α2,0(β2,2 − β2,0)− 2α21,0(β1,2β1,0 − β2

1,0)− 2α1,0β21,1)‖ diag(A)‖22

For the term ‖A · 1‖22 (from C2,1), we have

α21,0β

21,0‖A · 1‖22.

For the term diag(A)> · A · 1 (from C2,2 and C2,3), we have

2α21,0(β1,2β1,0 − β2

1,0) diag(A)> · A · 1.

For the term (diag(A)> · 1)2 (from C2,4), we have

α21,0β

21,1(diag(A)

> · 1)2.

329

Page 346: Copyright by Kai Zhong 2018

Putting it all together, we have

C1 + C2

= (α2,0β2,0 − α21,0β

21,0)‖A‖2F + α2

1,0β21,1〈A,A>〉

+ (α2,0(β2,2 − β2,0)− 2α21,0(β1,2β1,0 − β2

1,0)− 2α21,0β

21,1) · ‖ diag(A)‖2

+ α21,0β

21,0‖A · 1‖2 + 2α2

1,0(β1,2β1,0 − β21,0)(diag(A)

> · A · 1) + α21,0β

21,1(diag(A)

> · 1)2

= (α2,0β2,0 − α21,0β

21,0)(‖Ao‖2F + ‖gA‖2) + α2

1,0β21,1(〈Ao, A

>o 〉+ ‖gA‖2)

+ (α2,0β2,2 − α2,0β2,0 − 2α21,0β1,2β1,0 + 2α2

1,0β21,0 − 2α2

1,0β21,1) · ‖gA‖2

+ α21,0β

21,0(‖gA‖2 + ‖Ao · 1‖2 + 2g>A · Ao · 1)

+ 2α21,0(β1,2β1,0 − β2

1,0)(g>A · Ao · 1+ ‖gA‖2) + α2

1,0β21,1(g

>A · 1)2

= (α2,0β2,0 − α21,0β

21,0)‖Ao‖2F + α2

1,0β21,1〈Ao, A

>o 〉+ (α2,0β2,2 − α2

1,0β21,1) · ‖gA‖2

+ α21,0β

21,0(‖Ao · 1‖2) + 2α2

1,0β1,2β1,0(g>A · Ao · 1) + α2

1,0β21,1(g

>A · 1)2.

By doing a series of equivalent transformations, we have removed the expectation

and the formula C becomes a form of A and the moments of φ. These equivalent

transforms are mainly based on the fact that xi, xj, yi, yj for any i 6= j are indepen-

dent on each other.

330

Page 347: Copyright by Kai Zhong 2018

Similarly we can reformulate D,

Ex,y

[∑i,j

φ′(xi)φ(yi)x>aiφ

′(yj)φ(xj)y>bj

]=∑i

Ex,y

[φ′(xi)φ(yi)x

>aiφ′(yji)φ(xi)y

>bi]+∑i 6=j

Ex,y

[φ′(xi)φ(yi)x

>aiφ′(yj)φ(xj)y

>bj]

=∑i

γ2a>i eib>i ei +

∑i 6=j

α21,1a

>i ejb

>j ei + α1,1β1,1(a

>i ejb

>j ej + a>i eib

>j ei) + β2

1,1a>i eib

>j ej

= (γ2 − β21,0α

21,1 − 2α1,0α1,1β1,0β1,1 − α2

1,0β21,1)g

>AgB

+ β21,0α

21,1〈A,B>〉+ α2

1,0β21,1(g

>A1)(g

>B1)

+ α1,0α1,1β1,0β1,1[(A1)>gB + (B1)>gA]

= (γ2 − α21,0β

21,1)g

>AgB + β2

1,0α21,1〈Ao, B

>o 〉+ α2

1,0β21,1(g

>A1)(g

>B1)

+ α1,0α1,1β1,0β1,1[(Ao1)>gB + (Bo1)

>gA].

Combining the above results, we have

λmin(H∗) = min

‖A‖2F+‖B‖2F=1

(β21,0α

21,1‖Ao +B>

o ‖2F

+ ‖α1,0β1,0Ao1 + α1,0β1,2gA + α1,1β1,1gB‖2

+ ‖α1,0β1,0Bo1 + α1,0β1,2gB + α1,1β1,1gA‖2

+ (α2,0β2,0 − α21,0β

21,0 − β2

1,0α21,1 − α2

1,0β21,1)(‖Ao‖2F + ‖Bo‖2F )

+ 1/2 · α21,0β

21,1(‖Ao + A>

o ‖2F + ‖Bo +B>o ‖2F )

+ [α2,0β2,2 − α21,0β

21,1 − α2

1,0β21,2 − α2

1,1β21,1] · (‖gA‖2 + ‖gB‖2)

+ 2(γ2 − α21,0β

21,1 − 2α1,0α1,1β1,1β1,2)g

>AgB

+ α21,0β

21,1(g

>A1 + g>B1)

2

).

(D.6)

331

Page 348: Copyright by Kai Zhong 2018

The final output of the above formula has a clear form: most non-negative

terms are extracted. A,B are separated into the off-diagonal elements and off-

diagonal elements and these two terms can be dealt with independently. Now we

consider the activation functions that satisfy β1,1 = 0, which further simplifies the

equation. Note that Sigmoid and tanh satisfy this condition.

Finally, for β1,1 = 0, we obtain

λmin(H∗) = min∑k

i=1 ‖ai‖2+‖bi‖2=1Ex,y

( k∑i=1

φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y

>bi

)2

= min‖A‖2F+‖B‖2F=1

(α2,0β2,0 − α21,0β

21,0 − β2

1,0α21,1)(‖Ao‖2F + ‖Bo‖2F )

+ (α2,0β2,2 − α21,0β

21,2 − γ2)(‖gA‖2 + ‖gB‖2)

+ β21,0α

21,1‖Ao +B>

o ‖2F + γ2‖gA + gB‖2

+ α21,0(‖β1,0gA + β1,2Ao1‖2 + α2

1,0‖β1,0gA + β1,2Bo1‖2)

≥ min(α2,0β2,0 − α21,0β

21,0 − β2

1,0α21,1), (α2,0β2,2 − α2

1,0β21,2 − γ2)︸ ︷︷ ︸

:=ρ

.

For sigmoid, we have ρ = 0.000658; for tanh, we have ρ = 0.0095.

The following lemma will be used when transforming non-orthogonal cases

to orthogonal cases.

Lemma D.2.2. For any A = [a1, a2, · · · , ak] ∈ Rd×k, we have,

Ex,y∼Dk

∥∥∥∥∥k∑

i=1

φ′(xi)φ(yi)ai

∥∥∥∥∥2 ≥ (α2,0β2,0 − α2

1,0β21,0)‖A‖2F .

332

Page 349: Copyright by Kai Zhong 2018

Proof. Recall 1 ∈ Rd denote the all ones vector.

Ex,y∼Dk

∥∥∥∥∥k∑

i=1

φ′(xi)φ(yi)ai

∥∥∥∥∥2

= Ex,y∼Dk

[k∑

i=1

(φ′(xi)φ(yi))2‖ai‖2

]+ Ex,y∼Dk

[∑i 6=j

φ′(xi)φ(yi)φ′(xj)φ(yj)a

>i aj

]= (α2,0β2,0 − α2

1,0β21,0)‖A‖2F + α2

1,0β21,0‖A · 1‖2

≥ (α2,0β2,0 − α21,0β

21,0)‖A‖2F .

Thus, we complete the proof.

Now let’s show the PD of the population Hessian of Eq. (6.4) for the ReLU

case. where u∗(1) is the first row of U∗ and W ∈ R(d−1)×k.

Lemma D.2.3. Consider the activation function to be ReLU. Assume k = d, U∗, V ∗

are unitary matrices and u∗1,i 6= 0,∀i ∈ [k]. Then the minimal eigenvalue of the

corresponding population Hessian of Eq. (6.4) is lower bounded,

λmin(∇2fReLUD (W ∗, V ∗)) & min

i∈[k]u∗2

1,i,

where W ∗ = U∗2:d,: is the last d− 1 rows of U∗ and

fReLUD (W,V ) := Ex,y

[(φ(W>x2:d + x1(u

∗(1))>)>φ(V >y)− A(x, y))2], (D.7)

Proof. By fixing ui,1 = u∗i,1,∀i ∈ [k], we can rewrite the minimal eigenvalue of the

Hessian as follows. For simplicity, we denote λmin(H) := λmin(∇2fReLUD (W ∗, V ∗)).

First we observe that

λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]

Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>

i x)y>bi

)2.

(D.8)

333

Page 350: Copyright by Kai Zhong 2018

Without loss of generality, we assume V ∗ = I . Set x = U∗s, then we have

λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]

Ex,y

( k∑i=1

φ′(si)φ(yi)s>U∗>ai + φ′(yi)φ(xi)y

>bi

)2

= min∑ki=1 ‖ai‖2+‖bi‖2=1

u∗(1)ai=0,∀i∈[k]

Ex,y

( k∑i=1

φ′(si)φ(yi)s>ai + φ′(yi)φ(xi)y

>bi

)2,

where u∗(1) is the first row of U∗ and the second equality is because we replace

U∗>ai by ai. In the ReLU case, we have

α1,0 = α1,1 = α2,0 = β1,0 = β1,1 = β1,1 = β2,0 = β2,2 = γ = 1/2.

According to Eq. (D.6), we have

λmin(H) ≥ min‖A‖2F+‖B‖2F=1,u∗(1)A=0

C0(‖Ao‖2F + ‖Bo‖2F + ‖Ao + A>o ‖2F/2 + ‖Bo +B>

o ‖2F/2

+ ‖Ao +B>o ‖2F + ‖gA + gB‖2

+ ‖Ao1 + gA + gB‖2 + ‖Bo1 + gA + gB‖2 + (g>A1 + g>B1)2),

where C0 is a universal constant. Now we show that there exists a positive number

c0 such that λmin(H) ≥ c0. If there is no such number, i.e., λmin(H) = 0, then

we have Ao = Bo = 0, gA = −gB. By the assumption that u∗1,i 6= 0 and the

condition u∗(1)A = 0, we have gA = gB = 0, which violates ‖A‖2F + ‖B‖2F = 1.

So λmin(H) > 0. An exact value for c0 is postponed to Theorem D.2.5, which gives

the lower bound for the non-orthogonal case.

D.2.2 Non-orthogonal Case

The restriction of orthogonality on U, V is too strong. We need to consider

general non-orthogonal cases. With Gaussian assumption, the non-orthogonal case

334

Page 351: Copyright by Kai Zhong 2018

can be transformed to the orthogonal case according to the following relationship.

Lemma D.2.3. Let U ∈ Rd×k be a full-column rank matrix. Let g : Rk → [0,∞).

Define λ(U) = σk1(U)/(

∏ki=1 σi(U)). Let D denote the normal distribution. Then

Ex∼Dd

[g(U>x)

]≥ 1

λ(U)Ez∼Dk

[g(σk(U)z)]. (D.9)

Remark This lemma transforms U>x, where the elements of x are mixed,

to σk(U)z, where all the elements are independently fed into g with the sacrifices

of a condition number of U . Using Lemma D.2.3, we are able to show the PD for

non-orthogonal U∗, V ∗.

Proof. Let P ∈ Rd×k be the orthonormal basis of U , and let W = [w1, w2, · · · , wk] =

P>U ∈ Rk×k.

Ex∼Dd[g(U>x)]

= Ez∼Dk[g(W>z)]

=

∫(2π)−k/2g(W>z)e−‖z‖2/2dz

=

∫(2π)−k/2g(s)e−‖W †>s‖2/2| det(W †)|ds

≥∫

(2π)−k/2g(s)e−σ21(W

†)‖s‖2/2| det(W †)|ds

=

∫(2π)−k/2g

(1

σ1(W †)t

)e−‖t‖2/2| det(W †)|/σk

1(W†)dt

=1

λ(W )

∫(2π)−k/2g(σk(W )t)e−‖t‖2/2dt

=1

λ(U)Ez∼Dk

[g(σk(U)z)],

335

Page 352: Copyright by Kai Zhong 2018

where the third step follows by replacing z by z = W †>s, the fourth step follows

by the fact that ‖W †>s‖ ≤ σ1(W†)‖s‖, and the fifth step follows replacing s by

s = 1σ1(W †)

t.

Using Lemma D.2.3, we are able to provide the lower bound for the minimal

eigenvalue for sigmoid and tanh.

Theorem D.2.4. Assume σk(U∗) = σk(V

∗) = 1. Assume β1,1 defined in Eq. (D.4)

is 0. Then the minimal eigenvalue of Hessian defined in Eq. (D.3) can be lower

bounded by,

λmin(H∗) ≥ ρ

λ(U∗)λ(V ∗)maxκ(U∗), κ(V ∗)

where

λ(U) = σk1(U)/(Πk

i=1σi(U)), κ(U) = σ1(U)/σk(U).

Proof. Let P ∈ Rd×k, Q ∈ Rd×k be the orthonormal basis of U∗, V ∗ respectively.

Let R ∈ Rk×k, S ∈ Rk×k satisfy that U∗ = P · R and V ∗ = Q · S. Let P⊥ ∈

Rd×(d−k), Q⊥ ∈ Rd×(d−k) be the orthogonal complement of P,Q respectively. Set

ai = P · si +P⊥ · ti and bi = Q · pi +Q⊥ · qi. Then we can decompose the minimal

eigenvalue problem into three terms.

336

Page 353: Copyright by Kai Zhong 2018

Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>

i x)y>bi

)2

=Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>(Psi + P⊥ti) + φ′(v∗>i y)φ(u∗>

i x)y>(Qpi +Q⊥qi)

)2

=Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>Psi + φ′(v∗>i y)φ(u∗>

i x)y>Qpi

)2

︸ ︷︷ ︸C1

+ Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>P⊥ti

)2

︸ ︷︷ ︸C2

+Ex,y

( k∑i=1

φ′(v∗>i y)φ(u∗>i x)y>Q⊥qi

)2,

where we omit the terms containing a single independent Gaussian variable, whose

expectation is zero. Using Lemma D.2.3, we can lower bound the term C1 as fol-

lows,

C1 = Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>U∗R−1si + φ′(v∗>i y)φ(u∗>

i x)y>V ∗S−1pi

)2

≥ 1

λ(U∗)λ(V ∗)· Ex,y∼Dk

[(k∑

i=1

φ′(σk(U∗)xi))φ(yi)x

>R−1siσk(U∗)

+φ′(σk(V∗)yi)φ(σk(U

∗)xi)y>S−1piσk(V

∗))2]

.

And

C2 ≥Ex,y

∥∥∥∥∥k∑

i=1

φ′(u∗>i x)φ(v∗>i y)ti

∥∥∥∥∥2

≥ 1

λ(U∗)λ(V ∗)Ex,y∼Dk

∥∥∥∥∥k∑

i=1

φ′(σk(U∗)xi)φ(σk(V

∗)yi)ti

∥∥∥∥∥2.

337

Page 354: Copyright by Kai Zhong 2018

Without loss of generality, we assume σk(U∗) = σk(V

∗) = 1. Then accord-

ing to Lemma D.2.1 and Lemma D.2.2, we have

λmin(H) ≥ 1

λ(U∗)λ(V ∗)maxκ(U∗), κ(V ∗)·min(α2,0β2,0 − α2

1,0β21,0 − β2

1,0α21,1), (α2,0β2,2 − α2

1,0β21,2 − γ2).

Considering the definition of ρ in Eq. (D.4), we complete the proof.

For the ReLU case, we lower bound the minimal eigenvalue of the Hessian

for non-orthogonal cases.

Theorem D.2.5. Consider the activation to be ReLU. Assume U∗, V ∗ are full-

column-rank matrices and u∗1,i 6= 0,∀i ∈ [k]. Then the minimal eigenvalue of

the Hessian of Eq. (D.7) is lower bounded,

λmin(∇2fReLUD (W ∗, V ∗)) &

1

λ(U∗)λ(V ∗)

(mini∈[k]|u∗

1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖

)2

,

where u∗(1) is the first row of U∗.

Proof. Let P ∈ Rd×k, Q ∈ Rd×k be the orthonormal basis of U∗, V ∗ respectively.

Let R ∈ Rk×k, S ∈ Rk×k satisfy that U∗ = P · R and V ∗ = Q · S. Let P⊥ ∈

Rd×(d−k), Q⊥ ∈ Rd×(d−k) be the orthogonal complement of P,Q respectively. Set

ai = P ·si+P⊥ · ti and bi = Q ·pi+Q⊥ ·qi. Similar to the proof of Theorem D.2.4,

Lemma D.2.2 and Lemma D.2.3, we have the following.

338

Page 355: Copyright by Kai Zhong 2018

Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>

i x)y>bi

)2

≥ 1

λ(U∗)λ(V ∗)Ex,y∼Dk

[(k∑

i=1

φ′(σk(U∗)xi))φ(yi)x

>R−1siσk(U∗)

+φ′(σk(V∗)yi)φ(σk(U

∗)xi)y>S−1piσk(V

∗))2]

+1

λ(U∗)λ(V ∗)Ex,y∼Dk

∥∥∥∥∥k∑

i=1

φ′(σk(U∗)xi)φ(σk(V

∗)yi)ti

∥∥∥∥∥2

+1

λ(U∗)λ(V ∗)Ex,y∼Dk

∥∥∥∥∥k∑

i=1

φ′(σk(U∗)xi)φ(σk(V

∗)yi)qi

∥∥∥∥∥2

≥ 1

16λ(U∗)λ(V ∗)(‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖

2 + 3(‖T‖2F + ‖Q‖2F )),

where A = [R−1s1, R−1s2, · · · , R−1sk], B = [S−1p1, S

−1p2, · · · , S−1pk],

T = [t1, t2, · · · , tk], Q = [q1, q2, · · · , qk].

Similar to Eq. (D.8), we can find the minimal eigenvalue of the Hessian by

the following constrained minimization problem.

λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]

Ex,y

( k∑i=1

φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>

i x)y>bi

)2,

which is lower bounded by the following formula.

minA,B,T ,P

1

16λ(U∗)λ(V ∗)(‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖

2 + 3(‖T‖2F + ‖Q‖2F ))

s.t. ‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F = 1

e>1 PRA+ e>1 P⊥T = 0(D.10)

339

Page 356: Copyright by Kai Zhong 2018

If we assume the minimum of the above formula is c1. We show that c1 > 0 by

contradiction. If c1 = 0, then T = Q = 0, Ao = Bo = 0, gA = −gB. Since T = 0,

we have e>1 PRA = e>1 U∗A = 0. Assuming (e>1 U

∗)i 6= 0, ∀i, we have gA = gB =

0. This violates the condition that ‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F = 1.

Now we give a lower bound for c1. First we note,

‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F ≤ ‖R‖2‖A‖2F + ‖S‖2‖B‖2F + ‖T‖2F + ‖Q‖2F .

Therefore,

‖A‖2F + ‖B‖2F + ‖T‖2F + ‖Q‖2F ≥1

max‖U∗‖2, ‖V ∗‖2.

Also, as e>1 U∗Ao+(e>1 U

∗)g>A+e>1 P⊥T = 0, where is the element-wise

product, we have

‖gA‖2 ≤ (

1

min|u∗1,i|

(‖u∗(1)‖‖Ao‖+ ‖T‖)2

≤(1 + ‖u∗(1)‖min|u∗

1,i|

)2

2(‖Ao‖2F + ‖T‖2F ).

Note that ‖gA‖2 + ‖gA + gB‖2 ≥ 12‖gB‖2. Now let’s return to the main part

340

Page 357: Copyright by Kai Zhong 2018

of objective function Eq. (D.10).

‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖2 + 3(‖T‖2F + ‖Q‖2F )

≥ 2

3(‖Ao‖2F + ‖T‖2F ) +

1

3‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖

2 + ‖T‖2F + ‖Q‖2F

≥ 1

3

(min|u∗

1,i|1 + ‖u∗(1)‖

)2

‖gA‖2 +

1

3‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖

2 + ‖T‖2F + ‖Q‖2F

≥ 1

12

(min|u∗

1,i|1 + ‖u∗(1)‖

)2

(‖gA‖2 + ‖gB‖

2) +1

3‖Ao‖2F + ‖Bo‖2F + ‖T‖2F + ‖Q‖2F

≥ 1

12

(min|u∗

1,i|1 + ‖u∗(1)‖

)2(‖gA‖

2 + ‖gB‖2 + ‖Ao‖2F + ‖Bo‖2F + ‖T‖2F + ‖Q‖2F

)≥ 1

12

(min|u∗

1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖

)2

.

Therefore,

c1 ≥1

200λ(U∗)λ(V ∗)

(min|u∗

1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖

)2

.

D.3 Positive Definiteness of the Empirical Hessian

For any (U, V ), the population Hessian can be decomposed into the follow-

ing 2k × 2k blocks (i ∈ [k], j ∈ [k]),

∂2fD(U, V )

∂ui∂uj

= Ex,y

[φ′(u>

i x)φ′(u>

j x)xx>φ(v>i y)φ(v

>j y)]

+ δijEx,y

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(u>

i x)φ(v>i y)xx

>]∂2fD(U, V )

∂ui∂vj= Ex,y

[φ′(u>

i x)φ′(v>j y)xy

>φ(v>i y)φ(u>j x)]

+ δijEx,y

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′(u>

i x)φ′(v>i y)xy

>],(D.11)

341

Page 358: Copyright by Kai Zhong 2018

where δij = 1 if i = j, otherwise δij = 0. Similarly we can write the formula for∂2fD(U,V )∂vi∂vj

and ∂2fD(U,V )∂vi∂uj

.

Replacing Ex,y by 1|Ω|∑

(x,y)∈Ω in the above formula, we can obtain the for-

mula for the corresponding empirical Hessian, ∇2fΩ(U, V ).

We now bound the difference between ∇2fΩ(U, V ) and ∇2fD(U∗, V ∗).

Theorem D.3.1 (Restatement of Theorem D.1.1). For any ε > 0, if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability 1− d−t, for sigmoid/tanh,

‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖,

for ReLU,

‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ .

((‖V − V ∗‖σk(V ∗)

)1/2

+

(‖U − U∗‖σk(U∗)

)1/2

+ ε

)(‖U∗‖+‖V ∗‖)2.

Proof. Define H(U, V ) ∈ R(2kd)×(2kd) as a symmetric matrix, whose blocks are

represented as

Hui,uj= Ex,y

[φ′(u>

i x)φ′(u>

j x)xx>φ(v>i y)φ(v

>j y)],

Hui,vj = Ex,y

[φ′(u>

i x)φ′(v>j y)xy

>φ(v>i y)φ(u>j x)].

(D.12)

where Hui,uj∈ Rd×d, Hui,vj ∈ Rd×d correspond to ∂2fD(U,V )

∂ui∂uj, ∂

2fD(U,V )∂ui∂vj

respec-

tively.

We decompose the difference into

‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ ≤ ‖∇2fΩ(U, V )−H(U, V )‖+ ‖H(U, V )−∇2fD(U

∗, V ∗)‖.

Combining Lemma D.3.1, D.3.13, we complete the proof.

342

Page 359: Copyright by Kai Zhong 2018

Lemma D.3.1. For any ε > 0, if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability 1− d−t, for sigmoid/tanh,

‖∇2fΩ(U, V )−H(U, V )‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖,

for ReLU,

‖∇2fΩ(U, V )−H(U, V )‖ . ε‖U∗‖‖V ∗‖.

Proof. We can bound ‖∇2fΩ(U, V )−H(U, V )‖ if we bound each block.

We can show that if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability 1− d−t,∥∥∥∥∥∥Ex,y −

1

|Ω|∑

(x,y)∈Ω

[φ′(u>i x)φ′(u>j x)xx

>φ(v>i y)φ(v>j y)

]∥∥∥∥∥∥. ε‖U∗‖p‖V ∗‖p Lemma D.3.2∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(u>i x)φ(v

>i y)xx

>]∥∥∥∥∥∥

. ‖U − U∗‖+ ‖V − V ∗‖ Lemma D.3.5∥∥∥∥∥∥Ex,y −

1

|Ω|∑

(x,y)∈Ω

[φ′(u>i x)φ′(v>j y)xy

>φ(v>i y)φ(u>j x)

]∥∥∥∥∥∥. ε‖U∗‖p‖V ∗‖p Lemma D.3.6∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′(u>i x)φ

′(v>i y)xy>]∥∥∥∥∥∥

. ‖U − U∗‖+ ‖V − V ∗‖, Lemma D.3.8

343

Page 360: Copyright by Kai Zhong 2018

where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.

Note that for ReLU activation, for any given U, V , the second term is 0

because φ′′(z) = 0 almost everywhere.

Lemma D.3.2. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥Ex,y −

1

|Ω|∑

(x,y)∈Ω

[φ′(u>i x)φ

′(u>j x)xx

>φ(v>i y)φ(v>j y)]∥∥∥∥∥∥ ≤ ε‖vi‖p‖vj‖p

where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.

Proof. Let B(x, y) = φ′(u>i x)φ

′(u>j x)xx

>φ(v>i y)φ(v>j y). By applying Lemma D.3.10

and Property (I)− (III), (VI) in Lemma D.3.3 and Lemma D.3.4, we have for any

ε > 0 if

n1 & ε−2td log2 d, n2 & ε−2t log d,

then with probability at least 1− d−2t,

∥∥∥∥∥∥Ex,y[B(x, y)]− 1

|S|∑

(x,y)∈S

B(x, y)

∥∥∥∥∥∥ ≤ ε‖vi‖p‖vj‖p. (D.13)

By applying Lemma D.3.11 and Property (I), (III) − (V) in Lemma D.3.3

and Lemma D.3.4, we have for any ε > 0 if

n1 & ε−1td log2 d, n2 & ε−2t log d,

344

Page 361: Copyright by Kai Zhong 2018

then ∥∥∥∥∥∥ 1

n1

∑l∈[n1]

(φ′(u>i xl)φ

′(u>j xl))

2‖xl‖2xlx>l

∥∥∥∥∥∥ . d,

and ∥∥∥∥∥∥ 1

n2

∑l∈[n2]

(φ(v>i yl)φ(v>j yl))

2

∥∥∥∥∥∥ . ‖vi‖2p‖vj‖2p.

Therefore,

max

∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)B(x, y)>

∥∥∥∥∥∥,∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)>B(x, y)

∥∥∥∥∥∥ . εd‖vi‖2p‖vj‖2p.

(D.14)

We can apply Lemma D.3.12 and use Eq. (D.14) and Property (I) in Lemma D.3.3

and Lemma D.3.4 to obtain the following result. If

|Ω| & ε−2td log2 d,

then with probability at least 1− d−2t,∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)− 1

|Ω|∑

(x,y)∈Ω

B(x, y)

∥∥∥∥∥∥ . ε‖vi‖p‖vj‖p. (D.15)

Combining Eq. (D.13) and (D.15), we finish the proof.

Lemma D.3.3. Define T (z) = φ′(u>i z)φ

′(u>j z)zz

>. If z ∼ Z, Z = N(0, Id) and φ

345

Page 362: Copyright by Kai Zhong 2018

is ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,

(I) Prz∼Z

[‖T (z)‖ ≤ 5td log n] ≥ 1− n−1d−t;

(II) max‖a‖=‖b‖=1

(E

z∼Z

[(a>T (z)b

)2])1/2. 1;

(III) max(∥∥∥ E

z∼Z[T (z)T (z)>]

∥∥∥,∥∥∥ Ez∼Z

[T (z)>T (z)]∥∥∥) . d;

(IV) max‖a‖=1

(E

z∼Z

[(a>T (z)T (z)>a

)2])1/2. d;

(V)∥∥∥ Ez∼Z

[T (z)T (z)>T (z)T (z)>]∥∥∥ . d3;

(VI)∥∥∥ Ez∼Z

[T (z)]∥∥∥ . 1.

Proof. Note that 0 ≤ φ′(z) ≤ 1, therefore (I) can be proved by Proposition 1 of

[68]. (II)− (VI) can be proved by Holder’s inequality.

Lemma D.3.4. Define T (z) = φ(v>i z)φ(v>j z). If z ∼ Z, Z = N(0, Id) and φ is

ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,

(I) Prz∼Z

[‖T (z)‖ ≤ 5t‖vi‖p‖vj‖p log n] ≥ 1− n−1d−t;

(II) max‖a‖=‖b‖=1

(E

z∼Z

[(a>T (z)b

)2])1/2. ‖vi‖p‖vj‖p;

(III) max(∥∥∥ E

z∼Z[T (z)T (z)>]

∥∥∥,∥∥∥ Ez∼Z

[T (z)>T (z)]∥∥∥) . ‖vi‖2p‖vj‖2p;

(IV) max‖a‖=1

(E

z∼Z

[(a>T (z)T (z)>a

)2])1/2. ‖vi‖2p‖vj‖2p;

(V)∥∥∥ Ez∼Z

[T (z)T (z)>T (z)T (z)>]∥∥∥ . ‖vi‖4p‖vj‖4p;

(VI)∥∥∥ Ez∼Z

[T (z)]∥∥∥ . ‖vj‖p‖vi‖p.

where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.

346

Page 363: Copyright by Kai Zhong 2018

Proof. Note that |φ(z)| ≤ |z|p, therefore (I) can be proved by Proposition 1 of [68].

(II)− (VI) can be proved by Holder’s inequality

Lemma D.3.5. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(u>

i x)φ(v>i y)xx

>]∥∥∥∥∥∥. (‖U − U∗‖+ ‖V − V ∗‖).

Proof. We consider the following formula first,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[((φ(u>

j x)− φ(u∗>j x))φ(v∗>j y)

)φ′′(u>

i x)φ(v>i y)xx

>]∥∥∥∥∥∥≤

∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[∣∣(uj − u∗j)

>x∣∣xx>φ(v∗>j y)φ(v>i y)

]∥∥∥∥∥∥.Similar to Lemma D.3.2, we are able to show∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[∣∣(uj − u∗j)

>x∣∣xx>φ(v∗>j y)φ(v>i y)

]− E(x,y)

[∣∣(uj − u∗j)

>x∣∣xx>φ(v∗>j y)φ(v>i y)

]∥∥∥∥∥∥. ‖U − U∗‖.

Note that by Holder’s inequality, we have,

∥∥E(x,y)

[∣∣(uj − u∗j)

>x∣∣xx>φ(v∗>j y)φ(v>i y)

]∥∥ . ‖U − U∗‖.

So we complete the proof.

347

Page 364: Copyright by Kai Zhong 2018

Lemma D.3.6. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥Ex,y −

1

|Ω|∑

(x,y)∈Ω

[φ′(u>i x)φ

′(v>j y)xy>φ(v>i y)φ(u

>j x)]∥∥∥∥∥∥ . ε‖vi‖p‖uj‖p.

Proof. Let B(x, y) = M(x)N(y), where M(x) = φ′(u>i x)φ(u

>j x)x and N(y) =

φ′(v>j y)φ(v>i y)y

>. By applying Lemma D.3.10 and Property (I) − (III), (VI) in

Lemma D.3.7 , we have for any ε > 0 if

n1 & ε−2td log2 d, n2 & ε−2td log2 d,

then with probability at least 1− d−2t,

∥∥∥∥∥∥Ex,yB(x, y)− 1

|S|∑

(x,y)∈S

B(x, y)

∥∥∥∥∥∥ . ε‖uj‖p‖vi‖p. (D.16)

By applying Lemma D.3.11 and Property (I), (IV)− (VI) in Lemma D.3.7,

we have for any ε > 0 if

n1 & ε−2td log2 d, n2 & ε−2td log2 d,

then∥∥∥∥∥∥ 1

n1

∑l∈[n1]

M(xl)M(xl)>

∥∥∥∥∥∥ . ‖uj‖2p,

∥∥∥∥∥∥ 1

n2

∑l∈[n2]

N(yl)>N(yl)

∥∥∥∥∥∥ . ‖vi‖2p.

348

Page 365: Copyright by Kai Zhong 2018

By applying Lemma D.3.11 and Property (I), (IV), (VII), (VIII) in Lemma D.3.7,

we have for any ε > 0 if

n1 & ε−2td log2 d, n2 & ε−2td log2 d,

then∥∥∥∥∥∥ 1

n1

∑l∈[n1]

M(xl)>M(xl)

∥∥∥∥∥∥ . d‖uj‖2p,

∥∥∥∥∥∥ 1

n2

∑l∈[n2]

N(yl)N(yl)>

∥∥∥∥∥∥ . d‖vi‖2p.

Therefore,

max

∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)B(x, y)>

∥∥∥∥∥∥,∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)>B(x, y)

∥∥∥∥∥∥ . εd‖vi‖2p‖uj‖2p

(D.17)

We can apply Lemma D.3.12 and Eq. (D.17) and Property (I) in Lemma D.3.7 to

obtain the following result. If

|Ω| & ε−2td log2 d,

then with probability at least 1− d−2t,∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)− 1

|Ω|∑

(x,y)∈Ω

B(x, y)

∥∥∥∥∥∥ ≤ ε‖vi‖p‖uj‖p. (D.18)

Combining Eq. (D.16) and (D.18), we finish the proof.

Lemma D.3.7. Define T (z) = φ′(u>i z)φ(u

>j z)z. If z ∼ Z, Z = N(0, Id) and φ is

349

Page 366: Copyright by Kai Zhong 2018

ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,

(I) Prz∼Z

[‖T (z)‖ ≤ 5td1/2‖uj‖p log n

]≥ 1− n−1d−t;

(II)∥∥∥ Ez∼Z

[T (z)]∥∥∥ . ‖uj‖p;

(III) max‖a‖=‖b‖=1

(E

z∼Z

[(a>T (z)b

)2])1/2. ‖uj‖p;

(IV) max∥∥∥ E

z∼Z[T (z)T (z)>]

∥∥∥, ∥∥∥ Ez∼Z

[T (z)>T (z)]∥∥∥ . d‖uj‖2p;

(V) max‖a‖=1

(E

z∼Z

[(a>T (z)T (z)>a

)2])1/2. ‖uj‖2p;

(VI)∥∥∥ Ez∼Z

[T (z)T (z)>T (z)T (z)>]∥∥∥ . d‖uj‖4p;

(VII) max‖a‖=1

(E

z∼Z

[(a>T (z)>T (z)a

)2])1/2. d‖uj‖2p;

(VIII)∥∥∥ Ez∼Z

[T (z)>T (z)T (z)>T (z)]∥∥∥ . d2‖uj‖4p.

Proof. Note that 0 ≤ φ′(z) ≤ 1,|φ(z)| ≤ |z|p, therefore (I) can be proved by

Proposition 1 of [68]. (II)− (VIII) can be proved by Holder’s inequality.

Lemma D.3.8. If

n1 & td log2 d, n2 & t log d, |Ω| & td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′(u>

i x)φ′(v>i y)xy

>]∥∥∥∥∥∥. ‖U − U∗‖+ ‖V − V ∗‖.

350

Page 367: Copyright by Kai Zhong 2018

Proof. We consider the following formula first,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[((φ(u>

j x)− φ(u∗>j x))φ(v∗>j y)

)φ′(u>

i x)φ′(v>i y)xy

>]∥∥∥∥∥∥

Set M(x) = (φ(u>j x)−φ(u∗>

j x))φ′(u>i x)x and N(y) = φ(v∗>j y)φ′(v>i y)y

>

and follow the proof for Lemma D.3.6. Also note that φ is Lipschitz, i.e., |φ(u>j x)−

φ(u∗>j x)| ≤ |u>

j x− u∗>j x|. We can show the following. If

n1 & td log2 d, n2 & t log d, |Ω| & td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

−Ex,y

[M(x)N(y)]

∥∥∥∥∥∥ . ‖uj − u∗j‖.

Note that by Holder’s inequality, we have,

‖Ex,y[M(x)N(y)]‖ . ‖uj − u∗j‖.

So we complete the proof.

We provide a variation of Lemma B.7 in [153]. Note that the Lemma B.7

[153] requires four properties, we simplify it into three properties.

Lemma D.3.9 (Matrix Bernstein for unbounded case (A modified version of bounded

case, Theorem 6.1 in [127], A variation of Lemma B.7 in [153])). Let B denote a

351

Page 368: Copyright by Kai Zhong 2018

distribution over Rd1×d2 . Let d = d1 + d2. Let B1, B2, · · ·Bn be i.i.d. random

matrices sampled from B. Let B = EB∼B[B] and B = 1n

∑ni=1Bi. For parameters

m ≥ 0, γ ∈ (0, 1), ν > 0, L > 0, if the distribution B satisfies the following four

properties,

(I) PrB∼B

[‖B‖ ≤ m] ≥ 1− γ;

(II) max(∥∥∥ E

B∼B[BB>]

∥∥∥,∥∥∥ EB∼B

[B>B]∥∥∥) ≤ ν;

(III) max‖a‖=‖b‖=1

(E

B∼B

[(a>Bb

)2])1/2 ≤ L.

Then we have for any ε > 0 and t ≥ 1, if

n ≥ (18t log d) · ((ε+ ‖B‖)2 +mε+ ν)/ε2 and γ ≤ (ε/(2L))2

with probability at least 1− d−2t − nγ,∥∥∥∥∥ 1nn∑

i=1

Bi − EB∼B

[B]

∥∥∥∥∥ ≤ ε.

Proof. Define the event

ξi = ‖Bi‖ ≤ m,∀i ∈ [n].

Define Mi = 1‖Bi‖≤mBi. Let M = EB∼B[1‖B‖≤mB] and M = 1n

∑ni=1Mi. By

triangle inequality, we have

‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖. (D.19)

In the next a few paragraphs, we will upper bound the above three terms.

352

Page 369: Copyright by Kai Zhong 2018

The first term in Eq. (D.19). For each i, let ξi denote the complementary

set of ξi, i.e. ξi = [n]\ξi. Thus Pr[ξi] ≤ γ. By a union bound over i ∈ [n], with

probability 1− nγ, ‖Bi‖ ≤ m for all i ∈ [n]. Thus M = B.

The second term in Eq. (D.19). For a matrix B sampled from B, we use ξ

to denote the event that ξ = ‖B‖ ≤ m. Then, we can upper bound ‖M − B‖ in

the following way,

‖M −B‖

=∥∥∥ EB∼B

[1‖B‖≤m ·B]− EB∼B

[B]∥∥∥

=∥∥∥ EB∼B

[B · 1ξ

]∥∥∥= max

‖a‖=‖b‖=1E

B∼B

[a>Bb1ξ

]≤ max

‖a‖=‖b‖=1E

B∼B[(a>Bb)2]1/2 · E

B∼B

[1ξ

]1/2 by Holder’s inequality

≤ L EB∼B

[1ξ

]1/2 by Property (IV)

≤ Lγ1/2, by Pr[ξ] ≤ γ

≤ 1

2ε, by γ ≤ (ε/(2L))2,

which is

‖M −B‖ ≤ ε

2.

Therefore, ‖M‖ ≤ ‖B‖+ ε2.

The third term in Eq. (D.19). We can bound ‖M −M‖ by Matrix Bern-

stein’s inequality [127].

353

Page 370: Copyright by Kai Zhong 2018

We define Zi = Mi −M . Thus we have EBi∼B

[Zi] = 0, ‖Zi‖ ≤ 2m, and

∥∥∥∥ EBi∼B

[ZiZ>i ]

∥∥∥∥ =

∥∥∥∥ EBi∼B

[MiM>i ]−M ·M>

∥∥∥∥ ≤ ν + ‖M‖2 ≤ ν + ‖B‖2 + ε2 + ε‖B‖.

Similarly, we have∥∥∥∥ EBi∼B

[Z>i Zi]

∥∥∥∥ ≤ ν+‖B‖2+ε2+ε‖B‖. Using matrix Bernstein’s

inequality, for any ε > 0,

PrB1,··· ,Bn∼B

[1

n

∥∥∥∥∥n∑

i=1

Zi

∥∥∥∥∥ ≥ ε

]≤ d exp

(− ε2n/2

ν + ‖B‖2 + ε2 + ε‖B‖+ 2mε/3

).

By choosing

n ≥ (3t log d) · ν + ‖B‖2 + ε2 + ε‖B‖+ 2mε/3

ε2/2,

for t ≥ 1, we have with probability at least 1− d−2t,∥∥∥∥∥ 1nn∑

i=1

Mi −M

∥∥∥∥∥ ≤ ε

2.

Putting it all together, we have for ε > 0, if

n ≥ (18t log d) · ((ε+ ‖B‖)2 +mε+ ν)/(ε2) and γ ≤ (ε/(2L))2

with probability at least 1− d−2t − nγ,∥∥∥∥∥ 1nn∑

i=1

Bi − EB∼B

[B]

∥∥∥∥∥ ≤ ε.

Lemma D.3.10 (Tail Bound for fully-observed rating matrix). Let xii∈[n1] be in-

dependent samples from distribution X and yjj∈[n2] be independent samples from

354

Page 371: Copyright by Kai Zhong 2018

distribution Y. Denote S := (xi, yj)i∈[n1],j∈[n2] as the collection of all the (xi, yj)

pairs. Let B(x, y) be a random matrix of x, y, which can be represented as the prod-

uct of two matrices M(x), N(y), i.e., B(x, y) = M(x)N(y). Let M = ExM(x)

and N = EyN(y). Let dx be the sum of the two dimensions of M(x) and dy be

the sum of the two dimensions of N(y). Suppose both M(x) and N(y) satisfy the

following properties (z is a representative for x, y, and T (z) is a representative for

M(x), N(y)),

(I) Prz∼Z

[‖T (z)‖ ≤ mz] ≥ 1− γz;

(II) max‖a‖=‖b‖=1

(E

z∼Z

[(a>T (z)b

)2])1/2 ≤ Lz;

(III) max(∥∥∥ E

z∼Z[T (z)T (z)>]

∥∥∥,∥∥∥ Ez∼Z

[T (z)>T (z)]∥∥∥) ≤ νz.

Then for any ε1 > 0, ε2 > 0 if

n1 ≥ (18t log dx) · (νx + (‖M‖+ ε1)2 +mxε1)/ε

21 and γx ≤ (ε1/(2Lx))

2

n2 ≥ (18t log dy) · (νy + (ε2 + ‖N‖)2 +myε2)/ε22 and γy ≤ (ε2/(2Ly))

2

with probability at least 1− d−2tx − d−2t

y − n1γx − n2γy,

∥∥∥∥∥∥Ex,yB(x, y)− 1

|S|∑

(x,y)∈S

B(x, y)

∥∥∥∥∥∥ ≤ ε2‖M‖+ ε1‖N‖+ ε1ε2. (D.20)

355

Page 372: Copyright by Kai Zhong 2018

Proof. First we note that,

1

|S|∑

(x,y)∈S

B(x, y) =1

n1n2

∑i∈[n1]

M(xi)

·∑

j∈[n2]

N(yj)

,

and

Ex,y[B(x, y)] = (Ex[M(x)])(Ey[N(y)]).

Therefore, if we can bound ‖Ex[M(x)]− 1n1

∑i∈[n1]

M(xi)‖ and the corresponding

term for y, we are able to prove this lemma.

By the conditions of M(x), the three conditions in Lemma D.3.9 are satis-

fied, which completes the proof.

Lemma D.3.11 (Upper bound for the second-order moment). Let zii∈[n] be inde-

pendent samples from distribution Z. Let T (z) be a matrix of z. Let d be the sum of

the two dimensions of T (z) and T := Ez∼Z

[T (z)T (z)>]. Suppose T (z) satisfies the

following properties.

(I) Prz∼Z

[‖T (z)‖ ≤ mz] ≥ 1− γz;

(II) max‖a‖=1

(E

z∼Z

[(a>T (z)T (z)>a

)2])1/2 ≤ Lz;

(III)∥∥∥ Ez∼Z

[T (z)T (z)>T (z)T (z)>]∥∥∥ ≤ νz,

Then for any t > 1, if

n ≥ (18t log d) · (νz + (‖T‖+ ε)2 +m2z)/ε

2 and γz ≤ (ε/(2Lz))2,

356

Page 373: Copyright by Kai Zhong 2018

we have with probability at least 1− d−2t − nγz,∥∥∥∥∥∥ 1n∑i∈[n]

T (zi)T (zi)>

∥∥∥∥∥∥ ≤∥∥∥ Ez∼Z

[T (z)T (z)>]∥∥∥+ ε.

Proof. The proof directly follows by applying Lemma D.3.9.

Lemma D.3.12 (Tail Bound for partially-observed rating matrix). Given xii∈[n1]

and yjj∈[n2], let’s denote S := (xi, yj)i∈[n1],j∈[n2] as the collection of all the

(xi, yj) pairs. Let Ω also be a collection of (xi, yj) pairs, where each pair is sampled

from S independently and uniformly. Let B(x, y) be a matrix of x, y. Let dB be the

sum of the two dimensions of B(x, y). Define BS = 1|S|∑

(x,y)∈S B(x, y). Assume

the following,

(I) ‖B(x, y)‖ ≤ mB,∀(x, y) ∈ S,

(II) max

∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)B(x, y)>

∥∥∥∥∥∥,∥∥∥∥∥∥ 1

|S|∑

(x,y)∈S

B(x, y)>B(x, y)

∥∥∥∥∥∥ ≤ νB.

Then we have for any ε > 0 and t ≥ 1, if

|Ω| ≥ (18t log dB) · (νB + ‖BS‖2 +mBε)/ε2,

with probability at least 1− d−2tB ,∥∥∥∥∥∥BS −

1

|Ω|∑

(x,y)∈Ω

B(x, y)

∥∥∥∥∥∥ ≤ ε.

Proof. Since each entry in Ω is sampled from S uniformly and independently, we

have

1

|Ω|∑

(x,y)∈Ω

B(x, y)

=1

|S|∑

(x,y)∈S

B(x, y).

357

Page 374: Copyright by Kai Zhong 2018

Applying the matrix Bernstein inequality Theorem 6.1 in [127], we prove this

lemma.

Lemma D.3.13. For sigmoid/tanh activation function,

‖H(U, V )−∇2fD(U∗, V ∗)‖ . (‖V − V ∗‖+ ‖U − U∗‖),

where H(U, V ) is defined as in Eq. (D.12).

For ReLU activation function,

‖H(U, V )−∇2fD(U∗, V ∗)‖ .

((‖V − V ∗‖σk(V ∗)

)1/2

‖U∗‖+(‖U − U∗‖σk(U∗)

)1/2

‖V ∗‖

)(‖U∗‖+ ‖V ∗‖).

Proof. We can bound each block, i.e.,

Ex,y

[φ′(u>

i x)φ′(u>

j x)xx>φ(v>i y)φ(v

>j y)− φ′(u∗>

i x)φ′(u∗>j x)xx>φ(v∗>i y)φ(v∗>j y)

].

(D.21)

Ex,y

[φ′(u>

i x)φ′(v>j y)xy

>φ(v>i y)φ(u>j x)− φ′(u∗>

i x)φ′(v∗>j y)xy>φ(v∗>i y)φ(u∗>j x)

].

(D.22)

For smooth activations, the bound for Eq. (D.21) follows by combining

Lemma D.3.14 and Lemma D.3.15 and the bound for Eq. (D.22) follows Lemma D.3.17

and Lemma D.3.19. For ReLU activation, the bound for Eq. (D.21) follows by

combining Lemma D.3.14, Lemma D.3.16 and the bound for Eq. (D.22) follows

Lemma D.3.17 and Lemma D.3.18.

Lemma D.3.14.

∥∥Ey∼Dd

[(φ(v>i y)− φ(v∗>i y))φ(v>j y)

]∥∥ . ‖V ∗‖p‖V − V ∗‖.

358

Page 375: Copyright by Kai Zhong 2018

Proof. The proof follows the property of the activation function (φ(z) ≤ |z|p) and

Holder’s inequality.

Lemma D.3.15. When the activation function is smooth, we have

∥∥Ex∼Dd

[(φ′(u>

i x)− φ′(u∗>i x))φ′(u>

l x)xx>]∥∥ . ‖U − U∗‖.

Proof. The proof directly follows Eq. (12) in Lemma D.10 in [153].

Lemma D.3.16. When the activation function is piece-wise linear with e turning

points, we have

∥∥Ex∼Dd

[(φ′(u>

i x)− φ′(u∗>i x))φ′(u>

l x)xx>]∥∥ . (e‖U − U∗‖/σk(U

∗))1/2.

Proof.

∥∥Ex,y

[(φ′(u>

i x)− φ′(u∗>i x))φ′(u>

l x)xx>]∥∥ ≤ max

‖a‖=1

(Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)|φ′(u>

l x)(x>a)2

]).

Let P be the orthogonal basis of span(ui, u∗i , ul). Without loss of general-

ity, we assume ui, u∗i , ul are independent, so P = span(ui, u

∗i , ul) is d-by-3. Let

[qi q∗i ql] = P>[ui u

∗i ul] ∈ R3×3. Let a = Pb + P⊥c, where P⊥ ∈ Rd×(d−3) is the

359

Page 376: Copyright by Kai Zhong 2018

complementary matrix of P .

Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)||φ′(u>

l x)|(x>a)2]

= Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)||φ′(u>

l x)|(x>(Pb+ P⊥c))2]

. Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)||φ′(u>

l x)|((x>Pb)2 + (x>P⊥c)

2)]

= Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)||φ′(u>

l x)|(x>Pb)2]

+ Ex∼Dd

[|φ′(u>

i x)− φ′(u∗>i x)||φ′(u>

l x)|(x>P⊥c)2]

= Ez∼D3

[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2

]+ Ez∼D3,y∼Dd−3

[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(y>c)2

], (D.23)

where the first step follows by a = Pb + P⊥c, the last step follows by (a + b)2 ≤

2a2 + 2b2.

We have e exceptional points which have φ′′(z) 6= 0. Let these e points

be p1, p2, · · · , pe. Note that if q>i z and q∗>i z are not separated by any of these

exceptional points, i.e., there exists no j ∈ [e] such that q>i z ≤ pj ≤ q∗>i z or

q∗>i z ≤ pj ≤ q>i z, then we have φ′(q>i z) = φ′(q∗>i z) since φ′′(s) are zeros except

for pjj=1,2,··· ,e. So we consider the probability that q>i z, q∗>i z are separated by

any exception point. We use ξj to denote the event that q>i z, q∗>i z are separated

by an exceptional point pj . By union bound, 1 −∑e

j=1 Pr[ξj] is the probability

that q>i z, q∗>i z are not separated by any exceptional point. The first term of Equa-

360

Page 377: Copyright by Kai Zhong 2018

tion (D.23) can be bounded as,

Ez∼D3

[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2

]= Ez∼D3

[1∪e

j=1ξj|φ′(q>i z) + φ′(q∗>i z)||φ′(q>l z)|(z>b)2

]≤(Ez∼D3

[1∪e

j=1ξj

])1/2(Ez∼D3

[(φ′(q>i z) + φ′(q∗>i z))2φ′(q>l z)

2(z>b)4])1/2

(e∑

j=1

Prz∼D3

[ξj]

)1/2(Ez∼D3

[(φ′(q>i z) + φ′(q∗>i z))2φ′(q>l z)

2(z>b)4])1/2

.

(e∑

j=1

Prz∼D3

[ξj]

)1/2

‖b‖2,

where the first step follows by if q>i z, q∗>i z are not separated by any exceptional

point then φ′(q>i z) = φ′(q∗>i z) and the last step follows by Holder’s inequality.

It remains to upper bound Prz∼D3 [ξj]. First note that if q>i z, q∗>i z are sepa-

rated by an exceptional point, pj , then |q∗>i z− pj| ≤ |q>i z− q∗>i z| ≤ ‖qi− q∗i ‖‖z‖.

Therefore,

Prz∼D3

[ξj] ≤ Prz∼D3

[|q>i z − pj|‖z‖

≤ ‖qi − q∗i ‖].

Note that ( q∗>i z

‖z‖‖q∗i ‖+ 1)/2 follows Beta(1,1) distribution which is uniform

distribution on [0, 1].

Prz∼D3

[|q∗>i z − pj|‖z‖‖q∗i ‖

≤ ‖qi − q∗i ‖‖q∗i ‖

]≤ Pr

z∼D3

[|q∗>i z|‖z‖‖q∗i ‖

≤ ‖qi − q∗i ‖‖q∗i ‖

].‖qi − q∗i ‖‖q∗i ‖

.‖U − U∗‖σk(U∗)

,

where the first step is because we can view q∗>i z

‖z‖ and pj‖z‖ as two independent ran-

dom variables: the former is about the direction of z and the later is related to the

361

Page 378: Copyright by Kai Zhong 2018

magnitude of z. Thus, we have

Ez∈D3 [|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2] . (e‖U − U∗‖/σk(U∗))1/2‖b‖2.

(D.24)

Similarly we have

Ez∼D3,y∼Dd−3

[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(y>c)2

]. (e‖U − U∗‖/σk(U

∗))1/2‖c‖2.(D.25)

Finally combining Eq. (D.24) and Eq. (D.25) completes the proof.

Lemma D.3.17.

∥∥Ex∼Dd

[(φ(u>

j x)− φ(u∗>j x))φ′(u>

i x)x]∥∥ . ‖U − U∗‖.

Proof. First, we can use the Lipschitz continuity of the activation function,

∥∥Ex∼Dd

[φ(u>

j x)− φ(u∗>j x)φ′(u>

i x)x]∥∥ ≤ max

‖a‖=1

∥∥Ex∼Dd

[|φ(u>

j x)− φ(u∗>j x)|φ′(u>

i x)|x>a|]∥∥

≤ max‖a‖=1

∥∥Ex∼Dd

[|u>

j x− u∗>j x|φ′(u>

i x)|x>a|]∥∥,

where Lφ ≤ 1 is the Lipschitz constant of φ. Then the proof follows Holder’s

inequality.

Lemma D.3.18. When the activation function is ReLU,

∥∥Ex∼Dd

[φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x]∥∥ . (‖U − U∗‖/σk(U

∗))1/2‖uj‖.

Proof.

∥∥Ex∼Dd

[φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x]∥∥ ≤ max

‖a‖=1Ex∼Dd

[|φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x>a|].

362

Page 379: Copyright by Kai Zhong 2018

Similar to Lemma D.3.16, we can show that

max‖a‖=1

Ex∼Dd

[|φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x>a|]. (‖U − U∗‖/σk(U

∗))1/2‖uj‖.

Lemma D.3.19. When the activation function is sigmoid/tanh,

∥∥Ex∼Dd

[φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x]∥∥ . ‖U − U∗‖.

Proof.

∥∥Ex∼Dd

[φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x]∥∥

≤ max‖a‖=1

Ex∼Dd

[|φ(u∗>

j x)(φ′(u>i x)− φ′(u∗>

i x))x>a|]

. max‖a‖=1

Ex∼Dd

[|(u>

i x− u∗>i x)x>a|

]. ‖U − U∗‖.

D.3.1 Local Linear Convergence

Given Theorem 6.3.1, we are now able to show local linear convergence of

gradient descent for sigmoid and tanh activation function.

Theorem D.3.2 (Restatement of Theorem 6.3.2). Let [U c, V c] be the parameters in

the c-th iteration. Assuming ‖U c−U∗‖+‖V c−V ∗‖ . 1/(λ2κ), then given a fresh

sample set, Ω, that is independent of [U c, V c] and satisfies the conditions in Theo-

rem 6.3.1, the next iterate using one step of gradient descent, i.e., [U c+1, V c+1] =

363

Page 380: Copyright by Kai Zhong 2018

[U c, V c]− η∇fΩ(U c, V c), satisfies

‖U c+1 − U∗‖2F + ‖V c+1 − V ∗‖2F ≤ (1−Ml/Mu)(‖U c − U∗‖2F + ‖V c − V ∗‖2F )

with probability 1− d−t, where η = Θ(1/Mu) is the step size and Ml & 1/(λ2κ) is

the lower bound on the eigenvalues of the Hessian and Mu . 1 is the upper bound

on the eigenvalues of the Hessian.

Proof. In order to show the linear convergence of gradient descent, we first show

that the Hessian along the line between [U c, V c] and [U∗, V ∗] are positive definite

w.h.p..

The idea is essentially building a d−1/2λ−2κ−1-net for the line between the

current iterate and the optimal. In particular, we set d1/2 points [Ua, V a]a=1,2,··· ,d1/2

that are equally distributed between [U c, V c] and [U∗, V ∗]. Therefore, ‖Ua+1 −

Ua‖+ ‖V a+1 − V a‖ . d−1/2λ−2κ−1

Using Lemma D.3.20, we can show that for any [U, V ], if there exists a value

of a such that ‖U − Ua‖+ ‖V − V a‖ . d−1/2λ−2κ−1, then

‖∇2fΩ(U, V )−∇2fΩ(Ua, V a)‖ . λ−2κ−1.

Therefore, for every point [U, V ] in the line between [U c, V c] and [U∗, V ∗], we can

find a fixed point in [Ua, V a]a=1,2,··· ,d1/2 , such that ‖U − Ua‖ + ‖V − V a‖ .

d−1/2λ−2κ−1. Now applying union bound for all a, we have that w.p. 1 − d−t, for

every point [U, V ] in the line between [U c, V c] and [U∗, V ∗],

MlI ∇2fΩ(U, V ) Mu,

364

Page 381: Copyright by Kai Zhong 2018

where Ml = Ω(λ−2κ−1) and Mu = O(1). Note that the upper bound of the Hessian

is due to the fact that φ and φ′ are bounded.

Given the positive definiteness of the Hessian along the line between the

current iterate and the optimal, we are ready to show the linear convergence. First

we set the stepsize for the gradient descent update as η = 1/Mu and use notation

W := [U, V ] to simplify the writing.

‖W c+1 −W ∗‖2F

= ‖W c − η∇fΩ(W c)−W ∗‖2F

= ‖W c −W ∗‖2F − 2η〈∇fΩ(W c), (W c −W ∗)〉+ η2‖∇fΩ(W c)‖2F

Note that

∇fΩ(W c) =

(∫ 1

0

∇2fΩ(W∗ + ξ(W c −W ∗))dξ

)vec(W c −W ∗).

Define H ∈ R(2kd)×(2kd),

H =

(∫ 1

0

∇2fΩ(W∗ + ξ(W c −W ∗))dξ

).

By the result provided above, we have

MlI H MuI. (D.26)

Now we upper bound the norm of the gradient,

‖∇fΩ(W c)‖2F = 〈Hvec(W c −W ∗), Hvec(W c −W ∗)〉 ≤Mu〈vec(W c −W ∗), Hvec(W c −W ∗)〉.

365

Page 382: Copyright by Kai Zhong 2018

Therefore,

‖W c+1 −W ∗‖2F

≤ ‖W c −W ∗‖2F − (−η2Mu + 2η)〈vec(W c −W ∗), Hvec(W c −W ∗)〉

≤ ‖W c −W ∗‖2F − (−η2Mu + 2η)Ml‖W c −W ∗‖2F

= ‖W c −W ∗‖2F −Ml

Mu

‖W c −W ∗‖2F

≤ (1− Mu

Ml

)‖W c −W ∗‖2F

Lemma D.3.20. Let the activation function be tan/sigmoid. For given Ua, V a and

r > 0, if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability 1− d−t,

sup‖U−Ua‖+‖V−V a‖≤r

‖∇2fΩ(U, V )−∇2fΩ(Ua, V a)‖ . d1/2 · r

Proof. We consider each block of the Hessian as defined in Eq (D.11). In particular,

we show that if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

366

Page 383: Copyright by Kai Zhong 2018

then with probability 1− d−t,∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)φ′(u>

j x)φ(v>i y)φ(v

>j y)− φ′(ua>

i x)φ′(ua>j x)φ(va>i y)φ(va>j y))xx>]∥∥∥∥

. (‖ui − uai ‖+ ‖uj − ua

j‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2;

by Lemma D.3.21∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(u>

i x)φ(v>i y)xx

>

−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(ua>

i x)φ(va>i y)xx>]∥∥∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2

by Lemma D.3.22∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)φ′(v>j y)φ(v

>i y)φ(u

>j x)− φ′(ua>

i x)φ′(va>j y)φ(va>i y)φ(ua>j x)

)xy>

]∥∥∥∥. (‖ui − ua

i ‖+ ‖uj − uaj‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2

by Lemma D.3.23∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′(u>

i x)φ′(v>i y)xy

>

−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)

)φ′(ua>

i x)φ′(va>i y)xy>]

∥∥∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2

by Lemma D.3.24

Lemma D.3.21. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

367

Page 384: Copyright by Kai Zhong 2018

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)φ′(u>

j x)φ(v>i y)φ(v

>j y)− φ′(ua>

i x)φ′(ua>j x)φ(va>i y)φ(va>j y))xx>]∥∥∥∥∥∥

. (‖ui − uai ‖+ ‖uj − ua

j‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2

Proof. Note that

φ′(u>i x)φ

′(u>j x)φ(v

>i y)φ(v

>j y)− φ′(ua>

i x)φ′(ua>j x)φ(va>i y)φ(va>j y)

=φ′(u>i x)φ

′(u>j x)φ(v

>i y)φ(v

>j y)− φ′(ua>

i x)φ′(u>j x)φ(v

>i y)φ(v

>j y)

+ φ′(ua>i x)φ′(u>

j x)φ(v>i y)φ(v

>j y)− φ′(ua>

i x)φ′(ua>j x)φ(v>i y)φ(v

>j y)

+ φ′(ua>i x)φ′(ua>

j x)φ(v>i y)φ(v>j y)− φ′(ua>

i x)φ′(ua>j x)φ(va>i y)φ(v>j y)

+ φ′(ua>i x)φ′(ua>

j x)φ(va>i y)φ(v>j y)− φ′(ua>i x)φ′(ua>

j x)φ(va>i y)φ(va>j y)

(D.27)

Let’s consider the first term in the above formula. The other terms are similar.∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)− φ′(ua>i x))φ′(u>

j x)φ(v>i y)φ(v

>j y)xx

>]∥∥∥∥∥∥≤

∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[‖ui − ua

i ‖‖x‖xx>]∥∥∥∥∥∥

which is because both φ′(·) and φ(·) are bounded and Lipschitz continuous. Apply-

ing the unbounded matrix Bernstein Inequality Lemma D.3.9, we can bound∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[‖ui − ua

i ‖‖x‖xx>]∥∥∥∥∥∥ . ‖ui − uai ‖d1/2

368

Page 385: Copyright by Kai Zhong 2018

Since both φ′(·) and φ(·) are bounded and Lipschitz continuous, we can easily ex-

tend the above inequality to other cases and finish the proof.

Lemma D.3.22. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(u>

i x)φ(v>i y)xx

>

−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)

)φ′′(ua>

i x)φ(va>i y)xx>]∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2

Proof. Since for sigmoid/tanh, φ, φ′, φ′′ are all Lipschitz continuous and bounded,

the proof of this lemma resembles the proof for Lemma D.3.21.

Lemma D.3.23. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)φ′(v>j y)φ(v

>i y)φ(u

>j x)− φ′(ua>

i x)φ′(va>j y)φ(va>i y)φ(ua>j x)

)xy>

]∥∥∥∥∥∥. (‖ui − ua

i ‖+ ‖uj − uaj‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2

Proof. Do the similar splits as Eq. (D.27) and let’s consider the following case,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)− φ′(ua>i x)

)φ′(v>j y)φ(v

>i y)φ(u

>j x)xy

>]∥∥∥∥∥∥.369

Page 386: Copyright by Kai Zhong 2018

Setting M(x) =(φ′(u>

i x)− φ′(ua>i x)

)φ(u>

j x)x, N(y) = φ′(v>j y)φ(v>i y)y

> and

using the fact that ‖φ′(u>i x)−φ′(ua>

i x)‖ ≤ ‖ui−uai ‖‖x‖, we can follow the proof

of Lemma D.3.6 to show if

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ′(u>

i x)− φ′(ua>i x)

)φ′(v>j y)φ(v

>i y)φ(u

>j x)xy

>]∥∥∥∥∥∥ ≤ ‖ui − uai ‖d1/2

Lemma D.3.24. If

n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,

then with probability at least 1− d−t,∥∥∥∥∥∥ 1

|Ω|∑

(x,y)∈Ω

[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)

)φ′(u>

i x)φ′(v>i y)xy

>

−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)

)φ′(ua>

i x)φ′(va>i y)xy>]∥∥

. (‖U − Ua‖+ ‖V − V a‖)d1/2

Proof. Since for sigmoid/tanh, φ, φ′, φ′′ are all Lipschitz continuous and bounded,

the proof of this lemma resembles the proof for Lemma D.3.23.

370

Page 387: Copyright by Kai Zhong 2018

Appendix E

Low-rank Matrix Sensing

E.1 Proof of Theorem 7.3.1

Proof. We explain the key ideas of the proof by first presenting the proof for the

special case of rank-1 W ∗ = σ∗u∗v>∗ . We later extend the proof to general rank-k

case.

Similar to [71], we first characterize the update for h + 1-th step iterates

vh+1 of Algorithm 7.3.1 and its normalized form vh+1 = vh+1/‖vh+1‖2.

Now, by gradient of (7.4) w.r.t. v to be zero while keeping uh to be fixed.

That is,

m∑i=1

(bi − x>i uhv

>h+1yi)(x

>i uh)yi = 0,

i.e.,m∑i=1

(uh>xi)yi(σ∗y

>i v∗u

>∗ xi − y>

i vh+1uh>xi) = 0,

i.e.,

(m∑i=1

(x>i uhu

>hxi)yiy

>i

)vh+1 = σ∗

(m∑i=1

(x>i uhu

>∗ xi)yiy

>i

)v∗,

i.e., vh+1 = σ∗(u>∗ uh)v∗ − σ∗B

−1((u>∗ uh)B − B)v∗, (E.1)

where,

B =1

m

m∑i=1

(x>i uhu

>hxi)yiy

>i , B =

1

m

m∑i=1

(x>i uhu

>∗ xi)yiy

>i .

371

Page 388: Copyright by Kai Zhong 2018

Note that (E.1) shows that vh+1 is a perturbation of v∗ and the goal now is

to bound the spectral norm of the perturbation term:

‖G‖2 = ‖B−1(u>∗ uhB − B)v∗‖2 ≤ ‖B−1‖2‖u>

∗ uhB − B‖2‖v∗‖2. (E.2)

Now,, using Property 2 mentioned in the theorem, we get:

‖B− I‖2 ≤ 1/100, i.e., σmin(B) ≥ 1− 1/100, i.e., ‖B−1‖2 ≤ 1/(1− 1/100).

(E.3)

Now,

(u>∗ uh)B − B =

1

m

m∑i=1

yiy>i x

>i ((u

>∗ uh)uhu

>h − u∗u

>h )xi,

=1

m

m∑i=1

yiy>i x

>i (uhu

>h − I)u∗u

>hxi,

ζ1≤ 1

100‖(uhu

>h − I)u∗‖2‖u>

h ‖2 =1

100

√1− (u>

hu∗)2, (E.4)

where ζ1 follows by observing that (uhu>h − I)u∗ and uh are orthogonal set of

vectors and then using Property 3 given in the Theorem 7.3.1. Hence, using (E.3),

(E.4), and ‖v∗‖2 = 1 along with (E.2), we get:

‖G‖2 ≤1

99

√1− (u>

hu∗)2. (E.5)

We are now ready to lower bound the component of vh along the correct

direction v∗ and the component of vh that is perpendicular to the optimal direction

v∗.

Now, by left-multiplying (E.1) by v∗ and using (E.3) we obtain:

v>∗ vh+1 = σ∗(u

>hu∗)− σ∗v

>∗ G ≥ σ∗(u

>hu∗)−

σ∗

99

√1− (u>

hu∗)2. (E.6)

372

Page 389: Copyright by Kai Zhong 2018

Similarly, by multiplying (E.1) by v∗⊥, where v∗

⊥ is a unit norm vector that is

orthogonal to v∗, we get:

〈v∗⊥, vh+1〉 ≤

σ∗

99

√1− (u>

hu∗)2. (E.7)

Using (E.6), (E.7), and ‖vh+1‖22 = (v>∗ vh+1)

2 + ((v∗⊥)>vh+1)

2, we get:

1− (v>h+1v∗)

2 =〈v∗

⊥vh+1〉2

〈v∗vh+1〉2 + 〈v∗⊥vh+1〉2,

≤ 1

99 · 99 · (u>hu∗ − 1

99

√1− (u>

hu∗)2)2 + 1(1− (uhu∗)

2).

(E.8)

Also, using Property 1 of Theorem 7.3.1, for S = 1m

∑mi=1 biAi, we get: ‖S‖2 ≥

99σ∗100

. Moreover, by multiplying S −W ∗ by u0 on left and v0 on the right and

using the fact that (u0,v0) are the largest singular vectors of S, we get: ‖S‖2 −

σ∗v>0 v∗u

>0 u∗ ≤ σ∗/100. Hence, u>

0 u∗ ≥ 9/10.

Using the (E.8) along with the above given observation and by the “induc-

tive” assumption u>hu∗ ≥ u>

0 u∗ ≥ 9/10 (proof of the inductive step follows di-

rectly from the below equation) , we get:

1− (v>h+1v∗)

2 ≤ 1

2(1− (u>

hu∗)2). (E.9)

Similarly, we can show that 1 − (u>h+1u∗)

2 ≤ 12(1 − (v>

h+1v∗)2). Hence, after

H = O(log(σ∗/ε)) iterations, we obtain WH = uH v>H , s.t., ‖WH−W ∗‖2 ≤ ε.

We now generalize our above given proof to the rank-k case. In the case

of rank-1 matrix recovery, we used 1− (v>h+1u∗)

2 as the error or distance function

and show at each step that the error decreases by at least a constant factor. For

373

Page 390: Copyright by Kai Zhong 2018

general rank-k case, we need to generalize the distance function to be a distance

over subspaces of dimension-k. To this end, we use the standard principle angle

based subspace distance. That is,

Definition E.1.1. Let U1, U2 ∈ Rd×k be k-dimensional subspaces. Then the princi-

ple angle based distance dist(U1, U2) between U1, U2 is given by:

dist(U1, U2) = ‖U>⊥U2‖2,

where U⊥ is the subspace orthogonal to U1.

Proof of Theorem 7.3.1: General Rank-k Case. For simplicity of notation, we de-

note Uh by U , Vh+1 by V , and Vh+1 by V .

Similar to the above given proof, we first present the update equation for

V(t+1). Recall that V(t+1) = argminV ∈Rd2×k

∑i(x

>i W ∗yi − x>

i U V >yi)2. Hence,

by setting gradient of this objective function to 0, using the above given notation

and by simplifications, we get:

V = W ∗>U − F, (E.10)

where F = [F1F2 . . . Fk] is the “error” matrix.

Before specifying F , we first introduce block matrices B,C,D, S ∈ Rkd2×kd2

374

Page 391: Copyright by Kai Zhong 2018

with (p, q)-th block Bpq, Cpq, Spq, Dpq given by:

Bpq =∑i

yiy>i (x

>i up)(x

>i uq), (E.11)

Cpq =∑i

yiy>i (x

>i up)(x

>i u∗q), (E.12)

Dpq = u>p u∗qI, (E.13)

Spq = σ∗pI if p = q, and 0 if p 6= q. (E.14)

where σ∗p = Σ∗(p, p), i.e., the p-th singular value of W ∗ and u∗q is the q-th column

of U∗.

Then, using the definitions given above, we get:F1...Fk

= B−1(BD − C)S ·m(V∗). (E.15)

Now, recall that in the t+1-th iteration of Algorithm 7.3.1, Vt+1 is obtained

by QR decomposition of Vt+1. Using notation mentioned above, V = V R where

R denotes the lower triangular matrix Rt+1 obtained by the QR decomposition of

Vt+1.

Now, using (E.10), V = V R−1 = (W ∗>U − F )R−1. Multiplying both the

sides by V∗p, where V∗p is a fixed orthonormal basis of the subspace orthogonal to

span(V∗), we get:

(V∗p)>V = −(V∗p)

>FR−1 ⇒ dist(V∗, Vt+1) = ‖(V∗p)>V ‖2 ≤ ‖F‖2‖R−1‖2.

(E.16)

375

Page 392: Copyright by Kai Zhong 2018

Also, note that using the initialization property (1) mentioned in Theorem 7.3.1,

we get ‖S −W ∗‖2 ≤ σ∗k

100. Now, using the standard sin theta theorem for singular

vector perturbation[83], we get: dist(U0, U∗) ≤ 1100

.

Theorem 7.3.1 now follows by using Lemma E.1.1, Lemma E.1.2 along

with the above mentioned bound on dist(U0, U∗).

Lemma E.1.1. Let A be a rank-one measurement operator where Ai = xiy>i .

Also, let A satisfy Property 1, 2, 3 mentioned in Theorem 7.3.1 and let σ∗1 ≥ σ∗

2 ≥

· · · ≥ σ∗k be the singular values of W ∗. Then,

‖F‖2 ≤σ∗

k

100dist(Ut, U∗).

Lemma E.1.2. Let A be a rank-one measurement operator where Ai = xiy>i .

Also, let A satisfy Property 1, 2, 3 mentioned in Theorem 7.3.1. Then,

‖R−1‖2 ≤1

σ∗k ·√

1− dist2(Ut, U∗)− ‖F‖2.

Proof of Lemma E.1.1. Recall that m(F ) = B−1(BD − C)S ·m(V∗). Hence,

‖F‖2 ≤ ‖F‖F ≤ ‖B−1‖2‖BD−C‖2‖S‖2‖m(V∗)‖2 = σ∗1√k‖B−1‖2‖BD−C‖2.

(E.17)

Now, we first bound ‖B−1‖2 = 1/(σmin(B)). Also, let Z = [z1z2 . . . zk] and let

z = m(Z). Then,

σmin(B) = minz,‖z‖2=1

z>Bz = minz,‖z‖2=1

∑1≤p≤k,1≤q≤k

z>p Bpqzq

= minz,‖z‖2=1

∑p

z>p Bppzp +

∑pq,p 6=q

z>p Bpqzq. (E.18)

376

Page 393: Copyright by Kai Zhong 2018

Recall that, Bpp =1m

∑mi=1 yiy

>i (x

>i up)

2 and up is independent of ξ,yi,∀i. Hence,

using Property 2 given in Theorem 7.3.1, we get:

σmin(Bpp) ≥ 1− δ, (E.19)

where,

δ =1

k3/2 · β · 100,

and β = σ∗1/σ∗

k is the condition number of W ∗.

Similarly, using Property (3), we get:

‖Bpq‖2 ≤ δ. (E.20)

Hence, using (E.18), (E.19), (E.20), we get:

σmin(B) ≥ minz,‖z‖2=1

(1−δ)∑p

‖zp‖22−δ∑

pq,p 6=q

‖zp‖2‖zq‖2 = minz,‖z‖2=1

1−δ∑pq

‖zp‖2‖zq‖2 ≥ 1−kδ.

(E.21)

Now, consider BD − C:

‖BD − C‖2 = maxz,‖z‖2=1

|z>(BD − C)z|,

= maxz,‖z‖2=1

∣∣∣∣∣ ∑1≤p≤k,1≤q≤k

z>p yiy

>i zqx

>i

( ∑1≤`≤k

〈u`u∗q〉upu>` − upu

>∗q

)xi

∣∣∣∣∣,= max

z,‖z‖2=1

∣∣∣∣∣ ∑1≤p≤k,1≤q≤k

z>p yiy

>i zqx

>i upu

>∗q(UU> − I)xi

∣∣∣∣∣,ζ1≤ δ max

z,‖z‖2=1

∑1≤p≤k,1≤q≤k

‖(UU> − I)u∗q‖2‖zp‖2‖zq‖2 ≤ k · δ · dist(U,U∗),

(E.22)

where ζ1 follows by observing that u>∗q(UU> − I)up = 0 and then by applying

Property (3) mentioned in Theorem 7.3.1.

Lemma now follows by using (E.22) along with (E.17) and (E.21).

377

Page 394: Copyright by Kai Zhong 2018

Proof of Lemma E.1.2. The lemma is exactly the same as Lemma 4.7 of [71]. We

reproduce their proof here for completeness.

Let σmin(R) be the smallest singular value of R, then:

σmin(R) = minz,‖z‖2=1

‖Rz‖2 = minz,‖z‖2=1

‖V Rz‖2 = minz,‖z‖2=1

‖V∗Σ∗U>∗ Uz − Fz‖2,

≥ minz,‖z‖2=1

‖V∗Σ∗U>∗ Uz‖2 − ‖Fz‖2 ≥ σ∗

kσmin(U>U∗)− ‖F‖2,

≥ σ∗k

√1−

∥∥U>U∗⊥∥∥2

2− ‖F‖2 = σ∗

k√

1− dist(U∗, U)2 − ‖F‖2.(E.23)

Lemma now follows by using the above inequality along with the fact that ‖R−1‖2 ≤

1/σmin(R).

E.2 Proofs for Matrix Sensing using Rank-one Independent Gaus-sian Measurements

E.2.1 Proof of Claim 7.3.1

Proof. The main idea behind our proof is to show that there exists two rank-1 ma-

trices ZU , ZL s.t. ‖AGI(ZU)‖22 is large while ‖AGI(ZL)‖22 is much smaller than

‖AGI(ZU)‖22.

In particular, let ZU = x1y>1 and let ZL = uv> where u,v are sampled

from normal distribution independent of X,Y . Now,

‖AGI(ZU)‖22 =m∑i=1

‖x1‖42‖y1‖42 +m∑i=2

(x>1 xi)

2(y>1 yi)

2.

Now, as xi,yi,∀i are multi-variate normal random variables, ‖x1‖42‖y1‖42 ≥ 0.5d21d22

w.p. ≥ 1− 2 exp(−d1 − d2).

‖AGI(ZU)‖22 ≥ .5d21d22. (E.24)

378

Page 395: Copyright by Kai Zhong 2018

Moreover, ‖ZU‖2F ≤ 2d1d2 w.p. ≥ 1− 2 exp(−d1 − d2).

Now, consider

‖AGI(ZL)‖22 =m∑i=2

(u>xi)2(v>yi)

2,

where ZL = uv> and u,v are sampled from standard normal distribution, inde-

pendent of xi,yi,∀i. Since, u,v are independent of x>1 xi ∼ N(0, ‖x1‖2) and

y>1 yi ∼ N(0, ‖y1‖2). Hence, w.p. ≥ 1 − 1/m3, |u>xi| ≤ log(m)‖u‖2, |v>yi| ≤

log(m)‖v‖2,∀i ≥ 2. That is, w.p. 1− 1/m3:

‖AGI(ZL)‖22 ≤ 4m log4md1d2. (E.25)

Furthermore, ‖ZL‖2F ≤ 2d1d2 w.p. ≥ 1− 2 exp(−d1 − d2).

Using (E.24), (E.25), we get that w.p. ≥ 1− 2/m3 − 10 exp(−d1 − d2):

40m log4m ≤ ‖AGI(Z/‖Z‖F )‖2 ≤ .05d1d2.

Now, for RIP to be satisfied with a constant δ, the lower and upper bound should

be at most a constant factor apart. However, the above equation clearly shows

that the upper and lower bound can match only when m = Ω(d1d2/ log(5d1d2)).

Hence, for m that at most linear in both d1, d2 cannot be satisfied with probability

≥ 1− 1/(d1 + d2)3.

E.2.2 Proof of Thoerem 7.3.2

Proof. We divide the proof into three parts where each part proves a property men-

tioned in Theorem 7.3.1.

379

Page 396: Copyright by Kai Zhong 2018

Proof of Property 1. Now,

S =1

m

m∑i=1

bixiy>i =

1

m

m∑i=1

xix>i U∗Σ∗V

>∗ yiy

>i =

1

m

m∑i=1

Zi,

where Zi = xix>i U∗Σ∗V

>∗ yiy

>i . Note that E[Zi] = U∗Σ∗V

>∗ . Also, both xi and

yi are spherical Gaussian variables and hence are rotationally invariant. Therefore,

wlog, we can assume that U∗ = [e1e2 . . . ek] and V∗ = [e1e2 . . . ek] where ei is the

i-th canonical basis vector.

As S is a sum of m random matrices, the goal is to apply Lemma 2.4.3

to show that S is close to E[S] = W = U∗Σ∗V>∗ for large enough m. However,

Lemma 2.4.3 requires bounded random variable while Zi is an unbounded vari-

able. We handle this issue by clipping Zi to ensure that its spectral norm is always

bounded. In particular, consider the following random variable:

xij =

xij, |xij| ≤ C

√log(m(d1 + d2)),

0, otherwise,(E.26)

where xij is the j-th co-ordinate of xi. Similarly, define:

yij =

yij, |yij| ≤ C

√log(m(d1 + d2)),

0, otherwise.(E.27)

Note that, P(xij = xij) ≥ 1 − 1(m(d1+d2))C

and P(yij = yij) ≥ 1 − 1(m(d1+d2))C

.

Also, xij, yij are still symmetric and independent random variables, i.e., E[xij] =

E[yij] = 0, ∀i, j. Hence, E[xijxi`] = 0,∀j 6= `. Furthermore, ∀j,

E[x2ij] = E[x2

ij]−2√2π

∫ ∞

C√

log(m(d1+d2))

x2 exp(−x2/2)dx,

= 1− 2√2π

C√

log(m(d1 + d2))

(m(d1 + d2))C2/2

− 2√2π

∫ ∞

C√

log(m(d1+d2))

exp(−x2/2)dx,

≥ 1−2C√

log(m(d1 + d2))

(m(d1 + d2))C2/2

. (E.28)

380

Page 397: Copyright by Kai Zhong 2018

Similarly,

E[y2ij] ≥ 1−2C√log(m(d1 + d2))

(m(d1 + d2))C2/2

. (E.29)

Now, consider RV, Zi = xix>i U∗Σ∗V

>∗ yiy

>i . Note that,

‖Zi‖2 ≤ C4√

d1d2k log2(m(d1 + d2))σ∗

1

and ‖E[Zi]‖2 ≤ σ∗1. Also,

‖E[ZiZ>i ]‖2 = ‖E[‖yi‖22xix

>i U∗Σ∗V

>∗ yiy

>i V∗Σ∗U

>∗ xix

>i ]‖2,

≤ C2d2 log(m(d1 + d2))E[xix>i U∗Σ∗

2U>∗ xix

>i ]‖2,

≤ C2d2 log(m(d1 + d2))(σ∗1)2‖E[‖U>

∗ xi‖22xix>i ]‖2,

≤ C4kd2 log2(m(d1 + d2))(σ∗

1)2. (E.30)

Similarly,

‖E[Zi]E[Z>i ]‖2 ≤ (σmax

∗ )2. (E.31)

Similarly, we can obtain bounds for ‖E[Z>i Zi]‖2, ‖E[Zi]

>E[Zi]‖2.

Finally, by selecting m = C1k(d1+d2) log2(d1+d2)

δ2and applying Lemma 2.4.3

we get (w.p. 1− 1(d1+d2)10

),

‖ 1m

m∑i=1

Zi − E[Zi]‖2 ≤ δ. (E.32)

Note that E[Zi] = E[x2i1]E[y2i1]U∗Σ∗V

>∗ . Hence, by using (E.32), (E.28), (E.29),

‖ 1m

m∑i=1

Zi − U∗Σ∗V>∗ ‖2 ≤ δ +

σ1∗

(d1 + d2)100.

381

Page 398: Copyright by Kai Zhong 2018

Finally, by observing that by selecting C to be large enough in the definition of

xi, yi (see (E.26), (E.27)), we get P (‖Zi − Zi‖2 = 0) ≥ 1 − 1(d1+d2)5

. Hence, by

assuming δ to be a constant wrt d1, d2 and by union bound, w.p. 1− 2δ10

(d1+d2)5,

‖ 1m

m∑i=1

Zi −W ∗‖2 ≤ 5δ‖W ∗‖2.

Now, the theorem follows directly by setting δ = 1100k3/2β

.

Proof of Property 2 . Here again, the goal is to show that the random matrix Bx

concentrates around its mean which is given by I . Now, as xi,yi are rotationally

invariant random variables, wlog, we can assume uh = e1. That is, (x>i uhu

>hxi) =

x2i1 where xi1 is the first coordinate of xi. Furthermore, similar to (E.26), (E.27),

we define clipped random variables xi1 and yi below:

xi1 =

xi1, |xi1| ≤ C

√log(m),

0, otherwise.(E.33)

yi =

yi, ‖yi‖22 ≤ 2(d1 + d2),

0, otherwise.(E.34)

Now, consider B = 1m

∑mi=1 Zi, where Zi = x2

i1yiy>i . Note that, ‖Zi‖2 ≤ 2C2(d1+

d2) log(m). Similarly, ‖E[∑

i ZiZ>i ]‖2 ≤ 2mC4(d1 + d2) log

2(m). Hence, using

Lemma 2.4.3, we get:

P

(‖ 1m

∑i

Zi − E[Zi]‖2 ≥ γ

)≤ exp

(− mγ2

2C4(d1 + d2) log2(m)(1 + γ/3)

).

(E.35)

Now, using argument similar to (E.28), we get ‖E[Zi] − I‖2 ≤ 2C log(m)

mC2/2. Further-

more, P(Zi = Zi) ≥ 1− 1m3 . Hence for m = Ω(k(d1 + d2) log(d1 + d2)/δ

2), w.p.

382

Page 399: Copyright by Kai Zhong 2018

≥ 1− 2m3 ,

‖Bx − I‖2 ≤ δ. (E.36)

Similarly, we can prove the bound for By using exactly same set of arguments.

Proof of Property 3. Let, C = 1m

∑mi=1 yiy

>i x

>i uu

>⊥xi where u, u⊥ are fixed or-

thogonal unit vectors. Now x>i uh ∼ N(0, 1) and u>

⊥xi ∼ N(0, 1) are both normal

variables. Also, note that u and u⊥ are orthogonal, hence x>i uh and u>

⊥xi are

independent variables.

Hence, E[x>i uu

>⊥xi] = 0, i.e., E[C] = 0. Now, let m = Ω(k(d1 +

d2) log(d1+d2)/δ2). Then, using the clipping argument (used in the previous proof)

with Lemma 2.4.3, Property 3 is satisfied w.p. ≥ 1− 2m3 . That is, ‖Cy‖2 ≤ δ. More-

over, ‖Cx‖2 ≤ δ also can be proved using similar proof to the one given above.

E.3 Proof of Matrix Sensing using Rank-one Dependent Gaus-sian Measurements

E.3.1 Proof of Lemma 7.3.1

Proof. Incoherence: The Incoherence property directly follows Lemma 2.2 in

[24].

Averaging Property: Given any orthogonal matrix U ∈ Rd×k(d ≥ k), let

Q = [U,U⊥], where U⊥ is a complementary matrix of U . Define S = XQ =

(XQX>)X . The matrix XQX> can be viewed as a rotation matrix constrained

383

Page 400: Copyright by Kai Zhong 2018

in the column space of X . Thus, S is a constrained rotation of X , which implies

S is also a random orthogonal matrix and so is the first k columns of S. We use

T ∈ Rn×k to denote the first k columns of S.

maxi‖U>zi‖ = max

i‖U>Qsi‖ = max

i‖ti‖

where ti is the transpose of the i-th row of T . Now this property follows from

Lemma 2.2 in [24].

E.3.2 Proof of Theorem 7.3.3

Proof. Similar to the proof of Theorem 7.3.2, we divide the proof into three parts

where each part proves a property mentioned in Theorem 7.3.1. And in this proof,

we set d = d1 + d2 and n = n1 + n2.

Proof of Property 1. As mentioned in the proof sketch, wlog, we can assume that

both X,Y are orthonormal matrices and that the condition number of R is same as

condition number of W ∗.

We first recall the definition of S:

S =n1n2

m

m∑(i,j)∈Ω

xix>i U∗Σ∗V

>∗ yjy

>j =

n1n2

m

m∑(i,j)∈Ω

Zij,

where Zij = xix>i U∗Σ∗V

>∗ yjy

>j = Xeie

>i X

>U∗Σ∗V>∗ Y eje

>j Y

>, where ei, ej

denotes the i-th, j-th canonical basis vectors, respectively.

Also, since (i, j) is sampled uniformly at random from [n1] × [n2]. Hence,

Ei[eie>i ] =

1n1I and Ej[eje

>j ] =

1n2I . That is,

Eij[Zij] =1

n1n2

XX>U∗Σ∗V>∗ Y Y > = U∗Σ∗V

>∗ = W ∗/(n1 · n2),

384

Page 401: Copyright by Kai Zhong 2018

where XX> = I , Y Y > = I follows by orthonormality of both X and Y .

We now use the matrix concentration bound of Lemma 2.4.3 to bound ‖S−

W ∗‖2. To apply the bound of Lemma 2.4.3, we first need to bound the following

two quantities:

• Bound maxij ‖Zij‖2: Now,

‖Zij‖2 = ‖xix>i U∗Σ∗V

>∗ yjy

>j ‖2 ≤ σ∗

1‖U∗>xi‖2‖V∗

>yj‖2‖xi‖2‖yj‖2

≤ σ∗1µµ0

√d1d2k

n1n2

,

where the last inequality follows using the two properties of random orthog-

onal matrices of X,Y .

• Bound ‖∑

(i,j)∈ΩE[ZijZ>ij ]‖2 and ‖

∑(i,j)∈Ω E[Z>

ijZij]‖2: We first consider

‖∑

(i,j)∈Ω E[ZijZ>ij ]‖2:∥∥∥∥∥∥

∑(i,j)∈Ω

E[ZijZ>ij ]

∥∥∥∥∥∥2

=

∥∥∥∥∥∥∑

(i,j)∈Ω

E[xix>i W ∗yjy

>j yjy

>j W ∗

>xix>i ]

∥∥∥∥∥∥2

,

ζ1≤ µd2

n2

∥∥∥∥∥∥∑

(i,j)∈Ω

E[xix>i W ∗yjy

>j W ∗

>xix>i ]

∥∥∥∥∥∥2

,

ζ2=

µ2d2n22

∥∥∥∥∥∥∑

(i,j)∈Ω

E[xix>i W ∗W ∗

>xix>i ]

∥∥∥∥∥∥2

,

ζ3≤ (σ∗

1)2µµ0kd2n1n2

2

∥∥∥∥∥∥∑

(i,j)∈Ω

E[xix>i ]

∥∥∥∥∥∥2

,

ζ4=

(σ∗1)2µµ0kd2n21n

22

·m, (E.37)

where ζ1, ζ3 follows by using the two properties of X,Y and ‖W ∗‖2 ≤ σ∗1.

ζ2, ζ4 follows by using Ei[eie>i ] =

1n1I and Ej[eje

>j ] =

1n2I .

385

Page 402: Copyright by Kai Zhong 2018

Now, bound for ‖∑

(i,j)∈Ω E[Z>ijZij]‖2 also turns out to be exactly the same

and can be easily computed using exactly same arguments as above.

Now, by applying Lemma 2.4.3 and using the above computed bounds we get:

Pr(‖S −W ∗‖2 ≥ σ∗1γ) ≤ d exp

(− mγ2

µµ0kd(1 + γ/3)

). (E.38)

That is, w.p. ≥ 1− γ:

‖S −W ∗‖2 ≤σ∗

1√

2µµ0kd log(d/γ)√m

. (E.39)

Because the properties for random orthogonal matrices fail with probability cn−3 log n,

we assume gamma at least have the same magnitude of such a failure probability

to simplify the result, i.e., γ ≥ cn−3 log n. Hence, by selecting m = O(µµ0k4 · β2 ·

d log(d/γ)) where β = σ∗1/σ∗

k, the following holds w.p. ≥ 1− γ:

‖S −W ∗‖2 ≤ ‖W ∗‖2 · δ,

where δ = 1/(k3/2 · β · 100).

Proof of Property 2. We prove the property for By; proof for By follows analo-

gously. Now, let By =n1n2

m

∑(i,j)∈Ω Zij where Zi = x>

i uu>xiyiy

>i . Then,

E[By] =n1n2

m

∑(i,j)∈Ω

Zij =n1n2

m

m∑i=1

E(i,j)∈Ω[x>i uu

>xiyiy>i ] = I. (E.40)

Here again, we apply Lemma 2.4.3 to bound ‖By − I‖2. To this end, we need to

bound the following quantities:

386

Page 403: Copyright by Kai Zhong 2018

• Bound maxij ‖Zij‖2: Now,

‖Zij‖2 = ‖x>i uu

>xiyiy>i ‖2 ≤ ‖yi‖22‖u>xi‖22 ≤

µµ0d2 log n2

n1n2

.

The log factor comes from the second property of random orthogonal matri-

ces.

• Bound ‖∑

(i,j)∈ΩE[ZijZ>ij ]‖2 and ‖

∑(i,j)∈Ω E[Z>

ijZij]‖2: We first consider

‖∑

(i,j)∈Ω E[ZijZ>ij ]‖2:∥∥∥∥∥∥

∑(i,j)∈Ω

E[ZijZ>ij ]

∥∥∥∥∥∥2

=

∥∥∥∥∥∥∑

(i,j)∈Ω

E[(x>i uu

>xi)2‖yi‖22yiy

>i ]

∥∥∥∥∥∥2

,

ζ1≤ µd2

n2

∥∥∥∥∥∥∑

(i,j)∈Ω

E[(x>i uu

>xi)2yiy

>i ]

∥∥∥∥∥∥2

,

ζ2=

µd2n22

∥∥∥∥∥∥∑

(i,j)∈Ω

E[(x>i uu

>xi)2

∥∥∥∥∥∥2

,

ζ3≤ µµ0d2 log n2

n1n22

∥∥∥∥∥∥∑

(i,j)∈Ω

E[(x>i u)

2]

∥∥∥∥∥∥2

,

ζ4=

µµ0d2 log n2

n21n

22

·m (E.41)

Note that if we assume k ≥ log n, the above given bounds are less than the ones

obtained in the Initialization Property’s proof respectively. Hence, by applying

Lemma 2.4.3 in a similar manner, and selecting m = O(µµ0k4 · β2 · d log(d/γ))

and δ = 1/(k3/2 · β · 100), we get w.p. ≥ 1− γ:

‖By − I‖2 ≤ δ.

Hence Proved. ‖Bx − I‖2 ≤ δ can be proved similarly.

387

Page 404: Copyright by Kai Zhong 2018

Proof of Property 3. Note that E[Cy] = E[∑

(i,j)∈Ω Zij] = 0. Furthermore, both

‖Zij‖2 and ‖E[∑

(i,j)∈Ω ZijZ>ij ]‖2 have exactly the same bounds as those given in

the Property 2’s proof above. Hence, we obtain similar bounds. That is, if m =

O(µµ0k4 · β2 · d log(d/γ)) and δ = 1/(k3/2 · β · 100), we get w.p. ≥ 1− γ:

‖Cy‖2 ≤ δ.

Hence Proved. ‖Cx‖2 can also be bounded analogously.

388

Page 405: Copyright by Kai Zhong 2018

Appendix F

Over-specified Neural Network

F.1 Preliminaries

Without loss of generality, we assume that w∗i , i ∈ [k0] are orthonormal. De-

fine wj = wj/‖wj‖, θij = ∠(wi,wj), αij = ∠(wi,w∗j ). In the special case where

k0 = 1, we denote the ground truth weight vector as w∗, and αi = ∠(wi,w∗).

The following lemmas characterize the behavior of M matrix when the input

is normally distributed.

Lemma F.1.1 (Lemma 6 [41]). If x ∼ N(0, I), then given any two unitary vectors,

w,u (i.e., ‖w‖ = ‖u‖ = 1),

E[1w>x≥01u>x≥0xx

>] = (1

2− 1

2πθ

)I +

(wu> + uw>)−w>u(ww> + uu>)

2π sin θ(F.1)

where θ = arccos(w>u).

By multiplying u on both sides of Eq. (F.1), we immediately get the follow-

ing lemma.

Lemma F.1.2. If x ∼ N(0, I), then given any two unitary vectors, w,u (i.e.,

‖w‖ = ‖u‖ = 1),

E[1w>x≥01u>x≥0xx

>]u =

(1

2− 1

2πθ

)u+

sin θ

2πw (F.2)

389

Page 406: Copyright by Kai Zhong 2018

where θ = arccos(w>u).

When two vectors have close enough angle, then Mu,vv can be well-approximated

by 12v, where the approximation error decreases quadratically w.r.t. the angles be-

tween u,v. The following lemma characterizes this property.

Lemma F.1.3. For any u,v ∈ Sd−1, there exists a constant C2, such that∥∥∥∥Mu,vv −1

2v

∥∥∥∥ ≤ C2θ2u,v;

Proof. For Gaussian input, by Lemma F.1.2, we have

‖Mu,vv −1

2v‖ = 1

2π‖ sin θu,vu− θu,vv‖ (F.3)

=1

(sin2 θu,v + θ2u,v − 2θu,v sin θu,v cos θu,v

) 12

Expanding with Maclaurin expansion, we have∥∥∥∥Mu,vv −1

2v

∥∥∥∥ =1

(5

3θ4u,v +O(θ6u,v)

) 12

≤ C2(d)θ2u,v

Lemma F.1.4. If X = N(0, Id), then there exists constant C3 such that for all

u,v ∈ Sd−1,

1

2− vTMu,vv ≥ C3θ

3u,v.

Proof. When x ∼ N(0, Id), by Eq. (F.3),

1

2− vTMu,vv =

1

2πvT (sin θu,vu− θu,vv) =

1

2π(sin θu,v − θu,v cos θu,v)

=1

6πθ3u,v +O(θ5u,v) ≥ C3θ

3u,v.

390

Page 407: Copyright by Kai Zhong 2018

Lemma F.1.5. For any unit vectors u1,u2,w, we have

‖M(u1,w)w −M(u2,w)w‖ ≤ 3

2πθu1,u2 .

Proof. By Lemma F.1.2,

M(u1,w)w −M(u2,w)w (F.4)

=

(1

2− θu1,w

)w +

sin θu1,w

2πu1 −

(1

2− θu2,w

)w − sin θu2,w

2πu2

=θu2,w − θu1,w

2π+

sin θu1,wu1 − sin θu2,wu2

2π(F.5)

By triangle inequality under arccos metric, we have |θu2,w − θu1,w| ≤ θu1,u2 . For

the second term,

‖ sin θu1,wu1 − sin θu2,wu2‖ ≤‖(sin θu1,w − sin θu2,w)u1‖+ ‖ sin θu2,w(u1 − u2)‖

≤|θu2,w − θu1,w|+ ‖u1 − u2‖

≤θu1,u2 + 2 sinθu1,u2

2

≤2θu1,u2

Plugging back to Eq. (F.5), we have

‖M(u1,w)w −M(u2,w)w‖ ≤ 3

2πθu1,u2 ≤

1

2θu1,u2

We next bound the difference between population and empirical M ma-

trix. The proof of that bound relies on a covariance matrix concentration result for

391

Page 408: Copyright by Kai Zhong 2018

sub-gaussian random vectors in [128]. For completeness we include it below with

explicit dependence on the tail probability.

Lemma F.1.6 (Proposition 2.1 [128]). Consider independent random vectors X1, · · · , Xn ∈

Rd, n ≥ d, which have sub-gaussian distribution with sub-gaussian norm upper

bounded by L. Then for every δ > 0 with probability at least 1−δ one has for some

absolute constant C > 0,

‖EXXT − 1

n

n∑i=1

XiXTi ‖ ≤ C

√log

(2

δ

)d

n.

The following lemma shows the concentration for the M matrix.

Lemma F.1.7. Let x ∼ N(0, Id), and x1, · · · ,xn are n i.i.d. samples generated

from this distribution. Let u,v be two unitary vectors in Rd. Define M(u,v) =

E[φ′(uTx)φ′(vTx)xxT

], and M(u,v) = 1

n

∑ns=1 φ

′(uTxs)φ′(vTxs)xsx

Ts . Then

with probability at least 1− δ, we have

‖M(u,v)− M(u,v)‖ ≤ C

√log

(2

δ

)d

n.

Proof. First notice that xs are independent sub-gaussian random vectors, and φ′(uTxs) ∈

0, 1. Then for given u,v, φ′(uTxs)φ′(vTxs)xs are independent sub-gaussian

random variables with sub-gaussian norm upper bounded by L. Applying Lemma F.1.6,

we have with probability at least 1− δ,

‖Exφ′(uTx)2φ′(vTx)2xxT − 1

n

n∑s=1

φ′(uTxs)2φ′(vTxs)

2xsxTs ‖ ≤ C

√log

(2

δ

)d

n.

392

Page 409: Copyright by Kai Zhong 2018

F.2 Two Learning Phases

Below we break down the convergence path into two phases and talk briefly

about the proof ideas for each phase.

F.2.1 Phase I – Learning the Directions

In Phase I, we show that with a small initialization and a small step size, the

magnitude of wi remains small but the direction of wi moves towards the direction

of w∗. In order to achieve an overall complexity of O(ε−1), we carefully design of

the analysis for Phase I. In particular, we divide it into two sub-phases. In Phase

I-a, we have a suboptimal convergence rate ε−2. Yet in just a small number of

iterations T1a = O((log(1/ε))2), GD is able to achieve 1/ log(1/ε) angle precision,

i.e., α(t) . 1/ log(1/ε).Then starting from slightly aligned wj’s, we enter Phase

I-b, where we have a faster rate of O(ε−1).

We first present a lemma to characterize the rate of convergence in Phase

I-a. In order to make the lemma compatible with both population and empirical

scenarios, we assume the gradient update is noisy with t-iteration noise as r(t)(i) ,

where in the population case we have ‖r(t)i ‖ = 0. The noisy gradient update is as

follows.

w(t+1)i = w

(t)i − η∇

w(t)if(W ) + ηr

(t)i . (F.6)

For the empirical scenarios, which uses a sample set S to do gradient update, r(t)i :=

393

Page 410: Copyright by Kai Zhong 2018

∇w

(t)if(W )−∇

w(t)ifS(W ).

Lemma F.2.1 (Phase I-a). Let initialization w(0)i be randomly generated from a

uniform distribution on a d-dimensional sphere of radius σ. Given any constant ε0,

if σ . mini η(π−α(0)i )3ε20, η = O(

ε40k), ‖r(0)

i ‖ . (π−α(0)i )3 · ε20 and ‖r(t)

i ‖ . ε20 for

t ≥ 1, then after T1a = O(ε−20 ) iterations of noisy gradient update (F.6), we have

α(T1a) . ε0, η(T1a/3− 1) ≤ ‖w(T1a)i ‖ ≤ η(T1a/2 + 1), ∀i ∈ [k] (F.7)

Our convergence analysis for Phase I-a requires the initialization scale to be

smaller than η · ε20 ·mini(π−α(0)i )3, which implicitly requires α(0)

i < π for all i. To

understand this requirement, consider a special initialization case when there exists

w(0)i such that α(0)

i = π. Then ∇w

(0)ifS(W ) = 0 for any sample set S, therefore,

w(0)i will never move. For a finite k, if w(0)

i is randomly initialized, α(0)i < π almost

surely.

To grasp some intuition about the proof, we introduce a decomposition of

the gradient. To start with, we define the following matrix for any given pair of

unitary vectors , u,v ∈ Sd−1,

Mu,v := Ex∼X

[φ′(u>x)φ′(v>x)xx>].

Denote M(t)ij := M

w(t)i ,w

(t)j

and R(t)i := M

w(t)i ,w∗ . The gradient update can be

formulated as,

w(t+1)i = w

(t)i + ηR

(t)i w∗︸ ︷︷ ︸(A)

− η∑j

M(t)ij w

(t)j︸ ︷︷ ︸

(B)

+ηr(t)i (F.8)

394

Page 411: Copyright by Kai Zhong 2018

Term (A) in Eq. (F.8) only involves wi and w∗, while term (B) in Eq. (F.8) involves

the interaction of wi with all the other wj . On a high level, Term (A) always pushes

wi towards the direction of w∗, and term (B) always tries to move wii∈[k] away

from each other. The update procedure is hence a combination of the two forces.

This observation is also discussed in [98], where they show that gradient descent

update is equivalent to proton-electron electrodynamics. Term (A) represents the

attractive force between the protons and electrons, while Term (B) is the repulsive

force from the remaining electrons. In the initial steps, since the magnitude of wi

are small, term (A) will play a dominating role, which pushes wi towards w∗.

After Phase I-a pushes all wi to be slightly aligned with w∗, we enter Phase

I-b and obtain a faster convergence rate.

Lemma F.2.2 (Phase I-b). Assume∑

j ‖w(T1a)i ‖ . ξ/3 and α(T1a) . ξ/3 at the

end of Phase I-a. For any ε1 > 0, if η . ξε1+3ξ1 /k and ‖r(t)

i ‖ . ξ tan(α(t)) for

t ≥ T1a, then after T1b = O(ε−(1+3ξ)1 ) iterations of empirical gradient descent

updates Eq. (F.6), let T1 = T1a + T1b, we have

α(T1)i . ε1, ‖w(T1)

i ‖ = Θ(ξ/k), ∀i ∈ [k].

We achieve O(ε−11 ) rate when setting ξ = O(1/ log(1/ε1)).

F.2.2 Phase II – Learning the Magnitudes

After Phase I, all wi’s are approximately aligned with w∗, but their mag-

nitudes are still far from optimal, hence the training error can be further reduced.

395

Page 412: Copyright by Kai Zhong 2018

Fortunately, with all wi’s having a small angle with w∗, the non-convex problem re-

duces to a noisy linear regression problem, which has linear convergence rate using

GD. So we mainly need to ensure that the directions of wi does not move away from

w∗ too much during the optimization process. Formally, the convergence guarantee

is presented in Lemma F.2.3.

Lemma F.2.3 (Phase II). For any ε2 > 0, assume at the end of Phase I-b, we

have α(T1)i . min√ε2, 1/ log(1/ε2). Now following Phase I-b, taking additional

T2 = O(log(1/ε2)) iterations with step size η = Θ(1/k), if we have for all t ∈

[T1, T1 + T2], and ‖r(t)i ‖ . ε2, then we have,

f(W (T1+T2)) ≤ ε2; and α(T1+T2)i .

√ε2, ‖

∑j

w(T1+T2)j −w∗‖ . ε2, ∀i ∈ [k].

F.3 Proof of Main Theorems

We provide proofs for the theorems in the main result section. The proof

idea is mainly to combine the convergence bound for the previous lemmas for Phase

I-a, I-b and II, so that the later phases inherit sufficiently good parameters from the

previous phases.

F.3.1 Population Case

Proof for Theorem 8.1.1. The proof follows by combining the guarantees in each

phase, i.e., Lemma F.2.1 and Lemma F.2.2 for Phase I and Lemma F.2.3 for Phase

II, where we set the noise r(0)i = 0 for the population case. Now we set the

parameters in the lemmata in reverse order. According to Lemma F.2.3, if after

396

Page 413: Copyright by Kai Zhong 2018

Phase I-b, the following holds, α(T1)i .

√ε, then after T2 ≥ log(1/ε) iterations

with η2 = O(1/k), generalization error can be bounded by ε. By Phase I-b, if

we set ξ1 = ξ2 = O(1/ log(1/ε)) and η1b = O(ε1/2/ log(1/ε)/k), we have af-

ter T1b = O(ε−1/2) iterations, the requirement for the initialization of Phase II

will be satisfied. In Phase I-a, since we’ve set ξ1 = ξ2 = O(1/ log(1/ε)), we

need ε0 = O(1/ log(1/ε)). Therefore, we need η1a = O((log(1/ε))−4/k), and

T1a = O(log2(1/ε)). For simplicity, we combine Phase I-a and Phase I-b, and set

η1 = minη1a, η1b. And the total number of iterations is O(ε−1/2).

F.3.2 Finite-Sample Case

Proof for Theorem 8.1.2. We re-write the conditions on ‖r(0)i ‖ for Phase I-a, I-b, II

from Lemma F.2.1, F.2.2 and F.2.3, respectively.

• ‖r(t)i ‖ . mini(π − α

(0)i )3 · ε20

• ‖r(t)i ‖ . ε1/ log(1/ε1)

• ‖r(t)i ‖ . ε2

According to the proof of Theorem 8.1.1, to achieve a final precision of ε for the

objective value, we need to set ε2 = ε, ε1 =√ε and ε0 = 1/ log(1/ε). Therefore, if

‖r(t)i ‖ . min

ε,min

i(π − α

(0)i )3 · log−2(ε−1)

for all t ≤ T1 + T2, we have E(f) . ε. Assuming (π − α

(0)i )3 ≥ ε, we just need

‖r(t)i ‖ ≤ ε.

397

Page 414: Copyright by Kai Zhong 2018

Note that

r(t)i := −

[∑j

(M

(t)ij −M

(t)ij

)w

(t)j −

(R

(t)i −R

(t)i

)w∗

].

‖r(t)i ‖ can be bounded by the different of the sampling error of the matrices, R(t)

i

and M(t)ij . ∥∥∥r(t)

i

∥∥∥ ≤ ∥∥∥R(t)i −R

(t)i

∥∥∥+maxj

∥∥∥M (t)ij −M

(t)ij

∥∥∥By Lemma F.1.7, at iteration t, if the sample size

|S(t)| & ε−2d log k log(1/δ(t)),

we have ‖r(t)i ‖ ≤ ε w.p., 1 − δ(t). After T = O(ε−1/2) iterations, we have if the

sample size in each iteration,

|S(t)| & ε−2 log(1/ε) · d log k log(1/δ),

w.p. 1− δ,

E[f(W (T ))

]≤ ε.

F.4 Proofs of Phase I

In Phase I-a, we start from random initialization and proceed with gradient

descent for some initial steps. The main proof idea for this phase is to show at the

very beginning of the optimization process, GD will push all wi’s to w∗ when the

398

Page 415: Copyright by Kai Zhong 2018

interactions among wi’s each other is still small. However, the convergence rate is

small at the beginning. Therefore, we later introduce Phase I-b, that has a faster

convergence rate.

Lemma F.4.1 shows that if the updates are only consist of term (A) in

Eq. (F.8), the noiseless interaction term between wi and w∗, the angles between

wi and w∗ converges with rate ε−2.

Later in Lemma F.4.2, we add noise to the term (A) and find the conditions

of the noise so that the angles still converges.

Lemma F.4.1. Let θu(0),v ≤ π− ε for some ε > 0. Let Mu(0),v be defined from u(0).

Let u(1) = Mu(0),vv. Assume the following procedure,

u(t+1) = u(t) +Mu(t),vv, for t ≥ 1 (F.9)

Then given any ε0 > 0, after T & 1/ε20 iterations, θu(T ),v ≤ ε0 and T/3 ≤ ‖u(T )‖ ≤

(1 + ε0)T/2.

Proof. Denote α(t) := θu(t),v.

u(1) =

(1

2− α(0)

)v +

sinα(0)

2πu(0)

We calculate the norm of u(1) as

‖u(1)‖2 =(1

2− α(0)

)2

+

(sinα(0)

)2

+ 2 cosα(0)

(1

2− α(0)

)(sinα(0)

)≤ 1

For iteration 2, we re-write the update step as

u(2) = u(1) +

(1

2− α(1)

)v +

sinα(1)

2πu(1)

399

Page 416: Copyright by Kai Zhong 2018

∥∥u(2)∥∥

=

((‖u(1)‖+ sinα(1)

)2

+

(1

2− α(1)

)2

+ 2 cosα(1)

(1

2− α(1)

)(‖u(1)‖+ sinα(1)

)) 12

and

cosα(2) =

(‖u(1)‖+ sin α(1)

)cosα(1) +

(12− α(1)

)‖u(2)‖

Notice that cosα(2) and ‖u(2)‖ are dominated by a univariate function of α(0), it

hence can be computed that, when α(0) ≤ π − ε for any ε ∈ (0, π],

α(2) < π/5, 0.28 < ‖u(2)‖ ≤ 1 (F.10)

Next, we show the iterations for t > 2. Let γ(t) = v>u(t).

γ(t+1) = v>(u(t+1))

= v>(u(t) +Mu(t),vv)

= γ(t) +

(1

2− α(t)

)+

1

2πsinα(t) · cosα(t)

If α(t) ≤ π/5, we have 1/3 ≤ γ(t+1)−γ(t) ≤ 1/2. Hence, t/3+c0 ≤ γ(t) ≤

t/2 + c0 for t ≥ 2, where c0 is a constant.

tanα(t+1) =‖(I − vv>)u(t+1)‖

γ(t+1)

=γ(t) + 1

2πsinα(t) · cosα(t)

γ(t) + 12+ 1

2π(sinα(t) · cosα(t) − α(t))

tanα(t)

400

Page 417: Copyright by Kai Zhong 2018

When α(t) ≤ 2π/5, we have 12π

sinα(t) · cosα(t) ≤ 1/20.

tanα(t+1) ≤ γ(t) + 1/20

γ(t) + 1/3tanα(t)

≤ t/2 + c0 + 1/20

t/2 + c0 + 1/3tanα(t)

Using the fact that

limn→∞

Γ(n+ z)

Γ(n)nz= 1,

where Γ is the gamma function, we have

tanα(T ) . T−1/2 tanα(2) . ε0,

where the last equation is from the condition T ≥ 1/ε20.

Lemma F.4.2. Given a starting vector, w(0), assume the following two procedures,

u(t+1) = u(t) +Mu(t),vv, for any t ≥ 1

where u(1) = Mw(0),vv, and

w(t+1) = w(t) +Mw(t),vv + δ(t), for any t ≥ 0.

If ‖δ(0)‖+ ‖w(0)‖ . (π − θw(0),v)3 · ε20, and ‖δ(t)‖ . ε20,∀t ≥ 1, then

‖∆(t)‖ := ‖w(t) − u(t)‖ . tε20, ∀t ≥ 1

Further, after T = 1/ε20 iterations,

θw(T ),u(T ) ≤ ε20, ‖w(T ) − u(T )‖ . 1.

401

Page 418: Copyright by Kai Zhong 2018

Proof. Let ∆(t) := w(t) − u(t). For t ≥ 1,

w(t+1) = w(t) +Mw(t),vv + δ(t)

= u(t) +∆(t) +Mu(t),vv + (Mw(t),v −Mu(t),v)v + δ(t)

Therefore,

‖∆(t+1)‖ ≤ ‖∆(t)‖+‖(Mw(t),v−Mu(t),v)v‖+‖δ(t)‖ ≤ ‖∆(t)‖+C4θw(t),u(t)+‖δ(t)‖.

The second inequality is due to Lemma F.1.5. Note that for θw(t),u(t) ≤ π/2,

θw(t),u(t) ≤ 2 sin(θw(1),u(1)) ≤2‖w(t) − u(t)‖‖u(t)‖

.

‖∆(1)‖ = ‖w(1) − u(1)‖ ≤ ‖δ(0)‖+ ‖w(0)‖ ≤ 2ε30

Hence we have

θw(1),u(1) ≤2‖w(1) − u(1)‖‖u(1)‖

≤ 2‖δ(0)‖+ 2‖w(0)‖‖u(1)‖

Note that ‖u(1)‖ = ‖Mw(0),vv‖ ≥ C3(π − θw(0),v)3 for some C3 according to

Lemma F.1.4. Therefore, if

‖δ(0)‖+ ‖w(0)‖ ≤ C3/2(π − θw(0),v)3 · C−1

4 · ε20,

we have ‖∆(2)‖ ≤ 4ε20 .

402

Page 419: Copyright by Kai Zhong 2018

And for any t ≥ 2, ‖u(t)‖ ≥ t/3 + c0, and for some constant t1, if t ≥ t1,

‖u(t)‖ ≥ 2t/5 + c1. So if ‖δ(t)‖ . ε20 and ‖∆(t)‖ ≤ 2tε20, then

‖∆(t+1)‖ ≤ ‖∆(t)‖+ C4‖∆(t)‖/‖u(t)‖+ ‖δ(t)‖ ≤ (2t+ 2)ε20.

which is because C4 =32π

.

So by induction, we have after T = 1/ε20 iterations,

θw(T ),u(T ) . ε20,∥∥w(T ) − u(T )

∥∥ = ‖∆(T )‖ . Tε20 . 1

F.4.1 Proof of Phase I-a

Proof of Lemma F.2.1. Let’s first check the first iteration.

w(1)i = w

(0)i − η

[∑j

M(0)ij w

(0)j −R

(0)i w∗

]+ ηr

(0)i

Define

δ(0)i := w

(0)i − η

[∑j

M(0)ij w

(0)j

]+ ηr

(0)i (F.11)

So if we have σ . mini

η(π−α

(0)i

)3ε30

ηk+1and ‖r(0)

i ‖ . (π − α(0)i )3 · ε20, we have

‖δ(0)i ‖+ ‖w(0)i ‖ . η(π − α

(0)i )3 · ε20.

Now consider the case for t ≥ 1. Let u(1)i = ηM

w(0)i ,w∗ and for t ≥ 1,

u(t+1)i = u

(t)i + ηM

u(t)i ,w∗w

∗, ∆(t)i := w

(t)i − u

(t)i

403

Page 420: Copyright by Kai Zhong 2018

w(t+1)i = w

(t)i − η

[∑j

M(t)ij w

(t)j −R

(t)i w∗

]+ ηr

(t)i

Define

δ(t)i = −η

∑j

M(t)ij w

(t)j + ηr

(t)i .

If we have ηkT ≤ ε20 and ‖r(t)i ‖ . ε20, for t ≤ T , we can guarantee that

‖δ(t)i ‖ ≤ η∑j

∥∥∥w(t)j

∥∥∥+ η∥∥∥r(t)

i

∥∥∥ . η2kt+ ηε20 . ηε20,

holds for all t ≤ T .

Now applying Lemma F.4.2, after T = 1/ε20 iterations,

θw

(T )i ,u(T ) ≤ ε20,

∥∥∥w(T )i − u

(T )i

∥∥∥ ≤ η.

Combining Lemma F.4.1 and using the fact θw

(T )i ,u(T ) ≤ θ

w(T )i ,u(T ) + θu(T ),w∗ , we

conclude the proof.

404

Page 421: Copyright by Kai Zhong 2018

F.4.2 Proofs of Phase I-b

Proof of Lemma F.2.2. Define w(t)i = γ

(t)i w∗ + b

(t)i , where b

(t)>i w∗ = 0. By the

gradient update, we derive γ(t+1)i and b

(t+1)i in terms of

w

(t)i

i=1,2,··· ,k

.

γ(t+1)i = w∗>w

(t+1)i

= w∗>

(w

(t)i − η

[∑j

‖w(t)j ‖M

(t)ij w

(t)j −R

(t)i w∗

]+ ηr

(t)i

)

= w∗>

(w

(t)i + η/2w∗ − η

[∑j

‖w(t)j ‖M

(t)ij w

(t)j − (R

(t)i − I/2)w∗

]+ ηr

(t)i

)

= γ(t)i + η/2− ηw∗>

[∑j

‖w(t)j ‖M

(t)ij w

(t)j − (R

(t)i − I/2)w∗

]+ ηw∗>r

(t)i .

Therefore,

|γ(t+1)i − γ

(t)i − η/2| ≤ η

(∑j

∥∥∥w(t)j

∥∥∥/2 + C2(α(t)i )2 +

∥∥∥r(t)i

∥∥∥).If∑

j ‖w(t)j ‖/2 ≤ ξ/3, C2(α

(t)i ) ≤ ξ/3, and ‖r(t)

i ‖ ≤ ξ/3, we can upper

and lower bound γ(t+1)i .(

1

2− ξ

)η ≤ γ

(t+1)i − γ

(t)i ≤

(1

2+ ξ

)η. (F.12)

405

Page 422: Copyright by Kai Zhong 2018

For the part orthogonal to w∗.

‖b(t+1)i ‖

=‖(I −w∗w∗>)w(t+1)i ‖

=‖(I −w∗w∗>)

w(t)i − η[

∑j

‖w(t)j ‖M

(t)ij w

(t)j −R

(t)i w∗] + ηr

(t)i

‖=

∥∥∥∥∥∥(I −w∗w∗>)

w(t)i − η/2(

∑j

w(t)j −w∗)

−η[∑j

‖w(t)j ‖(M

(t)ij − I/2)w

(t)j − (R

(t)i − I/2)w∗] + ηr

(t)i

∥∥∥∥∥∥=

∥∥∥∥∥∥b(t)i − η/2∑j

b(t)j − (I −w∗w∗>)η

∑j

‖w(t)j ‖(M

(t)ij − I/2)w

(t)j − (R

(t)i − I/2)w∗ + r

(t)i

∥∥∥∥∥∥≤‖b(t)i ‖+ η/2

∑j

‖b(t)j ‖+ η

∑j

2‖w(t)j ‖+ 1

C2(α(t))2 + η

∥∥∥r(t)i

∥∥∥≤

∥∥∥∥∥∥b(t)i ‖+ η/2∑j

γ(t)j tan(α(t)) + η

∑j

2‖w(t)j ‖+ 1

C2(α(t))2 + η‖r(t)i

∥∥∥∥∥∥where we set α(t)

i ≤ α(t).

If we have∑

j γ(t)j ≤ ξ/3, C2α

(t) ≤ ξ/3, and ‖‖r(t)i ‖‖ ≤ (ξ/3) tan(α(t)),

we can obtain

tanα(t+1)i =

‖b(t+1)i ‖γ(t)i

≤ γ(t)i + ξη

γ(t)i + (1

2− ξη)

tanα(t).

According to Lemma F.2.1 (Phase I-a) and Eq. (F.12), γ(T1a)i +(t−T1a)η/3 ≤

γ(t)i ≤ γ

(T1a)i + (t− T1a)η/2.

Therefore,

γ(t)i + (ξ1 + ξ2)η

γ(t)i + (1

2− (ξ1 + ξ2)η)

≤ t/2 + c0 + (ξ1 + ξ2)

t/2 + c0 + 1/2− (ξ1 + ξ2).

406

Page 423: Copyright by Kai Zhong 2018

By the fact that

limn→∞

Γ(n+ z)

Γ(n)nz= 1

where Γ is the gamma function, we have

tanα(T1)i . T

−(1+3ξ)1b tanα

(T1a)i

After T1b = O(ε−(1+3ξ)1 ),

α(T1)i . ε1.

Note that we require∑

j γ(t)j ≤ ξ/3, so we can set T1bη/2k ≤ ξ/3, i.e., η .

ξ1ε1+3ξ1 /k

F.5 Proofs for Phase II

In this phase, we have already small angles, and the optimization process

can be approximated by linear regression. The following proof for Lemma F.2.3

mainly ensures the error doesn’t expand too much during this phase.

F.5.1 Proof for Lemma F.2.3

Proof. Recall the empirical gradient descent update,

w(t+1)i = w

(t)i − η

[∑j

‖w(t)j ‖M

(t)ij w

(t)j −R

(t)i w∗

]

= w(t)i − η/2

(∑j

w(t)j −w∗

)− η

[∑j

‖w(t)j ‖(M

(t)ij − I/2)w

(t)j − (R

(t)i − I/2)w∗

]+ ηr

(t)i

407

Page 424: Copyright by Kai Zhong 2018

∥∥∥∥∥w(t+1)i −w

(t)i + η/2

(∑j

w(t)j −w∗

)∥∥∥∥∥ ≤ η

(4∑j

‖w(t)j ‖+ 1

)(α(t))2 + η‖r(t)

i ‖

(F.13)

where α(t)i ≤ α0, for any i ∈ [k]. If for every t we considered, we have ‖r(t)

i ‖ . α20,

then

w(t+1)i = w

(t)i − η/2

(∑j

w(t)j −w∗

)+O(ηα2

0)u(t)i ,

for some unit vector u(t)i . Summing over all i ∈ [k] can arrive at∥∥∥∥∥

(∑j

w(t+1)j −w∗

)− (1− ηk/2)

(∑j

w(t)j −w∗

)∥∥∥∥∥ . ηkα20

After T iterations following Phase I-b,∥∥∥∥∥(∑j

w(T+T1)j −w∗)− (1− ηk/2)T

(∑j

w(T1)j −w∗

)∥∥∥∥∥ .T∑t=1

(1−ηk/2)T−tηkα20.

Now we start to prove by induction. Assume α(T1+t−1) ≤ α0 for all t = 1, 2, · · · , T

and some α0 > 0. Then∥∥∥∥∥(∑

j

w(T+T1)j −w∗

)− (1− ηk/2)T

(∑j

w(T1)j −w∗

)∥∥∥∥∥ . α20. (F.14)

This implies,∥∥∥∥∥T∑t=1

(∑j

w(t+T1−1)j −w∗

)−

T∑t=1

(1− ηk/2)t

(∑j

w(T1)j −w∗

)∥∥∥∥∥ . Tα20

Also note,∥∥∥∥∥w(T+T1)i −w

(T1)i + η/2

T∑t=1

(∑j

w(t+T1−1)j −w∗

)∥∥∥∥∥ . ηT∑t=1

α(t+T1−1))2.

408

Page 425: Copyright by Kai Zhong 2018

Therefore,∥∥∥∥∥w(T+T1)i −w

(T1)i +

1

k

(1− (1− ηk/2)T

)(∑j

w(T1)j −w∗

)∥∥∥∥∥ . ηTα20.

The final step of the induction is the following,

sinα(T1+T )i = sin∠(w(T+T1)

i ,w∗)

=‖w(T+T1)

i −w∗‖ sin∠(w(T+T1)i −w∗,w∗)

‖w(T1+T )i ‖

=‖(I −w∗w∗>)(w

(T+T1)i −w∗)‖

‖w(T1+T )i ‖

For the numerator,∥∥∥(I −w∗w∗>)(w

(T+T1)i −w∗

)∥∥∥=∥∥∥(I −w∗w∗>)

(w

(T+T1)i −w

(T1)i +w

(T1)i

)∥∥∥≤∥∥∥(I −w∗w∗>)

(w

(T+T1)i −w

(T1)i

)∥∥∥+ ∥∥∥(I −w∗w∗>)w(T1)i )

∥∥∥≤1

k

(1− (1− ηk/2)T

)∥∥∥∥∥(I −w∗w∗>)

(∑j

w(T1)j −w∗

)∥∥∥∥∥+ cηTα20 + ‖w

(T1)i ‖α(T1)

i

≤1

k

(1− (1− ηk/2)T

)∥∥∥∥∥(I −w∗w∗>)∑j

w(T1)j

∥∥∥∥∥+ cηTα20 + ‖w

(T1)i ‖α(T1)

i

≤1

k

(1− (1− ηk/2)T

)∥∥∥∥∥∑j

w(T1)j

∥∥∥∥∥∠(∑

j

w(T1)j ,w∗

)+ cηTα2

0 + ‖w(T1)i ‖α(T1)

i

≤ 1

16πkα0 + cηTα2

0

where c is an absolute constant.

409

Page 426: Copyright by Kai Zhong 2018

For the denominator,∥∥∥w(T1+T )i

∥∥∥ ≥ ∥∥∥∥∥1k (1− (1− ηk/2)T )

(∑j

w(T1)j −w∗

)∥∥∥∥∥− ∥∥∥w(T1)i

∥∥∥− ηTα20

≥ 1

2k

(1− (1− ηk/2)T

)− cηTα2

0

Now by setting cηTα0 ≤ 132πk

, ηk = 1, we obtain α(T1+T )i ≤ α0 for any T .

Finally using Eq. (F.14), we have∥∥∥∥∥∑j

w(T+T1)j −w∗

∥∥∥∥∥ . 2−T + α20 . ε2.

For the last part,

E[f(W )]

=∑ij

w>i Mwi,wjwj − 2

∑i

w>i Mwi,w∗w∗ +w∗>Mw∗,w∗w∗

=∑ij

w>i (Mwi,wj − I/2)wj − 2

∑i

w>i (Mwi,w∗ − I/2)w∗

+w∗>Mw∗,w∗w∗ +1

2

∥∥∥∥∥∥∑j

wj −w∗

∥∥∥∥∥∥2

.∑ij

‖wi‖ · ‖wj‖θ2wi,wj+ 2

∑i

‖wi‖ · ‖w∗‖α2i +

∥∥∥∥∥∥∑j

wj −w∗

∥∥∥∥∥∥2

. ε2

410

Page 427: Copyright by Kai Zhong 2018

Bibliography

[1] Movie lens dataset. Public dataset.

[2] Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and Jean-Philippe

Vert. Low-rank matrix factorization with attributes. arXiv preprint

cs/0611124, 2006.

[3] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli,

and Rashish Tandon. Learning sparsely used overcomplete dictionaries via

alternating minimization. COLT, 2014.

[4] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Ma-

tus Telgarsky. Tensor decompositions for learning latent variable models.

JMLR, 15:2773–2832, 2014.

[5] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning

polynomials with neural networks. In Proceedings of the 31st International

Conference on Machine Learning (ICML), pages 1908–1916, 2014.

[6] Peter Arbenz, Daniel Kressner, and D-MATH ETH Zurich. Lecture notes on

solving large scale eigenvalue problems. http://people.inf.ethz.

ch/arbenz/ewp/Lnotes/lsevp2010.pdf.

[7] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable

bounds for learning some deep representations. In Proceedings of the 31st

411

Page 428: Copyright by Kai Zhong 2018

International Conference on Machine Learning (ICML), pages 584–592.

https://arxiv.org/pdf/1310.6343.pdf, 2014.

[8] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Com-

puting a nonnegative matrix factorization–provably. In Proceedings of the

forty-fourth annual ACM symposium on Theory of computing (STOC), pages

145–162. ACM, 2012.

[9] Sanjeev Arora, Rong Ge, Tengyu Ma, and Andrej Risteski. Provable learning

of noisy-or networks. In Proceedings of the 49th Annual Symposium on

the Theory of Computing (STOC). https://arxiv.org/pdf/1612.

08795.pdf, 2017.

[10] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going

beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd

Annual Symposium on, pages 1–10. IEEE, 2012.

[11] Ozlem Aslan, Xinhua Zhang, and Dale Schuurmans. Convex deep learn-

ing via normalized kernels. In Advances in Neural Information Processing

Systems (NIPS), pages 3275–3283, 2014.

[12] Francis Bach. Breaking the curse of dimensionality with convex neural net-

works. arXiv preprint arXiv:1412.8690, 2014.

[13] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guar-

antees for the em algorithm: From population to sample-based analysis. An-

nals of Statistics, 2015.

412

Page 429: Copyright by Kai Zhong 2018

[14] Maria-Florina Balcan, Yingyu Liang, David P. Woodruff, and Hongyang

Zhang. Optimal sample complexity for matrix completion and related prob-

lems via `2-regularization. arXiv preprint arXiv:1704.08683, 2017.

[15] David Balduzzi. Deep online convex optimization with gated games. arXiv

preprint arXiv:1604.01952, 2016.

[16] David Balduzzi, Brian McWilliams, and Tony Butler-Yeoman. Neural taylor

approximations: Convergence and exploration in rectifier networks. arXiv

preprint arXiv:1611.02345, 2016.

[17] Peter L. Bartlett. The sample complexity of pattern classification with neu-

ral networks: the size of the weights is more important than the size of the

network. IEEE Transactions on Information Theory, 44(2):525–536, 1998.

[18] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-

normalized margin bounds for neural networks. In Advances in Neural In-

formation Processing Systems, pages 6241–6250, 2017.

[19] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and

Patrice Marcotte. Convex neural networks. In Advances in Neural Infor-

mation Processing Systems (NIPS), pages 123–130, 2005.

[20] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-

complete. In Proceedings of the 1st International Conference on Neural

Information Processing Systems (NIPS), pages 494–501. MIT Press, 1988.

413

Page 430: Copyright by Kai Zhong 2018

[21] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a

convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.

[22] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz.

SGD learns over-parameterized networks that provably generalize on linearly

separable data. In ICLR. https://arxiv.org/pdf/1710.10174,

2018.

[23] Samuel Burer and Renato DC Monteiro. Local minima and conver-

gence in low-rank semidefinite programming. Mathematical Programming,

103(3):427–444, 2005.

[24] Emmanuel J. Candes and Benjamin Recht. Exact matrix completion

via convex optimization. Foundations of Computational Mathematics,

9(6):717–772, December 2009.

[25] Alexandre X Carvalho and Martin A Tanner. Mixtures-of-experts of autore-

gressive time series: asymptotic normality and model specification. Neural

Networks, 16(1):39–56, 2005.

[26] Arun T. Chaganty and Percy Liang. Spectral experts for estimating mixtures

of linear regressions. In ICML, pages 1040–1048, 2013.

[27] Xiujuan Chai, Shiguang Shan, Xilin Chen, and Wen Gao. Locally lin-

ear regression for pose-invariant face recognition. Image Processing,

16(7):1716–1725, 2007.

414

Page 431: Copyright by Kai Zhong 2018

[28] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formu-

lation for mixed regression with two components: Minimax optimal rates.

COLT, 2014.

[29] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon. Matrix completion

with noisy side information. In Advances in Neural Information Processing

Systems, pages 3447–3455, 2015.

[30] Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous,

and Yann LeCun. The loss surfaces of multilayer networks. In Proceed-

ings of the Eighteenth International Conference on Artificial Intelligence and

Statistics (AISTATS), pages 192–204, 2015.

[31] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of

deep learning: A tensor analysis. In 29th Annual Conference on Learning

Theory (COLT), pages 698–728, 2016.

[32] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as gen-

eralized tensor decompositions. In International Conference on Machine

Learning (ICML), 2016.

[33] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for

YouTube recommendations. In Proceedings of the 10th ACM Conference on

Recommender Systems, pages 191–198. ACM, 2016.

[34] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding

of neural networks: The power of initialization and a dual view on expres-

415

Page 432: Copyright by Kai Zhong 2018

sivity. In Advances in neural information processing systems (NIPS), pages

2253–2261, 2016.

[35] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya

Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point

problem in high-dimensional non-convex optimization. In Advances in neu-

ral information processing systems (NIPS), pages 2933–2941, 2014.

[36] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by

a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.

[37] P. Deb and A. M. Holmes. Estimates of use and costs of behavioural health

care: a comparison of standard and finite mixture models. Health economics,

9(6):475–489, 2000.

[38] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional

filter easy to learn? arXiv preprint arXiv:1709.06129, 2017.

[39] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti

Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of

spurious local minima. arXiv preprint arXiv:1712.00779, 2017.

[40] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering. In CVPR,

pages 2790–2797, 2009.

[41] Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine

neural networks:(almost) all local optima are global. arXiv preprint

arXiv:1710.02196, 2017.

416

Page 433: Copyright by Kai Zhong 2018

[42] Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Man-

nor. Ensemble robustness of deep learning algorithms. arXiv preprint

arXiv:1602.02389, 2016.

[43] Giancarlo Ferrari-Trecate and Marco Muselli. A new learning method for

piecewise linear regression. In Artificial Neural NetworksICANN 2002,

pages 444–449. Springer, 2002.

[44] C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified

network optimization. In arXiv preprint. https://arxiv.org/pdf/

1611.01540.pdf, 2016.

[45] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of

regression models. In KDD. ACM, 1999.

[46] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low

rank problems: A unified geometric analysis. In International Conference

on Machine Learning, pages 1233–1242. https://arxiv.org/pdf/

1704.00708, 2017.

[47] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural

networks with landscape design. In ICLR. https://arxiv.org/pdf/

1711.00501, 2018.

[48] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N

Dauphin. Convolutional Sequence to Sequence Learning. In ArXiv

preprint:1705.03122, 2017.

417

Page 434: Copyright by Kai Zhong 2018

[49] Nicolas Gillis and Francois Glineur. Low-rank matrix approximation with

weights or missing data is np-hard. SIAM Journal on Matrix Analysis and

Applications, 32(4):1149–1165, 2011.

[50] Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and

`1-norm low-rank matrix approximation. arXiv preprint arXiv:1509.09236,

2015.

[51] Navier Glorot and Yoshua Bengio. Understanding the difficulty of training

deep feedforward neural networks. In Proceedings of the thirteenth inter-

national conference on artificial intelligence and statistics, pages 249–256,

2010.

[52] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably

learning the relu in polynomial time. In 30th Annual Conference on Learn-

ing Theory (COLT). https://arxiv.org/pdf/1611.10258.pdf,

2017.

[53] Surbhi Goel, Adam Klivans, and Reghu Meka. Learning one convolutional

layer with overlapping patches. In arXiv preprint. https://arxiv.

org/abs/1802.02547.pdf, 2018.

[54] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent

sample complexity of neural networks. arXiv preprint arXiv:1712.06541,

2017.

418

Page 435: Copyright by Kai Zhong 2018

[55] Carlos A Gomez-Uribe and Neil Hunt. The Netflix recommender system:

Algorithms, business value, and innovation. ACM Transactions on Manage-

ment Information Systems (TMIS), 6(4):13, 2016.

[56] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT

Press, 2016. http://www.deeplearningbook.org.

[57] Benjamin D Haeffele and Rene Vidal. Global optimality in tensor factoriza-

tion, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.

[58] Moritz Hardt. Understanding alternating minimization for matrix comple-

tion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual

Symposium on, pages 651–660. IEEE, 2014.

[59] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ICLR, 2017.

[60] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust sub-

space recovery. In COLT, volume 30, pages 354–375, 2013.

[61] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better:

Stability of stochastic gradient descent. In ICML, pages 1225–1234, 2016.

[62] Moritz Hardt and Mary Wootters. Fast matrix completion without the con-

dition number. In Proceedings of The 27th Conference on Learning Theory,

pages 638–678, 2014.

[63] Johan Hastad. Tensor rank is np-complete. Journal of Algorithms,

11(4):644–654, 1990.

419

Page 436: Copyright by Kai Zhong 2018

[64] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard.

In Journal of the ACM (JACM), volume 60(6), page 45. https://arxiv.

org/pdf/0911.1393.pdf, 2013.

[65] Kurt Hornik. Approximation capabilities of multilayer feedforward net-

works. Neural networks, 4(2):251–257, 1991.

[66] Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit Dhillon. Pu learning for

matrix completion. In International Conference on Machine Learning, pages

2445–2453, 2015.

[67] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians:

moment methods and spectral decompositions. In ITCS, pages 11–20. ACM,

2013.

[68] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for

quadratic forms of subgaussian random vectors. Electron. Commun. Probab,

17(52):1–6, 2012.

[69] Prateek Jain and Inderjit S Dhillon. Provable inductive matrix completion.

arXiv preprint arXiv:1306.0626, 2013.

[70] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank mini-

mization via singular value projection. In NIPS, pages 937–945, 2010.

[71] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix

completion using alternating minimization. In Proceedings of the forty-fifth

annual ACM symposium on Theory of computing (STOC), 2013.

420

Page 437: Copyright by Kai Zhong 2018

[72] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the per-

ils of non-convexity: Guaranteed training of neural networks using tensor

methods. arXiv preprint arXiv:1506.08473, 2015.

[73] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jor-

dan. How to escape saddle points efficiently. In International Conference on

Machine Learning, pages 1724–1732, 2017.

[74] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in

Neural Information Processing Systems, pages 586–594, 2016.

[75] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix

completion from a few entries. IEEE Transactions on Information Theory,

56(6):2980–2998, 2010.

[76] Abbas Khalili and Jiahua Chen. Variable selection in finite mixture of re-

gression models. Journal of the american Statistical association, 2012.

[77] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifi-

cation with deep convolutional neural networks. In NIPS, pages 1097–1105,

2012.

[78] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization

via matrix factorization. In Proceedings of the Eighteenth International Con-

ference on Artificial Intelligence and Statistics (AISTATS), pages 507–516,

2015.

421

Page 438: Copyright by Kai Zhong 2018

[79] Matt Kusner, Stephen Tyree, Kilian Q Weinberger, and Kunal Agrawal.

Stochastic neighbor compression. In Proceedings of the 31st International

Conference on Machine Learning (ICML-14), pages 622–630, 2014.

[80] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face

recognition: A convolutional neural-network approach. IEEE transactions

on neural networks, 8(1):98–113, 1997.

[81] K. Lee and Y. Bresler. Guaranteed minimum rank approximation from linear

observations by nuclear norm minimization with an ellipsoidal constraint.

arXiv preprint arXiv:0903.4742, 2009.

[82] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. Com-

parative deep learning of hybrid representations for image recommendations.

In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 2545–2553, 2016.

[83] Ren-Cang Li. On perturbations of matrix pencils with real spectra. Math.

Comp., 62:231–265, 1994.

[84] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization

in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203,

2017.

[85] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural net-

works with relu activation. In Advances in Neural Information Processing

Systems, pages 597–607, 2017.

422

Page 439: Copyright by Kai Zhong 2018

[86] Shiyu Liang, Ruoyu Sun, Yixuan Li, and R Srikant. Understanding the loss

surface of single-layered neural networks for binary classification. 2018.

[87] Ming Lin and Jieping Ye. A non-convex one-pass framework for generalized

factorization machine and rank-one matrix sensing. In Advances in Neural

Information Processing Systems, pages 1633–1641, 2016.

[88] Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements.

In Advances in Neural Information Processing Systems, pages 1638–1646,

2011.

[89] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational

efficiency of training neural networks. In Advances in neural information

processing systems (NIPS), pages 855–863, 2014.

[90] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bai-

ley. Efficient orthogonal parametrisation of recurrent neural networks using

householder reflections. In International Conference on Machine Learning,

pages 2401–2409, 2017.

[91] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio.

On the number of linear regions of deep neural networks. In Advances in

neural information processing systems (NIPS), pages 2924–2932, 2014.

[92] Nagarajan Natarajan and Inderjit S Dhillon. Inductive matrix completion for

predicting gene–disease associations. Bioinformatics, 30(12):i60–i68, 2014.

423

Page 440: Copyright by Kai Zhong 2018

[93] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandku-

mar, and Prateek Jain. Non-convex robust PCA. In Advances in Neural

Information Processing Systems, pages 1107–1115, 2014.

[94] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan

Srebro. A pac-bayesian approach to spectrally-normalized margin bounds

for neural networks. arXiv preprint arXiv:1707.09564, 2017.

[95] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based ca-

pacity control in neural networks. In Conference on Learning Theory, pages

1376–1401, 2015.

[96] Quynh Nguyen and Matthias Hein. The loss surface and expressivity of deep

convolutional neural networks. arXiv preprint arXiv:1710.10928, 2017.

[97] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide

neural networks. In International Conference on Machine Learning, pages

2603–2612, 2017.

[98] Rina Panigrahy, Ali Rahimi, Sushant Sachdeva, and Qiuyi Zhang. Conver-

gence results for neural networks via electrodynamics. In LIPIcs-Leibniz

International Proceedings in Informatics, volume 94. Schloss Dagstuhl-

Leibniz-Zentrum fuer Informatik, 2018.

[99] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and

Surya Ganguli. Exponential expressivity in deep neural networks through

424

Page 441: Copyright by Kai Zhong 2018

transient chaos. In Advances In Neural Information Processing Systems

(NIPS), pages 3360–3368, 2016.

[100] Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima

and back propagation. In Neural Networks, 1991., IJCNN-91-Seattle Inter-

national Joint Conference on, volume 2, pages 173–176. IEEE, 1991.

[101] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-

Dickstein. On the expressive power of deep neural networks. arXiv preprint

arXiv:1606.05336, 2016.

[102] Ilya P Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank

approximations with provable guarantees. In Proceedings of the 48th Annual

Symposium on the Theory of Computing (STOC), pages 250–263, 2016.

[103] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed

minimum-rank solutions of linear matrix equations via nuclear norm min-

imization. SIAM Review, 52(3):471–501, 2010.

[104] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE

10th International Conference on, pages 995–1000. IEEE, 2010.

[105] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspec-

ified neural networks. In International Conference on Machine Learning

(ICML), 2016.

[106] Itay Safran and Ohad Shamir. Spurious local minima are common in two-

layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.

425

Page 442: Copyright by Kai Zhong 2018

[107] Levent Sagun, Leon Bottou, and Yann LeCun. Singularity of the Hessian in

deep learning. arXiv preprint arXiv:1611.07476, 2016.

[108] Hanie Sedghi and Anima Anandkumar. Provable tensor methods for learn-

ing mixtures of generalized linear models. arXiv preprint arXiv:1412.3046,

2014.

[109] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural

networks with sparse connectivity. In International Conference on Learning

Representation (ICLR), 2015.

[110] Ohad Shamir. Distribution-specific hardness of learning neural networks.

arXiv preprint arXiv:1609.01037, 2016.

[111] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and Inderjit S

Dhillon. Tumblr blog recommendation with boosted inductive matrix com-

pletion. In Proceedings of the 24th ACM International on Conference on

Information and Knowledge Management, pages 203–212. ACM, 2015.

[112] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inderjit S Dhillon.

Goal-directed inductive matrix completion. In Proceedings of the 22nd ACM

SIGKDD International Conference on Knowledge Discovery and Data Min-

ing, pages 1165–1174. ACM, 2016.

[113] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,

George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda

426

Page 443: Copyright by Kai Zhong 2018

Panneershelvam, et al. Mastering the game of Go with deep neural networks

and tree search. Nature, 529(7587):484–489, 2016.

[114] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical in-

sights into the optimization landscape of over-parameterized shallow neural

networks. arXiv preprint arXiv:1707.04926, 2017.

[115] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthog-

onal tensor decomposition. In Advances in Neural Information Processing

Systems(NIPS), pages 793–801, 2016.

[116] Zhao Song, David P. Woodruff, and Peilin Zhong. Low rank approximation

with entrywise `1-norm error. In Proceedings of the 49th Annual Sympo-

sium on the Theory of Computing (STOC). ACM, https://arxiv.org/

pdf/1611.00898.pdf, 2017.

[117] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor

low rank approximation. In arXiv preprint. https://arxiv.org/pdf/

1704.08246.pdf, 2017.

[118] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor

low rank approximation. In arXiv preprint. https://arxiv.org/pdf/

1704.08246.pdf, 2017.

[119] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zero-one law

for entrywise low rank approximation. 2018.

427

Page 444: Copyright by Kai Zhong 2018

[120] David Sontag and Dan Roy. Complexity of inference in latent dirichlet

allocation. In Advances in neural information processing systems, pages

1008–1016, 2011.

[121] Daniel Soudry and Yair Carmon. No bad local minima: Data indepen-

dent training error guarantees for multilayer neural networks. arXiv preprint

arXiv:1605.08361, 2016.

[122] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-

convex factorization. In IEEE Symposium on Foundations of Computer Sci-

ence (FOCS), pages 270–289. IEEE, 2015.

[123] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Lo-

cal minima in training of deep networks. arXiv preprint arXiv:1611.06310,

2016.

[124] Matus Telgarsky. Benefits of depth in neural networks. In 29th Annual

Conference on Learning Theory (COLT), pages 1517–1539, 2016.

[125] Yuandong Tian. An analytical formula of population gradient for two-layered

relu network and its applications in convergence and critical point analysis.

In ICML, 2017.

[126] Yuandong Tian. Symmetry-breaking convergence analysis of certain two-

layered neural networks with ReLU nonlinearity. In Workshop at Interna-

tional Conference on Learning Representation, 2017.

428

Page 445: Copyright by Kai Zhong 2018

[127] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foun-

dations of Computational Mathematics, 12(4):389–434, 2012.

[128] Roman Vershynin. How close is the sample covariance matrix to the ac-

tual covariance matrix? Journal of Theoretical Probability, 25(3):655–686,

2012.

[129] Kert Viele and Barbara Tong. Modeling with mixtures of linear regressions.

Statistics and Computing, 12(4):315–330, 2002.

[130] Elisabeth Vieth. Fitting piecewise linear regression functions to biological

responses. Journal of applied physiology, 67(1):390–396, 1989.

[131] Xinxi Wang and Ye Wang. Improving content-based and hybrid music rec-

ommendation using deep learning. In Proceedings of the 22nd ACM inter-

national conference on Multimedia, pages 627–636. ACM, 2014.

[132] Yining Wang and Anima Anandkumar. Online and differentially-private ten-

sor decomposition. In Advances in Neural Information Processing Systems

(NIPS), pages 3531–3539, 2016.

[133] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar.

Fast and guaranteed tensor decomposition via sketching. In Advances in

Neural Information Processing Systems (NIPS), pages 991–999, 2015.

[134] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz,

and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns:

429

Page 446: Copyright by Kai Zhong 2018

How to train 10,000-layer vanilla convolutional neural networks. In ICML,

2018.

[135] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in

neural networks. In International Conference on Artificial Intelligence and

Statistics (AISTATS), 2017.

[136] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side

information: Application to multi-label learning. In NIPS, pages 2301–2309,

2013.

[137] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating mini-

mization for mixed linear regression. In ICML, pages 613–621, 2014.

[138] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. Scalable coor-

dinate descent approaches to parallel matrix factorization for recommender

systems. In ICDM, pages 765–774, 2012.

[139] Hsiang-Fu Yu, Hsin-Yuan Huang, Inderjit S Dihillon, and Chih-Jen Lin. A

unified algorithm for one-class structured matrix factorization with side in-

formation. In AAAI, 2017.

[140] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-

scale multi-label learning with missing labels. In ICML, pages 593–601,

2014.

430

Page 447: Copyright by Kai Zhong 2018

[141] Xiao-Hu Yu and Guo-An Chen. On the local minima free condition

of backpropagation learning. IEEE Transactions on Neural Networks,

6(5):1300–1303, 1995.

[142] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol

Vinyals. Understanding deep learning requires rethinking generalization. In

ICLR, 2017.

[143] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying

Ma. Collaborative knowledge base embedding for recommender systems. In

Proceedings of the 22nd ACM SIGKDD international conference on knowl-

edge discovery and data mining, pages 353–362. ACM, 2016.

[144] Jiong Zhang, Qi Lei, and Inderjit S Dhillon. Stabilizing gradients for deep

neural networks via efficient svd parameterization. In ICML, 2018.

[145] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender

system: A survey and new perspectives. arXiv preprint arXiv:1707.07435,

2017.

[146] Xiao Zhang, Simon S Du, and Quanquan Gu. Fast and sample efficient

inductive matrix completion via multi-phase procrustes flow. In ICML.

https://arxiv.org/pdf/1803.01233, 2018.

[147] Yuchen Zhang, Jason D Lee, and Michael I Jordan. L1-regularized neu-

ral networks are improperly learnable in polynomial time. In Proceedings

431

Page 448: Copyright by Kai Zhong 2018

of The 33rd International Conference on Machine Learning (ICML), pages

993–1001, 2016.

[148] Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan.

On the learnability of fully-connected neural networks. In International Con-

ference on Artificial Intelligence and Statistics, 2017.

[149] Yuchen Zhang, Percy Liang, and Martin Wainwright. Convexified convolu-

tional neural networks. In ICML, 2017.

[150] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Efficient matrix sensing

using rank-1 Gaussian measurements. In International Conference on Algo-

rithmic Learning Theory, pages 3–18. Springer, 2015.

[151] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with

multiple components. In Advances in neural information processing systems

(NIPS), pages 2190–2198, 2016.

[152] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping

convolutional neural networks with multiple kernels. arXiv preprint

arXiv:1711.03440, 2017.

[153] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S

Dhillon. Recovery guarantees for one-hidden-layer neural networks. In

ICML. https://arxiv.org/pdf/1706.03175.pdf, 2017.

432

Page 449: Copyright by Kai Zhong 2018

Vita

Kai Zhong was born in Hangzhou, China. He graduated from Zhenhai Mid-

dle School in Ningbo, China in 2008. He obtained his B.S. degree in physics from

Peking University in July 2012. Soon after that, he started his PhD study in the

institute for computational engineering and sciences from University of Texas at

Austin. Currently he is a member of Center for Big Data Analytics in University of

Texas at Austin.

Permanent address: [email protected]

This dissertation was typeset with LATEX† by the author.

†LATEX is a document preparation system developed by Leslie Lamport as a special version ofDonald Knuth’s TEX Program.

433