copyright by kai zhong 2018
TRANSCRIPT
Copyright
by
Kai Zhong
2018
The Dissertation Committee for Kai Zhongcertifies that this is the approved version of the following dissertation:
Provable Non-convex Optimization
for Learning Parametric Models
Committee:
Inderjit S. Dhillon, Supervisor
Prateek Jain, Co-Supervisor
Chandrajit Bajaj
George Biros
Rachel Ward
Provable Non-convex Optimization
for Learning Parametric Models
by
Kai Zhong,
DISSERTATION
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN
August 2018
Acknowledgments
This thesis is not possible without the help of many people. First I would
like to thank my PhD supervisors Inderjit Dhillon and Prateek Jain. They have
given me continuous support in many ways. I have benefited greatly from their
creative ideas, their insightful discussions, their encouraging comments and their
passion in academic research through my PhD study. This thesis is a direct result
of their guidance.
I would also like to thank Chandrajit Bajaj, George Biros, and Rachel Ward
for serving on my dissertation committee and their constructive advise on my dis-
sertation.
I also greatly appreciate the collaborations with my co-authors. An incom-
plete list is Ian En-Hsu Yen, Zhao Song, Pradeep Ravikumar, Bowei Yan, Xiangru
Huang, Qi Lei, Cho-Jui Hsieh, Arnaud Vandaele, Meng Li, Jimmy Lin, Sanjiv
Kumar, Ruiqi Guo, David Simcha, Ashish Kapoor, and Peter Bartlett. I would spe-
cially thank Ian for collaborating many projects with me. I will remember that we
spent numerous afternoons discussing research, which helped me a lot during my
early years of PhD study.
I would like to thank my other lab-mates for fruitful discussions and count-
less help in my research and life: Si Si, Hsiang-Fu Yu, Nagarajan Natarajan, Kai-
Yang Chiang, David Inouye, Jiong Zhang, Nikhil Rao, Donghyuk Shin and Joyce
iv
Whang.
I am also very lucky to have my best friends, Bangguo Xiong, Lingyuan
Gao, Kecheng Xu and Xiaoqing Xu, who brought happiness and passion to my life.
Finally, I would like to thank my parents for their constant support and love.
v
Provable Non-convex Optimization
for Learning Parametric Models
Publication No.
Kai Zhong, Ph.D.
The University of Texas at Austin, 2018
Supervisors: Inderjit S. DhillonPrateek Jain
Non-convex optimization plays an important role in recent advances of ma-
chine learning. A large number of machine learning tasks are performed by solving
a non-convex optimization problem, which is generally NP-hard. Heuristics, such
as stochastic gradient descent, are employed to solve non-convex problems and
work decently well in practice despite the lack of general theoretical guarantees. In
this thesis, we study a series of non-convex optimization strategies and prove that
they lead to the global optimal solution for several machine learning problems, in-
cluding mixed linear regression, one-hidden-layer (convolutional) neural networks,
non-linear inductive matrix completion, and low-rank matrix sensing. At a high
level, we show that the non-convex objectives formulated in the above problems
have a large basin of attraction around the global optima when the data has benign
statistical properties. Therefore, local search heuristics, such as gradient descent or
vi
alternating minimization, are guaranteed to converge to the global optima if initial-
ized properly. Furthermore, we show that spectral methods can efficiently initialize
the parameters such that they fall into the basin of attraction. Experiments on syn-
thetic datasets and real applications are carried out to justify our theoretical analyses
and illustrate the superiority of our proposed methods.
vii
Table of Contents
Acknowledgments iv
Abstract vi
List of Tables xiv
List of Figures xv
Chapter 1. Introduction 1
Chapter 2. Preliminaries 52.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Concentration Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Convex Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3. Mixed Linear Regression 183.1 Introduction to Mixed Linear Regression . . . . . . . . . . . . . . . 193.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Initialization via Tensor method . . . . . . . . . . . . . . . . . . . . 263.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 293.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
Chapter 4. One-hidden-layer Fully-connected Neural Networks 344.1 Introduction to One-hidden-layer Neural Networks . . . . . . . . . . 354.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Initialization via Tensor Methods . . . . . . . . . . . . . . . . . . . 434.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 504.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 5. One-hidden-layer Convolutional Neural Networks 575.1 Introduction to One-hidden-layer Convolutional Neural Networks . . 585.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Initialization via Tensor Method . . . . . . . . . . . . . . . . . . . . 675.5 Recovery Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 695.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 705.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 6. Non-linear Inductive Matrix Completion 746.1 Introduction to Inductive Matrix Completion . . . . . . . . . . . . . 756.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 796.3 Local Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . 816.4 Initialization and Recovery Guarantee . . . . . . . . . . . . . . . . 866.5 Experiments on Synthetic and Real-world Data . . . . . . . . . . . . 876.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 7. Low-rank Matrix Sensing 957.1 Introduction to Low-rank Matrix Sensing . . . . . . . . . . . . . . . 967.2 Problem Formulation – Two Settings . . . . . . . . . . . . . . . . . 997.3 Rank-one Matrix Sensing via Alternating Minimization . . . . . . . 1017.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 1107.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ix
Chapter 8. Discussion 1158.1 Over-specified/Over-parameterized Neural Networks . . . . . . . . 115
8.1.1 Learning a ReLU using Over-specified Neural Networks . . . 1178.1.2 Numerical Experiments with Multiple Hidden Units . . . . . 122
8.2 Initialization Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1248.3 Stochastic Gradient Descent (SGD) and Other Fast Algorithms . . . 126
Appendices 127
Appendix A. Mixed Linear Regression 128A.1 Proofs of Local Convergence . . . . . . . . . . . . . . . . . . . . . 128
A.1.1 Some Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . 128A.1.2 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . 130A.1.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . 138A.1.4 Proof of the lemmata . . . . . . . . . . . . . . . . . . . . . . 140
A.1.4.1 Proof of Lemma A.1.1 . . . . . . . . . . . . . . . . 140A.1.4.2 Proof of Lemma A.1.2 . . . . . . . . . . . . . . . . 140A.1.4.3 Proof of Lemma A.1.3 . . . . . . . . . . . . . . . . 141A.1.4.4 Proof of Lemma A.1.4 . . . . . . . . . . . . . . . . 144A.1.4.5 Proof of Lemma A.1.5 . . . . . . . . . . . . . . . . 145
A.1.5 Proof of Lemma A.1.6 . . . . . . . . . . . . . . . . . . . . . 148A.2 Proofs of Tensor Method for Initialization . . . . . . . . . . . . . . 149
A.2.1 Some Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . 149A.2.2 Proof of Theorem 3.4.2 . . . . . . . . . . . . . . . . . . . . 152A.2.3 Proof of Theorem 3.5.1 . . . . . . . . . . . . . . . . . . . . 154A.2.4 Proofs of Some Lemmata . . . . . . . . . . . . . . . . . . . 155
A.2.4.1 Proof of Lemma A.2.1 . . . . . . . . . . . . . . . . 155A.2.4.2 Proof of Lemma A.2.2 . . . . . . . . . . . . . . . . 155A.2.4.3 Proof of Lemma A.2.3 . . . . . . . . . . . . . . . . 156A.2.4.4 Proof of Lemma A.2.4 . . . . . . . . . . . . . . . . 157A.2.4.5 Proof of Lemma A.2.5 . . . . . . . . . . . . . . . . 162
x
Appendix B. One-hidden-layer Fully-connected Neural Networks 165B.1 Matrix Bernstein Inequality for Unbounded Case . . . . . . . . . . . 165B.2 Properties of Activation Functions . . . . . . . . . . . . . . . . . . 170B.3 Local Positive Definiteness of Hessian . . . . . . . . . . . . . . . . 172
B.3.1 Main Results for Positive Definiteness of Hessian . . . . . . . 172B.3.2 Positive Definiteness of Population Hessian at the Ground
Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176B.3.3 Error Bound of Hessians near the Ground Truth for Smooth
Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.3.4 Error Bound of Hessians near the Ground Truth for Non-
smooth Activations . . . . . . . . . . . . . . . . . . . . . . . 207B.3.5 Positive Definiteness for a Small Region . . . . . . . . . . . 212
B.4 Tensor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.4.1 Tensor Initialization Algorithm . . . . . . . . . . . . . . . . 219B.4.2 Main Result for Tensor Methods . . . . . . . . . . . . . . . . 223B.4.3 Error Bound for the Subspace Spanned by the Weight Matrix 226B.4.4 Error Bound for the Reduced Third-order Moment . . . . . . 239B.4.5 Error Bound for the Magnitude and Sign of the Weight Vectors246
Appendix C. One-hidden-layer Convolutional Neural Networks 254C.1 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
C.1.1 Orthogonal weight matrices for the population case . . . . . . 254C.1.2 Non-orthogonal weight matrices for the population case . . . 256
C.2 Properties of Activation Functions . . . . . . . . . . . . . . . . . . 258C.3 Positive Definiteness of Hessian near the Ground Truth . . . . . . . 259
C.3.1 Bounding the eigenvalues of Hessian . . . . . . . . . . . . . 259C.3.2 Error bound of Hessians near the ground truth for smooth
activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 277C.3.3 Error bound of Hessians near the ground truth for non-smooth
activations . . . . . . . . . . . . . . . . . . . . . . . . . . . 295C.3.4 Proofs for Main results . . . . . . . . . . . . . . . . . . . . . 302
xi
Appendix D. Non-linear Inductive Matrix Completion 317D.1 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
D.1.1 Positive definiteness of the population hessian . . . . . . . . 318D.1.2 Warm up: orthogonal case . . . . . . . . . . . . . . . . . . . 319D.1.3 Error bound for the empirical Hessian near the ground truth . 321
D.2 Positive Definiteness of Population Hessian . . . . . . . . . . . . . 322D.2.1 Orthogonal case . . . . . . . . . . . . . . . . . . . . . . . . 325D.2.2 Non-orthogonal Case . . . . . . . . . . . . . . . . . . . . . . 334
D.3 Positive Definiteness of the Empirical Hessian . . . . . . . . . . . . 341D.3.1 Local Linear Convergence . . . . . . . . . . . . . . . . . . . 363
Appendix E. Low-rank Matrix Sensing 371E.1 Proof of Theorem 7.3.1 . . . . . . . . . . . . . . . . . . . . . . . . 371E.2 Proofs for Matrix Sensing using Rank-one Independent Gaussian
Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378E.2.1 Proof of Claim 7.3.1 . . . . . . . . . . . . . . . . . . . . . . 378E.2.2 Proof of Thoerem 7.3.2 . . . . . . . . . . . . . . . . . . . . 379
E.3 Proof of Matrix Sensing using Rank-one Dependent Gaussian Mea-surements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383E.3.1 Proof of Lemma 7.3.1 . . . . . . . . . . . . . . . . . . . . . 383E.3.2 Proof of Theorem 7.3.3 . . . . . . . . . . . . . . . . . . . . 384
Appendix F. Over-specified Neural Network 389F.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389F.2 Two Learning Phases . . . . . . . . . . . . . . . . . . . . . . . . . 393
F.2.1 Phase I – Learning the Directions . . . . . . . . . . . . . . . 393F.2.2 Phase II – Learning the Magnitudes . . . . . . . . . . . . . . 395
F.3 Proof of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 396F.3.1 Population Case . . . . . . . . . . . . . . . . . . . . . . . . 396F.3.2 Finite-Sample Case . . . . . . . . . . . . . . . . . . . . . . 397
F.4 Proofs of Phase I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398F.4.1 Proof of Phase I-a . . . . . . . . . . . . . . . . . . . . . . . 403F.4.2 Proofs of Phase I-b . . . . . . . . . . . . . . . . . . . . . . . 405
F.5 Proofs for Phase II . . . . . . . . . . . . . . . . . . . . . . . . . . . 407F.5.1 Proof for Lemma F.2.3 . . . . . . . . . . . . . . . . . . . . . 407
xii
Bibliography 411
Vita 433
xiii
List of Tables
6.1 The error rate in semi-supervised clustering using NIMC and IMC. . 88
6.2 Test RMSE for recommending new users with movies on Movie-
lens dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 Comparison of sample complexity and computational complexity
for different approaches and different measurements . . . . . . . . . 98
8.1 The average objective function value when gradient descent gets
stuck at a non-global local minimum over 100 random trials. Note
that the average function value here does not take globally con-
verged function values into account. . . . . . . . . . . . . . . . . . 123
B.1 ρ(σ) values for different activation functions. Note that we can cal-
culate the exact values for ReLU, Leaky ReLU, squared ReLU and
erf. We can’t find a closed-form value for sigmoid or tanh, but
we calculate the numerical values of ρ(σ) for σ = 0.1, 1, 10. 1
ρerf(σ) = min(4σ2+1)−1/2− (2σ2+1)−1, (4σ2+1)−3/2− (2σ2+
1)−3, (2σ2 + 1)−2 . . . . . . . . . . . . . . . . . . . . . . . . . . 171
xiv
List of Figures
3.1 Empirical Performance of MLR. . . . . . . . . . . . . . . . . . . . 31
3.2 Comparison with EM in terms of time and iterations. Our method
with random initialization is significantly better than EM with ran-
dom initialization. Performance of the two methods is comparable
when initialized with tensor method. . . . . . . . . . . . . . . . . . 32
4.1 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 (a) (left) Minimal eigenvalue of Hessian at the ground truth for dif-
ferent activations against the sample size (b) (right) Convergence
of gradient descent with different random initializations. . . . . . . 70
6.1 The rate of success of GD over synthetic data. Left: sigmoid, Right:
ReLU). White blocks denote 100% success rate. . . . . . . . . . . 87
6.2 NIMC v.s. IMC on gene-disease association prediction task. . . . . 91
7.1 Comparison of computational complexity and measurement com-
plexity for different approaches and different operators . . . . . . . 111
xv
7.2 Recovery rate for different matrix dimension d (x-axis) and differ-
ent number of measurements m (y-axis). The color reflects the re-
covery rate scaled from 0 to 1. The white color indicates perfect re-
covery, while the black color denotes failure in all the experiments.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xvi
Chapter 1
Introduction
The goal of most machine learning tasks is to find a statistical model that
fits the data well. One popular procedure to find a good model is as follows: 1)
construct a model class that is the best guess of the underlying data-generating
model; 2) form a proper optimization problem from the model using the existing
training data; 3) solve the optimization problem and obtain the solution. All three
steps are critical in order to obtain a good model.
Choosing a suitable model class is the first key step for the success of ma-
chine learning tasks. There are mainly two types of models: parametric models,
which have a finite number of parameters, and non-parametric models, which, by
contrast, have an infinite number of parameters. Learning non-parametric mod-
els may not scale well for large datasets. For example, naive kernel methods can
scale quadratically with the size of the training data. On the other hand, simple
parametric models, such as linear models, are easy to train but may have limited
model capacity which cannot capture the underlying model. Therefore, recently
large-capacity parametric models, such as neural networks, have become increas-
ingly attractive. To fit a particular dataset, it is also important to use a well-designed
and structured model class. For example, convolutional neural networks have a spe-
1
cially constructed neural network architecture for computer vision tasks. Low-rank
models impose low-rank structure for the matrix variables. This thesis focuses on
parametric models that have particular structures. The models we consider include
mixed linear regression models, one-hidden-layer (convolutional) neural networks,
non-linear inductive matrix completion, and low-rank matrix sensing models.
The second step is to form an optimization problem. An optimization prob-
lem is typically formulated via a maximum (log-)likelihood estimator, which min-
imizes an objective function with certain constraints. To find a good model in
a model class, we can form either convex optimization problems or non-convex
optimization problems. For example, low-rank models can either directly form a
non-convex problem, where the low-rank constraint is enforced by representing the
matrix variable as a product of two thin matrices, X = UV >, or form a convex opti-
mization problem where the low-rank constraint is relaxed to a constrained nuclear
norm, ‖X‖∗ ≤ C. For convex problems, there exist polynomial algorithms that are
guaranteed to converge to the global optima. Non-convex problems are in general
NP-hard to solve, therefore, there are no general polynomial algorithms that can
solve non-convex problems. Popular heuristics such as gradient descent might get
stuck at some bad local minima. Furthermore, complex models such as deep neural
networks are more natural to be formulated as non-convex optimization problems
due to the models’ high nonlinearity.
The third step now comes to the choice of algorithms that solve the opti-
mization problem. When we consider the optimization process, we need to con-
sider its efficiency. For the low-rank model example, its corresponding non-convex
2
problem can often be solved more efficiently than its convex counterpart unless the
convex one is implemented very carefully due to the nuclear norm constraint. Al-
though there are no general theoretical guarantees, simple optimization heuristics
can still solve non-convex problems to a sufficiently good solution. For example,
SGD is widely applied to solve highly non-convex objectives of neural networks
and alternating minimization is often employed to solve non-convex problems in-
volving low-rank models. Both heuristics work quite well in practice. Considering
the efficiency yet the lack of theoretical guarantees of non-convex approaches, in
this thesis, we focus on providing theoretical guarantees for non-convex optimiza-
tion approaches.
Since non-convex problems are generally NP-hard, we need additional as-
sumptions to develop theoretical guarantees for these problems. In this thesis, we
assume data comes from benign distributions. In particular, a) we assume the input
data of the model follows a Gaussian distribution; b) we consider realizable set-
tings, i.e. there exists a ground truth parametric model that maps the input to the
target without noise. As a result, we provide recovery guarantees for learning many
non-convex problems, including mixed linear regression, one-hidden-layer (con-
volutional) neural networks, non-linear inductive matrix completion, and low-rank
matrix sensing.
Specifically, our main theoretical discoveries are two fold. 1) There is a
large basin of attraction around the global optima of the non-convex objective func-
tions of the above mentioned problems. 2) The parameters can be efficiently ini-
tialized into the basin of attraction by spectral methods. At a high level, we apply
3
the following optimization strategy. First spectral methods are used to initialize the
parameters, and then local search heuristics, such as gradient descent or alternat-
ing minimization, are used to achieve the global optima which recovers the ground
truth parameters in the realizable setting. Moreover, we conduct experiments on
synthetic data to justify our theoretical analyses and provide some experimental
results on real-world datasets to demonstrate the superior performance of our pro-
posed models.
Thesis Overview
The roadmap of this thesis is as follows. We first introduce some prelim-
inaries for this thesis, including notations, definitions and other preliminaries in
Chapter 2. Then, we present our main results for five non-convex problems: mixed
linear regression models, one-hidden-layer (convolutional) neural networks, non-
linear inductive matrix completion, and low-rank matrix sensing models in Chap-
ter 3-7 respectively. In each of them, we introduce the specific problem, formulate
it and present the theoretical results for its local landscape and the initialization
methods. Finally, in Chapter 8, we discuss several open problems.
4
Chapter 2
Preliminaries
In this chapter, we introduces some notations, definitions and preliminaries
required in this thesis.
2.1 Notations
For any positive integer n, we use [n] to denote the set 1, 2, · · · , n. For
random variable X , let E[X] denote the expectation of X (if this quantity exists).
For integer k, we use Dk to denote N(0, Ik), the standard Gaussian distribution with
dimension k. We use 1f to denote the indicator function, which is 1 if f holds and
0 otherwise. We define (z)+ := max0, z. For any vector x ∈ Rn, we use ‖x‖ to
denote its `2 norm.
For any function f , we define O(f) to be f · logO(1)(f). For two functions
f, g, we use the shorthand f . g (resp. &) to indicate that f ≤ Cg (resp. ≥) for an
absolute constant C. We use f h g to mean cf ≤ g ≤ Cf for constants c, C. We
use poly(f) to denote fO(1).
We provide several definitions related to matrix A. Let det(A) denote the
determinant of a square matrix A. Let A> denote the transpose of A. Let A† denote
the Moore-Penrose pseudoinverse of A. Let A−1 denote the inverse of a full rank
5
square matrix. Let ‖A‖F denote the Frobenius norm of matrix A. Let ‖A‖ denote
the spectral norm of matrix A. Let σi(A) to denote the i-th largest singular value
of A. We often use capital letter to denote the stack of corresponding small letter
vectors, e.g., W = [w1 w2 · · · wk] ∈ Rd×k, where wi ∈ Rd. For two same-
size matrices, A,B ∈ Rd1×d2 , we use A B ∈ Rd1×d2 to denote element-wise
multiplication of these two matrices.
Now we provide several definitions related to tensor T . We use ⊗ to denote
outer product and · to denote dot product. Given two column vectors u, v ∈ Rn,
then u ⊗ v ∈ Rn×n and (u ⊗ v)i,j = ui · vj , and u>v =∑n
i=1 uivi ∈ R. Given
three column vectors u, v, w ∈ Rn, then u⊗v⊗w ∈ Rn×n×n is a third-order tensor
and (u⊗ v ⊗ w)i,j,k = ui · vj · wk. We use u⊗r ∈ Rnr to denote the vector u outer
product with itself r − 1 times.
We say a tensor T ∈ Rn×n×n is symmetric if and only if for any i, j, k,
Ti,j,k = Ti,k,j = Tj,i,k = Tj,k,i = Tk,i,j = Tk,j,i. Given a third order tensor T ∈
Rn1×n2×n3 and three matrices A ∈ Rn1×d1 , B ∈ Rn2×d2 , C ∈ Rn3×d3 , we use
T (A,B,C) to denote a d1 × d2 × d3 tensor where the (i, j, k)-th entry is,
n1∑i′=1
n2∑j′=1
n3∑k′=1
Ti′,j′,k′Ai′,iBj′,jCk′,k.
We use ‖T‖ or ‖T‖op to denote the operator norm of the tensor T , i.e.,
‖T‖op = ‖T‖ = max‖a‖=1|T (a, a, a)|.
For tensor T ∈ Rn1×n2×n3 , we use matrix T (1) ∈ Rn1×n2n3 to denote the
flattening of tensor T along the first dimension, i.e., [T (1)]i,(j−1)n3+k = Ti,j,k,∀i ∈
6
[n1], j ∈ [n2], k ∈ [n3]. Similarly for matrices T (2) ∈ Rn2×n3n1 and T (3) ∈
Rn3×n1n2 .
2.2 Definitions
First we define some concepts about convexity and smoothness.
Definition 2.2.1 (m-Strongly Convex Function). A differentiable function f is called
m-strongly-convex with parameter m > 0 if the following inequality holds for all
points x,y in its domain:
〈∇f(x)−∇f(y), x− y〉 ≥ m‖x− y‖.
If further f is twice continuously differentiable, then f is strongly convex with pa-
rameter m if and only if∇2f(x) mI for all x in the domain.
Definition 2.2.2 ((A, δ,m)-Locally Strongly Convex Function). A function f is
called (A, δ,m) locally strongly convex with a convex subset of the domain A, pa-
rameters δ ∈ [0, 1], and m > 0 if given any point x ∈ A, with probability at least,
1− δ,∇2f(x) exists and the following inequality holds:
∇2f(x) mI.
Definition 2.2.3 (L-Lipschitz Function). A differentiable function f is called L-
Lipschitz with parameter L > 0 if the following inequality holds for all points x in
its domain:
‖∇f(x)‖ ≤ L
7
Definition 2.2.4 (M -Smooth Function). A differentiable function f is called M -
smooth with parameter M > 0 if the following inequality holds for all points x, y
in its domain:
‖∇f(x)−∇f(y)‖ ≤M‖x− y‖.
If further f is twice continuously differentiable, then f is M -smooth with parameter
M if and only if ∇2f(x) MI for all x in the domain.
Definition 2.2.5 ((A, δ,M )-Locally Smooth Function). A function f is called (A, δ,M )
locally smooth with a convex subset of the domain A, parameters δ ∈ [0, 1], and
M > 0 if given any point x ∈ A, with probability at least, 1− δ,∇2f(x) exists and
the following inequality holds:
∇2f(x) MI.
2.3 Linear Algebra
This section includes some facts about linear algebra that will be used in
thesis.
Fact 2.3.1. Given a full column-rank matrix W = [w1, w2, · · · , wk] ∈ Rd×k, let
W = [ w1
‖w1‖ , w2
‖w2‖ , · · · , wk
‖wk‖]. Then, we have: (I) for any i ∈ [k], σk(W ) ≤ ‖wi‖ ≤
σ1(W ); (II) 1/κ(W ) ≤ σk(W ) ≤ σ1(W ) ≤√k.
Proof. Part (I). We have,
σk(W ) ≤ ‖Wei‖ = ‖wi‖ ≤ σ1(W )
8
Part (II). We first show how to lower bound σk(W ),
σk(W ) = min‖s‖=1
‖Ws‖
= min‖s‖=1
∥∥∥∥∥k∑
i=1
si‖wi‖
wi
∥∥∥∥∥ by definition of W
≥ min‖s‖=1
σk(W )
(k∑
i=1
(si‖wi‖
)2
) 12
by ‖wi‖ ≥ σk(W )
≥ min‖s‖2=1
σk(W )
(k∑
i=1
(si
maxj∈[k] ‖wj‖)2
) 12
by maxj∈[k]‖wj‖ ≥ ‖wi‖
= σk(W )/maxj∈[k]‖wj‖ by ‖s‖ = 1
≥ σk(W )/σ1(W ). by maxj∈[k]‖wj‖ ≤ σ1(W )
= 1/κ(W ).
It remains to upper bound σ1(W ),
σ1(W ) ≤
(k∑
i=1
σ2i (W )
) 12
= ‖W‖F ≤√k.
Fact 2.3.2. Let U ∈ Rd×k and V ∈ Rd×k (k ≤ d) denote two orthogonal matrices.
Then ‖UU> − V V >‖ = ‖(I − UU>)V ‖ = ‖(I − V V >)U‖ =√1− σ2
k(U>V ).
Proof. Let U⊥ ∈ Rd×(d−k) and V⊥ ∈ Rd×(d−k) be the orthogonal complementary
9
matrices of U, V ∈ Rd×k respectively.
‖UU> − V V >‖ = ‖(I − V V >)UU> − V V >(I − UU>)‖
= ‖V⊥V>⊥ UU> − V V >U⊥U
>⊥‖
=
∥∥∥∥[V⊥ V][V >
⊥ U 00 −V >U⊥
][U>
U>⊥
]∥∥∥∥= max(‖V >
⊥ U‖, ‖V >U⊥‖).
We show how to simplify ‖V >⊥ U‖,
‖V >⊥ U‖ = ‖(I − V V >)U‖ =
√‖U>(I − V V >)U‖
= max‖a‖=1
√1− ‖V >Ua‖2 =
√1− σ2
k(V>U).
Similarly we can simplify ‖U>⊥V ‖,
‖U>⊥V ‖ =
√1− σ2
k(U>V ) =
√1− σ2
k(V>U).
Fact 2.3.3. Let C ∈ Rd1×d2 , B ∈ Rd2×d3 be two matrices. Then ‖CB‖ ≤ ‖C‖‖B‖F
and ‖CB‖ ≥ σmin(C)‖B‖F .
Proof. For each i ∈ [d3], let bi denote the i-th column of B. We can upper bound
‖CB‖,
‖CB‖F =
(d2∑i=1
‖Cbi‖2)1/2
≤
(d2∑i=1
‖C‖2‖bi‖2)1/2
= ‖C‖‖B‖F .
We show how to lower bound ‖CB‖,
‖CB‖ =
(d2∑i=1
‖Cbi‖2)1/2
≥
(d2∑i=1
σ2k(C)‖bi‖2
)1/2
= σmin(C)‖B‖F .
10
2.4 Concentration Bounds
This thesis assumes data has nice distributions. Specifically, we assume
input data follows Gaussian distribution. When a random vector follows Gaussian
distribution, there are several concentration properties. We illustrate them in this
section. They will be used as lemmata or tools for the analyses in this thesis.
Lemma 2.4.1 (Proposition 1.1 in [68]). If x ∼ N(0, Id) and Σ ∈ Rd×d is a fixed
positive semi-definite matrix. For all t > 0, w. p. 1− e−t, we have
xTΣx ≤ tr(Σ) + 2√
tr(Σ2)t+ 2‖Σ‖t.
By taking t = 2 log(d) + log(n) for some n ≥ d, we have
Corollary 2.4.1. If x ∼ N(0, Id) and Σ ∈ Rd×d is a fixed positive semi-definite
matrix. We have, w. p. 1− 1nd−2,
xTΣx ≤ tr(Σ) + 2√
tr(Σ2)(2 log(d) + log(n)) + 2‖Σ‖(2 log(d) + log(n))
≤ 12 tr(Σ) log(n).
Corollary 2.4.2. Let z denote a fixed d-dimensional vector, then for any C ≥ 1 and
n ≥ 1, we have
Prx∼N(0,Id)
[|〈x, z〉|2 ≤ 5C‖z‖2 log n] ≥ 1− 1/(ndC).
Proof. This follows by Proposition 1.1 in [68].
Corollary 2.4.3. For any C ≥ 1 and n ≥ 1, we have
Prx∼N(0,Id)
[‖x‖2 ≤ 5Cd log n] ≥ 1− 1/(ndC).
11
Proof. This follows by Proposition 1.1 in [68].
Setting Σ = ββT in Corollary 2.4.1, we have
Corollary 2.4.4. If x ∼ N(0, Id), then given any fixed β ∈ Rd, w. p. 1− 1nd−2, we
have
(βTx)2 ≤ 12‖β‖2 log n.
Setting Σ = I and t = 2 log(d)+ log n for some n ≥ d in Lemma 2.4.1, we
have
Corollary 2.4.5. If x ∼ N(0, Id), then w. p. 1− 1nd−2, we have
‖x‖2 ≤ d+ 2√
3d log n+ 6 log(n) ≤ 4d log n.
Lemma 2.4.2. Let a, b, c ≥ 0 denote three constants, let u, v, w ∈ Rd denote three
vectors, let Dd denote Gaussian distribution N(0, Id) then
Ex∼Dd
[|u>x|a|v>x|b|w>x|c
]h ‖u‖a‖v‖b‖w‖c.
Proof.
Ex∼Dd
[|u>x|a|v>x|b|w>x|c
]≤(
Ex∼Dd
[|u>x|2a])1/2
·(
Ex∼Dd
[|u>x|4b])1/4
·(
Ex∼Dd
[|u>x|4c])1/4
. ‖u‖a‖v‖b‖w‖c,
where the first step follows by Holder’s inequality, i.e., E[|XY Z|] ≤ (E[|X|2])1/2 ·
(E[|Y |4])1/4 · (E[|Z|4])1/4, the third step follows by calculating the expectation and
a, b, c are constants.
12
Since all the three components |u>x|, |v>x|, |w>x| are positive and related
to a common random vector x, we can show a lower bound,
Ex∼Dd
[|u>x|a|v>x|b|w>x|c
]& ‖u‖a‖v‖b‖w‖c.
Matrix Bernstein inequality is a key inequality that will be used throughout
this thesis. We demonstrate it below.
Lemma 2.4.3 (Matrix Bernstein Inequality for bounded case, Theorem 6.1 in [127]).
Consider a finite sequence Xk of independent, random, self-adjoint matrices with
dimension d. Assume that E[Xk] = 0 and λmax(Xk) ≤ R almost surely. Compute
the norm of the total variance, σ2 := ‖∑
k E(X2k)‖. Then the following chain of
inequalities holds for all t ≥ 0.
Pr[λmax(∑k
Xk) ≥ t]
≤ d · exp(− σ2
R2· h(Rt
σ2))
≤ d · exp( −t2/2
σ2 +Rt/3)
≤
d · exp(−3t2/8σ2) for t ≤ σ2/R;
d · exp(−3t/8R) for t ≥ σ2/R.
The function h(u) := (1 + u) log(1 + u)− u for u ≥ 0.
2.5 Convex Analyses
When an objective function is strongly convex and smooth, with properly
chosen step size, gradient descent converges to the global optimal linearly. Formally
13
it is stated as below. Our local convergence guarantee for non-convex problems is
based on it.
Theorem 2.5.1 (Convergence of Gradient Descent on Strongly Convex and Smooth
Functions). Let f : Rd → R be a m-strongly convex and M -smooth function. Let
its minimum be x∗ = argminx f(x). Assume x(t+1) = x(t)−η∇f(x(t)) is the update
of gradient descent with step size η. Then if η = 1/M ,
‖x(t+1) − x∗‖2 ≤ (1− m
M)‖x(t) − x∗‖2.
Proof.
‖x(t+1) − x∗‖2
= ‖x(t) − η∇f(x(t))− x∗‖2
= ‖x(t) − x∗‖2 − 2η〈∇f(x(t)), x(t) − x∗〉+ η2‖∇f(x(t))‖2
We can rewrite f(x),
∇f(x(t)) =
(∫ 1
0
∇2f(x∗ + γ(x(t) − x∗))dγ
)(x(t) − x∗).
We define function H : Rd → Rd×d such that
H(x(t) − x∗) =
(∫ 1
0
∇2f(x∗ + γ(x(t) − x∗))dγ
).
According to the smoothness and strong convexity of f ,
mI H MI. (2.1)
14
We can upper bound ‖∇f(x(t))‖2,
‖∇f(x(t))‖2 = 〈H(x(t) − x∗), H(x(t) − x∗)〉 ≤M〈x(t) − x∗, H(x(t) − x∗)〉.
Therefore,
‖x(t+1) − x∗‖2
≤ ‖x(t) − x∗‖2 − (−η2M + 2η)〈x(t) − x∗, H(x(t) − x∗)〉
≤ ‖x(t) − x∗‖2 − (−η2M + 2η)m‖x(t) − x∗‖2
= ‖x(t) − x∗‖2 − m
M‖x(t) − x∗‖2
≤ (1− m
M)‖x(t) − x∗‖2
where the third equality holds by setting η = 1M
.
2.6 Tensor Decomposition
We will use tensor decomposition for initialization for most problems dis-
cussed in this thesis. This section introduces some preliminaries for tensors and
tensor decomposition guarantees.
Lemma 2.6.1 ( Some properties of thrid-order tensor). If T ∈ Rd×d×d is a super-
symmetric tensor, i.e.,Tijk is equivalent for any permutation of the index, then the
operator norm is defined as
‖T‖op := sup‖a‖=1
|T (a,a,a)|
And we have the following properties for ‖T‖op.
15
Property 1. ‖T‖op = sup‖a‖=‖b‖=‖c‖=1 |T (a, b, c)|
Property 2. ‖T‖op ≤ ‖T(1)‖ ≤√K‖T‖op
Property 3. If T is a rank-one tensor, then ‖T(1)‖ = ‖T‖op
Property 4. For any matrix W ∈ Rd×d′ , ‖T (W,W,W )‖op ≤ ‖T‖op‖W‖3
Proof. Property 1. See the proof in Lemma 21 of [67].
Property 2.
‖T(1)‖ = max‖a‖=1
‖T (a, I, I)‖F ≤ max‖a‖=1
√K‖T (a, I, I)‖ = max
‖a‖=‖b‖=1
√K|T (a, b, b)| = ‖T‖op.
Obviously, max‖a‖=1 ‖T (a, I, I)‖F ≥ ‖T‖op.
Property 3. Let T = v ⊗ v ⊗ v.
‖T(1)‖ = max‖a‖=1
‖T (a, I, I)‖F = max‖a‖=1
‖v‖2(vTa)2 = ‖v‖3 = max‖a‖=1
|(vTa)3| = ‖T‖op.
Property 4. There exists a u ∈ Rd′ with ‖u‖ = 1 such that
‖T (W,W,W )‖op = |T (Wu,Wu,Wu)| ≤ ‖T‖op‖Wu‖3 ≤ ‖T‖op‖W‖3
Theorem 2.6.1 (Non-orthogonal tensor decomposition. Theorem 3 in [78]). Let
T =∑k
i=1 πiu⊗3i ∈ Rd×d×d, where ui’s are unit vectors . Let T = T + εR, where
ε > 0 and R ∈ Rd×d×d with ‖R‖op = 1. Let w1, · · · , wL be i.i.d. random Gaussian
vectors, wl ∼ N(0, Id),∀l ∈ [L], and let the matrices Ml ∈ Rd×d be constructed
via projection of T along w1, · · · , wL, i.e., Ml = T (I, I, wl). Assume incoherence
µ on (ui) : u>i uj ≤ µ. Let L ≥
(50
1−µ2
)log(15d(k− 1)/δ)2. Let uii∈[L] be the set
16
of output vectors of Algorithm 1 in [78]. Then, with probability at least 1 − δ, for
every ui, there exists a ui such that
‖ui − ui‖ ≤ O
(√‖π‖1πmax
π2min
1
(1− µ2)σk(U)(1 + C(δ))
)ε+ o(ε),
where U = [u1u2 · · ·uk] ∈ Rd×k, πmin = miniπi, πmax = maxiπi, C(δ) :=
O(log(kd/δ)√
dL)
17
Chapter 3
Mixed Linear Regression1
In this chapter, we study the mixed linear regression (MLR) problem, where
the goal is to recover multiple underlying linear models from their unlabeled linear
measurements. We propose a non-convex objective function which we show is
locally strongly convex in the neighborhood of the ground truth. We use a tensor
method for initialization so that the initial models are in the local strong convexity
region. We then employ general convex optimization algorithms to minimize the
objective function. To the best of our knowledge, our approach provides first exact
recovery guarantees for the MLR problem with K ≥ 2 components. Moreover,
our method has near-optimal computational complexity O(Nd) as well as near-
optimal sample complexity O(d) for constant K. Furthermore, our empirical results
indicate that even with random initialization, our approach converges to the global
optima in linear time, providing speed-up of up to two orders of magnitude.
1The content of this chapter is published as Mixed linear regression with multiple components,Kai Zhong, Prateek Jain, and Inderjit S. Dhillon, in Advances in Neural Information ProcessingSystems, 2016. The dissertator’s contribution includes deriving the detailed theoretical analysis,conducting the numerical experiments and writing most parts of the paper.
18
3.1 Introduction to Mixed Linear Regression
The mixed linear regression (MLR) [26, 28, 137] models each observation
as being generated from one of the K unknown linear models; the identity of the
generating model for each data point is also unknown. MLR is a popular technique
for capturing non-linear measurements while still keeping the models simple and
computationally efficient. Several widely-used variants of linear regression, such
as piecewise linear regression [43, 130] and locally linear regression [27], can be
viewed as special cases of MLR. MLR has also been applied in time-series analysis
[25], trajectory clustering [45], health care analysis [37] and phase retrieval [13].
See [129] for more applications.
In general, MLR is NP-hard [137] with the hardness arising due to lack of
information about the model labels (model from which a point is generated) as well
as the model parameters. However, under certain statistical assumptions, several
recent works have provided poly-time algorithms for solving MLR [4, 13, 28, 137].
But most of the existing recovery gurantees are restricted either to mixtures with
K = 2 components [13, 28, 137] or require poly(1/ε) samples/time to achieve ε-
approximate solution [26, 108] (analysis of [137] for two components can obtain ε
approximate solution in log(1/ε) samples). Hence, solving the MLR problem with
K ≥ 2 mixtures while using near-optimal number of samples and computational
time is still an open question.
In this section, we resolve the above question under standard statistical as-
sumptions for constant many mixture components K. To this end, we propose the
19
following smooth objective function as a surrogate to solve MLR:
f(w1,w2, · · · ,wK) :=n∑
i=1
ΠKk=1(yi − xT
i wk)2, (3.1)
where (xi, yi) ∈ Rd+1i=1,2,··· ,N are the data points and wkk=1,2,··· ,K are the
model parameters. The intuition for this objective is that the objective value is zero
when wkk=1,2,··· ,K is the global optima and y’s do not contain any noise. Fur-
thermore, the objective function is smooth and hence less prone to getting stuck in
arbitrary saddle points or oscillating between two points. The standard EM [137]
algorithm instead makes a “sharp” selection of mixture component and hence the
algorithm is more likely to oscillate or get stuck. This intuition is reflected in Fig-
ure 3.1 (d) which shows that with random initialization, EM algorithm routinely
gets stuck at poor solutions, while our proposed method based on the above objec-
tive still converges to the global optima.
Unfortunately, the above objective function is non-convex and is in general
prone to poor saddle points, local minima. However under certain standard as-
sumptions, we show that the objective is locally strongly convex (Theorem 3.3.1)
in a small basin of attraction near the optimal solution. Moreover, the objective
function is smooth. Hence, we can use gradient descent method to achieve linear
rate of convergence to the global optima. But, we will need to initialize the op-
timization algorithm with an iterate which lies in a small ball around the optima.
To this end, we modify the tensor method in [4, 26] to obtain a “good” initializa-
tion point. Typically, tensor methods require computation of third and higher order
moments which leads to significantly worse sample complexity in terms of data
20
dimensionality d. However, for the special case of MLR, we provide a small modi-
fication of the standard tensor method that achieves nearly optimal sample and time
complexity bounds for constant K (see Theorem 3.4.2) . More concretely, our ap-
proach requires O(d(K log d)K) many samples and requires O(Nd) computational
time; note the exponential dependence on K. Also for constant K, the method has
nearly optimal sample and time complexity.
Although EM with power method [13] shares the same computational com-
plexity as ours, there is no convergence guarantee for EM to the best of our knowl-
edge. In contrast, we provide local convergence guarantee for our method. That
is, if N = O(rKK) and if data satisfies certain standard assumptions, then starting
from an initial point Ukk=1,··· ,K that lies in a small ball of constant radius around
the globally optimal solution, our method converges super-linearly to the globally
optimal solution. Unfortunately, our existing analyses do not provide global con-
vergence guarantee and we leave it as a topic for future work. Interestingly, our
empirical results indicated that even with randomly initialized Ukk=1,··· ,K , our
method is able to recover the true subspace exactly using nearly O(rK) samples.
We summarize our contributions below. We propose a non-convex contin-
uous objective function for solving the mixed linear regression problem. To the
best of our knowledge, our algorithm is the first work that can handle K ≥ 2 com-
ponents with global convergence guarantee in the noiseless case (Theorem 3.5.1).
Our algorithm has near-optimal linear (in d) sample complexity and near-optimal
computational complexity; however, our sample complexity dependence on K is
exponential.
21
3.2 Problem Formulation
In this section, we assume the dataset (xi, yi) ∈ Rd+1i=1,2,··· ,N is gener-
ated by,
zi ∼ multinomial(p), xi ∼ N(0, Id), yi = xTi w
∗zi, (3.2)
where p is the proportion of different components satisfying pT1 = 1, w∗k ∈
Rdk=1,2,··· ,K are the ground truth parameters. The goal is to recover w∗kk=1,2,··· ,K
from the dataset. Our analysis is based on noiseless cases but we illustrate the
empirical performance of our algorithm for the noisy cases, where yi = xTi w
∗zi+ ei
for some noise ei (see Figure 3.1).
Notation for this section We use [N] to denote the set 1, 2, · · · , N and
Sk ⊂ [N ] to denote the index set of the samples that come from k-th component.
Define pmin := mink∈[K]pk, pmax := maxk∈[K]pk. Define ∆wj := wj −
w∗j and ∆w∗
kj := w∗k − w∗
j . Define ∆min := minj 6=k‖∆w∗jk‖ and ∆max :=
maxj 6=k‖∆w∗jk‖. We assume ∆min
∆maxis independent of the dimension d. Define
w := [w1;w2; · · · ;wK ] ∈ RKd. We denote w(t) as the parameters at t-th iteration
and w(0) as the initial parameters. For simplicity, we assume there are pkN samples
from the k-th model in any random subset of N samples. We assume K is a constant
in general. However, if some numbers depend on KK , we will explicitly present it
in the big O notation. For simplicity, we just include higher-order terms of K and
ignore lower-order terms, e.g., O((2K)2K) may be replaced by O(KK).
22
3.3 Local Strong Convexity
In this section, we analyze the Hessian of objective (3.1).
Theorem 3.3.1 (Local Strong Convexity). Let xi, yii=1,2,··· ,N be sampled from
the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the samples and lie in
the neighborhood of the optimal solution, i.e.,
‖∆wk‖ := ‖wk −w∗k‖ ≤ cm∆min,∀k ∈ [K], (3.3)
where cm = O(pmin(3K)−K(∆min/∆max)2K−2), ∆min = minj 6=k‖w∗
j − w∗k‖
and ∆max = maxj 6=k‖w∗j − w∗
k‖. Let P ≥ 1 be a constant. Then if N ≥
O((PK)Kd logK+2(d)), w.p. 1−O(Kd−P ), we have,
1
8pminN∆2K−2
min I ∇2f(w + δw) 10N(3K)K∆2K−2max I, (3.4)
for any δw := [δw1; δw2; · · · ; δwK ] satisfying ‖δwk‖ ≤ cf∆min, where cf =
O(pmin(3K)−Kd−K+1(∆min/∆max)2K−2).
The above theorem shows the Hessians of a small neighborhood around
a fixed wkk=1,2,··· ,K , which is close enough to the optimum, are positive def-
initeness (PD). The conditions on wkk=1,··· ,K and δwkk=1,··· ,K are different.
wkk=1,··· ,K are required to be independent of samples and in a ball of radius
cm∆min centered at the optimal solution. On the other hand, δwkk=1,2,··· ,K can be
dependent on the samples but are required to be in a smaller ball of radius cf∆min.
The conditions are natural as if ∆min is very small then distinguishing between w∗k
and w∗k′ is not possible and hence Hessians will not be PD w.r.t both the compo-
nents.
23
To prove the theorem, we decompose the Hessian of Eq. (3.1) into multi-
ple blocks, (∇f)jl = ∂2f∂wj∂wl
∈ Rd×d. When wk → w∗k for all k ∈ [K], the
diagonal blocks of the Hessian will be strictly positive definite. At the same time,
the off-diagonal blocks will be close to zeros. The blocks are approximated by
the samples using matrix Bernstein inequality. The detailed proof can be found in
Appendix A.1.2.
Traditional analysis of optimization methods on strongly convex functions,
such as gradient descent, requires the Hessians of all the parameters are PD. The-
orem 3.3.1 implies that when wk = w∗k for all k = 1, 2, · · · , K, a small basin of
attraction around the optimum is strongly convex as formally stated in the following
corollary.
Corollary 3.3.1 (Local Strong Convexity for all the Parameters). Let xi, yii=1,2,··· ,N
be sampled from the MLR model (3.2). Let wkk=1,2,··· ,K lie in the neighborhood
of the optimal solution, i.e.,
‖wk −w∗k‖ ≤ cf∆min,∀k ∈ [K], (3.5)
where cf = O(pmin(3K)−Kd−K+1(∆min/∆max)2K−2). Then, for any constant
P ≥ 1, if N ≥ O((PK)Kd logK+2(d)), w.p. 1 − O(Kd−P ), the objective func-
tion f(w1,w2, · · · ,wK) in Eq. (3.1) is strongly convex. In particular, w.p. 1 −
O(Kd−P ), for all w satisfying Eq. (3.5),
1
8pminN∆2K−2
min I ∇2f(w) 10N(3K)K∆2K−2max I. (3.6)
The strong convexity of Corollary 3.3.1 only holds in the basin of attraction
near the optimum that has diameter in the order of O(d−K+1), which is too small
24
to be achieved by our initialization method (in Sec. 6.4) using O(d) samples. Next,
we show by a simple construction, the linear convergence of gradient descent (GD)
with resampling is still guaranteed when the solution is initialized in a much larger
neighborhood.
Theorem 3.3.2 (Convergence of Gradient Descent). Let xi, yii=1,2,··· ,N be sam-
pled from the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the sam-
ples and lie in the neighborhood of the optimal solution, defined in Eq. (3.3). One
iteration of gradient descent can be described as, w+ = w − η∇f(w), where
η = 1/(10N(3K)K∆2K−2max ). Then, if N ≥ O(KKd logK+2(d)), w.p. 1−O(Kd−2),
‖w+ −w∗‖2 ≤ (1− pmin∆2K−2min
80(3K)K∆2K−2max
)‖w −w∗‖2 (3.7)
Remark. The linear convergence Eq. (3.7) requires the resampling of the
data points for each iteration. In Sec. 3.5, we combine Corollary 3.3.1, which
doesn’t require resampling when the iterate is sufficiently close to the optimum,
to show that there exists an algorithm using a finite number of samples to achieve
any solution precision.
To prove Theorem 3.3.2, we prove the PD properties on a line between a
current iterate and the optimum by constructing a set of anchor points and then ap-
ply traditional analysis for the linear convergence of gradient descent. The detailed
proof can be found in Appendix A.1.3.
25
3.4 Initialization via Tensor method
In this section, we propose a tensor method to initialize the parameters.
We define the second-order moment M2 := EJy2(x⊗ x− I)K and the third-order
moments,
M3 := Eqy3x⊗ x⊗ x
y−∑j∈[d]
Eqy3(ej ⊗ x⊗ ej + ej ⊗ ej ⊗ x+ x⊗ ej ⊗ ej)
y.
According to Lemma 6 in [108], M2 =∑
k=[K] 2pkw∗k⊗w∗
k and M3 =∑
k=[K] 6pkw∗k⊗
w∗k ⊗ w∗
k. Therefore by calculating the eigendecomposition of the estimated mo-
ments, we are able to recover the parameters to any precision provided enough
samples. Theorem 8 of [108] needs O(d3) sample complexity to obtain the model
parameters with certain precision. Such high sample complexity comes from the
tensor concentration bound. However, we find the problem of tensor eigendecom-
position in MLR can be reduced to RK×K×K space such that the sample complex-
ity and computational complexity are O(poly(K)). Our method is similar to the
whitening process in [26, 67]. However, [26] needs O(d6) sample complexity due
to the nuclear-norm minimization problem, while ours requires only O(d). For this
sample complexity, we need assume the following,
Assumption 3.4.1. The following quantities, σK(M2), ‖M2‖, ‖M3‖2/3op ,∑
k∈[K] pk‖w∗k‖2
and (∑
k∈[K] pk‖w∗k‖3)2/3, have the same order of d, i.e., the ratios between any two
of them are independent of d.
The above assumption holds when w∗kk=1,2,··· ,K are orthonormal to each
other.
26
We formally present the tensor method in Algorithm 3.4.1 and its theoretical
guarantee in Theorem 3.4.2.
Theorem 3.4.2. Under Assumption 3.4.1, if |Ω| ≥ O(d log2(d) + log4(d)), then
w.p. 1−O(d−2), Algorithm 3.4.1 will output w(0)k Kk=1 that satisfies,
‖w(0)k −w∗
k‖ ≤ cm∆min,∀k ∈ [K]
which falls in the locally PD region, Eq. (3.3), in Theorem 3.3.1.
Algorithm 3.4.1 Initialization for MLR via Tensor MethodInput: xi, yii∈ΩOutput: w(0)
k Kk=1
1: Partition the dataset Ω into Ω = ΩM2 ∪ Ω2 ∪ Ω3 with |ΩM2| = O(d log2(d)),|Ω2| = O(d log2(d)) and |Ω3| = O(log4(d))
2: Compute the approximate top-K eigenvectors, Y ∈ Rd×K , of the second-ordermoment, M2 :=
1|ΩM2
|∑
i∈ΩM2y2i (xi ⊗ xi − I), by the power method.
3: Compute R2 =1
2|Ω2|∑
i∈Ω2y2i (Y
Txi ⊗ Y Txi − I).
4: Compute the whitening matrix W ∈ RK×K of R2, i.e., W = U2Λ−1/22 UT
2 ,where R2 = U2Λ2U
T2 is the eigendecomposition of R2.
5: Compute R3 =1
6|Ω3|∑
i∈Ω3y3i (ri⊗ri⊗ri−
∑j∈[K] ej⊗ri⊗ej−
∑j∈[K] ej⊗
ej ⊗ ri −∑
j∈[K] ri ⊗ ej ⊗ ej), where ri = Y Txi for all i ∈ Ω3.6: Compute the eigenvalues akKk=1 and the eigenvectors vkKk=1 of the
whitened tensor R3(W , W , W ) ∈ RK×K×K by using the robust tensor powermethod [4].
7: Return the estimation of the models, w(0)k = Y (W T )†(akvk)
The proof can be found in Appendix A.2.2. Forming M2 explicitly will cost
O(Nd2) time, which is expensive when d is large. We can compute each step of the
power method without explicitly forming M2. In particular, we alternately compute
Y (t+1) =∑
i∈ΩM2y2i (xi(x
Ti Y
(t)) − Y (t)) and let Y (t+1) = QR(Y (t+1)). Now each
27
power method iteration only needs O(KNd) time. Furthermore, the number of it-
erations needed will be a constant, since power method has linear convergence rate
and we don’t need very accurate solution. For the proof of this claim, we refer to the
proof of Lemma A.2.3 in Appendix A.2. Next we compute R2 using O(KNd) and
compute W in O(K3) time. Computing R3 takes O(KNd+K3N) time. The robust
tensor power method takes O(poly(K)polylog(d)) time. In summary, the computa-
tional complexity for the initialization is O(KdN +K3N +poly(K)polylog(d)) =
O(dN).
3.5 Recovery Guarantee
We are now ready to show the complete algorithm, Algorithm 3.5.1, that
has global convergence guarantee. We use fΩ(w) to denote the objective function
Eq. (3.1) generated from a subset of the dataset Ω, i.e.,fΩ(w) =∑
i∈ΩΠKk=1(yi −
xTi wk)
2.
Theorem 3.5.1 (Global Convergence Guarantee). Let xi, yii=1,2,··· ,N be sampled
from the MLR model (3.2) with N ≥ O(d(K log(d))2K+3). Let the step size η
be smaller than a positive constant. Then given any precision ε > 0, after T =
O(log(d/ε)) iterations, w.p. 1 − O(Kd−2 log(d)), the output of Algorithm 3.5.1
satisfies
‖w(T ) −w∗‖ ≤ ε∆min.
The detailed proof is in Appendix A.2.3. The computational complexity
required by our algorithm is near-optimal: (a) tensor method (Algorithm 3.4.1) is
28
carefully employed such that only O(dN) computation is needed; (b) gradient de-
scent with resampling is conducted in log(d) iterations to push the iterate to the next
phase; (c) gradient descent without resampling is finally executed to achieve any
precision with log(1/ε) iterations. Therefore the total computational complexity is
O(dN log(d/ε)). As shown in the theorem, our algorithm can achieve any precision
ε > 0 without any sample complexity dependency on ε. This follows from Corol-
lary 3.3.1 that shows local strong convexity of objective (3.1) with a fixed set of
samples. By contrast, tensor method [26, 108] requires O(1/ε2) samples and EM
algorithm requires O(log(1/ε)) samples[13, 137].
Algorithm 3.5.1 Gradient Descent for MLRInput: xi, yii=1,2,··· ,N , step size η.Output: w
1: Partition the dataset into Ω(t)t=0,1,··· ,T0+1
2: Initialize w(0) by Algorithm 3.4.1 with Ω(0)
3: for t = 1, 2, · · · , T0 do4: w(t) = w(t−1) − η∇fΩ(t)(w(t−1))
5: for t = T0 + 1, T0 + 2, · · · , T0 + T1 do6: w(t) = w(t−1) − η∇fΩ(T0+1)(w(t−1))
3.6 Numerical Experiments
In this section, we use synthetic data to show the properties of our algo-
rithm that minimizes Eq. (3.1), which we call LOSCO (LOcally Strongly Convex
Objective). We generate data points and parameters from standard normal distri-
bution. We set K = 3 and pk = 13
for all k ∈ [K]. The error is defined as
ε(t) = minπ∈Perm([K])maxk∈[K] ‖w(t)π(k) −w∗
k‖/‖w∗k‖, where Perm([K]) is the set
29
of all the permutation functions on the set [K]. The errors reported in this section are
averaged over 10 trials. In our experiments, we find there is no difference whether
doing resampling or not. Hence, for simplicity, we use the original dataset for all
the processes. We set both of two parameters in the robust tensor power method (de-
noted as N and L in Algorithm 1 in [4]) to be 100. The experiments are conducted
in Matlab. After the initialization, we use alternating minimization (i.e., block co-
ordinate descent) to exactly minimize the objective over wk for k = 1, 2, · · · , K
cyclicly.
Fig. 3.1(a) shows the recovery rate for different dimensions and different
samples. We call the result of a trial is recovered if ε(t) < 10−6 for some t < 100.
The recovery rate is the proportion of recovered times out of 10 trials. As shown in
the figure, the sample complexity for exact recovery is nearly linear to d. Fig. 3.1(b)
shows the behavior of our algorithm in the noisy case. The noise is drawn from ei ∈
N(0, σ2), i.i.d., and d is fixed as 100. As we can see from the figure, the solution
error is almost proportional to the noise deviation. Comparing among different N ’s,
the solution error decreases when N increases, so it seems consistent in presence
of unbiased noise. We also illustrate the performance of our tensor initialization
method in Fig. 3.1(c), which shows to achieve an initial error ε(0) = c for some
constant c < 1, our tensor method only requires N to be proportional to d. Note that
the naive initialization methods, random initialization (using normal distribution) or
all-zero initialization, will lead to ε(0) ≈ 1.4 and ε(0) = 1 respectively.
We next compare with EM algorithm, where we alternately assign labels
to points and exactly solve each model parameter according to the labels. EM
30
has been shown to be very sensitive to the initialization [137]. The grid search
initialization method proposed in [137] is not feasible here, because it only handles
two components with a same magnitude. Therefore, we use random initialization
and tensor initialization for EM. We compare our method with EM on convergence
speed under different dimensions and different initialization methods. We use exact
alternating minimization (LOSCO-ALT) to optimize our objective (3.1), which has
similar computational complexity as EM. Fig. 3.2(a)(b) shows when both methods
are initialized from tensor method, our method converges slightly faster than EM
in terms of time, and when initialized from random vectors, our method converges
much faster than EM. In the case of (b), EM with random initialization doesn’t
converge to the optima, while our method still converges. In Fig. 3.2(c)(d), we also
show in terms of iterations, our method converges to the optima even more faster
than EM.
200 400 600 800 1000d
0.5
1
1.5
2
N
×104
0
0.2
0.4
0.6
0.8
1
Rec
over
y R
ate
-10 -5 0log(σ)
-15
-10
-5
0
log(ϵ)
N=6000N=60000N=600000
200 400 600 800 1000d
2
4
6
8
10
N
×104
-2
-1.5
-1
-0.5
0
0.5
1
log(ϵ)
(a) Sample complexity (b) Noisy case (c) The initialization error
Figure 3.1: Empirical Performance of MLR.
3.7 Related Work
EM algorithm without careful initialization is only guaranteed to have local
31
0 0.5 1 1.5time(s)
-30
-20
-10
0log(err)
LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random
0 100 200 300 400time(s)
-30
-20
-10
0
log(err)
LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random
(a) d = 100, N = 6k (b) d = 1k, N = 60k
20 40 60 80 100iterations
-30
-20
-10
0
log(err)
LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random
20 40 60 80 100iterations
-30
-20
-10
0
log(err)
LOSCO-ALT-tensorEM-tensorLOSCO-ALT-randomEM-random
(c) d = 100, N = 6k (d) d = 1k, N = 60k
Figure 3.2: Comparison with EM in terms of time and iterations. Our methodwith random initialization is significantly better than EM with random initialization.Performance of the two methods is comparable when initialized with tensor method.
convergence [13, 76, 137]. [137] proposed a grid search method for initialization.
However, it is limited to the two-component case and seems non-trivial to extend
to multiple components. It is known that exact minimization for each step of EM
is not scalable due to the O(d2N + d3) complexity. Alternatively, we can use EM
with gradient update, whose local convergence is also guaranteed by [13] but only
in the two-symmetric-component case, i.e., when w2 = −w1.
32
Tensor Methods for MLR were studied by [26, 108]. [108] approximated
the third-order moment directly from samples with Gaussian distribution, while
[26] learned the third-order moment from a low-rank linear regression problem.
Tensor methods can obtain the model parameters to any precision ε but requires
1/ε2 time/samples. Also, tensor methods can handle multiple components but suffer
from high sample complexity and high computational complexity. For example,
the sample complexity required by [26] and [108] is O(d6) and O(d3) respectively.
On the other hand, the computational burden mainly comes from the operation on
tensor, which costs at least O(d3) for a very simple tensor evaluation. [26] also
suffers from the slow nuclear norm minimization when estimating the second and
third order moments. In contrast, we use tensor method only for initialization, i.e.,
we require ε to be a certain constant. Moreover, with a simple trick, we can ensure
that the sample and time complexity of our initialization step is only linear in d and
N .
Convex Formulation. Another approach to guarantee the recovery of the pa-
rameters is to relax the non-convex problem to convex problem. [28] proposed
a convex formulation of MLR with two components. The authors provide up-
per bounds on the recovery errors in the noisy case and show their algorithm is
information-theoretically optimal. However, the convex formulation needs to solve
a nuclear norm function under linear constraints, which leads to high computational
cost. The extension from two components to multiple components for this formu-
lation is also not straightforward.
33
Chapter 4
One-hidden-layer Fully-connected Neural Networks1
In this chapter, we consider regression problems with one-hidden-layer
neural networks (1NNs). We distill some properties of activation functions that lead
to local strong convexity in the neighborhood of the ground-truth parameters for the
1NN squared-loss objective and most popular nonlinear activation functions sat-
isfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs,
squared ReLUs and sigmoids. For activation functions that are also smooth, we
show local linear convergence guarantees of gradient descent under a resampling
rule. For homogeneous activations, we show tensor methods are able to initial-
ize the parameters to fall into the local strong convexity region. As a result, ten-
sor initialization followed by gradient descent is guaranteed to recover the ground
truth with sample complexity d · log(1/ε) ·poly(k, λ) and computational complexity
n · d · poly(k, λ) for smooth homogeneous activations with high probability, where
d is the dimension of the input, k (k ≤ d) is the number of hidden nodes, λ is a
conditioning property of the ground-truth parameter matrix between the input layer
1The content of this chapter is published as Recovery guarantees for one-hidden-layer neuralnetworks, Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, in Interna-tional Conference for Machine Learning, 2017. The dissertator’s contribution includes deriving thedetailed theoretical analysis, conducting the numerical experiments and writing most parts of thepaper.
34
and the hidden layer, ε is the targeted precision and n is the number of samples. To
the best of our knowledge, this is the first work that provides recovery guarantees
for 1NNs with both sample complexity and computational complexity linear in the
input dimension and logarithmic in the precision.
4.1 Introduction to One-hidden-layer Neural Networks
Neural Networks (NNs) have achieved great practical success recently. Many
theoretical contributions have been made very recently to understand the extraor-
dinary performance of NNs. The remarkable results of NNs on complex tasks in
computer vision and natural language processing inspired works on the expres-
sive power of NNs [31, 32, 34, 91, 99, 101, 124]. Indeed, several works found
NNs are very powerful and the deeper the more powerful. However, due to the
high non-convexity of NNs, knowing the expressivity of NNs doesn’t guarantee
that the targeted functions will be learned. Therefore, several other works fo-
cused on the achievability of global optima. Many of them considered the over-
parameterized setting, where the global optima or local minima close to the global
optima will be achieved when the number of parameters is large enough, including
[35, 44, 57, 59, 89, 105]. This, however, leads to overfitting easily and can’t provide
any generalization guarantees, which are actually the essential goal in most tasks.
A few works have considered generalization performance. For example,
[135] provide generalization bound under the Rademacher generalization analy-
sis framework. Recently [142] describe some experiments showing that NNs are
complex enough that they actually memorize the training data but still generalize
35
well. As they claim, this cannot be explained by applying generalization analysis
techniques, like VC dimension and Rademacher complexity, to classification loss
(although it does not rule out a margins analysis—see, for example, [17]; their ex-
periments involve the unbounded cross-entropy loss).
In this work, we don’t develop a new generalization analysis. Instead we
focus on parameter recovery setting, where we assume there are underlying ground-
truth parameters and we provide recovery guarantees for the ground-truth parame-
ters up to equivalent permutations. Since the parameters are exactly recovered, the
generalization performance will also be guaranteed.
Several other techniques are also provided to recover the parameters or
to guarantee generalization performance, such as tensor methods [72] and kernel
methods [9]. These methods require sample complexity O(d3) or computational
complexity O(n2), which can be intractable in practice.
Recently [110] show that neither specific assumptions on the niceness of the
input distribution or niceness of the target function alone is sufficient to guarantee
learnability using gradient-based methods. In this work, we assume data points
are sampled from Gaussian distribution and the parameters of hidden neurons are
linearly independent.
Our main contributions are as follows,
1. We distill some properties for activation functions, which are satis-
fied by a wide range of activations, including ReLU, squared ReLU, sigmoid and
tanh. With these properties we show positive definiteness (PD) of the Hessian
36
in the neighborhood of the ground-truth parameters given enough samples (The-
orem 4.3.1). Further, for activations that are also smooth, we show local linear
convergence is guaranteed using gradient descent.
2. We propose a tensor method to initialize the parameters such that the
initialized parameters fall into the local positive definiteness area. Our contribution
is that we reduce the sample/computational complexity from cubic dependency on
dimension to linear dependency (Theorem 5.4.1).
3. Combining the above two results, we provide a globally converging algo-
rithm (Algorithm 8.1.1) for smooth homogeneous activations satisfying the distilled
properties. The whole procedure requires sample/computational complexity linear
in dimension and logarithmic in precision (Theorem 5.5.1).
Roadmap. This section is organized as follows. In Section 4.2, we present our
problem setting and show three key properties of activations required for our guar-
antees. In Section 4.3, we introduce the formal theorem of local strong convexity
and show local linear convergence for smooth activations. Section 4.4 presents a
tensor method to initialize the parameters so that they fall into the basin of the local
strong convexity region.
4.2 Problem Formulation
We consider the following regression problem. Given a set of n samples
S = (x1, y1), (x2, y2), · · · (xn, yn) ⊂ Rd × R,
37
let D denote a underlying distribution over Rd × R with parameters
w∗1, w
∗2, · · ·w∗
k ⊂ Rd, and v∗1, v∗2, · · · , v∗k ⊂ R
such that each sample (x, y) ∈ S is sampled i.i.d. from this distribution, with
D : x ∼ N(0, I), y =k∑
i=1
v∗i · φ(w∗>i x), (4.1)
where φ(z) is the activation function, k is the number of nodes in the hidden layer.
The main question we want to answer is: How many samples are sufficient to re-
cover the underlying parameters?
It is well-known that, training one hidden layer neural network is NP-complete
[20]. Thus, without making any assumptions, learning deep neural network is in-
tractable. Throughout the section, we assume x follows a standard normal distri-
bution; the data is noiseless; the dimension of input data is at least the number of
hidden nodes; and activation function φ(z) satisfies some reasonable properties.
Actually our results can be easily extended to multivariate Gaussian distri-
bution with positive definite covariance and zero mean since we can estimate the
covariance first and then transform the input to a standard normal distribution but
with some loss of accuracy. Although this work focuses on the regression problem,
we can transform classification problems to regression problems if a good teacher
is provided as described in [126]. Our analysis requires k to be no greater than d,
since the first-layer parameters will be linearly dependent otherwise.
For activation function φ(z), we assume it is continuous and if it is non-
smooth let its first derivative be left derivative. Furthermore, we assume it satisfies
38
Property 1, 2, and 3. These properties are critical for the later analyses. We also
observe that most activation functions actually satisfy these three properties.
Property 1. The first derivative φ′(z) is nonnegative and homogeneously bounded,
i.e., 0 ≤ φ′(z) ≤ L1|z|p for some constants L1 > 0 and p ≥ 0.
Property 2. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =
Ez∼N(0,1)[φ′2(σ·z)zq],∀q ∈ 0, 2. Let ρ(σ) denote minβ0(σ)−α2
0(σ)−α21(σ), β2(σ)−
α21(σ)− α2
2(σ), α0(σ) · α2(σ)− α21(σ) The first derivative φ′(z) satisfies that, for
all σ > 0, we have ρ(σ) > 0.
Property 3. The second derivative φ′′(z) is either (a) globally bounded |φ′′(z)| ≤
L2 for some constant L2, i.e., φ(z) is L2-smooth, or (b) φ′′(z) = 0 except for e (e is
a finite constant) points.
Remark 4.2.1. The first two properties are related to the first derivative φ′(z) and
the last one is about the second derivative φ′′(z). At high level, Property 1 re-
quires φ to be non-decreasing with homogeneously bounded derivative; Property 2
requires φ to be highly non-linear; Property 3 requires φ to be either smooth or
piece-wise linear.
Theorem 4.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,
squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-
tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),
the tanh function and the erf function φ(z) =∫ z
0e−t2dt, satisfy Property 1,2,3. The
linear function, φ(z) = z, doesn’t satisfy Property 2 and the quadratic function,
φ(z) = z2, doesn’t satisfy Property 1 and 2.
39
The proof can be found in Appendix C.2.
4.3 Local Strong Convexity
In this section, we study the Hessian of empirical risk near the ground truth.
We consider the case when v∗ is already known. Note that for homogeneous acti-
vations, we can assume v∗i ∈ −1, 1 since vφ(z) = v|v|φ(|v|
1/pz), where p is the
degree of homogeneity. As v∗i only takes discrete values for homogeneous activa-
tions, in the next section, we show we can exactly recover v∗ using tensor methods
with finite samples.
For a set of samples S, we define the Empirical Risk,
fS(W ) =1
2|S|∑
(x,y)∈S
(k∑
i=1
v∗i φ(w>i x)− y
)2
. (4.2)
For a distribution D, we define the Expected Risk,
fD(W ) =1
2E
(x,y)∼D
( k∑i=1
v∗i φ(w>i x)− y
)2. (4.3)
Let’s calculate the gradient and the Hessian of fS(W ) and fD(W ). For each j ∈ [k],
the partial gradient of fD(W ) with respect to wj can be represented as
∂fD(W )
∂wj
= E(x,y)∼D
[(k∑
i=1
v∗i φ(w>i x)− y
)v∗jφ
′(w>j x)x
].
For each j, l ∈ [k] and j 6= l, the second partial derivative of fD(W ) for the (j, l)-th
off-diagonal block is,
∂2fD(W )
∂wj∂wl
= E(x,y)∼D
[v∗j v
∗l φ
′(w>j x)φ
′(w>l x)xx
>],40
and for each j ∈ [k], the second partial derivative of fD(W ) for the j-th diagonal
block is
∂2fD(W )
∂w2j
= E(x,y)∼D
[(k∑
i=1
v∗i φ(w>i x)− y
)v∗jφ
′′(w>j x)xx
> + (v∗jφ′(w>
j x))2xx>
].
If φ(z) is non-smooth, we use the Dirac function and its derivatives to represent
φ′′(z). Replacing the expectation E(x,y)∼D by the average over the samples |S|−1∑
(x,y)∈S ,
we obtain the Hessian of the empirical risk.
Considering the case when W = W ∗ ∈ Rd×k, for all j, l ∈ [k], we have,
∂2fD(W∗)
∂wj∂wl
= E(x,y)∼D
[v∗j v
∗l φ
′(w∗>j x)φ′(w∗>
l x)xx>].If Property 3(b) is satisfied, φ′′(z) = 0 almost surely. So in this case the diagonal
blocks of the empirical Hessian can be written as,
∂2fS(W )
∂w2j
=1
|S|∑
(x,y)∈S
(v∗jφ′(w>
j x))2xx>.
Now we show the Hessian of the objective near the global optimum is positive
definite.
Definition 4.3.1. Given the ground truth matrix W ∗ ∈ Rd×k, let σi(W∗) denote
the i-th singular value of W ∗, often abbreviated as σi. Let κ = σ1/σk, λ =
(∏k
i=1 σi)/σkk . Let vmax denote maxi∈[k] |v∗i | and vmin denote mini∈[k] |v∗i | . Let
ν = vmax/vmin. Let ρ denote ρ(σk). Let τ = (3σ1/2)4p/minσ∈[σk/2,3σ1/2]ρ2(σ).
Theorem 4.3.1 (Informal version of Theorem B.3.1). For any W ∈ Rd×k with
‖W −W ∗‖ ≤ poly(1/k, 1/λ, 1/ν, ρ/σ2p1 ) · ‖W ∗‖, let S denote a set of i.i.d. sam-
ples from distribution D (defined in (4.1)) and let the activation function satisfy
41
Property 1,2,3. Then for any t ≥ 1, if |S| ≥ d · poly(log d, t, k, ν, τ, λ, σ2p1 /ρ), we
have with probability at least 1− d−Ω(t),
Ω(v2minρ(σk)/(κ2λ))I ∇2fS(W ) O(kv2maxσ
2p1 )I.
Remark 4.3.1. As we can see from Theorem 4.3.1, ρ(σk) from Property 2 plays an
important role for positive definite (PD) property. Interestingly, many popular acti-
vations, like ReLU, sigmoid and tanh, have ρ(σk) > 0, while some simple functions
like linear (φ(z) = z) and square (φ(z) = z2) functions have ρ(σk) = 0 and their
Hessians are rank-deficient. Another important numbers are κ and λ, two different
condition numbers of the weight matrix, which directly influences the positive defi-
niteness. If W ∗ is rank deficient, λ→∞, κ→∞ and we don’t have PD property.
In the best case when W ∗ is orthogonal, λ = κ = 1. In the worse case, λ can be
exponential in k. Also W should be close enough to W ∗. In the next section, we
provide tensor methods to initialize w∗i and v∗i such that they satisfy the conditions
in Theorem 4.3.1.
For the PD property to hold, we need the samples to be independent of the
current parameters. Therefore, we need to do resampling at each iteration to guar-
antee the convergence in iterative algorithms like gradient descent. The following
theorem provides the linear convergence guarantee of gradient descent for smooth
activations.
Theorem 4.3.2 (Linear convergence of gradient descent, informal version of Theo-
rem B.3.2). Let W be the current iterate satisfying
‖W −W ∗‖ ≤ poly(1/ν, 1/k, 1/λ, ρ/σ2p1 )‖W ∗‖.
42
Let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) with |S| ≥
d · poly(log d, t, k, ν, τ, λ, σ2p1 /ρ) and let the activation function satisfy Property 1,2
and 3(a). Define m0 := Θ(v2minρ(σk)/(κ2λ)) and M0 := Θ(kv2maxσ
2p1 ). If we
perform gradient descent with step size 1/M0 on fS(W ) and obtain the next iterate,
W = W − 1
M0
∇fS(W ),
then with probability at least 1− d−Ω(t),
‖W −W ∗‖2F ≤ (1− m0
M0
)‖W −W ∗‖2F .
We provide the proofs in the Appendix B.3.1
4.4 Initialization via Tensor Methods
In this section, we show that Tensor methods can recover the parameters
W ∗ to some precision and exactly recover v∗ for homogeneous activations.
It is known that most tensor problems are NP-hard [63, 64] or even hard
to approximate [117]. However, by making some assumptions, tensor decompo-
sition method becomes efficient [4, 115, 132, 133]. Here we utilize the noiseless
assumption and Gaussian inputs assumption to show a provable and efficient tensor
methods.
Preliminary
Let’s define a special outer product ⊗ for simplification of the notation.
If v ∈ Rd is a vector and I is the identity matrix, then v⊗I =∑d
j=1[v ⊗ ej ⊗
43
ej + ej ⊗ v ⊗ ej + ej ⊗ ej ⊗ v]. If M is a symmetric rank-r matrix factorized as
M =∑r
i=1 siviv>i and I is the identity matrix, then
M⊗I =r∑
i=1
si
d∑j=1
6∑l=1
Al,i,j,
where A1,i,j = vi⊗ vi⊗ ej⊗ ej , A2,i,j = vi⊗ ej⊗ vi⊗ ej , A3,i,j = ej⊗ vi⊗ vi⊗ ej ,
A4,i,j = vi ⊗ ej ⊗ ej ⊗ vi, A5,i,j = ej ⊗ vi ⊗ ej ⊗ vi and A6,i,j = ej ⊗ ej ⊗ vi ⊗ vi.
Denote w = w/‖w‖. Now let’s calculate some moments.
Definition 4.4.1. We define M1,M2,M3,M4 and m1,i,m2,i,m3,i,m4,i as follows :
M1 = E(x,y)∼D[y · x].
M2 = E(x,y)∼D[y · (x⊗ x− I)].
M3 = E(x,y)∼D[y · (x⊗3 − x⊗I)].
M4 = E(x,y)∼D[y · (x⊗4 − (x⊗ x)⊗I + I⊗I)].
γj(σ) = Ez∼N(0,1)[φ(σ · z)zj], ∀j = 0, 1, 2, 3, 4.
m1,i = γ1(‖w∗i ‖).
m2,i = γ2(‖w∗i ‖)− γ0(‖w∗
i ‖).
m3,i = γ3(‖w∗i ‖)− 3γ1(‖w∗
i ‖).
m4,i = γ4(‖w∗i ‖) + 3γ0(‖w∗
i ‖)− 6γ2(‖w∗i ‖).
According to Definition 4.4.1, we have the following results,
Claim 4.4.1. For each j ∈ [4], Mj =∑k
i=1 v∗imj,iw
∗⊗ji .
Note that some mj,i’s will be zero for specific activations. For example, for
activations with symmetric first derivatives, i.e., φ′(z) = φ′(−z), like sigmoid and
44
erf, we have φ(z) + φ(−z) being a constant and M2 = 0 since γ0(σ) = γ2(σ).
Another example is ReLU. ReLU functions have vanishing M3, i.e., M3 = 0, as
γ3(σ) = 3γ1(σ). To make tensor methods work, we make the following assumption.
Assumption 4.4.1. Assume the activation function φ(z) satisfies the following con-
ditions:
1. If Mj 6= 0, then mj,i 6= 0 for all i ∈ [k].
2. At least one of M3 and M4 is non-zero.
3. If M1 = M3 = 0, then φ(z) is an even function, i.e., φ(z) = φ(−z).
4. If M2 = M4 = 0, then φ(z) is an odd function, i.e., φ(z) = −φ(−z).
If φ(z) is an odd function then φ(z) = −φ(−z) and vφ(w>x) = −vφ(−w>x).
Hence we can always assume v > 0. If φ(z) is an even function, then vφ(w>x) =
vφ(−w>x). So if w recovers w∗ then −w also recovers w∗. Note that ReLU, leaky
ReLU and squared ReLU satisfy Assumption 4.4.1. We further define the following
non-zero moments.
Definition 4.4.2. Let α ∈ Rd denote a randomly picked vector. We define P2 and
P3 as follows: P2 = Mj2(I, I, α, · · · , α) , where j2 = minj ≥ 2|Mj 6= 0 and
P3 = Mj3(I, I, I, α, · · · , α), where j3 = minj ≥ 3|Mj 6= 0.
According to Definition 4.4.1 and 4.4.2, we have,
Claim 4.4.2. P2 =∑k
i=1 v∗imj2,i(α
>w∗i )
j2−2w∗⊗2i and
P3 =k∑
i=1
v∗imj3,i(α>w∗
i )j3−3w∗⊗3
i .
45
In other words for the above definition, P2 is equal to the first non-zero
matrix in the ordered sequence M2,M3(I, I, α),M4(I, I, α, α). P3 is equal to
the first non-zero tensor in the ordered sequence M3,M4(I, I, I, α). Since α is
randomly picked up, w∗>i α 6= 0 and we view this number as a constant throughout
this work. So by construction and Assumption 4.4.1, both P2 and P3 are rank-k.
Also, let P2 ∈ Rd×d and P3 ∈ Rd×d×d denote the corresponding empirical moments
of P2 ∈ Rd×d and P3 ∈ Rd×d×d respectively.
Algorithm
Now we briefly introduce how to use a set of samples with size linear in
dimension to recover the ground truth parameters to some precision. As shown in
the previous section, we have a rank-k 3rd-order moment P3 that has tensor decom-
position formed by w∗1, w
∗2, · · · , w∗
k. Therefore, we can use the non-orthogonal
decomposition method [78] to decompose the corresponding estimated tensor P3
and obtain an approximation of the parameters. The precision of the obtained pa-
rameters depends on the estimation error of P3, which requires Ω(d3/ε2) samples to
achieve ε error. Also, the time complexity for tensor decomposition on a d× d× d
tensor is Ω(d3).
In this work, we reduce the cubic dependency of sample/computational
complexity in dimension [72] to linear dependency. Our idea follows the tech-
niques used in [151], where they first used a 2nd-order moment P2 to approximate
the subspace spanned by w∗1, w
∗2, · · · , w∗
k, denoted as V , then use V to reduce a
higher-dimensional third-order tensor P3 ∈ Rd×d×d to a lower-dimensional tensor
46
Algorithm 4.4.1 Initialization via Tensor Method1: procedure INITIALIZATION(S) . Theorem 5.4.12: S2, S3, S4 ← PARTITION(S, 3)
3: P2 ← ES2 [P2]
4: V ← POWERMETHOD(P2, k)
5: R3 ← ES3 [P3(V, V, V )]
6: uii∈[k] ← KCL(R3)
7: w(0)i , v
(0)i i∈[k] ← RECMAGSIGN(V, uii∈[k], S4)
8: Return w(0)i , v
(0)i i∈[k]
R3 := P3(V, V, V ) ∈ Rk×k×k. Since the tensor decomposition and the tensor esti-
mation are conducted on a lower-dimensional Rk×k×k space, the sample complexity
and computational complexity are reduced.
The detailed algorithm is shown in Algorithm 4.4.1. First, we randomly
partition the dataset into three subsets each with size O(d). Then apply the power
method on P2, which is the estimation of P2 from S2, to estimate V . After that, the
non-orthogonal tensor decomposition (KCL)[78] on R3 outputs ui which estimates
siV>w∗
i for i ∈ [k] with unknown sign si ∈ −1, 1. Hence w∗i can be estimated
by siV ui. Finally we estimate the magnitude of w∗i and the signs si, v
∗i in the
RECMAGSIGN function for homogeneous activations. We discuss the details of
each procedure and provide POWERMETHOD and RECMAGSIGN algorithms in
Appendix B.4.
Theoretical Analysis
We formally present our theorem for Algorithm 4.4.1, and provide the proof
in the Appendix B.4.2.
47
Algorithm 4.4.2 Globally Converging Algorithm1: procedure LEARNING1NN(S, d, k, ε) . Theorem 5.5.12: T ← log(1/ε) · poly(k, ν, λ, σ2p
1 /ρ).3: η ← 1/(kv2maxσ
2p1 ).
4: S0, S1, · · · , Sq ← PARTITION(S, q + 1).5: W (0), v(0) ← INITIALIZATION(S0).6: Set v∗i ← v
(0)i in Eq. (4.2) for all fSq(W ), q ∈ [T ]
7: for q = 0, 1, 2, · · · , T − 1 do8: W (q+1) = W (q) − η∇fSq+1(W
(q))
9: Return w(T )i , v
(0)i i∈[k]
Theorem 4.4.2. Let the activation function be homogeneous satisfying Assump-
tion 4.4.1. For any 0 < ε < 1 and t ≥ 1, if |S| ≥ ε−2 · d · poly(t, k, κ, log d), then
there exists an algorithm (Algorithm 4.4.1) that takes |S|k · O(d) time and outputs
a matrix W (0) ∈ Rd×k and a vector v(0) ∈ Rk such that, with probability at least
1− d−Ω(t),
‖W (0) −W ∗‖F ≤ ε · poly(k, κ)‖W ∗‖F , and v(0)i = v∗i .
4.5 Recovery Guarantee
Combining the positive definiteness of the Hessian near the global optimal
in Section 4.3 and the tensor initialization methods in Section 4.4, we come up
with the overall globally converging algorithm Algorithm 8.1.1 and its guarantee
Theorem 5.5.1.
Theorem 4.5.1 (Global convergence guarantees). Let S denote a set of i.i.d. sam-
ples from distribution D (defined in (4.1)) and let the activation function be homo-
geneous satisfying Property 1, 2, 3(a) and Assumption 4.4.1. Then for any t ≥ 1 and
48
20 40 60 80 100d
2000
4000
6000
8000
10000N
0
0.2
0.4
0.6
0.8
1
Rec
over
y R
ate
20 40 60 80 100d
2000
4000
6000
8000
10000
N
0.5
1
1.5
Tens
or in
itial
izat
ion
erro
r
(a) Sample complexity for recovery (b) Tensor initialization error
0 200 400 600 800 1000iteration
-10
-5
0
5
10
15
20
log(
obj)
Initialize v,W with TensorRandomly initialize both v,WFix v=v*, randomly initialize W
(c) Objective v.s. iterations
Figure 4.1: Numerical Experiments
any ε > 0, if |S| ≥ d log(1/ε)·poly(log d, t, k, λ), T ≥ log(1/ε)·poly(k, ν, λ, σ2p1 /ρ)
and 0 < η ≤ 1/(kv2maxσ2p1 ), then there is an Algorithm (procedure LEARNING1NN
in Algorithm 8.1.1) taking |S| · d · poly(log d, k, λ) time and outputting a matrix
W (T ) ∈ Rd×k and a vector v(0) ∈ Rk satisfying
‖W (T ) −W ∗‖F ≤ ε‖W ∗‖F , and v(0)i = v∗i .
with probability at least 1− d−Ω(t).
This follows by combining Theorem 4.3.2 and Theorem 5.4.1.
49
4.6 Numerical Experiments
In this section we use synthetic data to verify our theoretical results. We
generate data points xi, yii=1,2,··· ,n from Distribution D(defined in Eq. (4.1)). We
set W ∗ = UΣV >, where U ∈ Rd×k and V ∈ Rk×k are orthogonal matrices gener-
ated from QR decomposition of Gaussian matrices, Σ is a diagonal matrix whose
diagonal elements are 1, 1+ κ−1k−1
, 1+ 2(κ−1)k−1
, · · · , κ. In this experiment, we set κ = 2
and k = 5. We set v∗i to be randomly picked from −1, 1 with equal chance. We
use squared ReLU φ(z) = maxz, 02, which is a smooth homogeneous function.
For non-orthogonal tensor methods, we directly use the code provided by [78] with
the number of random projections fixed as L = 100. We pick the stepsize η = 0.02
for gradient descent. In the experiments, we don’t do the resampling since the al-
gorithm still works well without resampling.
First we show the number of samples required to recover the parameters for
different dimensions. We fix k = 5, change d for d = 10, 20, · · · , 100 and n for
n = 1000, 2000, · · · , 10000. For each pair of d and n, we run 10 trials. We say a
trial successfully recovers the parameters if there exists a permutation π : [k]→ [k],
such that the returned parameters W and v satisfy
maxj∈[k]‖w∗
j − wπ(j)‖/‖w∗j‖ ≤ 0.01 and vπ(j) = v∗j .
We record the recovery rates and represent them as grey scale in Fig. 4.1(a). As
we can see from Fig. 4.1(a), the least number of samples required to have 100%
recovery rate is about proportional to the dimension.
Next we test the tensor initialization. We show the error between the output
50
of the tensor method and the ground truth parameters against the number of sam-
ples under different dimensions in Fig 4.1(b). The pure dark blocks indicate, in at
least one of the 10 trials,∑k
i=1 v(0)i 6=
∑ki=1 v
∗i , which means v
(0)i is not correctly
initialized. Let Π(k) denote the set of all possible permutations π : [k] → [k]. The
grey scale represents the averaged error,
minπ∈Π(k)
maxj∈[k]‖w∗
j − w(0)π(j)‖/‖w
∗j‖,
over 10 trials. As we can see, with a fixed dimension, the more samples we have the
better initialization we obtain. We can also see that to achieve the same initialization
error, the sample complexity required is about proportional to the dimension.
We also compare different initialization methods for gradient descent in
Fig. 4.1(c). We fix d = 10, k = 5, n = 10000 and compare three different ini-
tialization approaches, (I) Let both v and W be initialized from tensor methods,
and then do gradient descent for W while v is fixed; (II) Let both v and W be
initialized from random Gaussian, and then do gradient descent for both W and v;
(III) Let v = v∗ and W be initialized from random Gaussian, and then do gradient
descent for W while v is fixed. As we can see from Fig 4.1(c), Approach (I) is
the fastest and Approach (II) doesn’t converge even if more iterations are allowed.
Both Approach (I) and (III) have linear convergence rate when the objective value
is small enough, which verifies our local linear convergence claim.
51
4.7 Related Work
The recent empirical success of NNs has boosted their theoretical analyses
[5, 9, 15, 16, 42, 52, 107]. We classify them into three main directions.
Expressive Power
Expressive power is studied to understand the remarkable performance of
neural networks on complex tasks. Although one-hidden-layer neural networks
with sufficiently many hidden nodes can approximate any continuous function [65],
shallow networks can’t achieve the same performance in practice as deep networks.
Theoretically, several recent works show the depth of NNs plays an essential role in
the expressive power of neural networks [34]. As shown in [31, 32, 124], functions
that can be implemented by a deep network of polynomial size require exponential
size in order to be implemented by a shallow network. [9, 91, 99, 101] design some
measures of expressivity that display an exponential dependence on the depth of
the network. However, the increasing of the expressivity of NNs or its depth also
increases the difficulty of the learning process to achieve a good enough model. In
this work, we focus on 1NNs and provide recovery guarantees using a finite number
of samples.
Achievability of Global Optima
The global convergence is in general not guaranteed for NNs due to their
non-convexity. It is widely believed that training deep models using gradient-based
methods works so well because the error surface either has no local minima, or if
52
they exist they need to be close in value to the global minima. [123] present exam-
ples showing that for this to be true additional assumptions on the data, initializa-
tion schemes and/or the model classes have to be made. Indeed the achievability of
global optima has been shown under many different types of assumptions.
In particular, [30] analyze the loss surface of a special random neural net-
work through spin-glass theory and show that it has exponentially many local op-
tima, whose loss is small and close to that of a global optimum. Later on, [74]
eliminate some assumptions made by [30] but still require the independence of ac-
tivations as [30], which is unrealistic. [105] study the geometric structure of the
neural network objective function. They have shown that with high probability
random initialization will fall into a basin with a small objective value when the
network is over-parameterized. [89] consider polynomial networks where the acti-
vations are square functions, which are typically not used in practice. [57] show that
when a local minimum has zero parameters related to a hidden node, a global opti-
mum is achieved. [44] study the landscape of 1NN in terms of topology and geom-
etry, and show that the level set becomes connected as the network is increasingly
over-parameterized. [59] show that products of matrices don’t have spurious local
minima and that deep residual networks can represent any function on a sample, as
long as the number of parameters is larger than the sample size. [121] consider over-
specified NNs, where the number of samples is smaller than the number of weights.
[35] propose a new approach to second-order optimization that identifies and at-
tacks the saddle point problem in high-dimensional non-convex optimization. They
apply the approach to recurrent neural networks and show practical performance.
53
[9] use results from tropical geometry to show global optimality of an algorithm,
but it requires (2n)kpoly(n) computational complexity.
Almost all of these results require the number of parameters is larger than
the number of points, which probably overfits the model and no generalization per-
formance will be guaranteed. In this work, we propose an efficient and provable
algorithm for 1NNs that can achieve the underlying ground-truth parameters.
Generalization Bound / Recovery Guarantees
The achievability of global optima of the objective from the training data
doesn’t guarantee the learned model to be able to generalize well on unseen testing
data. In the literature, we find three main approaches to generalization guarantees.
1) Use generalization analysis frameworks, including VC dimension/Rademacher
complexity, to bound the generalization error. A few works have studied the gener-
alization performance for NNs. [135] follow [121] but additionally provide gener-
alization bounds using Rademacher complexity. They assume the obtained parame-
ters are in a regularization set so that the generalization performance is guaranteed,
but this assumption can’t be justified theoretically. [61] apply stability analysis to
the generalization analysis of SGD for convex and non-convex problems, arguing
early stopping is important for generalization performance.
2) Assume an underlying model and try to recover this model. This direction
is popular for many non-convex problems including matrix completion/sensing [14,
58, 71, 122], mixed linear regression [151], subspace recovery [40] and other latent
models [4].
54
Without making any assumptions, those non-convex problems are intractable
[8, 10, 49, 50, 60, 102, 116, 120, 137]. Recovery guarantees for NNs also need as-
sumptions. Several different approaches under different assumptions are provided
to have recovery guarantees on different NN settings.
Tensor methods [4, 115, 132, 133] are a general tool for recovering models
with latent factors by assuming the data distribution is known. Some existing recov-
ery guarantees for NNs are provided by tensor methods [72, 109]. However, [109]
only provide guarantees to recover the subspace spanned by the weight matrix and
no sample complexity is given, while [72] require O(d3/ε2) sample complexity. In
this work, we use tensor methods as an initialization step so that we don’t need very
accurate estimation of the moments, which enables us to reduce the total sample
complexity from 1/ε2 to log(1/ε).
[7] provide polynomial sample complexity and computational complexity
bounds for learning deep representations in unsupervised setting, and they need to
assume the weights are sparse and randomly distributed in [−1, 1].
[126] analyze 1NN by assuming Gaussian inputs in a supervised setting, in
particular, regression and classification with a teacher. This work also considers
this setting. However, there are some key differences. a) [126] require the second-
layer parameters are all ones, while we can learn these parameters. b) In [126],
the ground-truth first-layer weight vectors are required to be orthogonal, while we
only require linear independence. c) [126] require a good initialization but doesn’t
provide initialization methods, while we show the parameters can be efficiently
initialized by tensor methods. d) In [126], only the population case (infinite sample
55
size) is considered, so there is no sample complexity analysis, while we show finite
sample complexity.
Recovery guarantees for convolution neural network with Gaussian inputs
are provided in [21], where they show a globally converging guarantee of gradient
descent on a one-hidden-layer no-overlap convolution neural network. However,
they consider population case, so no sample complexity is provided. Also their
analysis depends on ReLU activations and the no-overlap case is very unlikely to be
used in practice. In this work, we consider a large range of activation functions, but
for one-hidden-layer fully-connected NNs.
3) Improper Learning. In the improper learning setting for NNs, the learn-
ing algorithm is not restricted to output a NN, but only should output a prediction
function whose error is not much larger than the error of the best NN among all the
NNs considered. [147, 149] propose kernel methods to learn the prediction func-
tion which is guaranteed to have generalization performance close to that of the
NN. However, the sample complexity and computational complexity are exponen-
tial. [11] transform NNs to convex semi-definite programming. The works by [12]
and [19] are also in this direction. However, these methods are actually not learning
the original NNs. Another work by [148] uses random initializations to achieve
arbitrary small excess risk. However, their algorithm has exponential running time
in 1/ε.
56
Chapter 5
One-hidden-layer Convolutional Neural Networks
In this chapter, we consider model recovery for non-overlapping convolu-
tional neural networks (CNNs) with multiple kernels. We show that when the inputs
follow Gaussian distribution and the sample size is sufficiently large, the squared
loss of such CNNs is locally strongly convex in a basin of attraction near the global
optima for most popular activation functions, like ReLU, Leaky ReLU, Squared
ReLU, Sigmoid and Tanh. The required sample complexity is proportional to the
dimension of the input and polynomial in the number of kernels and a condition
number of the parameters. We also show that tensor methods are able to initialize
the parameters to the local strong convex region. Hence, for most smooth activa-
tions, gradient descent following tensor initialization is guaranteed to converge to
the global optimal with time that is linear in input dimension, logarithmic in pre-
cision and polynomial in other factors. To the best of our knowledge, this is the
first work that provides recovery guarantees for CNNs with multiple kernels using
polynomial number of samples in polynomial running time.
57
5.1 Introduction to One-hidden-layer Convolutional Neural Net-works
Convolutional Neural Networks (CNNs) have been very successful in many
machine learning areas, including image classification [77], face recognition [80],
machine translation [48] and game playing [113]. Comparing with fully-connected
neural networks (FCNN), CNNs leverage three key ideas that improve their perfor-
mance in machine learning tasks, namely, sparse weights, parameter sharing and
equivariance to translation [56]. These ideas allow CNNs to capture common pat-
terns in portions of original inputs.
Despite the empirical success of neural networks, the mechanism behind
them is still not fully understood. Recently there are several theoretical works on
analyzing FCNNs, including the expressive power of FCNNs [31, 32, 34, 99, 101,
124], the achievability of global optima [35, 57, 59, 89, 97, 105] and the recov-
ery/generalization guarantees [72, 85, 109, 125, 135, 153].
With the increasing number of papers analyzing FCNNs, theoretical litera-
ture about CNNs is also rising. Recent theoretical CNN research focuses on gen-
eralization and recovery guarantees. In particular, generalization guarantees for
two-layer CNNs are provided by [149], where they convexify CNNs by relaxing the
class of CNN filters to a reproducing kernel Hilbert space (RKHS). However, to pair
with RKHS, only several uncommonly used activations are acceptable. There have
been much progress in the parameter recovery setting, including [21, 38, 39, 53].
However, existing results can only handle one kernel. It is still unknown
if multi-kernel CNNs will have recovery guarantees. In this section, we consider
58
multiple kernels instead of just one kernel as in [21, 38, 39, 53].
We use a popular proof framework for non-convex problem such as one-
hidden-layer neural network [153]. The framework is as follows. First show local
strong convexity near the global optimal and then use a proper initialization method
to initialize the parameters to fall in the the local strong convexity region. In par-
ticular, we first show that the population Hessian of the squared loss of CNN at
the ground truth is positive definite (PD) as long as the activation satisfies some
properties in Section 5.3. Note that the Hessian of the squared loss at the ground
truth can be trivially proved to be positive semidefinite (PSD), but only PSD-ness
at the ground truth can’t guarantee convergence of most optimization algorithms
like gradient descent. The proof for the PD-ness of Hessian at the ground truth
is non-trivial. Actually we will give examples in Section 5.3 where the distilled
properties are not satisfied and their Hessians are only PSD but not PD. Then given
the PD-ness of population Hessian at the ground truth, we are able to show that
the empirical Hessian at any fixed point that is close enough to the ground truth is
also PD with high probability by using matrix Bernstein inequality and the distilled
properties of activations and we show gradient descent converges to the global op-
timal given an initialization that falls into the PD region. In Section 5.4, we provide
existing guarantees for the initialization using tensor methods. Finally, we present
some experimental results to verify our theory.
Although we use the same proof framework, our proof details are different
from [153] and resolve some new technical challenges. To mention a few, 1) CNNs
have more complex structures, so we need a different property for the activation
59
function (see Property 4) , fortunately this new condition is still satisfied by com-
mon activation functions; 2) The patches introduce additional dependency, which
requires different proof techniques to disentangle when we attempt to show the PD-
ness of population Hessian; 3) The patches also introduce difficulty when applying
matrix Bernstein inequality for error bound. Part of the proof requires to bound the
error between the empirical sum of non-symmetric random matrices and its corre-
sponding population version. However, such non-symmetric random matrices are
not studied in [153] and cannot be applied to classic matrix Bernstein inequality.
Therefore, we exploit the structure of this type of random matrices so that we can
still bound the error.
In summary, our contributions are,
1. We show that the Hessian of the squared loss at a given point that is suf-
ficiently close to the ground truth is positive definite with high probabil-
ity(w.h.p.) when a sufficiently large number of samples are provided and
the activation function satisfies some properties.
2. Given an initialization point that is sufficiently close to the ground truth,
which can be obtained by tensor methods, we show that for smooth activa-
tion functions that satisfy the distilled properties, gradient descent converges
to the ground truth parameters within ε precision using O(log(1/ε)) samples
w.h.p.. To the best of our knowledge, this is the first time that recovery guar-
antees for non-overlapping CNNs with multiple kernels are provided.
60
5.2 Problem Formulation
We consider the CNN setting with one hidden layer, r non-overlapping
patches and t different kernels. Let (x, y) ∈ Rd × R be a pair of an input and its
corresponding final output, k = d/r be the kernel size (or the size of each patch),
wj ∈ Rk be the parameters of j-th kernel (j = 1, 2, · · · , t), and Pi · x ∈ Rk be the
i-th patch (i = 1, 2, · · · , r) of input x, where r matrices P1, P2, · · · , Pr ∈ Rk×d are
defined in the following sense.
P1 =[Ik 0 · · · 0
], · · · , Pr =
[0 0 · · · Ik
].
By construction of Pii∈[r], Pi · x and Pi′ · x (i 6= i′) don’t have any overlap on
the features of x. Throughout this section, we assume the number of kernels t is no
more than the size of each patch, i.e., t ≤ k. So by definition of d, d ≥ maxk, r, t.
We assume each sample (x, y) ∈ Rd×R is sampled from the following un-
derlying distribution with parameters W ∗ = [w∗1 w
∗2 · · · w∗
t ] ∈ Rk×t and activation
function φ(·),
D : x ∼ N(0, Id), y =t∑
j=1
r∑i=1
φ(w∗>j · Pi · x). (5.1)
Given a distribution D, we define the Expected Risk,
fD(W ) =1
2E
(x,y)∼D
t∑j=1
r∑i=1
φ(w>j · Pi · x)− y
2. (5.2)
Given a set of n samples
S = (x1, y1), (x2, y2), · · · , (xn, yn) ⊂ Rd × R,
61
we define the Empirical Risk,
fS(W ) =1
2|S|∑
(x,y)∈S
t∑j=1
r∑i=1
φ(w>j · Pi · x)− y
2
. (5.3)
We calculate the gradient and the Hessian of fD(W ). The gradient and theHessian of fS(W ) are similar. For each j ∈ [t], the partial gradient of fD(W ) withrespect to wj can be represented as
∂fD(W )
∂wj
= E(x,y)∼D
[(t∑
l=1
r∑i=1
φ(w>l Pix)− y
)(r∑
i=1
φ′(w>j Pix)Pix
)].
For each j ∈ [t], the second partial derivative of fD(W ) with respect to wj
can be represented as
∂2fD(W )
∂w2j
= E(x,y)∼D
( r∑i=1
φ′(w>j Pix)Pix
)(r∑
i=1
φ′(w>j Pix)Pix
)>
+
(t∑
l=1
r∑i=1
φ(w>l Pix)− y
)(r∑
i=1
φ′′(w>j Pix)Pix(Pix)
>
)].
When W = W ∗, we have
∂2fD(W∗)
∂w2j
= E(x,y)∼D
( r∑i=1
φ′(w∗>j Pix)Pix
)(r∑
i=1
φ′(w∗>j Pix)Pix
)>.
For each j, l ∈ [t] and j 6= l, the second partial derivative of fD(W ) with respect to
62
wj and wl can be represented as
∂2fD(W )
∂wj∂wl
= E(x,y)∼D
( r∑i=1
φ′(w>j Pix)Pix
)(r∑
i=1
φ′(w>l Pix)Pix
)>.
For activation function φ(z), we define the following new property for CNN.
Property 4. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =
Ez∼N(0,1)[φ′2(σ · z)zq],∀q ∈ 0, 2. Let ρ(σ) denote
min β0(σ)− α20(σ)− α2
1(σ),
β2(σ)− α21(σ)− α2
2(σ),
α0(σ) · α2(σ)− α21(σ), α
20(σ).
The first derivative φ′(z) satisfies that, for all σ > 0, we have ρ(σ) > 0.
Note that Property 4 is different from Property 2 for one-hidden-layer fully-
connected NNs. We can show that most commonly used activations satisfy these
properties, such as ReLU (φ(z) = maxz, 0, ρ(σ) = 0.091), leaky ReLU (φ(z) =
maxz, 0.01z, ρ(σ) = 0.089), squared ReLU (φ(z) = maxz, 02, ρ(σ) = 0.27σ2)
and sigmoid (φ(z) = 1/(1 + e−z)2, ρ(1) = 0.049). Also note that when Prop-
erty 3(b) is satisfied, i.e., the activation function is non-smooth, but piecewise lin-
ear, i.e., φ′′(z) = 0 almost surely. Then the empirical Hessian exists almost surely
for a finite number of samples.
63
5.3 Local Strong Convexity
In this section, we first show the eigenvalues of the Hessian at any fixed
point that is close to the ground truth are lower bounded and upper bounded by
two positives respectively w.h.p.. Then in the subsequent sections, we present the
main idea of the proofs step-by-step from special cases to general cases. Since we
assume t ≤ k, the following definition is well defined.
Definition 5.3.1. Given the ground truth matrix W ∗ ∈ Rk×t, let σi(W∗) denote the
i-th singular value of W ∗, often abbreviated as σi.
Let κ = σ1/σt, λ = (∏t
i=1 σi)/σtt . Let τ = (3σ1/2)
4p/minσ∈[σt/2,3σ1/2]ρ2(σ).
Theorem 5.3.1 (Lower and upper bound for the Hessian around the ground truth,informal version of Theorem C.3.2). For any W ∈ Rk×t with
‖W −W ∗‖ ≤ poly(1/r, 1/t, 1/κ, 1/λ, 1/ν, ρ/σ2p1 ) · ‖W ∗‖,
let S denote a set of i.i.d. samples from distribution D (defined in (5.1)) and
let the activation function satisfy Property 1,4,3. Then for any s ≥ 1, if |S| ≥
dpoly(s, t, r, ν, τ, κ, λ, σ2p1 /ρ, log d), we have with probability at least 1− d−Ω(s),
Ω(rρ(σt)/(κ2λ))I ∇2fS(W ) O(tr2σ2p
1 )I. (5.4)
Note that κ is the traditional condition number of W ∗, while λ is a more
involved condition number of W ∗. Both of them are 1 if W ∗ has orthonormal
columns. ρ(σ) is a number that is related to the activation function as defined in
Property 4. Property 4 requires ρ(σt) > 0, which is important for the PD-ness of
the Hessian.
64
Here we show a special case when Property 4 is not satisfied and population
Hessian is only positive semi-definite. We consider quadratic activation function,
φ(z) = z2, i.e., W ∗ = Ik. Let A =[a1 a2 · · · ak
]∈ Rk×k. Then the smallest
eigenvalue of∇2f(W ∗) can be written as follows,
min‖A‖F=1
Ex∼Dd
( k∑j=1
r∑i=1
a>j xi · 2xij)
)2 = 4 · min
‖A‖F=1E
x∼Dd
(〈A, r∑i=1
xix>i 〉
)2.
Then as long as we set A such that A = −A>, we have 〈A,∑r
i=1 xix>i 〉 = 0 for
any x. Therefore, the smallest eigenvalue of the population Hessian at the ground
truth for the quadratic activation function is zero. That is to say, the Hessian is
only PSD but not PD. Also note that ρ(σ) = 0 for the quadratic activation function.
Therefore, Property 4 is important for the PD-ness of the Hessian.
Locally Linear Convergence of Gradient Descent
A caveat of Theorem 5.3.1 is that the lower and upper bounds of the Hessian
only hold for a fixed W given a set of samples. That is to say, given a set of samples,
Eq (5.4) doesn’t hold for all the W ’s that are close enough to the ground truth
w.h.p. at the same time. So we want to point out that this theorem doesn’t indicate
the classical local strong convexity, since the classical strong convexity requires all
the Hessians at any point at a local area to be PD almost surely. Fortunately, our
goal is to show the convergence of optimization methods and we can still show
gradient descent converges to the global optimal linearly given a sufficiently good
initialization.
65
Theorem 5.3.2 (Linear convergence of gradient descent, informal version of Theo-
rem C.3.3). Let W be the current iterate satisfying
‖W −W ∗‖ ≤ poly(1/t, 1/r, 1/λ, 1/κ, ρ/σ2p1 )‖W ∗‖.
Let S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Let the
activation function satisfy Property 1,4 and 3(a). Define
m0 = Θ(rρ(σt)/(κ2λ)) and M0 = Θ(tr2σ2p
1 ).
For any s ≥ 1, if we choose
|S| ≥ d · poly(s, t, log d, τ, κ, λ, σ2p1 /ρ),
and perform gradient descent with step size 1/M0 on fS(W ) and obtain the next
iterate,
W = W − 1
M0
∇fS(W ),
then with probability at least 1− d−Ω(s),
‖W −W ∗‖2F ≤ (1− m0
M0
)‖W −W ∗‖2F .
To show the linear convergence of gradient descent for one iteration, we
need to show that all the Hessians along the line between the current point to the
optimal point are PD, which can’t be satisfied by simple union bound, since there
are infinite number of Hessians. Our solution is to set a finite number of anchor
points that are equally distributed along the line, whose Hessians can be shown to
66
be PD w.h.p. using union bound. Then we show all the points between two adjacent
anchor points have PD Hessians, since these points are much closer to the anchor
points than to the ground truth. The proofs are postponed to Appendix C.3.4.
Note that this theorem holds only for one iteration. For multiple iterations,
we need to do resampling at each iteration. However, since the number of iterations
required to achieve ε precision is O(log(1/ε)), the number of samples required is
also proportional to log(1/ε).
5.4 Initialization via Tensor Method
It is known that most tensor problems are NP-hard [63, 64] or even hard to
approximate [118]. Tensor decomposition method becomes efficient [4, 115, 132,
133] under some assumptions. We consider realizable setting and Gaussian inputs
assumption to show a provable and efficient tensor methods.
In this section, we discuss how to use tensor method to initialize the pa-
rameters to the local strong convexity region. Let’s define the following quan-
tities: γj(σ) = Ez∼N(0,1)[φ(σ · z)zj], ∀j = 0, 1, 2, 3. Let v ∈ Rd be a vector
and I be the identity matrix, define a special outer product ⊗ as follows, v⊗I :=∑dj=1[v ⊗ ej ⊗ ej + ej ⊗ v ⊗ ej + ej ⊗ ej ⊗ v].
We denote w = w/‖w‖ and xi = Pi · x. For each i ∈ [r], we can calculate
67
the second-order and third-order moments,
Mi,2 = E(x,y)∼D[y · (xi ⊗ xi − I)] =t∑
j=1
(γ2(‖w∗j‖)− γ0(‖w∗
j‖))w∗⊗2j . (5.5)
Mi,3 = E(x,y)∼D[y · (x⊗3i − xi⊗I)] =
t∑j=1
(γ3(‖w∗j‖)− 3γ1(‖w∗
j‖))w∗⊗3j . (5.6)
For simplicity, we assume γ2(‖w∗j‖) 6= γ0(‖w∗
j‖) and γ3(‖w∗j‖) 6= 3γ1(‖w∗
j‖) for
any j ∈ [t], then Mi,2 6= 0 and Mi,3 6= 0. Note that when this assumption doesn’t
hold, we can seek for higher-order moments and then degrade them to second-order
moments or third-order moments. Now we can use non-orthogonal tensor decom-
position [78] to decompose the empirical version of Mi,3 and obtain the estimation
of w∗j for j ∈ [t]. According to [153], from the empirical version of Mi,2 and Mi,3,
we are able to estimate W ∗ to some precision.
Theorem 5.4.1. For any 0 < ε < 1 and s ≥ 1, if
|S| ≥ ε−2 · k · poly(s, t, κ, log d),
then there exists an algorithm (based on non-orthogonal tensor decomposition [78])
that takes O(tk|S|) time and outputs a matrix W (0) ∈ Rk×t such that, with proba-
bility at least 1− d−Ω(s),
‖W (0) −W ∗‖F ≤ ε · poly(t, κ)‖W ∗‖F .
Therefore, setting ε = ρ(σt)2/poly(t, κ, λ), W (0) will satisfy the initializa-
tion condition in Theorem B.3.2.
68
Algorithm 5.5.1 Globally Converging Algorithm1: procedure LEARNING1CNN(S, T) . Theorem 5.5.12: η ← 1/(tr2σ2p
1 ).3: S0, S1, · · · , ST ← PARTITION(S, T + 1).4: W (0) ← TENSOR INITIALIZATION(S0).5: for q = 0, 1, 2, · · · , T − 1 do6: W (q+1) = W (q) − η∇fSq+1(W
(q))
7: Return W (T )
5.5 Recovery Guarantee
In this section, we can show the global convergence of gradient descent ini-
tialized by tensor method (Algorithm 8.1.1) by combining the local convergence
of gradient descent Theorem B.3.2 and the tensor initialization guarantee Theo-
rem 5.4.1.
Theorem 5.5.1 (Global convergence guarantees). Let S denote a set of i.i.d. sam-
ples from distribution D (defined in (5.1)) and let the activation function satisfying
Property 1, 4, 3(a). Then for any s ≥ 1 and any ε > 0, if
|S| ≥ d log(1/ε) · poly(log d, s, t, λ, r),
T ≥ log(1/ε) · poly(t, r, λ, σ2p1 /ρ),
η ∈ (0, 1/(tr2σ2p1 )],
then there is an algorithm (procedure LEARNING1CNN in Algorithm 8.1.1) taking
|S| · d · poly(log d, t, r, λ)
time and outputting a matrix W (T ) ∈ Rk×t satisfying
‖W (T ) −W ∗‖F ≤ ε‖W ∗‖F ,
69
0 20 40 60 80 100number of samples
-50
-40
-30
-20
-10
0
10lo
g of
the
min
imal
eig
enva
lue
of H
essi
an
squared ReLUReLUsigmoidquadratic
0 500 1000 1500 2000iteration
-15
-10
-5
0
5
10
15
log(obj)
Figure 5.1: (a) (left) Minimal eigenvalue of Hessian at the ground truth for differentactivations against the sample size (b) (right) Convergence of gradient descent withdifferent random initializations.
with probability at least 1− d−Ω(s).
5.6 Numerical Experiments
In this section, we do some experiments on synthetic data to verify our
analysis. We set W ∗ = UΣV >, where U ∈ Rk×t and V ∈ Rt×t are orthogonal
matrices generated from QR decomposition of Gaussian matrices, Σ is a diagonal
matrix whose elements are 1, 1 + κ−1t−1
, 1 + 2(κ−1)t−1
, · · · , κ, so that κ is the condition
number of W ∗. Then data points xi, yii=1,2,··· ,n are generated from Distribution
D(defined in Eq. (5.1)) with W ∗. In this experiment, we set κ = 2, d = 10, k = 5,
r = 2 and t = 2.
In our first experiment, we show that the minimal eigenvalues of Hessians
at the the ground truth for different number of samples and different activation
functions. As we can see from Fig. 5.1(a), The minimal eigenvalues using ReLU,
70
squared ReLU and sigmoid activations are positive, while the minimal eigenvalue
of Hessian using quadratic activation is zero. Note that we use log scale for y-axis.
Also, we can see when the sample size increases the minimal eigenvalues converges
to the minimal eigenvalue of the population Hessian.
In the second experiment, we demonstrate how gradient descent converges.
We use squared ReLU as an example, pick stepsize η = 0.01 for gradient descent
and set n = 1000. In the experiments, we don’t do the resampling for each iteration
since the algorithm still works well without resampling. The results are shown in
Fig. 5.1(b), where different lines use different initializations sampled from normal
distribution. The common properties of all the lines are that 1) they converge to
the global optimal; 2) they have linear convergence rate when the objective value is
close to zero, which verifies Theorem B.3.2.
5.7 Related Work
With the great success of neural networks, there is an increasing amount
of literature that provides theoretical analysis and guarantees for NNs. Some of
them measure the expressive power of NNs [31, 32, 34, 99, 101, 124] in order to
explain the remarkable performance of NNs on complex tasks. Many other works
try to handle the non-convexity of NNs by showing that the global optima or local
minima close to the global optima will be achieved when the number of parameters
is large enough [35, 57, 59, 89, 105]. However, such an over-parameterization will
also overfit the training data easily and limit the generalization.
In this work, we consider parameter recovery guarantees, where the typical
71
setting is to assume an underlying model and then try to recover the model. Once the
parameters of the underlying model are recovered, generalization performance will
also be guaranteed. Many non-convex problems, such as matrix completion/sensing
[71] and mixed linear regression [151], have nice recovery guarantees. Recovery
guarantees for FCNNs have been studied in several works by different approaches.
One of the approaches is tensor method [72, 109]. In particular, [109] guarantee
to recover the subspace spanned by the weight matrix but no sample complexity is
given, while [72] provide the recovery of the parameters and require O(d3/ε2) sam-
ple complexity. [125, 126, 153] consider the recovery of one-hidden-layer FCNNs
using algorithms based on gradient descent. [125, 126] provide recovery guarantees
for one-hidden-layer FCNNs with orthogonal weight matrix and ReLU activations
given infinite number of samples sampled from Gaussian distribution. [153] show
the local strong convexity of the squared loss for one-hidden-layer FCNNs and use
tensor method to initialize the parameters to the local strong convexity region fol-
lowed by gradient descent that finally converges to the ground truth parameters. In
this work, we consider the recovery guarantees for non-overlapping CNNs.
There is an increasing number of theoretical literature on CNNs. [32] con-
sider the CNNs as generalized tensor decomposition and show the expressive power
and depth efficiency of CNNs. [96] studied the loss surface of CNNs. [21] provide a
globally converging guarantee of gradient descent on one-hidden-layer CNNs. [38]
eliminate the Gaussian input assumption and only require a weaker assumption on
the inputs. However, 1) their analysis depends on ReLU activations, 2) they only
consider one kernel. [39] shows that with random initialization gradient descent
72
with weight normalization converges to the ground truth parameters with constant
probability. In this section, we provide recovery guarantees for CNNs with multiple
kernels and give sample complexity analysis. Moreover our analysis can be applied
to a large range of activations including most commonly used activations. Another
approach for CNNs that is worth mentioning is convex relaxation [149], where
the class of CNN filters is relaxed to a reproducing kernel Hilbert space (RKHS).
They show generalization error bound for this relaxation. However, to pair with
RKHS, only several uncommonly used activations work for their analysis. Also,
the learned function by convex relaxation is not the original CNN anymore. Re-
cently, [53] applies isotonic regression to learning CNNs with overlapping patches.
It uses a milder assumption on the data input and doesn’t need special initialization.
However, it can not handle multiple kernels either.
73
Chapter 6
Non-linear Inductive Matrix Completion
This chapter considers non-linear inductive matrix completion that has ap-
plications in recommendation systems. The goal of a recommendation system is to
predict the interest of a user in a given item by exploiting the existing set of ratings
as well as certain user/item features. A standard approach to modeling this problem
is Inductive Matrix Completion where the predicted rating is modeled as an inner
product of the user and the item features projected onto a latent space. In order to
learn the parameters effectively from a small number of observed ratings, the latent
space is constrained to be low-dimensional which implies that the parameter matrix
is constrained to be low-rank. However, such bilinear modeling of the ratings can
be limiting in practice and non-linear prediction functions can lead to significant
improvements. A natural approach to introducing non-linearity in the prediction
function is to apply a non-linear activation function on top of the projected user/item
features. Imposition of non-linearities further complicates an already challenging
problem that has two sources of non-convexity: a) low-rank structure of the pa-
rameter matrix, and b) non-linear activation function. We show that one can still
solve the non-linear Inductive Matrix Completion problem using gradient descent
type methods as long as the solution is initialized well. That is, close to the optima,
the optimization function is strongly convex and hence admits standard optimiza-
74
tion techniques, at least for certain activation functions, such as Sigmoid and tanh.
We also highlight the importance of the activation function and show how ReLU
can behave significantly differently than say a sigmoid function. Finally, we apply
our proposed technique to recommendation systems and semi-supervised cluster-
ing, and show that our method can lead to much better performance than standard
linear Inductive Matrix Completion methods.
6.1 Introduction to Inductive Matrix Completion
Matrix Completion (MC) or Collaborative filtering [24, 55] is by now a
standard technique to model recommendation systems problems where a few user-
item ratings are available and the goal is to predict ratings for any user-item pair.
However, standard collaborative filtering suffers from two drawbacks: 1) Cold-
start problem: MC can’t give prediction for new users or items, 2) Missing side-
information: MC cannot leverage side-information that is typically present in rec-
ommendation systems such as features for users/items. Consequently, several meth-
ods [2, 69, 104, 136] have been proposed to leverage the side information together
with the ratings. Inductive matrix completion (IMC) [2, 69] is one of the most
popular methods in this class.
IMC models the ratings as the inner product between certain linear map-
ping of the user/items’ features, i.e., A(x, y) = 〈U>x, V >y〉, where A(x, y) is the
predicted rating of user x for item y, x ∈ Rd1 , y ∈ Rd2 are the feature vectors.
Parameters U ∈ Rd1×k, V ∈ Rd2×k (k ≤ d1, k ≤ d2) can typically be learned using
a small number of observed ratings [69].
75
However, the bilinear structure of IMC is fairly simplistic and limiting in
practice and might lead to fairly poor accuracy on real-world recommendation prob-
lems. For example, consider the Youtube recommendation system [33] that requires
predictions over videos. Naturally, a linear function over the pixels of videos will
lead to fairly inaccurate predictions and hence one needs to model the videos using
non-linear networks. The survey paper by [145] presents many more such exam-
ples, where we need to design a non-linear ratings prediction function for the input
features, including [82] for image recommendation, [131] for music recommenda-
tion and [143] for recommendation systems with multiple types of inputs.
We can introduce non-linearity in the prediction function using several stan-
dard techniques, however, if our parameterization admits too many free parameters
then learning them might be challenging as the number of available user-item rat-
ings tend to be fairly small. Instead, we use a simple non-linear extension of IMC
that can control the number of parameters to be estimated. Note that IMC based pre-
diction function can be viewed as an inner product between certain latent user-item
features where the latent features are a linear map of the raw user-item features.
To introduce non-linearity, we can use a non-linear mapping of the raw user-item
features rather than the linear mapping used by IMC. This leads to the following
general framework that we call non-linear inductive matrix completion (NIMC),
A(x, y) = 〈U(x),V(y)〉, (6.1)
where x ∈ X, y ∈ Y are the feature vectors, A(x, y) is their rating and U : X →
S,V : Y → S are non-linear mappings from the raw feature space to the latent
space.
76
The above general framework reduces to standard inductive matrix comple-
tion when U,V are linear mappings and further reduces to matrix completion when
xi, yj are unit vectors ei, ej for i-th item and j-th user respectively. When [xi, ei]
is used as the feature vector and U is restricted to be a two-block (one for xi and
the other for ei) diagonal matrix, then the above framework reduces to the dirtyIMC
model [29]. Similarly, U/V can also be neural networks (NNs), such as feedforward
NNs [33, 112], convolutional NNs for images and recurrent NNs for speech/text.
In this chapter, we focus on a simple nonlinear activation based mapping
for the user-item features. That is, we set U(x) = φ(U∗>x) and V(x) = φ(V ∗>x)
where φ is a nonlinear activation function φ. Note that if φ is ReLU then the latent
space is guaranteed to be in non-negative orthant which in itself can be a desirable
property for certain recommendation problems.
Note that parameter estimation in both IMC and NIMC models is hard due
to non-convexity of the corresponding optimization problem. However, for ”nice”
data, several strong results are known for the linear models, such as [24, 46, 71] for
MC and [29, 69, 136] for IMC. However, non-linearity in NIMC models adds to the
complexity of an already challenging problem and has not been studied extensively,
despite its popularity in practice.
In this chapter, we study a simple one-layer neural network style NIMC
model mentioned above. In particular, we formulate a squared-loss based optimiza-
tion problem for estimating parameters U∗ and V ∗. We show that under a realizable
model and Gaussian input assumption, the objective function is locally strongly
convex within a ”reasonably large” neighborhood of the ground truth. Moreover,
77
we show that the above strong convexity claim holds even if the number of ob-
served ratings is nearly-linear in dimension and polynomial in the conditioning of
the weight matrices. In particular, for well-conditioned matrices, we can recover
the underlying parameters using only poly log(d1 + d2) user-item ratings, which is
critical for practical recommendation systems as they tend to have very few ratings
available per user. Our analysis covers popular activation functions, e.g., sigmoid
and ReLU, and discuss various subtleties that arise due to the activation function.
Finally we discuss how we can leverage standard tensor decomposition techniques
to initialize our parameters well. We would like to stress that practitioners typically
use random initialization itself, and hence results studying random initialization for
NIMC model would be of significant interest.
As mentioned above, due to non-linearity of activation function along with
non-convexity of the parameter space, the existing proof techniques do not apply
directly to the problem. Moreover, we have to carefully argue about both the op-
timization landscape as well as the sample complexity of the algorithm which is
not carefully studied for neural networks. Our proof establishes some new tech-
niques that might be of independent interest, e.g., how to handle the redundancy in
the parameters for ReLU activation. To the best of our knowledge, this is one of
the first theoretically rigorous study of neural-network based recommendation sys-
tems and will hopefully be a stepping stone for similar analysis for ”deeper” neural
networks based recommendation systems. We would also like to highlight that our
model can be viewed as a strict generalization of a one-hidden layer neural network,
hence our result represents one of the few rigorous guarantees for models that are
78
more powerful than one-hidden layer neural networks [22, 85, 153].
Finally, we apply our model on synthetic datasets and verify our theoretical
analysis. Further, we compare our NIMC model with standard linear IMC on sev-
eral real-world recommendation-type problems, including user-movie rating pre-
diction, gene-disease association prediction and semi-supervised clustering. NIMC
demonstrates significantly superior performance over IMC.
Roadmap. We first present the formal model and the corresponding opti-
mization problem in Section 6.2. We then present the local strong convexity and
local linear convergence results in Section 6.3. Finally, we demonstrate the empiri-
cal superiority of NIMC when compared to linear IMC (Section 6.5).
6.2 Problem Formulation
Consider a user-item recommender system, where we have n1 users with
feature vectors X := xii∈[n1] ⊆ Rd1 , n2 items with feature vectors Y := yjj∈[n2] ⊆
Rd2 and a collection of partially-observed user-item ratings, Aobs = A(x, y)|(x, y) ∈
Ω ⊆ X × Y . That is A(xi, yj) is the rating that user xi gave for item yj . For
simplicity, we assume xi’s and yj’s are sampled i.i.d. from distribution X and Y,
respectively. Each element of the index set Ω is also sampled independently and
uniformly with replacement from S := X × Y .
In this chapter, our goal is to predict the rating for any user-item pair with
feature vectors x and y, respectively. We model the user-item ratings as:
A(x, y) = φ(U∗>x)>φ(V ∗>y), (6.2)
79
where U∗ ∈ Rd1×k, V ∗ ∈ Rd2×k and φ is a non-linear activation function. Under
this realizable model, our goal is to recover U∗, V ∗ from a collection of observed
entries, A(x, y)|(x, y) ∈ Ω. Without loss of generality, we set d1 = d2. Also we
treat k as a constant throughout the paper. Our analysis requires U∗, V ∗ to be full
column rank, so we require k ≤ d. And w.l.o.g., we assume σk(U∗) = σk(V
∗) = 1,
i.e., the smallest singular value of both U∗ and V ∗ is 1.
Note that this model is similar to one-hidden layer feed-forward network
popular in standard classification/regression tasks. However, as there is an inner
product between the output of two non-linear layers, φ(U∗x) and φ(V ∗y), it cannot
be modeled by a single hidden layer neural network (with same number of nodes).
Also, for linear activation function, the problem reduces to inductive matrix com-
pletion [2, 69].
Now, to solve for U∗, V ∗, we optimize a simple squared-loss based opti-
mization problem, i.e.,
minU∈Rd1×k,V ∈Rd2×k
fΩ(U, V ) =∑
(x,y)∈Ω
(φ(U>x)>φ(V >y)− A(x, y))2. (6.3)
Naturally, the above problem is a challenging non-convex optimization prob-
lem that is strictly harder than two non-convex optimization problems which are
challenging in their own right: a) the linear inductive matrix completion where
non-convexity arises due to bilinearity of U>V , and b) the standard one-hidden
layer neural network (NN). In fact, recently a lot of research has focused on under-
standing various properties of both the linear inductive matrix completion problem
[46, 69] as well as one-hidden layer NN [47, 153].
80
In this chapter, we show that despite the non-convexity of Problem (6.3), it
behaves as a convex optimization problem close to the optima if the data is sampled
stochastically from a Gaussian distribution. This result combined with standard
tensor decomposition based initialization [72, 78, 153] leads to a polynomial time
algorithm for solving (6.3) optimally if the data satisfies certain sampling assump-
tions in Theorem 6.2.1. Moreover, we also discuss the effect of various activation
functions, especially the difference between a sigmoid activation function vs RELU
activation (see Theorem 6.3.1 and Theorem 6.3.3).
Informally, our recovery guarantee can be stated as follows,
Theorem 6.2.1 (Informal Recovery Guarantee). Consider a recommender system
with a realizable model Eq. (6.2) with sigmoid activation, Assume the features
xii∈[n1] and yjj∈[n2] are sampled i.i.d. from the normal distribution and the ob-
served pairs Ω are i.i.d. sampled from xii∈[n1] × yjj∈[n2] uniformly at random.
Then there exists an algorithm such that U∗, V ∗ can be recovered to any precision
ε with time complexity and sample complexity (refers to n1, n2, |Ω|) polynomial in
the dimension and the condition number of U∗, V ∗, and logarithmic in 1/ε.
6.3 Local Strong Convexity
Our main result shows that when initialized properly, gradient-based algo-
rithms will be guaranteed to converge to the ground truth. We first study the Hes-
sian of empirical risk for different activation functions, then based on the positive-
definiteness of the Hessian for smooth activations, we show local linear conver-
gence of gradient descent. The proof sketch is provided in Appendix D.1.
81
The positive definiteness of the Hessian does not hold for several activation
functions. Here we provide some examples. Counter Example 1) The Hessian at
the ground truth for linear activation is not positive definite because for any full-rank
matrix R ∈ Rk×k, (U∗R, V ∗R−1) is also a global optimal. Counter Example 2)
The Hessian at the ground truth for ReLU activation is not positive definite because
for any diagonal matrix D ∈ Rk×k with positive diagonal elements, U∗D,V ∗D−1
is also a global optimal. These counter examples have a common property: there
is redundancy in the parameters. Surprisingly, for sigmoid and tanh, the Hessian
around the ground truth is positive definite. More surprisingly, we will later show
that for ReLU, if the parameter space is constrained properly, its Hessian at a given
point near the ground truth can also be proved to be positive definite with high
probability.
Local Geometry and Local Linear Convergence for Sigmoid and Tanh
We define two natural condition numbers for the problem that captures the
”hardness” of the problem:
Definition 6.3.1. Define λ := maxλ(U∗), λ(V ∗) and κ := maxκ(U∗), κ(V ∗),
where λ(U) = σk1(U)/(Πk
i=1σi(U)), κ(U) = σ1(U)/σk(U), and σi(U) denotes the
i-th singular value of U with the ordering σi ≥ σi+1.
First we show the result for sigmoid and tanh activations.
Theorem 6.3.1 (Positive Definiteness of Hessian for Sigmoid and Tanh). Let the
activation function φ in the NIMC model (6.2) be sigmoid or tanh and let κ, λ be as
82
defined in Definition 6.3.1. Then for any t > 1 and any given U, V , if
n1 & tλ4κ2d log2 d, n2 & tλ4κ2d log2 d, |Ω| & tλ4κ2d log2 d,
and ‖U − U∗‖+ ‖V − V ∗‖ . 1/(λ2κ),
then with probability at least 1 − d−t, the smallest eigenvalue of the Hessian of
Eq. (6.3) is lower bounded by:
λmin(∇2fΩ(U, V )) & 1/(λ2κ).
Remark. Theorem 6.3.1 shows that, given sufficiently large number of user-
items ratings and a sufficiently large number of users/items themselves, the Hessian
at a point close enough to the true parameters U∗, V ∗, is positive definite with high
probability. The sample complexity, including n1, n2 and |Ω|, have a near-linear
dependency on the dimension, which matches the linear IMC analysis [69]. Strong
convexity parameter as well as the sample complexity depend on the condition num-
ber of U∗, V ∗ as defined in Definition 6.3.1. Although we don’t explicitly show the
dependence on k, both sample complexity and the minimal eigenvalue scale as a
polynomial of k. The proofs can be found in Appendix D.1.
As the above theorem shows the Hessian is positive definite w.h.p. for a
given U, V that is close to the optima. This result along with smoothness of the
activation function implies linear convergence of gradient descent that samples a
fresh batch of samples in each iteration as shown in the following, whose proof is
postponed to Appendix D.3.1.
83
Theorem 6.3.2. Let [U c, V c] be the parameters in the c-th iteration. Assuming
‖U c − U∗‖ + ‖V c − V ∗‖ . 1/(λ2κ), then given a fresh sample set, Ω, that is in-
dependent of [U c, V c] and satisfies the conditions in Theorem 6.3.1, the next iterate
using one step of gradient descent, i.e., [U c+1, V c+1] = [U c, V c] − η∇fΩ(U c, V c),
satisfies
‖U c+1 − U∗‖2F + ‖V c+1 − V ∗‖2F ≤ (1−Ml/Mu)(‖U c − U∗‖2F + ‖V c − V ∗‖2F )
with probability 1− d−t, where η = Θ(1/Mu) is the step size and Ml & 1/(λ2κ) is
the lower bound on the eigenvalues of the Hessian and Mu . 1 is the upper bound
on the eigenvalues of the Hessian.
Remark. The linear convergence requires each iteration has a set of fresh
samples. However, since it converges linearly to the ground-truth, we only need
log(1/ε) iterations, therefore the sample complexity is only logarithmic in 1/ε. This
dependency is better than directly using Tensor decomposition method [72], which
requires O(1/ε2) samples. Note that we only use Tensor decomposition to initialize
the parameters. Therefore the sample complexity required in our tensor initializa-
tion does not depend on ε.
Empirical Hessian around the Ground Truth for ReLU
We now present our result for ReLU activation. As we see in Counter Exam-
ple 2, without any further modification, the Hessian for ReLU is not locally strongly
convex due to the redundancy in parameters. Therefore, we reduce the parameter
space by fixing one parameter for each (ui, vi) pair, i ∈ [k]. In particular, we fix
84
u1,i = u∗1,i,∀i ∈ [k] when minimizing the objective function, Eq. (6.3), where u1,i
is i-th element in the first row of U . Note that as long as u∗1,i 6= 0, u1,i can be fixed
to any other non-zero values. We set u1,i = u∗1,i just for simplicity of the proof. The
new objective function can be represented as
fReLUΩ (W,V ) =
1
2|Ω|∑
(x,y)∈Ω
(φ(W>x2:d + x1(u∗(1))>)>φ(V >y)− A(x, y))2.
(6.4)
where u∗(1) is the first row of U∗ and W ∈ R(d−1)×k.
Surprisingly, after fixing one parameter for each (ui, vi) pair, the Hessian
using ReLU is also positive definite w.h.p. for a given (U, V ) around the ground
truth.
Theorem 6.3.3 (Positive Definiteness of Hessian for ReLU). Define u0 := mini∈[k]|u∗1,i|.
For any t > 1 and any given U, V , if
n1 & u−40 tλ4κ12d log2 d, n2 & u−4
0 tλ4κ12d log2 d, |Ω| & u−40 tλ4κ12d log2 d,
‖W −W ∗‖+ ‖V − V ∗‖ . u40/λ
4κ12,
then with probability 1 − d−t, the minimal eigenvalue of the objective for ReLU
activation function, Eq. (6.4), is lower bounded,
λmin(∇2fReLUΩ (W,V )) & u2
0/λ2κ4.
Remark. Similar to the sigmoid/tanh case, the sample complexity for ReLU
case also has a linear dependency on the dimension. However, here we have a worse
dependency on the condition number of the weight matrices. The scale of u0 can
85
also be important and in practice one needs to set it carefully. Note that although
the activation function is not smooth, the Hessian at a given point can still exist with
probability 1, since ReLU is smooth almost everywhere and there are only a finite
number of samples. However, owing to the non-smoothness, a proof of convergence
of gradient descent method for ReLU is still an open problem.
6.4 Initialization and Recovery Guarantee
To achieve the ground truth, our algorithm needs a good initialization method
that can initialize the parameters to fall into the neighborhood of the ground truth.
Here we show that this is possible by using tensor method under the Gaussian as-
sumption.
In the following, we consider estimating U∗. Estimating V ∗ is similar.
Define M3 := E[A(x, y) ·(x⊗3−x⊗I)], where x⊗I :=∑d
j=1[x⊗ej⊗ej+
ej⊗x⊗ej+ej⊗ej⊗x]. Define γj(σ) := E[φ(σ ·z)zj], ∀j = 0, 1, 2, 3. Then M3 =∑ki=1 αiu
∗⊗3i , where u∗
i = u∗i /‖u∗
i ‖ and αi = γ0(‖v∗i ‖)(γ3(‖u∗i ‖)− 3γ1(‖u∗
i ‖)).
When αi 6= 0, we can approximately recover αi and u∗i from the empirical ver-
sion of M3 using non-orthogonal tensor decomposition [78]. When φ is sigmoid,
γ0(‖v∗i ‖) = 0.5. Given αi, we can estimate ‖u∗i ‖, since αi is a monotonic function
w.r.t. ‖u∗i ‖. Applying Lemma B.7 in [153], we can bound the approximation er-
ror of empirical M3 and population M3 using polynomial number of samples. By
[78], we can bound the estimation error of ‖u∗i ‖ and u∗
i . Finally combining Theo-
rem 6.3.1, we are able to show the recovery guarantee for sigmoid activation, i.e.,
Theorem 6.2.1.
86
0 50 100
n
0
2
4
6
8
10
m/(
2kd)
0 50 100 150 200
n
0
5
10
15
20
m/(
2kd)
Figure 6.1: The rate of success of GD over synthetic data. Left: sigmoid, Right:ReLU). White blocks denote 100% success rate.
Although tensor initialization has nice theoretical guarantees and sample
complexity, it heavily depends on Gaussian assumption and realizable model as-
sumption. In contrast, practitioners typically use random initialization.
6.5 Experiments on Synthetic and Real-world Data
In this section, we show experimental results on both synthetic data and real-
world data. Our experiments on synthetic data are intended to verify our theoretical
analysis, while the real-world data shows the superior performance of NIMC over
IMC. We apply gradient descent with random initialization to both NIMC and IMC.
Synthetic Data
We first generate some synthetic datasets to verify the sample complexity
and the convergence of gradient descent using random initialization. We fix k =
5, d = 10. For sigmoid, set the number of samples n1 = n2 = n = 10 · ii=1,2··· ,10
87
Table 6.1: The error rate in semi-supervised clustering using NIMC and IMC.
Dataset n d k |Ω| NIMC IMCNIMC-
RFFIMC-RFF
mushroom 8124 112 25n 0 0.0049 0 0
20n 0 0.0010 0 0
segment 2310 19 75n 0.0543 0.0694 0.0197 0.0257
20n 0.0655 0.0768 0.0092 0.0183
covtype 1708 54 75n 0.1671 0.1733 0.1548 0.1529
20n 0.1555 0.1600 0.1200 0.1307
letter 15000 16 265n 0.0590 0.0704 0.0422 0.0430
20n 0.0664 0.0760 0.0321 0.0356
yalefaces 2452 100 385n 0.0315 0.0329 0.0266 0.0273
20n 0.0212 0.0277 0.0064 0.0142
usps 7291 256 105n 0.0211 0.0361 0.0301 0.0185
20n 0.0184 0.0320 0.0199 0.0152
and the number of observations |Ω| = m = 2kd · ii=1,2,··· ,10. For ReLU, set
n = 20 · ii=1,2··· ,10 and m = 4kd · ii=1,2,··· ,10. The sampling rule follows our
previous assumptions. For each n,m pair, we make 5 trials and take the average
of the successful recovery times. We say a solution (U, V ) successfully recovers
the ground truth parameters when the solution achieves 0.001 relative testing error,
i.e., ‖φ(XtU)φ(XtU)>−φ(XtU∗)φ(XtU
∗)>‖F ≤ 0.001 · ‖φ(XtU∗)φ(XtU
∗)>‖F ,
where Xt ∈ Rn×d is a newly sampled testing dataset. For both ReLU and sigmoid,
we minimize the original objective function (6.3). We illustrate the recovery rate in
Figure 6.1. As we can see, ReLU requires more samples/observations than that for
sigmoid for exact recovery (note the scales of n and m/2kd are different in the two
figures). This is consistent with our theoretical results. Comparing Theorem 6.3.1
and Theorem 6.3.3, we can see the sample complexity for ReLU has a worse de-
88
pendency on the conditioning of U∗, V ∗ than sigmoid. We can also see that when n
is sufficiently large, the number of observed ratings required remains the same for
both methods. This is also consistent with the theorems, where |Ω| is near-linear in
d and is independent of n.
Semi-supervised Clustering
We apply NIMC to semi-supervised clustering and follow the experimental
setting in GIMC [112]. In this problem, we are given a set of items with their
features, X ∈ Rn×d, where n is the number of items and d is the feature dimension,
and an incomplete similarity matrix A, where Ai,j = 1 if i-th item and j-th item
are similar and Ai,j = 0 if i-th item and j-th item are dissimilar. The goal is to do
clustering using both existing features and the partially observed similarity matrix.
We build the dataset from a classification dataset where the label of each item is
known and will be used as the ground truth cluster. We first compute the similarity
matrix from the labels and sample |Ω| entries uniformly as the observed entries.
Since there is only one features we set yj = xj in the objective function Eq. (6.3).
We initialize U and V to be the same Gaussian random matrix, then apply
gradient descent. This guarantees U and V to keep identical during the optimization
process. Once U converges, we take the top k left singular vectors of φ(XU) to do
k-means clustering. The clustering error is defined as in [112]. Like [112], we
define the clustering error as follows,
error =2
n(n− 1)
∑(i,j):π∗
i =π∗j
1πi 6=πj+
∑(i,j):π∗
i 6=π∗j
1πi=πj
,
89
where π∗ is the ground-truth clustering and π is the predicted clustering. We
compare NIMC of a ReLU activation function with IMC on six datasets using
raw features and random Fourier features (RFF). The random Fourier feature is
r(x) = 1√q·[sin(Qx)> cos(Qx)>
]> ∈ R2q and each entry of Q ∈ Rq×d is i.i.d.
sampled from N(0, σ). We use Random Fourier features in order to see how increas-
ing the depth of the neural network changes the performance. However, our anal-
ysis only works for one-layer neural networks, therefore, we use Random Fourier
features, which can be viewed as using two-layer neural networks but with the first-
layer parameters fixed.
σ is chosen such that a linear classifier using these random features achieves
the best classification accuracy. q is set as 100 for all datasets. Datasets mushroom,
segment, letter,usps,covtype are downloaded from libsvm website. We subsample
covtype dataset to balance the samples from different classes. We preprocess yale-
faces dataset as described in [79]. As shown in the right table in Figure 6.1, when
using raw features, NIMC achieves better clustering results than IMC for all the
cases. This is also true for most cases when using Random Fourier features.
Recommendation Systems
Recommender systems are used in many real situations. Here we consider
two tasks.
Movie recommendation for users. We use Movielens[1] dataset, which
has not only the ratings users give movies but also the users’ demographic informa-
tion and movies’ genre information. Our goal is to predict ratings that new users
90
Table 6.2: Test RMSE for recommending new users with movies on Movielensdataset.
Dataset #movies #users # ratings # movie feat. # user feat.RMSENIMC
RMSEIMC
ml-100k 1682 943 100,000 39 29 1.034 1.321ml-1m 3883 6040 1,000,000 38 29 1.021 1.320
0 10 20 30 40
Recall(%)
0
1
2
3
4
Pre
cis
ion
(%)
IMC
NIMC
(a) (b) (c) (d)
Figure 6.2: NIMC v.s. IMC on gene-disease association prediction task.
will give the existing movies. We randomly split the users into existing users (train-
ing data) and new users (testing data) with ratio 4:1. The user features include 21
types of occupations, 7 different age ranges and one gender information; the movie
features include 18-19 (18 for ml-1m and 19 for ml-100k) genre features and 20
features from the top 20 right singular values of the training rating matrix (which
has size #training users -by- #movies). In our experiments, we set k to be 50. Here
are our results on datasets ml-1m and ml-100k. For NIMC, we use ReLU activa-
tion. As shown in Table 6.2, NIMC achieves much smaller RMSE than IMC for
both ml-100k and ml-1m datasets.
Gene-Disease association prediction. We use the dataset collected by
[92], which has 300 gene features and 200 disease features. Our goal is to predict
associated genes for a new disease given its features. Since the dataset only contains
91
positive labels, this is a problem called positive-unlabeled learning [66] or one-
class matrix factorization [139]. We adapt our objective function to the following
objective,
f(U, V ) =1
2
∑(i,j)∈Ω
(φ(U>xi)>φ(V >yj)− Aij)
2 + β∑
(i,j)∈Ωc
(φ(U>xi)>φ(V >yj))
2
,
(6.5)
where A is the association matrix, Ω is the set of indices for observed associations,
Ωc is the complementary set of Ω and β is the penalty weight for unobserved as-
sociations. There are totally 12331 genes and 3209 diseases in the dataset. We
randomly split the diseases into training diseases and testing diseases with ratio
4:1. The results are presented in Fig 6.2. We follow [92] and use the cumulative
distribution of the ranks as a measure for comparing the performances of differ-
ent methods, i.e., the probability that any ground-truth associated gene of a disease
appears in the retrieved top-r genes for this disease.
In Fig 6.2(a), we show how k changes the performance of NIMC and IMC.
In general, the higher k, the better the performance. The performance of IMC be-
comes stable when k is larger than 100, while the performance of NIMC is still
increasing. Although IMC performs better than NIMC when k is small, the perfor-
mance of NIMC increases much faster than IMC when k increases. β is fixed as
0.01 and r = 100 in the experiment for Fig 6.2(a). In Fig. 6.2(b), we present how β
in Eq. (6.5) affects the performance. We tried over β = [10−4, 10−3, 10−2, 10−1, 1]
to check how the value of β changes the performance. As we can see, β = 10−3 and
10−2 give the best results. Fig. 6.2(c) shows the probability that any ground-truth
92
associated gene of a disease appears in the retrieved top-r genes for this disease
w.r.t. different r’s. Here we fix k = 200, and β = 0.01. Fig. 6.2(d) shows the
precision-recall curves for different methods when k = 200, and β = 0.01.
6.6 Related Work
Collaborative filtering: Our model is a non-linear version of the standard
inductive matrix completion model [69]. Practically, IMC has been applied to
gene-disease prediction [92], matrix sensing [150], multi-label classification[140],
blog recommender system [111], link prediction [29] and semi-supervised clus-
tering [29, 112]. However, IMC restricts the latent space of users/items to be a
linear transformation of the user/item’s feature space. [112] extended the model to
a three-layer neural network and showed significantly better empirical performance
for multi-label/multi-class classification problem and semi-supervised problems.
Although standard IMC has linear mappings, it is still a non-convex prob-
lem due to the bilinearity UV >. To deal with this non-convex problem, [58, 69]
provided recovery guarantees using alternating minimization with sample com-
plexity linear in dimension. [136] relaxed this problem to a nuclear-norm prob-
lem and also provided recovery guarantees. More general norms have been studied
[102, 117–119], e.g. weighted Frobenius norm, entry-wise `1 norm. More recently,
[146] uses gradient-based non-convex optimization and proves a better sample com-
plexity. [29] studied dirtyIMC models and showed that the sample complexity can
be improved if the features are informative when compared to matrix completion.
Several low-rank matrix sensing problems [46, 150] are also closely related to IMC
93
models where the observations are sampled only from the diagonal elements of the
rating matrix. [87, 104] introduced and studied an alternate framework for ratings
prediction with side-information but the prediction function is linear in their case
as well.
Neural networks: Nonlinear activation functions play an important role in
neural networks. Recently, several powerful results have been discovered for learn-
ing one-hidden-layer feedforward neural networks [22, 52, 72, 85, 125, 153], con-
volutional neural networks [21, 38, 39, 53, 152]. However, our problem is a strict
generalization of the one-hidden layer neural network and is not covered by the
above mentioned results.
94
Chapter 7
Low-rank Matrix Sensing1
In this chapter, we study the problem of low-rank matrix sensing where
the goal is to reconstruct a matrix exactly using a small number of linear measure-
ments. Existing methods for the problem either rely on measurement operators such
as random element-wise sampling which cannot recover arbitrary low-rank matri-
ces or require the measurement operator to satisfy the Restricted Isometry Property
(RIP). However, RIP based linear operators are generally full rank and require large
computation/storage cost for both measurement (encoding) as well as reconstruc-
tion (decoding). We propose simple rank-one Gaussian measurement operators for
matrix sensing that are significantly less expensive in terms of memory and compu-
tation for both encoding and decoding. Moreover, we show that the matrix can be
reconstructed exactly using a simple alternating minimization method. Finally, we
demonstrate the effectiveness of the measurement scheme vis-a-vis existing meth-
ods.
1The content of this chapter is published as Efficient matrix sensing using rank-1 Gaussianmeasurements, Kai Zhong, Prateek Jain, and Inderjit S. Dhillon, in International Conference onAlgorithmic Learning Theory, 2015. The dissertator’s contribution includes deriving part of thetheoretical analysis, conducting the numerical experiments and writing part of the paper.
95
7.1 Introduction to Low-rank Matrix Sensing
We consider the matrix sensing problem, where the goal is to recover a low-
rank matrix using a small number of linear measurements. The matrix sensing pro-
cess contains two phases: a) compression phase (encoding), and b) reconstruction
phase (decoding).
In the compression phase, a sketch/measurement of the given low-rank ma-
trix is obtained by applying a linear operator A : Rd1×d2 → Rm. That is, given a
rank-k matrix, W∗ ∈ Rd1×d2 , its linear measurements are computed by:
b = A(W∗) = [tr(A>1 W∗) tr(A>
2 W∗) . . . tr(A>mW∗)]
> , (7.1)
where Al ∈ Rd1×d2l=1,2,...,m parameterize the linear operator A and tr denotes the
trace operator. Then, in the reconstruction phase, the underlying low-rank matrix is
reconstructed using the given measurements b. That is, W∗ is obtained by solving
the following optimization problem:
minW
rank(W ) s.t. A(W ) = b . (7.2)
The matrix sensing problem is a matrix generalization of the popular compressive
sensing problem and has several real-world applications in the areas of system iden-
tification and control, Euclidean embedding, and computer vision (see [103] for a
detailed list of references).
Naturally, the design of the measurement operator A is critical for the suc-
cess of matrix sensing as it dictates cost of both the compression as well as the
reconstruction phase. Most popular operators for this task come from a family of
96
operators that satisfy a certain Restricted Isometry Property (RIP). However, these
operators require each Al, that parameterizes A, to be a full rank matrix. That
is, cost of compression as well as storage of A is O(md1d2), which is infeasible
for large matrices. Reconstruction of the low-rank matrix W∗ is also expensive,
requiring O(md1d2 + d21d2) computation steps. Moreover, m is typically at least
O(k · d log(d)) where d = d1 + d2. But, these operators are universal, i.e., every
rank-k W∗ can be compressed and recovered using such RIP based operators.
Here, we seek to reduce the computational/storage cost of such operators
but at the cost of the universality property. That is, we propose to use simple rank-
one operators, i.e., where each Al is a rank one matrix. We show that using similar
number of measurements as the RIP operators, i.e., m = O(k · d log(d)), we can
recover a fixed rank W∗ exactly.
In particular, we propose two measurement schemes: a) rank-one indepen-
dent measurements, b) rank-one dependent measurements. In rank-one indepen-
dent measurement, we use Al = xly>l , where xl ∈ Rd1 , yl ∈ Rd2 are both
sampled from zero mean sub-Gaussian product distributions, i.e., each element
of xl and yl is sampled from a fixed zero-mean univariate sub-Gaussian distri-
bution. Rank-one dependent measurements combine the above rank-one measure-
ments with element-wise sampling, i.e., Al = xily>jl
where xil ,yjl are sampled as
above. Also, (il, jl) ∈ [n1] × [n2] is a randomly sampled index, where n1 ≥ d1,
n2 ≥ d2. These measurements can also be viewed as the inductive version of the
matrix completion problem (see Section 7.2), where xi represents features of the
i-th user (row) and yj represents features of the j-th movie (column).
97
Table 7.1: Comparison of sample complexity and computational complexity fordifferent approaches and different measurements
METHODS SAMPLE COMPLEXITY COMPUTATIONAL COMPLEXITY
ALSRANK-1 INDEP. O(k4β2d log2 d) O(kdm)
RANK-1 DEP. O(k4β2d log d) O(dm+ knd)
RIP O(k4d log d) O(d2m)
Next, we provide recovery algorithms for both of the above measurement
operators. Note that, in general, the recovery problem (7.2) is NP-hard to solve.
However, for the RIP based operators, both alternating minimization as well as
nuclear norm minimization methods are known to solve the problem exactly in
polynomial time [71]. Note that the existing analysis of both the methods crucially
uses RIP and hence does not extend to the proposed operators. We show that if
m = O(k4 · β2 · (d1 + d2) log2(d1 + d2)), where β is the condition number of W∗
then alternating minimization for the rank-one independent measurements recovers
W∗ in time O(kdm), where d = d1 + d2.
We summarize the sample complexity and computational complexity for
different approaches and different measurements in Table 7.1. In the table, ALS
refers to alternating least squares, i.e., alternating minimization. For the symbols,
d = d1 + d2, n = n1 + n2.
We summarize related work in Section 7.5. In Section 7.2 we formally
introduce the matrix sensing problem and our proposed rank-one measurement op-
erators. In Section 7.3, we present the alternating minimization method for ma-
trix reconstruction. We then present a generic analysis for alternating minimization
98
when applied to the proposed rank-one measurement operators. Finally, we provide
empirical validation of our methods in Section 7.4.
7.2 Problem Formulation – Two Settings
The goal of matrix sensing is to design a linear operator A : Rd1×d2 → Rm
and a recovery algorithm so that a low-rank matrix W∗ ∈ Rd1×d2 can be recovered
exactly using A(W∗). In this work, we focus on rank-one measurement operators,
Al = xly>l , and call such problems as Low-Rank matrix estimation using Rank
One Measurements (LRROM): recover the rank-k matrix W ∗ ∈ Rd1×d2 by using
rank-1 measurements of the form:
b = [x>1 W ∗y1 x>
2 W ∗y2 . . . x>mW ∗ym]
>, (7.3)
where xl,yl are “feature” vectors and are provided along with the measurements b.
We propose two different kinds of rank-one measurement operators based
on Gaussian distribution.
1) Rank-one Independent Gaussian Operator. Our first measurement op-
erator is a simple rank-one Gaussian operator, AGI = [A1 . . . Am], where, Al =
xly>l l=1,2,...,m, and xl, yl are sampled i.i.d. from spherical Gaussian distribution.
2) Rank-one Dependent Gaussian Operator. Our second operator can
introduce certain “dependencies” in our measurement and has in fact interesting
connections to the matrix completion problem. We provide the operator as well as
the connection to matrix completion in this sub-section. To generate the rank-one
dependent Gaussian operator, we first sample two Gaussian matrices X ∈ Rn1×d1
99
and Y ∈ Rn2×d2 , where each entry of both X,Y is sampled independently from
Gaussian distribution and n1 ≥ Cd1, n2 ≥ Cd2 for a global constant C ≥ 1. Then,
the Gaussian dependent operator AGD = [A1, . . . Am] where Al = xily>il(il,jl)∈Ω.
Here x>i is the i-th row of X and y>
j is the j-th row of Y . Ω is a uniformly random
subset of [n1]×[n2] such that E[|Ω|] = m. For simplicity, we assume that each entry
(il, jl) ∈ [n1]× [n2] is sampled i.i.d. with probability p = m/(n1× n2). Therefore,
the measurements using the above operator are given by: bl = x>ilWyjl , (il, jl) ∈ Ω.
Connections to Inductive Matrix Completion (IMC): Note that the above
measurements are inspired by matrix completion style sampling operator. How-
ever, here we first multiply W with X , Y and then sample the obtained matrix
XWY >. In the domain of recommendation systems (say user-movie system), the
corresponding reconstruction problem can also be thought as the inductive matrix
completion problem. That is, let there be n1 users, n2 movies, X represents user
features, and Y represents the movie features. Then, the true ratings matrix is given
by R = XWY > ∈ Rn1×n2 .
That is, given the user/movie feature vectors xi ∈ Rd1 for i = 1, 2, ..., n1
and yj ∈ Rd2 for j = 1, 2, ..., n2, our goal is to recover a rank-k matrix W∗ of size
d1 × d2 from a few observed entries Rij = x>i W∗yj , for (i, j) ∈ Ω ⊂ [n1] × [n2].
Because of the equivalence between the dependent rank-one measurements and the
entries of the rating matrix, in the rest of this section, we will use Rij(i,j)∈Ω as the
dependent rank-one measurements.
Now, if we can reconstruct W∗ from the above measurements, we can pre-
dict ratings inductively for new users/movies, provided their feature vectors are
100
Algorithm 7.3.1 AltMin-LRROM: Alternating Minimization for LRROM1: Input: Measurements: ball, Measurement matrices: Aall, Number of iterations:
H2: Divide (Aall, ball) into 2H + 1 sets (each of size m) with h-th set being Ah =Ah
1 , Ah2 , . . . , A
hm and bh = [bh1 b
h2 . . . bhm]
>
3: Initialization: U0 =top-k left singular vectors of 1m
∑ml=1 b
0lA
0l
4: for h = 0 to H − 1 do5: b← b2h+1,A← A2h+1
6: Vh+1 ← argminV ∈Rd2×k
∑l(bl − x>
l UhV>yl)
2
7: Vh+1 = QR(Vh+1) //orthonormalization of Vh+1
8: b← b2h+2,A← A2h+2
9: Uh+1 ← argminU∈Rd1×k
∑l(bl − x>
l UV >h+1yl)
2
10: Uh+1 = QR(Uh+1) //orthonormalization of Uh+1
11: Output: WH = UH(VH)>
given.
Hence, our reconstruction procedure also solves the IMC problem. How-
ever, there is a key difference: in matrix sensing, we can select X , Y according
to our convenience, while in IMC, X and Y are provided a priori. But, for gen-
eral X,Y one cannot solve the problem because if say R = XW∗Y> is a 1-sparse
matrix, then W∗ cannot be reconstructed even with a large number of samples.
7.3 Rank-one Matrix Sensing via Alternating Minimization
We now present the alternating minimization approach for solving the re-
construction problem (7.2) with rank-one measurements (7.3). Since W to be re-
covered is restricted to have at most rank-k, (7.2) can be reformulated as the fol-
101
lowing non-convex optimization problem:
minU∈Rd1×k,V ∈Rd2×k
m∑l=1
(bl − x>l UV >yl)
2 . (7.4)
Alternating minimization is an iterative procedure that alternately optimizes for U
and V while keeping the other factor fixed. As optimizing for U (or V ) involves
solving just a least squares problem, so each individual iteration of the algorithm
is linear in matrix dimensions. For the rank-one measurement operator, we use a
particular initialization method to initialize U (see line 3 of Algorithm 7.3.1). See
Algorithm 7.3.1 for a pseudo-code of the algorithm.
General Theoretical Guarantee for Alternating Minimization
As mentioned above, (7.4) is non-convex in U, V and hence standard anal-
ysis would only ensure convergence to a local minima. However, [71] recently
showed that the alternating minimization method in fact converges to the global
minima of two low-rank estimation problems: matrix sensing with RIP matrices
and matrix completion.
The rank-one operator given above does not satisfy RIP (see Definition 7.3.1),
even when the vectors xl,yl are sampled from the normal distribution (see Claim 7.3.1).
Furthermore, each measurement need not reveal exactly one entry of W ∗ as in the
case of matrix completion. Hence, the proof of [71] does not apply directly. How-
ever, inspired by the proof of [71], we distill out three key properties that the op-
erator should satisfy, so that alternating minimization would converge to the global
optimum.
102
Theorem 7.3.1. Let W ∗ = U∗Σ∗V>∗ ∈ Rd1×d2 be a rank-k matrix with k-singular
values σ∗1 ≥ σ∗
2 · · · ≥ σ∗k. Also, let A : Rd1×d2 → Rm be a linear measurement
operator parameterized by m matrices, i.e., A = A1, A2, . . . , Am where Al =
xly>l . Let A(W ) be as given by (7.1).
Now, let A satisfy the following properties with parameter δ = 1k3/2·β·100
(β = σ∗1/σ∗
k):
1. Initialization: ‖ 1m
∑l blAl −W ∗‖2 ≤ ‖W ∗‖2 · δ.
2. Concentration of operators Bx, By: Let Bx = 1m
∑ml=1(y
>l v)
2xlx>l and
By = 1m
∑ml=1(x
>l u)
2yly>l , where u ∈ Rd1 ,v ∈ Rd2 are two unit vectors
that are independent of randomness in xl,yl, ∀i. Then the following holds:
‖Bx − I‖2 ≤ δ and ‖By − I‖2 ≤ δ.
3. Concentration of operators Gx, Gy: Let Gx = 1m
∑l(y
>l v)(ylv⊥)xlx
>l ,
Gy = 1m
∑l(x
>l u)(u
>⊥xl)yly
>l , where u,u⊥ ∈ Rd1 , v,v⊥ ∈ Rd2 are unit
vectors, s.t., u>u⊥ = 0 and v>v⊥ = 0. Furthermore, let u,u⊥,v,v⊥ be
independent of randomness in xl,yl, ∀i. Then, ‖Gx‖2 ≤ δ and ‖Gy‖2 ≤ δ.
Then, after H-iterations of the alternating minimization method (Algorithm 7.3.1),
we obtain WH = UHV>H s.t., ‖WH −W ∗‖2 ≤ ε, where H ≤ 100 log(‖W ∗‖F/ε).
See Appendix E.1 for a detailed proof. Note that we require intermediate
vectors u,v,u⊥,v⊥ to be independent of randomness in Al’s. Hence, we partition
Aall into 2H + 1 partitions and at each step (Ah, bh) and (Ah+1, bh+1) are supplied
to the algorithm. This implies that the measurement complexity of the algorithm is
given by m · H = m log(‖W ∗‖F/ε). That is, given O(m log(‖(d1 + d2)W ∗‖F ))
103
samples, we can estimate matrix WH , s.t., ‖WH −W ∗‖2 ≤ 1(d1+d2)c
, where c > 0
is any constant.
Independent Gaussian Measurements
In this section, we consider the rank-one independent measurement operator
AGI specified in Section 7.2. Now, for this operator AGI , we show that if m =
O(k4β2 · (d1 + d2) · log2(d1 + d2)), then w.p. ≥ 1 − 1/(d1 + d2)100, any fixed
rank-k matrix W ∗ can be recovered by AltMin-LRROM (Algorithm 7.3.1). Here
β = σ∗1/σ∗
k is the condition number of W ∗. That is, using nearly linear number
of measurements in d1, d2, one can exactly recover the d1 × d2 rank-k matrix W ∗.
As mentioned in the previous section, the existing matrix sensing results
typically assume that the measurement operator A satisfies the Restricted Isometry
Property (RIP) defined below:
Definition 7.3.1. A linear operator A : Rd1×d2 → Rm satisfies RIP iff, for ∀W s.t.
rank(W ) ≤ k, the following holds:
(1− δk)‖W‖2F ≤ ‖A(W )‖2F ≤ (1 + δk)‖W‖2F ,
where δk > 0 is a constant dependent only on k.
Naturally, this begs the question whether we can show that our rank-1 mea-
surement operator AGI satisfies RIP, so that the existing analysis for RIP based
low-rank matrix sensing can be used [71]. We answer this question in the negative,
i.e., for m = O((d1 + d2) log(d1 + d2)), AGI does not satisfy RIP even for rank-1
matrices (with high probability):
104
Claim 7.3.1. 4.2. Let AGI = A1, A2, . . . Am be a measurement operator with
each Al = xly>l , where xl ∈ Rd1 ∼ N(0, I), yl ∈ Rd2 ∼ N(0, I), 1 ≤ l ≤ m. Let
m = O((d1 + d2) logc(d1 + d2)), for any constant c > 0. Then, with probability at
least 1− 1/m10, AGI does not satisfy RIP for rank-1 matrices with a constant δ.
See Appendix E.2 for a detailed proof of the above claim. Now, even though
AGI does not satisfy RIP, we can still show that AGI satisfies the three properties
mentioned in Theorem 7.3.1. and hence we can use Theorem 7.3.1 to obtain the
exact recovery result.
Theorem 7.3.2 (Rank-One Independent Gaussian Measurements using ALS). Let
AGI = A1, A2, . . . Am be a measurement operator with each Al = xly>l , where
xl ∈ Rd1 ∼ N(0, I), yl ∈ Rd2 ∼ N(0, I), 1 ≤ l ≤ m. Let m = O(k4β2(d1 +
d2) log2(d1 + d2)). Then, Property 1, 2, 3 required by Theorem 7.3.1 are satisfied
with probability at least 1− 1/(d1 + d2)100.
Proof. Here, we provide a brief proof sketch. See Appendix E.2 for a detailed
proof.
Initialization: Note that,
1
m
m∑l=1
blxly>l =
1
m
m∑l=1
xlx>l U∗Σ∗V
>∗ yly
>l =
1
m
m∑l=1
Zl ,
where Zl = xlx>l U∗Σ∗V
>∗ yly
>l . Note that E[Zl] = U∗Σ∗V
>∗ . Hence, to prove the
initialization result, we need a tail bound for sums of random matrices. To this end,
we use matrix Bernstein inequality Lemma 2.4.3. However, matrix Bernstein in-
equality requires a bounded random variable while Zl is an unbounded variable. We
105
handle this issue by clipping Zl to ensure that its spectral norm is always bounded.
Furthermore, by using properties of normal distribution, we can ensure that w.p.
≥ 1− 1/m3, Zl’s do not require clipping and the new “clipped” variables converge
to nearly the same quantity as the original “non-clipped” Zl’s. See Appendix E.2
for more details.
Concentration of Bx, By, Gx, Gy: Consider Gx = 1m
∑ml=1 xlx
>l y
>l vv⊥
>yl. As,
v,v⊥ are unit-norm vectors, y>l v ∼ N(0, 1) and v>
⊥xl ∼ N(0, 1). Also, since v
and v⊥ are orthogonal, y>l v and v>
⊥yl are independent variables. Hence, Gx =
1m
∑ml=1 Zl where E[Zl] = 0. Here again, we apply matrix Bernstein inequality
Lemma 2.4.3 after using a clipping argument. We can obtain the required bounds
for Bx, By, Gy also in a similar manner.
Note that the clipping procedure ensures that Zl’s don’t need to be clipped
with probability ≥ 1 − 1/m3 only. That is, we cannot apply the union bound to
ensure that the concentration result holds for all v,v⊥. Hence, we need a fresh set
of measurements after each iteration to ensure concentration.
Global optimality of the rate of convergence of the Alternating Minimiza-
tion procedure for this problem now follows directly by using Theorem 7.3.1. We
would like to note that while the above result shows that the AGI operator is al-
most as powerful as the RIP based operators for matrix sensing, there is one critical
drawback: while RIP based operators are universal that is they can be used to re-
cover any rank-k W ∗, AGI needs to be resampled for each W ∗. We believe that
the two operators are at two extreme ends of randomness vs universality trade-off
106
and intermediate operators with higher success probability but using larger number
of random bits should be possible.
Dependent Gaussian Measurements
For the dependent Gaussian measurements, the alternating minimization
formulation is given by:
minU∈Rd1×k,V ∈Rd2×k
∑(i,j)∈Ω
(x>i UV >yj −Rij)
2 . (7.5)
Here again, we can solve the problem by alternatively optimizing for U and V .
Later in Section 7.3, we show that using such dependent measurements leads to a
faster recovery algorithm when compared to the recovery algorithm for independent
measurements.
Note that both the measurement matrices X,Y can be thought of as or-
thonormal matrices. The reason being, XW ∗Y> = UXΣXV
>X W ∗VYΣYU
>Y , where
X = UXΣXV>X and Y = UYΣY V
>Y is the SVD of X,Y respectively. Hence,
R = XW ∗Y> = UX(ΣXV
>X W ∗VYΣY )U
>Y . Now UX , UY can be treated as
the true “X”, “Y” matrices and W ∗ ← (ΣXV>X W ∗VYΣY ) can be thought of as
W ∗. Then the “true” W ∗ can be recovered using the obtained WH as: WH ←
VXΣ−1X WHΣ
−1Y V >
Y . We also note that such a transformation implies that the condi-
tion number of R and that of W ∗ ← (ΣXV>X W ∗VYΣY ) are exactly the same.
Similar to the previous section, we utilize our general theorem for optimal-
ity of the LRROM problem to provide a convergence analysis of rank-one Gaussian
dependent operators AGD. We prove if X and Y are random orthogonal matrices,
107
defined in [24], the above mentioned dependent measurement operator AGD gener-
ated from X,Y also satisfies Properties 1, 2, 3 in Theorem 7.3.1. Hence, AltMin-
LRROM (Algorithm 7.3.1) converges to the global optimum in O(log(‖W ∗‖F/ε))
iterations.
Theorem 7.3.3 (Rank-One Dependent Gaussian Measurements using ALS). Let
X0 ∈ Rn1×d1 and Y0 ∈ Rn2×d2 be Gaussian matrices, i.e. every entry is sampled
i.i.d from N(0, 1). Let X0 = XΣXV>X and Y0 = Y ΣY V
>Y be the thin SVD of
X0 and Y0 respectively. Then the rank-one dependent operator AGD formed by
X,Y with m ≥ O(k4β2(d1 + d2) log(d1 + d2)) satisfy Property 1,2,3 required by
Theorem 7.3.1 with high probability.
See Appendix E.3 for a detailed proof. Interestingly, our proof does not
require X , Y to be Gaussian. It instead utilizes only two key properties about X,Y
which are given by:
1. Incoherence: For some constant µ, c,
maxi∈[n]‖xi‖22 ≤
µd
n, (7.6)
where d = max(d, log n)
2. Averaging Property: For H different orthogonal matrices Uh ∈ Rd×k, h =
1, 2, . . . , H , the following hold for these Uh’s,
maxi∈[n]‖U>
h xi‖22 ≤µ0k
n, (7.7)
where µ0, c are some constants and k = max(k, log n).
108
Hence, the above theorem can be easily generalized to solve the inductive matrix
completion problem (IMC), i.e., solve (7.5) for arbitrary X,Y . Moreover, the sam-
ple complexity of the analysis would be nearly in (d1 + d2), instead of (n1 + n2)
samples required by the standard matrix completion methods.
The following lemma shows that the above two properties hold w.h.p. for
random orthogonal matrices .
Lemma 7.3.1. If X ∈ Rn×d is a random orthogonal matrix, then both Incoherence
and Averaging properties are satisfied with probability ≥ 1− (c/n3) log n, where c
is a constant.
The proof of Lemma 7.3.1 can be found in Appendix E.3.
Computational Complexity for Alternating Minimization
In this section, we briefly discuss the computational complexity for Algo-
rithm 7.3.1. For simplicity, we set d = d1 + d2 and n = n1 + n2, and in practical
implementation, we don’t divide the measurements and use the whole measurement
operator A for every iteration. The most time-consuming part of Algorithm 7.3.1 is
the step for solving the least square problem. Given U = Uh, V can be obtained by
solving the following linear system,
m∑l=1
〈V,A>l Uh〉AlUh =
m∑l=1
blA>l Uh . (7.8)
The dimension of this linear system is kd, which could be large, thus we use conju-
gate gradient (CG) method to solve it. In each CG iteration, different measurement
109
operators have different computational complexity. For RIP-based full-rank opera-
tors, the computational complexity for each CG step is O(d2m) while it is O(kdm)
for rank-one independent operators. However, for rank-one dependent operators,
using techniques introduced in [140], we can reduce the per iteration complexity to
be O(kdn +md). Furthermore, if n = d, the computational complexity of depen-
dent operators is only O(kd2+md), which is better than the complexity of rank-one
independent operators in an order of k.
7.4 Numerical Experiments
In this section, we demonstrate empirically that our Gaussian rank-one lin-
ear operators are significantly more efficient for matrix sensing than the existing
RIP based measurement operators. In particular, we apply alternating minimization
(ALS) to the measurements obtained using three different operators: rank-one inde-
pendent (Rank1 Indep), rank-one dependent (Rank1 Dep), and a RIP based operator
generated using random Gaussian matrices (RIP).
The experiments are conducted on Matlab.We first generated a random rank-5
signal W ∗ ∈ R50×50, and compute m = 1000 measurements using different mea-
surement operators. Figure 7.1(a) plots the relative error in recovery, err = ‖W −
W ∗‖2F/‖W ∗‖2F , against computational time required by each method. Clearly, re-
covery using rank-one measurements requires significantly less time compared to
the RIP based operator.
Next, we compare the measurement complexity (m) of each method. Here
again, we first generate a random rank-5 signal W ∗ ∈ R50×50 and its measurements
110
10-2 10-1 100 101
time(s)
10-6
10-5
10-4
10-3
10-2
10-1
∥W−
W∗∥2 F
/∥W
∗∥2 F
ALS Rank1 DepALS Rank1 IndepALS RIP
(a) Relative error in recovery v.s. compu-tation time
200 300 400 500 600 700 800 900 1000 1100
Number of Measurements
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RecoveryRate
ALS Rank1 DepALS Rank1 IndepALS RIP
(b) Recovery rate v.s. number of measure-ments
Figure 7.1: Comparison of computational complexity and measurement complexityfor different approaches and different operators
using different operators. We then measure error in recovery by each of the method
and consider success if the relative error err ≤ 0.05. We repeat the experiment 10
times to obtain the recovery rate (number of success/10) for each value of m (num-
ber of measurements). Figure 7.1(b) plots the recovery rate of different approaches
for different m. Clearly, the rank-one based measurements have similar recovery
rate and measurement complexity as the RIP based operators. However, our rank-
one operator based methods are much faster than the corresponding methods for the
RIP-based measurement scheme.
Finally, in Figure 7.2, we validate our theoretical analysis on the measure-
ment complexity by showing the recovery rate for different d and m. We fix the
rank k = 5, set d = d1 = d2 and n1 = d1, n2 = d2 for dependent operators. Figure
3 plots the recovery rate for various d and m. As shown in Figure 7.2, both inde-
pendent and dependent operators using alternating minimization require a number
111
50 100 1500
500
1000
1500
2000
2500
3000
d
Num
ber o
f Mea
sure
men
ts
0
0.2
0.4
0.6
0.8
1
50 100 1500
500
1000
1500
2000
2500
3000
d
Num
ber o
f Mea
sure
men
ts
0
0.2
0.4
0.6
0.8
1
Indep. ALS Dep. ALS
Figure 7.2: Recovery rate for different matrix dimension d (x-axis) and differentnumber of measurements m (y-axis). The color reflects the recovery rate scaledfrom 0 to 1. The white color indicates perfect recovery, while the black color de-notes failure in all the experiments.
of measurements proportional to the dimension of d. We also see that dependent
operators require a slight larger number of measurements than that of independent
ones.
7.5 Related Work
Matrix Sensing: Matrix sensing[103][70][81] is a generalization of the pop-
ular compressive sensing problem for the sparse vectors and has applications in sev-
eral domains such as control, vision etc. [103] introduced measurement operators
that satisfy RIP and showed that using only O(kd log d) measurements, a rank-k
W∗ ∈ Rd1×d2 can be recovered. Recently, a set of universal Pauli measurements,
used in quantum state tomography, have been shown to satisfy the RIP condition
112
[88]. These measurement operators are Kronecker products of 2× 2 matrices, thus,
they have appealing computation and memory efficiency.
Matrix Completion and Inductive Matrix Completion: Matrix completion
[24][75][71] is a special case of rank-one matrix sensing problem when the oper-
ator takes a subset of the entries. However, to guarantee exact recovery, the target
matrix has to satisfy the incoherence condition. Using our rank-one Gaussian op-
erators, we don’t require any condition on the target matrix. For inductive matrix
completion (IMC), which is a generalization of matrix completion utilizing movies’
and users’ features, the authors of [136] provided the theoretical recovery guarantee
for nuclear-norm minimization. In this work, we show that IMC is equivalent to the
matrix sensing problem using dependent rank-one measurements and focusing on
non-convex approach, alternating minimization.
Alternating Minimization: Although nuclear-norm minimization enjoys nice
recovery guarantees, it usually doesn’t scale well. In practice, alternating minimiza-
tion is employed to solve problem (7.2) by assuming the rank is known. Alternating
minimization solves two least square problems alternatively in each iteration, thus
is very computationally efficient[138]. Although widely used in practice, its the-
oretical guarantees are relatively less understood due to non-convexity. [71] first
showed optimality of alternating minimization in the matrix sensing/low-rank es-
timation setting under the RIP setting. Subsequently, several other papers have
also shown global convergence guarantees for alternating minimization, e.g. matrix
completion [62][58], robust PCA [93] and dictionary learning [3]. In this work,
we provide a generic analysis for alternating minimization applied to the proposed
113
rank-one measurement operators. Our results distill out certain key problem spe-
cific properties that would imply global optimality of alternating minimization. We
then show that the rank-one Gaussian measurements satisfy those properties.
114
Chapter 8
Discussion
8.1 Over-specified/Over-parameterized Neural Networks
Over-parameterization/over-specification endows greater expressive power
to modern neural networks and is also believed to be one of the primary reasons
why gradient-based local search algorithms have benign behavior.
It is conjectured in [89, 106] that over-parameterization is the primary rea-
son why local search heuristics can successfully find the global optima, based on
the intuition that over-parameterization mitigates the local minima issue and makes
the landscape more suitable for gradient descent to succeed. However, even if the
global optimum of the training loss is achieved, the generalization error of the so-
lution is typically not guaranteed due to the possibility of over-fitting.
On the generalization side, abundant empirical evidence has shown that
over-parameterized neural networks generalize well even when the number of pa-
rameters is much larger than the sample size, which seems to contradict classical
statistical learning theory. Somewhat more surprisingly, with some architecture de-
sign, vanilla gradient descent without any explicit regularization, such as dropout
or weight decay, converges to a solution that could generalize well. Some recent
generalization error bounds [18, 54, 94, 95] are derived which depend on the norm
115
of the weight matrices, but there is no guarantee that the learned weight matrices
have small norms. Some recent work [144] regularizes the weight matrices so that
the norm is bounded, however, it is theoretically unknown if the training loss will
be also affected.
To analyze the convergence properties of NNs, many assume realizabil-
ity [22, 72, 125, 153], where an underlying ground-truth NN is assumed to exist
and the goal is to recover this target NN. In realizable settings, once the underly-
ing model is recovered, we have generalization performance guaranteed. Here we
would like to make the distinction between two different assumptions within real-
izability. One class assumes the number of hidden nodes is known and the learning
architecture is fixed to be the same as the underlying data generative model, such
as those in [72, 125, 153]. The other class (such as in [22]) assumes the underlying
model can be represented as a NN with fewer hidden nodes than actually speci-
fied, we call it over-specified models. Over-specification is more suitable for real
problems as it is usually difficult to know the true architecture of the underlying
model a priori. In other areas such as low-rank matrix sensing [84] and low-rank
semi-definite programs [23], over-specifying the rank is shown to achieve better
performance and reduce the number of local minima in non-convex optimization.
In some literature, researchers also use the term over-parameterization to refer to
the problem where the number of observations is smaller than the number of param-
eters. We point out that when the sample size is sufficiently large and the underlying
data can be modeled by a low-complexity NN, over-parameterization often implies
over-specification.
116
Over-specified models have attracted lots of attention in recent research.
Many of them target smoothed activation functions, such as sigmoid activation [100,
141], linear activation [74], quadratic activation [114] and general smoothed acti-
vations [86, 97]. Brutzkus et. al. [22] recently studies over-parameterization of
Leaky ReLU NNs for classification problems with separable data. In this section,
we consider over-specified NNs with ReLU activation for regression problems. To
the best of our knowledge, this section is the first one that directly analyzes the
over-specified ReLU network under square loss.
In this section, we analyzed a simple case: recovering a ReLU function by
applying gradient descent on the squared-loss function with an over-specified one-
hidden-layer ReLU network. Our analysis shows that when starting from a small
initialization, gradient descent converges to the target model. Under mild condi-
tions, our sample complexity and time complexity are independent of the over-
specification degree up to a logarithmic factor. Furthermore, we present some ex-
perimental results with multiple hidden ReLU units in the ground truth network.
8.1.1 Learning a ReLU using Over-specified Neural Networks
We consider the regression problem with input data (x, y) where x ∈ Rd
and y ∈ R. The data is generated as follows.
D : x ∼ N(0, Id); y = φ(w∗>x). (8.1)
In Eq. (8.1), φ(·) is the ReLU activation function φ(x) = maxx, 0, w∗ is the
underlying ground-truth parameters. Without loss of generality, we assume ‖w∗‖ =
117
1. This model can be viewed as one of the simplest NNs, which has one hidden
layer with one hidden unit. Our convergence guarantee is based on this simple true
model. As empirically studied by [106] and also presented in our experimental
section Sec. 8.1.2, for learning a ground-truth NN with multiple hidden units, it
is possible for gradient descent (GD) to get stuck in bad local minima even if the
learning model is mildly over-specified.
We use the natural loss for regression problem, that is, squared loss in the
objective function. And we learn an over-specified NN model, which has multiple
ReLU units. Let f(W ) denote the population loss,
f(W ) = E(x,y)
(k∑
i=1
φ(w>i x)− y
)2
, (8.2)
where k ≥ 1 and the parameter to be learned is W := [w1,w2, · · · ,wk] ∈ Rd×k.
More practically, when given a finite set of samples S, we use fS(W ) to denote the
empirical loss,
fS(W ) =1
|S|∑
(x,y)∈S
(k∑
i=1
φ(w>i x)− y
)2
. (8.3)
We use gradient descent to solve the optimization problem minW f(W ), with the
update step as
W (t+1) ← W (t) − ηt∇f(W (t)),
where the i-th column of the gradient of f(W ) can be written in the following form,
[∇f(W )]i =∂f(W )
∂wi
= E(x,y)
(k∑
i=1
φ(w>i x)− y
)φ′(w>
i x)x. (8.4)
118
When a finite number of samples are available, the expectations above are replaced
by empirical averages. Under such a setting, we introduce our main result in the
following section.
Now let’s first present our algorithms and theoretical results on the con-
vergence of gradient descent for both population and finite-sample settings. It is
well-known that learning neural networks requires some tactics in selecting proper
initialization and step size. We show that, with a small initialization magnitude, the
optimization path of gradient descent can be divided into the following two phases,
which will be elaborated in Appendix F.2.1 and F.2.2.
1. In the first phase the angle between wi and w∗ decreases to a small value,
while the magnitudes of wii∈[k] keep small. At the end of this phase, all
wi’s are closely aligned with w∗. Therefore, this phase can be viewed as
compressing the over-specified model class to a minimal model class that can
be fit to the target model.
2. In the second phase, the magnitude of∑
i wi moves forward to the magnitude
of w∗, while the angles between wi and w∗ remain small. This phase can be
viewed as fitting the compressed model to the target model.
The optimization procedure is presented in Algorithm 8.1.1. Starting from
random initialization with small scale, Alg. 8.1.1 first does gradient update with
a small step size (O(ε1/2)). Then in the second phase, use a constant step size to
do gradient descent for linear convergence to the target function. The reason for
choosing such step sizes will become clear in the analysis. Our main result shows
that Alg. 8.1.1 converges to the target function with time complexity O(ε−1/2). The
119
Algorithm 8.1.1 Gradient Descent for Minimizing Eq. (8.2)1: procedure LEARNINGOVERSPECIFIED1NN(k, ε, σ)2: Set η1 = O(ε1/2k−1(log(ε−1))−1), T1 = O(ε−1/2) and η2 = O(k−1), T2 =
O(log(ε−1)).3: Initialize w(0)
i i.i.d. from uniform distribution on a sphere with radius σ.4: for t = 0, 1, 2, · · · , T1 − 1 do5: W (t+1) = W (t) − η1∇f(W (t))
6: for t = T1, T1 + 1, · · · , T1 + T2 − 1 do7: W (t+1) = W (t) − η2∇f(W (t))
8: Return W (T1+T2)
proof details are postponed to Appendix F.3.1.
Theorem 8.1.1 (Convergence Guarantee of Population Gradient Descent). For any
ε > 0, with small enough σ, after T = O(ε−1/2) iterations, Algorithm 8.1.1 outputs
a solution W (T ) that satisfies,
f(W (T )) . ε, almost surely.
Furthermore, for weight vector of each hidden units, we have
∠(w(T )i ,w∗) .
√ε, ‖
∑j
w(T )j −w∗‖ . ε.
Remark 8.1.1. For any target error ε, the total time complexity of Alg. 8.1.1 is
O(ε−1/2). It can be seen from Theorem 8.1.1 that, this complexity is independent of
the number of hidden units in the over-specified model. The over-specification only
changes the magnitude of initialization σ = O( 1k) and step size η1 = O( 1
k).
In a more practical setting, one only has access to a finite, yet large, number
of observations. Built on the algorithm with population gradient, we establish the
120
Algorithm 8.1.2 Gradient Descent with Finite Samples1: procedure LEARNINGOVERSPECIFIED1NNEMP(W (0), k, ε, δ, S)2: Set η1 = O(ε1/2k−1(log(ε−1))−1), T1 = O(ε−1/2) and η2 = O(k−1), T2 =
O(log(ε−1)).3: Divide the dataset S into T1 + T2 batches, S(t)t=0,1,2,··· ,T1+T2−1, which
satisfy|S(t)| & ε−2 log(1/ε) · d log k log(1/δ).
4: for t = 0, 1, 2, · · · , T1 − 1 do5: W (t+1) = W (t) − η1∇fS(t)(W (t))
6: for t = T1, T1 + 1, · · · , T1 + T2 − 1 do7: W (t+1) = W (t) − η2∇fS(t)(W (t))
8: Return W (T1+T2)
empirical version of the algorithm in Alg. 8.1.2. Alg. 8.1.2 is presented in an online
setting, where new independent samples are drawn in each iteration. It can be
viewed as stochastic gradient descent without reusing the samples. We show that
with mild condition on the initialization, the function returned by Algorithm 8.1.2
converges to the target function. The following theorem characterizes the time and
sample complexity for Alg. 8.1.2.
Theorem 8.1.2 (Convergence Guarantee of Empirical Gradient Descent). Let the
target error ε > 0, tail probability δ > 0, and initialization w(0)i i∈[k] be given in
Algorithm 8.1.2. Denote α(0)i := ∠(w(0)
i ,w∗), |S(t)| be the batch size in the t-th
iteration, and W (T ) be the output of Algorithm 8.1.2 after T iterations. Assume the
initialization satisfies: mini(π−α(0)i )3 & ε, and ‖w(0)
i ‖ . ε1/2 ·k−1 · (log(ε−1))−2 ·
mini(π− α(0)i )3. If |S(t)| & ε−2 log(1/ε) · d log k log(1/δ), and T = O(ε−1/2), then
121
with probability at least 1− δ,
f(W (T )) . ε.
Furthermore, the weights for each hidden unit satisfy
∠(w(T )i ,w∗) .
√ε, ‖
∑j
w(T )j −w∗‖ . ε.
Remark 8.1.2. In Algorithm 8.1.2, the total number of samples required is O(ε−5/2·
d) and the time complexity is O(ε−1/2). The initialization condition essentially
requires the initialization vectors are not aligned with the opposite direction of w∗,
and the magnitude is not too large. If one initializes W (0) uniformly from a sphere,
it is harder to satisfy for larger k.
8.1.2 Numerical Experiments with Multiple Hidden Units
In this section, we show some empirical results for the case where the un-
derlying model has more than one hidden unit. Specifically, we formulate the model
as,
x ∼ N(0, Id); y =
k0∑j=1
φ(w∗>j x), (8.5)
where w∗jj=1,2,··· ,k0 are the ground-truth parameters. We set d = k0, W ∗ :=
[w∗1,w
∗2, · · · ,w∗
k0] = Id and k ≥ k0. We then perform gradient descent with
random initialization on the population risk, Eq. (8.2). Similar experiments are
also conducted in [106], where they show bad local minima commonly exist for
Eq. (8.2) with some d k and k0. Here we want to show that when k becomes larger,
GD becomes less likely to get stuck in local minima and moreover, even if it gets
stuck, the converged local minimum is smaller.
122
Table 8.1: The average objective function value when gradient descent gets stuck ata non-global local minimum over 100 random trials. Note that the average functionvalue here does not take globally converged function values into account.
k − k0 0 1 2 4 8 16 32 64k0 = 4 0 0 0 0 0 0 0 0k0 = 8 0.2703 0 0 0 0 0 0 0k0 = 16 0.3088 0.2915 0 0 0 0 0 0k0 = 32 0.6017 0.4295 0.4079 0.6358 0.3276 0 0 0k0 = 64 0.9060 0.6463 0.6126 0.5432 0.4491 0.2724 0.1143 0
As we can from Table 8.1, gradient descent gets stuck in bad local minima
more frequently with large k0 and d. However, increasing k could reduce the av-
erage training error and help the solution to land closer to global optima. When
k is as large as 2k0, GD converges to the global optima for all the cases we tried.
These results show over-specification plays an important role in reshaping the land-
scape of the objective function to make the optimization easier. More interestingly,
even when the algorithm gets stuck at local minima, the average local minimum
value is smaller for a larger k, as shown in Table 8.1. Note that our experiments
are performed on the population risk, and gradient is evaluated exactly given stan-
dard Gaussian input, hence the neural network is not over-parameterized. The phe-
nomenon that over-specified models mitigate the local minima issue happens not
only when number of parameters is larger than the number of samples, but also
happens in the population case.
123
8.2 Initialization Methods
Initialization methods can significantly affect the performance of non-convex
optimization. Here we discuss some initialization methods used for deep learning.
There are mainly three initialization methods for deep learning in theoretical com-
munities and practical communities.
Tensor Methods. In this thesis, we used tensor methods to initialize the
parameters and obtain the theoretical guarantees. However, by our empirical expe-
rience with tensor methods, tensor methods are very sensitive to the data distribu-
tion. It often requires to know the underlying input data distributions. Also, tensor
decomposition involves a third-order tensor decomposition, whose computational
complexity can be cubic in the dimension if not implemented carefully. Therefore,
tensor methods are not commonly used as an initialization methods in practice.
Nevertheless, when data has nice distributions, tensor methods can be guaranteed
to obtain good initialization and are preferable than random initialization as shown
in Fig. 3.2 for mixed linear regression and Fig. 4.1 for one-hidden-layer neural net-
works.
Random Initialization In practice, random Gaussian numbers or random
uniform numbers are often employed to initialize the parameters. Although differ-
ent random initialization methods can affect the converging solution significantly,
they can perform well if carefully handled. For example, Xavier [51], which is one
of the most popular choices in practice, uses random Gaussian distribution with
variance inver-proportional to the weight dimension. In this thesis, we also used
random initialization for real-world applications, such as recommender systems,
124
semi-supervised clustering in Chapter 6. On the other hand, random initialization
does not always perform well on synthetic data. For example, it might not con-
verge or converge very slowly as shown in Fig. 3.2 for mixed linear regression and
Fig. 4.1 for one-hidden-layer neural networks. Even if the neural network is a little
over-specified as shown in Table 8.1, gradient descent with random initialization
might still lead to a local minimum.
For theoretical guarantees, random initialization can be applied to non-
convex problems together with local search heuristics when the objective function
does not have bad local minima since saddle points can be escaped by perturbed
gradient descent [73]. Although naive square-loss based objectives can have bad
local minima as shown in Table 8.1, landscape designs[47] are proposed to use a
different objective function that eliminates all the bad local minima.
It is also shown that random initialization can converge to global minima
with some probability for simple convolutional neural networks [39] or converge
when small initialization magnitude for residual networks [85].
Orthogonal Initialization. Orthogonal initialization uses orthogonal ma-
trices as the weight matrices for initialization. It has been shown that orthogonal
weight matrices can avoid vanishing/exploding gradient problem in deep neural net-
works or recurrent neural networks [90, 144]. Therefore, orthogonal weight initial-
ization methods are employed, which can lead to a better solution in practice. For
example, using orthogonal weight initialization, [134] uses orthogonal initialization
for 10,000-layer CNNs and achieves comparable results with residual networks.
125
8.3 Stochastic Gradient Descent (SGD) and Other Fast Algo-rithms
Our convergence rate for gradient descent requires resampling in each iter-
ation. This procedure can also be viewed as stochastic gradient descent with suf-
ficiently large batch size (proportional to the input dimension) and without reusing
the data. Therefore, if the input dimension is not too large, empirical batch size
(128-512) suffice to provide the guarantees. Note that we already show that the
population objective has positive definite hessians anywhere in a neighborhood of
ground truth. The caveat of using small-batch SGD is that a SGD update might
move the iterate outside of the local strong convexity region, which will invali-
date the following steps. Therefore, more statistical analysis needs to be done to
bound the probability that a SGD with a small batch size could move out of the
local strong convexity region during the optimization process. Other SGD-based
algorithms, such ADAM or AdaGrad, may encounter similar problems.
Second-order optimization algorithms are also very popular especially for
smooth and strongly convex problems. For example, quasi-Newton methods, such
as LBFGS, could be employed to solve smooth non-convex objective function.
For traditional strongly convex functions, quasi-Newton method can achieve super-
linear convergence rate. For the non-convex problems we considered, we can easily
show that quasi-Newton method can achieve super-linear local convergence rate
in the population case if initialized properly. For the empirical case, more careful
analyses need to be conducted.
126
Appendices
127
Appendix A
Mixed Linear Regression
A.1 Proofs of Local ConvergenceA.1.1 Some Lemmata
We first introduce some facts and lemmata, whose proofs can be found in
Sec. A.1.4.
Fact A.1.1. We define a sequence of constants, CJJ=0,1,···, that satisfy
C0 = 1, C1 = 3, and CJ = CJ−1 + (4J2 + 2J)CJ−2 for J ≥ 2. (A.1)
By construction, we can upper bound CJ ,
CJ ≤ CJ−2 + (4J2 + 2J)CJ−2 + (4(J − 1)2 + 2J − 2)CJ−3
≤ CJ−2 + (4J2 + 2J)CJ−2 + (4(J − 1)2 + 2J − 2)CJ−2
≤ 8J2CJ−2
≤ (3J)J .
(A.2)
Lemma A.1.1 (Stein-type Lemma). Let x ∼ N(0, Id) and f(x) be a function of x
whose second derivative exists. Then
Eqf(x)xxT
y= EJf(x)KI + E
q∇2f(x)
y
128
Lemma A.1.2. Let x ∼ N(0, Id) and Ak 0 for all k = 1, 2, · · · , K, then
ΠKk=1 tr(Ak)I E
qΠK
k=1(xTAkx)xx
Ty CKΠ
Kk=1 tr(Ak)I, (A.3)
where CK is a constant depending only on K, which is defined in Eq. (A.1).
Lemma A.1.3. Let xi ∼ N(0, Id) i.i.d., for all i ∈ [n] and Ak 0 for all k =
1, 2, · · · , K. Let B := EqΠK
k=1(xTAkx)xx
Ty
, Bi := ΠKk=1(x
Ti Akxi)xix
Ti and
B = 1n
∑ni=1Bi.
If n ≥ O( 1δ2logK(1
δ)(PK)Kd logK+1 d) and δ >
√4KC2K+1√
ndPfor some 0 <
δ ≤ 1 and P ≥ 1, then w.p. 1−O(Kd−P ), we have
‖B −B‖ ≤ δ‖B‖. (A.4)
Lemma A.1.4. Let x ∼ N(0, Id). Then given β,γ ∈ Rd and Ak 0 for all
k = 1, 2, · · · , K, we have
‖β‖‖γ‖ΠKk=1 tr(Ak) ≤ ‖E
q(βTx)(γTx)ΠK
k=1(xTAkx)xx
Ty‖ ≤
√3C2K+1‖β‖‖γ‖ΠK
k=1 tr(Ak)I.
(A.5)
Lemma A.1.5. Let xi ∼ N(0, Id) i.i.d., for all i ∈ [n], β,γ ∈ Rd and Ak 0
for all k = 1, 2, · · · , K. Let B := Eq(βTx)(γTx)ΠK
k=1(xTAkx)xx
Ty
, Bi :=
(βTxi)(γTxi)Π
Kk=1(x
Ti Akxi)xix
Ti and B = 1
n
∑ni=1 Bi.
If n ≥ O( 1δ2logK+1(1/δ)(PK)Kd logK+2(d)), δ >
√8KC2K+3√
ndPfor some
0 < δ ≤ 1 and P ≥ 1, then w.p. 1−O(Kd−P ), we have
‖B −B‖ ≤ δ‖B‖. (A.6)
Lemma A.1.6. If n ≥ c logK+1(c)K4Kd logK+2(d), where c is a constant, then
n ≥ cd log d logK+1(n).
129
A.1.2 Proof of Theorem 3.3.1
Proof. Denote the Hessian of Eq. (3.1), H ∈ RKd×Kd. Let H =∑
i Hi, where
Hi :=
H11
i H12i · · · H1K
i
H21i H22
i · · · H2Ki
. . .HK1
i HK2i · · · HKK
i
(A.7)
For diagonal blocks,
Hjji := 2
(Πk 6=j(yi − (wk + δwk)
Txi)2)xix
Ti (A.8)
For off-diagonal blocks,
Hjli := 4(yi−(wj+δwj)
Txi)(yi−(wl+δwl)Txi)
(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
(A.9)
In the following we will show that when wk is close to the optimal solution w∗k and
δwk is small enough for all k, then H will be positive definite w.h.p..
The main idea is to upper bound the off-diagonal blocks and lower bound
the diagonal blocks because,
σmin(H) = min∑Kj=1 ‖aj‖2=1
K∑j=1
aTj H
jjaj +∑j 6=l
2aTj H
jlal
≥ min∑Kj=1 ‖aj‖2=1
K∑j=1
σmin(Hjj)‖aj‖2 −
∑j 6=l
‖Hjl‖‖aj‖‖al‖
≥ minjσmin(H
jj) −maxj 6=l‖Hjl‖(K − 1)(
∑j
‖aj‖)
≥ minjσmin(H
jj) − (K − 1)maxj 6=l‖Hjl‖.
(A.10)
130
First consider the diagonal blocks. The idea is to decompose the diagonal
blocks into two parts. The first one only contains w and doesn’t contain δw, so for
this fixed w we apply Lemma A.1.3 to bound this term. The second one depends
on δw. We find an upper bound for this term which only depends on the magnitude
of δw. Therefore, the bound will hold for any qualified δw. Let’s first define
k1, k2, · · · , kK−1 = [K]\j.
Hjj ∑i∈Sj
Hjji
=∑i∈Sj
2(ΠK−1
s=1 (yi −wTksxi − δwT
ksxi)2)xix
Ti
∑i∈Sj
2((yi −wT
k1xi)
2 − 2|yi −wTk1xi|‖δwk1‖‖xi‖
)(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti
∑i∈Sj
2(yi −wTk1xi)
2(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
F1
−∑i∈Sj
4‖∆w∗jk1−∆wk1‖‖δwk1‖‖xi‖2
(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
E1
(A.11)
131
F1 ∑i∈Sj
2(yi −wTk1xi)
2(yi −wTk2xi)
2(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti
−∑i∈Sj
4(yi −wTk1xi)
2‖∆w∗jk2 −∆wk2‖‖δwk2‖‖xi‖2
(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti
∑i∈Sj
2(yi −wTk1xi)
2(yi −wTk2xi)
2(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
F2
−∑i∈Sj
4‖∆w∗jk1 −∆wk1‖2‖∆w∗
jk2 −∆wk2‖‖δwk2‖‖xi‖4(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
E2
(A.12)
Similarly, we decompose Fn = Fn+1 − En+1, for n = 1, 2, · · · , K − 1. Then,
recursively, we have
Hjj F1 − E1 F2 − E2 − E1 · · · FK−1 − EK−1 − EK−2 − · · · − E1
(A.13)
So Hjj is decomposed into FK−1, which contains only w, and E1, E2, · · · , EK−1,
each of which contains a separate term of ‖δw‖.
By Lemma A.1.2 and Lemma A.1.3,
E1 4∑i∈Sj
‖∆w∗jk1−∆wk1‖‖δwk1‖(ΠK−1
s=2 ‖∆w∗jks −∆wks − δwks‖2)‖xi‖2(K−1)xix
Ti
4cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗
jk‖2∑i∈Sj
‖xi‖2(K−1)xixTi
6cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗
jk‖2pjNCK−1dK−1I
(A.14)
and similarly, for all r = 1, 2, · · · , K − 1,
Er 6cf (1 + cm + cf )2K−3Πk:k 6=j‖∆w∗
jk‖2pjNCK−1dK−1I. (A.15)
132
For FK−1, we have
FK−1 =∑i∈Sj
2(Πk 6=j(yi −wT
k xi)2)xix
Ti
ξ1pjNΠk 6=j‖∆w∗
jk −∆wk‖2I
pjNΠk 6=j(‖∆w∗jk‖ − ‖∆wk‖)2I
pjN(1− cm)2(K−1)Πk 6=j‖∆w∗
jk‖2I
(A.16)
where ξ1 is because of Lemma A.1.2 and Lemma A.1.3 by setting Ak = (∆w∗jk −
∆wk)(∆w∗jk −∆wk)
T and δ = 1/(2CK−1).
Now combining Eq. (A.16), Eq. (A.13) and Eq. (A.15), we can lower bound
the eigenvalues of Hjj ,
Hjj ((1− cm)
2(K−1) − 6cf (K − 1)(1 + cm + cf )2K−3CK−1d
K−1)pjNΠk 6=j‖∆w∗
jk‖2I
(A.17)
Next consider the off-diagonal blocks for j 6= l,
133
∑i∈Sq
Hjli
=∑i∈Sq
4(yi − (wj + δwj)Txi)(yi − (wl + δwl)
Txi)(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
∑i∈Sq
4(yi −wTj xi)(yi − (wl + δwl)
Txi)(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
+∑i∈Sq
4‖δwTj xi‖|yi − (wl + δwl)
Txi|(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
∑i∈Sq
4(yi −wTj xi)(yi −wT
l xi)(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
+∑i∈Sq
4|yi −wTj xi|‖δwT
l xi‖(Πk 6=j,k 6=l(yi − (wk + δwk)
Txi)2)xix
Ti
+∑i∈Sq
4‖δwj‖‖w∗q −wl − δwl)‖
(Πk 6=j,k 6=l‖w∗
q −wk + δwk‖2)‖xi‖2(K−1)xix
Ti
...
∑i∈Sq
4(yi −wTj xi)(yi −wT
l xi)(Πk 6=j,k 6=l(yi −wT
k xi)2)xix
Ti
+ 8(K − 1)cf (1 + cm + cf )2K−3∆2K−2
max
∑i∈Sq
‖xi‖2(K−1)xixTi
∑i∈Sq
4(yi −wTj xi)(yi −wT
l xi)(Πk 6=j,k 6=l(yi −wT
k xi)2)xix
Ti
+ 12(K − 1)cf (1 + cm + cf )2K−3∆2K−2
max pqNCK−1dK−1I
(A.18)
134
For the first term above,
‖∑i∈Sq
4(w∗q −wj)
Txi(w∗q −wl)
Txi
(Πk 6=j,k 6=l((w
∗q −wk)
Txi)2)xix
Ti ‖
ξ1≤6pqN‖E
q(w∗
q −wj)Txi(w
∗q −wl)
Txi
(Πk 6=j,k 6=l((w
∗q −wk)
Txi)2)xix
Ti
y‖
ξ2≤6pqN
√3C2K−3‖w∗
q −wj‖‖w∗q −wl‖
(Πk 6=j,k 6=l‖w∗
q −wk‖2)
≤6pqN√
3C2K−3‖∆w∗qj −∆wj‖‖∆w∗
ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗
qk −∆wk‖2),
(A.19)
where ξ1 is because of Lemma A.1.5 and ξ2 is because of Lemma A.1.4.
We consider three cases: q 6= j, q 6= l, q = j and q = l. When q 6= j and
q 6= l,
‖∆w∗qj −∆wj‖‖∆w∗
ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗
qk −∆wk‖2)
≤(1 + cm)2K−2c2m‖∆w∗
qj‖‖∆w∗ql‖(Πk 6=j,k 6=l‖∆w∗
qk‖2) (A.20)
When q = j,
‖∆w∗qj −∆wj‖‖∆w∗
ql −∆wl‖(Πk 6=j,k 6=l‖∆w∗
qk −∆wk‖2)
≤(1 + cm)2K−1cm‖∆w∗
qj‖‖∆w∗ql‖(Πk 6=j,k 6=l‖∆w∗
qk‖2) (A.21)
For q = l, we have similar results. Therefore,
‖Hjl‖ ≤K∑q=1
‖∑i∈Sq
Hjli ‖
≤∑q
(1 + cm)2K−1cm6pqN
√3C2K−3∆
2K−2max
+∑q
12(K − 1)cf (1 + cm + cf )2K−3pqNCK−1d
K−1∆2K−2max
≤(1 + cm)2K−1cm6N
√3C2K−3∆
2K−2max
+ 12(K − 1)cf (1 + cm + cf )2K−3NCK−1d
K−1∆2K−2max
(A.22)
135
Now we obtain the lower bound for the minimal eigenvalue of the Hessian.
When cm ≤ pmin∆2K−2min
500K√
C2K−3∆2K−2max
and cf ≤ pmin∆2K−2min
1000(K−1)2CK−1dK−1∆2K−2max
, we have (1 −
cm)2K−2 ≥ (1− 1
2K)2K−2 ≥ 1
4, (1 + cm + cf )
2K−2 ≤ 3. Hence,
‖Hjl‖ ≤ 1
16(K − 1)pminN∆2K−2
min , (A.23)
Combining Eq.(A.10), Eq.(A.17) and Eq.(A.23), we have
σmin(H) ≥ 1
8pminN∆2K−2
min , (A.24)
which is a positive constant.
In the following we upper bound the maximal eigenvalue of the Hessian.
σmax(H) = max∑Kj=1 ‖aj‖2=1
K∑j=1
aTj H
jjaj +∑j 6=l
2aTj H
jlal
≤ max∑Kj=1 ‖aj‖2=1
K∑j=1
‖(Hjj)‖‖aj‖2 +∑j 6=l
‖Hjl‖‖aj‖‖al‖
≤ maxj‖Hjj‖+max
j 6=l‖Hjl‖(K − 1)(
∑j
‖aj‖)
≤ maxj‖Hjj‖+ (K − 1)max
j 6=l‖Hjl‖.
(A.25)
Consider the diagonal blocks and define k1, k2, · · · , kK−1 = [K]\j.
Hjji = 2
(ΠK−1
s=1 (yi −wTksxi − δwT
ksxi)2)xix
Ti
2((yi −wT
k1xi)
2 + 2|yi −wTk1xi||δwT
k1xi|+ (δwT
k1xi)
2)(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti
2(yi −wTk1xi)
2(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
F1
+ 2(2‖∆w∗jk1−∆wk1‖+ ‖δwk1‖)‖δwk1‖‖xi‖2
(ΠK−1
s=2 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
E1
(A.26)
136
For E1,
E1 4cf (1 + cm + cf )2K−3∆2K−2
max ‖xi‖2K−2xixTi (A.27)
For F1,
F1
2(yi −wTk1xi)
2(yi −wTk2xi)
2(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
F2
+ 2(yi −wTk1xi)
2|(2∆w∗jk2−2∆wk2−δwk2)
Txi||δwTk2xi|(ΠK−1
s=3 (yi −wTksxi − δwT
ksxi)2)xix
Ti︸ ︷︷ ︸
E2
(A.28)
We also have for E2
E2 4cf (1 + cm + cf )2K−3∆2K−2
max ‖xi‖2K−2xixTi (A.29)
Therefore, recursively, we have
Hjji 2ΠK−1
s=1 (yi −wTksxi)
2xixTi︸ ︷︷ ︸
FK−1
+ 4Kcf (1 + cm + cf )2K−3∆2K−2
max ‖xi‖2K−2xixTi
(A.30)
137
Now applying Lemma A.1.2 and Lemma A.1.3,
Hjj =∑q
∑i∈Sq
Hjji
6cfK(1 + cm + cf )2K−3NCK−1d
K−1∆2K−2max I +
∑q
∑i∈Sq
2(Πk 6=q((∆w∗
jk −∆wk)Txi)
2)xix
Ti
6cfK(1 + cm + cf )2K−3NCK−1d
K−1∆2K−2max I + 3
∑q
pqNCK−1
(Πk 6=q‖∆w∗
jk −∆wk‖2)
=6cfK(1 + cm + cf )2K−3NCK−1d
K−1∆2K−2max I
+ 3pjNCK−1(1 + cm)2K−2
(Πk 6=j‖∆w∗
jk‖2)
+ 3∑q:q 6=j
pqNCK−1c2m(1 + cm)
2K−4(Πk:k 6=j‖∆w∗
jk‖2)
9NCK−1∆2K−2max I
(A.31)
Combining the off-diagonal blocks bound in Eq. (A.23), applying union bound on
the probabilities of the lemmata and Eq. (A.2) complete the proof.
A.1.3 Proof of Theorem 3.3.2
We first introduce a corollary of Theorem 3.3.1, which shows the strong
convexity on a line between a current iterate and the optimum.
Corollary A.1.1 (Positive Definiteness on the Line between w and w∗). Let xi, yii=1,2,··· ,N
be sampled from the MLR model (3.2). Let wkk=1,2,··· ,K be independent of the
samples and lie in the neighborhood of the optimal solution, defined in Eq. (3.3).
Then, if N ≥ O(KKd logK+2(d)), w.p. 1−O(Kd−2), for all λ ∈ [0, 1],
1
8pminN∆2K−2
min I ∇2f(λw∗ + (1− λ)w) 10N(3K)K∆2K−2max I. (A.32)
Proof. We set dK−1 anchor points equally along the line λw∗ + (1 − λ)w for λ ∈
138
[0, 1]. Then based on these anchors, according to Theorem 3.3.1, by setting P =
K + 1, we complete the proof.
Now we show the proof of Theorem 3.3.2.
Proof. Let α := 18pminN∆2K−2
min and β := 10N(3K)K∆2K−2max .
‖w+ −w∗‖2 =‖w − η∇f(w)−w∗‖2
=‖w −w∗‖2 − 2η∇f(w)T (w −w∗) + η2‖∇f(w)‖2(A.33)
∇f(w) =
(∫ 1
0
∇2f(w∗ + γ(w −w∗))dγ
)(w −w∗)
=: H(w −w∗)
(A.34)
According to Corollary A.1.1,
αI H βI. (A.35)
‖∇f(w)‖2 = (w −w∗)T H2(w −w∗) ≤ β(w −w∗)T H(w −w∗) (A.36)
Therefore,
‖w+ −w∗‖2 ≤‖w −w∗‖2 − (−η2β + 2η)(w −w∗)T H(w −w∗)
≤‖w −w∗‖2 − (−η2β + 2η)α‖w −w∗‖2
=‖w −w∗‖2 − α
β‖w −w∗‖2
≤(1− α
β)‖w −w∗‖2
(A.37)
where the third equality holds by setting η = 1β
.
139
A.1.4 Proof of the lemmata
A.1.4.1 Proof of Lemma A.1.1
Proof. Let g(x) = 1(2π)d/2
e−‖x‖2/2 and we have xg(x)dx = −dg(x).
Eqf(x)xxT
y=
∫f(x)xxTg(x)dx
= −∫
f(x)(dg(x))xT
=
∫∇f(x)xTg(x))dx+
∫f(x)g(x)Idx
= −∫∇f(x)(dg(x))T + EJf(x)KI
= Eq∇2f(x)
y+ EJf(x)KI
(A.38)
A.1.4.2 Proof of Lemma A.1.2
Proof. Let GK := EqΠK
k=1(xTAkx)xx
Ty
. First we show the lower bound.
σmin(GK) = min‖a‖=1
EqΠK
k=1(xTAkx)(x
Ta)2y
≥ ΠKk=1E
q(xTAkx)
ymin‖a‖=1
Eq(xTa)2
y
= ΠKk=1 tr(Ak)
(A.39)
Next, we show the upper bound. As we know, when K = 1, G1 = tr(A1)I+
2A1 and for any K > 1, GK should have an explicit closed-form. However, it is too
complicated to derive and formulate it for general K. Fortunately we only need the
property of Eq. (A.3) in our proofs. We prove it by induction. First, it is obvious
that Eq. (A.3) holds for K = 1 and C1 = 3. We assume that, for any J < K, there
140
exists a constant CJ depending only on J , such that
GJ CJΠJk=1 tr(Ak)I (A.40)
Then by Stein-type lemma, Lemma A.1.1,
GK =EqΠK
k=1(xTAkx)xx
Ty
=EqΠK
k=1(xTAkx)
yI + 2
K∑j=1
Eq(ΠK
k 6=j(xTAkx))Aj
y
+ 4∑j,l:j 6=l
AjEq(Πk:k 6=j,k 6=l(x
TAkx))xxTyAl
CK−1ΠKk=1 tr(Ak)I + 2
K∑j=1
CK−2(Πk 6=j tr(Ak))Aj
+ 4∑j,l:j 6=l
CK−2‖Aj‖‖Al‖Πk:k 6=j,k 6=l tr(Ak)I
(CK−1 + (2K + 4K2)CK−2
)ΠK
k=1 tr(Ak)I
(A.41)
So CK = CK−1 + (4K2 + 2K)CK−2. Note that C0 = 1.
A.1.4.3 Proof of Lemma A.1.3
Proof. Proof Sketch: We use matrix Bernstein inequality to prove this lemma. How-
ever, the spectral norm of the random matrix Bi is not uniformly bounded, which is
required by matrix Bernstein inequality. So we define a new random matrix,
Mi := 1(Ei)ΠKk=1(x
Ti Akxi)xix
Ti ,
where Ei is an event when ‖Bi‖ is bounded, which will hold with high probability
and 1() is the indicate function of value 1 and 0, i.e., 1(E) = 1 if E holds and
1(E) = 0 otherwise. Then
‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖,
141
where M = EJMiK and M = 1n
∑ni=1 Mi. We show that
1. M = B w.h.p. by the union bound
2. ‖M −M‖ is bounded by matrix Bernstein inequality
3. ‖M −B‖ is bounded because EJ1(Ec)K is small.
Proof Details:
Step 1. First we show that ‖Bi‖ is bounded w.h.p.. First,
‖Bi‖ = ΠKk=1(x
Ti Akxi)‖xi‖2
Since x ∼ N(0, Id), by Corollary 2.4.5, we have PJ‖x‖2 ≥ (4P + 5)d log nK ≤
n−1d−P . By Corollary 2.4.1, PqxTAkx > (4P + 5) tr(Ak) log n
y≤ n−1d−P .
Therefore w.p. 1− (K + 1)n−1d−P ,
‖Bi‖ ≤ (4P + 5)K+1 × (ΠKk=1 tr(Ak))d log
K+1(n).
Define
m := (4P + 5)K+1(ΠKk=1 tr(Ak))d log
K+1(n). (A.42)
and the event
Ei = ‖Bi‖ ≤ m,
Let Ec be the complementary set of E, thus PJEciK ≤ (K + 1)n−1d−P . By union
bound, w.p. 1− (K + 1)d−P , ‖Bi‖ ≤ m for all i ∈ [n] and M = B.
Step 2. Now we bound ‖M −M‖ by Matrix Bernstein’s inequality[127].
Set Zi := Mi −M . Thus EJZiK = 0 and ‖Zi‖ ≤ 2m. And
‖EqZ2
i
y‖ = ‖E
qM2
i
y−M2‖ ≤ ‖E
qM2
i
y‖+ ‖M2‖
142
Since M is PSD, ‖EJM2i K‖ ≤ m‖M‖. Now by matrix Bernstein’s inequality, for
any δ > 0,
P
t1
n‖
n∑i=1
Zi‖ ≥ δ‖M‖
|
≤ 2d exp(− δ2n2‖M‖2/2mn‖M‖+ 2mnδ‖M‖/3
) = 2d exp(− δ2n‖M‖/2m+ 2mδ/3
)
(A.43)
Setting
n ≥ (P + 1)(4
3δ+
2
δ2)m‖M‖−1 log d, (A.44)
we have w.p. at least 1− 2d−P ,
‖ 1n
∑Mi −M‖ ≤ δ‖M‖ (A.45)
Step 3. Now we bound ‖M − B‖. For simplicity, we replace xi by x and
Ei by E.
‖M −B‖
=‖EJBi1(Eci)K‖
= max‖a‖=1
Eq(aTx)2ΠK
k=1(xTAkx)1(E
c)y
ζ1≤ max
‖a‖=1E
q(aTx)4ΠK
k=1(xTAkx)
2y1/2
EJ1(Ec)K1/2
= max‖a‖=1
〈aaT ,Eq(xTaaTx)ΠK
k=1(xTAkx)
2xxTy〉1/2EJ1(Ec)K1/2
ζ2≤ max
‖a‖=1〈aaT , C2K+1Π
Kk=1 tr(Ak)
2I〉1/2EJ1(Ec)K1/2
ζ3≤√
(K + 1)C2K+1√ndP
ΠKk=1 tr(Ak)
(A.46)
where ζ1 is from Holder’s inequality, ζ2 is because of Lemma A.1.2 and ζ3 is be-
cause EJ1(Ec)K = PJEcK. Assume n ≥ 4(K + 1)C2K+1/dP , we have ‖M −B‖ ≤
143
12‖B‖ and 3
2‖B‖ ≥ ‖M‖ ≥ 1
2‖B‖. So combining this result with Eq. (A.42),
Eq. (A.44), and Eq. (A.45), if
n ≥ max4(K + 1)C2K+1/dP , c1
1
δ2(4P + 5)K+2d logK+1(n) log d, (A.47)
we obtain
‖ 1n
∑Mi −M‖ ≤ 1
3δ‖M‖ ≤ 1
2δ‖B‖. (A.48)
According to Lemma A.1.6, n ≥ O( 1δ2logK+1(1
δ)(PK)Kd logK+2 d) will imply
Eq. (A.47). By further setting δ >√
4(K+1)C2K+1√ndP
, we have ‖M − B‖ ≤ 12δ‖B‖,
completing the proof.
A.1.4.4 Proof of Lemma A.1.4
Proof.‖E
q(βTx)(γTx)ΠK
k=1(xTAkx)xx
Ty‖
≥ Eq(βTx)2(γTx)2ΠK
k=1(xTAkx)
y/(‖β‖‖γ‖)
≥ ‖β‖‖γ‖ΠKk=1 tr(Ak).
(A.49)
‖Eq(βTx)(γTx)ΠK
k=1(xTAkx)xx
Ty‖
= maxa,b
Eq(βTx)(γTx)(aTx)(bTx)ΠK
k=1(xTAkx)
y/(‖a‖‖b‖)
≤ Eq(aTx)2(bTx)2
y1/2E
q(βTx)2(γTx)2ΠK
k=1(xTAkx)
2y1/2
/(‖a‖‖b‖)
≤√
3C2K+1‖β‖‖γ‖ΠKk=1 tr(Ak)
(A.50)
144
A.1.4.5 Proof of Lemma A.1.5
Proof. Note that the matrix Bi is probably not PSD. Thus we can’t apply Lemma A.1.3
directly. But the proof is similar to that for Lemma A.1.3.
Define
m := (4P + 5)K+2‖β‖‖γ‖(ΠKk=1 tr(Ak))d log
K+1(n), (A.51)
and the event, Ei := ‖Bi‖ ≤ m. Then by Corollary 2.4.4,
PJEiK ≥ 1− 2Kn−1d−P .
Define a new random matrix Mi := 1(Ei)Bi, its expectation M := EJMiK and its
empirical average M = 1n
∑ni=1Mi.
Step 1. By union bound, we have w.p. 1 − 2Kd−P , Mi = Bi for all i, i.e.,
M = B.
Step 2. We now bound ‖M − B‖, For simplicity, we replace xi by x and
145
Ei by E.
‖M −B‖
=‖EJBi1(Eci)K‖
= max‖a‖=‖b‖=1
Eq(aTx)(bTx)(βTx)(γTx)ΠK
k=1(xTAkx)1(E
c)y
ζ1≤ max
‖a‖=‖b‖=1E
q(aTx)2(bTx)2(βTx)2(γTx)2ΠK
k=1(xTAkx)
2y1/2
EJ1(Ec)K1/2
= max‖a‖=‖b‖=1
〈aaT ,Eq(bTx)2(βTx)2(γTx)2ΠK
k=1(xTAkx)
2xxTy〉1/2EJ1(Ec)K1/2
ζ2≤ max
‖a‖=‖b‖=1〈aaT , C2K+3‖β‖2‖γ‖2ΠK
k=1 tr(Ak)2I〉1/2EJ1(Ec)K1/2
ζ3≤√2KC2K+3√
ndP‖β‖‖γ‖ΠK
k=1 tr(Ak)
(A.52)
where ζ1 is from Holder’s inequality, ζ2 is because of Lemma A.1.2 and ζ3 is be-
cause EJ1(Ec)K = PJEcK.
According to Eq. (A.52) and Lemma A.1.4, if√
2KC2K+3√ndP
≤ δ/2, then
‖M −B‖ ≤ 1
2δ‖β‖‖γ‖ΠK
k=1 tr(Ak) ≤1
2δ‖B‖ (A.53)
Since δ ≤ 1, we also have ‖M −B‖ ≤ 12‖B‖, so by Lemma A.1.4,
3
2‖B‖ ≥ ‖M‖ ≥ 1
2‖B‖ ≥ 1
2‖β‖‖γ‖ΠK
k=1 tr(Ak) (A.54)
Step 3. Now we bound ‖M − M‖. ‖M‖ ≤ m automatically holds. Since
M is probably not PSD, we don’t have ‖EJM2i K‖ ≤ m‖M‖. However, we can still
146
show that EJM2i K ≤ O(m)‖M‖.
‖EqM2
i
y‖
≤ ‖EqB2
i
y‖
= ‖Eq(βTx)2(γTx)2ΠK
k=1(xTAkx)
2‖x‖2xxTy‖
≤ C2K+3d× (‖β‖‖γ‖ΠKk=1 tr(Ak))
2
≤ 2C2K+3
(4P + 5)K+2m‖M‖
(A.55)
We can use matrix Bernstein inequality now. Let Zi := Mi − M . ‖Zi‖ ≤ 2m.
‖EJZ2i K‖ ≤ ( 2C2K+3
(4P+5)K+2 + 1)m‖M‖. Define CK := 2C2K+3
(4P+5)K+2 + 1, then
P
t1
n‖
n∑i=1
Zi‖ ≥ δ‖M‖
|
≤ 2d exp(− δ2n2‖M‖2/2CKmn‖M‖+ 2mnδ‖M‖/3
) ≤ 2d exp(− δ2n‖M‖/2CKm+ 2mδ/3
)
(A.56)
Thus, when n ≥ (P + 1)( CK
δ2+ 2
3δ)m/‖M‖ log d, we have w.p., 1− c2d
−P ,
‖M −M‖ ≤ 1
3δ‖M‖ ≤ 1
2δ‖B‖.
By Eq. (A.51) and Eq. (A.54),
(P + 1)(CK
δ2+
2
3δ)m/‖M‖ log d ≤ c1
CK
δ2× (4P + 5)K+3d logK+1(n) log(d)
≤ c1δ2(2C2K+3(P + 1) + (4P + 5)K+2)d logK+1(n) log(d)
Applying the fact, ‖B−B‖ ≤ ‖B−M‖+‖M−M‖+‖M−B‖, and Lemma A.1.6
completes the proof.
147
A.1.5 Proof of Lemma A.1.6
Proof. Assume we require n ≥ cd log(d) logK+1(n) and we have n ≥ bcd log(d) logA(d),
where b, A depends only on K. First let’s assume we haveb = K4K logK+1(c)
A = K + 1
Therefore, b ≥ K4K
b ≥ 4K+1 logK+1(c)
A ≥ K + 1
b ≥ (4(A+ 1))K+1
Taking a log on both sides, we obtain,log b ≥ (K + 1) log(4 log(b))
log b ≥ (K + 1) log(4 log(c))
log b+ A log log(d) ≥ (K + 1) log(4 log(d))
log b+ A log log(d) ≥ (K + 1) log(4(A+ 1) log log(d))
Combining the above four inequalities, we have,
log b+ A log log(d) ≥ (K + 1) log(4maxlog(b), log(c), log(d), (A+ 1) log log(d))
Replacing max by addition, we have,
log b+ A log log(d) ≥ (K + 1) log(log(b) + log(c) + log(d) + (A+ 1) log log(d))
Taking exp on both sides,
b logA(d) ≥ logK+1(bcd log(d) logA(d))
148
Reformulating:
bcd log(d) logA(d)
logK+1(bcd log(d) logA(d))≥ cd log d
n
logK+1(n)≥ cd log d
Finally, we obtain
n ≥ cd log d logK+1(n)
A.2 Proofs of Tensor Method for InitializationA.2.1 Some Lemmata
We will use the following lemma to guarantee the robust tensor power
method. The proofs of these lemmata will be found in Sec. A.2.4.
Lemma A.2.1 ( Some properties of thrid-order tensor). If T ∈ Rd×d×d is a super-
symmetric tensor, i.e.,Tijk is equivalent for any permutation of the index, then the
operator norm defined as
‖T‖op := sup‖a‖=1
|T (a,a,a)|
Property 1. ‖T‖op = sup‖a‖=‖b‖=‖c‖=1 |T (a, b, c)|
Property 2. ‖T‖op ≤ ‖T(1)‖ ≤√K‖T‖op
Property 3. If T is a rank-one tensor, then ‖T(1)‖ = ‖T‖op
Property 4. For any matrix W ∈ Rd×d′ , ‖T (W,W,W )‖op ≤ ‖T‖op‖W‖3
149
Lemma A.2.2 (Approximation error for the second moment). Let xi, yii∈[n] be
generated from the mixed linear regression model (3.2). Define M2 :=∑
k=[K] 2pkw∗k⊗
w∗k and M2 :=
1n
∑i∈[n] y
2i (xi⊗xi− I). Then with n ≥ c1
1pminδ22
d log2(d), we have
w.p. 1− c2Kd−2,
‖M2 −M2‖ ≤ δ2∑k
pk‖w∗k‖2 (A.57)
where c1, c2 are universal constants.
And for any fixed orthogonal matrix Y ∈ Rd×K , with the same condition,
we have
‖Y T (M2 −M2)Y ‖ ≤ δ2∑k
pk‖w∗k‖2 (A.58)
Lemma A.2.3 (Subspace Estimation). Let M2,M3 be
M2 =∑k=[K]
2pkw∗k ⊗w∗
k, and M3 =∑k=[K]
6pkw∗k ⊗w∗
k ⊗w∗k, (A.59)
and M2 be an estimate of M2. Assume ‖M2 −M2‖ ≤ δσK(M2) and δ ≤ 16. Let Y
be the returned matrix of the power method after O(log(1/δ)) steps. Define R2 =
Y TM2Y and R3 = M3(Y, Y, Y ). Then ‖R2‖ ≤ ‖M2‖ and ‖R3‖op ≤ ‖M3‖op. We
also have
‖Y Y Tw∗k −w∗
k‖ ≤ 3δ‖w∗k‖,∀k (A.60)
and
σK(R2) ≥3
4σK(M2)
150
Lemma A.2.4 (Approximation error for the third moment). Let xi, yii∈[n] be
drawn from the mixed linear regression model (3.2). Let Y ∈ Rd×K be any fixed
orthogonal matrix that satisfies, ‖Y Y Tw∗k − w∗
k‖ ≤ 12‖w∗
k‖,∀k, and ri = Y Txi,
for all i ∈ [n]. Let
R3 =1
n
∑i∈[n]
y3i (ri⊗ri⊗ri−∑j∈[K]
ej⊗ri⊗ej−∑j∈[K]
ej⊗ej⊗ri−∑j∈[K]
ri⊗ej⊗ej)
and
R3 =∑k=[K]
6pk(YTw∗
k)⊗ (Y Tw∗k)⊗ (Y Tw∗
k)
Then if n ≥ c31
pminδ23K3 log4(d) and 3
√C5n
−1/2d−1 ≤ δ34
, we have w.p. 1−c4Kd−2
‖R3 −R3‖op ≤ δ3∑k∈[K]
pk‖w∗k‖3,
where c3 and c4 are universal constant.
Lemma A.2.5 (Robust Tensor Power Method. Similar to Lemma 4 in [26]). Let
R2 =∑K
k=1 pkuk ⊗ uk and R3 =∑K
k=1 pkuk ⊗ uk ⊗ uk, where uk ∈ RK can be
any fixed vector. Define σK := σK(R2). Assume the estimations of R2 and R3, R2
and R3 respectively, satisfy ‖R2 − R2‖op ≤ ε2 and ‖R3 − R3‖op ≤ ε3 with
ε2 ≤ σK/3, 8‖R3‖opσ−5/2K ε2 + 2
√2σ
−3/2K ε3 ≤ cT
1
K√pmax
, (A.61)
for some constant cT . Let the whitening matrix W = U2Λ−1/22 UT
2 , where R2 =
U2Λ2UT2 is the eigendecomposition of R2. Then w.p. 1−η, the eigenvalues akKk=1
and the eigenvectors vkKk=1 computed from the whitened tensor R3(W , W , W ) ∈
RK×K×K by using the robust tensor power method [4] will satisfy
‖(W T )†(akvk)− uk‖ ≤ κ2ε2 + κ3ε3
151
where κ2 = 3‖R2‖1/2σ−1K + 200‖R2‖1/2‖R3‖opσ−5/2
K , κ3 = 75‖R2‖1/2σ−3/2K and η
is related to the computational time by O(log(1/η)).
Remark: This lemma differs from Lemma 4 of [26] in the requirement on
ε2, ε3. Lemma 4 in [26] treats ε2, ε3 in the same order (that are bounded by the same
value), however, they should have different order because one is for second-order
moments and the other is for third-order moments.
A.2.2 Proof of Theorem 3.4.2
Proof Details. We state the proof outline here,
1. ‖M2 −M2‖ ≤ εM2 by Matrix Bernstein’s inequality.
2. ‖Y Y Tw∗k −w∗
k‖ ≤ εY ‖w∗k‖ for all k ∈ [K] by Davis-Kahan’s theorem [36].
3. ‖R2 −R2‖ ≤ ε2 by Matrix Bernstein’s inequality.
4. ‖R3 −R3‖op ≤ ε3 by Matrix Bernstein’s inequality after matricizing tensor.
5. Let uk = (W T )†(akvk). Then ‖uk − Y Tw∗k‖ ≤ εu by the robust tensor
power method.
6. Finally, ‖w(0)k −w∗
k‖ ≤ c6∆min by combining the results of Step 2 and Step
5.
The lemmata in Appendix A.2.1 provide the bound for the above steps: Lemma A.2.2
for Step 1, Lemma A.2.3 for Step 2 and Step 3, Lemma A.2.4 for Step 4, and
Lemma A.2.5 for Step 5. Now we show the details. Define
κ2 := 4‖M2‖1/2σ−1K (M2) + 412‖M2‖1/2‖M3‖opσ−5/2
K (M2)
152
and
κ3 := 116‖M2‖1/2σ−3/2K (M2).
By Lemma A.2.3, we have κ3 ≥ κ3 and κ2 ≥ κ2 for any orthogonal matrix Y .
‖w(0)k −w∗
k‖ξ1≤ ‖Y uk − Y Y Tw∗
k‖+ ‖Y Y Tw∗k −w∗
k‖ξ2≤ κ2‖R2 −R2‖+ κ3‖R3 −R3‖op +
2
3δM2‖w∗
k‖σ−1K (M2)
∑k
pk‖w∗k‖2
ξ3≤ κ2δ2
∑k
pk‖w∗k‖2 + κ3δ3
∑k
pk‖w∗k‖3 +
2
3δM2σ
−1K (M2)(max
k‖w∗
k‖)∑k
pk‖w∗k‖2
(A.62)
where ξ1 is due to triangle inequality, ξ2 is due to Lemma A.2.5, Lemma A.2.3 and
Lemma A.2.2, and ξ3 is due to Lemma A.2.2 and Lemma A.2.4. Therefore, we can
set
δ2 ≤c6∆min
3κ2
∑k pk‖w∗
k‖2,
δ3 ≤c6∆min
3κ3
∑k∈[K] pk‖w∗
k‖3
and
δM2 ≤c6∆min
2σ−1K (M2)(maxk ‖w∗
k‖)∑
k pk‖w∗k‖2
,
such that ‖w(0)k −w∗
k‖ ≤ c6∆min. Note that Lemma A.2.5 also requires Eq. (A.61),
which can be satisfied if
‖R2 −R2‖ ≤ minσK(M2)
4,
cTσK(M2)5/2
34‖M3‖opK√pmax
and
‖R3 −R3‖op ≤cTσK(M2)
3/2
6K√pmax
.
153
Therefore, we require
δ2 ≤ δ∗2 :=1∑
k pk‖w∗k‖2
minσK(M2)
4,
cTσK(M2)5/2
34‖M3‖opK√pmax
,c6∆min
3κ2
δ3 ≤ δ∗3 :=1∑
k∈[K] pk‖w∗k‖3
minc6∆min
3κ3
,cTσK(M2)
3/2
6K√pmax
δM2 ≤ δ∗M2:=
c6∆min
2σ−1K (M2)(maxk ‖w∗
k‖)∑
k pk‖w∗k‖2
,
Now we analyze the sample complexity. δ∗M2, δ∗2, δ
∗3 correspond to the sam-
ple sets, ΩM2 , Ω2 and Ω3 respectively. By Lemma A.2.2, Lemma A.2.4, we require
|ΩM2| ≥ cM2
1
pminδ∗2M2
d log2(d)
|Ω2| ≥ c21
pminδ∗22d log2(d)
|Ω3| ≥ c31
pminδ∗23K3 log11/2(d),
and 3√C5n
−1/2d−1 ≤ δ34
. For the probability, we can set η = d−2 in Lemma A.2.5
by scarifying a little more computational time, which is in the order of O(log(d)).
Therefore, the final probability is at least 1−O(Kd−2).
A.2.3 Proof of Theorem 3.5.1
According to Theorem 3.3.2, after T0 = O(log d) iterations, we arrive the
local convexity region in Corollary 3.3.1. Then we just need one more set of
samples, but still need O(log(1/ε)) iterations to achieve 1/ε precision. By The-
orem 3.3.1, Corollary 3.3.1, Theorem 3.3.2 and Theorem 3.4.2, we can partition the
dataset into |Ω(t)| = O(d(K log(d))2K+2) for all t = 0, 1, 2, · · · , T0 + 1 to satisfy
their sample complexity requirement. This complete the proof.
154
A.2.4 Proofs of Some Lemmata
A.2.4.1 Proof of Lemma A.2.1
Proof. Property 1. See the proof in Lemma 21 of [67].
Property 2.
‖T(1)‖ = max‖a‖=1
‖T (a, I, I)‖F ≤ max‖a‖=1
√K‖T (a, I, I)‖ = max
‖a‖=‖b‖=1
√K|T (a, b, b)| = ‖T‖op.
Obviously, max‖a‖=1 ‖T (a, I, I)‖F ≥ ‖T‖op.
Property 3. Let T = v ⊗ v ⊗ v.
‖T(1)‖ = max‖a‖=1
‖T (a, I, I)‖F = max‖a‖=1
‖v‖2(vTa)2 = ‖v‖3 = max‖a‖=1
|(vTa)3| = ‖T‖op.
Property 4. There exists a u ∈ Rd′ with ‖u‖ = 1 such that
‖T (W,W,W )‖op = |T (Wu,Wu,Wu)| ≤ ‖T‖op‖Wu‖3 ≤ ‖T‖op‖W‖3
A.2.4.2 Proof of Lemma A.2.2
Proof. Define M(k)2 := 2w∗
kw∗Tk and M
(k)2 = 1
|Sk|∑
i∈Sky2i (xi ⊗ xi − I), where
Sk ⊂ [n] is the index set for samples from the k-th model. Since we assume |Sk| =
pkn, M2 =∑
k∈[K] pkM(k)2 . We first bound ‖M (k)
2 −M(k)2 ‖. By Lemma A.1.3 with
K = 1, A1 = w∗kw
∗Tk , then if |Sk| ≥ c1
1δ2d log2(d), we have w.p., 1− c2d
−2,
‖ 1
|Sk|∑i∈Sk
y2ixixTi − ‖w∗
k‖2I − 2w∗kw
∗k‖ ≤ δ‖w∗
k‖2.
By Lemma A.1.3 with K = 0, we have w.p. at least 1− d−2,
‖ 1
|Sk|∑i∈Sk
xixTi − I‖ ≤ δ
155
Then
‖ 1
|Sk|∑i∈Sk
(xTi w
∗k)
2 − ‖w∗k‖2‖ ≤ ‖
1
|Sk|∑i∈Sk
xixTi − I‖‖w∗
k‖2 ≤ δ‖w∗k‖2.
Thus
‖ 1
|Sk|∑i∈Sk
y2i (xixTi − I)− 2w∗
kw∗k‖ ≤ 2δ‖w∗
k‖2.
And w.p. 1−O(Kd−2),
‖M2 −M2‖ ≤ 2δ∑k
pk‖w∗k‖2.
A.2.4.3 Proof of Lemma A.2.3
Proof. ‖R2‖ ≤ ‖Y ‖2‖M2‖ = ‖M2‖. By Property 4 in Lemma A.2.1, ‖R3‖op ≤
‖Y ‖3‖M3‖op = ‖M3‖op. Let U be the top-K eigenvectors of M2. Then U =
span(w∗1,w
∗2, · · · ,w∗
K). Let Y ∈ Rd×K be the top-K eigenvectors of M2. By
Lemma 9 in [67] (Davis-Kahan’s theorem [36] can also prove it),
‖(I − Y Y T )UUT‖ ≤ 3
2δ.
According to Theorem 7.2 in [6], after t steps of the power method, we have
‖Y Y T − Y (t)Y (t)T‖ ≤ (σK+1(M2)
σK(M2))t‖Y Y T − Y (0)Y (0)T‖.
When δ ≤ 1/3, by Weyl’s inequality, we have σK+1(M2) ≤ 13σK(M2) and σK(M2) ≥
23σK(M2). Therefore, after t = log(2/(3δ)) steps of the power method, we have
‖Y Y T − Y (t)Y (t)T‖ ≤ 3
2δ
156
Let Y = Y (t). We have
‖Y Y T − UUT‖ ≤ ‖Y Y T − Y Y T‖+ ‖UUT − Y Y T‖ ≤ 3δ
and
‖Y Y Tw∗k −w∗
k‖ ≤ ‖Y Y T − UUT‖‖w∗k‖ ≤ 3δ‖w∗
k‖
Now we consider σK(R2). The proof is similar to that for Property 3 in Lemma 9
in [67].
σK(R2) ≥ σK(M2)σ2K(Y
TU)
Note that ‖Y T⊥ U‖ = ‖Y Y T − UUT‖, where Y⊥ is the subspace orthogonal to Y .
For any normalized vector v,
‖Y TUv‖2 = ‖Uv‖2 − ‖Y T⊥ Uv‖2 ≥ 1− (3δ)2 ≥ 3
4
Therefore, we have σK(R2) ≥ 34σK(M2).
A.2.4.4 Proof of Lemma A.2.4
Proof. We prove it by matricizing the tensor. Define
Gi = y3i (ri⊗ ri⊗ ri−∑j∈[K]
ej ⊗ ri⊗ ej −∑j∈[K]
ej ⊗ ej ⊗ ri−∑j∈[K]
ri⊗ ej ⊗ ej).
Like in Lemma A.2.2, we first bound ‖R(k)3 −R
(k)3 ‖op, where R(k)
3 = 1|Sk|∑
i∈SkGi,
and R(k)3 = 6(Y Tw∗
k)⊗ (Y Tw∗k)⊗ (Y Tw∗
k).
‖R(k)3 ‖op = 6‖Y Tw∗
k‖3.
By Lemma A.2.3, 12‖w∗
k‖ ≤ ‖Y Tw∗k‖ ≤ 3
2‖w∗
k‖. Thus
3
4‖w∗
k‖3 ≤ ‖R(k)3 ‖op ≤
81
4‖w∗
k‖3. (A.63)
157
Then
‖Gi‖op ≤ 4|xTi w
∗k|3‖ri‖3, (A.64)
By Corollary 2.4.5, we have w.p., 1 − n−1d−2, ‖ri‖2 ≤ 4K log n. Thus, w.p.
1− 4n−1d−2,
‖Gi‖op ≤ 4× 123/2‖w∗k‖3 log3(n)(4K)3/2
Define m := c6‖w∗k‖3K3/2 log3(n) for constant c6 = 4× (48)3/2, and the event
Ei := ‖Gi‖op ≤ m
Then PJEciK ≤ 4n−1d−2. Define a new tensor Bi = 1(Ei)Gi, its expectation
B = EJBiK (the expectation is over all samples from the k-th components) and
its empirical average B = 1|Sk|∑
i∈[Sk]Bi.
Step 1. So we have Bi = Gi for all i ∈ Sk w.p. 1− 4d−2, i.e.,
R(k)3 = B (A.65)
Step 2. We bound ‖B −R(k)3 ‖op
‖B −R(k)3 ‖op = ‖EJ1(Ec
i)GiK‖op
= max‖a‖=1
|EJ1(Eci)Gi(a,a,a)K|
≤ EJ1(Eci)K
1/2 max‖a‖=1
|EqGi(a,a,a)
2y|1/2
≤ 2n−1/2d−1 max‖a‖=1
|Eq(y3i ((r
Ti a)
3 − 3rTi a))
2y|1/2
≤ 2n−1/2d−1 max‖a‖=1
|Eq(w∗T
k xi)6((xT
i Y a)6 + 9(xTi Y a)2)
y|1/2
≤ 2n−1/2d−1√2C5‖w∗
k‖3
ξ
≤3√
C5n−1/2d−1‖R(k)
3 ‖op,(A.66)
158
where ξ is due to Eq. (A.63). Therefore, if 3√C5n
−1/2d−1 ≤ δ34
, we have
‖B −R(k)3 ‖op ≤
3δ38‖w∗
k‖3 ≤δ32‖R(k)
3 ‖op (A.67)
And further if δ3 ≤ 1, combining Eq. (A.63),
3
8‖w∗
k‖3 ≤1
2‖R(k)
3 ‖op ≤ ‖B‖op ≤3
2‖R(k)
3 ‖op ≤ 32‖w∗k‖3
Step 3. We bound ‖B −B‖op. Let Zi = (Bi −B)(1).
‖B(1)‖ ≤ max‖a‖=1
‖B(1)a‖
= max‖a‖=1
‖B(a, I, I)‖F
≤ max‖a‖=1
K‖B(a, I, I)‖
≤ max‖a‖=1
max‖b‖=1
√K|B(a, b, b)|
ξ=√K‖B‖op
≤ 32√K‖w∗
k‖3
(A.68)
where ξ is due to Lemma A.2.1.
‖Zi‖ ≤ ‖Bi(1)‖+ ‖B(1)‖ ≤√K(‖Bi‖op + ‖B‖op) ≤ 2
√Km
Now consider ‖EqZiZ
Ti
y‖ and ‖E
qZT
i Zi
y‖.
EqZiZ
Ti
y= E
q(Bi(1) −B(1))(Bi(1) −B(1))
Ty= E
qBi(1)B
Ti(1)
y−B(1)B
T(1)
159
‖EqBi(1)B
Ti(1)
y‖ ≤ ‖E
qGi(1)G
Ti(1)
y‖
≤ ‖Eq(w∗T
k x)6(‖r‖4rrT + 2‖r‖2I + (K + 6)rrT − 6‖r‖2rrT )y‖
≤ ‖Eq(w∗T
k x)6(‖Y Tx‖4Y TxxTY + 2‖Y Tx‖2I + (K + 6)Y TxxTY )y‖
≤ 2C5K2‖w∗
k‖6,(A.69)
where the last inequality is due to Lemma A.1.2. Thus
‖EqZiZ
Ti
y‖ ≤ 3C5K
2‖w∗k‖6
SimilarlyEqZT
i Zi
y= E
rBT
i(1)Bi(1)
z−BT
(1)B(1) and ‖BT(1)B(1)‖ ≤ ‖B(1)‖2.
‖EqBT
i(1)Bi(1)
y‖
≤ ‖EqGT
i(1)Gi(1)
y‖
≤ max‖A‖F=1,A sym.
Eqy6i ‖rTArr − (2Ar + tr(A)r)‖2
y
≤ max‖A‖F=1,A sym.
Eq(w∗T
k x)6((rTAr)2‖r‖2 + 4rTA2r + tr2(A)‖r‖2 + | tr(A)rTAr|(4 + 2‖r‖2))y
≤ max‖A‖F=1,A sym.
Eq(w∗T
k x)6(‖r‖6‖A‖2F + 4‖r‖2 tr(A2) + tr2(A)‖r‖2 + (4 + 2‖r‖2)‖r‖2| tr(A)|‖A‖F )y
≤ Er(w∗T
k x)6(‖r‖6 + 4‖r‖2 +K‖r‖2 +√K(4 + 2‖r‖2)‖r‖2)
z
= Er(w∗T
k x)6(‖Y Tx‖6 + 2√K‖Y Tx‖4 + 4(
√K + 1)‖Y Tx‖2 +K‖Y Tx‖2)
z
≤ 2C5K3‖w∗
k‖6(A.70)
Therefore,
‖EqZT
i Zi
y‖ ≤ 3C5K
3‖w∗k‖6,
160
and
max‖EqZT
i Zi
y‖, ‖E
qZiZ
Ti
y‖ ≤ 3C5K
3‖w∗k‖6 ≤ cm2K
3/2m‖w∗k‖3
Now we are ready to apply matrix Bernstein’s inequality.
P
t1
|Sk|‖∑i∈Sk
Zi‖ ≥ t
|
≤ 2K2 exp(− −|Sk|t2/2cm2K3/2m‖w∗
k‖3 + 2√Kmt/3
) (A.71)
Setting t = δ3‖w∗k‖3, we have when
|Sk| ≥ c31
δ23K3 log3(n) log(d) (A.72)
w.p. 1− d−2,
‖B −B‖op ≤ ‖1
|Sk|∑i∈Sk
Zi‖ ≤ δ3‖w∗k‖3, (A.73)
for some universal constant c3. And there exists some constant c3, such that |Sk| ≥
c31δ23K3 log4(d) will imply (A.72). Step 4. Combing all the K components. With
above three steps for k-th component, i.e., Eq. (A.65), Eq. (A.67) and Eq. (A.73),
w.h.p., we have
‖R(k)3 −R
(k)3 ‖op ≤ δ3‖w∗
k‖3
Now we can complete the proof by combing all the K components, w.p. 1 −
O(Kd−2)
‖R3 −R3‖op ≤∑k∈[K]
pk‖R(k)3 −R
(k)3 ‖op ≤ δ3
∑k∈[K]
pk‖w∗k‖3 (A.74)
161
A.2.4.5 Proof of Lemma A.2.5
Proof. Most part of the proof follows the proof of Lemma 4 in [26]. Let W TR2W =
UΛUT . Define W := WUΛ−1/2UT , then W is the whitening matrix of R2, i.e.,
W TR2W = I . Define the whitened tensor T = R3(W,W,W ), i.e.,
T :=K∑k=1
pkWTuk ⊗W Tuk ⊗W Tuk
=K∑k=1
p−1/2k (p
1/2k W Tuk)⊗ (p
1/2k W Tuk)⊗ (p
1/2k W Tuk)
=K∑k=1
p−1/2k vk ⊗ vk ⊗ vk,
(A.75)
where vk := p1/2k W TukKk=1 are orthogonal basis because
∑Kk=1 vkv
Tk = W TR2W =
IK . In practice, we have T := M3(W , W , W ), an estimation of T . Define εT :=
‖T − T‖op. Similar to the proof of Lemma 4 in [26], we have
εT =‖R3(W,W,W )− R3(W , W , W )‖op
≤‖R3(W,W,W )−R3(W,W, W )‖op + ‖R3(W,W, W )−R3(W, W , W )‖op
+ ‖R3(W, W , W )−R3(W , W , W )‖op + ‖R3(W , W , W )− R3(W , W , W )‖op
=‖R3(W,W,W − W )‖op + ‖R3(W,W − W , W )‖op
+ ‖R3(W − W , W , W )‖op + ‖R3(W , W , W )− R3(W , W , W )‖op
≤‖R3‖op(‖W‖2 + ‖W‖‖W‖+ ‖W‖2)εW + ‖W‖3ε3(A.76)
where εW = ‖W −W‖.
If ε2 ≤ σK/3, we have |σK(R2) − σK | ≤ ε2 ≤ σK/3. Then 23σK ≤
162
σK(R2) ≤ 43σK and ‖W‖ ≤
√2σ
−1/2K .
εW = ‖W −W‖ = ‖W (I − UΛ−1/2UT )‖ ≤ ‖W‖‖I − Λ−1/2‖ (A.77)
Since we have ‖I − Λ‖ = ‖W TR2W − W T R2W‖ ≤ ‖W‖2ε2 = 2σ−1K ε2. Thus
‖I − Λ−1/2‖ ≤ max|1− (1 + 2ε2/σK)−1/2|, |1− (1− 2ε2/σK)
−1/2| ≤ ε2/σK
Therefore,
εW ≤√2ε2σ
−3/2K
(A.78)
Now we have
εT ≤8‖R3‖opσ−5/2K ε2 + 2
√2σ
−3/2K ε3 (A.79)
Thus we can apply Theorem 5.1 [4] to show the guarantees of the robust tensor
power method to recover vkKk=1 and pkKk=1. It can be stated as below, for some
universal constant cT and a small value η (the computational complexity is related
to η by O(log(1/η))), if εT ≤ cT1
K√pmax
, w.p. 1 − η the returned eigenvectors
vkKk=1 and eigenvalues akKk=1 satisfy
‖vk − vk‖ ≤ 8εT√pk ≤ 8εT
√pmax, |ak −
1√pk| ≤ 5εT (A.80)
Let ak = 1√pk
. Now we show
‖(W T )†(akvk)− uk‖ = ‖(W T )†(akvk)−W †akvk‖
≤ ‖(W T )†(akvk)− (W T )†(akvk)‖+ ‖(W T )†(akvk)− (W T )†akvk‖
≤ ‖(W T )†‖(‖akvk − akvk‖+ ‖akvk − akvk‖) + ‖(W T )† − (W T )†‖‖akvk‖
≤ ‖(W T )†‖(ak8εT/ak + 5εT ) + ‖(W T )† − (W T )†‖ak(A.81)
163
If εT ≤ 110
√pmax
, we have ak/ak ≤ 3/2. If ε2 ≤ σK/3,
‖(W T )†‖ = ‖Λ2‖1/2 ≤√2‖R2‖1/2 (A.82)
and‖(W T )† − (W T )†‖ = ‖(W T )†(I − UΛ1/2UT )‖
= ‖(W T )†‖‖I − Λ1/2‖
≤ 2√2‖R2‖1/2ε2/σK
(A.83)
‖(W T )†(akvk)− uk‖ ≤ ‖R2‖1/2(25εT + 3ε2/σK)
≤ (3‖R2‖1/2σ−1K + 200‖R2‖1/2‖R3‖opσ−5/2
K )ε2 + (75‖R2‖1/2σ−3/2K )ε3
(A.84)
164
Appendix B
One-hidden-layer Fully-connected Neural Networks
B.1 Matrix Bernstein Inequality for Unbounded Case
In many proofs we need to bound the difference between some population
matrices/tensors and their empirical versions. Typically, the classic matrix Bern-
stein inequality requires the norm of the random matrix be bounded almost surely
(e.g., Theorem 6.1 in [127]) or the random matrix satisfies subexponential prop-
erty (Theorem 6.2 in [127]) . However, in our cases, most of the random matrices
don’t satisfy these conditions. So we derive the following lemmata that can deal
with random matrices that are not bounded almost surely or follow subexponential
distribution, but are bounded with high probability.
Lemma B.1.1 (Matrix Bernstein for unbounded case (A modified version of bounded
case, Theorem 6.1 in [127])). Let B denote a distribution over Rd1×d2 . Let d = d1+
d2. Let B1, B2, · · ·Bn be i.i.d. random matrices sampled from B. Let B = EB∼B[B]
and B = 1n
∑ni=1 Bi. For parameters m ≥ 0, γ ∈ (0, 1), ν > 0, L > 0, if the distri-
165
bution B satisfies the following four properties,
(I) PrB∼B
[‖B‖ ≤ m] ≥ 1− γ;
(II)∥∥∥ EB∼B
[B]∥∥∥ > 0;
(III) max(∥∥∥ E
B∼B[BB>]
∥∥∥,∥∥∥ EB∼B
[B>B]∥∥∥) ≤ ν;
(IV) max‖a‖=‖b‖=1
(E
B∼B
[(a>Bb
)2])1/2 ≤ L.
Then we have for any 0 < ε < 1 and t ≥ 1, if
n ≥ (18t log d) · (ν + ‖B‖2 +m‖B‖ε)/(ε2‖B‖2) and γ ≤ (ε‖B‖/(2L))2
with probability at least 1− 1/d2t − nγ,
‖B −B‖ ≤ ε‖B‖.
Proof. Define the event
ξi = ‖Bi‖ ≤ m,∀i ∈ [n].
Define Mi = 1‖Bi‖≤mBi. Let M = EB∼B[1‖B‖≤mB] and M = 1n
∑ni=1Mi. By
triangle inequality, we have
‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖. (B.1)
In the next a few paragraphs, we will upper bound the above three terms.
The first term in Eq. (D.19). Denote ξc as the complementary set of ξ, thus
Pr[ξci ] ≤ γ. By a union bound over i ∈ [n], with probability 1− nγ, ‖Bi‖ ≤ m for
all i ∈ [n]. Thus M = B.
166
The second term in Eq. (D.19). For a matrix B sampled from B, we use ξ
to denote the event that ξ = ‖B‖ ≤ m. Then, we can upper bound ‖M − B‖ in
the following way,
‖M −B‖
=∥∥∥ EB∼B
[1‖B‖≤m ·B]− EB∼B
[B]∥∥∥
=∥∥∥ EB∼B
[B · 1ξc ]∥∥∥
= max‖a‖=‖b‖=1
EB∼B
[a>Bb1ξc ]
≤ max‖a‖=‖b‖=1
EB∼B
[(a>Bb)2]1/2 · EB∼B
[1ξc ]1/2 by Holder’s inequality
≤ L EB∼B
[1ξc ]1/2 by Property (IV)
≤ Lγ1/2, by Pr[ξc] ≤ γ
≤ 1
2ε‖B‖, by γ ≤ (ε‖B‖/(2L))2
which implies
‖M −B‖ ≤ ε
2‖B‖.
Since ε < 1, we also have ‖M −B‖ ≤ 12‖B‖ and 3
2‖B‖ ≥ ‖M‖ ≥ 1
2‖B‖.
The third term in Eq. (D.19). We can bound ‖M −M‖ by Matrix Bern-
stein’s inequality [127].
We define Zi = Mi −M . Thus we have EBi∼B
[Zi] = 0, ‖Zi‖ ≤ 2m, and
∥∥∥∥ EBi∼B
[ZiZ>i ]
∥∥∥∥ =
∥∥∥∥ EBi∼B
[MiM>i ]−M ·M>
∥∥∥∥ ≤ ν + ‖M‖2 ≤ ν + 3‖B‖2.
167
Similarly, we have∥∥∥∥ EBi∼B
[Z>i Zi]
∥∥∥∥ ≤ ν +3‖B‖2. Using matrix Bernstein’s inequal-
ity, for any ε > 0,
PrB1,··· ,Bn∼B
[1
n
∥∥∥∥∥n∑
i=1
Zi
∥∥∥∥∥ ≥ ε‖B‖
]≤ d exp
(− ε2‖B‖2n/2ν + 3‖B‖2 + 2m‖B‖ε/3
).
By choosing
n ≥ (3t log d) · ν + 3‖B‖2 + 2m‖B‖ε/3ε2‖B‖2/2
,
for t ≥ 1, we have with probability at least 1− 1/d2t,∥∥∥∥∥ 1nn∑
i=1
Mi −M
∥∥∥∥∥ ≤ ε
2‖B‖
Putting it all together, we have for 0 < ε < 1, if
n ≥ (18t log d) · (ν + ‖B‖2 +m‖B‖ε)/(ε2‖B‖2) and γ ≤ (ε‖B‖/(2L))2
with probability at least 1− 1/d2t − nγ,∥∥∥∥∥ 1nn∑
i=1
Bi − EB∼B
[B]
∥∥∥∥∥ ≤ ε∥∥∥ EB∼B
[B]∥∥∥.
Corollary B.1.1 (Error bound for symmetric rank-one random matrices). Let x1, x2, · · ·xn
denote n i.i.d. samples drawn from Gaussian distribution N(0, Id). Let h(x) :
Rd → R be a function satisfying the following properties (I), (II) and (III).
(I) Prx∼N(0,Id)
[|h(x)| ≤ m] ≥ 1− γ
(II)
∥∥∥∥ Ex∼N(0,Id)
[h(x)xx>]
∥∥∥∥ > 0;
(III)
(E
x∼N(0,Id)[h4(x)]
)1/4
≤ L.
168
Define function B(x) = h(x)xx> ∈ Rd×d, ∀i ∈ [n]. Let B = Ex∼N(0,Id)
[h(x)xx>].
For any 0 < ε < 1 and t ≥ 1, if
n & (t log d) · (L2d+ ‖B‖2 + (mtd log n)‖B‖ε)/(ε2‖B‖2), and γ + 1/(nd2t) . (ε‖B‖/L)2
then
Prx1,··· ,xn∼N(0,Id)
[∥∥∥∥∥B − 1
n
n∑i=1
B(xi)
∥∥∥∥∥ ≤ ε‖B‖
]≥ 1− 2/(d2t)− nγ.
Proof. We show that the four Properties in Lemma D.3.9 are satisfied. Define func-
tion B(x) = h(x)xx>.
(I) ‖B(x)‖ = ‖h(x)xx>‖ = |h(x)|‖x‖2.
By using Fact 2.4.3, we have
Prx∼N(0,Id)
[‖x‖2 ≤ 10td log n] ≥ 1− 1/(nd2t)
Therefore,
Prx∼N(0,Id)
[‖B(x)‖ ≤ m · 10td log(n)] ≥ 1− γ − 1/(nd2t).
(II)∥∥∥ EB∼B
[B]∥∥∥ =
∥∥∥∥ Ex∼N(0,Id)
[h(x)xx>]
∥∥∥∥ > 0.
(III)
max(∥∥∥ E
B∼B[BB>]
∥∥∥, ∥∥∥ EB∼B
[B>B]∥∥∥)
= max‖a‖=1
Ex∼N(0,Id)
[(h(x))2‖x‖2(a>x)2]
≤(
Ex∼N(0,Id)
[(h(x))4]
)1/2
·(
Ex∼N(0,Id)
[‖x‖8])1/4
· max‖a‖=1
(E
x∼N(0,Id)[(a>x)8]
)1/4
. L2d.
169
(IV)
max‖a‖=‖b‖=1
(E
B∼B[(a>Bb)2]
)1/2= max
‖a‖=1
(E
x∼N(0,Id)[h2(x)(a>x)4]
)1/2
≤(
Ex∼N(0,Id)
[h4(x)]
)1/4
· max‖a‖=1
(E
x∼N(0,Id)[(a>x)8]
)1/4
. L.
Applying Lemma D.3.9, we obtain, for any 0 < ε < 1 and t ≥ 1, if
n & (t log d) · (L2d+ ‖B‖2 + (mtd log n)‖B‖ε)/(ε2‖B‖2), and γ + 1/(nd2t) . (ε‖B‖/L)2
then
Prx1,··· ,xn∼N(0,Id)
[∥∥∥∥∥B − 1
n
n∑i=1
B(xi)
∥∥∥∥∥ ≤ ε‖B‖
]≥ 1− 2/(d2t)− nγ.
B.2 Properties of Activation Functions
Theorem 4.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,
squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-
tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),
the tanh function and the erf function φ(z) =∫ z
0e−t2dt, satisfy Property 1,2,3. The
linear function, φ(z) = z, doesn’t satisfy Property 2 and the quadratic function,
φ(z) = z2, doesn’t satisfy Property 1 and 2.
170
Proof. We can easily verify that ReLU , leaky ReLU and squared ReLU satisfy Prop-erty 2 by calculating ρ(σ) in Property 2, which is shown in Table B.1. Property 1 forReLU , leaky ReLU and squared ReLU can be verified since they are non-decreasingwith bounded first derivative. ReLU and leaky ReLU are piece-wise linear, so theysatisfy Property 3(b). Squared ReLU is smooth so it satisfies Property 3(a).
Table B.1: ρ(σ) values for different activation functions. Note that we can calculatethe exact values for ReLU, Leaky ReLU, squared ReLU and erf. We can’t find aclosed-form value for sigmoid or tanh, but we calculate the numerical values ofρ(σ) for σ = 0.1, 1, 10. 1 ρerf(σ) = min(4σ2 + 1)−1/2 − (2σ2 + 1)−1, (4σ2 +1)−3/2 − (2σ2 + 1)−3, (2σ2 + 1)−2
Activations ReLU Leaky ReLU squared ReLU erfsigmoid
(σ = 0.1)sigmoid(σ = 1)
sigmoid(σ = 10)
α0(σ)12
1.012
σ√
2π
1(2σ2+1)1/2
0.99 0.606 0.079α1(σ)
1√2π
0.99√2π
σ 0 0 0 0
α2(σ)12
1.012
2σ√
2π
1(2σ2+1)3/2
0.97 0.24 0.00065β0(σ)
12
1.00012
2σ 1(4σ2+1)1/2
0.98 0.46 0.053β2(σ)
12
1.00012
6σ 1(4σ2+1)3/2
0.94 0.11 0.00017ρ(σ) 0.091 0.089 0.27σ ρerf(σ)
1 1.8E-4 4.9E-2 5.1E-5
Smooth non-decreasing activations with bounded first derivatives automat-
ically satisfy Property 1 and 3. For Property 2, since their first derivatives are
symmetric, we have E[φ′(σ · z)z] = 0. Then by Holder’s inequality and φ′(z) ≥ 0,
we have
Ez∼D1 [φ′2(σ · z)] ≥ (Ez∼D1 [φ
′(σ · z)])2,
Ez∼D1 [φ′2(σ · z)z2] · Ez∼D1 [z
2] ≥(Ez∼D1 [φ
′(σ · z)z2])2,
Ez∼D1 [φ′(σ · z)z2] · Ez∼D1 [φ
′(σ · z)] = Ez∼D1 [(√φ′(σ · z)z)2] · Ez∼D1 [(
√φ′(σ · z))2] ≥ (Ez∼D1 [φ
′(σ · z)z])2.
The equality in the first inequality happens when φ′(σ · z) is a constant a.e.. The
equality in the second inequality happens when |φ′(σ ·z)| is a constant a.e., which is
171
invalidated by the non-linearity and smoothness condition. The equality in the third
inequality holds only when φ′(z) = 0 a.e., which leads to a constant function under
non-decreasing condition. Therefore, ρ(σ) > 0 for any smooth non-decreasing
non-linear activations with bounded symmetric first derivatives. The statements
about linear activations and quadratic activation follow direct calculations.
B.3 Local Positive Definiteness of HessianB.3.1 Main Results for Positive Definiteness of Hessian
Bounding the Spectrum of the Hessian near the Ground Truth
Theorem B.3.1 (Bounding the spectrum of the Hessian near the ground truth).
For any W ∈ Rd×k with ‖W − W ∗‖ . v4minρ2(σk)/(k
2κ5λ2v4maxσ4p1 ) · ‖W ∗‖,
let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) and
let the activation function satisfy Property 1,2,3. Then for any t ≥ 1, if |S| ≥
d · poly(log d, t) · k2v4maxτκ8λ2σ4p
1 /(v4minρ2(σk)), we have with probability at least
1− d−Ω(t),
Ω(v2minρ(σk)/(κ2λ))I ∇2fS(W ) O(kv2maxσ
2p1 )I.
Proof. The main idea of the proof follows the following inequalities,
∇2fD(W∗)− ‖∇2fS(W )−∇2fD(W
∗)‖I ∇2fS(W ) ∇2fD(W∗) + ‖∇2fS(W )−∇2fD(W
∗)‖I
The proof sketch is first to bound the range of the eigenvalues of∇2fD(W∗) (Lemma B.3.1)
and then bound the spectral norm of the remaining error, ‖∇2fS(W )−∇2fD(W∗)‖.
‖∇2fS(W )−∇2fD(W∗)‖ can be further decomposed into two parts, ‖∇2fS(W )−
172
H‖ and ‖H − ∇2fD(W∗)‖, where H is ∇2fD(W ) if φ is smooth, otherwise H is
a specially designed matrix . We can upper bound them when W is close enough
to W ∗ and there are enough samples. In particular, if the activation satisfies Prop-
erty 3(a), see Lemma B.3.6 for bounding ‖H −∇2fD(W∗)‖ and Lemma B.3.7 for
bounding ‖H−∇2fS(W )‖. If the activation satisfies Property 3(b), see Lemma B.3.8.
Finally we can complete the proof by setting δ = O(v2minρ(σ1)/(kv2maxκ
2λσ2p1 ))
in Lemma B.3.7 and Lemma B.3.8, setting ‖W −W ∗‖ . v2minρ(σk)/(kκ2λv2maxσ
p1)
in Lemma B.3.6 and setting ‖W − W ∗‖ ≤ v4minρ2(σk)σk/(k
2κ4λ2v4maxσ4p1 ) in
Lemma B.3.8.
Local Linear Convergence of Gradient Descent
Although Theorem B.3.1 gives upper and lower bounds for the spectrum of
the Hessian w.h.p., it only holds when the current set of parameters W are indepen-
dent of samples. When we use iterative methods, like gradient descent, to optimize
this objective, the next iterate calculated from the current set of samples will depend
on this set of samples. Therefore, we need to do resampling at each iteration. Here
we show that for activations that satisfies Properties 1, 2 and 3(a), linear conver-
gence of gradient descent is guaranteed. To the best of our knowledge, there is no
linear convergence guarantees for general non-smooth objective. So the following
proposition also applies to smooth objectives only, which excludes ReLU.
Theorem B.3.2 (Linear convergence of gradient descent, formal version of Theo-
rem 4.3.2). Let W c ∈ Rd×k be the current iterate satisfying
‖W c −W ∗‖ . v4minρ2(σk)/(k
2κ5λ2v4maxσ4p1 )‖W ∗‖.
173
Let S denote a set of i.i.d. samples from distribution D (defined in (4.1)) Let the
activation function satisfy Property 1,2 and 3(a). Define
m0 = Θ(v2minρ(σk)/(κ2λ)) and M0 = Θ(kv2maxσ
2p1 ).
For any t ≥ 1, if we choose
|S| ≥ d · poly(log d, t) · k2v4maxτκ8λ2σ4p
1 /(v4minρ2(σk)) (B.2)
and perform gradient descent with step size 1/M0 on fS(Wc) and obtain the next
iterate,
W = W c − 1
M0
∇fS(W c),
then with probability at least 1− d−Ω(t),
‖W −W ∗‖2F ≤ (1− m0
M0
)‖W c −W ∗‖2F .
Proof. To prove Theorem B.3.2, we need to show the positive definite properties
on the entire line between the current iterate and the optimum by constructing a set
of anchor points, which are independent of the samples. Then we apply traditional
analysis for the linear convergence of gradient descent.
In particular, given a current iterate W c, we set d(p+1)/2 anchor points W aa=1,2,··· ,d(p+1)/2
equally along the line ξW ∗ + (1− ξ)W c for ξ ∈ [0, 1].
According to Theorem B.3.1, by setting t ← t + (p + 1)/2, we have with
probability at least 1− d−(t+(p+1)/2) for each anchor point W a,
m0I ∇2fS(Wa) M0I.
174
Then given an anchor point W a, according to Lemma C.3.11, we have with
probability 1 − 2d−(t+(p+1)/2), for any points W between (W a−1 + W a)/2 and
(W a +W a+1)/2,
m0I ∇2fS(W ) M0I. (B.3)
Finally by applying union bound over these d(p+1)/2 small intervals, we have
with probability at least 1− d−t for any points W on the line between W c and W ∗,
m0I ∇2fS(W ) M0I.
Now we can apply traditional analysis for linear convergence of gradient
descent.
Let η denote the stepsize.
‖W −W ∗‖2F
= ‖W c − η∇fS(W c)−W ∗‖2F
= ‖W c −W ∗‖2F − 2η〈∇fS(W c), (W c −W ∗)〉+ η2‖∇fS(W c)‖2F
We can rewrite fS(Wc),
∇fS(W c) =
(∫ 1
0
∇2fS(W∗ + γ(W c −W ∗))dγ
)vec(W c −W ∗).
We define function HS : Rd×k → Rdk×dk such that
HS(Wc −W ∗) =
(∫ 1
0
∇2fS(W∗ + γ(W c −W ∗))dγ
).
According to Eq. (C.13),
m0I H M0I. (B.4)
175
We can upper bound ‖∇fS(W c)‖2F ,
‖∇fS(W c)‖2F = 〈HS(Wc −W ∗), HS(W
c −W ∗)〉 ≤M0〈W c −W ∗, HS(Wc −W ∗)〉.
Therefore,
‖W −W ∗‖2F
≤ ‖W c −W ∗‖2F − (−η2M0 + 2η)〈W c −W ∗, H(W c −W ∗)〉
≤ ‖W c −W ∗‖2F − (−η2M0 + 2η)m0‖W c −W ∗‖2F
= ‖W c −W ∗‖2F −m0
M0
‖W c −W ∗‖2F
≤ (1− m0
M0
)‖W c −W ∗‖2F
where the third equality holds by setting η = 1M0
.
B.3.2 Positive Definiteness of Population Hessian at the Ground Truth
The goal of this Section is to prove Lemma B.3.1.
Lemma B.3.1 (Positive definiteness of population Hessian at the ground truth). If
φ(z) satisfies Property 1,2 and 3, we have the following property for the second
derivative of function fD(W ) at W ∗,
Ω(v2minρ(σk)/(κ2λ))I ∇2fD(W
∗) O(kv2maxσ2p1 )I.
Proof. The proof directly follows Lemma B.3.3 (Section B.3.2) and Lemma B.3.4(Sec-
tion B.3.2).
Lower Bound on the Eigenvalues of Hessian for the Orthogonal Case
176
Lemma B.3.2. Let D1 denote Gaussian distribution N(0, 1). Let α0 = Ez∼D1 [φ′(z)],
α1 = Ez∼D1 [φ′(z)z], α2 = Ez∼D1 [φ
′(z)z2], β0 = Ez∼D1 [φ′2(z)] ,β2 = Ez∼D1 [φ
′2(z)z2].
Let ρ denote min(β0 − α20 − α2
1), (β2 − α21 − α2
2). Let P =[p1 p2 · · · pk
]∈
Rk×k. Then we have,
Eu∼Dk
( k∑i=1
p>i u · φ′(ui)
)2 ≥ ρ‖P‖2F (B.5)
Proof. The main idea is to explicitly calculate the LHS of Eq (C.4), then reformu-
late the equation and find a lower bound represented by α0, α1, α2, β0, β2.
Eu∼Dk
( k∑i=1
p>i u · φ′(ui)
)2
=k∑
i=1
k∑l=1
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]
=k∑
i=1
Eu∼Dk
[p>i (φ′(ui)
2 · uu>)pi]︸ ︷︷ ︸A
+∑i 6=l
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]︸ ︷︷ ︸B
177
Further, we can rewrite the diagonal term in the following way,
A =k∑
i=1
Eu∼Dk
[p>i (φ′(ui)
2 · uu>)pi]
=k∑
i=1
Eu∼Dk
[p>i
(φ′(ui)
2 ·
(u2i eie
>i +
∑j 6=i
uiuj(eie>j + eje
>i ) +
∑j 6=i
∑l 6=i
ujuleje>l
))pi
]
=k∑
i=1
Eu∼Dk
[p>i
(φ′(ui)
2 ·
(u2i eie
>i +
∑j 6=i
u2jeje
>j
))pi
]
=k∑
i=1
[p>i
(E
u∼Dk
[φ′(ui)2u2
i ]eie>i +
∑j 6=i
Eu∼Dk
[φ′(ui)2u2
j ]eje>j
)pi
]
=k∑
i=1
[p>i
(β2eie
>i +
∑j 6=i
β0eje>j
)pi
]
=k∑
i=1
p>i ((β2 − β0)eie>i + β0Ik)pi
= (β2 − β0)k∑
i=1
p>i eie>i pi + β0
k∑i=1
p>i pi
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F ,
where the second step follows by rewriting uu> =k∑
i=1
k∑j=1
uiujeie>j , the third step
follows by
Eu∼Dk
[φ′(ui)2uiuj] = 0, ∀j 6= i and E
u∼Dk
[φ′(ui)2ujul] = 0, ∀j 6= l, the fourth
step follows by pushing expectation, the fifth step follows by Eu∼Dk
[φ′(ui)2u2
i ] = β2
and Eu∼Dk
[φ′(ui)2u2
j ] = Eu∼Dk
[φ′(ui)2] = β0, and the last step follows by
k∑i=1
p2i,i =
‖ diag(P )‖2 andk∑
i=1
p>i pi =k∑
i=1
‖pi‖2 = ‖P‖2F .
178
We can rewrite the off-diagonal term in the following way,
B =∑i 6=l
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]
=∑i 6=l
Eu∼Dk
[p>i
(φ′(ul)φ
′(ui) ·
(u2i eie
>i + u2
l ele>l + uiul(eie
>l + ele
>i ) +
∑j 6=l
uiujeie>j
+∑j 6=i
ujuleje>l +
∑j 6=i,l
∑j′ 6=i,l
ujuj′eje>j′
))pl
]
=∑i 6=l
Eu∼Dk
[p>i
(φ′(ul)φ
′(ui) ·
(u2i eie
>i + u2
l ele>l + uiul(eie
>l + ele
>i ) +
∑j 6=i,l
u2jeje
>j
))pl
]
=∑i 6=l
[p>i
(E
u∼Dk
[φ′(ul)φ′(ui)u
2i ]eie
>i + E
u∼Dk
[φ′(ul)φ′(ui)u
2l ]ele
>l
+ Eu∼Dk
[φ′(ul)φ′(ui)uiul](eie
>l + ele
>i ) +
∑j 6=i,l
Eu∼Dk
[φ′(ul)φ′(ui)u
2j ]eje
>j
)pl
]
=∑i 6=l
[p>i
(α0α2(eie
>i + ele
>l ) + α2
1(eie>l + ele
>i ) +
∑j 6=i,l
α20eje
>j
)pl
]=∑i 6=l
[p>i((α0α2 − α2
0)(eie>i + ele
>l ) + α2
1(eie>l + ele
>i ) + α2
0Ik)pl]
= (α0α2 − α20)∑i 6=l
p>i (eie>i + ele
>l )pl︸ ︷︷ ︸
B1
+α21
∑i 6=l
p>i (eie>l + ele
>i )pl︸ ︷︷ ︸
B2
+α20
∑i 6=l
p>i pl︸ ︷︷ ︸B3
,
where the third step follows by
Eu∼Dk
[φ′(ul)φ′(ui)uiuj] = 0,
and
Eu∼Dk
[φ′(ul)φ′(ui)uj′uj] = 0, ∀j′ 6= j.
179
For the term B1, we have
B1 = (α0α2 − α20)∑i 6=l
p>i (eie>i + ele
>l )pl
= 2(α0α2 − α20)∑i 6=l
p>i eie>i pl
= 2(α0α2 − α20)
k∑i=1
p>i eie>i
(k∑
l=1
pl − pi
)
= 2(α0α2 − α20)
(k∑
i=1
p>i eie>i
k∑l=1
pl −k∑
i=1
p>i eie>i pi
)= 2(α0α2 − α2
0)(diag(P )> · P · 1− ‖ diag(P )‖2)
For the term B2, we have
B2 = α21
∑i 6=l
p>i (eie>l + ele
>i )pl
= α21
(∑i 6=l
p>i eie>l pl +
∑i 6=l
p>i ele>i pl
)
= α21
(k∑
i=1
k∑l=1
p>i eie>l pl −
k∑j=1
p>j eje>j pj +
k∑i=1
k∑l=1
p>i ele>i pl −
k∑j=1
p>j eje>j pj
)= α2
1((diag(P )>1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)
180
For the term B3, we have
B3 = α20
∑i 6=l
p>i pl
= α20
(k∑
i=1
p>i
k∑l=1
pl −k∑
i=1
p>i pi
)
= α20
∥∥∥∥∥k∑
i=1
pi
∥∥∥∥∥2
−k∑
i=1
‖pi‖2
= α20(‖P · 1‖2 − ‖P‖2F )
Let diag(P ) denote a length k column vector where the i-th entry is the
181
(i, i)-th entry of P ∈ Rk×k. Furthermore, we can show A+B is,
A+B
= A+B1 +B2 +B3
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A
+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸
B1
+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸
B2
+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸
B3
= ‖α0P · 1+ (α2 − α0) diag(P )‖2︸ ︷︷ ︸C1
+α21(diag(P )> · 1)2︸ ︷︷ ︸
C2
+α21
2‖P + P> − 2 diag(diag(P ))‖2F︸ ︷︷ ︸
C3
+ (β0 − α20 − α2
1)‖P − diag(diag(P ))‖2F︸ ︷︷ ︸C4
+ (β2 − α21 − α2
2)‖ diag(P )‖2︸ ︷︷ ︸C5
≥ (β0 − α20 − α2
1)‖P − diag(diag(P ))‖2F + (β2 − α21 − α2
2)‖ diag(P )‖2
≥ min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · (‖P − diag(diag(P ))‖2F + ‖ diag(P )‖2)
= min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · (‖P − diag(diag(P ))‖2F + ‖ diag(diag(P ))‖2)
≥ min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · ‖P‖2F
= ρ‖P‖2F ,
where the first step follows by B = B1 + B2 + B3, and the second step follows by
the definition of A,B1, B2, B3 the third step follows by A+B1 +B2 +B3 = C1 +
C2+C3+C4+C5, the fourth step follows by C1, C2, C3 ≥ 0, the fifth step follows
a ≥ min(a, b), the sixth step follows by ‖ diag(P )‖2 = ‖ diag(diag(P ))‖2, the
seventh step follows by triangle inequality, and the last step follows the definition
of ρ.
Claim B.3.1. A+B1 +B2 +B3 = C1 + C2 + C3 + C4 + C5.
182
Proof. The key properties we need are, for two vectors a, b, ‖a + b‖2 = ‖a‖2 +
2〈a, b〉 + ‖b‖2; for two matrices A,B, ‖A + B‖2F = ‖A‖2F + 2〈A,B〉 + ‖B‖2F .
183
Then, we have
C1 + C2 + C3 + C4 + C5
= (‖α0P · 1‖)2 + 2(α0α2 − α20)〈P · 1, diag(P )〉+ (α2 − α0)
2‖ diag(P )‖2︸ ︷︷ ︸C1
+α21(diag(P )> · 1)2︸ ︷︷ ︸
C2
+α21
2(2‖P‖2F + 4‖ diag(diag(P ))‖2F + 2〈P, P>〉 − 4〈P, diag(diag(P ))〉 − 4〈P>, diag(diag(P ))〉)︸ ︷︷ ︸
C3
+ (β0 − α20 − α2
1)(‖P‖2F − 2〈P, diag(diag(P ))〉+ ‖ diag(diag(P ))‖2F )︸ ︷︷ ︸C4
+(β2 − α21 − α2
2)‖ diag(P )‖2︸ ︷︷ ︸C5
= α20‖P · 1‖2 + 2(α0α2 − α2
0)〈P · 1, diag(P )〉+ (α2 − α0)2‖ diag(P )‖2︸ ︷︷ ︸
C1
+α21(diag(P )> · 1)2︸ ︷︷ ︸
C2
+α21
2(2‖P‖2F + 4‖ diag(P )‖2 + 2〈P, P>〉 − 8‖ diag(P )‖2)︸ ︷︷ ︸
C3
+ (β0 − α20 − α2
1)(‖P‖2F − 2‖ diag(P )‖2 + ‖ diag(P )‖2)︸ ︷︷ ︸C4
+(β2 − α21 − α2
2)‖ diag(P )‖2︸ ︷︷ ︸C5
= α20‖P · 1‖2 + 2(α0α2 − α2
0) diag(P )> · P · 1+ α21(diag(P )> · 1)2 + α2
1〈P, P>〉
+ (β0 − α20)‖P‖2F + ((α2 − α0)
2 − 2α21 − β0 + α2
0 + α21 + β2 − α2
1 − α22)︸ ︷︷ ︸
β2−β0−2(α2α0−α20+α2
1)
‖ diag(P )‖2
= 0︸︷︷︸part of A
+2(α2α0 − α20) · diag(P )>P · 1︸ ︷︷ ︸part of B1
+α21 · ((diag(P )>1)2 + 〈P, P>〉)︸ ︷︷ ︸
part of B2
+α20 · ‖P · 1‖2︸ ︷︷ ︸
part of B3
+ (β0 − α20) · ‖P‖2F︸ ︷︷ ︸
proportional to ‖P‖2F
+(β2 − β0 − 2(α2α0 − α20 + α2
1)) · ‖ diag(P )‖2︸ ︷︷ ︸proportional to ‖ diag(P )‖2
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A
+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸
B1
+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸
B2
+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸
B3
=A+B1 +B2 +B3
where the second step follows by 〈P, diag(diag(P ))〉 = ‖ diag(P )‖2 and ‖ diag(diag(P ))‖2F =
184
‖ diag(P )‖2.
Lower Bound on the Eigenvalues of Hessian for Non-orthogonal Case
First we show the lower bound of the eigenvalues. The main idea is to
reduce the problem to a k-by-k problem and then lower bound the eigenvalues using
orthogonal weight matrices.
Lemma B.3.3 (Lower bound). If φ(z) satisfies Property 1,2 and 3, we have the
following property for the second derivative of function fD(W ) at W ∗,
Ω(v2minρ(σk)/(κ2λ))I ∇2fD(W
∗).
Proof. Let a ∈ Rdk denote vector[a>1 a>2 · · · a>k
]>, let b ∈ Rdk denote vec-
tor[b>1 b>2 · · · b>k
]> and let c ∈ Rdk denote vector[c>1 c>2 · · · c>k
]>. The
smallest eigenvalue of the Hessian can be calculated by
∇2f(W ∗) min‖a‖=1
a>∇2f(W ∗)a Idk = min‖a‖=1
Ex∼Dd
( k∑i=1
v∗i a>i x · φ′(w∗>
i x)
)2 Idk
(B.6)
185
Note that
min‖a‖=1
Ex∼Dd
( k∑i=1
(v∗i ai)>x · φ′(w∗>
i x)
)2
= min‖a‖6=0
Ex∼Dd
( k∑i=1
(v∗i ai)>x · φ′(w∗>
i x)
)2/‖a‖2
= min∑i ‖bi/v∗i ‖2 6=0
Ex∼Dd
( k∑i=1
b>i x · φ′(w∗>i x)
)2/( k∑
i=1
‖bi/v∗i ‖2)
by ai = bi/v∗i
= min∑i ‖bi‖2 6=0
Ex∼Dd
( k∑i=1
b>i x · φ′(w∗>i x)
)2/( k∑
i=1
‖bi/v∗i ‖2)
≥ v2min min∑i ‖bi‖2 6=0
Ex∼Dd
( k∑i=1
b>i x · φ′(w∗>i x)
)2/( k∑
i=1
‖bi‖2)
by vmin = mini∈[k]|v∗i |
= v2min min‖a‖=1
Ex∼Dd
( k∑i=1
a>i x · φ′(w∗>i x)
)2 (B.7)
Let U ∈ Rd×k be the orthonormal basis of W ∗ and let V =[v1 v2 · · · vk
]=
U>W ∗ ∈ Rk×k. Also note that V and W ∗ have same singular values and W ∗ =
UV . We use U⊥ ∈ Rd×(d−k) to denote the complement of U . For any vector
aj ∈ Rd, there exist two vectors bj ∈ Rk and cj ∈ Rd−k such that
aj = Ubj + U⊥cj.
We use Dd to denote Gaussian distribution N(0, Id), Dd−k to denote Gaussian dis-
tribution N(0, Id−k), and Dk to denote Gaussian distribution N(0, Ik). Then we can
186
rewrite formulation (C.6) (removing v2min) as
Ex∼Dd
( k∑i=1
a>i x · φ′(w∗>i x)
)2 = E
x∼Dd
( k∑i=1
(b>i U> + c>i U
>⊥ )x · φ′(w∗>
i x)
)2 = A+B + C
where
A = Ex∼Dd
( k∑i=1
b>i U>x · φ′(w∗>
i x)
)2,
B = Ex∼Dd
( k∑i=1
c>i U>⊥x · φ′(w∗>
i x)
)2,
C = Ex∼Dd
[2
(k∑
i=1
b>i U>x · φ′(w∗>
i x)
)·
(k∑
i=1
c>i U>⊥x · φ′(w∗>
i x)
)].
We calculate A,B,C separately. First, we can show
A = Ex∼Dd
( k∑i=1
b>i U>x · φ′(w∗>
i x)
)2 = E
z∼Dk
( k∑i=1
b>i z · φ′(v>i z)
)2.
187
Second, we can show
B = Ex∼Dd
( k∑i=1
c>i U>⊥x · φ′(w∗>
i x)
)2
= Es∼Dd−k,z∼Dk
( k∑i=1
c>i s · φ′(v∗>i z)
)2
= Es∼Dd−k,z∼Dk
[(y>s)2] by defining y =k∑
i=1
φ′(v∗>i z)ci ∈ Rd−k
= Ez∼Dk
[E
s∼Dd−k
[(y>s)2]
]= E
z∼Dk
[E
s∼Dd−k
[d−k∑j=1
s2jy2j
]]by E[sjsj′ ] = 0
= Ez∼Dk
[d−k∑j=1
y2j
]by sj ∼ N(0, 1)
= Ez∼Dk
∥∥∥∥∥k∑
i=1
φ′(v∗>i z)ci
∥∥∥∥∥2 by definition of y
Third, we have C = 0 since U>⊥x is independent of w∗>
i x and U>x. Thus, putting
them all together,
Ex∼Dd
( k∑i=1
a>i x · φ′(w∗>i x)
)2 = E
z∼Dk
( k∑i=1
b>i z · φ′(v>i z)
)2+ E
z∼Dk
∥∥∥∥∥k∑
i=1
φ′(v>i z)ci
∥∥∥∥∥2
188
Let us lower bound A,
A = Ez∼Dk
( k∑i=1
b>i z · φ′(v>i z)
)2
=
∫(2π)−k/2
(k∑
i=1
b>i z · φ′(v>i z)
)2
e−‖z‖2/2dz
=ξ1
∫(2π)−k/2
(k∑
i=1
b>i V†>s · φ′(si)
)2
e−‖V †>s‖2/2 · | det(V †)|ds
≥ξ2
∫(2π)−k/2
(k∑
i=1
b>i V†>s · φ′(si)
)2
e−σ21(V
†)‖s‖2/2 · | det(V †)|ds
=ξ3
∫(2π)−k/2
(k∑
i=1
b>i V†>u/σ1(V
†) · φ′(ui/σ1(V†))
)2
e−‖u‖2/2| det(V †)|/σk1(V
†)du
=
∫(2π)−k/2
(k∑
i=1
p>i u · φ′(σk · ui)
)2
e−‖u‖2/2 1
λdu
=1
λE
u∼Dk
( k∑i=1
p>i u · φ′(σk · ui)
)2
where V † ∈ Rk×k is the inverse of V ∈ Rk×k, i.e., V †V = I , p>i = b>i V†>/σ1(V
†)
and σk = σk(W∗). ξ1 replaces z by z = V †>s, so v∗>i z = si. ξ2 uses the fact
‖V †>s‖ ≤ σ1(V†)‖s‖. ξ3 replaces s by s = u/σ1(V
†). Note that φ′(σk · ui)’s
are independent of each other, so we can simplify the analysis. In particular,
Lemma B.3.2 gives a lower bound in this case in terms of pi. Note that ‖pi‖ ≥
‖bi‖/κ. Therefore,
Ez∼Dk
( k∑i=1
b>i z · φ′(v>i z)
)2 ≥ ρ(σk)
1
κ2λ‖b‖2.
189
For B, similar to the proof of Lemma B.3.2, we have,
B = Ez∼Dk
∥∥∥∥∥k∑
i=1
φ′(v>i z)ci
∥∥∥∥∥2
=
∫(2π)−k/2
∥∥∥∥∥k∑
i=1
φ′(v>i z)ci
∥∥∥∥∥2
e−‖z‖2/2dz
=
∫(2π)−k/2
∥∥∥∥∥k∑
i=1
φ′(σk · ui)ci
∥∥∥∥∥2
e−‖V †>u/σ1(V †)‖2/2 · det(V †/σ1(V†))du
=
∫(2π)−k/2
∥∥∥∥∥k∑
i=1
φ′(σk · ui)ci
∥∥∥∥∥2
e−‖V †>u/σ1(V †)‖2/2 · 1λdu
≥∫
(2π)−k/2
∥∥∥∥∥k∑
i=1
φ′(σk · ui)ci
∥∥∥∥∥2
e−‖u‖2/2 · 1λdu
=1
λE
u∼Dk
∥∥∥∥∥k∑
i=1
φ′(σk · ui)ci
∥∥∥∥∥2
=1
λ
(k∑
i=1
Eu∼Dk
[φ′(σk · ui)φ′(σk · ui)c
>i ci] +
∑i 6=l
Eu∼Dk
[φ′(σk · ui)φ′(σk · ul)c
>i cl]
)
=1
λ
(E
z∼D1
[φ′(σk · ui)2]
k∑i=1
‖ci‖2 +(
Ez∼D1
[φ′(σk · z)])2∑
i 6=l
c>i cl
)
=1
λ
( Ez∼D1
[φ′(σk · z)])2∥∥∥∥∥
k∑i=1
ci
∥∥∥∥∥2
2
+
(E
z∼D1
[φ′(σk · z)2]−(
Ez∼D1
[φ′(σk · z)])2)‖c‖2
≥ 1
λ
(E
z∼D1
[φ′(σk · z)2]−(
Ez∼D1
[φ′(σk · z)])2)‖c‖2
≥ ρ(σk)1
λ‖c‖2,
where the first step follows by definition of Gaussian distribution, the second step
follows by replacing z by z = V †>u/σ1(V†), and then v>i z = ui/σ1(V
†) =
190
uiσk(W∗), the third step follows by ‖u‖2 ≥ ‖ 1
σ1(V †)V †>u‖2 , the fourth step follows
by det(V †/σ1(V†)) = det(V †)/σk
1(V†) = 1/λ, the fifth step follows by definition
of Gaussian distribution, the ninth step follows by x2 ≥ 0 for any x ∈ R, and the
last step follows by Property 2.
Note that 1 = ‖a‖2 = ‖b‖2 + ‖c‖2. Thus, we finish the proof for the lower
bound.
Upper Bound on the Eigenvalues of Hessian for Non-orthogonal Case
Lemma B.3.4 (Upper bound). If φ(z) satisfies Property 1,2 and 3, we have the
following property for the second derivative of function fD(W ) at W ∗,
∇2fD(W∗) O(kv2maxσ
2p1 )I.
191
Proof. Similarly, we can calculate the upper bound of the eigenvalues by
‖∇2f(W ∗)‖
= max‖a‖=1
a>∇2f(W ∗)a
= v2max max‖a‖=1
Ex∼Dk
( k∑i=1
a>i x · φ′(w∗>i x)
)2
= v2max max‖a‖=1
k∑i=1
k∑l=1
Ex∼Dk
[a>i x · φ′(w∗>i x) · a>l x · φ′(w∗>
l x)]
≤ v2max max‖a‖=1
k∑i=1
k∑l=1
(E
x∼Dk
[(a>i x)4] · E
x∼Dk
[(φ′(w∗>i x))4] · E
x∼Dk
[(a>l x)4] · E
x∼Dk
[(φ′(w∗>l x))4]
)1/4
. v2max max‖a‖=1
k∑i=1
k∑l=1
‖ai‖ · ‖al‖ · ‖w∗i ‖p · ‖w∗
l ‖p
≤ v2max max‖a‖=1
k∑i=1
k∑l=1
‖ai‖ · ‖al‖ · σ2p1
≤ kv2maxσ2p1 ,
where the first inequality follows Holder’s inequality, the second inequality follows
by Property 1, the third inequality follows by ‖w∗i ‖ ≤ σ1(W
∗), and the last inequal-
ity by Cauchy-Schwarz inequality.
B.3.3 Error Bound of Hessians near the Ground Truth for Smooth Activa-tions
The goal of this Section is to prove Lemma B.3.5
Lemma B.3.5 (Error bound of Hessians near the ground truth for smooth activa-
192
tions). Let φ(z) satisfy Property 1, Property 2 and Property 3(a). Let W satisfy
‖W − W ∗‖ ≤ σk/2. Let S denote a set of i.i.d. samples from the distribution
defined in (4.1). Then for any t ≥ 1 and 0 < ε < 1/2, if
|S| ≥ ε−2dκ2τ · poly(log d, t)
then we have, with probability at least 1− 1/dΩ(t),
‖∇2fS(W )−∇2fD(W∗)‖ . v2maxkσ
p1(εσ
p1 + ‖W −W ∗‖).
Proof. This follows by combining Lemma B.3.6 and Lemma B.3.7 directly.
Second-order Smoothness near the Ground Truth for Smooth Activa-
tions The goal of this Section is to prove Lemma B.3.6.
Fact B.3.3. Let wi denote the i-th column of W ∈ Rd×k, and w∗i denote the i-th
column of W ∗ ∈ Rd×k. If ‖W −W ∗‖ ≤ σk(W∗)/2, then for all i ∈ [k],
1
2‖w∗
i ‖ ≤ ‖wi‖ ≤3
2‖w∗
i ‖.
Proof. Note that if ‖W − W ∗‖ ≤ σk(W∗)/2, we have σk(W
∗)/2 ≤ σi(W ) ≤32σ1(W
∗) for all i ∈ [k] by Weyl’s inequality. By definition of singular value,
we have σk(W∗) ≤ ‖w∗
i ‖ ≤ σ1(W∗). By definition of spectral norm, we have
‖wi − w∗i ‖ ≤ ‖W −W ∗‖. Thus, we can lower bound ‖wi‖,
‖wi‖ ≤ ‖w∗i ‖+ ‖wi − w∗
i ‖ ≤ ‖w∗i ‖+ ‖W −W ∗‖ ≤ ‖w∗
i ‖+ σk/2 ≤3
2‖w∗
i ‖.
Similarly, we have ‖wi‖ ≥ 12‖w∗
i ‖.
193
Lemma B.3.6 (Second-order smoothness near the ground truth for smooth activa-
tions). If φ(z) satisfies Property 1, Property 2 and Property 3(a), then for any W
with ‖W −W ∗‖ ≤ σk/2, we have
‖∇2fD(W )−∇2fD(W∗)‖ . k2v2maxσ
p1‖W −W ∗‖.
Proof. Let ∆ = ∇2fD(W )−∇2fD(W∗). For each (i, l) ∈ [k]× [k], let ∆i,l ∈ Rd×d
denote the (i, l)-th block of ∆. Then, for i 6= l, we have
∆i,l = Ex∼Dd
[v∗i v
∗l
(φ′(w>
i x)φ′(w>
l x)− φ′(w∗>i x)φ′(w∗>
l x))xx>],
and for i = l, we have
∆i,i
= Ex∼Dd
[(k∑
r=1
v∗rφ(w>r x)− y
)v∗i φ
′′(w>i x)xx
> + v∗2i(φ′2(w>
i x)− φ′2(w∗>i x)
)xx>
]
= Ex∼Dd
[(k∑
r=1
v∗rφ(w>r x)− y
)v∗i φ
′′(w>i x)xx
>
]+ E
x∼Dd
[v∗2i(φ′2(w>
i x)− φ′2(w∗>i x)
)xx>].
(B.8)
In the next a few paragraphs, we first show how to bound the off-diagonal
term, and then show how to bound the diagonal term.
194
First, we consider off-diagonal terms.
‖∆i,l‖
=
∥∥∥∥ Ex∼Dd
[v∗i v
∗l
(φ′(w>
i x)φ′(w>
l x)− φ′(w∗>i x)φ′(w∗>
l x))xx>]∥∥∥∥
≤ v2max max‖a‖=1
Ex∼Dd
[∣∣φ′(w>i x)φ
′(w>l x)− φ′(w∗>
i x)φ′(w∗>l x)
∣∣ · (x>a)2]
≤ v2max max‖a‖=1
Ex∼Dd
[(|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|+ |φ′(w∗>i x)||φ′(w>
l x)− φ′(w∗>l x)|
)(x>a)2
]= v2max max
‖a‖=1
(E
x∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>a)2]
+ Ex∼Dd
[|φ′(w∗>
i x)||φ′(w>l x)− φ′(w∗>
l x)|(x>a)2])
≤ v2max max‖a‖=1
(E
x∼Dd
[L2|(wi − w∗
i )>x||φ′(w>
l x)|(x>a)2]+ E
x∼Dd
[|φ′(w∗>
i x)|L2|(wl − w∗l )
>x|(x>a)2])
≤ v2max max‖a‖=1
(E
x∼Dd
[L2|(wi − w∗
i )>x|L1|w>
l x|p(x>a)2]+ E
x∼Dd
[L1|w∗>
i x|pL2|(wl − w∗l )
>x|(x>a)2])
≤ v2maxL1L2 max‖a‖=1
(E
x∼Dd
[|(wi − w∗
i )>x||w>
l x|p(x>a)2]+ E
x∼Dd
[|(wl − w∗
l )>x||w∗>
i x|p(x>a)2])
. v2maxL1L2 max‖a‖=1
(‖wi − w∗i ‖‖wl‖p‖a‖2 + ‖wl − w∗
l ‖‖w∗i ‖p‖a‖2)
. v2maxL1L2σp1(W
∗)‖W −W ∗‖ (B.9)
where the first step follows by definition of ∆i,l, the second step follows by defini-
tion of spectral norm and v∗i v∗l ≤ v2max, the third step follows by triangle inequality,
the fourth step follows by linearity of expectation, the fifth step follows by Prop-
erty 3(a), i.e., |φ′(w>i x) − φ′(w∗>
i x)| ≤ L2|(wi − w∗i )
>x|, the sixth step follows
by Property 1, i.e., φ′(z) ≤ L1|z|p, the seventh step follows by Fact 2.4.2, and the
last step follows by ‖a‖2 = 1, ‖wi − w∗i ‖ ≤ ‖W − W ∗‖, ‖wi‖ ≤ 3
2‖w∗
i ‖, and
‖w∗i ‖ ≤ σ1(W
∗).
Note that the proof for the off-diagonal terms also applies to bounding the
195
second-term in the diagonal block Eq. (B.8). Thus we only need to show how to
bound the first term in the diagonal block Eq. (B.8).∥∥∥∥∥ Ex∼Dd
[(k∑
r=1
v∗rφ(w>r x)− y
)v∗i φ
′′(w>i x)xx
>
]∥∥∥∥∥=
∥∥∥∥∥ Ex∼Dd
[(k∑
r=1
v∗r(φ(w>r x)− φ(w∗>
r x))
)v∗i φ
′′(w>i x)xx
>
]∥∥∥∥∥≤ v2max
k∑r=1
max‖a‖=1
Ex∼Dd
[|φ(w>r x)− φ(w∗>
r x)||φ′′(w>i x)|(x>a)2]
≤ v2max
k∑r=1
max‖a‖=1
Ex∼Dd
[|φ(w>r x)− φ(w∗>
r x)|L2(x>a)2]
≤ v2maxL2
k∑r=1
max‖a‖=1
Ex∼Dd
[max
z∈[w>r x,w∗>
r x]|φ′(z)| · |(wr − w∗
r)>x| · (x>a)2
]
≤ v2maxL2
k∑r=1
max‖a‖=1
Ex∼Dd
[max
z∈[w>r x,w∗>
r x]L1|z|p · |(wr − w∗
r)>x| · (x>a)2
]
≤ v2maxL1L2
k∑r=1
max‖a‖=1
Ex∼Dd
[(|w>r x|p + |w∗>
r x|p) · |(wr − w∗r)
>x| · (x>a)2]
. v2maxL1L2
k∑r=1
[(‖wr‖p + ‖w∗r‖p)‖wr − w∗
r‖]
. kv2maxL1L2σp1(W
∗)‖W −W ∗‖, (B.10)
where the first step follows by y =∑k
r=1 v∗rφ(w
∗>r x), the second step follows by
definition of spectral norm and v∗rv∗i ≤ |vmax|2, the third step follows by Prop-
erty 3(a), i.e., |φ′′(w>i x)| ≤ L2, the fourth step follows by |φ(w>
r x) − φ(w∗>r x) ≤
maxz∈[w>r x,w∗>
r x] |φ′(z)||(wr−w∗r)
>x|, the fifth step follows Property 1, i.e., |φ′(z)| ≤
L1|z|p, the sixth step follows by maxz∈[w>r x,w∗>
r x] |z|p ≤ (|w>r x|p + |w∗>
r x|p), the
seventh step follows by Fact 2.4.2.
196
Putting it all together, we can bound the error by
‖∇2fD(W )−∇2fD(W∗)‖
= max‖a‖=1
a>(∇2fD(W )−∇2fD(W∗))a
= max‖a‖=1
k∑i=1
k∑l=1
a>i ∆i,lal
= max‖a‖=1
(k∑
i=1
a>i ∆i,iai +∑i 6=l
a>i ∆i,lal
)
≤ max‖a‖=1
(k∑
i=1
‖∆i,i‖‖ai‖2 +∑i 6=l
‖∆i,l‖‖ai‖‖al‖
)
≤ max‖a‖=1
(k∑
i=1
C1‖ai‖2 +∑i 6=l
C2‖ai‖‖al‖
)
= max‖a‖=1
C1
k∑i=1
‖ai‖2 + C2
( k∑i=1
‖ai‖
)2
−k∑
i=1
‖ai‖2
≤ max‖a‖=1
(C1
k∑i=1
‖ai‖2 + C2
(k
k∑i=1
‖ai‖2 −k∑
i=1
‖ai‖2))
= max‖a‖=1
(C1 + C2(k − 1))
. k2v2maxL1L2σp1(W
∗)‖W −W ∗‖.
where the first step follows by definition of spectral norm and a denotes a vector
∈ Rdk, the first inequality follows by ‖A‖ = max‖x‖6=0,‖y‖6=0x>Ay‖x‖‖y‖ , the second
inequality follows by ‖∆i,i‖ ≤ C1 and ‖∆i,l‖ ≤ C2, the third inequality follows
by Cauchy-Scharwz inequality, the eighth step follows by∑
i=1 ‖ai‖2 = 1, and the
last step follows by Eq (B.9) and (B.10).
Empirical and Population Difference for Smooth Activations
197
The goal of this Section is to prove Lemma B.3.7. For each i ∈ [k], let σi
denote the i-th largest singular value of W ∗ ∈ Rd×k.
Note that Bernstein inequality requires the spectral norm of each random
matrix to be bounded almost surely. However, since we assume Gaussian distribu-
tion for x, ‖x‖2 is not bounded almost surely. The main idea is to do truncation and
then use Matrix Bernstein inequality. Details can be found in Lemma D.3.9 and
Corollary B.1.1.
Lemma B.3.7 (Empirical and population difference for smooth activations). Let
φ(z) satisfy Property 1,2 and 3(a). Let W satisfy ‖W −W ∗‖ ≤ σk/2. Let S denote
a set of i.i.d. samples from distribution D (defined in (4.1)). Then for any t ≥ 1 and
0 < ε < 1/2, if
|S| ≥ ε−2dκ2τ · poly(log d, t)
then we have, with probability at least 1− d−Ω(t),
‖∇2fS(W )−∇2fD(W )‖ .v2maxkσp1(εσ
p1 + ‖W −W ∗‖).
Proof. Define ∆ = ∇2fD(W )−∇2fS(W ). Let’s first consider the diagonal blocks.
Define
∆i,i = Ex∼Dd
[(k∑
r=1
v∗rφ(w>r x)− y
)v∗i φ
′′(w>i x)xx
> + v∗2i φ′2(w>i x)xx
>
]
−
(1
n
n∑j=1
(k∑
r=1
v∗rφ(w>r xj)− y
)v∗i φ
′′(w>i xj)xjx
>j + v∗2i φ′2(w>
i xj)xjx>j
).
Let’s further decompose ∆i,i into ∆i,i = ∆(1)i,i +∆
(2)i,i , where
198
∆(1)i,i
= Ex∼Dd
[(k∑
r=1
v∗rφ(w>r x)− y
)v∗i φ
′′(w>i x)xx
>
]− 1
n
n∑j=1
(k∑
r=1
v∗rφ(w>r xj)− y
)v∗i φ
′′(w>i xj)xjx
>j
= v∗i
k∑r=1
(v∗r E
x∼Dd
[(φ(w>
r x)− φ(w∗>r x))φ′′(w>
i x)xx>]
− 1
n
n∑j=1
(φ(w>r xj)− φ(w∗>
r xj))φ′′(w>
i xj)xjx>j
),
and
∆(2)i,i = E
x∼Dd
[v∗2i φ′2(w>i x)xx
>]− 1
n
n∑j=1
[v∗2i φ′2(w>i xj)xjx
>j ]. (B.11)
The off-diagonal block is
∆i,l = v∗i v∗l
(E
x∼Dd
[φ′(w>i x)φ
′(w>l x)xx
>]− 1
n
n∑j=1
φ′(w>i xj)φ
′(w>l xj)xjx
>j
)
Combining Claims. B.3.2, B.3.3 and B.3.4, and taking a union bound over
k2 different ∆i,j , we obtain if n ≥ ε−2κ2(W ∗)τd·poly(t, log d) for any ε ∈ (0, 1/2),
with probability at least 1− 1/dt,
‖∇2fS(W )−∇2f(W )‖ . v2maxkσp1(W
∗)(εσp1(W
∗) + ‖W −W ∗‖)
Claim B.3.2. For each i ∈ [k], if n ≥ dpoly(log d, t)
‖∆(1)i,i ‖ . kv2maxσ
p1(W
∗)‖W −W ∗‖
holds with probability 1− 1/d4t.
199
Proof. For each r ∈ [k], we define function Br : Rd → Rd×d,
Br(x) = L1L2 · (|w>r x|p + |w∗>
r x|p) · |(wr − w∗r)
>x| · xx>.
According to Properties 1,2 and 3(a), we have for each x ∈ S,
−Br(x) (φ(w>r x)− φ(w∗>
r x))φ′′(w>i x)xx
> Br(x)
Therefore,
∆(1)i,i v2max
k∑r=1
(E
x∼Dd
[Br(x)] +1
|S|∑x∈S
Br(x)
).
Let hr(x) = L1L2|w>r x|p · |(wr − w∗
r)>x|. Let Br = Ex∼Dd
[hr(x)xx>].
Define function Br(x) = hr(x)xx>.
(I) Bounding |hr(x)|.
According to Fact 2.4.2, we have for any constant t ≥ 1, with probability
1− 1/(nd4t),
|hr(x)| = L1L2|w>r x|p|(wr − w∗
r)>x| . L1L2‖wr‖p‖wr − w∗
r‖(t log n)(p+1)/2.
(II) Bounding ‖Br‖.
‖Br‖ ≥ Ex∼Dd
[L1L2|w>
r x|p|(wr − w∗r)
>x|((wr − w∗
r)>x
‖wr − w∗r‖
)2]& L1L2‖wr‖p‖wr − w∗
r‖,
where the first step follows by definition of spectral norm, and last step follows by
Fact 2.4.2. Using Fact 2.4.2, we can also prove an upper bound ‖Br‖, ‖Br‖ .
L1L2‖wr‖p‖wr − w∗r‖.
200
(III) Bounding (Ex∼Dd[h4(x)])1/4
Using Fact 2.4.2, we have(E
x∼Dd
[h4(x)]
)1/4
= L1L2
(E
x∼Dd
[(|w>
r x|p|(wr − w∗r)
>x|)4])1/4
. L1L2‖wr‖p‖wr − w∗r‖.
By applying Corollary B.1.1, if n ≥ ε−2dpoly(log d, t), then with probabil-
ity 1− 1/d4t,
∥∥∥∥∥ Ex∼Dd
[|w>
r x|p · |(wr − w∗r)
>x| · xx>]− 1
|S|
(∑x∈S
|w>r xj|p · |(wr − w∗
r)>x| · xx>
)∥∥∥∥∥=
∥∥∥∥∥Br −1
|S|∑x∈S
Br(x)
∥∥∥∥∥≤ ε‖Br‖
. ε‖wr‖p‖wr − w∗r‖. (B.12)
If ε ≤ 1/2, we have
‖∆(1)i,i ‖ .
k∑i=1
v2max‖Br‖ . kv2maxσp1(W
∗)‖W −W ∗‖
Claim B.3.3. For each i ∈ [k], if n ≥ ε−2dτpoly(log d, t) , then
‖∆(2)i,i ‖ . εv2maxσ
2p1
holds with probability 1− 1/d4t.
201
Proof. Recall the definition of ∆(2)i,i .
∆(2)i,i = E
x∼Dd
[v∗2i φ′2(w>i x)xx
>]− 1
|S|∑x∈S
[v∗2i φ′2(w>i x)xx
>]
Let hi(x) = φ′2(w>i x). Let Bi = Ex∼Dd
[hi(x)xx>] Define function Bi(x) =
hi(x)xx>.
(I) Bounding |hi(x)|.
For any constant t ≥ 1, (φ′(w>i x))
2 ≤ L21|w>
i x|2p . L21‖wi‖2ptp logp(n)
with probability 1− 1/(nd4t) according to Fact 2.4.2.
(II) Bounding ‖Bi‖.∥∥∥∥ Ex∼Dd
[φ′2(w>i x)xx
>]
∥∥∥∥ = max‖a‖=1
Ex∼Dd
[φ′2(w>
i x)(x>a)2
]= max
‖a‖=1E
x∼Dd
[φ′2(w>
i x)(αw>
i x+ βx>v)2]
= maxα2+β2=1,‖v‖=1
Ex∼Dd
[φ′2(w>
i x)((αw>
i x)2 + (βx>v)2
)]= max
α2+β2=1
(α2 E
z∼D1
[φ′2(‖wi‖z)z2] + β2 Ez∼D1
[φ′2(‖wi‖z)])
= max
(E
x∼D1
[φ′2(‖wi‖z)z2], Ex∼D1
[φ′2(‖wi‖z)])
where wi = wi/‖wi‖ and v is a unit vector orthogonal to wi such that a = αwi+βv.
Now from Property 2, we have
ρ(‖wi‖) ≤∥∥∥∥ Ex∼Dd
[φ′2(w>i x)xx
>]
∥∥∥∥ . L21‖wi‖2p.
(III) Bounding (Ex∼Dd[h4
i (x)])1/4.
(Ex∼Dd
[h4i (x)]
)1/4=
(E
x∼Dd
[φ′8(w>i x)]
)1/4
. L21‖wi‖2p.
202
By applying Corollary B.1.1, we have, for any 0 < ε < 1, if
n ≥ ε−2d ‖wi‖4pρ2(‖wi‖)poly(log d, t) the following bound holds∥∥∥∥∥Bi −
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε‖Bi‖,
with probability at least 1− 1/d4t.
Therefore, if n ≥ ε−2dτpoly(log d, t), where τ = (3σ1/2)4p
minσ∈[σk/2,3σ1/2]ρ2(σ)
, we
have with probability 1− 1/d4t
‖∆(2)i,i ‖ . εv2maxσ
2p1
Claim B.3.4. For each i ∈ [k], j ∈ [k], i 6= j, if n ≥ ε−2κ2τdpoly(log d, t), then
‖∆i,j‖ ≤ εv2maxσ2p1 (W ∗)
holds with probability 1− 1/d4t.
Proof. Recall the definition of off-diagonal blocks ∆i,l,
∆i,l = v∗i v∗l
(E
x∼Dd
[φ′(w>i x)φ
′(w>l x)xx
>]− 1
|S|∑x∈S
φ′(w>i x)φ
′(w>l x)xx
>
)(B.13)
Let hi,l(x) = φ′(w>i x)φ
′(w>l x). Define function Bi,l(x) = hi,l(x)xx
>. Let
Bi,l = Ex∼Dd[hi,l(x)xx
>].
(I) Bounding |hi,l(x)|.
203
For any constant t ≥ 1, we have with probability 1− 1/(nd4t)
|hi,l(x)| = |φ′(w>i x)φ
′(w>l x)|
≤ L21‖w>
i x‖p‖w>l x‖p
≤ L21‖wi‖p‖wl‖p(t log n)p
. L21σ
2p1 (t log n)p
where the third step follows by Fact 2.4.2.
(II) Bounding ‖Bi,l‖.
Let U ∈ Rd×2 be the orthogonal basis of spanwi, wl and U⊥ ∈ Rd×(d−2)
be the complementary matrix of U . Let matrix V ∈ R2×2 denote U>[wi wl], then
UV = [wi wl] ∈ Rd×2. Given any vector a ∈ Rd, there exist vectors b ∈ R2 and
c ∈ Rd−2 such that a = Ub+ U⊥c. We can simplify ‖Bi,l‖ in the following way,
‖Bi,l‖ =∥∥∥∥ Ex∼Dd
[φ′(w>i x)φ
′(w>l x)xx
>]
∥∥∥∥= max
‖a‖=1E
x∼Dd
[φ′(w>i x)φ
′(w>l x)(x
>a)2]
= max‖b‖2+‖c‖2=1
Ex∼Dd
[φ′(w>i x)φ
′(w>l x)(b
>U>x+ c>U>⊥x)
2]
= max‖b‖2+‖c‖2=1
Ex∼Dd
[φ′(w>i x)φ
′(w>l x)((b
>U>x)2 + (c>U>⊥x)
2)]
= max‖b‖2+‖c‖2=1
Ez∼D2
[φ′(v>1 z)φ′(v>2 z)(b
>z)2]︸ ︷︷ ︸A1
+ Ez∼D2,s∼Dd−2
[φ′(v>1 z)φ′(v>2 z)(c
>s)2]︸ ︷︷ ︸A2
204
We can lower bound the term A1,
A1 = Ez∼D2
[φ′(v>1 z)φ′(v>2 z)(b
>z)2]
=
∫(2π)−1φ′(v>1 z)φ
′(v>2 z)(b>z)2e−‖z‖2/2dz
=
∫(2π)−1φ′(s1)φ
′(s2)(b>V †>s)2e−‖V †>s‖2/2 · | det(V †)|ds
≥∫
(2π)−1(φ′(s1)φ′(s2))(b
>V †>s)2e−σ21(V
†)‖s‖2/2 · | det(V †)|ds
=
∫(2π)−1(φ′(u1/σ1(V
†))φ′(u2/σ1(V†))) · (b>V †>u/σ1(V
†))2e−‖u‖2/2| det(V †)|/σ21(V
†)du
=σ2(V )
σ1(V )E
u∼D2
[(p>u)2φ′(σ2(V ) · u1)φ
′(σ2(V ) · u2)]
=σ2(V )
σ1(V )E
u∼D2
[((p1u1)
2 + (p2u2)2 + 2p1p2u1u2
)φ′(σ2(V ) · u1)φ
′(σ2(V ) · u2)]
=σ2(V )
σ1(V )
(‖p‖2 E
z∼D1
[φ′(σ2(V ) · z)z2] · Ez∼D1
[φ′(σ2(V ) · z)]
+ ((p>1)2 − ‖p‖2) Ez∼D1
[φ′(σ2(V ) · z)z]2)
where p = V †b · σ2(V ) ∈ R2.
Since we are maximizing over b ∈ R2, we can choose b such that ‖p‖ = ‖b‖.
Then
A1 = Ez∼D2
[φ′(v>1 z)φ′(v>2 z)(b
>z)2]
≥ σ2(V )
σ1(V )‖b‖2
(E
z∼D1
[φ′(σ2(V ) · z)z2] · Ez∼D1
[φ′(σ2(V ) · z)]− Ez∼D1
[φ′(σ2(V ) · z)z]2)
205
For the term A2, similarly we have
A2 = Ez∼D2,s∼Dd−2
[φ′(v>1 z)φ′(v>2 z)(c
>s)2]
= Ez∼D2
[φ′(v>1 z)φ′(v>2 z)] E
s∼Dd−2
[(c>s)2]
= ‖c‖2 Ez∼D2
[φ′(v>1 z)φ′(v>2 z)]
≥ ‖c‖2σ2(V )
σ1(V )
(E
z∼D1
[φ′(σ2(V ) · z)])2
For simplicity, we just set ‖b‖ = 1 and ‖c‖ = 0 to lower bound,∥∥∥∥ Ex∼Dd
[φ′(w>
i x)φ′(w>
l x)xx>]∥∥∥∥
≥ σ2(V )
σ1(V )
(E
z∼D1
[φ′(σ2(V ) · z)z2] · Ez∼D1
[φ′(σ2(V ) · z)]−(
Ez∼D1
[φ′(σ2(V ) · z)z])2)
≥ σ2(V )
σ1(V )ρ(σ2(V ))
≥ 1
κ(W ∗)ρ(σ2(V ))
where the second step is from Property 2 and the fact that σk/2 ≤ σ2(V ) ≤
σ1(V ) ≤ 3σ1/2.
For the upper bound of ‖Ex∼Dd[φ′(w>
i x)φ′(w>
l x)xx>]‖, we have
Ez∼D2
[φ′(v>1 z)φ′(v>2 z)(b
>z)2] ≤ L21 Ez∼D2
[|v>1 z|p · |v>2 z|p · |b>z|2]
. L21‖v1‖p‖v2‖p‖b‖2
. L21σ
2p1
where the first step follows by Property 1, the second step follows by Fact 2.4.2,
and the last step follows by ‖v1‖ ≤ σ1, ‖v2‖ ≤ σ1 and ‖b‖ ≤ 1. Similarly, we can
206
upper bound,
Ez∼D2,s∼Dd−2
[φ′(v>1 z)φ′(v>2 z)(c
>s)2] = ‖c‖2 Ez∼D2
[φ′(v>1 z)φ′(v>2 z)] . L2
1σ2p1
Thus, we have∥∥Ex∼Dd[φ′(w>
i x)φ′(w>
l x)xx>]∥∥ . L2
1σ2p1 . σ2p
1 .
(III) Bounding (Ex∼Dd[h4
i,l(x)])1/4.
(Ex∼Dd
[h4i,l(x)]
)1/4=(Ex∼Dd
[φ′4(w>i x) · φ′4(w>
l x)])1/4
. L21‖wi‖p‖wl‖p . L2
1σ2p1 .
Therefore, applying Corollary B.1.1, we have, if n ≥ ε−2κ2(W ∗)τdpoly(log d, t),
then
‖∆i,j‖ ≤ εv2maxσ2p1 (W ∗).
holds with probability at least 1− 1/d4t.
B.3.4 Error Bound of Hessians near the Ground Truth for Non-smooth Acti-vations
The goal of this Section is to prove Lemma B.3.8,
Lemma B.3.8 (Error bound of Hessians near the ground truth for non-smooth acti-
vations). Let φ(z) satisfy Property 1,2 and 3(b). Let W satisfy ‖W −W ∗‖ ≤ σk/2.
Let S denote a set of i.i.d. samples from the distribution defined in (4.1). Then for
any t ≥ 1 and 0 < ε < 1/2, if
|S| ≥ ε−2κ2τdpoly(log d, t)
207
with probability at least 1− d−Ω(t),
‖∇2fS(W )−∇2fD(W∗)‖ . v2maxkσ
2p1 (ε+ (σ−1
k · ‖W −W ∗‖)1/2).
Proof. As we noted previously, when Property 3(b) holds, the diagonal blocks of
the empirical Hessian can be written as, with probability 1,
∂2fS(W )
∂w2i
=1
|S|∑
(x,y)∈S
(v∗i φ′(w>
i x))2xx>
We construct a matrix H ∈ Rdk×dk with i, l-th block as
Hi,l = v∗i v∗l Ex∼Dd
[φ′(w>
i x)φ′(w>
l x)xx>] ∈ Rd×d, ∀i ∈ [k], l ∈ [k].
Note that H 6= ∇2fD(W ). However we can still bound ‖H − ∇2fS(W )‖
and ‖H − ∇2fD(W∗)‖ when we have enough samples and ‖W − W ∗‖ is small
enough. The proof for ‖H−∇2fS(W )‖ basically follows the proof of Lemma B.3.7
as ∆(2)ii in Eq. (B.11) and ∆il in Eq. (B.13) forms the blocks of H −∇2fS(W ) and
we can bound them without smoothness of φ(·).
Now we focus on H −∇2fD(W∗). We again consider each block.
∆i,l = Ex∼Dd
[v∗i v
∗l (φ
′(w>i x)φ
′(w>l x)− φ′(w∗>
i x)φ′(w∗>l x))xx>].
We used the boundedness of φ′′(z) to prove Lemma B.3.6. Here we can’t use this
208
condition. Without smoothness, we will stop at the following position.
∥∥Ex∼Dd[v∗i v
∗l (φ
′(w>i x)φ
′(w>l x)− φ′(w∗>
i x)φ′(w∗>l x))xx>]
∥∥≤ |v∗i v∗l | max
‖a‖=1Ex∼Dd
[|φ′(w>
i x)φ′(w>
l x)− φ′(w∗>i x)φ′(w∗>
l x)|(x>a)2]
≤ |v∗i v∗l | max‖a‖=1
Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|
+ |φ′(w∗>i x)||φ′(w>
l x)− φ′(w∗>l x)|(x>a)2
]= |v∗i v∗l | max
‖a‖=1
(Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>a)2]
+ Ex∼Dd
[|φ′(w∗>
i x)||φ′(w>l x)− φ′(w∗>
l x)|(x>a)2]). (B.14)
where the first step follows by definition of spectral norm, the second step follows
by triangle inequality, and the last step follows by linearity of expectation.
Without loss of generality, we just bound the first term in the above formu-
lation. Let U be the orthogonal basis of span(wi, w∗i , wl). If wi, w
∗i , wl are inde-
pendent, U is d-by-3. Otherwise it can be d-by-rank(span(wi, w∗i , wl)). Without
loss of generality, we assume U = span(wi, w∗i , wl) is d-by-3. Let [vi v∗i vl] =
U>[wi w∗i wl] ∈ R3×3, and [ui u∗
i ul] = U>⊥ [wi w∗
i wl] ∈ R(d−3)×3. Let
209
a = Ub+ U⊥c, where U⊥ ∈ Rd×(d−3) is the complementary matrix of U .
Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>a)2]
= Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>(Ub+ U⊥c))2]
. Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|((x>Ub)2 + (x>U⊥c)
2)]
= Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>Ub)2]
+ Ex∼Dd
[|φ′(w>
i x)− φ′(w∗>i x)||φ′(w>
l x)|(x>U⊥c)2]
= Ez∼D3
[|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2
]+ Ey∼Dd−3
[|φ′(u>
i y)− φ′(u∗>i y)||φ′(u>
l y)|(y>c)2]
(B.15)
where the first step follows by a = Ub + U⊥c, the last step follows by (a + b)2 ≤
2a2 + 2b2.
By Property 3(b), we have e exceptional points which have φ′′(z) 6= 0. Let
these e points be p1, p2, · · · , pe. Note that if v>i z and v∗>i z are not separated by
any of these exceptional points, i.e., there exists no j ∈ [e] such that v>i z ≤ pj ≤
v∗>i z or v∗>i z ≤ pj ≤ v>i z, then we have φ′(v>i z) = φ′(v∗>i z) since φ′′(s) are
zeros except for pjj=1,2,··· ,e. So we consider the probability that v>i z, v∗>i z are
separated by any exception point. We use ξj to denote the event that v>i z, v∗>i z
are separated by an exceptional point pj . By union bound, 1 −∑e
j=1 Pr[ξj] is the
probability that v>i z, v∗>i z are not separated by any exceptional point. The first term
210
of Equation (D.23) can be bounded as,
Ez∼D3
[|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2
]= Ez∼D3
[1∪e
j=1ξj|φ′(v>i z) + φ′(v∗>i z)||φ′(v>l z)|(z>b)2
]≤(Ez∼D3
[1∪e
j=1ξj
])1/2(Ez∼D3
[(φ′(v>i z) + φ′(v∗>i z))2φ′(v>l z)
2(z>b)4])1/2
≤
(e∑
j=1
Prz∼D3
[ξj]
)1/2(Ez∼D3
[(φ′(v>i z) + φ′(v∗>i z))2φ′(v>l z)
2(z>b)4])1/2
.
(e∑
j=1
Prz∼D3
[ξj]
)1/2
(‖vi‖p + ‖v∗i ‖p)‖vl‖p‖b‖2
where the first step follows by if v>i z, v∗>i z are not separated by any exceptional
point then φ′(v>i z) = φ′(v∗>i z) and the last step follows by Holder’s inequality and
Property 1.
It remains to upper bound Prz∼D3 [ξj]. First note that if v>i z, v∗>i z are sepa-
rated by an exceptional point, pj , then |v∗>i z− pj| ≤ |v>i z− v∗>i z| ≤ ‖vi− v∗i ‖‖z‖.
Therefore,
Prz∼D3
[ξj] ≤ Prz∼D3
[|v>i z − pj|‖z‖
≤ ‖vi − v∗i ‖].
Note that ( v∗>i z
‖z‖‖v∗i ‖+ 1)/2 follows Beta(1,1) distribution which is uniform
distribution on [0, 1].
Prz∼D3
[|v∗>i z − pj|‖z‖‖v∗i ‖
≤ ‖vi − v∗i ‖‖v∗i ‖
]≤ Pr
z∼D3
[|v∗>i z|‖z‖‖v∗i ‖
≤ ‖vi − v∗i ‖‖v∗i ‖
].‖vi − v∗i ‖‖v∗i ‖
.‖W −W ∗‖σk(W ∗)
,
where the first step is because we can view v∗>i z
‖z‖ and pj‖z‖ as two independent ran-
dom variables: the former is about the direction of z and the later is related to the
211
magnitude of z. Thus, we have
Ez∈D3 [|φ′(v>i z)− φ′(v∗>i z)||φ′(v>l z)|(z>b)2] . (e‖W −W ∗‖/σk(W∗))1/2σ2p
1 (W ∗)‖b‖2.(B.16)
Similarly we have
Ey∈Dd−3[|φ′(u>
i y)− φ′(u∗>i y)||φ′(u>
l y)|(y>c)2] . (e‖W −W ∗‖/σk(W∗))1/2σ2p
1 (W ∗)‖c‖2.(B.17)
Finally combining Eq. (C.8), Eq. (D.24) and Eq. (D.25), we have
‖H −∇2fD(W∗)‖ . kv2max(e‖W −W ∗‖/σk(W
∗))1/2σ2p1 (W ∗),
which completes the proof.
B.3.5 Positive Definiteness for a Small Region
Here we introduce a lemma which shows that the Hessian of any W , which
may be dependent on the samples but is very close to an anchor point, is close to
the Hessian of this anchor point.
Lemma B.3.9. Let S denote a set of samples from Distribution D defined in Eq. (4.1).
Let W a ∈ Rd×k be a point (respect to function fS(W )), which is independent of the
samples S, satisfying ‖W a −W ∗‖ ≤ σk/2. Assume φ satisfies Property 1, 2 and
3(a). Then for any t ≥ 1, if
|S| ≥ dpoly(log d, t),
212
with probability at least 1 − d−t, for any W (which is not necessarily to be inde-
pendent of samples) satisfying ‖W a −W‖ ≤ σk/4, we have
‖∇2fS(W )−∇2fS(Wa)‖ ≤ kv2maxσ
p1(‖W a −W ∗‖+ ‖W −W a‖d(p+1)/2).
Proof. Let ∆ = ∇2fS(W )−∇2fS(Wa) ∈ Rdk×dk, then ∆ can be thought of as k2
blocks, and each block has size d× d. The off-diagonal blocks are,
∆i,l = v∗i v∗l
1
|S|∑x∈S
(φ′(w>
i x)φ′(w>
l x)− φ′(wa>i x)φ′(wa>
l x))xx>
For diagonal blocks,
∆i,i =1
|S|∑x∈S
((k∑
q=1
v∗qφ(w>q x)− y
)v∗i φ
′′(w>i x)xx
> + v∗2i φ′2(w>i x)xx
>
)
− 1
|S|∑x∈S
((k∑
q=1
v∗qφ(wa>q x)− y
)v∗i φ
′′(wa>i x)xx> + v∗2i φ′2(wa>
i x)xx>
)
We further decompose ∆i,i into ∆i,i = ∆(1)i,i +∆
(2)i,i , where
∆(1)i,i = v∗i
1
|S|∑x∈S
((k∑
q=1
v∗qφ(w>q x)− y
)φ′′(w>
i x)−
(k∑
q=1
v∗qφ(wa>q x)− y
)φ′′(wa>
i x)
)xx>,
and
∆(2)i,i =v∗2i
1
|S|∑
(x,y)∈S
φ′2(w>i x)xx
> − φ′2(wa>i x)xx>. (B.18)
213
We can further decompose ∆(1)i,i into ∆
(1,1)i,i and ∆
(1,2)i,i ,
∆(1)i,i = v∗i
1
|S|∑x∈S
(k∑
q=1
v∗qφ(w>q x)− y
)φ′′(w>
i x)−
(k∑
q=1
v∗qφ(wa>q x)− y
)φ′′(wa>
i x)xx>
= v∗i1
|S|∑x∈S
(k∑
q=1
v∗qφ(w>q x)−
k∑q=1
v∗qφ(wa>q x)
)φ′′(w>
i x)xx>
+ v∗i1
|S|∑x∈S
(k∑
q=1
v∗qφ(wa>q x)− y
)(φ′′(w>
i x)− φ′′(wa>i x))xx>
= v∗i1
|S|∑x∈S
k∑q=1
v∗q (φ(w>q x)− φ(wa>
q x))φ′′(w>i x)xx
>
+ v∗i1
|S|∑x∈S
k∑q=1
v∗q (φ(wa>q x)− φ(w∗>
q x))(φ′′(w>i x)− φ′′(wa>
i x))xx>
= ∆(1,1)i,i +∆
(1,2)i,i .
Combining Claim B.3.5 and Claim B.3.6 , we have if
n ≥ dpoly(log d, t)
with probability at least 1− 1/d4t,
‖∆(1)i,i ‖ . kv2maxσ
p1(‖W a −W ∗‖+ ‖W a −W‖d(p+1)/2). (B.19)
Therefore, combining Eq. (B.19), Claim B.3.7 and Claim B.3.8, we com-
plete the proof.
Claim B.3.5. For each i ∈ [k], if n ≥ dpoly(log d, t), then
‖∆(1,1)i,i ‖ . kv2maxσ
p1‖W a −W‖d(p+1)/2
214
Proof. Define function h1(x) = ‖x‖p+1 and h2(x) = |w∗>q x|p|(w∗
q − waq )
>x|. Note
that h1 and h2 don’t contain W which maybe depend on the samples. Therefore, we
can use the modified matrix Bernstein inequality (Corollary B.1.1) to bound ∆(1)i,i .
(I) Bounding |h1(x)|.
By Fact 2.4.3, we have h1(x) . (td log n)(p+1)/2 with probability at least
1− 1/(nd4t).
(II) Bounding ‖Ex∼Dd[‖x‖p+1xx>]‖.
Let g(x) = (2π)−d/2e−‖x‖2/2. Note that xg(x)dx = −dg(x).
Ex∼Dd
[‖x‖p+1xx>] = ∫
‖x‖p+1g(x)xx>dx
= −∫‖x‖p+1d(g(x))x>
= −∫‖x‖p+1d(g(x)x>) +
∫‖x‖p+1g(x)Iddx
=
∫d(‖x‖p+1)g(x)x> +
∫‖x‖p+1g(x)Iddx
=
∫(p+ 1)‖x‖p−1g(x)xx>dx+
∫‖x‖p+1g(x)Iddx
∫‖x‖p+1g(x)Iddx
= Ex∼Dd[‖x‖p+1]Id.
Since ‖x‖2 follows χ2 distribution with degree d, Ex∼Dd[‖x‖q] = 2q/2 Γ((d+q)/2)
Γ(d/2)
for any q ≥ 0. So, dq/2 . Ex∼Dd[‖x‖q] . dq/2. Hence, ‖Ex∼Dd
[h1(x)xx>]‖ &
215
d(p+1)/2. Also∥∥Ex∼Dd
[h1(x)xx
>]∥∥ ≤ max‖a‖=1
Ex∼Dd
[h1(x)(x
>a)2]
≤ max‖a‖=1
(Ex∼Dd
[h21(x)
])1/2(Ex∼Dd
[(x>a)4
])1/2. d(p+1)/2.
(III) Bounding (Ex∼Dd[h4
1(x)])1/4.
(Ex∼Dd
[h41(x)]
)1/4. d(p+1)/2.
Define function B(x) = h(x)xx> ∈ Rd×d, ∀i ∈ [n]. Let B = Ex∼Dd[h(x)xx>].
Therefore by applying Corollary B.1.1, we obtain for any 0 < ε < 1, if
n ≥ ε−2dpoly(log d, t)
with probability at least 1− 1/dt,∥∥∥∥∥ 1
|S|∑x∈S
‖x‖p+1xx> − Ex∼Dd
[‖x‖p+1xx>]∥∥∥∥∥ . δd(p+1)/2.
Therefore we have with probability at least 1− 1/dt,∥∥∥∥∥ 1
|S|∑x∈S
‖x‖p+1xx>
∥∥∥∥∥ . d(p+1)/2. (B.20)
Claim B.3.6. For each i ∈ [k], if n ≥ dpoly(log d, t), then
‖∆(1,2)i,i ‖ . kσp
1‖W a −W ∗‖,
holds with probability at least 1− 1/d4t.
216
Proof. Recall the definition of ∆(1,2)i,i ,
∆(1,2)i,i = v∗i
1
|S|∑x∈S
k∑q=1
v∗q (φ(wa>q x)− φ(w∗>
q x))(φ′′(w>i x)− φ′′(wa>
i x))xx>
In order to upper bound the ‖∆(1,2)i,i ‖, it suffices to upper bound the spectral norm
of this quantity,
1
|S|∑x∼S
(φ(wa>q x)− φ(w∗>
q x))(φ′′(w>i x)− φ′′(wa>
i x))xx>.
By Property 1, we have
|φ(wa>q x)− φ(w∗>
q x)| . L1(|wa>q x|p + |w∗>
q x|)|(w∗q − wa
q )>x|.
By Property 3, we have |φ′′(w>i x)− φ′′(wa>
i x)| ≤ 2L2.
For the second part |w∗>q x|p|(w∗
q − waq )
>x|xx>, according to Eq. (C.7), we
have with probability 1− d−t, if n ≥ dpoly(log d, t),∥∥∥∥∥Ex∼Dd
[|w∗>
q x|p|(w∗q − wa
q )>x|xx>]− 1
|S|∑x∈S
|w∗>q x|p|(w∗
q − waq )
>x|xx>
∥∥∥∥∥ . δ‖w∗q‖p‖w∗
q − waq‖.
Also, note that
∥∥Ex∼Dd
[|w∗>
q x|p|(w∗q − wa
q )>x|xx>]∥∥ . ‖w∗
q‖p‖w∗q − wa
q‖.
Thus, we obtain∥∥∥∥∥ 1
|S|∑x∈S
|w∗>q x|p|(w∗
q − waq )
>x|xx>
∥∥∥∥∥ . ‖W a −W ∗‖σp1. (B.21)
217
Claim B.3.7. For each i ∈ [k], if n ≥ dpoly(log d, t), then
‖∆(2)i,i ‖ . kv2maxσ
p1‖W −W a‖d(p+1)/2
holds with probability 1− 1/d4t.
Proof. We have
‖∆(2)i,i ‖ ≤ v∗2i
∥∥∥∥∥ 1
|S|∑x∈S
((φ′(w>
i xj)− φ′(wa>i x)) · (φ′(w>
i x) + φ′(wa>i x))
)xx>
∥∥∥∥∥≤ v∗2i
∥∥∥∥∥ 1
|S|∑x∈S
(L2|(wi − wa
i )>x| · L1(|w>
i x|p + |wa>i x|p)
)xx>
∥∥∥∥∥≤ v∗2i ‖W −W a‖
∥∥∥∥∥ 1
|S|∑x∈S
(L2‖x‖ · L1(‖wi‖p‖x‖p + |wa>
i x|p))xx>
∥∥∥∥∥.Applying Corollary B.1.1 finishes the proof.
Claim B.3.8. For each i ∈ [k], j ∈ [k], i 6= j, if n ≥ dpoly(log d, t), then
‖∆i,l‖ . v2maxσp1‖W a −W‖
holds with probability 1− d4t.
218
Proof. Recall the definition of ∆i,l,
∆i,l = v∗i v∗l
1
|S|∑x∈S
(φ′(w>
i x)φ′(w>
l x)− φ′(w>i x)φ
′(wa>l x)
+φ′(w>i x)φ
′(wa>l x)− φ′(wa>
i x)φ′(wa>l x)
)xx>
= v∗i v∗l
1
|S|∑x∈S
(φ′(w>
i x)φ′(w>
l x)− φ′(w>i x)φ
′(wa>l x)
)+ v∗i v
∗l
1
|S|∑x∈S
(φ′(w>
i x)φ′(wa>
l x)− φ′(wa>i x)φ′(wa>
l x))xx>
|v∗i v∗l |1
|S|∑x∈S
(L1‖wi‖pL2‖wl − wa
l ‖‖x‖p+1 + L2‖wi − wai ‖‖x‖L1‖wa>
l x‖p)xx>
Applying Corollary B.1.1 completes the proof.
B.4 Tensor MethodsB.4.1 Tensor Initialization Algorithm
We describe the details of each procedure in Algorithm 4.4.1 in this section.
a) Compute the subspace estimation from P2 (Algorithm B.4.1). Note
that the eigenvalues of P2 and P2 are not necessarily nonnegative. However, only
k of the eigenvalues will have large magnitude. So we can first compute the top-
k eigenvectors/eigenvalues of both C · I + P2 and C · I − P2, where C is large
enough such that C ≥ 2‖P2‖. Then from the 2k eigen-pairs, we pick the top-k
eigenvectors with the largest eigenvalues in magnitude, which is executed in TOPK
in Algorithm B.4.1. For the outputs of TOPK, k1, k2 are the numbers of picked
largest eigenvalues from C · I + P2 and C · I − P2 respectively and π1(i) returns
the original index of i-th largest eigenvalue from C · I + P2 and similarly π2 is for
C · I − P2. Finally orthogonalizing the picked eigenvectors leads to an estimation
219
of the subspace spanned by w∗1 w∗
2 · · · w∗k. Also note that forming P2 takes
O(n · d2) time and each step of the power method doing a multiplication between
a d × d matrix and a d × k matrix takes k · d2 time by a naive implementation.
Here we reduce this complexity from O((k + n)d2) to O(knd). The idea is to
compute each step of the power method without explicitly forming P2. We take
P2 = M2 as an example; other cases are similar. In Algorithm B.4.1, let the step
P2V be calculated as P2V = 1|S|∑
(x,y)∈S y(x(x>V ) − V ). Now each iteration
only needs O(knd) time. Furthermore, the number of iterations required will be
a small number, since the power method has a linear convergence rate and as an
initialization method we don’t need a very accurate solution. The detailed algorithm
is shown in Algorithm B.4.1. The approximation error bound of P2 to P2 is provided
in Lemma B.4.4. Lemma B.4.5 provides the theoretical bound for Algorithm B.4.1.
b) Form and decompose the 3rd-moment R3 (Algorithm 1 in [78]). We
apply the non-orthogonal tensor factorization algorithm, Algorithm 1 in [78], to
decompose R3. According to Theorem 3 in [78], when R3 is close enough to R3,
the output of the algorithm, ui will close enough to siV>w∗
i , where si is an unknown
sign. Lemma B.4.8 provides the error bound for ‖R3 −R3‖.
c) Recover the magnitude of w∗i and the signs si, v
∗i (Algorithm B.4.2).
For Algorithm B.4.2, we only consider homogeneous functions. Hence we can
assume v∗i ∈ −1, 1 and there exist some universal constants cj such that mj,i =
cj‖w∗i ‖p+1 for j = 1, 2, 3, 4, where p + 1 is the degree of homogeneity. Note
that different activations have different situations even under Assumption 4.4.1. In
particular, if M4 = M2 = 0, φ(·) is an odd function and we only need to know siv∗i .
220
If M3 = M1 = 0, φ(·) is an even function, so we don’t care about what si is.
Let’s describe the details for Algorithm B.4.2. First define two quantities
Q1 and Q2,
Q1 = Ml1(I, α, · · · , α︸ ︷︷ ︸(l1−1) α’s
) =k∑
i=1
v∗i cl1‖w∗i ‖p+1(α>w∗
i )l1−1w∗
i , (B.22)
Q2 = Ml2(V, V, α, · · · , α︸ ︷︷ ︸(l2−2) α’s
) =k∑
i=1
v∗i cl2‖w∗i ‖p+1(α>w∗
i )l2−2(V >w∗
i )(V>w∗
i )>,
(B.23)
where l1 ≥ 1 such that Ml1 6= 0 and l2 ≥ 2 such that Ml2 6= 0. There are possibly
multiple choices for l1 and l2. We will discuss later on how to choose them. Now
we solve two linear systems.
z∗ = argminz∈Rk
∥∥∥∥∥k∑
i=1
zisiw∗i −Q1
∥∥∥∥∥, and r∗ = argminr∈Rk
∥∥∥∥∥k∑
i=1
riV>w∗
i (V>w∗
i )> −Q2
∥∥∥∥∥F
.
(B.24)
The solutions of the above linear systems are
z∗i = v∗i sl1i cl1‖w∗
i ‖p+1(α>siw∗i )
l1−1, and r∗i = v∗i sl2i cl2‖w∗
i ‖p+1(α>siw∗i )
l2−2.
We can approximate siw∗i by V ui and approximate Q1 and Q2 by their empirical
versions Q1 and Q2 respectively. Hence, in practice, we solve
z = argminz∈Rk
∥∥∥∥∥k∑
i=1
ziV ui − Q1
∥∥∥∥∥, and r = argminr∈Rk
∥∥∥∥∥k∑
i=1
riuiu>i − Q2
∥∥∥∥∥F
(B.25)
So we have the following approximations,
zi ≈ v∗i sl1i cl1‖w∗
i ‖p+1(α>V ui)l1−1, and ri ≈ v∗i s
l2i cl2‖w∗
i ‖p+1(α>V ui)l2−2,∀i ∈ [k].
221
In Lemma B.4.11 and Lemma B.4.12, we provide robustness of the above two linear
systems, i.e., the solution errors, ‖z−z∗‖ and ‖r−r∗‖, are bounded under small per-
turbations of w∗i , Q1 and Q2. Recall that the final goal is to approximate ‖w∗
i ‖ and
the signs v∗i , si. Now we can approximate ‖w∗i ‖ by (|zi/(cl1(α>V ui)
l1−1)|)1/(p+1).
To recover v∗i , si, we need to note that if l1 and l2 are both odd or both even, we
can’t recover both v∗i and si. So we consider the following situations,
1. If M1 = M3 = 0, we choose l1 = l2 = minj ∈ 2, 4|Mj 6= 0. Return
v(0)i = sign(ricl2) and s
(0)i being −1 or 1.
2. If M2 = M4 = 0, we choose l1 = minj ∈ 1, 3|Mj 6= 0, l2 = 3. Return
v(0)i being −1 or 1 and s
(0)i = sign(v
(0)i zicl1).
3. Otherwise, we choose l1 = minj ∈ 1, 3|Mj 6= 0, l2 = minj ∈
2, 4|Mj 6= 0. Return v(0)i = sign(ricl2) and s
(0)i = sign(v
(0)i zicl1).
The 1st situation corresponds to part 3 of Assumption 4.4.1,where si doesn’t matter,
and the 2nd situation corresponds to part 4 of Assumption 4.4.1, where only siv∗i
matters. So we recover ‖w∗i ‖ to some precision and v∗i , si exactly provided enough
samples. The recovery of w∗i and v∗i then follows.
Sample complexity: We use matrix Bernstein inequality to bound the error
between P2 and P2, which requires Ω(d) samples (Lemma B.4.4). To bound the
estimation error between R3 and R3, we flatten the tensor to a matrix and then
use matrix Bernstein inequality to bound the error, which requires Ω(k3) samples
(Lemma B.4.8). In Algorithm B.4.2, we also need to approximate a Rd vector and
a Rk×k matrix, which also requires Ω(d). Thus, taking O(d) + O(k3) samples is
sufficient.
222
Algorithm B.4.1 Power Method via Implicit Matrix-Vector Multiplication
1: procedure POWERMETHOD(P2, k)2: C ← 3‖P2‖, T ← a large enough constant.3: Initial guess V (0)
1 ∈ Rd×k, V(0)2 ∈ Rd×k
4: for t = 1→ T do5: V
(t)1 ← QR(CV
(t−1)1 + P2V
(t−1)1 ) . P2V
(t−1)1 is not calculated directly,
see Sec. B.4.1(a)6: V
(t)2 ← QR(CV
(t−1)2 − P2V
(t−1)2 )
7: for j = 1, 2 do8: V
(T )j ←
[vj,1 vj,2 · · · vj,k
]9: for i = 1→ k do
10: λj,i ← |v>j,iP2vj,i| . Calculate the absolute of eigenvalues
11: π1, π2, k1, k2 ← TOPK(λ, k) . πj : [kj]→ [k] and k1 + k2 = k, seeSec. B.4.1(a)
12: for j = 1, 2 do13: Vj ←
[vj,πj(1) v1,πj(2) · · · vj,πj(kj)
]14: V2 ← QR((I − V1V
>1 )V2)
15: V ← [V1, V2]16: return V
Time complexity: In Part a), by using a specially designed power method,
we only need O(knd) time to compute the subspace estimation V . Part b) needs
O(knd) to form R3 and the tensor factorization needs O(k3) time. Part c) requires
calculation of d × k and k2 × k linear systems in Eq. (B.25), which takes at most
O(knd) running time. Hence, the total time complexity is O(knd).
B.4.2 Main Result for Tensor Methods
The goal of this Section is to prove Theorem 5.4.1.
Theorem 5.4.1. Let the activation function be homogeneous satisfying Assump-
223
Algorithm B.4.2 Recovery of the Ground Truth Parameters of the Neural Network,i.e., w∗
i and v∗i1: procedure RECMAGSIGN(V, uii∈[k], S)2: if M1 = M3 = 0 then3: l1 ← l2 ← minj ∈ 2, 4|Mj 6= 04: else if M2 = M4 = 0 then5: l1 ← minj ∈ 1, 3|Mj 6= 0, l2 ← 36: else7: l1 ← minj ∈ 1, 3|Mj 6= 0, l2 ← minj ∈ 2, 4|Mj 6= 0.8: S1, S2 ← PARTITION(S, 2) . |S1|, |S2| = Ω(d)9: Choose α to be a random unit vector
10: Q1 ← ES1 [Q1] . Q1 is the empirical version of Q1(defined in Eq.(B.22))11: Q2 ← ES2 [Q2] . Q2 is the empirical version of Q2(defined in Eq.(B.23))12: z ← argminz
∥∥∥∑ki=1 ziV ui − Q1
∥∥∥13: r ← argminr
∥∥∥∑ki=1 riuiu
>i − Q2
∥∥∥F
14: v(0)i ← sign(ricl2)
15: s(0)i ← sign(v
(0)i zicl1)
16: w(0)i ← s
(0)i (|zi/(cl1(α>V ui)
l1−1)|)1/(p+1)V ui
17: return v(0)i ,w(0)
i
tion 4.4.1. For any 0 < ε < 1 and t ≥ 1, if |S| ≥ ε−2 · d · poly(t, k, κ, log d), then
there exists an algorithm (Algorithm 4.4.1) that takes |S|k · O(d) time and outputs
a matrix W (0) ∈ Rd×k and a vector v(0) ∈ Rk such that, with probability at least
1− d−Ω(t),
‖W (0) −W ∗‖F ≤ ε · poly(k, κ)‖W ∗‖F , and v(0)i = v∗i .
Proof. The success of Algorithm 4.4.1 depends on two approximations. The first
is the estimation of the normalized w∗i up to some unknown sign flip, i.e., the error
‖w∗i−siV ui‖ for some si ∈ −1, 1. The second is the estimation of the magnitude
224
of w∗i and the signs v∗i , si which is conducted in Algorithm B.4.2.
For the first one,
‖w∗i − siV ui‖ ≤ ‖V V >w∗
i − w∗i ‖+ ‖V V >w∗
i − V siui‖
= ‖V V >w∗i − w∗
i ‖+ ‖V >w∗i − siui‖, (B.26)
where the first step follows by triangle inequality, the second step follows by V >V =
I .
We can upper bound ‖V V >w∗i − w∗
i ‖,
‖V V >w∗i − w∗
i ‖ ≤ (‖P2 − P2‖/σk(P2) + ε)
≤ (poly(k, κ)‖P2 − P2‖+ ε)
≤ poly(k, κ)ε, (B.27)
where the first step follows by Lemma B.4.5, the second step follows by
σk(P2) ≥ 1/poly(k, κ), and the last step follows by ‖P2 − P2‖ ≤ εpoly(k, κ) if the
number of samples is proportional to O(d/ε2) as shown in Lemma B.4.4.
We can upper bound ‖V >w∗i − siui‖,
‖V >w∗i − siui‖ ≤ poly(k, κ)‖R3 −R3‖ ≤ εpoly(k, κ), (B.28)
where the first step follows by Theorem 3 in [78], and the last step follows by
‖R3 − R3‖ ≤ εpoly(k, κ) if the number of samples is proportional to O(k2/ε2) as
shown in Lemma B.4.8.
Combining Eq. (B.26), (B.27) and (B.28) together,
‖w∗i − siV ui‖ ≤ εpoly(k, κ).
225
For the second one, we can bound the error of the estimation of moments,
Q1 and Q2, using number of samples proportional to O(d) by Lemma B.4.10 and
Lemma B.4.4 respectively. The error of the solutions of the linear systems Eq.(B.25)
can be bounded by ‖Q1− Q1‖, ‖Q2− Q2‖, ‖ui− V >w∗i ‖ and ‖(I − V V >)w∗
i ‖ ac-
cording to Lemma B.4.11 and Lemma B.4.12. Then we can bound the error of the
output of Algorithm B.4.2. Furthermore, since v∗i ’s are discrete values, they can be
exactly recovered. All the sample complexities mentioned in the above lemmata
are linear in dimension and polynomial in other factors to achieve a constant error.
So accumulating all these errors we complete our proof.
Remark B.4.1. The proofs of these lemmata for Theorem 4.4.1 can be found in the
following sections. Note that these lemmata also hold for any activations satisfying
Property 1 and Assumption 4.4.1. However, since we are unclear how to implement
the last step of Algorithm 4.4.1 (Algorithm B.4.2) for general non-homogeneous
activations, we restrict our theorem to homogeneous activations only.
B.4.3 Error Bound for the Subspace Spanned by the Weight Matrix
Error Bound for the Second-order Moment in Different Cases
Lemma B.4.1. Let M2 be defined as in Definition 4.4.1. Let M2 be the empirical
version of M2, i.e.,
M2 =1
|S|∑
(x,y)∈S
y · (x⊗ x− Id),
226
where S denote a set of samples from Distribution D defined in Eq. (4.1). Assume
M2 6= 0, i.e., m(2)i 6= 0 for any i. Then for any 0 < ε < 1, t ≥ 1, if
|S| ≥ maxi∈[k]
(‖w∗i ‖p+1/|m2,i|+ 1) · ε−2dpoly(log d, t)
with probability at least 1− d−t,
‖M2 − M2‖ ≤ εk∑
i=1
|v∗im2,i|.
Proof. Recall that, for each sample (x, y), y =∑k
i=1 v∗i φ(w
∗>i x). We consider each
component i ∈ [k]. Define function Bi(x) : Rd → Rd×d such that
Bi(x) = φ(w∗>i x) · (x⊗ x− Id).
Define g(z) = φ(z)− φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤ L1/(p+ 1)|z|p+1, which
follows Property 1.
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ . (L1
p+ 1|w∗>
i x|p+1 + |φ(0)|)(‖xj‖2 + 1)
. (L1
p+ 1‖w∗
i ‖p+1 + |φ(0)|)dpoly(log d, t)
where the last step follows by Fact 2.4.2 and Fact 2.4.3.
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m2,iw
∗iw
∗>i . Therefore, ‖Ex∼Dd
[Bi(x)]‖ = |m2,i|.
(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)
>‖,Ex∼Dd‖Bi(x)
>Bi(x)‖).
227
Note that Bi(x) is a symmetric matrix, thus it suffices to only bound one of
them.∥∥Ex∼Dd[B2
i (x)]∥∥ .
(Ex∼Dd
[φ(w∗>i x)4]
)1/2(Ex∼Dd[‖x‖4]
)1/2. (
L1
p+ 1‖w∗
i ‖p+1 + |φ(0)|)2d.
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)
2]).
Note that Bi(x) is a symmetric matrix, thus it suffices to consider the case
where a = b.
max‖a‖=1
(Ex∼Dd
[(a>Bi(x)a)
2])1/2
.(Ex∼Dd
[φ4(w∗>i x)]
)1/4.
L1
p+ 1‖w∗
i ‖p+1 + |φ(0)|.
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
n &L2d+ |m2,i|2 + L|m2,i|d · poly(log d, t)ε
ε2|m2,i|2t log d
with probability at least 1− 1/dt,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m2,i|.
Lemma B.4.2. Let M3 be defined as in Definition 4.4.1. Let M3 be the empirical
version of M3, i.e.,
M3 =1
|S|∑
(x,y)∈S
y · (x⊗3 − x⊗I),
where S denote a set of samples (each sample is i.i.d. sampled from Distribution D
defined in Eq. (4.1)). Assume M3 6= 0, i.e., m3,i 6= 0 for any i. Let α be a fixed unit
vector. Then for any 0 < ε < 1, t ≥ 1, if
|S| ≥ maxi∈[k]
(‖w∗i ‖p+1/|m3,i(w
∗>i α)|2 + 1) · ε−2dpoly(log d, t)
228
with probability at least 1− 1/dt,
‖M3(I, I, α)− M3(I, I, α)‖ ≤ ε
k∑i=1
|v∗im3,i(w∗>i α)|.
Proof. Since y =∑k
i=1 v∗i φ(w
∗>i x). We consider each component i ∈ [k].
Define function Bi(x) : Rd → Rd×d such that
Bi(x) = [φ(w∗>i x) · (x⊗3 − x⊗I)](I, I, α) = φ(w∗>
i x) · ((x>α)x⊗2 − α>xI − αx> − xα>).
Define g(z) = φ(z) − φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤ L1
p+1|z|p+1 . |z|p+1,
which follows Property 1. In order to apply Lemma D.3.9, we need to bound the
following four quantities,
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ = ‖φ(w∗>i x) · ((x>α)x⊗2 − α>xId − αx> − xα>)‖
≤ |φ(w∗>i x)| · ‖(x>α)x⊗2 − α>xI − αx> − xα>‖
. (|w∗>i x|p+1 + |φ(0)|)‖(x>α)x⊗2 − α>xI − αx> − xα>‖
. (|w∗>i x|p+1 + |φ(0)|)(|x>α|‖x‖2 + 3|α>x|),
where the third step follows by definition of g(z), and last step follows by definition
of spectral norm and triangle inequality.
Using Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with prob-
ability 1− 1/(nd4t),
‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)dpoly(log d, t).
229
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m3,i(w
∗>i α)w∗
iw∗>i . Therefore, ‖Ex∼Dd
[Bi(x)]‖ =
|m3,i(w∗>i α)|.
(III) Bounding max(‖Ex∼Dd[Bi(x)Bi(x)
>]‖, ‖Ex∼Dd[Bi(x)
>Bi(x)]‖).
Because matrix Bi(x) is symmetric, thus it suffices to bound one of them,∥∥Ex∼Dd[B2
i (x)]∥∥ .
(Ex∼Dd
[φ(w∗>
i x)4])1/2(Ex∼Dd
[(x>α)4
])1/2(Ex∼Dd[‖x‖4]
)1/2. (‖w∗
i ‖p+1 + |φ(0)|)2d.
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)
2])1/2.
max‖a‖=1
(Ex∼Dd
[(a>Bi(x)a)
2])1/2
.(Ex∼Dd
[φ4(w∗>
i x)])1/4
. ‖w∗i ‖p+1 + |φ(0)|.
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
|S| & L2d+ |m3,i(w∗>i α)|2 + L|m3,i(w
∗>i α)|d · poly(log d, t)ε
ε2|m3,i(w∗>i α)|2
· t log d
with probability at least 1− d−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m3,i(w∗>i α)|.
Lemma B.4.3. Let M4 be defined as in Definition 4.4.1. Let M4 be the empirical
version of M4, i.e.,
M4 =1
|S|∑
(x,y)∈S
y · (x⊗4 − (x⊗ x)⊗I + I⊗I),
230
where S denote a set of samples (where each sample is i.i.d. sampled are sampled
from Distribution D defined in Eq. (4.1)). Assume M4 6= 0, i.e., m4,i 6= 0 for any i.
Let α be a fixed unit vector. Then for any 0 < ε < 1, t ≥ 1, if
|S| ≥ maxi∈[k]
(‖w∗i ‖p+1/|m4,i|(w∗>
i α)2 + 1)2 · ε−2 · dpoly(log d, t)
with probability at least 1− 1/dt,
‖M4(I, I, α, α)− M4(I, I, α, α)‖ ≤ εk∑
i=1
|v∗im4,i|(w∗>i α)2.
Proof. Since y =∑k
i=1 v∗i φ(w
∗>i x). We consider each component i ∈ [k].
Define function Bi(x) : Rd → Rd×d such that
Bi(x) = [φ(w∗>i x) · (x⊗4 − (x⊗ x)⊗I + I⊗I)](I, I, α, α)
= φ(w∗>i x) · ((x>α)2x⊗2 − (α>x)2I − 2(α>x)(xα> + αx>)− xx> + 2αα> + I).
Define g(z) = φ(z)−φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤ L1/(p+1)|z|p+1 . |z|p+1,
which follows Property 1.
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ .|φ(w∗>i x)| · ((x>α)2‖x‖2 + 1 + ‖x‖2 + (α>x)2)
.(|w∗>i x|p+1 + |φ(0)|) · ((x>α)2‖x‖2 + 1 + ‖x‖2 + (α>x)2)
Using Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with probability
1− 1/(nd4t),
‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)dpoly(log d, t).
231
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m4,i(w
∗>i α)2w∗
iw∗>i . Therefore, ‖Ex∼Dd
[Bi(x)]‖ =
|m4,i|(w∗>i α)2.
(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)
>‖,Ex∼Dd‖Bi(x)
>Bi(x)‖).
∥∥Ex∼Dd[Bi(x)
2]∥∥ .
(Ex∼Dd
[φ(w∗>i x)4]
)1/2(Ex∼Dd[(x>α)8]
)1/2(Ex∼Dd[‖x‖4]
)1/2. (‖w∗
i ‖p+1 + |φ(0)|)2d.
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)
2])1/2.
max‖a‖=1
(Ex∼Dd
[(a>Bi(x)a)
2])1/2
.(Ex∼Dd
[φ4(w∗>
i x)])1/4
. ‖w∗i ‖p+1 + |φ(0)|.
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
n &L2d+ |m4,i|2(w∗>
i α)4 + L|m4,i|(w∗>i α)2dpoly(log d, t)ε
ε2|m4,i|2(w∗>i α)4
· t log d
with probability at least 1− d−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m4,i|(w∗>i α)2.
Error Bound for the Second-order Moment
The goal of this section is to prove Lemma B.4.4, which shows we can
approximate the second order moments up to some precision by using linear sample
complexity in d.
232
Lemma B.4.4 (Estimation of the second order moment). Let P2 and j2 be defined
in Definition 4.4.2. Let S denote a set of i.i.d. samples generated from distribution
D(defined in (4.1)). Let P2 be the empirical version of P2 using dataset S, i.e., P2 =
ES[P2]. Assume the activation function satisfies Property 1 and Assumption 4.4.1.
Then for any 0 < ε < 1 and t ≥ 1, and m0 = mini∈[k]|mj2,i|2(w∗>i α)2(j2−2), if
|S| & σ2p+21 · d · poly(t, log d)/(ε2m0),
then with probability at least 1− d−Ω(t),
‖P2 − P2‖ ≤ εk∑
i=1
|v∗imj2,i(w∗>i α)j2−2|.
Proof. We have shown the bound for j2 = 2, 3, 4 in Lemma B.4.1, Lemma B.4.2
and Lemma B.4.3 respectively. To summarize, for any 0 < ε < 1 we have if
|S| ≥ maxi∈[k]
(‖w∗
i ‖p+1 + |φ(0)|+ |mj2,i(w∗>i α)(j2−2)|)2
|mj2,i|2(w∗>i α)2(j2−2)
· ε−2dpoly(log d, t)
with probability at least 1− d−t,
‖P2 − P2‖ ≤ εk∑
i=1
|v∗imj2,i(w∗>i α)j2−2|.
Subspace Estimation Using Power Method
Lemma B.4.5 shows a small number of power iterations can estimate the
subspace of w∗i i∈[k] to some precision, which provides guarantees for Algorithm B.4.1.
233
Lemma B.4.5 (Bound on subspace estimation). Let P2 be defined as in Defini-
tion. 4.4.2 and P2 be its empirical version. Let U ∈ Rd×k be the orthogonal column
span of W ∗ ∈ Rd×k. Assume ‖P2 − P2‖ ≤ σk(P2)/10. Let C be a large enough
positive number such that C > 2‖P2‖. Then after T = O(log(1/ε)) iterations, the
output of Algorithm B.4.1, V ∈ Rd×k, will satisfy
‖UU> − V V >‖ . ‖P2 − P2‖/σk(P2) + ε,
which implies
‖(I − V V >)w∗i ‖ . (‖P2 − P2‖/σk(P2) + ε)‖w∗
i ‖.
Proof. According to Weyl’s inequality, we are able to pick up the correct numbers
of positive eigenvalues and negative eigenvalues in Algorithm B.4.1 as long as P2
and P2 are close enough.
Let U = [U1 U2] ∈ Rd×k be the eigenspace of spanw∗1 w∗
2 · · · w∗k,
where U1 ∈ Rd×k1 corresponds to positive eigenvalues of P2 and U2 ∈ Rd×k2 is for
negatives.
Let V 1 be the top-k1 eigenvectors of CI + P2. Let V 2 be the top-k2 eigen-
vectors of CI − P2. Let V = [V 1 V 2] ∈ Rd×k.
According to Lemma 9 in [67], we have ‖(I−U1U>1 )V 1‖ . ‖P2−P2‖/σk(P2),‖(I−
U2U>2 )V 2‖ . ‖P2 − P2‖/σk(P2) . Using Fact 2.3.2, we have ‖(I − UU>)V ‖ =
‖UU> − V V >‖.
Let ε be the precision we want to achieve using power method. Let V1 be
the top-k1 eigenvectors returned after O(log(1/ε)) iterations of power methods on
234
CI + P2 and V2 ∈ Rd×k2 for CI − P2 similarly.
According to Theorem 7.2 in [6], we have ‖V1V>1 − V1V
>1 ‖ ≤ ε and
‖V2V>2 − V2V
>2 ‖ ≤ ε.
Let U⊥ be the complementary matrix of U . Then we have,
‖(I − U1U>1 )V 1‖ = max
‖a‖=1‖(I − U1U
>1 )V 1a‖
= max‖a‖=1
‖(U⊥U>⊥ + U2U
>2 )V 1a‖
= max‖a‖=1
√‖U⊥U>
⊥V 1a‖2 + ‖U2U>2 V 1a‖2
≥ max‖a‖=1
‖U2U>2 V 1a‖
= ‖U2U>2 V 1‖, (B.29)
where the first step follows by definition of spectral norm, the second step follows
by I = U1U>1 +U2U
>2 +U>
⊥U>⊥ , the third step follows by U>
2 U⊥ = 0, and last step
follows by definition of spectral norm.
We can upper bound ‖(I − UU>)V ‖,
‖(I − UU>)V ‖
≤ (‖(I − U1U>1 )V 1‖+ ‖(I − U2U
>2 )V 2‖+ ‖U2U
>2 V 1‖+ ‖U1U
>1 V 2‖)
≤ 2(‖(I − U1U>1 )V 1‖+ ‖(I − U2U
>2 )V 2‖)
. ‖P2 − P2‖/σk(P2), (B.30)
where the first step follows by triangle inequality, the second step follows by Eq. (B.29),
and the last step follows by Lemma 9 in [67].
235
We define matrix R such that V 2R = (I − V1V>1 )V2 is the QR decomposi-
tion of (I − V1V>1 )V2, then we have
‖(I − V 2V>2 )V 2‖
= ‖(I − V 2V>2 )(I − V1V
>1 )V2R
−1‖
= ‖(I − V 2V>2 )(I − V 1V
>1 + V 1V
>1 − V1V
>1 )V2R
−1‖
≤ ‖(I − V 2V>2 )(I − V 1V
>1 )V2R
−1‖︸ ︷︷ ︸α
+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V
>1 − V1V
>1 ‖︸ ︷︷ ︸
β
,
where the first step follows by V 2 = (I − V1V>1 )V2R
−1, and the last step follows
by triangle inequality.
Furthermore, we have,
α + β
= ‖(I − V 2V>2 − V 1V
>1 )V2R
−1‖+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V
>1 − V1V
>1 ‖
≤ ‖(I − V 2V>2 )V2R
−1‖+ ‖V 1V>1 V2R
−1‖+ ‖(I − V 2V>2 )‖‖R−1‖‖V 1V
>1 − V1V
>1 ‖
≤ ‖V 2V>2 − V2V
>2 ‖‖R−1‖+ ‖V 1V
>1 V2‖‖R−1‖+ ‖(I − V 2V
>2 )‖‖R−1‖‖V 1V
>1 − V1V
>1 ‖
= ‖V 2V>2 − V2V
>2 ‖‖R−1‖+ ‖V 1V
>1 V2‖‖R−1‖+ ‖R−1‖‖V 1V
>1 − V1V
>1 ‖
≤ ‖V 2V>2 − V2V
>2 ‖‖R−1‖+ ‖(I − V 2V
>2 )V2‖‖R−1‖+ ‖R−1‖‖V 1V
>1 − V1V
>1 ‖
≤ (2‖V 2V>2 − V2V
>2 ‖+ ‖V 1V
>1 − V1V
>1 ‖)‖R−1‖
≤ 3ε‖R−1‖
≤ 6ε,
where the first step follows by definition of α, β, the second step follows by triangle
inequality, the third step follows by ‖AB‖ ≤ ‖A‖‖B‖, the fourth step follows by
236
‖(I − V 2V>2 )‖ = 1, the fifth step follows by Eq. (B.29), the sixth step follows by
Fact 2.3.2, the seventh step follows by ‖V 1V>1 −V1V
>1 ‖ ≤ ε and ‖V 2V
>2 −V2V
>2 ‖ ≤
ε, and the last step follows by ‖R−1‖ ≤ 2 (Claim B.4.1).
Finally,
‖UU> − V V >‖ ≤ ‖UU> − V V>‖+ ‖V V
> − V V >‖
= ‖(I − UU>)V ‖+ ‖V V> − V V >‖
≤ ‖P2 − P2‖/σk(P2) + ‖V V> − V V >‖
≤ ‖P2 − P2‖/σk(P2) + ‖V 1V>1 − V1V
>1 ‖+ ‖V 2V
>2 − V2V
>2 ‖
≤ ‖P2 − P2‖/σk(P2) + 2ε,
where the first step follows by triangle inequality, the second step follows by Fact 2.3.2,
the third step follows by Eq. (B.30), the fourth step follows by triangle inequality,
and the last step follows by ‖V 1V>1 − V1V
>1 ‖ ≤ ε and ‖V 2V
>2 − V2V
>2 ‖ ≤ ε.
Therefore we finish the proof.
It remains to prove Claim B.4.1.
Claim B.4.1. σk(R) ≥ 1/2.
Proof. First, we can rewrite R>R in the follow way,
R>R = V >2 (I − V1V
>1 )V2 = I − V >
2 V1V>1 V2
237
Second, we can upper bound ‖V >2 V1‖ by 1/4,
‖V >2 V1‖ = ‖V2V
>2 V1‖
≤ ‖(V2V>2 − V 2V
>2 )V1‖+ ‖V 2V
>2 V1‖
≤ ‖(V2V>2 − V 2V
>2 )V1‖+ ‖V
>2 (V1V
>1 − V 1V
>1 )‖+ ‖V
>2 V 1V
>1 ‖
= ‖(V2V>2 − V 2V
>2 )V1‖+ ‖V
>2 (V1V
>1 − V 1V
>1 )‖
≤ ‖V2V>2 − V 2V
>2 ‖ · ‖V1‖+ ‖V
>2 ‖ · ‖V1V
>1 − V 1V
>1 ‖
≤ ε+ ε
≤ 1/4,
where the first step follows by V >2 V2 = I , the second step follows by triangle
inequality, the third step follows by triangle inequality, the fourth step follows by
‖V >2 V 1V
>1 ‖ = 0, the fifth step follows by ‖AB‖ ≤ ‖A‖ · ‖B‖, and the last step
follows by ‖V 1V>1 − V1V
>1 ‖ ≤ ε, ‖V1‖ = 1, ‖V 2V
>2 − V2V
>2 ‖ ≤ ε and ‖V >
2 ‖ = 1,
and the last step follows by ε < 1/8.
Thus, we can lower bound σ2k(R),
σ2k(R) = λmin(R
>R)
= min‖a‖=1
a>R>Ra
= min‖a‖=1
a>Ia− ‖V >2 V1a‖2
= 1− max‖a‖=1
‖V >2 V1a‖2
= 1− ‖V >2 V1‖2
≥ 3/4
which implies σk(R) ≥ 1/2.
238
B.4.4 Error Bound for the Reduced Third-order Moment
Error Bound for the Reduced Third-order Moment in Different Cases
Lemma B.4.6. Let M3 be defined as in Definition 4.4.1. Let M3 be the empirical
version of M3, i.e.,
M3 =1
|S|∑
(x,y)∈S
y · (x⊗3 − x⊗I),
where S denote a set of samples (where each sample is i.i.d. sampled from Dis-
tribution D defined in Eq. (4.1)). Assume M3 6= 0, i.e., m3,i 6= 0 for any i. Let
V ∈ Rd×k be an orthogonal matrix satisfying ‖UU> − V V >‖ ≤ 1/4, where U is
the orthogonal basis of spanw∗1, w
∗2, · · · , w∗
k. Then for any 0 < ε < 1, t ≥ 1, if
|S| ≥ maxi∈[k]
(‖w∗i ‖p+1/|m3,i|2 + 1) · ε−2 · k2poly(log d, t)
with probability at least 1− 1/dt,
‖M3(V, V, V )− M3(V, V, V )‖ ≤ εk∑
i=1
|v∗im3,i|.
Proof. Since y =∑k
i=1 v∗i φ(w
∗>i x). We consider each component i ∈ [k]. We
define function Ti(x) : Rd → Rk×k×k such that,
Ti(x) = φ(w∗>i x) · ((V >x)⊗3 − (V >x)⊗I).
We flatten tensor Ti(x) along the first dimension into matrix Bi(x) ∈ Rk×k2 . Define
g(z) = φ(z)−φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤ L1/(p+1)|z|p+1, which follows
Property 1.
239
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ ≤ |φ(w∗>i x)| · (‖V >x‖3 + 3k‖V >x‖)
. (|w∗>i x|p+1 + |φ(0)|) · (‖V >x‖3 + 3k‖V >xj‖)
Note that V >x ∼ N(0, Ik). According to Fact 2.4.2 and Fact 2.4.3, we have for any
constant t ≥ 1, with probability 1− 1/(ndt),
‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)k3/2poly(log d, t)
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m3,i(V
>w∗i )vec((V >w∗
i )(V>w∗
i )>)>. There-
fore, ‖Ex∼Dd[Bi(x)]‖ = |m3,i|‖V >w∗
i ‖3. Since ‖V V >−UU>‖ ≤ 1/4, ‖V V >w∗i−
w∗i ‖ ≤ 1/4 and 3/4 ≤ ‖V >w∗
i ‖ ≤ 5/4. So 14|m3,i| ≤ ‖B‖ ≤ 2|m3,i|.
(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)
>‖,Ex∼Dd‖Bi(x)
>Bi(x)‖).
∥∥Ex∼Dd[Bi(x)Bi(x)
>]∥∥ .
(Ex∼Dd
[φ(w∗>i x)4]
)1/2(Ex∼Dd[‖V >x‖6]
)1/2. (‖w∗
i ‖p+1 + |φ(0)|)2k3/2.
∥∥Ex∼Dd[Bi(x)
>Bi(x)]∥∥
.(Ex∼Dd
[φ(w∗>i x)4]
)1/2(Ex∼Dd[‖V >x‖4]
)1/2max
‖A‖F=1
(Ex∼Dd
[〈A, (V >x)(V >x)>〉4])1/2
. (‖w∗i ‖p+1 + |φ(0)|)2k2.
240
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)
2]).
max‖a‖=‖b‖=1
(Ex∼Dd
[(a>Bi(x)b)2])1/2
.(Ex∼Dd
[(φ(w∗>i x))4]
)1/4max‖a‖=1
(Ex∼Dd
[(a>V >x)4])1/2
max‖A‖F=1
(Ex∼Dd
[〈A, (V >x)(V >x)>〉4])1/2
. (‖w∗i ‖p+1 + |φ(0)|)k
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
|S| & L2k2 + |m3,i|2 + k3/2poly(log d, t)|m3,i|εε2|m3,i|2
t log(k)
with probability at least 1− k−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m3,i|.
We can set t = T log(d)/ log(k), then if
|S| ≥ ε−2(1 + 1/|m3,i|2)poly(T, log d)
with probability at least 1− d−T ,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m3,i|.
Also note that for any symmetric 3rd-order tensor R, the operator norm of R,
‖R‖ = max‖a‖=1
|R(a, a, a)| ≤ max‖a‖=1
‖R(a, I, I)‖F = ‖R(1)‖.
241
Lemma B.4.7. Let M4 be defined as in Definition 4.4.1. Let M4 be the empirical
version of M4, i.e.,
M4 =1
|S|∑
(x,y)∈S
y ·(x⊗4 − (x⊗ x)⊗I + I⊗I
),
where S is a set of samples (where each sample is i.i.d. sampled from Distribution D
defined in Eq. (4.1)). Assume M4 6= 0, i.e., m4,i 6= 0 for any i. Let α be a fixed unit
vector. Let V ∈ Rd×k be an orthogonal matrix satisfying ‖UU> − V V >‖ ≤ 1/4,
where U is the orthogonal basis of spanw∗1, w
∗2, · · · , w∗
k. Then for any 0 < ε <
1, t ≥ 1, if
|S| ≥ maxi∈[k]
(1 + ‖w∗i ‖p+1/|m4,i(α
>w∗i )|2) · ε−2 · k2poly(log d, t)
with probability at least 1− d−t,
‖M4(V, V, V, α)− M4(V, V, V, α)‖ ≤ εk∑
i=1
|v∗im4,i(α>w∗
i )|.
Proof. Recall that for each (x, y) ∈ S, we have y =∑k
i=1 v∗i φ(w
∗>i x). We consider
each component i ∈ [k]. We define function r(x) : Rd → Rk such that
r(x) = V >x.
Define function Ti(x) : Rd → Rk×k×k such that
Ti(x) =φ(w∗>i x)
(x>α · r(x)⊗ r(x)⊗ r(x)− (V >α)⊗(r(x)⊗ r(x))
−α>x · r(x)⊗I + (V >α)⊗I).
242
We flatten Ti(x) : Rd → Rk×k×k along the first dimension to obtain function
Bi(x) : Rd → Rk×k2 . Define g(z) = φ(z) − φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤
L1/(p+ 1)|z|p+1, which follows Property 1.
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ .(|w∗>i x|p+1 + |φ(0)|) · (|(x>α)|‖V >x‖3 + 3‖V >α‖‖V >x‖2
+ 3|(x>α)|‖V >xj‖√k + 3‖V >α‖
√k)
Note that V >x ∼ N(0, Ik). According to Fact 2.4.2 and Fact 2.4.3, we have for any
constant t ≥ 1, with probability 1− 1/(ndt),
‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)k3/2poly(log d, t)
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m4,i(α
>w∗i )(V
>w∗i )vec((V >w∗
i )(V>w∗
i )>)>. There-
fore,
‖Ex∼Dd[Bi(x)]‖ = |m4,i(α
>w∗i )|‖V >w∗
i ‖3.
Since ‖V V > − UU>‖ ≤ 1/4, ‖V V >w∗i − w∗
i ‖ ≤ 1/4 and 3/4 ≤ ‖V >w∗i ‖ ≤ 5/4.
So 14|m4,i(α
>w∗i )| ≤ ‖Ex∼Dd
[Bi(x)]‖ ≤ 2|m4,i(α>w∗
i )|.
(III) Bounding max(Ex∼Dd[Bi(x)Bi(x)
>],Ex∼Dd[Bi(x)
>Bi(x)]).
∥∥Ex∼Dd[Bi(x)Bi(x)
>]∥∥ .
(Ex∼Dd
[φ(w∗>
i x)4])1/2(Ex∼Dd
[(α>x)4
])1/2(Ex∼Dd
[‖V >x‖6
])1/2. (‖w∗
i ‖p+1 + |φ(0)|)2k3/2.
243
∥∥Ex∼Dd[Bi(x)
>Bi(x)]∥∥
.(Ex∼Dd
[φ(w∗>i x)4]
)1/2(Ex∼Dd[(α>x)4]
)1/2(Ex∼Dd[‖V >x‖4]
)1/2·(
max‖A‖F=1
Ex∼Dd
[〈A, (V >x)(V >x)>〉4
])1/2
. (‖w∗i ‖p+1 + |φ(0)|)2k2.
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd
[(a>Bi(x)b)
2])1/2.
max‖a‖=‖b‖=1
(Ex∼Dd
[(a>Bi(x)b)
2])1/2
.(Ex∼Dd
[φ4(w∗>i x)]
)1/4(Ex∼Dd
[(α>x)4
])1/4max‖a‖=1
(Ex∼Dd
[(a>V >x)4
])1/2· max‖A‖F=1
(Ex∼Dd
[〈A, (V >x)(V >x)>〉4
])1/2. (‖w∗
i ‖p+1 + |φ(0)|)k.
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
|S| ≥ L2k2 + |m4,i(α>w∗
i )|2 + k3/2poly(t, log d)|m4,i(α>w∗
i )|εε2(m4,i(α>w∗
i ))2
· t log k
with probability at least 1− k−t,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|
n∑x∈S
Bi(x)
∥∥∥∥∥ ≤ ε|m4,i(α>w∗
i )|. (B.31)
We can set t = T log(d)/ log(k), then if
|S| ≥ (L+ |m4,i(α>w∗
i )|)2k2poly(T, log d)ε2|m4,i(α>w∗
i )|2· T log2 d
with probability at least 1−d−T , Eq. (B.31) holds. Also note that for any symmetric
3rd-order tensor R, the operator norm of R,
‖R‖ = max‖a‖=1
|R(a, a, a)| ≤ max‖a‖=1
‖R(a, I, I)‖F = ‖R(1)‖.
244
Final Error Bound for the Reduced Third-order Moment
Lemma B.4.8 shows R3 can approximate R3 to some small precision with
poly(k) samples.
Lemma B.4.8 (Estimation of the reduced third order moment). Let U ∈ Rd×k de-
note the orthogonal column span of W ∗. Let α be a fixed unit vector and V ∈ Rd×k
denote an orthogonal matrix satisfying ‖V V > − UU>‖ ≤ 1/4. Define R3 :=
P3(V, V, V ), where P3 is defined as in Definition 4.4.2 using α. Let R3 be the
empirical version of R3 using dataset S, where each sample of S is i.i.d. sam-
pled from distribution D(defined in (4.1)). Assume the activation function satisfies
Property 1 and Assumption 4.4.1. Then for any 0 < ε < 1 and t ≥ 1, define
j3 = minj ≥ 3|Mj 6= 0 and m0 = mini∈[k](m(j3)i (α>w∗
i )j3−3)2, if
|S| ≥ σ2p+21 · k2 · poly(log d, t)/(ε2m0)
then we have ,
‖R3 − R3‖ ≤ εk∑
i=1
|v∗imj3,i(w∗>i α)j3−3|,
holds with probability at least 1− d−Ω(t).
Proof. The main idea is to use matrix Bernstein bound after matricizing the third-
order tensor. Similar to the proof of Lemma B.4.4, we consider each node compo-
nent individually and then sum up the errors and apply union bound.
We have shown the bound for j3 = 3, 4 in Lemma B.4.6 and Lemma B.4.7
respectively. To summarize, for any 0 < ε < 1 we have if
|S| ≥ maxi∈[k]
(1 + ‖w∗
i ‖p+1/|mj3,i(w∗>i α)(j3−3)|2
)· ε−2 · k2poly(log d, t)
245
with probability at least 1− d−t,
‖R3 − R3‖ ≤ ε
k∑i=1
|v∗imj3,i(w∗>i α)j3−3|.
B.4.5 Error Bound for the Magnitude and Sign of the Weight Vectors
The lemmata in this section together with Lemma B.4.4 provide guaran-
tees for Algorithm B.4.2. In particular, Lemma B.4.10 shows with linear sam-
ple complexity in d, we can approximate the 1st-order moment to some precision.
And Lemma B.4.11 and Lemma B.4.12 provide the error bounds of linear systems
Eq. (B.25) under some perturbations.
Robustness for Solving Linear Systems
Lemma B.4.9 (Robustness of linear system). Given two matrices A, A ∈ Rd×k, and
two vectors b, b ∈ Rd. Let z∗ = argminz∈Rk ‖Az − b‖ and z = argminz∈Rk ‖(A +
A)z − (b+ b)‖. If ‖A‖ ≤ 14κσk(A) and ‖b‖ ≤ 1
4‖b‖, then, we have
‖z∗ − z‖ .(σ−4k (A)σ2
1(A) + σ−2k (A))‖b‖‖A‖+ σ−2
k (A)σ1(A)‖b‖.
Proof. By definition of z and z, we can rewrite z and z,
z = A†b = (A>A)−1A>b
z = (A+ A)†(b+ b) = ((A+ A)>(A+ A))−1(A+ A)>(b+ b).
As ‖A‖ ≤ 14κσk(A), we have ‖A>A + A>A‖‖(A>A)−1‖ ≤ 1/4. Together with
246
‖b‖ ≤ 14‖b‖, we can ignore the high-order errors. So we have
‖z − z∗‖
. ‖(A>A)−1(A>b+ A>b) + (A>A)−1(A>A+ A>A)(A>A)−1A>b‖
. ‖(A>A)−1‖(‖A‖‖b‖+ ‖A‖‖b‖) + ‖(A>A)−2‖ · ‖A‖‖A‖‖A‖‖b‖
. σ−2k (A)(‖A‖‖b‖+ σ1(A)‖b‖) + σ−4
k (A) · σ21(A)‖A‖‖b‖.
Error Bound for the First-order Moment
Lemma B.4.10 (Error bound for the first-order moment). Let Q1 be defined as in
Eq. (B.22) and Q1 be the empirical version of Q1 using dataset S, where each
sample of S is i.i.d. sampled from distribution D(defined in (4.1)). Assume the acti-
vation function satisfies Property 1 and Assumption 4.4.1. Then for any 0 < ε < 1
and t ≥ 1, define j1 = minj ≥ 1|Mj 6= 0 and m0 = mini∈[k](mj1,i(w∗>i α)j1−1)2
if
|S| ≥ σ2p+21 dpoly(t, log d)/(ε2m0)
we have with probability at least 1− d−Ω(t),
‖Q1 − Q1‖ ≤ εk∑
i=1
|v∗imj1,i(w∗>i α)j1−1|.
Proof. We consider the case when l1 = 3, i.e.,
Q1 = M3(I, α, α) =k∑
i=1
v∗im3,i(α>w∗
i )3w∗
i .
And other cases are similar.
247
Since y =∑k
i=1 v∗i φ(w
∗>i x). We consider each component i ∈ [k].
Define function Bi(x) : Rd → Rd such that
Bi(x) = [φ(w∗>i x) · (x⊗3 − x⊗I)](I, α, α) = φ(w∗>
i x) · ((x>α)2x− 2(x>α)α− x).
Define g(z) = φ(z)−φ(0), then |g(z)| = |∫ z
0φ′(s)ds| ≤ L1/(p+1)|z|p+1,
which follows Property 1.
(I) Bounding ‖Bi(x)‖.
‖Bi(x)‖ ≤ |φ(w∗>i x)| · ‖((x>α)2x− 2α>xα− x)‖
≤ (|w∗>i x|p+1 + |φ(0)|)(((x>α)2 + 1)‖x‖+ 2|α>x|)
According to Fact 2.4.2 and Fact 2.4.3, we have for any constant t ≥ 1, with prob-
ability 1− 1/(ndt),
‖Bi(x)‖ . (‖w∗i ‖p+1 + |φ(0)|)
√dpoly(log d, t)
(II) Bounding ‖Ex∼Dd[Bi(x)]‖.
Note that Ex∼Dd[Bi(x)] = m3,i(w
∗>i α)2w∗
i . Therefore, ‖Ex∼Dd[Bi(x)]‖ =
|m3,i(w∗>i α)2|.
(III) Bounding max(Ex∼Dd‖Bi(x)Bi(x)
>‖,Ex∼Dd‖Bi(x)
>Bi(x)‖).
∥∥Ex∼Dd
[Bi(x)
>Bi(x)]∥∥ .
(Ex∼Dd
[φ(w∗>
i x)4])1/2(Ex∼Dd
[(x>α)8
])1/2(Ex∼Dd
[‖x‖4
])1/2. (‖w∗
i ‖p+1 + |φ(0)|)2d.
248
∥∥Ex∼Dd
[Bi(x)Bi(x)
>]∥∥ .(Ex∼Dd
[φ(w∗>
i x)4])1/2(Ex∼Dd
[(x>α)8
])1/2(max‖a‖=1
Ex∼Dd
[(x>a)4
])1/2
. (‖w∗i ‖p+1 + |φ(0)|)2.
(IV) Bounding max‖a‖=‖b‖=1(Ex∼Dd[(a>Bi(x)b)
2]).
max‖a‖=1
(Ex∼Dd
[(a>Bi(x)a)
2])1/2
.(Ex∼Dd
[φ4(w∗>
i x)])1/4
. ‖w∗i ‖p+1 + |φ(0)|.
Define L = ‖w∗i ‖p+1 + |φ(0)|. Then we have for any 0 < ε < 1, if
|S| & L2d+ |m3,i(w∗>i α)2|2 + L|m3,i(w
∗>i α)2|
√dpoly(log d, t)ε
ε2|m3,i(w∗>i α)2|2
· t log d
with probability at least 1− 1/dt,∥∥∥∥∥Ex∼Dd[Bi(x)]−
1
|S|
n∑x∼S
Bi(x)
∥∥∥∥∥ ≤ ε|m3,i(w∗>i α)2|.
Summing up all k components, we obtain if
|S| ≥ maxi∈[k]
(‖w∗
i ‖p+1 + |φ(0)|+ |m3,i(w∗>i α)2|)2
|m3,i(w∗>i α)2|2
· ε−2dpoly(log d, t)
with probability at least 1− 1/dt,
‖M3(I, α, α)− M3(I, α, α)‖ ≤ ε
k∑i=1
|v∗im3,i(w∗>i α)2|.
Other cases (j1 = 1, 2, 4) are similar, so we complete the proof.
Linear System for the First-order MomentThe following lemma provides
estimation error bound for the first linear system in Eq. (B.25).
249
Lemma B.4.11 (Solution of linear system for the first order moment). Let U ∈
Rd×k be the orthogonal column span of W ∗. Let V ∈ Rd×k denote an orthogonal
matrix satisfying that ‖V V > − UU>‖ ≤ δ2 . 1/(κ2√k). For each i ∈ [k], let ui
denote the vector satisfying ‖ui − V >w∗i ‖ ≤ δ3 . 1/(κ2
√k). Let Q1 be defined
as in Eq. (B.22) and Q1 be the empirical version of Q1 such that ‖Q1 − Q1‖ ≤
δ4‖Q1‖ ≤ 14‖Q1‖. Let z∗ ∈ Rk and z ∈ Rk be defined as in Eq. (B.24) and
Eq. (B.25). Then
|zi − z∗i | ≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)‖z∗‖1.
Proof. Let A ∈ Rk×k denote the matrix where the i-th column is siw∗i . Let A ∈
Rk×k denote the matrix where the i-th column is V ui. Let b ∈ Rk denote the vector
Q1, let b denote the vector Q1 −Q1. Then we have
‖A‖ ≤√k.
Using Fact 2.3.1, we can lower bound σk(A),
σk(A) ≥ 1/κ.
We can upper bound ‖A‖ in the following way,
‖A‖ ≤√kmax
i∈[k]‖V ui − siw
∗i ‖
≤√kmax
i∈[k]‖V ui − siV V >w∗
i + siV V >w∗i − siUU>w∗
i ‖
≤√k(δ3 + δ2).
250
We can upper bound ‖b‖ and ‖b‖,
‖b‖ = ‖Q1‖ ≤ k
k∑i=1
|z∗i |, and ‖b‖ ≤ δ4‖Q1‖.
To apply Lemma B.4.9, we need δ4 ≤ 1/4 and δ2 . 1/(√kκ2), δ3 . 1/(
√kκ2).
So we have,
‖zi − z∗i ‖ ≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)‖Q1‖
≤ (κ4k3/2(δ2 + δ3) + κ2k1/2δ4)k∑
i=1
|z∗i |.
Linear System for the Second-order Moment
The following lemma provides estimation error bound for the second linear
system in Eq. (B.25).
Lemma B.4.12 (Solution of linear system for the second order moment). Let U ∈
Rd×k be the orthogonal column span of W ∗ denote an orthogonal matrix satisfying
that ‖V V > − UU>‖ ≤ δ2 . 1/(κ√k). For each i ∈ [k], let ui denote the vector
satisfying ‖ui − V >w∗i ‖ ≤ δ3 . 1/(
√kκ3). Let Q2 be defined as in Eq. (B.23)
and Q2 be the estimation of Q2 such that ‖Q2 − Q2‖F ≤ δ4‖Q2‖F ≤ 14‖Q2‖F . Let
r∗ ∈ Rk and r ∈ Rk be defined as in Eq. (B.24) and Eq. (B.25). Then
|ri − r∗i | ≤ (k3κ8δ3 + κ2k2δ4)‖r∗‖.
Proof. For each i ∈ [k], let ui = V >w∗i . Let A ∈ Rk2×k denote the matrix where
the i-th column is vec(uiu>i ). Let A ∈ Rk2×k denote the matrix where the i-th
251
column is vec(uiu>i − uiu
>i ). Let b ∈ Rk2 denote the vector vex(Q2), let b ∈ Rk2
denote the vector vec(Q2 − Q2).
Let be the element-wise matrix product (a.k.a. Hadamard product), W =
[w∗1 w∗
2 · · · w∗k] and U = [u1 u2 · · · uk] = V >W . We can upper bound ‖A‖ and
‖A‖ as follows,
‖A‖ = max‖x‖=1
∥∥∥∥∥k∑
i=1
xivec(uiu>i )
∥∥∥∥∥= max
‖x‖=1‖U diag(x)U>‖F
≤ ‖U‖2
≤ σ21(V
>W ),
and
‖A‖ =√kmax
i∈[k]‖Ai‖
≤√kmax
i∈[k]‖uiu
>i − uiu
>i ‖F
≤√kmax
i∈[k]2‖ui − ui‖2
≤ 2√kδ3.
We can lower bound σk(A),
σk(A) =√
σk(A>A)
=√
σk((U>U) (U>U))
= min‖x‖=1
√x>((U>U) (U>U))x
= min‖x‖=1
‖(U>U)1/2 diag(x)(U>U)1/2‖F
≥ σ2k(V
>W )
252
where fourth step follows Schur product theorem, the last step follows by the fact
that ‖CB‖F ≥ σmin(C)‖B‖F and is the element-wise multiplication of two ma-
trices.
We can upper bound ‖b‖ and ‖b‖,
‖b‖ ≤‖Q2‖F ≤ ‖r∗‖,
‖b‖ =‖Q2 − Q2‖F ≤ δ4‖r∗‖.
Since ‖V V >W −W‖ ≤√kδ2, we have for any x ∈ Rk,
‖V V >Wx‖ ≥ ‖Wx‖ − ‖(V V >W −W )x‖
≥ σk(W )‖x‖ − δ2√k‖x‖
Note that according to Fact 2.3.1, σk(W ) ≥ 1/κ. Therefore, if δ2 ≤
1/(2κ√k), we will have σk(V
>W ) ≥ 1/(2κ). Similarly, we have σ1(V>W ) ≤
‖V ‖‖W‖ ≤√k. Then applying Lemma B.4.9 and setting δ2 . 1√
kκ3 , we complete
the proof.
253
Appendix C
One-hidden-layer Convolutional Neural Networks
C.1 Proof Overview
In this section, we briefly give the proof sketch for the local strong con-
vexity. The main idea is first to bound the range of the eigenvalues of the popula-
tion Hessian ∇2fD(W∗) and then bound the spectral norm of the remaining error,
‖∇2fS(W ) − ∇2fD(W∗)‖. The later can be bounded by mainly applying matrix
Bernstein inequality and Property 1, 3 carefully. In Sec. C.1.1, we show that when
Property 4 is satisfied, ∇2fD(W∗) for orthogonal W ∗ with k = t can be lower
bounded. Sec. C.1.2 shows how to reduce the case of a non-orthogonal W ∗ with
k ≥ t to the orthogonal case with k = t. The upper bound is relatively easier, so
we leave those proofs in Appendix C.3.
C.1.1 Orthogonal weight matrices for the population case
In this section, we consider a special case when t = k and W ∗ is orthogonal
to illustrate how we prove PD-ness of Hessian. Without loss of generality, we set
W ∗ = Ik. Let[x>1 x>
2 · · · x>r
]> denote vector x ∈ Rd, where xi = Pix ∈ Rk,
for each i ∈ [r]. Let xij denote the j-th entry of xi. Thus, we can rewrite the second
254
partial derivative of fD(W ∗) with respect to wj and wl as,
∂2fD(W∗)
∂wj∂wl
= E(x,y)∼D
( r∑i=1
φ′(xij)xi
)(r∑
i=1
φ′(xil)xi
)>.
Let a ∈ Rk2 denote vector[a>1 a>2 · · · a>k
]> for ai ∈ Rk, i ∈ [r]. The Hessiancan be lower bounded by
λmin(∇2f(W ∗))
≥ min‖a‖=1
a>∇2f(W ∗)a
= min‖a‖=1
Ex∼Dd
k∑j=1
r∑i=1
a>j xi · φ′(xij)
2 (C.1)
≥ r · min‖a‖=1
Eu∼Dk
k∑j=1
a>j(uφ′(uj)− Eu∼Dk
[uφ′(uj)])2. (C.2)
The last formulation Eq. (C.2) has a unit independent element uj in φ′(·), thus can
be calculated explicitly by defining some quantities. In particular, we can obtain
the following lower bounded for Eq. (C.2).
Lemma C.1.1 (Informal version of Lemma C.3.2). Let D1 denote Gaussian distri-
bution N(0, 1). Let
α0 = Ez∼D1 [φ′(z)],
α1 = Ez∼D1 [φ′(z)z],
α2 = Ez∼D1 [φ′(z)z2],
β0 = Ez∼D1 [φ′2(z)],
β2 = Ez∼D1 [φ′2(z)z2].
255
Let ρ denote
min(β0 − α20 − α2
1), (β2 − α21 − α2
2).
For any positive integer k, let A =[a1 a2 · · · ak
]∈ Rk×k. Then we have,
Eu∼Dk
k∑j=1
a>j(u · φ′(uj)− Eu∼Dk
[uφ′(uj)])2 ≥ ρ‖A‖2F . (C.3)
Note that the definition of ρ contains two elements of the definition of ρ(1)
in Property 4. Therefore, if ρ(1) > 0, we also have ρ > 0. More detailed proofs for
the orthogonal case can be found in Appendix C.3.1.
C.1.2 Non-orthogonal weight matrices for the population case
In this section, we show how to reduce the minimal eigenvalue problem with
a non-orthogonal weight matrix into a problem with an orthogonal weight matrix,
so that we can use the results in Sec. C.1.1 to lower bound the eigenvalues.
Let U ∈ Rk×t be the orthonormal basis of W ∗ ∈ Rk×t and let V = U>W ∗ ∈
Rt×t. We use U⊥ ∈ Rk×(k−t) to denote the complement of U . For any vector
aj ∈ Rk, there exist two vectors bj ∈ Rt and cj ∈ Rk−t such that
aj︸︷︷︸k×1
= U︸︷︷︸k×t
bj︸︷︷︸t×1
+ U⊥︸︷︷︸k×(k−t)
cj︸︷︷︸(k−t)×1
.
Let b ∈ Rt2 denote vector[b>1 b>2 · · · b>t
]> and let c ∈ R(k−t)t denote vector[c>1 c>2 · · · c>t
]>. Define g(w∗i ) = E
x∼Dk
[xφ′(w∗>
i x)].
256
Similar to the steps in Eq. (C.1) and Eq. (C.2), we have
∇2fD(W∗)
r · min‖a‖=1
Ex∼Dk
( t∑i=1
a>i (xφ′(w∗>
i x)− g(w∗i )
)2Ikt
= r · min‖b‖=1,‖c‖=1
Ex∼Dk
[(
t∑i=1
(b>i U> + c>i U
>⊥ )·
(xφ′(w∗>i x)− g(w∗
i )))2]Ikt
r · (C1 + C2 + C3)Ikt,
where
C1 = min‖b‖=1
Ex∼Dk
( t∑i=1
b>i U> · (xφ′(w∗>
i x)− g(w∗i ))
)2,
C2 = min‖c‖=1
Ex∼Dk
( t∑i=1
c>i U>⊥ · (xφ′(w∗>
i x)− g(w∗i ))
)2,
C3 = min‖b‖=‖c‖=1
Ex∼Dk
[2
(t∑
i=1
b>i U> · (xφ′(w∗>
i x)− g(w∗i ))
)(
t∑i=1
c>i U>⊥ (xφ′(w∗>
i x)− g(w∗i ))
)].
Since g(w∗i ) ∝ w∗
i and U>⊥x is independent of φ′(w∗>
i x), we have C3 = 0. C1 can
be lower bounded by the orthogonal case with a loss of a condition number of W ∗,
257
λ, as follows.
C1 ≥1
λEu∼Dt
[(
t∑i=1
σt · b>i V †>(uφ′(σt · ui)−
V >σ1(V†)g(w∗
i )))2]
≥ 1
λEu∼Dt
[(
t∑i=1
σt · b>i V †>(uφ′(σt · ui)−
Eu∼Dt [uφ′(σt · ui)]))
2].
The last formulation is the orthogonal weight case in Eq. (C.2) in Sec. C.1.1. So we
can lower bound it by Lemma C.1.1. The intermediate steps for the derivation of
the above inequalities and the lower bound for C2 can be found in Appendix C.3.1.
C.2 Properties of Activation Functions
Definition C.2.1. Let αq(σ) = Ez∼N(0,1)[φ′(σ · z)zq],∀q ∈ 0, 1, 2, and βq(σ) =
Ez∼N(0,1)[φ′2(σ · z)zq],∀q ∈ 0, 2. Let γq(σ) = Ez∼N(0,1)[φ(σ · z)zq], ∀q ∈
0, 1, 2, 3, 4.
Proposition C.2.1. ReLU φ(z) = maxz, 0, leaky ReLU φ(z) = maxz, 0.01z,
squared ReLU φ(z) = maxz, 02 and any non-linear non-decreasing smooth func-
tions with bounded symmetric φ′(z), like the sigmoid function φ(z) = 1/(1 + e−z),
the tanh function and the erf function φ(z) =∫ z
0e−t2dt, satisfy Property 1,4,3.
Proof. The difference between Property 4 and Property 2 is in the definition of
ρ(σ), for which Property 4 has an additional term α20. From Table B.1, we know
that ReLU, leaky ReLU, squared ReLU satisfy the condition that α0 > 0. For non-
linear non-decreasing smooth functions, α0 = 0 if and only if φ′(z) = 0 almost
258
surely, since φ′(z) ≥ 0. Therefore, ρ(σ) > 0 for any smooth non-decreasing non-
linear activations with bounded symmetric first derivatives.
C.3 Positive Definiteness of Hessian near the Ground TruthC.3.1 Bounding the eigenvalues of Hessian
The goal of this section is to prove Lemma C.3.1.
Lemma C.3.1 (Positive Definiteness of Population Hessian at the Ground Truth).
If φ(z) satisfies Property 1,4 and 3, we have the following property for the second
derivative of function fD(W ) at W ∗ ∈ Rk×t,
Ω(rρ(σt)/(κ2λ))I ∇2fD(W
∗) O(tr2σ2p1 )I.
Proof. This follows by combining Lemma C.3.3 and Lemma C.3.4.
Lower bound for the orthogonal case
Lemma C.3.2 (Formal version of Lemma C.1.1). Let D1 denote Gaussian distri-
bution N(0, 1). Let α0 = Ez∼D1 [φ′(z)], α1 = Ez∼D1 [φ
′(z)z], α2 = Ez∼D1 [φ′(z)z2],
β0 = Ez∼D1 [φ′2(z)] ,β2 = Ez∼D1 [φ
′2(z)z2]. Let ρ denote min(β0−α20−α2
1), (β2−
α21 − α2
2). Let P =[p1 p2 · · · pk
]∈ Rk×k. Then we have,
Eu∼Dk
( k∑i=1
p>i (u · φ′(ui)− Eu∼Dk[uφ′(ui)])
)2 ≥ ρ‖P‖2F (C.4)
259
Proof.
Eu∼Dk
( k∑i=1
p>i (u · φ′(ui)− Eu∼Dk[uφ′(ui)])
)2
= Eu∼Dk
( k∑i=1
p>i u · φ′(ui)
)2−( E
u∼Dk
[(k∑
i=1
p>i u · φ′(ui)
)])2
=k∑
i=1
k∑l=1
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]−
(Eu∼Dk
[k∑
i=1
p>i eiuiφ′(ui)
])2
=k∑
i=1
Eu∼Dk
[p>i (φ
′(ui)2 · uu>)pi
]︸ ︷︷ ︸
A
+∑i 6=l
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]︸ ︷︷ ︸B
−
(Eu∼Dk
[k∑
i=1
p>i eiuiφ′(ui)
])2
︸ ︷︷ ︸C
First, we can rewrite the term C in the following way,
C =
(Eu∼Dk
[k∑
i=1
p>i eiuiφ′(ui)
])2
=
(k∑
i=1
p>i eiEz∼D1 [φ′(z)z]
)2
= α21
(k∑
i=1
p>i ei
)2
= α21(diag(P )>1)2.
260
Further, we can rewrite the diagonal term in the following way,
A =k∑
i=1
Eu∼Dk
[p>i (φ′(ui)
2 · uu>)pi]
=k∑
i=1
Eu∼Dk
[p>i
(φ′(ui)
2 ·
(u2i eie
>i +
∑j 6=i
uiuj(eie>j + eje
>i ) +
∑j 6=i
∑l 6=i
ujuleje>l
))pi
]
=k∑
i=1
Eu∼Dk
[p>i
(φ′(ui)
2 ·
(u2i eie
>i +
∑j 6=i
u2jeje
>j
))pi
]
=k∑
i=1
[p>i
(E
u∼Dk
[φ′(ui)2u2
i ]eie>i +
∑j 6=i
Eu∼Dk
[φ′(ui)2u2
j ]eje>j
)pi
]
=k∑
i=1
[p>i
(β2eie
>i +
∑j 6=i
β0eje>j
)pi
]
=k∑
i=1
p>i ((β2 − β0)eie>i + β0Ik)pi
= (β2 − β0)k∑
i=1
p>i eie>i pi + β0
k∑i=1
p>i pi
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F ,
where the second step follows by rewriting uu> =k∑
i=1
k∑j=1
uiujeie>j , the third step
follows by
Eu∼Dk
[φ′(ui)2uiuj] = 0, ∀j 6= i and E
u∼Dk
[φ′(ui)2ujul] = 0, ∀j 6= l, the fourth
step follows by pushing expectation, the fifth step follows by Eu∼Dk
[φ′(ui)2u2
i ] = β2
and Eu∼Dk
[φ′(ui)2u2
j ] = Eu∼Dk
[φ′(ui)2] = β0, and the last step follows by
k∑i=1
p2i,i =
‖ diag(P )‖2 andk∑
i=1
p>i pi =k∑
i=1
‖pi‖2 = ‖P‖2F .
261
We can rewrite the off-diagonal term in the following way,
B =∑i 6=l
Eu∼Dk
[p>i (φ′(ul)φ
′(ui) · uu>)pl]
=∑i 6=l
Eu∼Dk
[p>i
(φ′(ul)φ
′(ui) ·
(u2i eie
>i + u2
l ele>l + uiul(eie
>l + ele
>i ) +
∑j 6=l
uiujeie>j
+∑j 6=i
ujuleje>l +
∑j 6=i,l
∑j′ 6=i,l
ujuj′eje>j′
))pl
]
=∑i 6=l
Eu∼Dk
[p>i
(φ′(ul)φ
′(ui) ·
(u2i eie
>i + u2
l ele>l + uiul(eie
>l + ele
>i ) +
∑j 6=i,l
u2jeje
>j
))pl
]
=∑i 6=l
[p>i
(E
u∼Dk
[φ′(ul)φ′(ui)u
2i ]eie
>i + E
u∼Dk
[φ′(ul)φ′(ui)u
2l ]ele
>l
+ Eu∼Dk
[φ′(ul)φ′(ui)uiul](eie
>l + ele
>i ) +
∑j 6=i,l
Eu∼Dk
[φ′(ul)φ′(ui)u
2j ]eje
>j
)pl
]
=∑i 6=l
[p>i
(α0α2(eie
>i + ele
>l ) + α2
1(eie>l + ele
>i ) +
∑j 6=i,l
α20eje
>j
)pl
]=∑i 6=l
[p>i((α0α2 − α2
0)(eie>i + ele
>l ) + α2
1(eie>l + ele
>i ) + α2
0Ik)pl]
= (α0α2 − α20)∑i 6=l
p>i (eie>i + ele
>l )pl︸ ︷︷ ︸
B1
+α21
∑i 6=l
p>i (eie>l + ele
>i )pl︸ ︷︷ ︸
B2
+α20
∑i 6=l
p>i pl︸ ︷︷ ︸B3
,
where the third step follows by
Eu∼Dk
[φ′(ul)φ′(ui)uiuj] = 0,
and
Eu∼Dk
[φ′(ul)φ′(ui)uj′uj] = 0, ∀j′ 6= j.
262
For the term B1, we have
B1 = (α0α2 − α20)∑i 6=l
p>i (eie>i + ele
>l )pl
= 2(α0α2 − α20)∑i 6=l
p>i eie>i pl
= 2(α0α2 − α20)
k∑i=1
p>i eie>i
(k∑
l=1
pl − pi
)
= 2(α0α2 − α20)
(k∑
i=1
p>i eie>i
k∑l=1
pl −k∑
i=1
p>i eie>i pi
)= 2(α0α2 − α2
0)(diag(P )> · P · 1− ‖ diag(P )‖2)
For the term B2, we have
B2 = α21
∑i 6=l
p>i (eie>l + ele
>i )pl
= α21
(∑i 6=l
p>i eie>l pl +
∑i 6=l
p>i ele>i pl
)
= α21
(k∑
i=1
k∑l=1
p>i eie>l pl −
k∑j=1
p>j eje>j pj +
k∑i=1
k∑l=1
p>i ele>i pl −
k∑j=1
p>j eje>j pj
)= α2
1((diag(P )>1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)
263
For the term B3, we have
B3 = α20
∑i 6=l
p>i pl
= α20
(k∑
i=1
p>i
k∑l=1
pl −k∑
i=1
p>i pi
)
= α20
∥∥∥∥∥k∑
i=1
pi
∥∥∥∥∥2
−k∑
i=1
‖pi‖2
= α20(‖P · 1‖2 − ‖P‖2F )
Let diag(P ) denote a length k column vector where the i-th entry is the
264
(i, i)-th entry of P ∈ Rk×k. Furthermore, we can show A+B − C is,
A+B − C
= A+B1 +B2 +B3 − C
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A
+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸
B1
+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸
B2
+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸
B3
− α21(diag(P )> · 1)2︸ ︷︷ ︸
C
= ‖α0P · 1+ (α2 − α0) diag(P )‖2︸ ︷︷ ︸C1
+α21
2‖P + P> − 2 diag(diag(P ))‖2F︸ ︷︷ ︸
C2
+ (β0 − α20 − α2
1)‖P − diag(diag(P ))‖2F︸ ︷︷ ︸C3
+ (β2 − α21 − α2
2)‖ diag(P )‖2︸ ︷︷ ︸C4
≥ (β0 − α20 − α2
1)‖P − diag(diag(P ))‖2F + (β2 − α21 − α2
2)‖ diag(P )‖2
≥ min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · (‖P − diag(diag(P ))‖2F + ‖ diag(P )‖2)
= min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · (‖P − diag(diag(P ))‖2F + ‖ diag(diag(P ))‖2)
≥ min(β0 − α20 − α2
1), (β2 − α21 − α2
2) · ‖P‖2F
= ρ‖P‖2F ,
where the first step follows by B = B1 + B2 + B3, and the second step follows by
the definition of A,B1, B2, B3, C the third step follows by A+B1+B2+B3−C =
C1 + C2 + C3 + C4, the fourth step follows by C1, C2 ≥ 0, the fifth step follows
a ≥ min(a, b), the sixth step follows by ‖ diag(P )‖2 = ‖ diag(diag(P ))‖2F , the
seventh step follows by triangle inequality, and the last step follows the definition
of ρ.
265
Claim C.3.1. A+B1 +B2 +B3 − C = C1 + C2 + C3 + C4.
Proof. The key properties we need are, for two vectors a, b, ‖a + b‖2 = ‖a‖2 +2〈a, b〉 + ‖b‖2; for two matrices A,B, ‖A + B‖2F = ‖A‖2F + 2〈A,B〉 + ‖B‖2F .Then, we have
C1 + C + C3 + C4 + C5
= (‖α0P · 1‖)2 + 2(α0α2 − α20)〈P · 1, diag(P )〉+ (α2 − α0)
2‖ diag(P )‖2︸ ︷︷ ︸C1
+α21(diag(P )> · 1)2︸ ︷︷ ︸
C
+α21
2(2‖P‖2F + 4‖ diag(diag(P ))‖2F + 2〈P, P>〉 − 4〈P, diag(diag(P ))〉 − 4〈P>, diag(diag(P ))〉)︸ ︷︷ ︸
C2
+ (β0 − α20 − α2
1)(‖P‖2F − 2〈P, diag(diag(P ))〉+ ‖ diag(diag(P ))‖2F )︸ ︷︷ ︸C3
+(β2 − α21 − α2
2)‖diag(P )‖2︸ ︷︷ ︸C4
= α20‖P · 1‖2 + 2(α0α2 − α2
0)〈P · 1,diag(P )〉+ (α2 − α0)2‖diag(P )‖2︸ ︷︷ ︸
C1
+α21(diag(P )> · 1)2︸ ︷︷ ︸
C
+α21
2(2‖P‖2F + 4‖ diag(P )‖2 + 2〈P, P>〉 − 8‖ diag(P )‖2)︸ ︷︷ ︸
C2
+ (β0 − α20 − α2
1)(‖P‖2F − 2‖diag(P )‖2 + ‖ diag(P )‖2)︸ ︷︷ ︸C3
+(β2 − α21 − α2
2)‖ diag(P )‖2︸ ︷︷ ︸C4
= α20‖P · 1‖2 + 2(α0α2 − α2
0) diag(P )> · P · 1+ α21(diag(P )> · 1)2 + α2
1〈P, P>〉+ (β0 − α2
0)‖P‖2F + ((α2 − α0)2 − 2α2
1 − β0 + α20 + α2
1 + β2 − α21 − α2
2)︸ ︷︷ ︸β2−β0−2(α2α0−α2
0+α21)
‖diag(P )‖2
= 0︸︷︷︸part of A
+2(α2α0 − α20) · diag(P )>P · 1︸ ︷︷ ︸part of B1
+α21 · ((diag(P )>1)2 + 〈P, P>〉)︸ ︷︷ ︸
part of B2
+α20 · ‖P · 1‖2︸ ︷︷ ︸
part of B3
+ (β0 − α20) · ‖P‖2F︸ ︷︷ ︸
proportional to ‖P‖2F
+(β2 − β0 − 2(α2α0 − α20 + α2
1)) · ‖ diag(P )‖2︸ ︷︷ ︸proportional to ‖ diag(P )‖2
= (β2 − β0)‖ diag(P )‖2 + β0‖P‖2F︸ ︷︷ ︸A
+2(α0α2 − α20)(diag(P )> · P · 1− ‖ diag(P )‖2)︸ ︷︷ ︸
B1
+ α21((diag(P )> · 1)2 − ‖ diag(P )‖2 + 〈P, P>〉 − ‖ diag(P )‖2)︸ ︷︷ ︸
B2
+α20(‖P · 1‖2 − ‖P‖2F )︸ ︷︷ ︸
B3
=A+B1 +B2 +B3
where the second step follows by 〈P, diag(diag(P ))〉 = ‖ diag(P )‖2 and ‖ diag(diag(P ))‖2F =
266
‖ diag(P )‖2.
Lower bound on the eigenvalues of the population Hessian at the ground
truth
Lemma C.3.3. If φ(z) satisfies Property 1, 4, 3 we have
∇2fD(W∗) Ω(rρ(σt)/(κ
2λ)).
Proof. Let x ∈ Rd denote vector[x>1 x>
2 · · · x>r
]> where xi = Pix ∈ Rk, for
each i ∈ [r]. Thus, we can rewrite the partial gradient. For each j ∈ [t], the second
partial derivative of fD(W ) is
∂2fD(W∗)
∂w2j
= E(x,y)∼D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
For each j, l,∈ [t] and j 6= l, the second partial derivative of fD(W ) with
respect to wj and wl can be represented as
∂2fD(W )
∂wj∂wl
= E(x,y)∼D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
First we show the lower bound of the eigenvalues. The main idea is to
reduce the problem to a k-by-k problem and then lower bound the eigenvalues using
orthogonal weight matrices.
267
Let a ∈ Rkt denote vector[a>1 a>2 · · · a>t
]>. The smallest eigenvalue of
the Hessian can be calculated by
∇2f(W ∗) min‖a‖=1
a>∇2f(W ∗)a Ikt
= min‖a‖=1
Ex∼Dd
( t∑j=1
r∑i=1
a>j xi · φ′(w∗>j xi)
)2 Ikt (C.5)
For each i ∈ [r], we define function hi(y) : Rk → R such that
hi(y) =t∑
j=1
a>j y · φ′(w∗>j y).
Then, we can analyze the smallest eigenvalue of the Hessian in the following way,
min‖a‖=1
Ex∼Dd
( t∑j=1
r∑i=1
a>j xi · φ′(w∗>j xi)
)2
= min‖a‖=1
Ex∼Dd
( r∑i=1
t∑j=1
a>j xi · φ′(w∗>j xi)
)2
= min‖a‖=1
Ex∼Dd
( r∑i=1
hi(xi)
)2
= min‖a‖=1
r∑i=1
Ex∼Dd
[h2i (xi)] +
r∑j 6=l
Ex∼Dd
[hj(xj)] Ex∼Dd
[hl(xl)]
= min‖a‖=1
r∑i=1
(Ex∼Dd
[h2i (xi)]− (Ex∼Dd
[hi(xi)])2)+( r∑
l=1
Ex∼Dd
[hl(xl)]
)2
≥ min‖a‖=1
r∑i=1
(Ex∼Dd
[h2i (xi)]− (Ex∼Dd
[hi(xi)])2)
= min‖a‖=1
r∑i=1
Ex∼Dd
[(hi(xi)− Ex∼Dd
[hi(xi)])2]
268
Since min‖a‖=1
∑ri=1 fi(a) ≥
∑ri=1min‖a‖=1 fi(a). Thus, we only need to
consider one i ∈ [r],
min‖a‖=1
Ex∼Dd
[(hi(xi)− Ex∼Dd
[hi(xi)])2]
= min‖a‖=1
Ey∼Dk
[(hi(y)− Ey∼Dk
[hi(y)])2]
= min‖a‖=1
Ey∼Dk
( t∑j=1
a>j y · φ′(w∗>j y)− Ey∼Dk
[t∑
j=1
a>j y · φ′(w∗>j y)
])2
= min‖a‖=1
Ey∼Dk
( t∑j=1
a>j(yφ′(w∗>
j y)− Ey∼Dk[yφ′(w∗>
j y)]))2
≥ min
‖a‖=1Ey∼Dk
( t∑j=1
a>j(yφ′(w∗>
j y)− Ey∼Dk[yφ′(w∗>
j y)]))2
where the second step follows by definition of function hi(y),
We define function g(w) : Rk → Rk such that
g(w) = Ey∼Dk[φ′(w>y)y].
Then we have
min‖a‖=1
Ex∼Dd
[(hi(xi)− Ex∼Dd
[hi(xi)])2] ≥ min
‖a‖=1Ex∼Dk
( t∑j=1
a>j(xφ′(w∗>
j x)− g(w∗j )))2
.(C.6)
Let U ∈ Rk×t be the orthonormal basis of W ∗ ∈ Rk×t and let V =[v1 v2 · · · vt
]= U>W ∗ ∈ Rt×t. Also note that V and W ∗ have same sin-
gular values and W ∗ = UV . We use U⊥ ∈ Rk×(k−t) to denote the complement of
269
U . For any vector aj ∈ Rk, there exist two vectors bj ∈ Rt and cj ∈ Rk−t such that
aj︸︷︷︸k×1
= U︸︷︷︸k×t
bj︸︷︷︸t×1
+ U⊥︸︷︷︸k×(k−t)
cj︸︷︷︸(k−t)×1
.
Let b ∈ Rt2 denote vector[b>1 b>2 · · · b>t
]> and let c ∈ R(k−t)t denote vector[c>1 c>2 · · · c>t
]>.
Let U>g(w∗i ) = g(v∗i ) ∈ Rt, then g(v∗i ) = Ez∼Dt [φ
′(v∗i z)z]. Then we can
rewrite formulation (C.6) as
Ex∼Dk
( t∑i=1
a>i (xφ′(w∗>
i x)− g(w∗i )
)2
= Ex∼Dk
( t∑i=1
(b>i U> + c>i U
>⊥ ) · (xφ′(w∗>
i x)− g(w∗i ))
)2
= A+B + C
where
A = Ex∼Dk
( t∑i=1
b>i U> · (xφ′(w∗>
i x)− g(w∗i ))
)2,
B = Ex∼Dk
( t∑i=1
c>i U>⊥ · (xφ′(w∗>
i x)− g(w∗i ))
)2,
C = Ex∼Dk
[2
(t∑
i=1
b>i U> · (xφ′(w∗>
i x)− g(w∗i ))
)·
(t∑
i=1
c>i U>⊥ · (xφ′(w∗>
i x)− g(w∗i ))
)].
270
We calculate A,B,C separately. First, we can show
A = Ex∼Dk
( t∑i=1
b>i U> ·(xφ′(w∗>
i x)− g(w∗i )))2
= E
z∼Dt
( t∑i=1
b>i · (zφ′(v∗>i z)− g(v∗i ))
)2.
where the first step follows by definition of A and the last step follows by U>g(w∗i ) =
g(v∗i ).
Second, we can show
B = Ex∼Dk
( t∑i=1
c>i U>⊥ · (xφ′(w∗>
i x)− g(w∗i ))
)2
= Ex∼Dk
( t∑i=1
c>i U>⊥ · (xφ′(w∗>
i x))
)2 by U>
⊥ g(w∗i ) = 0
= Es∼Dk−t,z∼Dt
( t∑i=1
c>i s · φ′(v∗>i z)
)2
= Es∼Dk−t,z∼Dt
[(y>s)2] by defining y =t∑
i=1
φ′(v∗>i z)ci ∈ Rk−t
= Ez∼Dt
[E
s∼Dk−t
[(y>s)2]
]= E
z∼Dt
[E
s∼Dk−t
[k−t∑j=1
s2jy2j
]]by E[sjsj′ ] = 0
= Ez∼Dt
[k−t∑j=1
y2j
]by sj ∼ N(0, 1)
= Ez∼Dt
∥∥∥∥∥t∑
i=1
φ′(v∗>i z)ci
∥∥∥∥∥2 by definition of y
271
Third, we have C = 0 since U>⊥x is independent of w∗>
i x and U>x, and g(w∗) ∝
w∗, then U>⊥ g(w
∗) = 0.
Thus, putting them all together,
Ex∼Dk
( k∑i=1
a>i (xφ′(w∗>
i x)− g(w∗i ))
)2
= Ez∼Dt
( t∑i=1
b>i (zφ′(v∗>i z)− g(v∗i ))
)2
︸ ︷︷ ︸A
+ Ez∼Dt
∥∥∥∥∥t∑
i=1
φ′(v∗>i z)ci
∥∥∥∥∥2
︸ ︷︷ ︸B
272
Let us lower bound A,
A = Ez∼Dt
( t∑i=1
b>i · (zφ′(v∗>i z)− g(w∗i ))
)2
=
∫(2π)−t/2
(t∑
i=1
b>i (zφ′(v∗>i z)− g(w∗
i ))
)2
e−‖z‖2/2dz
=
∫(2π)−t/2
(t∑
i=1
b>i (V†>s · φ′(si)− g(w∗
i ))
)2
e−‖V †>s‖2/2 · | det(V †)|ds
≥∫(2π)−t/2
(t∑
i=1
b>i (V†>s · φ′(si)− g(w∗
i ))
)2
e−σ21(V
†)‖s‖2/2 · | det(V †)|ds
=
∫(2π)−t/2
(t∑
i=1
b>i (V†>u/σ1(V
†) · φ′(ui/σ1(V†))− g(w∗
i ))
)2
e−‖u‖2/2| det(V †)|/σt1(V
†)du
=
∫(2π)−t/2
(t∑
i=1
p>i (u · φ′(σt · ui)− V >σ1(V†)g(w∗
i ))
)2
e−‖u‖2/2 1
λdu
=1
λEu∼Dt
( t∑i=1
p>i (uφ′(σt · ui)− V >σ1(V
†)g(w∗i ))
)2
≥ 1
λEu∼Dt
( t∑i=1
p>i (uφ′(σt · ui)− Eu∼Dt [uφ
′(σt · ui)])
)2
where the first step follows by definition of A, the second step follows by high-
dimensional Gaussian distribution, the third step follows by replacing z by V †>s, so
v∗>i z = si, the fourth step follows by the fact ‖V †>s‖ ≤ σ1(V†)‖s‖, and fifth step
follows by replacing s by u/σ1(V†), the sixth step follows by p>i = b>i V
†>/σ1(V†),
the seventh step follows by definition of high-dimensional Gaussian distribution,
and the last step follows by E[(X − C)2] ≥ E[(X − E[X])2].
Note that φ′(σt · ui)’s are independent of each other, so we can simplify the
273
analysis.
In particular, Lemma C.3.2 gives a lower bound in this case in terms of pi.
Note that ‖pi‖ ≥ ‖bi‖/κ. Therefore,
Ez∼Dt
( t∑i=1
b>i z · φ′(v>i z)
)2 ≥ ρ(σt)
1
κ2λ‖b‖2.
274
For B, similar to the proof of Lemma C.1.1, we have,
B = Ez∼Dt
∥∥∥∥∥t∑
i=1
φ′(v>i z)ci
∥∥∥∥∥2
=
∫(2π)−t/2
∥∥∥∥∥t∑
i=1
φ′(v>i z)ci
∥∥∥∥∥2
e−‖z‖2/2dz
=
∫(2π)−t/2
∥∥∥∥∥t∑
i=1
φ′(σt · ui)ci
∥∥∥∥∥2
e−‖V †>u/σ1(V †)‖2/2 · det(V †/σ1(V†))du
=
∫(2π)−t/2
∥∥∥∥∥t∑
i=1
φ′(σt · ui)ci
∥∥∥∥∥2
e−‖V †>u/σ1(V †)‖2/2 · 1λdu
≥∫
(2π)−t/2
∥∥∥∥∥t∑
i=1
φ′(σt · ui)ci
∥∥∥∥∥2
e−‖u‖2/2 · 1λdu
=1
λE
u∼Dt
∥∥∥∥∥t∑
i=1
φ′(σt · ui)ci
∥∥∥∥∥2
=1
λ
(t∑
i=1
Eu∼Dk
[φ′(σt · ui)φ′(σk · ui)c
>i ci] +
∑i 6=l
Eu∼Dt
[φ′(σt · ui)φ′(σt · ul)c
>i cl]
)
=1
λ
(E
z∼D1
[φ′(σt · ui)2]
t∑i=1
‖ci‖2 +(
Ez∼D1
[φ′(σt · z)])2∑
i 6=l
c>i cl
)
=1
λ
( Ez∼D1
[φ′(σt · z)])2∥∥∥∥∥
t∑i=1
ci
∥∥∥∥∥2
2
+
(E
z∼D1
[φ′(σt · z)2]−(
Ez∼D1
[φ′(σt · z)])2)‖c‖2
≥ 1
λ
(E
z∼D1
[φ′(σt · z)2]−(
Ez∼D1
[φ′(σt · z)])2)‖c‖2
≥ ρ(σt)1
λ‖c‖2,
where the first step follows by definition of Gaussian distribution, the second step
follows by replacing z by z = V †>u/σ1(V†), and then v>i z = ui/σ1(V
†) =
275
uiσt(W∗), the third step follows by ‖u‖2 ≥ ‖ 1
σ1(V †)V †>u‖2 , the fourth step follows
by det(V †/σ1(V†)) = det(V †)/σt
1(V†) = 1/λ, the fifth step follows by definition
of Gaussian distribution, the ninth step follows by x2 ≥ 0 for any x ∈ R, and the
last step follows by Property 4.
Note that 1 = ‖a‖2 = ‖b‖2 + ‖c‖2. Thus, we finish the proof for the lower
bound.
Upper bound on the eigenvalues of the population Hessian at the ground
truth
Lemma C.3.4. If φ(z) satisfies Property 1, 4, 3, then
∇2fD(W∗) O(tr2σ2p
1 )
Proof. Similarly to the proof in previous section, we can calculate the upper bound
276
of the eigenvalues by
‖∇2fD(W∗)‖
= max‖a‖=1
a>∇2fD(W∗)a
= max‖a‖=1
Ex∼Dd
( t∑j=1
r∑i=1
a>j xi · φ′(w∗>j xi)
)2
≤ max‖a‖=1
Ex∼Dd
( t∑j=1
r∑i=1
|a>j xi| · |φ′(w∗>j xi)|
)2
= max‖a‖=1
Ex∼Dd
[t∑
j=1
r∑i=1
t∑j′=1
r∑i′=1
|a>j xi| · |φ′(w∗>j xi)| · |a>j′xi′| · |φ′(w∗>
j′ xi′)|
]
= max‖a‖=1
t∑j=1
r∑i=1
t∑j′=1
r∑i′=1
Ex∼Dd
[|a>j xi| · |φ′(w∗>
j xi)| · |a>j′xi′ | · |φ′(w∗>j′ xi′)|
]︸ ︷︷ ︸Aj,i,j′,i′
.
It remains to bound Aj,i,j′,i′ . We have
Aj,i,j′,i′ = Ex∼Dd
[|a>j xi| · |φ′(w∗>
j xi)| · |a>j′xi′ | · |φ′(w∗>j′ xi′)|
]≤(Ex∼Dk
[|a>j x|4] · Ex∼Dk[|φ′(w∗>
j x)|4] · Ex∼Dk[|a>j′x|4] · Ex∼Dk
[|φ′(w∗>j′ x)|4]
)1/4. ‖aj‖ · ‖aj′‖ · ‖w∗
j‖p · ‖w∗j′‖p.
Thus, we have
‖∇2fD(W∗)‖ ≤ tr2σ2p
1 ,
which completes the proof.
C.3.2 Error bound of Hessians near the ground truth for smooth activations
The goal of this Section is to prove Lemma C.3.5
277
Lemma C.3.5 (Error Bound of Hessians near the Ground Truth for Smooth Activa-
tions). Let φ(z) satisfy Property 1 (with p = 0, 1), Property 4 and Property 3(a).
Let W ∈ Rk×t satisfy ‖W −W ∗‖ ≤ σt/2. Let S denote a set of i.i.d. samples from
the distribution defined in (5.1). Then for any s ≥ 1 and 0 < ε < 1/2, if
|S| ≥ ε−2kκ2τ · poly(log d, s)
then we have, with probability at least 1− 1/dΩ(s),
‖∇2fS(W )−∇2fD(W∗)‖ . r2t2σp
1(εσp1 + ‖W −W ∗‖).
Proof. This follows by combining Lemma C.3.6 and Lemma C.3.7 directly.
Second-order smoothness near the ground truth for smooth activations
The goal of this Section is to prove Lemma C.3.6.
Fact C.3.1. Let wi denote the i-th column of W ∈ Rk×t, and w∗i denote the i-th
column of W ∗ ∈ Rk×t. If ‖W −W ∗‖ ≤ σt(W∗)/2, then for all i ∈ [t],
1
2‖w∗
i ‖ ≤ ‖wi‖ ≤3
2‖w∗
i ‖.
Proof. Note that if ‖W − W ∗‖ ≤ σt(W∗)/2, we have σt(W
∗)/2 ≤ σi(W ) ≤32σ1(W
∗) for all i ∈ [t] by Weyl’s inequality. By definition of singular value, we
have σt(W∗) ≤ ‖w∗
i ‖ ≤ σ1(W∗). By definition of spectral norm, we have ‖wi −
w∗i ‖ ≤ ‖W −W ∗‖. Thus, we can lower bound ‖wi‖,
‖wi‖ ≤ ‖w∗i ‖+ ‖wi − w∗
i ‖ ≤ ‖w∗i ‖+ ‖W −W ∗‖ ≤ ‖w∗
i ‖+ σt/2 ≤3
2‖w∗
i ‖.
Similarly, we have ‖wi‖ ≥ 12‖w∗
i ‖.
278
Lemma C.3.6 (Second-order Smoothness near the Ground Truth for Smooth Ac-
tivations). If φ(z) satisfies Property 1 (with p = 0, 1), Property 4 and Prop-
erty 3(a), then for any W ∈ Rk×t with ‖W −W ∗‖ ≤ σt/2, we have
‖∇2fD(W )−∇2fD(W∗)‖ . r2t2σp
1‖W −W ∗‖.
Proof. Recall that x ∈ Rd denotes a vector[x>1 x>
2 · · · x>r
]>, where xi =
Pix ∈ Rk, ∀i ∈ [r] and d = rk. Recall that for each (x, y) ∼ D or (x, y) ∈ S,
y =∑t
j=1
∑ri=1 φ(w
∗>j xi).
Let ∆ = ∇2fD(W )−∇2fD(W∗). For each (j, l) ∈ [t]×[t], let ∆j,l ∈ Rk×k.
Then for any j 6= l, we have
∆j,l = Ex∼Dd
( r∑i=1
φ′(w>j xi)xi
)(r∑
i=1
φ′(w>l xi)xi
)>
−
(r∑
i=1
φ′(w∗>j xi)xi
)(r∑
i=1
φ′(w∗>l xi)xi
)>
=r∑
i=1
Ex∼Dk
[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>]+∑i 6=i′
(Ey∼Dk,z∼Dk
[φ′(w>
j y)yφ′(w>
l z)z> − φ′(w∗>
j y)yφ′(w∗>l z)z>
])= ∆
(1)j,l +∆
(2)j,l .
Using Claim C.3.2 and Claim C.3.3, we can bound ∆(1)j,l and ∆
(2)j,l .
279
For any j ∈ [t], we have
∆j,j = Ex∼Dd
[(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
+ Ex∼Dd
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
− Ex∼Dd
( r∑i=1
φ′(w∗>j xi)xi
)·
(r∑
i=1
φ′(w∗>j xi)xi
)>
= Ex∼Dd
[(t∑
l=1
r∑i=1
(φ(w>l xi)− φ(w∗>
l xi))
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
+ Ex∼Dd
[r∑
i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
j xi′)− φ′(w∗>j xi)φ
′(w∗>j xi′)
)xix
>i′
]= ∆
(1)j,j +∆
(2)j,j ,
where the first step follows by∇2fD(W )−∇2fD(W∗), the second step follows by
the definition of y.
Using Claim C.3.4, we can bound ∆(1)j,j . Using Claim C.3.5, we can bound
∆(2)j,j .
280
Putting it all together, we can bound the error by
‖∇2fD(W )−∇2fD(W∗)‖
= max‖a‖=1
a>(∇2fD(W )−∇2fD(W∗))a
= max‖a‖=1
t∑j=1
t∑l=1
a>j ∆j,lal
= max‖a‖=1
(t∑
j=1
a>j ∆j,jaj +∑j 6=l
a>j ∆i,lal
)
≤ max‖a‖=1
(t∑
j=1
‖∆j,j‖‖aj‖2 +∑j 6=l
‖∆j,l‖‖aj‖‖al‖
)
≤ max‖a‖=1
(t∑
j=1
C1‖aj‖2 +∑j 6=l
C2‖aj‖‖al‖
)
= max‖a‖=1
C1
t∑j=1
‖ai‖2 + C2
( t∑j=1
‖aj‖
)2
−t∑
j=1
‖aj‖2
≤ max‖a‖=1
(C1
t∑j=1
‖aj‖2 + C2
(t
t∑j=1
‖aj‖2 −t∑
j=1
‖aj‖2))
= max‖a‖=1
(C1 + C2(t− 1))
. r2t2L1L2σp1(W
∗)‖W −W ∗‖.
where the first step follows by definition of spectral norm and a denotes a vector
∈ Rdk, the first inequality follows by ‖A‖ = max‖x‖6=0,‖y‖6=0x>Ay‖x‖‖y‖ , the second
inequality follows by ‖∆i,i‖ ≤ C1 and ‖∆i,l‖ ≤ C2, the third inequality follows by
Cauchy-Scharwz inequality, the eighth step follows by∑
i=1 ‖ai‖2 = 1, where the
last step follows by Claim C.3.2, C.3.3 and C.3.4.
Thus, we complete the proof.
281
Claim C.3.2. For each (j, l) ∈ [t]× [t] and j 6= l, ‖∆(1)j,l ‖ . r2L1L2σ
p1(W
∗)‖W −
W ∗‖.
Proof. Recall the definition of ∆(1)j,l ,
r∑i=1
Ex∼Dk
[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>]In order to upper bound ‖∆(1)
j,l ‖, it suffices to upper bound the spectral norm of this
quantity,
Ex∼Dk
[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>].We have ∥∥Ex∼Dk
[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>]∥∥= max
‖a‖=1Ex∼Dk
[|φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x)|(x>a)2]
≤ max‖a‖=1
(Ex∼Dk
[|φ′(w>
j x)φ′(w>
l x)− φ′(w>j x)φ
′(w∗>l x)|(x>a)2
]+ Ex∼Dk
[|φ′(w>
j x)φ′(w∗>
l x)− φ′(w∗>j x)φ′(w∗>
l x)|(x>a)2])
= max‖a‖=1
(Ex∼Dk
[|φ′(w>
j x)| · |φ′(w>l x)− φ′(w∗>
l x)|(x>a)2]
+ Ex∼Dk
[|φ′(w∗>
l x| · |φ′(w>j x)− φ′(w∗>
j x))|(x>a)2])
We can upper bound the first term of above Equation in the following way,
max‖a‖=1
Ex∼Dk
[|φ′(w>
j x)| · |φ′(w>l x)− φ′(w∗>
l x)|(x>a)2]
≤ 2L1L2Ex∼Dk[|w>
j x|p · |(wl − w∗l )
>x| · |x>a|2]
. L1L2σp1‖W −W ∗‖.
282
Similarly, we can upper bound the second term. By summing over O(r2) terms, we
complete the proof.
Claim C.3.3. For each (j, l) ∈ [t]× [t] and j 6= l, ‖∆(2)j,l ‖ . r2L1L2σ
p1(W
∗)‖W −
W ∗‖.
Proof. Note that
Ey∼Dk,z∼Dk
[φ′(w>
j y)yφ′(w>
l z)z> − φ′(w∗>
j y)yφ′(w∗>l z)z>
]=Ey∼Dk,z∼Dk
[φ′(w>
j y)yφ′(w>
l z)z> − φ′(w>
j y)yφ′(w∗>
l z)z>]
+ Ey∼Dk,z∼Dk
[φ′(w>
j y)yφ′(w∗>
l z)z> − φ′(w∗>j y)yφ′(w∗>
l z)z>]
We consider the first term as follows. The second term is similar.
∥∥Ey∼Dk,z∼Dk[φ′(w>
j y)yφ′(w>
l z)z> − φ′(w>
j y)yφ′(w∗>
l z)z>]∥∥
=∥∥Ey∼Dk,z∼Dk
[φ′(w>j y)(φ
′(w>l z)− φ′(w∗>
l z))yz>]∥∥
≤ max‖a‖=‖b‖=1
Ey,z∼Dk[|φ′(w>
j y)| · |φ′(w>l z)− φ′(w∗>
l z)| · |a>y| · |b>z|]
. L1L2σp1(W
∗)‖W −W ∗‖.
By summing over O(r2) terms, we complete the proof.
Claim C.3.4. For each j ∈ [t], ‖∆(1)j,j ‖ . r2tL1L2σ
p1(W
∗)‖W −W ∗‖.
Proof. Recall the definition of ∆(1)j,j ,
∆(1)j,j = Ex∼Dd
[(t∑
l=1
r∑i=1
(φ(w>l xi)− φ(w∗>
l xi))
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
283
In order to upper bound ‖∆(1)j,j ‖, it suffices to upper bound the spectral norm of this
quantity,
Ex∼Dd
[(φ(w>
l xi)− φ(w∗>l xi)) · φ′′(w>
j xi′)xi′x>i′
]= Ey,z∼Dk
[(φ(w>
l y)− φ(w∗>l y)) · φ′′(w>
j z)zz>]
Thus, we have
∥∥Ey,z∼Dk
[(φ(w>
l y)− φ(w∗>l y)) · φ′′(w>
j z)zz>]∥∥
≤ max‖a‖=1
Ey,z∼Dk|(φ(w>
l y)− φ(w∗>l y)| · |φ′′(w>
j z)| · (z>a)2
≤ max‖a‖=1
Ey,z∼Dk[|φ(w>
l x)− φ(w∗>l y)|L2(z
>a)2]
≤ L2 max‖a‖=1
Ey,z∼Dk
[max
u∈[w>l y,w∗>
l y]|φ′(u)| · |(wl − w∗
l )>y| · (z>a)2
]≤ L2 max
‖a‖=1Ey,z∼Dk
[max
u∈[w>l y,w∗>
l y]L1|u|p · |(wl − w∗
l )>y| · (z>a)2
]≤ L1L2 max
‖a‖=1Ey,z∼Dk
[(|w>l y|p + |w∗>
l y|p) · |(wl − w∗l )
>y| · (z>a)2]
. L1L2(‖wl‖p + ‖w∗l ‖p)‖wl − w∗
l ‖
By summing over all the O(tr2) terms and using triangle inequality, we finish the
proof.
Claim C.3.5. For each j ∈ [t], ‖∆(2)j,j ‖ . r2tL1L2σ
p1(W
∗)‖W −W ∗‖.
Proof. Recall the definition of ∆(2)j,j ,
Ex∼Dd
[r∑
i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
j xi′)− φ′(w∗>j xi)φ
′(w∗>j xi′)
)xix
>i′
]
284
In order to upper bound ‖∆(2)j,j ‖, it suffices to upper bound the spectral norm of these
two quantities, the diagonal term
Ey∼Dk
[(φ′2(w>
j y)− φ′2(w∗>j y))yy>
]and the off-diagonal term,
Ey,z∼Dk
[(φ′(w>
j y)φ′(w>
j z)− φ′(w∗>j y)φ′(w∗>
j z))yz>]
These two terms can be bounded by using the proof similar to the other Claims of
this Section.
Empirical and population difference for smooth activations
The goal of this Section is to prove Lemma C.3.7. For each i ∈ [k], let σi
denote the i-th largest singular value of W ∗ ∈ Rd×k.
Note that Bernstein inequality requires the spectral norm of each random
matrix to be bounded almost surely. However, since we assume Gaussian distribu-
tion for x, ‖x‖2 is not bounded almost surely. The main idea is to do truncation and
then use Matrix Bernstein inequality. Details can be found in Lemma D.3.9 and
Corollary B.1.1.
Lemma C.3.7 (Empirical and Population Difference for Smooth Activations). Let
φ(z) satisfy Property 1,4 and 3(a). Let W ∈ Rk×t satisfy ‖W −W ∗‖ ≤ σt/2. Let
S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Then for any
s ≥ 1 and 0 < ε < 1/2, if
|S| ≥ ε−2kκ2τ · poly(log d, s)
285
then we have, with probability at least 1− 1/dΩ(s),
‖∇2fS(W )−∇2fD(W )‖ .r2t2σp1(εσ
p1 + ‖W −W ∗‖).
Proof. Recall that x ∈ Rd denotes a vector[x>1 x>
2 · · · x>r
]>, where xi =
Pix ∈ Rk, ∀i ∈ [r] and d = rk. Recall that for each (x, y) ∼ D or (x, y) ∈ S,
y =∑t
j=1
∑ri=1 φ(w
∗>j xi).
Define ∆ = ∇2fD(W ) − ∇2fS(W ). Let us first consider the diagonal
blocks. Define
∆j,j = E(x,y)∼D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
+
(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
+
(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
286
Further, we can decompose ∆j,j into ∆j,j = ∆(1)j,j +∆
(2)j,j , where
∆(1)j,j = E(x,y)∼D
[(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
= E(x,y)∼D
[(t∑
l=1
r∑i=1
(φ(w>l xi)− φ(w∗>
l xi))
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
(φ(w>l xi)− φ(w∗>
l xi))
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
=r∑
l=1
r∑i=1
r∑i′=1
(Ex∼Dd
[(φ(w>
l xi)− φ(w∗>l xi))φ
′′(w>j xi′)xi′x
>i′
]− 1
|S|∑x∈S
[(φ(w>
l xi)− φ(w∗>l xi))φ
′′(w>j xi′)xi′x
>i′
])
and
∆(2)j,j = E(x,y)∈D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
− 1
|S|∑
(x,y)∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
=r∑
i=1
r∑i′=1
(Ex∼Dd
[φ′(w>j xi)xiφ
′(w>j xi′)x
>i′ ]−
1
|S|∑x∈S
[φ′(w>j xi)xiφ
′(w>j xi′)x
>i′ ]
)
287
The off-diagonal block is
∆j,l = E(x,y)∼D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
− 1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
=r∑
i=1
r∑i′=1
(Ex∼Dd
[φ′(w>j xi)xiφ
′(w>l xi′)x
>i′ ]−
1
|S|∑x∈S
[φ′(w>j xi)xiφ
′(w>l xi′)x
>i′ ]
)
Note that ∆(2)j,j is a special case of ∆j,l so we just bound ∆j,l. Combining Claims C.3.6
C.3.7, and taking a union bound over t2 different ∆j,l, we obtain if n ≥ ε−2kτκ2poly(log d, s),
with probability at least 1− 1/d4s,
‖∇2fS(W )−∇2f(W )‖ . t2r2σp1(W
∗) · (εσp1(W
∗) + ‖W −W ∗‖).
Therefore, we complete the proof.
Claim C.3.6. For each j ∈ [t], if |S| ≥ kpoly(log d, s)
‖∆(1)j,j ‖ . r2tσp
1(W∗)‖W −W ∗‖
holds with probability 1− 1/d4s.
Proof. Define B∗i,i′,j,l to be
B∗i,i′,j,l =Ex∼Dd
[(φ(w>l xi)− φ(w∗>
l xi))φ′′(w>
j xi′)xi′x>i′ )]
− 1
|S|∑x∈S
[(φ(w>l xi)− φ(w∗>
l xi))φ′′(w>
j xi′)xi′x>i′ )]
288
For each l ∈ [t], we define function Al(x, x′) : R2k → Rk×k,
Al(x, x′) = L1L2 · (|w>
l x|p + |w∗>l x|p) · |(wl − w∗
l )>x| · x′x′>.
Using Properties 1,4 and 3(a), we have for each x ∈ S, for each (i, i′) ∈
[r]× [r],
−Al(xi, xi′) (φ(w>l xi)− φ(w∗>
l xi)) · φ′′(w>j xi′)xi′x
>i′ Al(xi, xi′).
Therefore,
∆(1)j,j
r∑i=1
r∑i′=1
t∑l=1
(Ex∼Dd
[Al(xi, xi′)] +1
|S|∑x∈S
Al(xi, xi′)
).
Let hl(x) = L1L2|w>l x|p · |(wl−w∗
l )>x|. Let Dk denote Gaussian distribu-
tion N(0, Ik). Let Bl = Ex,x′∼D2k[hl(x)x
′x′>].
We define function Bl(x, x′) : R2k → Rk×k such that
Bl(x, x′) = hl(x)x
′x′>.
(I) Bounding |hl(x)|.
According to Lemma 2.4.2, we have for any constant s ≥ 1, with probability
1− 1/(nd8s),
|hl(x)| = L1L2|w>r x|p|(wl − w∗
l )>x| ≤ ‖wl‖p‖wl − w∗
l ‖poly(s, log n).
(II) Bounding ‖Bl‖.
289
‖Bl‖ ≥ Ex∼Dk
[L1L2|w>
l x|p|(wl − w∗l )
>x|]· Ex′∼Dk
[((wl − w∗
l )>x′
‖wl − w∗l ‖
)2]& ‖wl‖p‖wl − w∗
l ‖,
where the first step follows by definition of spectral norm, and last step follows
by Lemma 2.4.2. Using Lemma 2.4.2, we can also prove an upper bound ‖Bl‖,
‖Bl‖ . L1L2‖wl‖p‖wl − w∗l ‖.
(III) Bounding (Ex∼Dk[h4(x)])1/4
Using Lemma 2.4.2, we have(E
x∼Dk
[h4(x)]
)1/4
= L1L2
(E
x∼Dk
[(|w>
l x|p|(wl − w∗l )
>x|)4])1/4
. ‖wl‖p‖wl − w∗l ‖.
By applying Corollary B.1.1, for each (i, i′) ∈ [r]×[r] if n ≥ ε−2kpoly(log d, s),
then with probability 1− 1/d8s,
∥∥∥∥∥ Ex∼Dd
[|w>
l xi|p · |(wl − w∗l )
>xi| · xi′x>i′
]− 1
|S|∑x∈S
|w>l xi|p · |(wl − w∗
l )>xi| · xi′x
>i′
∥∥∥∥∥=
∥∥∥∥∥Bl −1
|S|∑x∈S
Bl(xi, xi′)
∥∥∥∥∥≤ ε‖Bl‖
. ε‖wl‖p‖wl − w∗l ‖. (C.7)
If ε ≤ 1/2, we have
‖∆(1)i,i ‖ .
r∑i=1
r∑i′=1
t∑l=1
‖Bl‖ . r2tσp1(W
∗)‖W −W ∗‖
290
Claim C.3.7. For each (j, l) ∈ [t]× [t], j 6= l, if |S| ≥ ε−2kτκ2poly(log d, s)
‖∆j,l‖ . εr2σ2p1 (W ∗)
holds with probability 1− 1/d4s.
Proof. Recall
∆j,l = E(x,y)∼D
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
− 1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
Recall that x = [x>1 x>
2 · · ·x>r ]
>, xi ∈ Rk,∀i ∈ [r] and d = rk. We define
X = [x1 x2 · · ·xr] ∈ Rk×r. Let φ′(X>wj) ∈ Rr denote the vector
[φ′(x>1 wj) φ
′(x>2 wj) · · ·φ′(x>
r wj)]> ∈ Rr.
We define function B(x) : Rd → Rk×k such that
B(x) = X︸︷︷︸k×r
φ′(X>wj)︸ ︷︷ ︸r×1
φ′(X>wl)>︸ ︷︷ ︸
1×r
X>︸︷︷︸r×k
.
Therefore,
∆j,l = E(x,y)∼D[B(x)]− 1
|S|∑x∈S
[B(x)]
To apply Lemma D.3.9, we show the following.
(I)
291
‖B(x)‖ .
(r∑
i=1
|w>j xi|p‖xi‖
)·
(r∑
i=1
|w>l xi|p‖xi‖
).
By using Lemma 2.4.2,2.4.3, we have with probability 1− 1/nd4s,
‖B(x)‖ ≤ r2k‖wj‖p‖wl‖p log n
(II)
Ex∼Dd
[B(x)]
=r∑
i=1
Ex∼Dd[φ′(w>
j xi)xiφ′(w>
l xi)x>i ] +
∑i 6=i′
Ex∼Dd[φ′(w>
j xi)xiφ′(w>
l xi′)x>i′ ]
=r∑
i=1
Ex∼Dd[φ′(w>
j xi)xiφ′(w>
l xi)x>i ] +
∑i 6=i′
Exi∼Dk[φ′(w>
j xi)xi]Exi′∼Dk[φ′(w>
l xi′)x>i′ ]
=B1 +B2
Let’s first consider B1. Let U ∈ Rk×2 be the orthogonal basis of spanwj, wl
and U⊥ ∈ Rk×(k−2) be the complementary matrix of U . Let matrix V := [v1 v2] ∈
R2×2 denote U>[wj wl], then UV = [wj wl] ∈ Rd×2. Given any vector a ∈ Rk,
there exist vectors b ∈ R2 and c ∈ Rk−2 such that a = Ub+ U⊥c. We can simplify
292
‖B1‖ in the following way,
‖B1‖ =∥∥∥∥ Ex∼Dk
[φ′(w>j x)φ
′(w>l x)xx
>]
∥∥∥∥= max
‖a‖=1E
x∼Dk
[φ′(w>j x)φ
′(w>l x)(x
>a)2]
= max‖b‖2+‖c‖2=1
Ex∼Dk
[φ′(w>j x)φ
′(w>l x)(b
>U>x+ c>U>⊥x)
2]
= max‖b‖2+‖c‖2=1
Ex∼Dk
[φ′(w>j x)φ
′(w>l x)((b
>U>x)2 + (c>U>⊥x)
2)]
= max‖b‖2+‖c‖2=1
Ez∼D2
[φ′(v>1 z)φ′(v>2 z)(b
>z)2]︸ ︷︷ ︸A1
+ Ez∼D2,s∼Dk−2
[φ′(v>1 z)φ′(v>2 z)(c
>s)2]︸ ︷︷ ︸A2
Obviously, A1 ≥ 0. For the term A2, we have
A2 = Ez∼D2,s∼Dk−2
[φ′(v>1 z)φ′(v>2 z)(c
>s)2]
= Ez∼D2
[φ′(v>1 z)φ′(v>2 z)] E
s∼Dk−2
[(c>s)2]
= ‖c‖2 Ez∼D2
[φ′(v>1 z)φ′(v>2 z)]
≥ ‖c‖2σ2(V )
σ1(V )
(E
z∼D1
[φ′(σ2(V ) · z)])2
& ‖c‖2 1
κ(W ∗)ρ(σ2(V ))
Then if we set b = 0, we have
‖Ex∼Dd[B(x)]‖ ≥ max
‖a‖=1
∣∣a>Ex∼Dd[B(x)]a
∣∣ ≥ max‖a‖=1
∣∣a>B1a∣∣ ≥ r
κ(W ∗)ρ(σ2(V )).
The second inequality follows by the fact that Exi∼Dk[φ′(w>
j xi)xi] ∝ wj and a ∈
span(U⊥). The upper bound can be obtained following [153] as
‖Ex∼Dd[B(x)]‖ . r2L2
1σ2p1 .
293
(III)
max
(∥∥∥∥ Ex∼Dd
[B(x)B(x)>]
∥∥∥∥, ∥∥∥∥ Ex∼Dd
[B(x)>B(x)]
∥∥∥∥)= max
‖a‖=1E
x∼Dd
[∣∣a>Xφ′(X>wj)φ′(X>wl)
>X>Xφ′(X>wl)φ′(X>wj)
>X>a∣∣]
. r4L41σ
4p1 k.
(IV)
max‖a‖=‖b‖=1
(E
B∼B
[(a>Bb)2
])1/2= max
‖a‖=1,‖b‖=1
(E
x∼N(0,Id)
[a>Xφ′(X>wj)φ
′(X>wl)>X>b
])1/2
. r2L21σ
2p1 .
Therefore, applying Lemma D.3.9, if |S| ≥ ε−2κ2τkpoly(log d, s) we have
‖∆j,l‖ ≤ εr2σ2p1
holds with probability at least 1− 1/dΩ(s).
Claim C.3.8. For each j ∈ [t], if |S| ≥ ε−2kτκ2poly(log d, s)
‖∆(2)j,j ‖ . εr2tσ2p
1 (W ∗)
holds with probability 1− 1/d4s.
Proof. The proof is identical to Claim C.3.7.
294
C.3.3 Error bound of Hessians near the ground truth for non-smooth activa-tions
The goal of this Section is to prove Lemma C.3.8,
Lemma C.3.8 (Error Bound of Hessians near the Ground Truth for Non-smooth
Activations). Let φ(z) satisfy Property 1,4 and 3(b). Let W ∈ Rk×t satisfy ‖W −
W ∗‖ ≤ σt/2. Let S denote a set of i.i.d. samples from the distribution defined
in (5.1). Then for any t ≥ 1 and 0 < ε < 1/2, if
|S| ≥ ε−2kκ2τpoly(log d, s)
with probability at least 1− 1/dΩ(s),
‖∇2fS(W )−∇2fD(W∗)‖ . r2t2σ2p
1 (ε+ (‖W −W ∗‖/σt)1/2).
Proof. Recall that x ∈ Rd denotes a vector[x>1 x>
2 · · · x>r
]>, where xi =
Pix ∈ Rk, ∀i ∈ [r] and d = rk.
As we noted previously, when Property 3(b) holds, the diagonal blocks of
the empirical Hessian can be written as, with probability 1, for all j ∈ [t],
∂2fS(W )
∂w2j
=1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>.
We also know that, for each (j, l) ∈ [t]× [t] and j 6= l,
∂2fS(W )
∂wj∂wl
=1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>.
295
We define HD(W ) ∈ Rtk×tk such that for each j ∈ [t], the diagonal block HD(W )j,j ∈
Rk×k is
HD(W )j,j = Ex∈Dd
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>.
and for each (j, l) ∈ [t]× [t], the off-diagonal block HD(W )j,l ∈ Rk×k is
HD(W )j,l = Ex∈Dd
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>.
Recall the definition of ∇2fD(W∗), for each j ∈ [t], the diagonal block is
∂2fD(W∗)
∂w2j
= E(x,y)∼D
( r∑i=1
φ′(w∗>j xi)xi
)·
(r∑
i=1
φ′(w∗>j xi)xi
)>.
For each j, l,∈ [t] and j 6= l, the off-diagonal block is
∂2fD(W∗)
∂wj∂wl
= E(x,y)∼D
( r∑i=1
φ′(w∗>j xi)xi
)·
(r∑
i=1
φ′(w∗>l xi)xi
)>.
Thus, we can show
‖∇2fS(W )−∇2fD(W∗)‖ = ‖∇2fS(W )−HD(W ) +HD(W )−∇2fD(W
∗)‖
≤ ‖∇2fS(W )−HD(W )‖+ ‖HD(W )−∇2fD(W∗)‖
. εr2t2σ2p1 + r2t2σ2p
1 (‖W −W ∗‖/σt)1/2,
where the second step follows by triangle inequality, the third step follows by
Lemma C.3.9 and Lemma C.3.10.
Lemma C.3.9. If |S| ≥ ε−2kτκ2poly(log d, s), then we have
‖HD(W )−∇2fS(W )‖ . εr2t2σp1(W
∗)
296
Proof. Using Claim C.3.7, we can bound the spectral norm of all the off-diagonal
blocks, and using Claim C.3.8, we can bound the spectral norm of all the diagonal
blocks.
Lemma C.3.10. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if
‖W −W ∗‖ ≤ σt/2, then we have
‖HD(W )−∇2fD(W∗)‖ . r2t2σ2p
1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.
Proof. This follows by using the similar technique from [153]. Let ∆ = HD(W )−
∇2fD(W∗). For each j ∈ [t], the diagonal block is,
∆j,j = Ex∼Dd
[r∑
i=1
r∑i′=1
(φ′(w>j xi)φ
′(w>j xi′)− φ′(w∗>
j xi)φ′(w∗>
j xi′))xix>i′
]
= Ex∼Dd
[r∑
i=1
(φ′2(w>j xi)− φ′2(w∗>
j xi))xix>i
]
+ Ex∼Dd
[∑i 6=i′
(φ′(w>j xi)φ
′(w>j xi′)− φ′(w∗>
j xi)φ′(w∗>
j xi′))xix>i′
]= ∆
(1)j,j +∆
(2)j,j .
For each (j, l) ∈ [t]× [t] and j 6= l, the off-diagonal block is,
∆j,l = Ex∼Dd
[r∑
i=1
r∑i′=1
(φ′(w>j xi)φ
′(w>l xi′)− φ′(w∗>
j xi)φ′(w∗>
l xi′))xix>i′
]
= Ex∼Dd
[r∑
i=1
(φ′(w>j xi)φ
′(w>l xi)− φ′(w∗>
j xi)φ′(w∗>
l xi))xix>i
]
+ Ex∼Dd
[∑i 6=i′
(φ′(w>j xi)φ
′(w>l xi′)− φ′(w∗>
j xi)φ′(w∗>
l xi′))xix>i′
]= ∆
(1)j,l +∆
(2)j,l
297
Applying Claim C.3.9 and C.3.10 completes the proof.
Claim C.3.9. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if ‖W −
W ∗‖ ≤ σt/2, then we have
max(‖∆(1)j,j ‖, ‖∆
(1)j,l ‖) . rσ2p
1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.
Proof. We want to bound the spectral norm of
Ex∼Dk
[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>].We first show that,
∥∥Ex∼Dk[(φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x))xx>]∥∥
≤ max‖a‖=1
Ex∼Dk
[|φ′(w>
j x)φ′(w>
l x)− φ′(w∗>j x)φ′(w∗>
l x)|(x>a)2]
≤ max‖a‖=1
Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|+ |φ′(w∗>j x)||φ′(w>
l x)− φ′(w∗>l x)|(x>a)2
]= max
‖a‖=1
(Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|(x>a)2]
+ Ex∼Dk
[|φ′(w∗>
j x)||φ′(w>l x)− φ′(w∗>
l x)|(x>a)2]). (C.8)
where the first step follows by definition of spectral norm, the second step follows
by triangle inequality, and the last step follows by linearity of expectation.
Without loss of generality, we just bound the first term in the above formu-
lation. Let U be the orthogonal basis of span(wj, w∗j , wl). If wj, w
∗j , wl are inde-
pendent, U is k-by-3. Otherwise it can be d-by-rank(span(wj, w∗j , wl)). Without
loss of generality, we assume U = span(wj, w∗j , wl) is k-by-3. Let [vj v∗j vl] =
298
U>[wj w∗j wl] ∈ R3×3, and [uj u∗
j ul] = U>⊥ [wj w∗
j wl] ∈ R(k−3)×3 Let a =
Ub+ U⊥c, where U⊥ ∈ Rd×(k−3) is the complementary matrix of U .
Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|(x>a)2]
= Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|(x>(Ub+ U⊥c))2]
. Ex∼Dd
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|((x>Ub)2 + (x>U⊥c)
2)]
= Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|(x>Ub)2]
+ Ex∼Dk
[|φ′(w>
j x)− φ′(w∗>j x)||φ′(w>
l x)|(x>U⊥c)2]
= Ez∼D3
[|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2
]+ Ey∼Dk−3
[|φ′(u>
j y)− φ′(u∗>j y)||φ′(u>
l y)|(y>c)2]
(C.9)
where the first step follows by a = Ub + U⊥c, the last step follows by (a + b)2 ≤
2a2 + 2b2. Let’s consider the first term. The second term is similar.
By Property 3(b), we have e exceptional points which have φ′′(z) 6= 0. Let
these e points be p1, p2, · · · , pe. Note that if v>j z and v∗>j z are not separated by
any of these exceptional points, i.e., there exists no j ∈ [e] such that v>i z ≤ pj ≤
v∗>j z or v∗>j z ≤ pj ≤ v>j z, then we have φ′(v>j z) = φ′(v∗>j z) since φ′′(s) are
zeros except for pjj=1,2,··· ,e. So we consider the probability that v>j z, v∗>j z are
separated by any exception point. We use ξj to denote the event that v>j z, v∗>j z
are separated by an exceptional point pj . By union bound, 1 −∑e
j=1 Pr ξj is the
probability that v>j z, v∗>j z are not separated by any exceptional point. The first term
299
of Equation (D.23) can be bounded as,
Ez∼D3
[|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2
]= Ez∼D3
[1∪e
i=1ξi|φ′(v>j z) + φ′(v∗>j z)||φ′(v>l z)|(z>b)2
]≤(Ez∼D3
[1∪e
i=1ξi
])1/2(Ez∼D3
[(φ′(v>j z) + φ′(v∗>j z))2φ′(v>l z)
2(z>b)4])1/2
≤
(e∑
j=1
Prz∼D3
[ξj]
)1/2(Ez∼D3
[(φ′(v>j z) + φ′(v∗>j z))2φ′(v>l z)
2(z>b)4])1/2
.
(e∑
j=1
Prz∼D3
[ξj]
)1/2
(‖vj‖p + ‖v∗j‖p)‖vl‖p‖b‖2
where the first step follows by if v>j z, v∗>j z are not separated by any exceptional
point then φ′(v>j z) = φ′(v∗>j z) and the last step follows by Holder’s inequality and
Property 1.
It remains to upper bound Prz∼D3 [ξj]. First note that if v>j z, v∗>j z are sepa-
rated by an exceptional point, pj , then |v∗>j z−pj| ≤ |v>j z− v∗>j z| ≤ ‖vj− v∗j‖‖z‖.
Therefore,
Prz∼D3
[ξj] ≤ Prz∼D3
[|v>j z − pj|‖z‖
≤ ‖vj − v∗j‖
].
Note that (v∗>j z
‖z‖‖v∗j ‖+ 1)/2 follows Beta(1,1) distribution which is uniform
distribution on [0, 1].
Prz∼D3
[|v∗>j z − pj|‖z‖‖v∗j‖
≤‖vj − v∗j‖‖v∗j‖
]≤ Pr
z∼D3
[|v∗>j z|‖z‖‖v∗j‖
≤‖vj − v∗j‖‖v∗j‖
]
.‖vj − v∗j‖‖v∗j‖
.‖W −W ∗‖σt(W ∗)
,
300
where the first step is because we can viewv∗>j z
‖z‖ and pj‖z‖ as two independent ran-
dom variables: the former is about the direction of z and the later is related to the
magnitude of z. Thus, we have
Ez∈D3 [|φ′(v>j z)− φ′(v∗>j z)||φ′(v>l z)|(z>b)2] . (e‖W −W ∗‖/σt(W∗))1/2σ2p
1 (W ∗)‖b‖2.(C.10)
Similarly we have
Ey∈Dk−3[|φ′(u>
i y)− φ′(u∗>i y)||φ′(u>
l y)|(y>c)2] . (e‖W −W ∗‖/σt(W∗))1/2σ2p
1 (W ∗)‖c‖2.(C.11)
Thus, we complete the proof.
Claim C.3.10. Let φ(z) satisfy Property 1,4 and 3(b). For any W ∈ Rk×t, if
‖W −W ∗‖ ≤ σt/2, then we have
max(‖∆(2)j,j ‖, ‖∆
(2)j,l ‖) . r2σ2p
1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.
Proof. We bound ‖∆(2)j,l ‖. ‖∆
(2)j,j ‖ is a special case of ‖∆(2)
j,l ‖.
∆(2)j,l =Ex∼Dd
[∑i 6=i′
(φ′(w>j xi)φ
′(w>l xi′)− φ′(w∗>
j xi)φ′(w∗>
l xi′))xix>i′
]=∑i 6=i′
(Exi∼Dk
[φ′(w>j xi)xi]Exi′∼Dk
[φ′(w>l xi′)x
>i′ ]
−Exi∼Dk[φ′(w∗>
j xi)xi]Exi′∼Dk[φ′(w∗>
l xi′)x>i′ ]).
301
Define α1(σ) = Ez∼D1 [φ′(σz)z]. Then
‖∆(2)j,l ‖ ≤ r(r − 1)
∥∥∥∥α1(‖wj‖)α1(‖wl‖)wjw>l − α1(‖w∗
j‖)α1(‖w∗l ‖)w∗
jw∗>l
∥∥∥∥≤ r(r − 1)
(∥∥∥α1(‖wj‖)α1(‖wl‖)wjw>l − α1(‖wj‖)α1(‖w∗
l ‖)wjw∗>l
∥∥∥+∥∥∥α1(‖wj‖)α1(‖w∗
l ‖)wjw∗>l − α1(‖w∗
j‖)α1(‖w∗l ‖)w∗
jw∗>l
∥∥∥). r2σ2p
1 (W ∗) · (‖W −W ∗‖/σt(W∗))1/2.
where the last inequality uses the same analysis in Claim C.3.9.
C.3.4 Proofs for Main results
Bounding the spectrum of the Hessian near the ground truth The goal
of this Section is to prove Theorem C.3.2
Theorem C.3.2 (Bounding the Spectrum of the Hessian near the Ground Truth, for-
mal version of Theorem 5.3.1). For any W ∈ Rd×k with ‖W−W ∗‖ . ρ2(σt)/(r2t2κ5λ2σ4p
1 )·
‖W ∗‖, let S denote a set of i.i.d. samples from distribution D (defined in (5.1)) and
let the activation function satisfy Property 1,4,3. For any t ≥ 1, if
|S| ≥ dr3 · poly(log d, s) · τκ8λ2σ4p1 /(ρ2(σt)),
then with probability at least 1− d−Ω(s),
Ω(rρ(σt)/(κ2λ))I ∇2fS(W ) O(tr2σ2p
1 )I.
Proof. The main idea of the proof follows the following inequalities,
∇2fD(W∗)− ‖∇2fS(W )−∇2fD(W
∗)‖I ∇2fS(W ) ∇2fD(W∗) + ‖∇2fS(W )−∇2fD(W
∗)‖I
302
We first provide lower bound and upper bound for the range of the eigenvalues of
∇2fD(W∗) by using Lemma C.3.1. Then we show how to bound the spectral norm
of the remaining error, ‖∇2fS(W )−∇2fD(W∗)‖. ‖∇2fS(W )−∇2fD(W
∗)‖ can
be further decomposed into two parts, ‖∇2fS(W ) − HD(W )‖ and ‖HD(W ) −
∇2fD(W∗)‖, where HD(W ) is ∇2fD(W ) if φ is smooth, otherwise HD(W ) is a
specially designed matrix . We can upper bound them when W is close enough
to W ∗ and there are enough samples. In particular, if the activation satisfies Prop-
erty 3(a), we use Lemma C.3.6 to bound ‖HD(W )−∇2fD(W∗)‖ and Lemma C.3.7
to bound ‖HD(W ) − ∇2fS(W )‖. If the activation satisfies Property 3(b), we
use Lemma C.3.10 to bound ‖HD(W ) − ∇2fD(W∗)‖ and Lemma C.3.9 to bound
‖HD(W )−∇2fS(W )‖.
Finally we can complete the proof by setting ε = O(ρ(σ1)/(r2t2κ2λσ2p
1 )) in
Lemma C.3.5 and Lemma C.3.8.
If the activation satisfies Property 3(a), we set ‖W−W ∗‖ . ρ(σt)/(rtκ2λσp
1)
in Lemma C.3.5.
If the activation satisfies Property 3(b), we set ‖W−W ∗‖ . ρ2(σt)σt/(r2t2κ4λ2σ4p
1 )
in Lemma C.3.8.
Linear convergence of gradient descent The goal of this Section is to
prove Theorem C.3.3.
Theorem C.3.3 (Linear convergence of gradient descent, formal version of Theo-
rem B.3.2). Let W ∈ Rt×k be the current iterate satisfying
‖W −W ∗‖ . ρ2(σt)/(r2t2κ5λ2σ4p
1 )‖W ∗‖.
303
Let S denote a set of i.i.d. samples from distribution D (defined in (5.1)). Let the
activation function satisfy Property 1,4 and 3(a). Define
m0 = Θ(rρ(σk)/(κ2λ)) and M0 = Θ(tr2σ2p
1 ).
For any s ≥ 1, if we choose
|S| ≥ d · poly(s, log d) · r2t2τκ8λ2σ4p1 /(ρ2(σt)) (C.12)
and perform gradient descent with step size 1/M0 on fS(W ) and obtain the next
iterate,
W † = W − 1
M0
∇fS(W ),
then with probability at least 1− d−Ω(s),
‖W † −W ∗‖2F ≤ (1− m0
M0
)‖W −W ∗‖2F .
Proof. Given a current iterate W , we set k(p+1)/2 anchor points W aa=1,2,··· ,k(p+1)/2
equally along the line ξW ∗ + (1 − ξ)W for ξ ∈ [0, 1]. Using Theorem C.3.2,
and applying a union bound over all the events, we have with probability at least
1−d−Ω(s) for all anchor points W aa=1,2,··· ,k(p+1)/2 , if |S| satisfies Equation (C.12),
then
m0I ∇2fS(Wa) M0I.
Then based on these anchors, using Lemma C.3.11 we have with probability
1− d−Ω(s), for any points W on the line between W and W ∗,
m0I ∇2fS(W ) M0I. (C.13)
304
Let η be the stepsize.
‖W † −W ∗‖2F
= ‖W − η∇fS(W )−W ∗‖2F
= ‖W −W ∗‖2F − 2η〈∇fS(W ), (W −W ∗)〉+ η2‖∇fS(W )‖2F
We can rewrite fS(W ),
∇fS(W ) =
(∫ 1
0
∇2fS(W∗ + γ(W −W ∗))dγ
)vec(W −W ∗).
We define function HS(W ) : Rk×t → Rtk×tk such that
HS(W −W ∗) =
(∫ 1
0
∇2fS(W∗ + γ(W −W ∗))dγ
).
According to Eq. (C.13),
m0I HS(W −W ∗) M0I. (C.14)
‖∇fS(W )‖2F = 〈HS(W −W ∗), HS(W −W ∗)〉 ≤M0〈W −W ∗, HS(W −W ∗)〉
Therefore,
‖W −W ∗‖2F
≤ ‖W −W ∗‖2F − (−η2M0 + 2η)〈W −W ∗, H(W −W ∗)〉
≤ ‖W −W ∗‖2F − (−η2M0 + 2η)m0‖W −W ∗‖2F
= ‖W −W ∗‖2F −m0
M0
‖W −W ∗‖2F
≤ (1− m0
M0
)‖W −W ∗‖2F
where the third equality holds by setting η = 1/M0.
305
Bounding the spectrum of the Hessian near the fixed point The goal of
this Section is to prove Lemma C.3.11.
Lemma C.3.11. Let S denote a set of samples from Distribution D defined in
Eq. (5.1). Let W a ∈ Rt×k be a point (respect to function fS(W )), which is in-
dependent of the samples S, satisfying ‖W a − W ∗‖ ≤ σt/2. Assume φ satisfies
Property 1, 4 and 3(a). Then for any s ≥ 1, if
|S| ≥ kpoly(log d, s),
with probability at least 1−d−Ω(s), for all W ∈ Rk×t 1 satisfying ‖W a−W‖ ≤ σt/4,
we have
‖∇2fS(W )−∇2fS(Wa)‖ ≤ r3t2σp
1(‖W a −W ∗‖+ ‖W −W a‖k(p+1)/2).
Proof. Let ∆ = ∇2fS(W ) −∇2fS(Wa) ∈ Rkt×kt, then ∆ can be thought of as t2
blocks, and each block has size k × k.
For each j, l ∈ [t] and j 6= l, we use ∆j,l to denote the off-diagonal block,
∆j,l =1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>l xi)xi
)>
− 1
|S|∑x∈S
( r∑i=1
φ′(wa>j xi)xi
)·
(r∑
i=1
φ′(wa>l xi)xi
)>
=1
|S|∑x∈S
r∑i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
l xi′)− φ′(wa>j xi)φ
′(wa>l xi′)
)xix
>i′
1which is not necessarily to be independent of samples
306
For each j ∈ [t], we use ∆j,j to denote the diagonal block,
∆j,j =1
|S|∑
(x,y)∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
+
(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
( r∑i=1
φ′(wa>j xi)xi
)·
(r∑
i=1
φ′(wa>j xi)xi
)>
+
(t∑
l=1
r∑i=1
φ(wa>l xi)− y
)·
(r∑
i=1
φ′′(wa>j xi)xix
>i
)]
307
We further decompose ∆j,j into ∆j,j = ∆(1)j,j +∆
(2)j,j , where
∆(1)j,j =
1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
φ(w>l xi)− y
)·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
φ(wa>l xi)− y
)·
(r∑
i=1
φ′′(wa>j xi)xix
>i
)]
=1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
(φ(w>
l xi)− φ(w∗>l xi)
))·
(r∑
i=1
φ′′(w>j xi)xix
>i
)]
− 1
|S|∑
(x,y)∈S
[(t∑
l=1
r∑i=1
(φ(wa>
l xi)− φ(w∗>l xi)
))·
(r∑
i=1
φ′′(wa>j xi)xix
>i
)]
=1
|S|∑x∈S
t∑l=1
r∑i=1
r∑i′=1
((φ(w>
l xi)− φ(w∗>l xi))φ
′′(w>j xi′)
− (φ(wa>l xi)− φ(w∗>
l xi))φ′′(wa>
j xi′)
)xi′x
>i′
=1
|S|∑x∈S
t∑l=1
r∑i=1
r∑i′=1
((φ(w>
l xi)− φ(wa>l xi))φ
′′(w>j xi′)
)xi′x
>i′
+1
|S|∑x∈S
t∑l=1
r∑i=1
r∑i′=1
((φ(wa>
l xi)− φ(w∗>l xi))(φ
′′(wa>j xi′) + φ′′(w>
j xi′))
)xi′x
>i′
= ∆(1,1)j,j +∆
(1,2)j,j ,
and
∆(2)j,j =
1
|S|∑x∈S
( r∑i=1
φ′(w>j xi)xi
)·
(r∑
i=1
φ′(w>j xi)xi
)>
− 1
|S|∑x∈S
( r∑i=1
φ′(wa>j xi)xi
)·
(r∑
i=1
φ′(wa>j xi)xi
)>
=1
|S|∑x∈S
r∑i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
j xi′)− φ′(wa>j xi)φ
′(wa>j xi′)
)xix
>i′
308
Combining Claims C.3.11, C.3.12, C.3.14 C.3.13 and taking a union bound over
O(t2) events, we have
‖∇2fS(W )−∇2fS(Wa)‖ ≤
t∑j=1
‖∆(1)j,j ‖+ ‖∆
(2)j,j ‖+
∑j 6=l
‖∆j,l‖
. r3t2σp1(‖W a −W ∗‖+ ‖W −W a‖k(p+1)/2),
holds with probability at least 1− d−Ω(s).
Claim C.3.11. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then
‖∆(1,1)j,j ‖ . tr2σp
1‖W a −W‖k(p+1)/2
holds with probability 1− d−Ω(s).
Proof. Recall the definition ∆(1,1)j,j ,
1
|S|∑x∈S
t∑l=1
r∑i=1
r∑i′=1
((φ(w>
l xi)− φ(wa>l xi))φ
′′(w>j xi′)
)xi′x
>i′ .
In order to upper bound ‖∆(1,1)j,j ‖, it suffices to upper bound the spectral norm of
1
|S|∑x∈S
((φ(w>
l xi)− φ(wa>l xi))φ
′′(w>j xi′)
)xi′x
>i′ .
We focus on the case for i = i′. The case for i 6= i′ is similar. Note that
−2L2L1(‖wl‖p + ‖wl‖p)‖xi‖p+1xix>i
((φ(w>
l xi)− φ(wa>l xi))φ
′′(w>j xi)
)xix
>i
2L2L1(‖wl‖p + ‖wl‖p)‖xi‖p+1xix>i
Define function h1(x) : Rk → R
h1(x) = ‖x‖p+1
309
(I) Bounding |h(x)|.
By Lemma 2.4.3, we have h(x) . (sk log dn)(p+1)/2 with probability at
least 1− 1/(nd4s).
(II) Bounding ‖Ex∼Dk[‖x‖p+1xx>]‖.
Let g(x) = (2π)−k/2e−‖x‖2/2. Note that xg(x)dx = −dg(x).
Ex∼Dk
[‖x‖p+1xx>] = ∫
‖x‖p+1g(x)xx>dx
= −∫‖x‖p+1d(g(x))x>
= −∫‖x‖p+1d(g(x)x>) +
∫‖x‖p+1g(x)Ikdx
=
∫d(‖x‖p+1)g(x)x> +
∫‖x‖p+1g(x)Ikdx
=
∫(p+ 1)‖x‖p−1g(x)xx>dx+
∫‖x‖p+1g(x)Ikdx
∫‖x‖p+1g(x)Ikdx
= Ex∼Dk[‖x‖p+1]Ik.
Since ‖x‖2 follows χ2 distribution with degree k, Ex∼Dk[‖x‖q] = 2q/2 Γ((k+q)/2)
Γ(k/2)for
any q ≥ 0. So, kq/2 . Ex∼Dk[‖x‖q] . kq/2. Hence, ‖Ex∼Dk
[h(x)xx>]‖ & k(p+1)/2.
Also
∥∥Ex∼Dk
[h(x)xx>]∥∥ ≤ max
‖a‖=1Ex∼Dk
[h(x)(x>a)2
]≤ max
‖a‖=1
(Ex∼Dk
[h2(x)
])1/2(Ex∼Dk
[(x>a)4
])1/2. k(p+1)/2.
(III) Bounding (Ex∼Dk[h4(x)])1/4.
310
(Ex∼Dk
[h4(x)])1/4
. k(p+1)/2.
Define function B(x) = h(x)xx> ∈ Rk×k, ∀i ∈ [n]. Let B = Ex∼Dd[h(x)xx>].
Therefore by applying Corollary B.1.1, we obtain for any 0 < ε < 1, if
|S| ≥ ε−2kpoly(log d, s)
with probability at least 1− 1/dΩ(s),∥∥∥∥∥ 1
|S|∑x∈S
‖x‖p+1xx> − Ex∼Dk
[‖x‖p+1xx>]∥∥∥∥∥ . εk(p+1)/2.
Therefore we have with probability at least 1− 1/dΩ(s),∥∥∥∥∥ 1
|S|∑x∼S
‖x‖p+1xx>
∥∥∥∥∥ . k(p+1)/2. (C.15)
Claim C.3.12. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then
‖∆(1,2)j,j ‖ . tr2σp
1‖W a −W ∗‖
holds with probability 1− d−Ω(s).
Proof. Recall the definition of ∆(1,2)j,l ,
1
|S|∑x∈S
t∑l=1
r∑i=1
r∑i′=1
((φ(wa>
l xi)− φ(w∗>l xi))(φ
′′(wa>j xi′) + φ′′(w>
j xi′))
)xi′x
>i′ .
311
In order to upper bound ‖∆(1,2)j,l ‖, it suffices to upper bound the spectral norm of
this quantity,
1
|S|∑x∈S
((φ(wa>
l xi)− φ(w∗>l xi))(φ
′′(wa>j xi′) + φ′′(w>
j xi′))
)xi′x
>i′ ,
where ∀l ∈ [t], i ∈ [r], i′ ∈ [r]. We define function h(y, z) : R2k → R such that
h(y, z) = |φ(wa>l y)− φ(w∗>
l y)| · (|φ′′(wa>
j z)|+ |φ′′(w>j z)|).
We define function B(y, z) : R2k → Rk×k such that
B(y, z) = |φ(wa>l y)− φ(w∗>
l y)| · (|φ′′(wa>
j z)|+ |φ′′(w>j z)|) · zz> = h(y, z)zz>.
Using Property 1, we can show
|φ(wa>l y)− φ(w∗>
l y)| ≤ |(wal − w∗
l )>y| · (|φ′(wa>
l y)|+ |φ′(w∗>l y)|)
≤ |(wal − w∗
l )>y| · L1 · (|wa>
l y|p + |w∗>l y|p)
Using Property 3, we have (|φ′′(wa>j z)| + |φ′′(w>
j z)|) ≤ 2L2. Thus, h(y, z) ≤
2L1L2|(wal − w∗
l )>y| · (|wa>
l y|p + |w∗>l y|p).
Using Lemma 2.4.2, matrix Bernstein inequality Corollary B.1.1, we have,
if |S| ≥ kpoly(log d, s),∥∥∥∥∥∥Ey,z∼Dk[B(y, z)]− 1
|S|∑
(y,z)∈S
B(y, z)
∥∥∥∥∥∥ . ‖Ey,z∼Dk[B(y, z)]‖
. ‖w∗l ‖p‖w∗
l − wal ‖
where S denote a set of samples from distribution D2k. Thus, we obtain∥∥∥∥∥∥ 1
|S|∑
(y,z)∈S
B(y, z)
∥∥∥∥∥∥ . ‖W a −W ∗‖σp1.
312
Taking the union bound over O(tr2) events, summing up those O(tr2) terms com-
pletes the proof.
Claim C.3.13. For each (j, l) ∈ [t]× [t] and j 6= l, if |S| ≥ kpoly(log d, s), then
‖∆j,l‖ . r2σp1‖W a −W‖k(p+1)/2
holds with probability 1− d−Ω(s).
Proof. Recall
∆j,l :=1
|S|∑x∈S
r∑i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
l xi′)− φ′(wa>j xi)φ
′(wa>l xi′)
)xix
>i′
=1
|S|∑x∈S
r∑i=1
r∑i′=1
(φ′(w>
j xi)φ′(w>
l xi′)− φ′(w>j xi)φ
′(wa>l xi′)
+ φ′(w>j xi)φ
′(wa>l xi′)− φ′(wa>
j xi)φ′(wa>
l xi′)
)xix
>i′
We just need to consider
1
|S|∑x∈S
r∑i=1
r∑i′=1
(φ′(w>
j xi)(φ′(w>
l xi′)− φ′(wa>l xi′))
)xix
>i′
Recall that x = [x>1 x>
2 · · ·x>r ]
>, xi ∈ Rk,∀i ∈ [r] and d = rk. We define
X = [x1 x2 · · · xr] ∈ Rk×r. Let φ′(X>wj) ∈ Rr denote the vector
[φ′(x>
1 wj) φ′(x>2 wj) · · · φ′(x>
r wj)]> ∈ Rr.
Let pl(X) denote the vector
[φ′(w>
l x1)− φ′(wa>l x1) φ′(w>
l x2)− φ′(wa>l x2) · · · φ′(w>
l xr)− φ′(wa>l xr)
]> ∈ Rr.
313
We define function B(x) : Rd → Rk×k such that
B(x) = X︸︷︷︸k×r
φ′(X>wj)︸ ︷︷ ︸r×1
pl(X)>︸ ︷︷ ︸1×r
X>︸︷︷︸r×k
.
Note that
‖φ′(X>wj)‖‖pl(X)‖ ≤ L1L2‖wj‖p‖wl − wal ‖
(r∑
i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
).
We define function B(x) : Rd → Rk×k such that
B(x) = L1L2‖wj‖p‖wl − wal ‖
(r∑
i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
)XX>
Also note that[0 B(x)
B>(x) 0
]=
[X 00 X
][0 φ′(X>wj)pl(X)>
pl(X)φ′(X>wj)> 0
][X> 00 X>
]We can lower and upper bound the above term by
−[B(x) 00 B(x)
]
[0 B(x)
B>(x) 0
][B(x) 00 B(x)
]
Therefore,
‖∆j,l‖ =
∥∥∥∥∥ 1
|S|∑x∈S
B(x)
∥∥∥∥∥ .
∥∥∥∥∥ 1
|S|∑x∈S
B(x)
∥∥∥∥∥Define
F (x) :=
(r∑
i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
)XX>.
To bound ‖Ex∼DdF (x) − 1
|S|∑
x∈S F (x)‖, we apply Lemma D.3.9. The
following proof discuss the four properties in Lemma D.3.9.
314
(I)
‖F (x)‖ ≤
(r∑
i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
)3
By using Lemma 2.4.3, we have with probability 1− 1/nd4s,
‖F (x)‖ . r4k3/2+p/2 log3/2+p/2 n
(II)
∥∥∥∥ Ex∼Dd
[F (x)]
∥∥∥∥=
∥∥∥∥∥r · Ex∼Dd
[(r∑
i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
)xjx
>j
]∥∥∥∥∥& r3kp/2+1/2
The upper bound can be obtained similarly,∥∥∥∥ Ex∼Dd
[F (x)]
∥∥∥∥ . r3kp/2+1/2
(III)
max
(∥∥∥∥ Ex∼Dd
[F (x)F (x)>]
∥∥∥∥,∥∥∥∥ Ex∼Dd
[F (x)>F (x)]
∥∥∥∥)
= max‖a‖=1
Ex∼Dd
( r∑i=1
‖xi‖p)2
·
(r∑
i=1
‖xi‖
)2
‖X‖2‖X>a‖2
. r7kp+2.
315
(IV)
max‖a‖=‖b‖=1
(E
B∼B
[(a>F (x)b)2
])1/2= max
‖a‖=1,‖b‖=1
Ex∼N(0,Id)
(a>( r∑i=1
‖xi‖p)·
(r∑
i=1
‖xi‖
)XX>b
)21/2
. r5/2kp/2+1/2.
Using Lemma 2.4.2 and matrix Bernstein inequality Lemma D.3.9, we have,
if |S| ≥ rkpoly(log d, s), with probability at least 1− 1/dΩ(s),∥∥∥∥∥Ex∼Dd[B(x)]− 1
|S|∑x∈S
B(x)
∥∥∥∥∥ . ‖Ex∼Dd[B(x)]‖
. ‖wj‖p‖wl − wal ‖r3k(p+1)/2
Thus, we obtain ∥∥∥∥∥ 1
|S|∑x∈S
B(x)
∥∥∥∥∥ . ‖W a −W‖σp1r
3k(p+1)/2.
We complete the proof.
Claim C.3.14. For each j ∈ [t], if |S| ≥ kpoly(log d, s), then
‖∆(2)j,j ‖ . tr2σp
1‖W a −W‖k(p+1)/2
holds with probability 1− d−Ω(s).
Proof. ∆(2)j,j is a special case of ∆j,l, so we refer readers to the proofs in Claim C.3.13.
316
Appendix D
Non-linear Inductive Matrix Completion
D.1 Proof Overview
At high level the proofs for Theorem 6.3.1 and Theorem 6.3.3 include the
following steps. 1) Show that the population Hessian at the ground truth is positive
definite. 2) Show that population Hessians near the ground truth are also positive
definite. 3) Employ matrix Bernstein inequality to bound the population Hessian
and the empirical Hessian.
We now formulate the Hessian. The Hessian of Eq. (6.3), ∇2fΩ(U, V ) ∈
R(2kd)×(2kd), can be decomposed into two types of blocks, (i ∈ [k], j ∈ [k]),
∂2fΩ(U, V )
∂ui∂vj,∂2fΩ(U, V )
∂ui∂uj
,
where ui(vj , resp.) is the i-th column of U (j-th column of V , resp.). Note that each
of the above second-order derivatives is a d× d matrix.
The first type of blocks are given by:
∂2fΩ(U, V )
∂ui∂vj= E
Ω
[φ′(u>
i x)φ′(v>j y)xy
>φ(v>i y)φ(u>j x)]+ δijE
Ω
[hx,y(U, V )φ′(u>
i x)φ′(v>i y)xy
>],where EΩ[·] = 1
|Ω|∑
(x,y)∈Ω[·], δij = 1i=j , and
hx,y(U, V ) = φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y).
317
For sigmoid/tanh activation function, the second type of blocks are given
by:
∂2fΩ(U, V )
∂ui∂uj
= EΩ
[φ′(u>
i x)φ′(u>
j x)xx>φ(v>i y)φ(v
>j y)]+ δijE
Ω
[hx,y(U, V )φ′′(u>
i x)φ(v>i y)xx
>].(D.1)
For ReLU/leaky ReLU activation function, the second type of blocks are given by:
∂2fΩ(U, V )
∂ui∂uj
= EΩ
[φ′(u>
i x)φ′(u>
j x)xx>φ(v>i y)φ(v
>j y)].
Note that the second term of Eq. (D.1) is missing here as (U, V ) are fixed, the
number of samples is finite and φ′′(z) = 0 almost everywhere.
In this section, we will discuss important lemmas/theorems for Step 1 in
Appendix D.1.1 and Step 2,3 in Appendix D.1.3.
D.1.1 Positive definiteness of the population hessian
The corresponding population risk for Eq. (6.3) is given by:
fD(U, V ) =1
2E(x,y)∼D[(φ(U
>x)>φ(V >y)− A(x, y))2], (D.2)
where D := X×Y. For simplicity, we also assume X and Y are normal distributions.
Now we study the Hessian of the population risk at the ground truth. Let the
Hessian of fD(U, V ) at the ground-truth (U, V ) = (U∗, V ∗) be H∗ ∈ R(2dk)×(2dk),
which can be decomposed into the following two types of blocks (i ∈ [k], j ∈ [k]),
∂2fD(U∗, V ∗)
∂ui∂uj
=Ex,y
[φ′(u∗>
i x)φ′(u∗>j x)xx>φ(v∗>i y)φ(v
∗>j y)
],
∂2fD(U∗, V ∗)
∂ui∂vj=Ex,y
[φ′(u∗>
i x)φ′(v∗>j y)xy>φ(v∗>i y)φ(u∗>j x)
].
318
To study the positive definiteness of H∗, we characterize the minimal eigen-
value of H∗ by a constrained optimization problem,
λmin(H∗) = min
(a,b)∈BEx,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>
i x)y>bi
)2,
(D.3)
where (a, b) ∈ B denotes that∑k
i=1 ‖ai‖2 + ‖bi‖2 = 1. Obviously, λmin(H∗) ≥ 0
due to the squared loss and the realizable assumption. However, this is not sufficient
for the local convexity around the ground truth, which requires the positive (semi-
)definiteness for the neighborhood around the ground truth. In other words, we
need to show that λmin(H∗) is strictly greater than 0, so that we can characterize
an area in which the Hessian still preserves positive definiteness (PD) despite the
deviation from the ground truth.
Challenges. As we mentioned previously there are activation functions that
lead to redundancy in parameters. Hence one challenge is to distill properties of
the activation functions that preserve the PD. Another challenge is the correlation
introduced by U∗ when it is non-orthogonal. So we first study the minimal eigen-
value for orthogonal U∗ and orthogonal V ∗ and then link the non-orthogonal case
to the orthogonal case.
D.1.2 Warm up: orthogonal case
In this section, we consider the case when U∗, V ∗ are unitary matrices, i.e.,
U∗>U∗ = U∗U∗> = Id. (d = k). This case is easier to analyze because the
319
dependency between different elements of x or y can be disentangled. And we are
able to provide lower bound for the Hessian. Before we introduce the lower bound,
let’s first define the following quantities for an activation function φ.
αi,j :=Ez∼N(0,1)[(φ(z))izj],
βi,j :=Ez∼N(0,1)[(φ′(z))izj],
γ :=Ez∼N(0,1)[φ(z)φ′(z)z],
ρ :=min(α2,0β2,0 − α21,0β
21,0 − β2
1,0α21,1), (α2,0β2,2 − α2
1,0β21,2 − γ2).
(D.4)
We now present a lower bound for general activation functions including
sigmoid and tanh.
Lemma D.1.1. Let (a, b) ∈ B denote that∑k
i=1 ‖ai‖2 + ‖bi‖2 = 1. Assume d = k
and U∗, V ∗ are unitary matrices, i.e., U∗>U∗ = U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id,
then the minimal eigenvalue of the population Hessian in Eq. (D.3) can be simplified
as,
min(a,b)∈B
Ex,y
( k∑i=1
φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y
>bi
)2.
Let β, ρ be defined as in Eq. (D.4). If the activation function φ satisfies β1,1 = 0,
then λmin(H∗) ≥ ρ.
Since sigmoid and tanh have symmetric derivatives w.r.t. 0, they satisfy
β1,1 = 0. Specifically, we have ρ ≈ 0.000658 for sigmoid and ρ ≈ 0.0095 for
tanh. Also for ReLU, β1,1 = 1/2, so ReLU does not fit in this lemma. The full
proof of Lemma D.1.1, the lower bound of the population Hessian for ReLU and
the extension to non-orthogonal cases can be found in Appendix D.2.
320
D.1.3 Error bound for the empirical Hessian near the ground truth
In the previous section, we have shown PD for the population Hessian at
the ground truth for the orthogonal cases. Based on that, we can characterize the
landscape around the ground truth for the empirical risk. In particular, we bound the
difference between the empirical Hessian near the ground truth and the population
Hessian at the ground truth. The theorem below provides the error bound w.r.t. the
number of samples (n1, n2) and the number of observations |Ω| for both sigmoid
and ReLU activation functions.
Theorem D.1.1. For any ε > 0, if
n1 & ε−2td log2 d, n2 & ε−2td log2 d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t, for sigmoid/tanh,
‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖;
for ReLU,
‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ .
(‖V − V ∗‖1/2 + ‖U − U∗‖1/2 + ε
)(‖U∗‖+ ‖V ∗‖)2.
The key idea to prove this theorem is to use the population Hessian at (U, V )
as a bridge.
On one side, we bound the population Hessian at the ground truth and the
population Hessian at (U, V ). This would be easy if the second derivative of the
activation function is Lipschitz, which is the case of sigmoid and tanh. But ReLU
321
doesn’t have this property. However, we can utilize the condition that the param-
eters are close enough to the ground truth and the piece-wise linearity of ReLU to
bound this term.
On the other side, we bound the empirical Hessian and the population Hes-
sian. A natural idea is to apply matrix Bernstein inequality. However, there are
two obstacles. First the Gaussian variables are not uniformly bounded. Therefore,
we instead use Lemma B.7 in [153], which is a loosely-bounded version of matrix
Bernstein inequality. The second obstacle is that each individual Hessian calculated
from one observation (x, y) is not independent from another observation (x′, y′),
since they may share the same feature x or y. The analyses for vanilla IMC and MC
assume all the items(users) are given and the observed entries are independently
sampled from the whole matrix. However, our observations are sampled from the
joint distribution of X and Y.
To handle the dependency, our model assumes the following two-stage sam-
pling rule. First, the items/users are sampled from their distributions independently,
then given the items and users, the observations Ω are sampled uniformly with re-
placement. The key question here is how to combine the error bounds from these
two stages. Fortunately, we found special structures in the blocks of Hessian which
enables us to separate x, y for each block, and bound the errors in stage separately.
See Appendix D.3 for details.
D.2 Positive Definiteness of Population Hessian
We state some useful facts in this section.
322
Fact D.2.1. Let A =[a1 a2 · · · ak
]. Let diag(A) ∈ Rk denote the vector where
the i-th entry is Ai,i, ∀i ∈ [k]. Let 1 ∈ Rk denote the vector that the i-th entry is 1,
∀i ∈ [k]. We have the following properties,
(I)k∑
i=1
(a>i ei)2 = ‖ diag(A)‖22,
(II)k∑
i=1
(a>i ai)2 = ‖A‖2F ,
(III)k∑
i=1
k∑j=1
(a>i aj) = ‖A · 1‖22,
(IV)∑i 6=j
a>i aj = ‖A · 1‖22 − ‖A‖2F .
Proof. Using the definition, it is easy to see that (I), (II) and (III) are holding.
Proof of (IV), we have
∑i 6=j
a>i aj =∑i,j
a>i aj −k∑
i=1
a>i ai = ‖A · 1‖22 − ‖A‖2F .
where the last step follows by (II) and (III).
Fact D.2.2. Let A =[a1 a2 · · · ak
]. Let diag(A) ∈ Rk denote the vector where
the i-th entry is Ai,i, ∀i ∈ [k]. Let 1 ∈ Rk denote the vector that the i-th entry is 1,
323
∀i ∈ [k]. We have the following properties,
(I)∑i 6=j
a>i eie>i aj = (diag(A)> · (A · 1))− ‖ diag(A)‖22,
(II)∑i 6=j
a>i eje>j aj = (diag(A)> · (A · 1))− ‖ diag(A)‖22,
(III)∑i 6=j
a>i eia>j ej = (diag(A)> · 1)2 − ‖ diag(A)‖22,
(IV)∑i 6=j
a>i eja>j ei = 〈A>, A〉 − ‖ diag(A)‖22.
Proof. Proof of (I). We have
∑i 6=j
a>i eie>i aj =
∑i,j
a>i eie>i aj −
k∑i=1
a>i eie>i ai
=∑i,j
ai,ie>i aj − ‖ diag(A)‖22
=k∑
i=1
ai,ie>i
k∑j=1
aj − ‖ diag(A)‖22
= (diag(A)> · (A · 1))− ‖ diag(A)‖22
Proof of (II). It is similar to (I).
Proof of (III). We have∑i 6=j
a>i eia>j ej =
∑i,j
a>i eia>j ej −
∑i=1
a>i eia>i ei
=k∑
i=1
a>i ei ·k∑
j=1
a>j ej −k∑
i=1
a>i eia>i ei
=k∑
i=1
ai,i ·k∑
j=1
aj,j −k∑
i=1
ai,iai,i
= (diag(A)> · 1)2 − ‖ diag(A)‖22
324
Proof of (IV). We have
∑i 6=j
a>i eja>j ei =
∑i 6=j
tr(a>i eja
>j ei)
=∑i 6=j
tr(eja
>j eia
>i
)=∑i 6=j
〈eja>j , aie>i 〉
=∑i,j
〈eja>j , aie>i 〉 −k∑
i=1
〈eia>i , aie>i 〉
= 〈A>, A〉 − ‖ diag(A)‖22.
where the second step follows by tr(ABCD) = tr(BCDA), the third step follows
by tr(AB) = 〈A,B>〉.
D.2.1 Orthogonal case
We first study the orthogonal case, where d = k and U∗, V ∗ are unitary
matrices, i.e., U∗>U∗ = U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id.
Lower bound on minimum eigenvalue
Lemma D.2.1 (Restatement of Lemma D.1.1). Let (a, b) ∈ B denote that∑k
i=1 ‖ai‖2+
‖bi‖2 = 1. Assume d = k and U∗, V ∗ are unitary matrices, i.e., U∗>U∗ =
U∗U∗> = V ∗V ∗> = V ∗>V ∗ = Id, then the minimal eigenvalue of the popula-
tion Hessian in Eq. (D.3) can be simplified as,
λmin(H∗) = min
(a,b)∈BEx,y
( k∑i=1
φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y
>bi
)2. (D.5)
325
Let β, ρ be defined as in Eq. (D.4). If the activation function φ satisfies β1,1 = 0,
then λmin(H∗) ≥ ρ.
Proof. In the orthogonal case, we can easily transform Eq. (D.3) to Eq. (D.5) since
x, y are normal distribution. Now we can decompose Eq. (D.5) into the following
three terms.
Ex,y
( k∑i=1
φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y
>bi
)2
= Ex,y
( k∑i=1
φ′(xi)φ(yi)x>ai
)2
︸ ︷︷ ︸C
+Ex,y
( k∑i=1
φ′(yi)φ(xi)y>bi
)2
+ 2Ex,y
[∑i,j
φ′(xi)φ(yi)x>aiφ
′(yj)φ(xj)y>bj
]︸ ︷︷ ︸
D
.
Note that the first term is similar to the second term, so we just lower bound the first
term and the third term. Define A = [a1, a2, · · · , ak], B = [b1, b2, · · · , bk]. Let Ao
be the off-diagonal part of A and Ad be the diagonal part of A, i.e., Ao + Ad = A.
And let gA = diag(A) be the vector of the diagonal elements of A. We will bound
C and D in the following.
326
For C, we have
Ex,y
( k∑i=1
φ′(xi)φ(yi)x>ai
)2
=k∑
i=1
Ex,y
[(φ′(xi)φ(yi)x
>ai)2]
+∑i 6=j
Ex,y
[φ′(xi)φ(yi)x
>ai · φ′(xj)φ(yj)x>aj]
=k∑
i=1
α2,0
[(a>i ei)
2(β2,2 − β2,0) + β2,0‖ai‖2]
+∑i 6=j
α21,0
[β21,0a
>i aj + (β1,2β1,0 − β2
1,0)(a>i eie
>i aj + a>i eja
>j ej) + β2
1,1(a>i eia
>j ej + a>i eja
>j ei)
]= C1 + C2.
where the last step follows by
C1 =k∑
i=1
α2,0
[(a>i ei)
2(β2,2 − β2,0) + β2,0‖ai‖2]
C2 =∑i 6=j
α21,0
[β21,0a
>i aj + (β1,2β1,0 − β2
1,0)(a>i eie
>i aj + a>i eja
>j ej) + β2
1,1(a>i eia
>j ej + a>i eja
>j ei)
]First we can simplify C1 in the following sense,
C1 = α2,0(β2,2 − β2,0)k∑
i=1
(a>i ei)2 + α2,0β2,0
k∑i=1
‖ai‖22
= α2,0(β2,2 − β2,0)‖ diag(A)‖22 + α2,0β2,0‖A‖2F ,
where the last step follows by Fact D.2.1.
We can rewrite C2 in the following sense
C2 = α21,0(β
21,0C2,1 + (β1,2β1,0 − β2
1,0) · (C2,2 + C2,3) + β21,1(C2,4 + C2,5)).
327
where
C2,1 =∑i 6=j
a>i aj
C2,2 =∑i 6=j
a>i eie>i aj
C2,3 =∑i 6=j
a>i eje>j aj
C2,4 =∑i 6=j
a>i eia>j ej
C2,5 =∑i 6=j
a>i eja>j ei
Using Fact D.2.1, we have
C2,1 = ‖A · 1‖22 − ‖A‖2F .
Using Fact D.2.2, we have
C2,2 = (diag(A)> · (A · 1))− ‖ diag(A)‖22,
C2,3 = (diag(A)> · (A · 1))− ‖ diag(A)‖22,
C2,4 = (diag(A)> · 1)2 − ‖ diag(A)‖22,
C2,5 = 〈A>, A〉 − ‖ diag(A)‖22.
Thus,
C2 = α21,0(β
21,0(‖A · 1‖22 − ‖A‖2F )
+ (β1,2β1,0 − β21,0)2 · (diag(A)> · (A · 1)− ‖ diag(A)‖22)
+ β21,1((diag(A)
> · 1)2 + 〈A>, A〉 − 2‖ diag(A)‖22)).
328
We consider C1+C2 by focusing different terms, for the ‖A‖2F (from C1 and
C2), we have
(α2,0β2,0 − α21,0β
21,0)‖A‖2F .
For the term 〈A,A>〉 (from C2,5), we have
α21,0β
21,1〈A,A>〉.
For the term ‖ diag(A)‖22 (from C1 and C2), we have
(α2,0(β2,2 − β2,0)− 2α21,0(β1,2β1,0 − β2
1,0)− 2α1,0β21,1)‖ diag(A)‖22
For the term ‖A · 1‖22 (from C2,1), we have
α21,0β
21,0‖A · 1‖22.
For the term diag(A)> · A · 1 (from C2,2 and C2,3), we have
2α21,0(β1,2β1,0 − β2
1,0) diag(A)> · A · 1.
For the term (diag(A)> · 1)2 (from C2,4), we have
α21,0β
21,1(diag(A)
> · 1)2.
329
Putting it all together, we have
C1 + C2
= (α2,0β2,0 − α21,0β
21,0)‖A‖2F + α2
1,0β21,1〈A,A>〉
+ (α2,0(β2,2 − β2,0)− 2α21,0(β1,2β1,0 − β2
1,0)− 2α21,0β
21,1) · ‖ diag(A)‖2
+ α21,0β
21,0‖A · 1‖2 + 2α2
1,0(β1,2β1,0 − β21,0)(diag(A)
> · A · 1) + α21,0β
21,1(diag(A)
> · 1)2
= (α2,0β2,0 − α21,0β
21,0)(‖Ao‖2F + ‖gA‖2) + α2
1,0β21,1(〈Ao, A
>o 〉+ ‖gA‖2)
+ (α2,0β2,2 − α2,0β2,0 − 2α21,0β1,2β1,0 + 2α2
1,0β21,0 − 2α2
1,0β21,1) · ‖gA‖2
+ α21,0β
21,0(‖gA‖2 + ‖Ao · 1‖2 + 2g>A · Ao · 1)
+ 2α21,0(β1,2β1,0 − β2
1,0)(g>A · Ao · 1+ ‖gA‖2) + α2
1,0β21,1(g
>A · 1)2
= (α2,0β2,0 − α21,0β
21,0)‖Ao‖2F + α2
1,0β21,1〈Ao, A
>o 〉+ (α2,0β2,2 − α2
1,0β21,1) · ‖gA‖2
+ α21,0β
21,0(‖Ao · 1‖2) + 2α2
1,0β1,2β1,0(g>A · Ao · 1) + α2
1,0β21,1(g
>A · 1)2.
By doing a series of equivalent transformations, we have removed the expectation
and the formula C becomes a form of A and the moments of φ. These equivalent
transforms are mainly based on the fact that xi, xj, yi, yj for any i 6= j are indepen-
dent on each other.
330
Similarly we can reformulate D,
Ex,y
[∑i,j
φ′(xi)φ(yi)x>aiφ
′(yj)φ(xj)y>bj
]=∑i
Ex,y
[φ′(xi)φ(yi)x
>aiφ′(yji)φ(xi)y
>bi]+∑i 6=j
Ex,y
[φ′(xi)φ(yi)x
>aiφ′(yj)φ(xj)y
>bj]
=∑i
γ2a>i eib>i ei +
∑i 6=j
α21,1a
>i ejb
>j ei + α1,1β1,1(a
>i ejb
>j ej + a>i eib
>j ei) + β2
1,1a>i eib
>j ej
= (γ2 − β21,0α
21,1 − 2α1,0α1,1β1,0β1,1 − α2
1,0β21,1)g
>AgB
+ β21,0α
21,1〈A,B>〉+ α2
1,0β21,1(g
>A1)(g
>B1)
+ α1,0α1,1β1,0β1,1[(A1)>gB + (B1)>gA]
= (γ2 − α21,0β
21,1)g
>AgB + β2
1,0α21,1〈Ao, B
>o 〉+ α2
1,0β21,1(g
>A1)(g
>B1)
+ α1,0α1,1β1,0β1,1[(Ao1)>gB + (Bo1)
>gA].
Combining the above results, we have
λmin(H∗) = min
‖A‖2F+‖B‖2F=1
(β21,0α
21,1‖Ao +B>
o ‖2F
+ ‖α1,0β1,0Ao1 + α1,0β1,2gA + α1,1β1,1gB‖2
+ ‖α1,0β1,0Bo1 + α1,0β1,2gB + α1,1β1,1gA‖2
+ (α2,0β2,0 − α21,0β
21,0 − β2
1,0α21,1 − α2
1,0β21,1)(‖Ao‖2F + ‖Bo‖2F )
+ 1/2 · α21,0β
21,1(‖Ao + A>
o ‖2F + ‖Bo +B>o ‖2F )
+ [α2,0β2,2 − α21,0β
21,1 − α2
1,0β21,2 − α2
1,1β21,1] · (‖gA‖2 + ‖gB‖2)
+ 2(γ2 − α21,0β
21,1 − 2α1,0α1,1β1,1β1,2)g
>AgB
+ α21,0β
21,1(g
>A1 + g>B1)
2
).
(D.6)
331
The final output of the above formula has a clear form: most non-negative
terms are extracted. A,B are separated into the off-diagonal elements and off-
diagonal elements and these two terms can be dealt with independently. Now we
consider the activation functions that satisfy β1,1 = 0, which further simplifies the
equation. Note that Sigmoid and tanh satisfy this condition.
Finally, for β1,1 = 0, we obtain
λmin(H∗) = min∑k
i=1 ‖ai‖2+‖bi‖2=1Ex,y
( k∑i=1
φ′(xi)φ(yi)x>ai + φ′(yi)φ(xi)y
>bi
)2
= min‖A‖2F+‖B‖2F=1
(α2,0β2,0 − α21,0β
21,0 − β2
1,0α21,1)(‖Ao‖2F + ‖Bo‖2F )
+ (α2,0β2,2 − α21,0β
21,2 − γ2)(‖gA‖2 + ‖gB‖2)
+ β21,0α
21,1‖Ao +B>
o ‖2F + γ2‖gA + gB‖2
+ α21,0(‖β1,0gA + β1,2Ao1‖2 + α2
1,0‖β1,0gA + β1,2Bo1‖2)
≥ min(α2,0β2,0 − α21,0β
21,0 − β2
1,0α21,1), (α2,0β2,2 − α2
1,0β21,2 − γ2)︸ ︷︷ ︸
:=ρ
.
For sigmoid, we have ρ = 0.000658; for tanh, we have ρ = 0.0095.
The following lemma will be used when transforming non-orthogonal cases
to orthogonal cases.
Lemma D.2.2. For any A = [a1, a2, · · · , ak] ∈ Rd×k, we have,
Ex,y∼Dk
∥∥∥∥∥k∑
i=1
φ′(xi)φ(yi)ai
∥∥∥∥∥2 ≥ (α2,0β2,0 − α2
1,0β21,0)‖A‖2F .
332
Proof. Recall 1 ∈ Rd denote the all ones vector.
Ex,y∼Dk
∥∥∥∥∥k∑
i=1
φ′(xi)φ(yi)ai
∥∥∥∥∥2
= Ex,y∼Dk
[k∑
i=1
(φ′(xi)φ(yi))2‖ai‖2
]+ Ex,y∼Dk
[∑i 6=j
φ′(xi)φ(yi)φ′(xj)φ(yj)a
>i aj
]= (α2,0β2,0 − α2
1,0β21,0)‖A‖2F + α2
1,0β21,0‖A · 1‖2
≥ (α2,0β2,0 − α21,0β
21,0)‖A‖2F .
Thus, we complete the proof.
Now let’s show the PD of the population Hessian of Eq. (6.4) for the ReLU
case. where u∗(1) is the first row of U∗ and W ∈ R(d−1)×k.
Lemma D.2.3. Consider the activation function to be ReLU. Assume k = d, U∗, V ∗
are unitary matrices and u∗1,i 6= 0,∀i ∈ [k]. Then the minimal eigenvalue of the
corresponding population Hessian of Eq. (6.4) is lower bounded,
λmin(∇2fReLUD (W ∗, V ∗)) & min
i∈[k]u∗2
1,i,
where W ∗ = U∗2:d,: is the last d− 1 rows of U∗ and
fReLUD (W,V ) := Ex,y
[(φ(W>x2:d + x1(u
∗(1))>)>φ(V >y)− A(x, y))2], (D.7)
Proof. By fixing ui,1 = u∗i,1,∀i ∈ [k], we can rewrite the minimal eigenvalue of the
Hessian as follows. For simplicity, we denote λmin(H) := λmin(∇2fReLUD (W ∗, V ∗)).
First we observe that
λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]
Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>
i x)y>bi
)2.
(D.8)
333
Without loss of generality, we assume V ∗ = I . Set x = U∗s, then we have
λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]
Ex,y
( k∑i=1
φ′(si)φ(yi)s>U∗>ai + φ′(yi)φ(xi)y
>bi
)2
= min∑ki=1 ‖ai‖2+‖bi‖2=1
u∗(1)ai=0,∀i∈[k]
Ex,y
( k∑i=1
φ′(si)φ(yi)s>ai + φ′(yi)φ(xi)y
>bi
)2,
where u∗(1) is the first row of U∗ and the second equality is because we replace
U∗>ai by ai. In the ReLU case, we have
α1,0 = α1,1 = α2,0 = β1,0 = β1,1 = β1,1 = β2,0 = β2,2 = γ = 1/2.
According to Eq. (D.6), we have
λmin(H) ≥ min‖A‖2F+‖B‖2F=1,u∗(1)A=0
C0(‖Ao‖2F + ‖Bo‖2F + ‖Ao + A>o ‖2F/2 + ‖Bo +B>
o ‖2F/2
+ ‖Ao +B>o ‖2F + ‖gA + gB‖2
+ ‖Ao1 + gA + gB‖2 + ‖Bo1 + gA + gB‖2 + (g>A1 + g>B1)2),
where C0 is a universal constant. Now we show that there exists a positive number
c0 such that λmin(H) ≥ c0. If there is no such number, i.e., λmin(H) = 0, then
we have Ao = Bo = 0, gA = −gB. By the assumption that u∗1,i 6= 0 and the
condition u∗(1)A = 0, we have gA = gB = 0, which violates ‖A‖2F + ‖B‖2F = 1.
So λmin(H) > 0. An exact value for c0 is postponed to Theorem D.2.5, which gives
the lower bound for the non-orthogonal case.
D.2.2 Non-orthogonal Case
The restriction of orthogonality on U, V is too strong. We need to consider
general non-orthogonal cases. With Gaussian assumption, the non-orthogonal case
334
can be transformed to the orthogonal case according to the following relationship.
Lemma D.2.3. Let U ∈ Rd×k be a full-column rank matrix. Let g : Rk → [0,∞).
Define λ(U) = σk1(U)/(
∏ki=1 σi(U)). Let D denote the normal distribution. Then
Ex∼Dd
[g(U>x)
]≥ 1
λ(U)Ez∼Dk
[g(σk(U)z)]. (D.9)
Remark This lemma transforms U>x, where the elements of x are mixed,
to σk(U)z, where all the elements are independently fed into g with the sacrifices
of a condition number of U . Using Lemma D.2.3, we are able to show the PD for
non-orthogonal U∗, V ∗.
Proof. Let P ∈ Rd×k be the orthonormal basis of U , and let W = [w1, w2, · · · , wk] =
P>U ∈ Rk×k.
Ex∼Dd[g(U>x)]
= Ez∼Dk[g(W>z)]
=
∫(2π)−k/2g(W>z)e−‖z‖2/2dz
=
∫(2π)−k/2g(s)e−‖W †>s‖2/2| det(W †)|ds
≥∫
(2π)−k/2g(s)e−σ21(W
†)‖s‖2/2| det(W †)|ds
=
∫(2π)−k/2g
(1
σ1(W †)t
)e−‖t‖2/2| det(W †)|/σk
1(W†)dt
=1
λ(W )
∫(2π)−k/2g(σk(W )t)e−‖t‖2/2dt
=1
λ(U)Ez∼Dk
[g(σk(U)z)],
335
where the third step follows by replacing z by z = W †>s, the fourth step follows
by the fact that ‖W †>s‖ ≤ σ1(W†)‖s‖, and the fifth step follows replacing s by
s = 1σ1(W †)
t.
Using Lemma D.2.3, we are able to provide the lower bound for the minimal
eigenvalue for sigmoid and tanh.
Theorem D.2.4. Assume σk(U∗) = σk(V
∗) = 1. Assume β1,1 defined in Eq. (D.4)
is 0. Then the minimal eigenvalue of Hessian defined in Eq. (D.3) can be lower
bounded by,
λmin(H∗) ≥ ρ
λ(U∗)λ(V ∗)maxκ(U∗), κ(V ∗)
where
λ(U) = σk1(U)/(Πk
i=1σi(U)), κ(U) = σ1(U)/σk(U).
Proof. Let P ∈ Rd×k, Q ∈ Rd×k be the orthonormal basis of U∗, V ∗ respectively.
Let R ∈ Rk×k, S ∈ Rk×k satisfy that U∗ = P · R and V ∗ = Q · S. Let P⊥ ∈
Rd×(d−k), Q⊥ ∈ Rd×(d−k) be the orthogonal complement of P,Q respectively. Set
ai = P · si +P⊥ · ti and bi = Q · pi +Q⊥ · qi. Then we can decompose the minimal
eigenvalue problem into three terms.
336
Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>
i x)y>bi
)2
=Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>(Psi + P⊥ti) + φ′(v∗>i y)φ(u∗>
i x)y>(Qpi +Q⊥qi)
)2
=Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>Psi + φ′(v∗>i y)φ(u∗>
i x)y>Qpi
)2
︸ ︷︷ ︸C1
+ Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>P⊥ti
)2
︸ ︷︷ ︸C2
+Ex,y
( k∑i=1
φ′(v∗>i y)φ(u∗>i x)y>Q⊥qi
)2,
where we omit the terms containing a single independent Gaussian variable, whose
expectation is zero. Using Lemma D.2.3, we can lower bound the term C1 as fol-
lows,
C1 = Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>U∗R−1si + φ′(v∗>i y)φ(u∗>
i x)y>V ∗S−1pi
)2
≥ 1
λ(U∗)λ(V ∗)· Ex,y∼Dk
[(k∑
i=1
φ′(σk(U∗)xi))φ(yi)x
>R−1siσk(U∗)
+φ′(σk(V∗)yi)φ(σk(U
∗)xi)y>S−1piσk(V
∗))2]
.
And
C2 ≥Ex,y
∥∥∥∥∥k∑
i=1
φ′(u∗>i x)φ(v∗>i y)ti
∥∥∥∥∥2
≥ 1
λ(U∗)λ(V ∗)Ex,y∼Dk
∥∥∥∥∥k∑
i=1
φ′(σk(U∗)xi)φ(σk(V
∗)yi)ti
∥∥∥∥∥2.
337
Without loss of generality, we assume σk(U∗) = σk(V
∗) = 1. Then accord-
ing to Lemma D.2.1 and Lemma D.2.2, we have
λmin(H) ≥ 1
λ(U∗)λ(V ∗)maxκ(U∗), κ(V ∗)·min(α2,0β2,0 − α2
1,0β21,0 − β2
1,0α21,1), (α2,0β2,2 − α2
1,0β21,2 − γ2).
Considering the definition of ρ in Eq. (D.4), we complete the proof.
For the ReLU case, we lower bound the minimal eigenvalue of the Hessian
for non-orthogonal cases.
Theorem D.2.5. Consider the activation to be ReLU. Assume U∗, V ∗ are full-
column-rank matrices and u∗1,i 6= 0,∀i ∈ [k]. Then the minimal eigenvalue of
the Hessian of Eq. (D.7) is lower bounded,
λmin(∇2fReLUD (W ∗, V ∗)) &
1
λ(U∗)λ(V ∗)
(mini∈[k]|u∗
1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖
)2
,
where u∗(1) is the first row of U∗.
Proof. Let P ∈ Rd×k, Q ∈ Rd×k be the orthonormal basis of U∗, V ∗ respectively.
Let R ∈ Rk×k, S ∈ Rk×k satisfy that U∗ = P · R and V ∗ = Q · S. Let P⊥ ∈
Rd×(d−k), Q⊥ ∈ Rd×(d−k) be the orthogonal complement of P,Q respectively. Set
ai = P ·si+P⊥ · ti and bi = Q ·pi+Q⊥ ·qi. Similar to the proof of Theorem D.2.4,
Lemma D.2.2 and Lemma D.2.3, we have the following.
338
Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>
i x)y>bi
)2
≥ 1
λ(U∗)λ(V ∗)Ex,y∼Dk
[(k∑
i=1
φ′(σk(U∗)xi))φ(yi)x
>R−1siσk(U∗)
+φ′(σk(V∗)yi)φ(σk(U
∗)xi)y>S−1piσk(V
∗))2]
+1
λ(U∗)λ(V ∗)Ex,y∼Dk
∥∥∥∥∥k∑
i=1
φ′(σk(U∗)xi)φ(σk(V
∗)yi)ti
∥∥∥∥∥2
+1
λ(U∗)λ(V ∗)Ex,y∼Dk
∥∥∥∥∥k∑
i=1
φ′(σk(U∗)xi)φ(σk(V
∗)yi)qi
∥∥∥∥∥2
≥ 1
16λ(U∗)λ(V ∗)(‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖
2 + 3(‖T‖2F + ‖Q‖2F )),
where A = [R−1s1, R−1s2, · · · , R−1sk], B = [S−1p1, S
−1p2, · · · , S−1pk],
T = [t1, t2, · · · , tk], Q = [q1, q2, · · · , qk].
Similar to Eq. (D.8), we can find the minimal eigenvalue of the Hessian by
the following constrained minimization problem.
λmin(H) = min∑ki=1 ‖ai‖2+‖bi‖2=1ai,1=0,∀i∈[k]
Ex,y
( k∑i=1
φ′(u∗>i x)φ(v∗>i y)x>ai + φ′(v∗>i y)φ(u∗>
i x)y>bi
)2,
which is lower bounded by the following formula.
minA,B,T ,P
1
16λ(U∗)λ(V ∗)(‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖
2 + 3(‖T‖2F + ‖Q‖2F ))
s.t. ‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F = 1
e>1 PRA+ e>1 P⊥T = 0(D.10)
339
If we assume the minimum of the above formula is c1. We show that c1 > 0 by
contradiction. If c1 = 0, then T = Q = 0, Ao = Bo = 0, gA = −gB. Since T = 0,
we have e>1 PRA = e>1 U∗A = 0. Assuming (e>1 U
∗)i 6= 0, ∀i, we have gA = gB =
0. This violates the condition that ‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F = 1.
Now we give a lower bound for c1. First we note,
‖RA‖2F + ‖SB‖2F + ‖T‖2F + ‖Q‖2F ≤ ‖R‖2‖A‖2F + ‖S‖2‖B‖2F + ‖T‖2F + ‖Q‖2F .
Therefore,
‖A‖2F + ‖B‖2F + ‖T‖2F + ‖Q‖2F ≥1
max‖U∗‖2, ‖V ∗‖2.
Also, as e>1 U∗Ao+(e>1 U
∗)g>A+e>1 P⊥T = 0, where is the element-wise
product, we have
‖gA‖2 ≤ (
1
min|u∗1,i|
(‖u∗(1)‖‖Ao‖+ ‖T‖)2
≤(1 + ‖u∗(1)‖min|u∗
1,i|
)2
2(‖Ao‖2F + ‖T‖2F ).
Note that ‖gA‖2 + ‖gA + gB‖2 ≥ 12‖gB‖2. Now let’s return to the main part
340
of objective function Eq. (D.10).
‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖2 + 3(‖T‖2F + ‖Q‖2F )
≥ 2
3(‖Ao‖2F + ‖T‖2F ) +
1
3‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖
2 + ‖T‖2F + ‖Q‖2F
≥ 1
3
(min|u∗
1,i|1 + ‖u∗(1)‖
)2
‖gA‖2 +
1
3‖Ao‖2F + ‖Bo‖2F + ‖gA + gB‖
2 + ‖T‖2F + ‖Q‖2F
≥ 1
12
(min|u∗
1,i|1 + ‖u∗(1)‖
)2
(‖gA‖2 + ‖gB‖
2) +1
3‖Ao‖2F + ‖Bo‖2F + ‖T‖2F + ‖Q‖2F
≥ 1
12
(min|u∗
1,i|1 + ‖u∗(1)‖
)2(‖gA‖
2 + ‖gB‖2 + ‖Ao‖2F + ‖Bo‖2F + ‖T‖2F + ‖Q‖2F
)≥ 1
12
(min|u∗
1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖
)2
.
Therefore,
c1 ≥1
200λ(U∗)λ(V ∗)
(min|u∗
1,i|(1 + ‖u∗(1)‖)max‖U∗‖, ‖V ∗‖
)2
.
D.3 Positive Definiteness of the Empirical Hessian
For any (U, V ), the population Hessian can be decomposed into the follow-
ing 2k × 2k blocks (i ∈ [k], j ∈ [k]),
∂2fD(U, V )
∂ui∂uj
= Ex,y
[φ′(u>
i x)φ′(u>
j x)xx>φ(v>i y)φ(v
>j y)]
+ δijEx,y
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(u>
i x)φ(v>i y)xx
>]∂2fD(U, V )
∂ui∂vj= Ex,y
[φ′(u>
i x)φ′(v>j y)xy
>φ(v>i y)φ(u>j x)]
+ δijEx,y
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′(u>
i x)φ′(v>i y)xy
>],(D.11)
341
where δij = 1 if i = j, otherwise δij = 0. Similarly we can write the formula for∂2fD(U,V )∂vi∂vj
and ∂2fD(U,V )∂vi∂uj
.
Replacing Ex,y by 1|Ω|∑
(x,y)∈Ω in the above formula, we can obtain the for-
mula for the corresponding empirical Hessian, ∇2fΩ(U, V ).
We now bound the difference between ∇2fΩ(U, V ) and ∇2fD(U∗, V ∗).
Theorem D.3.1 (Restatement of Theorem D.1.1). For any ε > 0, if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability 1− d−t, for sigmoid/tanh,
‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖,
for ReLU,
‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ .
((‖V − V ∗‖σk(V ∗)
)1/2
+
(‖U − U∗‖σk(U∗)
)1/2
+ ε
)(‖U∗‖+‖V ∗‖)2.
Proof. Define H(U, V ) ∈ R(2kd)×(2kd) as a symmetric matrix, whose blocks are
represented as
Hui,uj= Ex,y
[φ′(u>
i x)φ′(u>
j x)xx>φ(v>i y)φ(v
>j y)],
Hui,vj = Ex,y
[φ′(u>
i x)φ′(v>j y)xy
>φ(v>i y)φ(u>j x)].
(D.12)
where Hui,uj∈ Rd×d, Hui,vj ∈ Rd×d correspond to ∂2fD(U,V )
∂ui∂uj, ∂
2fD(U,V )∂ui∂vj
respec-
tively.
We decompose the difference into
‖∇2fΩ(U, V )−∇2fD(U∗, V ∗)‖ ≤ ‖∇2fΩ(U, V )−H(U, V )‖+ ‖H(U, V )−∇2fD(U
∗, V ∗)‖.
Combining Lemma D.3.1, D.3.13, we complete the proof.
342
Lemma D.3.1. For any ε > 0, if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability 1− d−t, for sigmoid/tanh,
‖∇2fΩ(U, V )−H(U, V )‖ . ε+ ‖U − U∗‖+ ‖V − V ∗‖,
for ReLU,
‖∇2fΩ(U, V )−H(U, V )‖ . ε‖U∗‖‖V ∗‖.
Proof. We can bound ‖∇2fΩ(U, V )−H(U, V )‖ if we bound each block.
We can show that if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability 1− d−t,∥∥∥∥∥∥Ex,y −
1
|Ω|∑
(x,y)∈Ω
[φ′(u>i x)φ′(u>j x)xx
>φ(v>i y)φ(v>j y)
]∥∥∥∥∥∥. ε‖U∗‖p‖V ∗‖p Lemma D.3.2∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(u>i x)φ(v
>i y)xx
>]∥∥∥∥∥∥
. ‖U − U∗‖+ ‖V − V ∗‖ Lemma D.3.5∥∥∥∥∥∥Ex,y −
1
|Ω|∑
(x,y)∈Ω
[φ′(u>i x)φ′(v>j y)xy
>φ(v>i y)φ(u>j x)
]∥∥∥∥∥∥. ε‖U∗‖p‖V ∗‖p Lemma D.3.6∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′(u>i x)φ
′(v>i y)xy>]∥∥∥∥∥∥
. ‖U − U∗‖+ ‖V − V ∗‖, Lemma D.3.8
343
where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.
Note that for ReLU activation, for any given U, V , the second term is 0
because φ′′(z) = 0 almost everywhere.
Lemma D.3.2. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥Ex,y −
1
|Ω|∑
(x,y)∈Ω
[φ′(u>i x)φ
′(u>j x)xx
>φ(v>i y)φ(v>j y)]∥∥∥∥∥∥ ≤ ε‖vi‖p‖vj‖p
where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.
Proof. Let B(x, y) = φ′(u>i x)φ
′(u>j x)xx
>φ(v>i y)φ(v>j y). By applying Lemma D.3.10
and Property (I)− (III), (VI) in Lemma D.3.3 and Lemma D.3.4, we have for any
ε > 0 if
n1 & ε−2td log2 d, n2 & ε−2t log d,
then with probability at least 1− d−2t,
∥∥∥∥∥∥Ex,y[B(x, y)]− 1
|S|∑
(x,y)∈S
B(x, y)
∥∥∥∥∥∥ ≤ ε‖vi‖p‖vj‖p. (D.13)
By applying Lemma D.3.11 and Property (I), (III) − (V) in Lemma D.3.3
and Lemma D.3.4, we have for any ε > 0 if
n1 & ε−1td log2 d, n2 & ε−2t log d,
344
then ∥∥∥∥∥∥ 1
n1
∑l∈[n1]
(φ′(u>i xl)φ
′(u>j xl))
2‖xl‖2xlx>l
∥∥∥∥∥∥ . d,
and ∥∥∥∥∥∥ 1
n2
∑l∈[n2]
(φ(v>i yl)φ(v>j yl))
2
∥∥∥∥∥∥ . ‖vi‖2p‖vj‖2p.
Therefore,
max
∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)B(x, y)>
∥∥∥∥∥∥,∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)>B(x, y)
∥∥∥∥∥∥ . εd‖vi‖2p‖vj‖2p.
(D.14)
We can apply Lemma D.3.12 and use Eq. (D.14) and Property (I) in Lemma D.3.3
and Lemma D.3.4 to obtain the following result. If
|Ω| & ε−2td log2 d,
then with probability at least 1− d−2t,∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)− 1
|Ω|∑
(x,y)∈Ω
B(x, y)
∥∥∥∥∥∥ . ε‖vi‖p‖vj‖p. (D.15)
Combining Eq. (D.13) and (D.15), we finish the proof.
Lemma D.3.3. Define T (z) = φ′(u>i z)φ
′(u>j z)zz
>. If z ∼ Z, Z = N(0, Id) and φ
345
is ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,
(I) Prz∼Z
[‖T (z)‖ ≤ 5td log n] ≥ 1− n−1d−t;
(II) max‖a‖=‖b‖=1
(E
z∼Z
[(a>T (z)b
)2])1/2. 1;
(III) max(∥∥∥ E
z∼Z[T (z)T (z)>]
∥∥∥,∥∥∥ Ez∼Z
[T (z)>T (z)]∥∥∥) . d;
(IV) max‖a‖=1
(E
z∼Z
[(a>T (z)T (z)>a
)2])1/2. d;
(V)∥∥∥ Ez∼Z
[T (z)T (z)>T (z)T (z)>]∥∥∥ . d3;
(VI)∥∥∥ Ez∼Z
[T (z)]∥∥∥ . 1.
Proof. Note that 0 ≤ φ′(z) ≤ 1, therefore (I) can be proved by Proposition 1 of
[68]. (II)− (VI) can be proved by Holder’s inequality.
Lemma D.3.4. Define T (z) = φ(v>i z)φ(v>j z). If z ∼ Z, Z = N(0, Id) and φ is
ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,
(I) Prz∼Z
[‖T (z)‖ ≤ 5t‖vi‖p‖vj‖p log n] ≥ 1− n−1d−t;
(II) max‖a‖=‖b‖=1
(E
z∼Z
[(a>T (z)b
)2])1/2. ‖vi‖p‖vj‖p;
(III) max(∥∥∥ E
z∼Z[T (z)T (z)>]
∥∥∥,∥∥∥ Ez∼Z
[T (z)>T (z)]∥∥∥) . ‖vi‖2p‖vj‖2p;
(IV) max‖a‖=1
(E
z∼Z
[(a>T (z)T (z)>a
)2])1/2. ‖vi‖2p‖vj‖2p;
(V)∥∥∥ Ez∼Z
[T (z)T (z)>T (z)T (z)>]∥∥∥ . ‖vi‖4p‖vj‖4p;
(VI)∥∥∥ Ez∼Z
[T (z)]∥∥∥ . ‖vj‖p‖vi‖p.
where p = 1 if φ is ReLU, p = 0 if φ is sigmoid/tanh.
346
Proof. Note that |φ(z)| ≤ |z|p, therefore (I) can be proved by Proposition 1 of [68].
(II)− (VI) can be proved by Holder’s inequality
Lemma D.3.5. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(u>
i x)φ(v>i y)xx
>]∥∥∥∥∥∥. (‖U − U∗‖+ ‖V − V ∗‖).
Proof. We consider the following formula first,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[((φ(u>
j x)− φ(u∗>j x))φ(v∗>j y)
)φ′′(u>
i x)φ(v>i y)xx
>]∥∥∥∥∥∥≤
∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[∣∣(uj − u∗j)
>x∣∣xx>φ(v∗>j y)φ(v>i y)
]∥∥∥∥∥∥.Similar to Lemma D.3.2, we are able to show∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[∣∣(uj − u∗j)
>x∣∣xx>φ(v∗>j y)φ(v>i y)
]− E(x,y)
[∣∣(uj − u∗j)
>x∣∣xx>φ(v∗>j y)φ(v>i y)
]∥∥∥∥∥∥. ‖U − U∗‖.
Note that by Holder’s inequality, we have,
∥∥E(x,y)
[∣∣(uj − u∗j)
>x∣∣xx>φ(v∗>j y)φ(v>i y)
]∥∥ . ‖U − U∗‖.
So we complete the proof.
347
Lemma D.3.6. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥Ex,y −
1
|Ω|∑
(x,y)∈Ω
[φ′(u>i x)φ
′(v>j y)xy>φ(v>i y)φ(u
>j x)]∥∥∥∥∥∥ . ε‖vi‖p‖uj‖p.
Proof. Let B(x, y) = M(x)N(y), where M(x) = φ′(u>i x)φ(u
>j x)x and N(y) =
φ′(v>j y)φ(v>i y)y
>. By applying Lemma D.3.10 and Property (I) − (III), (VI) in
Lemma D.3.7 , we have for any ε > 0 if
n1 & ε−2td log2 d, n2 & ε−2td log2 d,
then with probability at least 1− d−2t,
∥∥∥∥∥∥Ex,yB(x, y)− 1
|S|∑
(x,y)∈S
B(x, y)
∥∥∥∥∥∥ . ε‖uj‖p‖vi‖p. (D.16)
By applying Lemma D.3.11 and Property (I), (IV)− (VI) in Lemma D.3.7,
we have for any ε > 0 if
n1 & ε−2td log2 d, n2 & ε−2td log2 d,
then∥∥∥∥∥∥ 1
n1
∑l∈[n1]
M(xl)M(xl)>
∥∥∥∥∥∥ . ‖uj‖2p,
∥∥∥∥∥∥ 1
n2
∑l∈[n2]
N(yl)>N(yl)
∥∥∥∥∥∥ . ‖vi‖2p.
348
By applying Lemma D.3.11 and Property (I), (IV), (VII), (VIII) in Lemma D.3.7,
we have for any ε > 0 if
n1 & ε−2td log2 d, n2 & ε−2td log2 d,
then∥∥∥∥∥∥ 1
n1
∑l∈[n1]
M(xl)>M(xl)
∥∥∥∥∥∥ . d‖uj‖2p,
∥∥∥∥∥∥ 1
n2
∑l∈[n2]
N(yl)N(yl)>
∥∥∥∥∥∥ . d‖vi‖2p.
Therefore,
max
∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)B(x, y)>
∥∥∥∥∥∥,∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)>B(x, y)
∥∥∥∥∥∥ . εd‖vi‖2p‖uj‖2p
(D.17)
We can apply Lemma D.3.12 and Eq. (D.17) and Property (I) in Lemma D.3.7 to
obtain the following result. If
|Ω| & ε−2td log2 d,
then with probability at least 1− d−2t,∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)− 1
|Ω|∑
(x,y)∈Ω
B(x, y)
∥∥∥∥∥∥ ≤ ε‖vi‖p‖uj‖p. (D.18)
Combining Eq. (D.16) and (D.18), we finish the proof.
Lemma D.3.7. Define T (z) = φ′(u>i z)φ(u
>j z)z. If z ∼ Z, Z = N(0, Id) and φ is
349
ReLU or sigmoid/tanh, the following holds for T (z) and any t > 1,
(I) Prz∼Z
[‖T (z)‖ ≤ 5td1/2‖uj‖p log n
]≥ 1− n−1d−t;
(II)∥∥∥ Ez∼Z
[T (z)]∥∥∥ . ‖uj‖p;
(III) max‖a‖=‖b‖=1
(E
z∼Z
[(a>T (z)b
)2])1/2. ‖uj‖p;
(IV) max∥∥∥ E
z∼Z[T (z)T (z)>]
∥∥∥, ∥∥∥ Ez∼Z
[T (z)>T (z)]∥∥∥ . d‖uj‖2p;
(V) max‖a‖=1
(E
z∼Z
[(a>T (z)T (z)>a
)2])1/2. ‖uj‖2p;
(VI)∥∥∥ Ez∼Z
[T (z)T (z)>T (z)T (z)>]∥∥∥ . d‖uj‖4p;
(VII) max‖a‖=1
(E
z∼Z
[(a>T (z)>T (z)a
)2])1/2. d‖uj‖2p;
(VIII)∥∥∥ Ez∼Z
[T (z)>T (z)T (z)>T (z)]∥∥∥ . d2‖uj‖4p.
Proof. Note that 0 ≤ φ′(z) ≤ 1,|φ(z)| ≤ |z|p, therefore (I) can be proved by
Proposition 1 of [68]. (II)− (VIII) can be proved by Holder’s inequality.
Lemma D.3.8. If
n1 & td log2 d, n2 & t log d, |Ω| & td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′(u>
i x)φ′(v>i y)xy
>]∥∥∥∥∥∥. ‖U − U∗‖+ ‖V − V ∗‖.
350
Proof. We consider the following formula first,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[((φ(u>
j x)− φ(u∗>j x))φ(v∗>j y)
)φ′(u>
i x)φ′(v>i y)xy
>]∥∥∥∥∥∥
Set M(x) = (φ(u>j x)−φ(u∗>
j x))φ′(u>i x)x and N(y) = φ(v∗>j y)φ′(v>i y)y
>
and follow the proof for Lemma D.3.6. Also note that φ is Lipschitz, i.e., |φ(u>j x)−
φ(u∗>j x)| ≤ |u>
j x− u∗>j x|. We can show the following. If
n1 & td log2 d, n2 & t log d, |Ω| & td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
−Ex,y
[M(x)N(y)]
∥∥∥∥∥∥ . ‖uj − u∗j‖.
Note that by Holder’s inequality, we have,
‖Ex,y[M(x)N(y)]‖ . ‖uj − u∗j‖.
So we complete the proof.
We provide a variation of Lemma B.7 in [153]. Note that the Lemma B.7
[153] requires four properties, we simplify it into three properties.
Lemma D.3.9 (Matrix Bernstein for unbounded case (A modified version of bounded
case, Theorem 6.1 in [127], A variation of Lemma B.7 in [153])). Let B denote a
351
distribution over Rd1×d2 . Let d = d1 + d2. Let B1, B2, · · ·Bn be i.i.d. random
matrices sampled from B. Let B = EB∼B[B] and B = 1n
∑ni=1Bi. For parameters
m ≥ 0, γ ∈ (0, 1), ν > 0, L > 0, if the distribution B satisfies the following four
properties,
(I) PrB∼B
[‖B‖ ≤ m] ≥ 1− γ;
(II) max(∥∥∥ E
B∼B[BB>]
∥∥∥,∥∥∥ EB∼B
[B>B]∥∥∥) ≤ ν;
(III) max‖a‖=‖b‖=1
(E
B∼B
[(a>Bb
)2])1/2 ≤ L.
Then we have for any ε > 0 and t ≥ 1, if
n ≥ (18t log d) · ((ε+ ‖B‖)2 +mε+ ν)/ε2 and γ ≤ (ε/(2L))2
with probability at least 1− d−2t − nγ,∥∥∥∥∥ 1nn∑
i=1
Bi − EB∼B
[B]
∥∥∥∥∥ ≤ ε.
Proof. Define the event
ξi = ‖Bi‖ ≤ m,∀i ∈ [n].
Define Mi = 1‖Bi‖≤mBi. Let M = EB∼B[1‖B‖≤mB] and M = 1n
∑ni=1Mi. By
triangle inequality, we have
‖B −B‖ ≤ ‖B − M‖+ ‖M −M‖+ ‖M −B‖. (D.19)
In the next a few paragraphs, we will upper bound the above three terms.
352
The first term in Eq. (D.19). For each i, let ξi denote the complementary
set of ξi, i.e. ξi = [n]\ξi. Thus Pr[ξi] ≤ γ. By a union bound over i ∈ [n], with
probability 1− nγ, ‖Bi‖ ≤ m for all i ∈ [n]. Thus M = B.
The second term in Eq. (D.19). For a matrix B sampled from B, we use ξ
to denote the event that ξ = ‖B‖ ≤ m. Then, we can upper bound ‖M − B‖ in
the following way,
‖M −B‖
=∥∥∥ EB∼B
[1‖B‖≤m ·B]− EB∼B
[B]∥∥∥
=∥∥∥ EB∼B
[B · 1ξ
]∥∥∥= max
‖a‖=‖b‖=1E
B∼B
[a>Bb1ξ
]≤ max
‖a‖=‖b‖=1E
B∼B[(a>Bb)2]1/2 · E
B∼B
[1ξ
]1/2 by Holder’s inequality
≤ L EB∼B
[1ξ
]1/2 by Property (IV)
≤ Lγ1/2, by Pr[ξ] ≤ γ
≤ 1
2ε, by γ ≤ (ε/(2L))2,
which is
‖M −B‖ ≤ ε
2.
Therefore, ‖M‖ ≤ ‖B‖+ ε2.
The third term in Eq. (D.19). We can bound ‖M −M‖ by Matrix Bern-
stein’s inequality [127].
353
We define Zi = Mi −M . Thus we have EBi∼B
[Zi] = 0, ‖Zi‖ ≤ 2m, and
∥∥∥∥ EBi∼B
[ZiZ>i ]
∥∥∥∥ =
∥∥∥∥ EBi∼B
[MiM>i ]−M ·M>
∥∥∥∥ ≤ ν + ‖M‖2 ≤ ν + ‖B‖2 + ε2 + ε‖B‖.
Similarly, we have∥∥∥∥ EBi∼B
[Z>i Zi]
∥∥∥∥ ≤ ν+‖B‖2+ε2+ε‖B‖. Using matrix Bernstein’s
inequality, for any ε > 0,
PrB1,··· ,Bn∼B
[1
n
∥∥∥∥∥n∑
i=1
Zi
∥∥∥∥∥ ≥ ε
]≤ d exp
(− ε2n/2
ν + ‖B‖2 + ε2 + ε‖B‖+ 2mε/3
).
By choosing
n ≥ (3t log d) · ν + ‖B‖2 + ε2 + ε‖B‖+ 2mε/3
ε2/2,
for t ≥ 1, we have with probability at least 1− d−2t,∥∥∥∥∥ 1nn∑
i=1
Mi −M
∥∥∥∥∥ ≤ ε
2.
Putting it all together, we have for ε > 0, if
n ≥ (18t log d) · ((ε+ ‖B‖)2 +mε+ ν)/(ε2) and γ ≤ (ε/(2L))2
with probability at least 1− d−2t − nγ,∥∥∥∥∥ 1nn∑
i=1
Bi − EB∼B
[B]
∥∥∥∥∥ ≤ ε.
Lemma D.3.10 (Tail Bound for fully-observed rating matrix). Let xii∈[n1] be in-
dependent samples from distribution X and yjj∈[n2] be independent samples from
354
distribution Y. Denote S := (xi, yj)i∈[n1],j∈[n2] as the collection of all the (xi, yj)
pairs. Let B(x, y) be a random matrix of x, y, which can be represented as the prod-
uct of two matrices M(x), N(y), i.e., B(x, y) = M(x)N(y). Let M = ExM(x)
and N = EyN(y). Let dx be the sum of the two dimensions of M(x) and dy be
the sum of the two dimensions of N(y). Suppose both M(x) and N(y) satisfy the
following properties (z is a representative for x, y, and T (z) is a representative for
M(x), N(y)),
(I) Prz∼Z
[‖T (z)‖ ≤ mz] ≥ 1− γz;
(II) max‖a‖=‖b‖=1
(E
z∼Z
[(a>T (z)b
)2])1/2 ≤ Lz;
(III) max(∥∥∥ E
z∼Z[T (z)T (z)>]
∥∥∥,∥∥∥ Ez∼Z
[T (z)>T (z)]∥∥∥) ≤ νz.
Then for any ε1 > 0, ε2 > 0 if
n1 ≥ (18t log dx) · (νx + (‖M‖+ ε1)2 +mxε1)/ε
21 and γx ≤ (ε1/(2Lx))
2
n2 ≥ (18t log dy) · (νy + (ε2 + ‖N‖)2 +myε2)/ε22 and γy ≤ (ε2/(2Ly))
2
with probability at least 1− d−2tx − d−2t
y − n1γx − n2γy,
∥∥∥∥∥∥Ex,yB(x, y)− 1
|S|∑
(x,y)∈S
B(x, y)
∥∥∥∥∥∥ ≤ ε2‖M‖+ ε1‖N‖+ ε1ε2. (D.20)
355
Proof. First we note that,
1
|S|∑
(x,y)∈S
B(x, y) =1
n1n2
∑i∈[n1]
M(xi)
·∑
j∈[n2]
N(yj)
,
and
Ex,y[B(x, y)] = (Ex[M(x)])(Ey[N(y)]).
Therefore, if we can bound ‖Ex[M(x)]− 1n1
∑i∈[n1]
M(xi)‖ and the corresponding
term for y, we are able to prove this lemma.
By the conditions of M(x), the three conditions in Lemma D.3.9 are satis-
fied, which completes the proof.
Lemma D.3.11 (Upper bound for the second-order moment). Let zii∈[n] be inde-
pendent samples from distribution Z. Let T (z) be a matrix of z. Let d be the sum of
the two dimensions of T (z) and T := Ez∼Z
[T (z)T (z)>]. Suppose T (z) satisfies the
following properties.
(I) Prz∼Z
[‖T (z)‖ ≤ mz] ≥ 1− γz;
(II) max‖a‖=1
(E
z∼Z
[(a>T (z)T (z)>a
)2])1/2 ≤ Lz;
(III)∥∥∥ Ez∼Z
[T (z)T (z)>T (z)T (z)>]∥∥∥ ≤ νz,
Then for any t > 1, if
n ≥ (18t log d) · (νz + (‖T‖+ ε)2 +m2z)/ε
2 and γz ≤ (ε/(2Lz))2,
356
we have with probability at least 1− d−2t − nγz,∥∥∥∥∥∥ 1n∑i∈[n]
T (zi)T (zi)>
∥∥∥∥∥∥ ≤∥∥∥ Ez∼Z
[T (z)T (z)>]∥∥∥+ ε.
Proof. The proof directly follows by applying Lemma D.3.9.
Lemma D.3.12 (Tail Bound for partially-observed rating matrix). Given xii∈[n1]
and yjj∈[n2], let’s denote S := (xi, yj)i∈[n1],j∈[n2] as the collection of all the
(xi, yj) pairs. Let Ω also be a collection of (xi, yj) pairs, where each pair is sampled
from S independently and uniformly. Let B(x, y) be a matrix of x, y. Let dB be the
sum of the two dimensions of B(x, y). Define BS = 1|S|∑
(x,y)∈S B(x, y). Assume
the following,
(I) ‖B(x, y)‖ ≤ mB,∀(x, y) ∈ S,
(II) max
∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)B(x, y)>
∥∥∥∥∥∥,∥∥∥∥∥∥ 1
|S|∑
(x,y)∈S
B(x, y)>B(x, y)
∥∥∥∥∥∥ ≤ νB.
Then we have for any ε > 0 and t ≥ 1, if
|Ω| ≥ (18t log dB) · (νB + ‖BS‖2 +mBε)/ε2,
with probability at least 1− d−2tB ,∥∥∥∥∥∥BS −
1
|Ω|∑
(x,y)∈Ω
B(x, y)
∥∥∥∥∥∥ ≤ ε.
Proof. Since each entry in Ω is sampled from S uniformly and independently, we
have
EΩ
1
|Ω|∑
(x,y)∈Ω
B(x, y)
=1
|S|∑
(x,y)∈S
B(x, y).
357
Applying the matrix Bernstein inequality Theorem 6.1 in [127], we prove this
lemma.
Lemma D.3.13. For sigmoid/tanh activation function,
‖H(U, V )−∇2fD(U∗, V ∗)‖ . (‖V − V ∗‖+ ‖U − U∗‖),
where H(U, V ) is defined as in Eq. (D.12).
For ReLU activation function,
‖H(U, V )−∇2fD(U∗, V ∗)‖ .
((‖V − V ∗‖σk(V ∗)
)1/2
‖U∗‖+(‖U − U∗‖σk(U∗)
)1/2
‖V ∗‖
)(‖U∗‖+ ‖V ∗‖).
Proof. We can bound each block, i.e.,
Ex,y
[φ′(u>
i x)φ′(u>
j x)xx>φ(v>i y)φ(v
>j y)− φ′(u∗>
i x)φ′(u∗>j x)xx>φ(v∗>i y)φ(v∗>j y)
].
(D.21)
Ex,y
[φ′(u>
i x)φ′(v>j y)xy
>φ(v>i y)φ(u>j x)− φ′(u∗>
i x)φ′(v∗>j y)xy>φ(v∗>i y)φ(u∗>j x)
].
(D.22)
For smooth activations, the bound for Eq. (D.21) follows by combining
Lemma D.3.14 and Lemma D.3.15 and the bound for Eq. (D.22) follows Lemma D.3.17
and Lemma D.3.19. For ReLU activation, the bound for Eq. (D.21) follows by
combining Lemma D.3.14, Lemma D.3.16 and the bound for Eq. (D.22) follows
Lemma D.3.17 and Lemma D.3.18.
Lemma D.3.14.
∥∥Ey∼Dd
[(φ(v>i y)− φ(v∗>i y))φ(v>j y)
]∥∥ . ‖V ∗‖p‖V − V ∗‖.
358
Proof. The proof follows the property of the activation function (φ(z) ≤ |z|p) and
Holder’s inequality.
Lemma D.3.15. When the activation function is smooth, we have
∥∥Ex∼Dd
[(φ′(u>
i x)− φ′(u∗>i x))φ′(u>
l x)xx>]∥∥ . ‖U − U∗‖.
Proof. The proof directly follows Eq. (12) in Lemma D.10 in [153].
Lemma D.3.16. When the activation function is piece-wise linear with e turning
points, we have
∥∥Ex∼Dd
[(φ′(u>
i x)− φ′(u∗>i x))φ′(u>
l x)xx>]∥∥ . (e‖U − U∗‖/σk(U
∗))1/2.
Proof.
∥∥Ex,y
[(φ′(u>
i x)− φ′(u∗>i x))φ′(u>
l x)xx>]∥∥ ≤ max
‖a‖=1
(Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)|φ′(u>
l x)(x>a)2
]).
Let P be the orthogonal basis of span(ui, u∗i , ul). Without loss of general-
ity, we assume ui, u∗i , ul are independent, so P = span(ui, u
∗i , ul) is d-by-3. Let
[qi q∗i ql] = P>[ui u
∗i ul] ∈ R3×3. Let a = Pb + P⊥c, where P⊥ ∈ Rd×(d−3) is the
359
complementary matrix of P .
Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)||φ′(u>
l x)|(x>a)2]
= Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)||φ′(u>
l x)|(x>(Pb+ P⊥c))2]
. Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)||φ′(u>
l x)|((x>Pb)2 + (x>P⊥c)
2)]
= Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)||φ′(u>
l x)|(x>Pb)2]
+ Ex∼Dd
[|φ′(u>
i x)− φ′(u∗>i x)||φ′(u>
l x)|(x>P⊥c)2]
= Ez∼D3
[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2
]+ Ez∼D3,y∼Dd−3
[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(y>c)2
], (D.23)
where the first step follows by a = Pb + P⊥c, the last step follows by (a + b)2 ≤
2a2 + 2b2.
We have e exceptional points which have φ′′(z) 6= 0. Let these e points
be p1, p2, · · · , pe. Note that if q>i z and q∗>i z are not separated by any of these
exceptional points, i.e., there exists no j ∈ [e] such that q>i z ≤ pj ≤ q∗>i z or
q∗>i z ≤ pj ≤ q>i z, then we have φ′(q>i z) = φ′(q∗>i z) since φ′′(s) are zeros except
for pjj=1,2,··· ,e. So we consider the probability that q>i z, q∗>i z are separated by
any exception point. We use ξj to denote the event that q>i z, q∗>i z are separated
by an exceptional point pj . By union bound, 1 −∑e
j=1 Pr[ξj] is the probability
that q>i z, q∗>i z are not separated by any exceptional point. The first term of Equa-
360
tion (D.23) can be bounded as,
Ez∼D3
[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2
]= Ez∼D3
[1∪e
j=1ξj|φ′(q>i z) + φ′(q∗>i z)||φ′(q>l z)|(z>b)2
]≤(Ez∼D3
[1∪e
j=1ξj
])1/2(Ez∼D3
[(φ′(q>i z) + φ′(q∗>i z))2φ′(q>l z)
2(z>b)4])1/2
≤
(e∑
j=1
Prz∼D3
[ξj]
)1/2(Ez∼D3
[(φ′(q>i z) + φ′(q∗>i z))2φ′(q>l z)
2(z>b)4])1/2
.
(e∑
j=1
Prz∼D3
[ξj]
)1/2
‖b‖2,
where the first step follows by if q>i z, q∗>i z are not separated by any exceptional
point then φ′(q>i z) = φ′(q∗>i z) and the last step follows by Holder’s inequality.
It remains to upper bound Prz∼D3 [ξj]. First note that if q>i z, q∗>i z are sepa-
rated by an exceptional point, pj , then |q∗>i z− pj| ≤ |q>i z− q∗>i z| ≤ ‖qi− q∗i ‖‖z‖.
Therefore,
Prz∼D3
[ξj] ≤ Prz∼D3
[|q>i z − pj|‖z‖
≤ ‖qi − q∗i ‖].
Note that ( q∗>i z
‖z‖‖q∗i ‖+ 1)/2 follows Beta(1,1) distribution which is uniform
distribution on [0, 1].
Prz∼D3
[|q∗>i z − pj|‖z‖‖q∗i ‖
≤ ‖qi − q∗i ‖‖q∗i ‖
]≤ Pr
z∼D3
[|q∗>i z|‖z‖‖q∗i ‖
≤ ‖qi − q∗i ‖‖q∗i ‖
].‖qi − q∗i ‖‖q∗i ‖
.‖U − U∗‖σk(U∗)
,
where the first step is because we can view q∗>i z
‖z‖ and pj‖z‖ as two independent ran-
dom variables: the former is about the direction of z and the later is related to the
361
magnitude of z. Thus, we have
Ez∈D3 [|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(z>b)2] . (e‖U − U∗‖/σk(U∗))1/2‖b‖2.
(D.24)
Similarly we have
Ez∼D3,y∼Dd−3
[|φ′(q>i z)− φ′(q∗>i z)||φ′(q>l z)|(y>c)2
]. (e‖U − U∗‖/σk(U
∗))1/2‖c‖2.(D.25)
Finally combining Eq. (D.24) and Eq. (D.25) completes the proof.
Lemma D.3.17.
∥∥Ex∼Dd
[(φ(u>
j x)− φ(u∗>j x))φ′(u>
i x)x]∥∥ . ‖U − U∗‖.
Proof. First, we can use the Lipschitz continuity of the activation function,
∥∥Ex∼Dd
[φ(u>
j x)− φ(u∗>j x)φ′(u>
i x)x]∥∥ ≤ max
‖a‖=1
∥∥Ex∼Dd
[|φ(u>
j x)− φ(u∗>j x)|φ′(u>
i x)|x>a|]∥∥
≤ max‖a‖=1
Lφ
∥∥Ex∼Dd
[|u>
j x− u∗>j x|φ′(u>
i x)|x>a|]∥∥,
where Lφ ≤ 1 is the Lipschitz constant of φ. Then the proof follows Holder’s
inequality.
Lemma D.3.18. When the activation function is ReLU,
∥∥Ex∼Dd
[φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x]∥∥ . (‖U − U∗‖/σk(U
∗))1/2‖uj‖.
Proof.
∥∥Ex∼Dd
[φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x]∥∥ ≤ max
‖a‖=1Ex∼Dd
[|φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x>a|].
362
Similar to Lemma D.3.16, we can show that
max‖a‖=1
Ex∼Dd
[|φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x>a|]. (‖U − U∗‖/σk(U
∗))1/2‖uj‖.
Lemma D.3.19. When the activation function is sigmoid/tanh,
∥∥Ex∼Dd
[φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x]∥∥ . ‖U − U∗‖.
Proof.
∥∥Ex∼Dd
[φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x]∥∥
≤ max‖a‖=1
Ex∼Dd
[|φ(u∗>
j x)(φ′(u>i x)− φ′(u∗>
i x))x>a|]
. max‖a‖=1
Ex∼Dd
[|(u>
i x− u∗>i x)x>a|
]. ‖U − U∗‖.
D.3.1 Local Linear Convergence
Given Theorem 6.3.1, we are now able to show local linear convergence of
gradient descent for sigmoid and tanh activation function.
Theorem D.3.2 (Restatement of Theorem 6.3.2). Let [U c, V c] be the parameters in
the c-th iteration. Assuming ‖U c−U∗‖+‖V c−V ∗‖ . 1/(λ2κ), then given a fresh
sample set, Ω, that is independent of [U c, V c] and satisfies the conditions in Theo-
rem 6.3.1, the next iterate using one step of gradient descent, i.e., [U c+1, V c+1] =
363
[U c, V c]− η∇fΩ(U c, V c), satisfies
‖U c+1 − U∗‖2F + ‖V c+1 − V ∗‖2F ≤ (1−Ml/Mu)(‖U c − U∗‖2F + ‖V c − V ∗‖2F )
with probability 1− d−t, where η = Θ(1/Mu) is the step size and Ml & 1/(λ2κ) is
the lower bound on the eigenvalues of the Hessian and Mu . 1 is the upper bound
on the eigenvalues of the Hessian.
Proof. In order to show the linear convergence of gradient descent, we first show
that the Hessian along the line between [U c, V c] and [U∗, V ∗] are positive definite
w.h.p..
The idea is essentially building a d−1/2λ−2κ−1-net for the line between the
current iterate and the optimal. In particular, we set d1/2 points [Ua, V a]a=1,2,··· ,d1/2
that are equally distributed between [U c, V c] and [U∗, V ∗]. Therefore, ‖Ua+1 −
Ua‖+ ‖V a+1 − V a‖ . d−1/2λ−2κ−1
Using Lemma D.3.20, we can show that for any [U, V ], if there exists a value
of a such that ‖U − Ua‖+ ‖V − V a‖ . d−1/2λ−2κ−1, then
‖∇2fΩ(U, V )−∇2fΩ(Ua, V a)‖ . λ−2κ−1.
Therefore, for every point [U, V ] in the line between [U c, V c] and [U∗, V ∗], we can
find a fixed point in [Ua, V a]a=1,2,··· ,d1/2 , such that ‖U − Ua‖ + ‖V − V a‖ .
d−1/2λ−2κ−1. Now applying union bound for all a, we have that w.p. 1 − d−t, for
every point [U, V ] in the line between [U c, V c] and [U∗, V ∗],
MlI ∇2fΩ(U, V ) Mu,
364
where Ml = Ω(λ−2κ−1) and Mu = O(1). Note that the upper bound of the Hessian
is due to the fact that φ and φ′ are bounded.
Given the positive definiteness of the Hessian along the line between the
current iterate and the optimal, we are ready to show the linear convergence. First
we set the stepsize for the gradient descent update as η = 1/Mu and use notation
W := [U, V ] to simplify the writing.
‖W c+1 −W ∗‖2F
= ‖W c − η∇fΩ(W c)−W ∗‖2F
= ‖W c −W ∗‖2F − 2η〈∇fΩ(W c), (W c −W ∗)〉+ η2‖∇fΩ(W c)‖2F
Note that
∇fΩ(W c) =
(∫ 1
0
∇2fΩ(W∗ + ξ(W c −W ∗))dξ
)vec(W c −W ∗).
Define H ∈ R(2kd)×(2kd),
H =
(∫ 1
0
∇2fΩ(W∗ + ξ(W c −W ∗))dξ
).
By the result provided above, we have
MlI H MuI. (D.26)
Now we upper bound the norm of the gradient,
‖∇fΩ(W c)‖2F = 〈Hvec(W c −W ∗), Hvec(W c −W ∗)〉 ≤Mu〈vec(W c −W ∗), Hvec(W c −W ∗)〉.
365
Therefore,
‖W c+1 −W ∗‖2F
≤ ‖W c −W ∗‖2F − (−η2Mu + 2η)〈vec(W c −W ∗), Hvec(W c −W ∗)〉
≤ ‖W c −W ∗‖2F − (−η2Mu + 2η)Ml‖W c −W ∗‖2F
= ‖W c −W ∗‖2F −Ml
Mu
‖W c −W ∗‖2F
≤ (1− Mu
Ml
)‖W c −W ∗‖2F
Lemma D.3.20. Let the activation function be tan/sigmoid. For given Ua, V a and
r > 0, if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability 1− d−t,
sup‖U−Ua‖+‖V−V a‖≤r
‖∇2fΩ(U, V )−∇2fΩ(Ua, V a)‖ . d1/2 · r
Proof. We consider each block of the Hessian as defined in Eq (D.11). In particular,
we show that if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
366
then with probability 1− d−t,∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)φ′(u>
j x)φ(v>i y)φ(v
>j y)− φ′(ua>
i x)φ′(ua>j x)φ(va>i y)φ(va>j y))xx>]∥∥∥∥
. (‖ui − uai ‖+ ‖uj − ua
j‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2;
by Lemma D.3.21∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(u>
i x)φ(v>i y)xx
>
−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(ua>
i x)φ(va>i y)xx>]∥∥∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2
by Lemma D.3.22∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)φ′(v>j y)φ(v
>i y)φ(u
>j x)− φ′(ua>
i x)φ′(va>j y)φ(va>i y)φ(ua>j x)
)xy>
]∥∥∥∥. (‖ui − ua
i ‖+ ‖uj − uaj‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2
by Lemma D.3.23∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′(u>
i x)φ′(v>i y)xy
>
−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)
)φ′(ua>
i x)φ′(va>i y)xy>]
∥∥∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2
by Lemma D.3.24
Lemma D.3.21. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
367
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)φ′(u>
j x)φ(v>i y)φ(v
>j y)− φ′(ua>
i x)φ′(ua>j x)φ(va>i y)φ(va>j y))xx>]∥∥∥∥∥∥
. (‖ui − uai ‖+ ‖uj − ua
j‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2
Proof. Note that
φ′(u>i x)φ
′(u>j x)φ(v
>i y)φ(v
>j y)− φ′(ua>
i x)φ′(ua>j x)φ(va>i y)φ(va>j y)
=φ′(u>i x)φ
′(u>j x)φ(v
>i y)φ(v
>j y)− φ′(ua>
i x)φ′(u>j x)φ(v
>i y)φ(v
>j y)
+ φ′(ua>i x)φ′(u>
j x)φ(v>i y)φ(v
>j y)− φ′(ua>
i x)φ′(ua>j x)φ(v>i y)φ(v
>j y)
+ φ′(ua>i x)φ′(ua>
j x)φ(v>i y)φ(v>j y)− φ′(ua>
i x)φ′(ua>j x)φ(va>i y)φ(v>j y)
+ φ′(ua>i x)φ′(ua>
j x)φ(va>i y)φ(v>j y)− φ′(ua>i x)φ′(ua>
j x)φ(va>i y)φ(va>j y)
(D.27)
Let’s consider the first term in the above formula. The other terms are similar.∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)− φ′(ua>i x))φ′(u>
j x)φ(v>i y)φ(v
>j y)xx
>]∥∥∥∥∥∥≤
∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[‖ui − ua
i ‖‖x‖xx>]∥∥∥∥∥∥
which is because both φ′(·) and φ(·) are bounded and Lipschitz continuous. Apply-
ing the unbounded matrix Bernstein Inequality Lemma D.3.9, we can bound∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[‖ui − ua
i ‖‖x‖xx>]∥∥∥∥∥∥ . ‖ui − uai ‖d1/2
368
Since both φ′(·) and φ(·) are bounded and Lipschitz continuous, we can easily ex-
tend the above inequality to other cases and finish the proof.
Lemma D.3.22. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(u>
i x)φ(v>i y)xx
>
−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)
)φ′′(ua>
i x)φ(va>i y)xx>]∥∥. (‖U − Ua‖+ ‖V − V a‖)d1/2
Proof. Since for sigmoid/tanh, φ, φ′, φ′′ are all Lipschitz continuous and bounded,
the proof of this lemma resembles the proof for Lemma D.3.21.
Lemma D.3.23. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)φ′(v>j y)φ(v
>i y)φ(u
>j x)− φ′(ua>
i x)φ′(va>j y)φ(va>i y)φ(ua>j x)
)xy>
]∥∥∥∥∥∥. (‖ui − ua
i ‖+ ‖uj − uaj‖+ ‖vi − vai ‖+ ‖vj − vaj ‖)d1/2
Proof. Do the similar splits as Eq. (D.27) and let’s consider the following case,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)− φ′(ua>i x)
)φ′(v>j y)φ(v
>i y)φ(u
>j x)xy
>]∥∥∥∥∥∥.369
Setting M(x) =(φ′(u>
i x)− φ′(ua>i x)
)φ(u>
j x)x, N(y) = φ′(v>j y)φ(v>i y)y
> and
using the fact that ‖φ′(u>i x)−φ′(ua>
i x)‖ ≤ ‖ui−uai ‖‖x‖, we can follow the proof
of Lemma D.3.6 to show if
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ′(u>
i x)− φ′(ua>i x)
)φ′(v>j y)φ(v
>i y)φ(u
>j x)xy
>]∥∥∥∥∥∥ ≤ ‖ui − uai ‖d1/2
Lemma D.3.24. If
n1 & ε−2td log2 d, n2 & ε−2t log d, |Ω| & ε−2td log2 d,
then with probability at least 1− d−t,∥∥∥∥∥∥ 1
|Ω|∑
(x,y)∈Ω
[(φ(U>x)>φ(V >y)− φ(U∗>x)>φ(V ∗>y)
)φ′(u>
i x)φ′(v>i y)xy
>
−(φ(Ua>x)>φ(V a>y)− φ(U∗>x)>φ(V ∗>y)
)φ′(ua>
i x)φ′(va>i y)xy>]∥∥
. (‖U − Ua‖+ ‖V − V a‖)d1/2
Proof. Since for sigmoid/tanh, φ, φ′, φ′′ are all Lipschitz continuous and bounded,
the proof of this lemma resembles the proof for Lemma D.3.23.
370
Appendix E
Low-rank Matrix Sensing
E.1 Proof of Theorem 7.3.1
Proof. We explain the key ideas of the proof by first presenting the proof for the
special case of rank-1 W ∗ = σ∗u∗v>∗ . We later extend the proof to general rank-k
case.
Similar to [71], we first characterize the update for h + 1-th step iterates
vh+1 of Algorithm 7.3.1 and its normalized form vh+1 = vh+1/‖vh+1‖2.
Now, by gradient of (7.4) w.r.t. v to be zero while keeping uh to be fixed.
That is,
m∑i=1
(bi − x>i uhv
>h+1yi)(x
>i uh)yi = 0,
i.e.,m∑i=1
(uh>xi)yi(σ∗y
>i v∗u
>∗ xi − y>
i vh+1uh>xi) = 0,
i.e.,
(m∑i=1
(x>i uhu
>hxi)yiy
>i
)vh+1 = σ∗
(m∑i=1
(x>i uhu
>∗ xi)yiy
>i
)v∗,
i.e., vh+1 = σ∗(u>∗ uh)v∗ − σ∗B
−1((u>∗ uh)B − B)v∗, (E.1)
where,
B =1
m
m∑i=1
(x>i uhu
>hxi)yiy
>i , B =
1
m
m∑i=1
(x>i uhu
>∗ xi)yiy
>i .
371
Note that (E.1) shows that vh+1 is a perturbation of v∗ and the goal now is
to bound the spectral norm of the perturbation term:
‖G‖2 = ‖B−1(u>∗ uhB − B)v∗‖2 ≤ ‖B−1‖2‖u>
∗ uhB − B‖2‖v∗‖2. (E.2)
Now,, using Property 2 mentioned in the theorem, we get:
‖B− I‖2 ≤ 1/100, i.e., σmin(B) ≥ 1− 1/100, i.e., ‖B−1‖2 ≤ 1/(1− 1/100).
(E.3)
Now,
(u>∗ uh)B − B =
1
m
m∑i=1
yiy>i x
>i ((u
>∗ uh)uhu
>h − u∗u
>h )xi,
=1
m
m∑i=1
yiy>i x
>i (uhu
>h − I)u∗u
>hxi,
ζ1≤ 1
100‖(uhu
>h − I)u∗‖2‖u>
h ‖2 =1
100
√1− (u>
hu∗)2, (E.4)
where ζ1 follows by observing that (uhu>h − I)u∗ and uh are orthogonal set of
vectors and then using Property 3 given in the Theorem 7.3.1. Hence, using (E.3),
(E.4), and ‖v∗‖2 = 1 along with (E.2), we get:
‖G‖2 ≤1
99
√1− (u>
hu∗)2. (E.5)
We are now ready to lower bound the component of vh along the correct
direction v∗ and the component of vh that is perpendicular to the optimal direction
v∗.
Now, by left-multiplying (E.1) by v∗ and using (E.3) we obtain:
v>∗ vh+1 = σ∗(u
>hu∗)− σ∗v
>∗ G ≥ σ∗(u
>hu∗)−
σ∗
99
√1− (u>
hu∗)2. (E.6)
372
Similarly, by multiplying (E.1) by v∗⊥, where v∗
⊥ is a unit norm vector that is
orthogonal to v∗, we get:
〈v∗⊥, vh+1〉 ≤
σ∗
99
√1− (u>
hu∗)2. (E.7)
Using (E.6), (E.7), and ‖vh+1‖22 = (v>∗ vh+1)
2 + ((v∗⊥)>vh+1)
2, we get:
1− (v>h+1v∗)
2 =〈v∗
⊥vh+1〉2
〈v∗vh+1〉2 + 〈v∗⊥vh+1〉2,
≤ 1
99 · 99 · (u>hu∗ − 1
99
√1− (u>
hu∗)2)2 + 1(1− (uhu∗)
2).
(E.8)
Also, using Property 1 of Theorem 7.3.1, for S = 1m
∑mi=1 biAi, we get: ‖S‖2 ≥
99σ∗100
. Moreover, by multiplying S −W ∗ by u0 on left and v0 on the right and
using the fact that (u0,v0) are the largest singular vectors of S, we get: ‖S‖2 −
σ∗v>0 v∗u
>0 u∗ ≤ σ∗/100. Hence, u>
0 u∗ ≥ 9/10.
Using the (E.8) along with the above given observation and by the “induc-
tive” assumption u>hu∗ ≥ u>
0 u∗ ≥ 9/10 (proof of the inductive step follows di-
rectly from the below equation) , we get:
1− (v>h+1v∗)
2 ≤ 1
2(1− (u>
hu∗)2). (E.9)
Similarly, we can show that 1 − (u>h+1u∗)
2 ≤ 12(1 − (v>
h+1v∗)2). Hence, after
H = O(log(σ∗/ε)) iterations, we obtain WH = uH v>H , s.t., ‖WH−W ∗‖2 ≤ ε.
We now generalize our above given proof to the rank-k case. In the case
of rank-1 matrix recovery, we used 1− (v>h+1u∗)
2 as the error or distance function
and show at each step that the error decreases by at least a constant factor. For
373
general rank-k case, we need to generalize the distance function to be a distance
over subspaces of dimension-k. To this end, we use the standard principle angle
based subspace distance. That is,
Definition E.1.1. Let U1, U2 ∈ Rd×k be k-dimensional subspaces. Then the princi-
ple angle based distance dist(U1, U2) between U1, U2 is given by:
dist(U1, U2) = ‖U>⊥U2‖2,
where U⊥ is the subspace orthogonal to U1.
Proof of Theorem 7.3.1: General Rank-k Case. For simplicity of notation, we de-
note Uh by U , Vh+1 by V , and Vh+1 by V .
Similar to the above given proof, we first present the update equation for
V(t+1). Recall that V(t+1) = argminV ∈Rd2×k
∑i(x
>i W ∗yi − x>
i U V >yi)2. Hence,
by setting gradient of this objective function to 0, using the above given notation
and by simplifications, we get:
V = W ∗>U − F, (E.10)
where F = [F1F2 . . . Fk] is the “error” matrix.
Before specifying F , we first introduce block matrices B,C,D, S ∈ Rkd2×kd2
374
with (p, q)-th block Bpq, Cpq, Spq, Dpq given by:
Bpq =∑i
yiy>i (x
>i up)(x
>i uq), (E.11)
Cpq =∑i
yiy>i (x
>i up)(x
>i u∗q), (E.12)
Dpq = u>p u∗qI, (E.13)
Spq = σ∗pI if p = q, and 0 if p 6= q. (E.14)
where σ∗p = Σ∗(p, p), i.e., the p-th singular value of W ∗ and u∗q is the q-th column
of U∗.
Then, using the definitions given above, we get:F1...Fk
= B−1(BD − C)S ·m(V∗). (E.15)
Now, recall that in the t+1-th iteration of Algorithm 7.3.1, Vt+1 is obtained
by QR decomposition of Vt+1. Using notation mentioned above, V = V R where
R denotes the lower triangular matrix Rt+1 obtained by the QR decomposition of
Vt+1.
Now, using (E.10), V = V R−1 = (W ∗>U − F )R−1. Multiplying both the
sides by V∗p, where V∗p is a fixed orthonormal basis of the subspace orthogonal to
span(V∗), we get:
(V∗p)>V = −(V∗p)
>FR−1 ⇒ dist(V∗, Vt+1) = ‖(V∗p)>V ‖2 ≤ ‖F‖2‖R−1‖2.
(E.16)
375
Also, note that using the initialization property (1) mentioned in Theorem 7.3.1,
we get ‖S −W ∗‖2 ≤ σ∗k
100. Now, using the standard sin theta theorem for singular
vector perturbation[83], we get: dist(U0, U∗) ≤ 1100
.
Theorem 7.3.1 now follows by using Lemma E.1.1, Lemma E.1.2 along
with the above mentioned bound on dist(U0, U∗).
Lemma E.1.1. Let A be a rank-one measurement operator where Ai = xiy>i .
Also, let A satisfy Property 1, 2, 3 mentioned in Theorem 7.3.1 and let σ∗1 ≥ σ∗
2 ≥
· · · ≥ σ∗k be the singular values of W ∗. Then,
‖F‖2 ≤σ∗
k
100dist(Ut, U∗).
Lemma E.1.2. Let A be a rank-one measurement operator where Ai = xiy>i .
Also, let A satisfy Property 1, 2, 3 mentioned in Theorem 7.3.1. Then,
‖R−1‖2 ≤1
σ∗k ·√
1− dist2(Ut, U∗)− ‖F‖2.
Proof of Lemma E.1.1. Recall that m(F ) = B−1(BD − C)S ·m(V∗). Hence,
‖F‖2 ≤ ‖F‖F ≤ ‖B−1‖2‖BD−C‖2‖S‖2‖m(V∗)‖2 = σ∗1√k‖B−1‖2‖BD−C‖2.
(E.17)
Now, we first bound ‖B−1‖2 = 1/(σmin(B)). Also, let Z = [z1z2 . . . zk] and let
z = m(Z). Then,
σmin(B) = minz,‖z‖2=1
z>Bz = minz,‖z‖2=1
∑1≤p≤k,1≤q≤k
z>p Bpqzq
= minz,‖z‖2=1
∑p
z>p Bppzp +
∑pq,p 6=q
z>p Bpqzq. (E.18)
376
Recall that, Bpp =1m
∑mi=1 yiy
>i (x
>i up)
2 and up is independent of ξ,yi,∀i. Hence,
using Property 2 given in Theorem 7.3.1, we get:
σmin(Bpp) ≥ 1− δ, (E.19)
where,
δ =1
k3/2 · β · 100,
and β = σ∗1/σ∗
k is the condition number of W ∗.
Similarly, using Property (3), we get:
‖Bpq‖2 ≤ δ. (E.20)
Hence, using (E.18), (E.19), (E.20), we get:
σmin(B) ≥ minz,‖z‖2=1
(1−δ)∑p
‖zp‖22−δ∑
pq,p 6=q
‖zp‖2‖zq‖2 = minz,‖z‖2=1
1−δ∑pq
‖zp‖2‖zq‖2 ≥ 1−kδ.
(E.21)
Now, consider BD − C:
‖BD − C‖2 = maxz,‖z‖2=1
|z>(BD − C)z|,
= maxz,‖z‖2=1
∣∣∣∣∣ ∑1≤p≤k,1≤q≤k
z>p yiy
>i zqx
>i
( ∑1≤`≤k
〈u`u∗q〉upu>` − upu
>∗q
)xi
∣∣∣∣∣,= max
z,‖z‖2=1
∣∣∣∣∣ ∑1≤p≤k,1≤q≤k
z>p yiy
>i zqx
>i upu
>∗q(UU> − I)xi
∣∣∣∣∣,ζ1≤ δ max
z,‖z‖2=1
∑1≤p≤k,1≤q≤k
‖(UU> − I)u∗q‖2‖zp‖2‖zq‖2 ≤ k · δ · dist(U,U∗),
(E.22)
where ζ1 follows by observing that u>∗q(UU> − I)up = 0 and then by applying
Property (3) mentioned in Theorem 7.3.1.
Lemma now follows by using (E.22) along with (E.17) and (E.21).
377
Proof of Lemma E.1.2. The lemma is exactly the same as Lemma 4.7 of [71]. We
reproduce their proof here for completeness.
Let σmin(R) be the smallest singular value of R, then:
σmin(R) = minz,‖z‖2=1
‖Rz‖2 = minz,‖z‖2=1
‖V Rz‖2 = minz,‖z‖2=1
‖V∗Σ∗U>∗ Uz − Fz‖2,
≥ minz,‖z‖2=1
‖V∗Σ∗U>∗ Uz‖2 − ‖Fz‖2 ≥ σ∗
kσmin(U>U∗)− ‖F‖2,
≥ σ∗k
√1−
∥∥U>U∗⊥∥∥2
2− ‖F‖2 = σ∗
k√
1− dist(U∗, U)2 − ‖F‖2.(E.23)
Lemma now follows by using the above inequality along with the fact that ‖R−1‖2 ≤
1/σmin(R).
E.2 Proofs for Matrix Sensing using Rank-one Independent Gaus-sian Measurements
E.2.1 Proof of Claim 7.3.1
Proof. The main idea behind our proof is to show that there exists two rank-1 ma-
trices ZU , ZL s.t. ‖AGI(ZU)‖22 is large while ‖AGI(ZL)‖22 is much smaller than
‖AGI(ZU)‖22.
In particular, let ZU = x1y>1 and let ZL = uv> where u,v are sampled
from normal distribution independent of X,Y . Now,
‖AGI(ZU)‖22 =m∑i=1
‖x1‖42‖y1‖42 +m∑i=2
(x>1 xi)
2(y>1 yi)
2.
Now, as xi,yi,∀i are multi-variate normal random variables, ‖x1‖42‖y1‖42 ≥ 0.5d21d22
w.p. ≥ 1− 2 exp(−d1 − d2).
‖AGI(ZU)‖22 ≥ .5d21d22. (E.24)
378
Moreover, ‖ZU‖2F ≤ 2d1d2 w.p. ≥ 1− 2 exp(−d1 − d2).
Now, consider
‖AGI(ZL)‖22 =m∑i=2
(u>xi)2(v>yi)
2,
where ZL = uv> and u,v are sampled from standard normal distribution, inde-
pendent of xi,yi,∀i. Since, u,v are independent of x>1 xi ∼ N(0, ‖x1‖2) and
y>1 yi ∼ N(0, ‖y1‖2). Hence, w.p. ≥ 1 − 1/m3, |u>xi| ≤ log(m)‖u‖2, |v>yi| ≤
log(m)‖v‖2,∀i ≥ 2. That is, w.p. 1− 1/m3:
‖AGI(ZL)‖22 ≤ 4m log4md1d2. (E.25)
Furthermore, ‖ZL‖2F ≤ 2d1d2 w.p. ≥ 1− 2 exp(−d1 − d2).
Using (E.24), (E.25), we get that w.p. ≥ 1− 2/m3 − 10 exp(−d1 − d2):
40m log4m ≤ ‖AGI(Z/‖Z‖F )‖2 ≤ .05d1d2.
Now, for RIP to be satisfied with a constant δ, the lower and upper bound should
be at most a constant factor apart. However, the above equation clearly shows
that the upper and lower bound can match only when m = Ω(d1d2/ log(5d1d2)).
Hence, for m that at most linear in both d1, d2 cannot be satisfied with probability
≥ 1− 1/(d1 + d2)3.
E.2.2 Proof of Thoerem 7.3.2
Proof. We divide the proof into three parts where each part proves a property men-
tioned in Theorem 7.3.1.
379
Proof of Property 1. Now,
S =1
m
m∑i=1
bixiy>i =
1
m
m∑i=1
xix>i U∗Σ∗V
>∗ yiy
>i =
1
m
m∑i=1
Zi,
where Zi = xix>i U∗Σ∗V
>∗ yiy
>i . Note that E[Zi] = U∗Σ∗V
>∗ . Also, both xi and
yi are spherical Gaussian variables and hence are rotationally invariant. Therefore,
wlog, we can assume that U∗ = [e1e2 . . . ek] and V∗ = [e1e2 . . . ek] where ei is the
i-th canonical basis vector.
As S is a sum of m random matrices, the goal is to apply Lemma 2.4.3
to show that S is close to E[S] = W = U∗Σ∗V>∗ for large enough m. However,
Lemma 2.4.3 requires bounded random variable while Zi is an unbounded vari-
able. We handle this issue by clipping Zi to ensure that its spectral norm is always
bounded. In particular, consider the following random variable:
xij =
xij, |xij| ≤ C
√log(m(d1 + d2)),
0, otherwise,(E.26)
where xij is the j-th co-ordinate of xi. Similarly, define:
yij =
yij, |yij| ≤ C
√log(m(d1 + d2)),
0, otherwise.(E.27)
Note that, P(xij = xij) ≥ 1 − 1(m(d1+d2))C
and P(yij = yij) ≥ 1 − 1(m(d1+d2))C
.
Also, xij, yij are still symmetric and independent random variables, i.e., E[xij] =
E[yij] = 0, ∀i, j. Hence, E[xijxi`] = 0,∀j 6= `. Furthermore, ∀j,
E[x2ij] = E[x2
ij]−2√2π
∫ ∞
C√
log(m(d1+d2))
x2 exp(−x2/2)dx,
= 1− 2√2π
C√
log(m(d1 + d2))
(m(d1 + d2))C2/2
− 2√2π
∫ ∞
C√
log(m(d1+d2))
exp(−x2/2)dx,
≥ 1−2C√
log(m(d1 + d2))
(m(d1 + d2))C2/2
. (E.28)
380
Similarly,
E[y2ij] ≥ 1−2C√log(m(d1 + d2))
(m(d1 + d2))C2/2
. (E.29)
Now, consider RV, Zi = xix>i U∗Σ∗V
>∗ yiy
>i . Note that,
‖Zi‖2 ≤ C4√
d1d2k log2(m(d1 + d2))σ∗
1
and ‖E[Zi]‖2 ≤ σ∗1. Also,
‖E[ZiZ>i ]‖2 = ‖E[‖yi‖22xix
>i U∗Σ∗V
>∗ yiy
>i V∗Σ∗U
>∗ xix
>i ]‖2,
≤ C2d2 log(m(d1 + d2))E[xix>i U∗Σ∗
2U>∗ xix
>i ]‖2,
≤ C2d2 log(m(d1 + d2))(σ∗1)2‖E[‖U>
∗ xi‖22xix>i ]‖2,
≤ C4kd2 log2(m(d1 + d2))(σ∗
1)2. (E.30)
Similarly,
‖E[Zi]E[Z>i ]‖2 ≤ (σmax
∗ )2. (E.31)
Similarly, we can obtain bounds for ‖E[Z>i Zi]‖2, ‖E[Zi]
>E[Zi]‖2.
Finally, by selecting m = C1k(d1+d2) log2(d1+d2)
δ2and applying Lemma 2.4.3
we get (w.p. 1− 1(d1+d2)10
),
‖ 1m
m∑i=1
Zi − E[Zi]‖2 ≤ δ. (E.32)
Note that E[Zi] = E[x2i1]E[y2i1]U∗Σ∗V
>∗ . Hence, by using (E.32), (E.28), (E.29),
‖ 1m
m∑i=1
Zi − U∗Σ∗V>∗ ‖2 ≤ δ +
σ1∗
(d1 + d2)100.
381
Finally, by observing that by selecting C to be large enough in the definition of
xi, yi (see (E.26), (E.27)), we get P (‖Zi − Zi‖2 = 0) ≥ 1 − 1(d1+d2)5
. Hence, by
assuming δ to be a constant wrt d1, d2 and by union bound, w.p. 1− 2δ10
(d1+d2)5,
‖ 1m
m∑i=1
Zi −W ∗‖2 ≤ 5δ‖W ∗‖2.
Now, the theorem follows directly by setting δ = 1100k3/2β
.
Proof of Property 2 . Here again, the goal is to show that the random matrix Bx
concentrates around its mean which is given by I . Now, as xi,yi are rotationally
invariant random variables, wlog, we can assume uh = e1. That is, (x>i uhu
>hxi) =
x2i1 where xi1 is the first coordinate of xi. Furthermore, similar to (E.26), (E.27),
we define clipped random variables xi1 and yi below:
xi1 =
xi1, |xi1| ≤ C
√log(m),
0, otherwise.(E.33)
yi =
yi, ‖yi‖22 ≤ 2(d1 + d2),
0, otherwise.(E.34)
Now, consider B = 1m
∑mi=1 Zi, where Zi = x2
i1yiy>i . Note that, ‖Zi‖2 ≤ 2C2(d1+
d2) log(m). Similarly, ‖E[∑
i ZiZ>i ]‖2 ≤ 2mC4(d1 + d2) log
2(m). Hence, using
Lemma 2.4.3, we get:
P
(‖ 1m
∑i
Zi − E[Zi]‖2 ≥ γ
)≤ exp
(− mγ2
2C4(d1 + d2) log2(m)(1 + γ/3)
).
(E.35)
Now, using argument similar to (E.28), we get ‖E[Zi] − I‖2 ≤ 2C log(m)
mC2/2. Further-
more, P(Zi = Zi) ≥ 1− 1m3 . Hence for m = Ω(k(d1 + d2) log(d1 + d2)/δ
2), w.p.
382
≥ 1− 2m3 ,
‖Bx − I‖2 ≤ δ. (E.36)
Similarly, we can prove the bound for By using exactly same set of arguments.
Proof of Property 3. Let, C = 1m
∑mi=1 yiy
>i x
>i uu
>⊥xi where u, u⊥ are fixed or-
thogonal unit vectors. Now x>i uh ∼ N(0, 1) and u>
⊥xi ∼ N(0, 1) are both normal
variables. Also, note that u and u⊥ are orthogonal, hence x>i uh and u>
⊥xi are
independent variables.
Hence, E[x>i uu
>⊥xi] = 0, i.e., E[C] = 0. Now, let m = Ω(k(d1 +
d2) log(d1+d2)/δ2). Then, using the clipping argument (used in the previous proof)
with Lemma 2.4.3, Property 3 is satisfied w.p. ≥ 1− 2m3 . That is, ‖Cy‖2 ≤ δ. More-
over, ‖Cx‖2 ≤ δ also can be proved using similar proof to the one given above.
E.3 Proof of Matrix Sensing using Rank-one Dependent Gaus-sian Measurements
E.3.1 Proof of Lemma 7.3.1
Proof. Incoherence: The Incoherence property directly follows Lemma 2.2 in
[24].
Averaging Property: Given any orthogonal matrix U ∈ Rd×k(d ≥ k), let
Q = [U,U⊥], where U⊥ is a complementary matrix of U . Define S = XQ =
(XQX>)X . The matrix XQX> can be viewed as a rotation matrix constrained
383
in the column space of X . Thus, S is a constrained rotation of X , which implies
S is also a random orthogonal matrix and so is the first k columns of S. We use
T ∈ Rn×k to denote the first k columns of S.
maxi‖U>zi‖ = max
i‖U>Qsi‖ = max
i‖ti‖
where ti is the transpose of the i-th row of T . Now this property follows from
Lemma 2.2 in [24].
E.3.2 Proof of Theorem 7.3.3
Proof. Similar to the proof of Theorem 7.3.2, we divide the proof into three parts
where each part proves a property mentioned in Theorem 7.3.1. And in this proof,
we set d = d1 + d2 and n = n1 + n2.
Proof of Property 1. As mentioned in the proof sketch, wlog, we can assume that
both X,Y are orthonormal matrices and that the condition number of R is same as
condition number of W ∗.
We first recall the definition of S:
S =n1n2
m
m∑(i,j)∈Ω
xix>i U∗Σ∗V
>∗ yjy
>j =
n1n2
m
m∑(i,j)∈Ω
Zij,
where Zij = xix>i U∗Σ∗V
>∗ yjy
>j = Xeie
>i X
>U∗Σ∗V>∗ Y eje
>j Y
>, where ei, ej
denotes the i-th, j-th canonical basis vectors, respectively.
Also, since (i, j) is sampled uniformly at random from [n1] × [n2]. Hence,
Ei[eie>i ] =
1n1I and Ej[eje
>j ] =
1n2I . That is,
Eij[Zij] =1
n1n2
XX>U∗Σ∗V>∗ Y Y > = U∗Σ∗V
>∗ = W ∗/(n1 · n2),
384
where XX> = I , Y Y > = I follows by orthonormality of both X and Y .
We now use the matrix concentration bound of Lemma 2.4.3 to bound ‖S−
W ∗‖2. To apply the bound of Lemma 2.4.3, we first need to bound the following
two quantities:
• Bound maxij ‖Zij‖2: Now,
‖Zij‖2 = ‖xix>i U∗Σ∗V
>∗ yjy
>j ‖2 ≤ σ∗
1‖U∗>xi‖2‖V∗
>yj‖2‖xi‖2‖yj‖2
≤ σ∗1µµ0
√d1d2k
n1n2
,
where the last inequality follows using the two properties of random orthog-
onal matrices of X,Y .
• Bound ‖∑
(i,j)∈ΩE[ZijZ>ij ]‖2 and ‖
∑(i,j)∈Ω E[Z>
ijZij]‖2: We first consider
‖∑
(i,j)∈Ω E[ZijZ>ij ]‖2:∥∥∥∥∥∥
∑(i,j)∈Ω
E[ZijZ>ij ]
∥∥∥∥∥∥2
=
∥∥∥∥∥∥∑
(i,j)∈Ω
E[xix>i W ∗yjy
>j yjy
>j W ∗
>xix>i ]
∥∥∥∥∥∥2
,
ζ1≤ µd2
n2
∥∥∥∥∥∥∑
(i,j)∈Ω
E[xix>i W ∗yjy
>j W ∗
>xix>i ]
∥∥∥∥∥∥2
,
ζ2=
µ2d2n22
∥∥∥∥∥∥∑
(i,j)∈Ω
E[xix>i W ∗W ∗
>xix>i ]
∥∥∥∥∥∥2
,
ζ3≤ (σ∗
1)2µµ0kd2n1n2
2
∥∥∥∥∥∥∑
(i,j)∈Ω
E[xix>i ]
∥∥∥∥∥∥2
,
ζ4=
(σ∗1)2µµ0kd2n21n
22
·m, (E.37)
where ζ1, ζ3 follows by using the two properties of X,Y and ‖W ∗‖2 ≤ σ∗1.
ζ2, ζ4 follows by using Ei[eie>i ] =
1n1I and Ej[eje
>j ] =
1n2I .
385
Now, bound for ‖∑
(i,j)∈Ω E[Z>ijZij]‖2 also turns out to be exactly the same
and can be easily computed using exactly same arguments as above.
Now, by applying Lemma 2.4.3 and using the above computed bounds we get:
Pr(‖S −W ∗‖2 ≥ σ∗1γ) ≤ d exp
(− mγ2
µµ0kd(1 + γ/3)
). (E.38)
That is, w.p. ≥ 1− γ:
‖S −W ∗‖2 ≤σ∗
1√
2µµ0kd log(d/γ)√m
. (E.39)
Because the properties for random orthogonal matrices fail with probability cn−3 log n,
we assume gamma at least have the same magnitude of such a failure probability
to simplify the result, i.e., γ ≥ cn−3 log n. Hence, by selecting m = O(µµ0k4 · β2 ·
d log(d/γ)) where β = σ∗1/σ∗
k, the following holds w.p. ≥ 1− γ:
‖S −W ∗‖2 ≤ ‖W ∗‖2 · δ,
where δ = 1/(k3/2 · β · 100).
Proof of Property 2. We prove the property for By; proof for By follows analo-
gously. Now, let By =n1n2
m
∑(i,j)∈Ω Zij where Zi = x>
i uu>xiyiy
>i . Then,
E[By] =n1n2
m
∑(i,j)∈Ω
Zij =n1n2
m
m∑i=1
E(i,j)∈Ω[x>i uu
>xiyiy>i ] = I. (E.40)
Here again, we apply Lemma 2.4.3 to bound ‖By − I‖2. To this end, we need to
bound the following quantities:
386
• Bound maxij ‖Zij‖2: Now,
‖Zij‖2 = ‖x>i uu
>xiyiy>i ‖2 ≤ ‖yi‖22‖u>xi‖22 ≤
µµ0d2 log n2
n1n2
.
The log factor comes from the second property of random orthogonal matri-
ces.
• Bound ‖∑
(i,j)∈ΩE[ZijZ>ij ]‖2 and ‖
∑(i,j)∈Ω E[Z>
ijZij]‖2: We first consider
‖∑
(i,j)∈Ω E[ZijZ>ij ]‖2:∥∥∥∥∥∥
∑(i,j)∈Ω
E[ZijZ>ij ]
∥∥∥∥∥∥2
=
∥∥∥∥∥∥∑
(i,j)∈Ω
E[(x>i uu
>xi)2‖yi‖22yiy
>i ]
∥∥∥∥∥∥2
,
ζ1≤ µd2
n2
∥∥∥∥∥∥∑
(i,j)∈Ω
E[(x>i uu
>xi)2yiy
>i ]
∥∥∥∥∥∥2
,
ζ2=
µd2n22
∥∥∥∥∥∥∑
(i,j)∈Ω
E[(x>i uu
>xi)2
∥∥∥∥∥∥2
,
ζ3≤ µµ0d2 log n2
n1n22
∥∥∥∥∥∥∑
(i,j)∈Ω
E[(x>i u)
2]
∥∥∥∥∥∥2
,
ζ4=
µµ0d2 log n2
n21n
22
·m (E.41)
Note that if we assume k ≥ log n, the above given bounds are less than the ones
obtained in the Initialization Property’s proof respectively. Hence, by applying
Lemma 2.4.3 in a similar manner, and selecting m = O(µµ0k4 · β2 · d log(d/γ))
and δ = 1/(k3/2 · β · 100), we get w.p. ≥ 1− γ:
‖By − I‖2 ≤ δ.
Hence Proved. ‖Bx − I‖2 ≤ δ can be proved similarly.
387
Proof of Property 3. Note that E[Cy] = E[∑
(i,j)∈Ω Zij] = 0. Furthermore, both
‖Zij‖2 and ‖E[∑
(i,j)∈Ω ZijZ>ij ]‖2 have exactly the same bounds as those given in
the Property 2’s proof above. Hence, we obtain similar bounds. That is, if m =
O(µµ0k4 · β2 · d log(d/γ)) and δ = 1/(k3/2 · β · 100), we get w.p. ≥ 1− γ:
‖Cy‖2 ≤ δ.
Hence Proved. ‖Cx‖2 can also be bounded analogously.
388
Appendix F
Over-specified Neural Network
F.1 Preliminaries
Without loss of generality, we assume that w∗i , i ∈ [k0] are orthonormal. De-
fine wj = wj/‖wj‖, θij = ∠(wi,wj), αij = ∠(wi,w∗j ). In the special case where
k0 = 1, we denote the ground truth weight vector as w∗, and αi = ∠(wi,w∗).
The following lemmas characterize the behavior of M matrix when the input
is normally distributed.
Lemma F.1.1 (Lemma 6 [41]). If x ∼ N(0, I), then given any two unitary vectors,
w,u (i.e., ‖w‖ = ‖u‖ = 1),
E[1w>x≥01u>x≥0xx
>] = (1
2− 1
2πθ
)I +
(wu> + uw>)−w>u(ww> + uu>)
2π sin θ(F.1)
where θ = arccos(w>u).
By multiplying u on both sides of Eq. (F.1), we immediately get the follow-
ing lemma.
Lemma F.1.2. If x ∼ N(0, I), then given any two unitary vectors, w,u (i.e.,
‖w‖ = ‖u‖ = 1),
E[1w>x≥01u>x≥0xx
>]u =
(1
2− 1
2πθ
)u+
sin θ
2πw (F.2)
389
where θ = arccos(w>u).
When two vectors have close enough angle, then Mu,vv can be well-approximated
by 12v, where the approximation error decreases quadratically w.r.t. the angles be-
tween u,v. The following lemma characterizes this property.
Lemma F.1.3. For any u,v ∈ Sd−1, there exists a constant C2, such that∥∥∥∥Mu,vv −1
2v
∥∥∥∥ ≤ C2θ2u,v;
Proof. For Gaussian input, by Lemma F.1.2, we have
‖Mu,vv −1
2v‖ = 1
2π‖ sin θu,vu− θu,vv‖ (F.3)
=1
2π
(sin2 θu,v + θ2u,v − 2θu,v sin θu,v cos θu,v
) 12
Expanding with Maclaurin expansion, we have∥∥∥∥Mu,vv −1
2v
∥∥∥∥ =1
2π
(5
3θ4u,v +O(θ6u,v)
) 12
≤ C2(d)θ2u,v
Lemma F.1.4. If X = N(0, Id), then there exists constant C3 such that for all
u,v ∈ Sd−1,
1
2− vTMu,vv ≥ C3θ
3u,v.
Proof. When x ∼ N(0, Id), by Eq. (F.3),
1
2− vTMu,vv =
1
2πvT (sin θu,vu− θu,vv) =
1
2π(sin θu,v − θu,v cos θu,v)
=1
6πθ3u,v +O(θ5u,v) ≥ C3θ
3u,v.
390
Lemma F.1.5. For any unit vectors u1,u2,w, we have
‖M(u1,w)w −M(u2,w)w‖ ≤ 3
2πθu1,u2 .
Proof. By Lemma F.1.2,
M(u1,w)w −M(u2,w)w (F.4)
=
(1
2− θu1,w
2π
)w +
sin θu1,w
2πu1 −
(1
2− θu2,w
2π
)w − sin θu2,w
2πu2
=θu2,w − θu1,w
2π+
sin θu1,wu1 − sin θu2,wu2
2π(F.5)
By triangle inequality under arccos metric, we have |θu2,w − θu1,w| ≤ θu1,u2 . For
the second term,
‖ sin θu1,wu1 − sin θu2,wu2‖ ≤‖(sin θu1,w − sin θu2,w)u1‖+ ‖ sin θu2,w(u1 − u2)‖
≤|θu2,w − θu1,w|+ ‖u1 − u2‖
≤θu1,u2 + 2 sinθu1,u2
2
≤2θu1,u2
Plugging back to Eq. (F.5), we have
‖M(u1,w)w −M(u2,w)w‖ ≤ 3
2πθu1,u2 ≤
1
2θu1,u2
We next bound the difference between population and empirical M ma-
trix. The proof of that bound relies on a covariance matrix concentration result for
391
sub-gaussian random vectors in [128]. For completeness we include it below with
explicit dependence on the tail probability.
Lemma F.1.6 (Proposition 2.1 [128]). Consider independent random vectors X1, · · · , Xn ∈
Rd, n ≥ d, which have sub-gaussian distribution with sub-gaussian norm upper
bounded by L. Then for every δ > 0 with probability at least 1−δ one has for some
absolute constant C > 0,
‖EXXT − 1
n
n∑i=1
XiXTi ‖ ≤ C
√log
(2
δ
)d
n.
The following lemma shows the concentration for the M matrix.
Lemma F.1.7. Let x ∼ N(0, Id), and x1, · · · ,xn are n i.i.d. samples generated
from this distribution. Let u,v be two unitary vectors in Rd. Define M(u,v) =
E[φ′(uTx)φ′(vTx)xxT
], and M(u,v) = 1
n
∑ns=1 φ
′(uTxs)φ′(vTxs)xsx
Ts . Then
with probability at least 1− δ, we have
‖M(u,v)− M(u,v)‖ ≤ C
√log
(2
δ
)d
n.
Proof. First notice that xs are independent sub-gaussian random vectors, and φ′(uTxs) ∈
0, 1. Then for given u,v, φ′(uTxs)φ′(vTxs)xs are independent sub-gaussian
random variables with sub-gaussian norm upper bounded by L. Applying Lemma F.1.6,
we have with probability at least 1− δ,
‖Exφ′(uTx)2φ′(vTx)2xxT − 1
n
n∑s=1
φ′(uTxs)2φ′(vTxs)
2xsxTs ‖ ≤ C
√log
(2
δ
)d
n.
392
F.2 Two Learning Phases
Below we break down the convergence path into two phases and talk briefly
about the proof ideas for each phase.
F.2.1 Phase I – Learning the Directions
In Phase I, we show that with a small initialization and a small step size, the
magnitude of wi remains small but the direction of wi moves towards the direction
of w∗. In order to achieve an overall complexity of O(ε−1), we carefully design of
the analysis for Phase I. In particular, we divide it into two sub-phases. In Phase
I-a, we have a suboptimal convergence rate ε−2. Yet in just a small number of
iterations T1a = O((log(1/ε))2), GD is able to achieve 1/ log(1/ε) angle precision,
i.e., α(t) . 1/ log(1/ε).Then starting from slightly aligned wj’s, we enter Phase
I-b, where we have a faster rate of O(ε−1).
We first present a lemma to characterize the rate of convergence in Phase
I-a. In order to make the lemma compatible with both population and empirical
scenarios, we assume the gradient update is noisy with t-iteration noise as r(t)(i) ,
where in the population case we have ‖r(t)i ‖ = 0. The noisy gradient update is as
follows.
w(t+1)i = w
(t)i − η∇
w(t)if(W ) + ηr
(t)i . (F.6)
For the empirical scenarios, which uses a sample set S to do gradient update, r(t)i :=
393
∇w
(t)if(W )−∇
w(t)ifS(W ).
Lemma F.2.1 (Phase I-a). Let initialization w(0)i be randomly generated from a
uniform distribution on a d-dimensional sphere of radius σ. Given any constant ε0,
if σ . mini η(π−α(0)i )3ε20, η = O(
ε40k), ‖r(0)
i ‖ . (π−α(0)i )3 · ε20 and ‖r(t)
i ‖ . ε20 for
t ≥ 1, then after T1a = O(ε−20 ) iterations of noisy gradient update (F.6), we have
α(T1a) . ε0, η(T1a/3− 1) ≤ ‖w(T1a)i ‖ ≤ η(T1a/2 + 1), ∀i ∈ [k] (F.7)
Our convergence analysis for Phase I-a requires the initialization scale to be
smaller than η · ε20 ·mini(π−α(0)i )3, which implicitly requires α(0)
i < π for all i. To
understand this requirement, consider a special initialization case when there exists
w(0)i such that α(0)
i = π. Then ∇w
(0)ifS(W ) = 0 for any sample set S, therefore,
w(0)i will never move. For a finite k, if w(0)
i is randomly initialized, α(0)i < π almost
surely.
To grasp some intuition about the proof, we introduce a decomposition of
the gradient. To start with, we define the following matrix for any given pair of
unitary vectors , u,v ∈ Sd−1,
Mu,v := Ex∼X
[φ′(u>x)φ′(v>x)xx>].
Denote M(t)ij := M
w(t)i ,w
(t)j
and R(t)i := M
w(t)i ,w∗ . The gradient update can be
formulated as,
w(t+1)i = w
(t)i + ηR
(t)i w∗︸ ︷︷ ︸(A)
− η∑j
M(t)ij w
(t)j︸ ︷︷ ︸
(B)
+ηr(t)i (F.8)
394
Term (A) in Eq. (F.8) only involves wi and w∗, while term (B) in Eq. (F.8) involves
the interaction of wi with all the other wj . On a high level, Term (A) always pushes
wi towards the direction of w∗, and term (B) always tries to move wii∈[k] away
from each other. The update procedure is hence a combination of the two forces.
This observation is also discussed in [98], where they show that gradient descent
update is equivalent to proton-electron electrodynamics. Term (A) represents the
attractive force between the protons and electrons, while Term (B) is the repulsive
force from the remaining electrons. In the initial steps, since the magnitude of wi
are small, term (A) will play a dominating role, which pushes wi towards w∗.
After Phase I-a pushes all wi to be slightly aligned with w∗, we enter Phase
I-b and obtain a faster convergence rate.
Lemma F.2.2 (Phase I-b). Assume∑
j ‖w(T1a)i ‖ . ξ/3 and α(T1a) . ξ/3 at the
end of Phase I-a. For any ε1 > 0, if η . ξε1+3ξ1 /k and ‖r(t)
i ‖ . ξ tan(α(t)) for
t ≥ T1a, then after T1b = O(ε−(1+3ξ)1 ) iterations of empirical gradient descent
updates Eq. (F.6), let T1 = T1a + T1b, we have
α(T1)i . ε1, ‖w(T1)
i ‖ = Θ(ξ/k), ∀i ∈ [k].
We achieve O(ε−11 ) rate when setting ξ = O(1/ log(1/ε1)).
F.2.2 Phase II – Learning the Magnitudes
After Phase I, all wi’s are approximately aligned with w∗, but their mag-
nitudes are still far from optimal, hence the training error can be further reduced.
395
Fortunately, with all wi’s having a small angle with w∗, the non-convex problem re-
duces to a noisy linear regression problem, which has linear convergence rate using
GD. So we mainly need to ensure that the directions of wi does not move away from
w∗ too much during the optimization process. Formally, the convergence guarantee
is presented in Lemma F.2.3.
Lemma F.2.3 (Phase II). For any ε2 > 0, assume at the end of Phase I-b, we
have α(T1)i . min√ε2, 1/ log(1/ε2). Now following Phase I-b, taking additional
T2 = O(log(1/ε2)) iterations with step size η = Θ(1/k), if we have for all t ∈
[T1, T1 + T2], and ‖r(t)i ‖ . ε2, then we have,
f(W (T1+T2)) ≤ ε2; and α(T1+T2)i .
√ε2, ‖
∑j
w(T1+T2)j −w∗‖ . ε2, ∀i ∈ [k].
F.3 Proof of Main Theorems
We provide proofs for the theorems in the main result section. The proof
idea is mainly to combine the convergence bound for the previous lemmas for Phase
I-a, I-b and II, so that the later phases inherit sufficiently good parameters from the
previous phases.
F.3.1 Population Case
Proof for Theorem 8.1.1. The proof follows by combining the guarantees in each
phase, i.e., Lemma F.2.1 and Lemma F.2.2 for Phase I and Lemma F.2.3 for Phase
II, where we set the noise r(0)i = 0 for the population case. Now we set the
parameters in the lemmata in reverse order. According to Lemma F.2.3, if after
396
Phase I-b, the following holds, α(T1)i .
√ε, then after T2 ≥ log(1/ε) iterations
with η2 = O(1/k), generalization error can be bounded by ε. By Phase I-b, if
we set ξ1 = ξ2 = O(1/ log(1/ε)) and η1b = O(ε1/2/ log(1/ε)/k), we have af-
ter T1b = O(ε−1/2) iterations, the requirement for the initialization of Phase II
will be satisfied. In Phase I-a, since we’ve set ξ1 = ξ2 = O(1/ log(1/ε)), we
need ε0 = O(1/ log(1/ε)). Therefore, we need η1a = O((log(1/ε))−4/k), and
T1a = O(log2(1/ε)). For simplicity, we combine Phase I-a and Phase I-b, and set
η1 = minη1a, η1b. And the total number of iterations is O(ε−1/2).
F.3.2 Finite-Sample Case
Proof for Theorem 8.1.2. We re-write the conditions on ‖r(0)i ‖ for Phase I-a, I-b, II
from Lemma F.2.1, F.2.2 and F.2.3, respectively.
• ‖r(t)i ‖ . mini(π − α
(0)i )3 · ε20
• ‖r(t)i ‖ . ε1/ log(1/ε1)
• ‖r(t)i ‖ . ε2
According to the proof of Theorem 8.1.1, to achieve a final precision of ε for the
objective value, we need to set ε2 = ε, ε1 =√ε and ε0 = 1/ log(1/ε). Therefore, if
‖r(t)i ‖ . min
ε,min
i(π − α
(0)i )3 · log−2(ε−1)
for all t ≤ T1 + T2, we have E(f) . ε. Assuming (π − α
(0)i )3 ≥ ε, we just need
‖r(t)i ‖ ≤ ε.
397
Note that
r(t)i := −
[∑j
(M
(t)ij −M
(t)ij
)w
(t)j −
(R
(t)i −R
(t)i
)w∗
].
‖r(t)i ‖ can be bounded by the different of the sampling error of the matrices, R(t)
i
and M(t)ij . ∥∥∥r(t)
i
∥∥∥ ≤ ∥∥∥R(t)i −R
(t)i
∥∥∥+maxj
∥∥∥M (t)ij −M
(t)ij
∥∥∥By Lemma F.1.7, at iteration t, if the sample size
|S(t)| & ε−2d log k log(1/δ(t)),
we have ‖r(t)i ‖ ≤ ε w.p., 1 − δ(t). After T = O(ε−1/2) iterations, we have if the
sample size in each iteration,
|S(t)| & ε−2 log(1/ε) · d log k log(1/δ),
w.p. 1− δ,
E[f(W (T ))
]≤ ε.
F.4 Proofs of Phase I
In Phase I-a, we start from random initialization and proceed with gradient
descent for some initial steps. The main proof idea for this phase is to show at the
very beginning of the optimization process, GD will push all wi’s to w∗ when the
398
interactions among wi’s each other is still small. However, the convergence rate is
small at the beginning. Therefore, we later introduce Phase I-b, that has a faster
convergence rate.
Lemma F.4.1 shows that if the updates are only consist of term (A) in
Eq. (F.8), the noiseless interaction term between wi and w∗, the angles between
wi and w∗ converges with rate ε−2.
Later in Lemma F.4.2, we add noise to the term (A) and find the conditions
of the noise so that the angles still converges.
Lemma F.4.1. Let θu(0),v ≤ π− ε for some ε > 0. Let Mu(0),v be defined from u(0).
Let u(1) = Mu(0),vv. Assume the following procedure,
u(t+1) = u(t) +Mu(t),vv, for t ≥ 1 (F.9)
Then given any ε0 > 0, after T & 1/ε20 iterations, θu(T ),v ≤ ε0 and T/3 ≤ ‖u(T )‖ ≤
(1 + ε0)T/2.
Proof. Denote α(t) := θu(t),v.
u(1) =
(1
2− α(0)
2π
)v +
sinα(0)
2πu(0)
We calculate the norm of u(1) as
‖u(1)‖2 =(1
2− α(0)
2π
)2
+
(sinα(0)
2π
)2
+ 2 cosα(0)
(1
2− α(0)
2π
)(sinα(0)
2π
)≤ 1
For iteration 2, we re-write the update step as
u(2) = u(1) +
(1
2− α(1)
2π
)v +
sinα(1)
2πu(1)
399
∥∥u(2)∥∥
=
((‖u(1)‖+ sinα(1)
2π
)2
+
(1
2− α(1)
2π
)2
+ 2 cosα(1)
(1
2− α(1)
2π
)(‖u(1)‖+ sinα(1)
2π
)) 12
and
cosα(2) =
(‖u(1)‖+ sin α(1)
2π
)cosα(1) +
(12− α(1)
2π
)‖u(2)‖
Notice that cosα(2) and ‖u(2)‖ are dominated by a univariate function of α(0), it
hence can be computed that, when α(0) ≤ π − ε for any ε ∈ (0, π],
α(2) < π/5, 0.28 < ‖u(2)‖ ≤ 1 (F.10)
Next, we show the iterations for t > 2. Let γ(t) = v>u(t).
γ(t+1) = v>(u(t+1))
= v>(u(t) +Mu(t),vv)
= γ(t) +
(1
2− α(t)
2π
)+
1
2πsinα(t) · cosα(t)
If α(t) ≤ π/5, we have 1/3 ≤ γ(t+1)−γ(t) ≤ 1/2. Hence, t/3+c0 ≤ γ(t) ≤
t/2 + c0 for t ≥ 2, where c0 is a constant.
tanα(t+1) =‖(I − vv>)u(t+1)‖
γ(t+1)
=γ(t) + 1
2πsinα(t) · cosα(t)
γ(t) + 12+ 1
2π(sinα(t) · cosα(t) − α(t))
tanα(t)
400
When α(t) ≤ 2π/5, we have 12π
sinα(t) · cosα(t) ≤ 1/20.
tanα(t+1) ≤ γ(t) + 1/20
γ(t) + 1/3tanα(t)
≤ t/2 + c0 + 1/20
t/2 + c0 + 1/3tanα(t)
Using the fact that
limn→∞
Γ(n+ z)
Γ(n)nz= 1,
where Γ is the gamma function, we have
tanα(T ) . T−1/2 tanα(2) . ε0,
where the last equation is from the condition T ≥ 1/ε20.
Lemma F.4.2. Given a starting vector, w(0), assume the following two procedures,
u(t+1) = u(t) +Mu(t),vv, for any t ≥ 1
where u(1) = Mw(0),vv, and
w(t+1) = w(t) +Mw(t),vv + δ(t), for any t ≥ 0.
If ‖δ(0)‖+ ‖w(0)‖ . (π − θw(0),v)3 · ε20, and ‖δ(t)‖ . ε20,∀t ≥ 1, then
‖∆(t)‖ := ‖w(t) − u(t)‖ . tε20, ∀t ≥ 1
Further, after T = 1/ε20 iterations,
θw(T ),u(T ) ≤ ε20, ‖w(T ) − u(T )‖ . 1.
401
Proof. Let ∆(t) := w(t) − u(t). For t ≥ 1,
w(t+1) = w(t) +Mw(t),vv + δ(t)
= u(t) +∆(t) +Mu(t),vv + (Mw(t),v −Mu(t),v)v + δ(t)
Therefore,
‖∆(t+1)‖ ≤ ‖∆(t)‖+‖(Mw(t),v−Mu(t),v)v‖+‖δ(t)‖ ≤ ‖∆(t)‖+C4θw(t),u(t)+‖δ(t)‖.
The second inequality is due to Lemma F.1.5. Note that for θw(t),u(t) ≤ π/2,
θw(t),u(t) ≤ 2 sin(θw(1),u(1)) ≤2‖w(t) − u(t)‖‖u(t)‖
.
‖∆(1)‖ = ‖w(1) − u(1)‖ ≤ ‖δ(0)‖+ ‖w(0)‖ ≤ 2ε30
Hence we have
θw(1),u(1) ≤2‖w(1) − u(1)‖‖u(1)‖
≤ 2‖δ(0)‖+ 2‖w(0)‖‖u(1)‖
Note that ‖u(1)‖ = ‖Mw(0),vv‖ ≥ C3(π − θw(0),v)3 for some C3 according to
Lemma F.1.4. Therefore, if
‖δ(0)‖+ ‖w(0)‖ ≤ C3/2(π − θw(0),v)3 · C−1
4 · ε20,
we have ‖∆(2)‖ ≤ 4ε20 .
402
And for any t ≥ 2, ‖u(t)‖ ≥ t/3 + c0, and for some constant t1, if t ≥ t1,
‖u(t)‖ ≥ 2t/5 + c1. So if ‖δ(t)‖ . ε20 and ‖∆(t)‖ ≤ 2tε20, then
‖∆(t+1)‖ ≤ ‖∆(t)‖+ C4‖∆(t)‖/‖u(t)‖+ ‖δ(t)‖ ≤ (2t+ 2)ε20.
which is because C4 =32π
.
So by induction, we have after T = 1/ε20 iterations,
θw(T ),u(T ) . ε20,∥∥w(T ) − u(T )
∥∥ = ‖∆(T )‖ . Tε20 . 1
F.4.1 Proof of Phase I-a
Proof of Lemma F.2.1. Let’s first check the first iteration.
w(1)i = w
(0)i − η
[∑j
M(0)ij w
(0)j −R
(0)i w∗
]+ ηr
(0)i
Define
δ(0)i := w
(0)i − η
[∑j
M(0)ij w
(0)j
]+ ηr
(0)i (F.11)
So if we have σ . mini
η(π−α
(0)i
)3ε30
ηk+1and ‖r(0)
i ‖ . (π − α(0)i )3 · ε20, we have
‖δ(0)i ‖+ ‖w(0)i ‖ . η(π − α
(0)i )3 · ε20.
Now consider the case for t ≥ 1. Let u(1)i = ηM
w(0)i ,w∗ and for t ≥ 1,
u(t+1)i = u
(t)i + ηM
u(t)i ,w∗w
∗, ∆(t)i := w
(t)i − u
(t)i
403
w(t+1)i = w
(t)i − η
[∑j
M(t)ij w
(t)j −R
(t)i w∗
]+ ηr
(t)i
Define
δ(t)i = −η
∑j
M(t)ij w
(t)j + ηr
(t)i .
If we have ηkT ≤ ε20 and ‖r(t)i ‖ . ε20, for t ≤ T , we can guarantee that
‖δ(t)i ‖ ≤ η∑j
∥∥∥w(t)j
∥∥∥+ η∥∥∥r(t)
i
∥∥∥ . η2kt+ ηε20 . ηε20,
holds for all t ≤ T .
Now applying Lemma F.4.2, after T = 1/ε20 iterations,
θw
(T )i ,u(T ) ≤ ε20,
∥∥∥w(T )i − u
(T )i
∥∥∥ ≤ η.
Combining Lemma F.4.1 and using the fact θw
(T )i ,u(T ) ≤ θ
w(T )i ,u(T ) + θu(T ),w∗ , we
conclude the proof.
404
F.4.2 Proofs of Phase I-b
Proof of Lemma F.2.2. Define w(t)i = γ
(t)i w∗ + b
(t)i , where b
(t)>i w∗ = 0. By the
gradient update, we derive γ(t+1)i and b
(t+1)i in terms of
w
(t)i
i=1,2,··· ,k
.
γ(t+1)i = w∗>w
(t+1)i
= w∗>
(w
(t)i − η
[∑j
‖w(t)j ‖M
(t)ij w
(t)j −R
(t)i w∗
]+ ηr
(t)i
)
= w∗>
(w
(t)i + η/2w∗ − η
[∑j
‖w(t)j ‖M
(t)ij w
(t)j − (R
(t)i − I/2)w∗
]+ ηr
(t)i
)
= γ(t)i + η/2− ηw∗>
[∑j
‖w(t)j ‖M
(t)ij w
(t)j − (R
(t)i − I/2)w∗
]+ ηw∗>r
(t)i .
Therefore,
|γ(t+1)i − γ
(t)i − η/2| ≤ η
(∑j
∥∥∥w(t)j
∥∥∥/2 + C2(α(t)i )2 +
∥∥∥r(t)i
∥∥∥).If∑
j ‖w(t)j ‖/2 ≤ ξ/3, C2(α
(t)i ) ≤ ξ/3, and ‖r(t)
i ‖ ≤ ξ/3, we can upper
and lower bound γ(t+1)i .(
1
2− ξ
)η ≤ γ
(t+1)i − γ
(t)i ≤
(1
2+ ξ
)η. (F.12)
405
For the part orthogonal to w∗.
‖b(t+1)i ‖
=‖(I −w∗w∗>)w(t+1)i ‖
=‖(I −w∗w∗>)
w(t)i − η[
∑j
‖w(t)j ‖M
(t)ij w
(t)j −R
(t)i w∗] + ηr
(t)i
‖=
∥∥∥∥∥∥(I −w∗w∗>)
w(t)i − η/2(
∑j
w(t)j −w∗)
−η[∑j
‖w(t)j ‖(M
(t)ij − I/2)w
(t)j − (R
(t)i − I/2)w∗] + ηr
(t)i
∥∥∥∥∥∥=
∥∥∥∥∥∥b(t)i − η/2∑j
b(t)j − (I −w∗w∗>)η
∑j
‖w(t)j ‖(M
(t)ij − I/2)w
(t)j − (R
(t)i − I/2)w∗ + r
(t)i
∥∥∥∥∥∥≤‖b(t)i ‖+ η/2
∑j
‖b(t)j ‖+ η
∑j
2‖w(t)j ‖+ 1
C2(α(t))2 + η
∥∥∥r(t)i
∥∥∥≤
∥∥∥∥∥∥b(t)i ‖+ η/2∑j
γ(t)j tan(α(t)) + η
∑j
2‖w(t)j ‖+ 1
C2(α(t))2 + η‖r(t)i
∥∥∥∥∥∥where we set α(t)
i ≤ α(t).
If we have∑
j γ(t)j ≤ ξ/3, C2α
(t) ≤ ξ/3, and ‖‖r(t)i ‖‖ ≤ (ξ/3) tan(α(t)),
we can obtain
tanα(t+1)i =
‖b(t+1)i ‖γ(t)i
≤ γ(t)i + ξη
γ(t)i + (1
2− ξη)
tanα(t).
According to Lemma F.2.1 (Phase I-a) and Eq. (F.12), γ(T1a)i +(t−T1a)η/3 ≤
γ(t)i ≤ γ
(T1a)i + (t− T1a)η/2.
Therefore,
γ(t)i + (ξ1 + ξ2)η
γ(t)i + (1
2− (ξ1 + ξ2)η)
≤ t/2 + c0 + (ξ1 + ξ2)
t/2 + c0 + 1/2− (ξ1 + ξ2).
406
By the fact that
limn→∞
Γ(n+ z)
Γ(n)nz= 1
where Γ is the gamma function, we have
tanα(T1)i . T
−(1+3ξ)1b tanα
(T1a)i
After T1b = O(ε−(1+3ξ)1 ),
α(T1)i . ε1.
Note that we require∑
j γ(t)j ≤ ξ/3, so we can set T1bη/2k ≤ ξ/3, i.e., η .
ξ1ε1+3ξ1 /k
F.5 Proofs for Phase II
In this phase, we have already small angles, and the optimization process
can be approximated by linear regression. The following proof for Lemma F.2.3
mainly ensures the error doesn’t expand too much during this phase.
F.5.1 Proof for Lemma F.2.3
Proof. Recall the empirical gradient descent update,
w(t+1)i = w
(t)i − η
[∑j
‖w(t)j ‖M
(t)ij w
(t)j −R
(t)i w∗
]
= w(t)i − η/2
(∑j
w(t)j −w∗
)− η
[∑j
‖w(t)j ‖(M
(t)ij − I/2)w
(t)j − (R
(t)i − I/2)w∗
]+ ηr
(t)i
407
∥∥∥∥∥w(t+1)i −w
(t)i + η/2
(∑j
w(t)j −w∗
)∥∥∥∥∥ ≤ η
(4∑j
‖w(t)j ‖+ 1
)(α(t))2 + η‖r(t)
i ‖
(F.13)
where α(t)i ≤ α0, for any i ∈ [k]. If for every t we considered, we have ‖r(t)
i ‖ . α20,
then
w(t+1)i = w
(t)i − η/2
(∑j
w(t)j −w∗
)+O(ηα2
0)u(t)i ,
for some unit vector u(t)i . Summing over all i ∈ [k] can arrive at∥∥∥∥∥
(∑j
w(t+1)j −w∗
)− (1− ηk/2)
(∑j
w(t)j −w∗
)∥∥∥∥∥ . ηkα20
After T iterations following Phase I-b,∥∥∥∥∥(∑j
w(T+T1)j −w∗)− (1− ηk/2)T
(∑j
w(T1)j −w∗
)∥∥∥∥∥ .T∑t=1
(1−ηk/2)T−tηkα20.
Now we start to prove by induction. Assume α(T1+t−1) ≤ α0 for all t = 1, 2, · · · , T
and some α0 > 0. Then∥∥∥∥∥(∑
j
w(T+T1)j −w∗
)− (1− ηk/2)T
(∑j
w(T1)j −w∗
)∥∥∥∥∥ . α20. (F.14)
This implies,∥∥∥∥∥T∑t=1
(∑j
w(t+T1−1)j −w∗
)−
T∑t=1
(1− ηk/2)t
(∑j
w(T1)j −w∗
)∥∥∥∥∥ . Tα20
Also note,∥∥∥∥∥w(T+T1)i −w
(T1)i + η/2
T∑t=1
(∑j
w(t+T1−1)j −w∗
)∥∥∥∥∥ . ηT∑t=1
α(t+T1−1))2.
408
Therefore,∥∥∥∥∥w(T+T1)i −w
(T1)i +
1
k
(1− (1− ηk/2)T
)(∑j
w(T1)j −w∗
)∥∥∥∥∥ . ηTα20.
The final step of the induction is the following,
sinα(T1+T )i = sin∠(w(T+T1)
i ,w∗)
=‖w(T+T1)
i −w∗‖ sin∠(w(T+T1)i −w∗,w∗)
‖w(T1+T )i ‖
=‖(I −w∗w∗>)(w
(T+T1)i −w∗)‖
‖w(T1+T )i ‖
For the numerator,∥∥∥(I −w∗w∗>)(w
(T+T1)i −w∗
)∥∥∥=∥∥∥(I −w∗w∗>)
(w
(T+T1)i −w
(T1)i +w
(T1)i
)∥∥∥≤∥∥∥(I −w∗w∗>)
(w
(T+T1)i −w
(T1)i
)∥∥∥+ ∥∥∥(I −w∗w∗>)w(T1)i )
∥∥∥≤1
k
(1− (1− ηk/2)T
)∥∥∥∥∥(I −w∗w∗>)
(∑j
w(T1)j −w∗
)∥∥∥∥∥+ cηTα20 + ‖w
(T1)i ‖α(T1)
i
≤1
k
(1− (1− ηk/2)T
)∥∥∥∥∥(I −w∗w∗>)∑j
w(T1)j
∥∥∥∥∥+ cηTα20 + ‖w
(T1)i ‖α(T1)
i
≤1
k
(1− (1− ηk/2)T
)∥∥∥∥∥∑j
w(T1)j
∥∥∥∥∥∠(∑
j
w(T1)j ,w∗
)+ cηTα2
0 + ‖w(T1)i ‖α(T1)
i
≤ 1
16πkα0 + cηTα2
0
where c is an absolute constant.
409
For the denominator,∥∥∥w(T1+T )i
∥∥∥ ≥ ∥∥∥∥∥1k (1− (1− ηk/2)T )
(∑j
w(T1)j −w∗
)∥∥∥∥∥− ∥∥∥w(T1)i
∥∥∥− ηTα20
≥ 1
2k
(1− (1− ηk/2)T
)− cηTα2
0
Now by setting cηTα0 ≤ 132πk
, ηk = 1, we obtain α(T1+T )i ≤ α0 for any T .
Finally using Eq. (F.14), we have∥∥∥∥∥∑j
w(T+T1)j −w∗
∥∥∥∥∥ . 2−T + α20 . ε2.
For the last part,
E[f(W )]
=∑ij
w>i Mwi,wjwj − 2
∑i
w>i Mwi,w∗w∗ +w∗>Mw∗,w∗w∗
=∑ij
w>i (Mwi,wj − I/2)wj − 2
∑i
w>i (Mwi,w∗ − I/2)w∗
+w∗>Mw∗,w∗w∗ +1
2
∥∥∥∥∥∥∑j
wj −w∗
∥∥∥∥∥∥2
.∑ij
‖wi‖ · ‖wj‖θ2wi,wj+ 2
∑i
‖wi‖ · ‖w∗‖α2i +
∥∥∥∥∥∥∑j
wj −w∗
∥∥∥∥∥∥2
. ε2
410
Bibliography
[1] Movie lens dataset. Public dataset.
[2] Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and Jean-Philippe
Vert. Low-rank matrix factorization with attributes. arXiv preprint
cs/0611124, 2006.
[3] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli,
and Rashish Tandon. Learning sparsely used overcomplete dictionaries via
alternating minimization. COLT, 2014.
[4] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Ma-
tus Telgarsky. Tensor decompositions for learning latent variable models.
JMLR, 15:2773–2832, 2014.
[5] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning
polynomials with neural networks. In Proceedings of the 31st International
Conference on Machine Learning (ICML), pages 1908–1916, 2014.
[6] Peter Arbenz, Daniel Kressner, and D-MATH ETH Zurich. Lecture notes on
solving large scale eigenvalue problems. http://people.inf.ethz.
ch/arbenz/ewp/Lnotes/lsevp2010.pdf.
[7] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable
bounds for learning some deep representations. In Proceedings of the 31st
411
International Conference on Machine Learning (ICML), pages 584–592.
https://arxiv.org/pdf/1310.6343.pdf, 2014.
[8] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Com-
puting a nonnegative matrix factorization–provably. In Proceedings of the
forty-fourth annual ACM symposium on Theory of computing (STOC), pages
145–162. ACM, 2012.
[9] Sanjeev Arora, Rong Ge, Tengyu Ma, and Andrej Risteski. Provable learning
of noisy-or networks. In Proceedings of the 49th Annual Symposium on
the Theory of Computing (STOC). https://arxiv.org/pdf/1612.
08795.pdf, 2017.
[10] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going
beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd
Annual Symposium on, pages 1–10. IEEE, 2012.
[11] Ozlem Aslan, Xinhua Zhang, and Dale Schuurmans. Convex deep learn-
ing via normalized kernels. In Advances in Neural Information Processing
Systems (NIPS), pages 3275–3283, 2014.
[12] Francis Bach. Breaking the curse of dimensionality with convex neural net-
works. arXiv preprint arXiv:1412.8690, 2014.
[13] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guar-
antees for the em algorithm: From population to sample-based analysis. An-
nals of Statistics, 2015.
412
[14] Maria-Florina Balcan, Yingyu Liang, David P. Woodruff, and Hongyang
Zhang. Optimal sample complexity for matrix completion and related prob-
lems via `2-regularization. arXiv preprint arXiv:1704.08683, 2017.
[15] David Balduzzi. Deep online convex optimization with gated games. arXiv
preprint arXiv:1604.01952, 2016.
[16] David Balduzzi, Brian McWilliams, and Tony Butler-Yeoman. Neural taylor
approximations: Convergence and exploration in rectifier networks. arXiv
preprint arXiv:1611.02345, 2016.
[17] Peter L. Bartlett. The sample complexity of pattern classification with neu-
ral networks: the size of the weights is more important than the size of the
network. IEEE Transactions on Information Theory, 44(2):525–536, 1998.
[18] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-
normalized margin bounds for neural networks. In Advances in Neural In-
formation Processing Systems, pages 6241–6250, 2017.
[19] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and
Patrice Marcotte. Convex neural networks. In Advances in Neural Infor-
mation Processing Systems (NIPS), pages 123–130, 2005.
[20] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-
complete. In Proceedings of the 1st International Conference on Neural
Information Processing Systems (NIPS), pages 494–501. MIT Press, 1988.
413
[21] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a
convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
[22] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz.
SGD learns over-parameterized networks that provably generalize on linearly
separable data. In ICLR. https://arxiv.org/pdf/1710.10174,
2018.
[23] Samuel Burer and Renato DC Monteiro. Local minima and conver-
gence in low-rank semidefinite programming. Mathematical Programming,
103(3):427–444, 2005.
[24] Emmanuel J. Candes and Benjamin Recht. Exact matrix completion
via convex optimization. Foundations of Computational Mathematics,
9(6):717–772, December 2009.
[25] Alexandre X Carvalho and Martin A Tanner. Mixtures-of-experts of autore-
gressive time series: asymptotic normality and model specification. Neural
Networks, 16(1):39–56, 2005.
[26] Arun T. Chaganty and Percy Liang. Spectral experts for estimating mixtures
of linear regressions. In ICML, pages 1040–1048, 2013.
[27] Xiujuan Chai, Shiguang Shan, Xilin Chen, and Wen Gao. Locally lin-
ear regression for pose-invariant face recognition. Image Processing,
16(7):1716–1725, 2007.
414
[28] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formu-
lation for mixed regression with two components: Minimax optimal rates.
COLT, 2014.
[29] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon. Matrix completion
with noisy side information. In Advances in Neural Information Processing
Systems, pages 3447–3455, 2015.
[30] Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous,
and Yann LeCun. The loss surfaces of multilayer networks. In Proceed-
ings of the Eighteenth International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 192–204, 2015.
[31] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of
deep learning: A tensor analysis. In 29th Annual Conference on Learning
Theory (COLT), pages 698–728, 2016.
[32] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as gen-
eralized tensor decompositions. In International Conference on Machine
Learning (ICML), 2016.
[33] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for
YouTube recommendations. In Proceedings of the 10th ACM Conference on
Recommender Systems, pages 191–198. ACM, 2016.
[34] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding
of neural networks: The power of initialization and a dual view on expres-
415
sivity. In Advances in neural information processing systems (NIPS), pages
2253–2261, 2016.
[35] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya
Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point
problem in high-dimensional non-convex optimization. In Advances in neu-
ral information processing systems (NIPS), pages 2933–2941, 2014.
[36] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by
a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
[37] P. Deb and A. M. Holmes. Estimates of use and costs of behavioural health
care: a comparison of standard and finite mixture models. Health economics,
9(6):475–489, 2000.
[38] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional
filter easy to learn? arXiv preprint arXiv:1709.06129, 2017.
[39] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti
Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of
spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
[40] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering. In CVPR,
pages 2790–2797, 2009.
[41] Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine
neural networks:(almost) all local optima are global. arXiv preprint
arXiv:1710.02196, 2017.
416
[42] Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Man-
nor. Ensemble robustness of deep learning algorithms. arXiv preprint
arXiv:1602.02389, 2016.
[43] Giancarlo Ferrari-Trecate and Marco Muselli. A new learning method for
piecewise linear regression. In Artificial Neural NetworksICANN 2002,
pages 444–449. Springer, 2002.
[44] C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified
network optimization. In arXiv preprint. https://arxiv.org/pdf/
1611.01540.pdf, 2016.
[45] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of
regression models. In KDD. ACM, 1999.
[46] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low
rank problems: A unified geometric analysis. In International Conference
on Machine Learning, pages 1233–1242. https://arxiv.org/pdf/
1704.00708, 2017.
[47] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural
networks with landscape design. In ICLR. https://arxiv.org/pdf/
1711.00501, 2018.
[48] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N
Dauphin. Convolutional Sequence to Sequence Learning. In ArXiv
preprint:1705.03122, 2017.
417
[49] Nicolas Gillis and Francois Glineur. Low-rank matrix approximation with
weights or missing data is np-hard. SIAM Journal on Matrix Analysis and
Applications, 32(4):1149–1165, 2011.
[50] Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and
`1-norm low-rank matrix approximation. arXiv preprint arXiv:1509.09236,
2015.
[51] Navier Glorot and Yoshua Bengio. Understanding the difficulty of training
deep feedforward neural networks. In Proceedings of the thirteenth inter-
national conference on artificial intelligence and statistics, pages 249–256,
2010.
[52] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably
learning the relu in polynomial time. In 30th Annual Conference on Learn-
ing Theory (COLT). https://arxiv.org/pdf/1611.10258.pdf,
2017.
[53] Surbhi Goel, Adam Klivans, and Reghu Meka. Learning one convolutional
layer with overlapping patches. In arXiv preprint. https://arxiv.
org/abs/1802.02547.pdf, 2018.
[54] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent
sample complexity of neural networks. arXiv preprint arXiv:1712.06541,
2017.
418
[55] Carlos A Gomez-Uribe and Neil Hunt. The Netflix recommender system:
Algorithms, business value, and innovation. ACM Transactions on Manage-
ment Information Systems (TMIS), 6(4):13, 2016.
[56] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016. http://www.deeplearningbook.org.
[57] Benjamin D Haeffele and Rene Vidal. Global optimality in tensor factoriza-
tion, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
[58] Moritz Hardt. Understanding alternating minimization for matrix comple-
tion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual
Symposium on, pages 651–660. IEEE, 2014.
[59] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ICLR, 2017.
[60] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust sub-
space recovery. In COLT, volume 30, pages 354–375, 2013.
[61] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better:
Stability of stochastic gradient descent. In ICML, pages 1225–1234, 2016.
[62] Moritz Hardt and Mary Wootters. Fast matrix completion without the con-
dition number. In Proceedings of The 27th Conference on Learning Theory,
pages 638–678, 2014.
[63] Johan Hastad. Tensor rank is np-complete. Journal of Algorithms,
11(4):644–654, 1990.
419
[64] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard.
In Journal of the ACM (JACM), volume 60(6), page 45. https://arxiv.
org/pdf/0911.1393.pdf, 2013.
[65] Kurt Hornik. Approximation capabilities of multilayer feedforward net-
works. Neural networks, 4(2):251–257, 1991.
[66] Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit Dhillon. Pu learning for
matrix completion. In International Conference on Machine Learning, pages
2445–2453, 2015.
[67] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians:
moment methods and spectral decompositions. In ITCS, pages 11–20. ACM,
2013.
[68] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for
quadratic forms of subgaussian random vectors. Electron. Commun. Probab,
17(52):1–6, 2012.
[69] Prateek Jain and Inderjit S Dhillon. Provable inductive matrix completion.
arXiv preprint arXiv:1306.0626, 2013.
[70] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank mini-
mization via singular value projection. In NIPS, pages 937–945, 2010.
[71] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix
completion using alternating minimization. In Proceedings of the forty-fifth
annual ACM symposium on Theory of computing (STOC), 2013.
420
[72] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the per-
ils of non-convexity: Guaranteed training of neural networks using tensor
methods. arXiv preprint arXiv:1506.08473, 2015.
[73] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jor-
dan. How to escape saddle points efficiently. In International Conference on
Machine Learning, pages 1724–1732, 2017.
[74] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in
Neural Information Processing Systems, pages 586–594, 2016.
[75] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix
completion from a few entries. IEEE Transactions on Information Theory,
56(6):2980–2998, 2010.
[76] Abbas Khalili and Jiahua Chen. Variable selection in finite mixture of re-
gression models. Journal of the american Statistical association, 2012.
[77] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifi-
cation with deep convolutional neural networks. In NIPS, pages 1097–1105,
2012.
[78] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization
via matrix factorization. In Proceedings of the Eighteenth International Con-
ference on Artificial Intelligence and Statistics (AISTATS), pages 507–516,
2015.
421
[79] Matt Kusner, Stephen Tyree, Kilian Q Weinberger, and Kunal Agrawal.
Stochastic neighbor compression. In Proceedings of the 31st International
Conference on Machine Learning (ICML-14), pages 622–630, 2014.
[80] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face
recognition: A convolutional neural-network approach. IEEE transactions
on neural networks, 8(1):98–113, 1997.
[81] K. Lee and Y. Bresler. Guaranteed minimum rank approximation from linear
observations by nuclear norm minimization with an ellipsoidal constraint.
arXiv preprint arXiv:0903.4742, 2009.
[82] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. Com-
parative deep learning of hybrid representations for image recommendations.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2545–2553, 2016.
[83] Ren-Cang Li. On perturbations of matrix pencils with real spectra. Math.
Comp., 62:231–265, 1994.
[84] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization
in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203,
2017.
[85] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural net-
works with relu activation. In Advances in Neural Information Processing
Systems, pages 597–607, 2017.
422
[86] Shiyu Liang, Ruoyu Sun, Yixuan Li, and R Srikant. Understanding the loss
surface of single-layered neural networks for binary classification. 2018.
[87] Ming Lin and Jieping Ye. A non-convex one-pass framework for generalized
factorization machine and rank-one matrix sensing. In Advances in Neural
Information Processing Systems, pages 1633–1641, 2016.
[88] Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements.
In Advances in Neural Information Processing Systems, pages 1638–1646,
2011.
[89] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational
efficiency of training neural networks. In Advances in neural information
processing systems (NIPS), pages 855–863, 2014.
[90] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bai-
ley. Efficient orthogonal parametrisation of recurrent neural networks using
householder reflections. In International Conference on Machine Learning,
pages 2401–2409, 2017.
[91] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio.
On the number of linear regions of deep neural networks. In Advances in
neural information processing systems (NIPS), pages 2924–2932, 2014.
[92] Nagarajan Natarajan and Inderjit S Dhillon. Inductive matrix completion for
predicting gene–disease associations. Bioinformatics, 30(12):i60–i68, 2014.
423
[93] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandku-
mar, and Prateek Jain. Non-convex robust PCA. In Advances in Neural
Information Processing Systems, pages 1107–1115, 2014.
[94] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan
Srebro. A pac-bayesian approach to spectrally-normalized margin bounds
for neural networks. arXiv preprint arXiv:1707.09564, 2017.
[95] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based ca-
pacity control in neural networks. In Conference on Learning Theory, pages
1376–1401, 2015.
[96] Quynh Nguyen and Matthias Hein. The loss surface and expressivity of deep
convolutional neural networks. arXiv preprint arXiv:1710.10928, 2017.
[97] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide
neural networks. In International Conference on Machine Learning, pages
2603–2612, 2017.
[98] Rina Panigrahy, Ali Rahimi, Sushant Sachdeva, and Qiuyi Zhang. Conver-
gence results for neural networks via electrodynamics. In LIPIcs-Leibniz
International Proceedings in Informatics, volume 94. Schloss Dagstuhl-
Leibniz-Zentrum fuer Informatik, 2018.
[99] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and
Surya Ganguli. Exponential expressivity in deep neural networks through
424
transient chaos. In Advances In Neural Information Processing Systems
(NIPS), pages 3360–3368, 2016.
[100] Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima
and back propagation. In Neural Networks, 1991., IJCNN-91-Seattle Inter-
national Joint Conference on, volume 2, pages 173–176. IEEE, 1991.
[101] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-
Dickstein. On the expressive power of deep neural networks. arXiv preprint
arXiv:1606.05336, 2016.
[102] Ilya P Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank
approximations with provable guarantees. In Proceedings of the 48th Annual
Symposium on the Theory of Computing (STOC), pages 250–263, 2016.
[103] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed
minimum-rank solutions of linear matrix equations via nuclear norm min-
imization. SIAM Review, 52(3):471–501, 2010.
[104] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE
10th International Conference on, pages 995–1000. IEEE, 2010.
[105] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspec-
ified neural networks. In International Conference on Machine Learning
(ICML), 2016.
[106] Itay Safran and Ohad Shamir. Spurious local minima are common in two-
layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
425
[107] Levent Sagun, Leon Bottou, and Yann LeCun. Singularity of the Hessian in
deep learning. arXiv preprint arXiv:1611.07476, 2016.
[108] Hanie Sedghi and Anima Anandkumar. Provable tensor methods for learn-
ing mixtures of generalized linear models. arXiv preprint arXiv:1412.3046,
2014.
[109] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural
networks with sparse connectivity. In International Conference on Learning
Representation (ICLR), 2015.
[110] Ohad Shamir. Distribution-specific hardness of learning neural networks.
arXiv preprint arXiv:1609.01037, 2016.
[111] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and Inderjit S
Dhillon. Tumblr blog recommendation with boosted inductive matrix com-
pletion. In Proceedings of the 24th ACM International on Conference on
Information and Knowledge Management, pages 203–212. ACM, 2015.
[112] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inderjit S Dhillon.
Goal-directed inductive matrix completion. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing, pages 1165–1174. ACM, 2016.
[113] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
426
Panneershelvam, et al. Mastering the game of Go with deep neural networks
and tree search. Nature, 529(7587):484–489, 2016.
[114] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical in-
sights into the optimization landscape of over-parameterized shallow neural
networks. arXiv preprint arXiv:1707.04926, 2017.
[115] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthog-
onal tensor decomposition. In Advances in Neural Information Processing
Systems(NIPS), pages 793–801, 2016.
[116] Zhao Song, David P. Woodruff, and Peilin Zhong. Low rank approximation
with entrywise `1-norm error. In Proceedings of the 49th Annual Sympo-
sium on the Theory of Computing (STOC). ACM, https://arxiv.org/
pdf/1611.00898.pdf, 2017.
[117] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor
low rank approximation. In arXiv preprint. https://arxiv.org/pdf/
1704.08246.pdf, 2017.
[118] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor
low rank approximation. In arXiv preprint. https://arxiv.org/pdf/
1704.08246.pdf, 2017.
[119] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zero-one law
for entrywise low rank approximation. 2018.
427
[120] David Sontag and Dan Roy. Complexity of inference in latent dirichlet
allocation. In Advances in neural information processing systems, pages
1008–1016, 2011.
[121] Daniel Soudry and Yair Carmon. No bad local minima: Data indepen-
dent training error guarantees for multilayer neural networks. arXiv preprint
arXiv:1605.08361, 2016.
[122] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-
convex factorization. In IEEE Symposium on Foundations of Computer Sci-
ence (FOCS), pages 270–289. IEEE, 2015.
[123] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Lo-
cal minima in training of deep networks. arXiv preprint arXiv:1611.06310,
2016.
[124] Matus Telgarsky. Benefits of depth in neural networks. In 29th Annual
Conference on Learning Theory (COLT), pages 1517–1539, 2016.
[125] Yuandong Tian. An analytical formula of population gradient for two-layered
relu network and its applications in convergence and critical point analysis.
In ICML, 2017.
[126] Yuandong Tian. Symmetry-breaking convergence analysis of certain two-
layered neural networks with ReLU nonlinearity. In Workshop at Interna-
tional Conference on Learning Representation, 2017.
428
[127] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foun-
dations of Computational Mathematics, 12(4):389–434, 2012.
[128] Roman Vershynin. How close is the sample covariance matrix to the ac-
tual covariance matrix? Journal of Theoretical Probability, 25(3):655–686,
2012.
[129] Kert Viele and Barbara Tong. Modeling with mixtures of linear regressions.
Statistics and Computing, 12(4):315–330, 2002.
[130] Elisabeth Vieth. Fitting piecewise linear regression functions to biological
responses. Journal of applied physiology, 67(1):390–396, 1989.
[131] Xinxi Wang and Ye Wang. Improving content-based and hybrid music rec-
ommendation using deep learning. In Proceedings of the 22nd ACM inter-
national conference on Multimedia, pages 627–636. ACM, 2014.
[132] Yining Wang and Anima Anandkumar. Online and differentially-private ten-
sor decomposition. In Advances in Neural Information Processing Systems
(NIPS), pages 3531–3539, 2016.
[133] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar.
Fast and guaranteed tensor decomposition via sketching. In Advances in
Neural Information Processing Systems (NIPS), pages 991–999, 2015.
[134] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz,
and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns:
429
How to train 10,000-layer vanilla convolutional neural networks. In ICML,
2018.
[135] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in
neural networks. In International Conference on Artificial Intelligence and
Statistics (AISTATS), 2017.
[136] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side
information: Application to multi-label learning. In NIPS, pages 2301–2309,
2013.
[137] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating mini-
mization for mixed linear regression. In ICML, pages 613–621, 2014.
[138] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. Scalable coor-
dinate descent approaches to parallel matrix factorization for recommender
systems. In ICDM, pages 765–774, 2012.
[139] Hsiang-Fu Yu, Hsin-Yuan Huang, Inderjit S Dihillon, and Chih-Jen Lin. A
unified algorithm for one-class structured matrix factorization with side in-
formation. In AAAI, 2017.
[140] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-
scale multi-label learning with missing labels. In ICML, pages 593–601,
2014.
430
[141] Xiao-Hu Yu and Guo-An Chen. On the local minima free condition
of backpropagation learning. IEEE Transactions on Neural Networks,
6(5):1300–1303, 1995.
[142] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol
Vinyals. Understanding deep learning requires rethinking generalization. In
ICLR, 2017.
[143] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying
Ma. Collaborative knowledge base embedding for recommender systems. In
Proceedings of the 22nd ACM SIGKDD international conference on knowl-
edge discovery and data mining, pages 353–362. ACM, 2016.
[144] Jiong Zhang, Qi Lei, and Inderjit S Dhillon. Stabilizing gradients for deep
neural networks via efficient svd parameterization. In ICML, 2018.
[145] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender
system: A survey and new perspectives. arXiv preprint arXiv:1707.07435,
2017.
[146] Xiao Zhang, Simon S Du, and Quanquan Gu. Fast and sample efficient
inductive matrix completion via multi-phase procrustes flow. In ICML.
https://arxiv.org/pdf/1803.01233, 2018.
[147] Yuchen Zhang, Jason D Lee, and Michael I Jordan. L1-regularized neu-
ral networks are improperly learnable in polynomial time. In Proceedings
431
of The 33rd International Conference on Machine Learning (ICML), pages
993–1001, 2016.
[148] Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan.
On the learnability of fully-connected neural networks. In International Con-
ference on Artificial Intelligence and Statistics, 2017.
[149] Yuchen Zhang, Percy Liang, and Martin Wainwright. Convexified convolu-
tional neural networks. In ICML, 2017.
[150] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Efficient matrix sensing
using rank-1 Gaussian measurements. In International Conference on Algo-
rithmic Learning Theory, pages 3–18. Springer, 2015.
[151] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with
multiple components. In Advances in neural information processing systems
(NIPS), pages 2190–2198, 2016.
[152] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping
convolutional neural networks with multiple kernels. arXiv preprint
arXiv:1711.03440, 2017.
[153] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S
Dhillon. Recovery guarantees for one-hidden-layer neural networks. In
ICML. https://arxiv.org/pdf/1706.03175.pdf, 2017.
432
Vita
Kai Zhong was born in Hangzhou, China. He graduated from Zhenhai Mid-
dle School in Ningbo, China in 2008. He obtained his B.S. degree in physics from
Peking University in July 2012. Soon after that, he started his PhD study in the
institute for computational engineering and sciences from University of Texas at
Austin. Currently he is a member of Center for Big Data Analytics in University of
Texas at Austin.
Permanent address: [email protected]
This dissertation was typeset with LATEX† by the author.
†LATEX is a document preparation system developed by Leslie Lamport as a special version ofDonald Knuth’s TEX Program.
433