a spline theory of deep networksproceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf · 3....

A Spline Theory of Deep Networks

Randall Balestriero 1 Richard G. Baraniuk 1

AbstractWe build a rigorous bridge between deep net-works (DNs) and approximation theory via splinefunctions and operators. Our key result is that alarge class of DNs can be written as a compositionof max-affine spline operators (MASOs), whichprovide a powerful portal through which to viewand analyze their inner workings. For instance,conditioned on the input signal, the output of aMASO DN can be written as a simple affine trans-formation of the input. This implies that a DNconstructs a set of signal-dependent, class-specifictemplates against which the signal is comparedvia a simple inner product; we explore the links tothe classical theory of optimal classification viamatched filters and the effects of data memoriza-tion. Going further, we propose a simple penaltyterm that can be added to the cost function of anyDN learning algorithm to force the templates tobe orthogonal with each other; this leads to signif-icantly improved classification performance andreduced overfitting with no change to the DN ar-chitecture. The spline partition of the input signalspace opens up a new geometric avenue to studyhow DNs organize signals in a hierarchical fash-ion. As an application, we develop and validate anew distance metric for signals that quantifies thedifference between their partition encodings.

1. IntroductionDeep learning has significantly advanced our ability to ad-dress a wide range of difficult machine learning and signalprocessing problems. Today’s machine learning landscapeis dominated by deep (neural) networks (DNs), which arecompositions of a large number of simple parameterizedlinear and nonlinear transforms. An all-too-common storyof late is that of plugging a deep network into an appli-cation as a black box, training it on copious training data,

1ECE Department, Rice University, Houston, TX, USA. Corre-spondence to: Randall B. <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

and then significantly improving performance over classicalapproaches.

Despite this empirical progress, the precise mechanisms bywhich deep learning works so well remain relatively poorlyunderstood, adding an air of mystery to the entire field. On-going attempts to build a rigorous mathematical frameworkfall roughly into five camps: (i) probing and measuring DNsto visualize their inner workings (Zeiler & Fergus, 2014); (ii)analyzing their properties such as expressive power (Cohenet al., 2016), loss surface geometry (Lu & Kawaguchi, 2017;Soudry & Hoffer, 2017), nuisance management (Soatto &Chiuso, 2016), sparsification (Papyan et al., 2017), and gen-eralization abilities; (iii) new mathematical frameworks thatshare some (but not all) common features with DNs (Bruna& Mallat, 2013); (iv) probabilistic generative models fromwhich specific DNs can be derived (Arora et al., 2013; Patelet al., 2016); and (v) information theoretic bounds (Tishby& Zaslavsky, 2015).

In this paper, we build a rigorous bridge between DNs andapproximation theory via spline functions and operators.We prove that a large class of DNs — including convo-lutional neural networks (CNNs) (LeCun, 1998), residualnetworks (ResNets) (He et al., 2016; Targ et al., 2016),skip connection networks (Srivastava et al., 2015), fullyconnected networks (Pal & Mitra, 1992), recurrent neuralnetworks (RNNs) (Graves, 2013), and beyond — can bewritten as spline operators. In particular, when these net-works employ current standard-practice piecewise-affine,convex nonlinearities (e.g., ReLU, max-pooling, etc.) theycan be written as the composition of max-affine spline opera-tors (MASOs) (Magnani & Boyd, 2009; Hannah & Dunson,2013). We focus on such nonlinearities here but note thatour framework applies also to non-piecewise-affine nonlin-earities through a standard approximation argument.

The max-affine spline connection provides a powerful portalthrough which to view and analyze the inner workings of aDN using tools from approximation theory and functionalanalysis. Here is a summary of our key contributions:

[C1] We prove that a large class of DNs can be written as acomposition of MASOs, from which it follows immediatelythat, conditioned on the input signal, the output of a DN isa simple affine transformation of the input. We illustratein Section 4 by deriving a closed-form expression for the


input/output mapping of a CNN.

[C2] The affine mapping formula enables us to interpret aMASO DN as constructing a set of signal-dependent, class-specific templates against which the signal is comparedvia a simple inner product. In Section 5 we relate DNsdirectly to the classical theory of optimal classification viamatched filters and provide insights into the effects of datamemorization (Zhang et al., 2016).

[C3] We propose a simple penalty term that can be added tothe cost function of any DN learning algorithm to force thetemplates to be orthogonal to each other. In Section 6, weshow that this leads to significantly improved classificationperformance and reduced overfitting on standard test datasets like CIFAR100 with no change to the DN architecture.

[C4] The partition of the input space induced by a MASOlinks DNs to the theory of vector quantization (VQ) andK-means clustering, which opens up a new geometric av-enue to study how DNs cluster and organize signals in ahierarchical fashion. Section 7 studies the properties of theMASO partition.

[C5] Leveraging the fact that a DN considers two signalsto be similar if they lie in the same MASO partition region,we develop a new signal distance in Section 7.3 that mea-sures the difference between their partition encodings. Thedistance is easily computed via backpropagation.

A number of appendices in the Supplementary Material(SM) contain the mathematical setup and proofs. A signifi-cantly extended account of these events with numerous newresults is available in (Balestriero & Baraniuk, 2018).

2. Background on Deep NetworksA deep network (DN) is an operator fΘ : RD → RC thatmaps an input signal1 x ∈ RD to an output predictiony ∈ RC as fΘ : RD → RC . All current DNs can be writtenas a composition of L intermediate mappings called layers

fΘ(x) =(f

(L)

θ(L) · · · f(1)

θ(1)

)(x), (1)

where Θ =θ(1), . . . , θ(L)

is the collection of the net-

work’s parameters from each layer. This composition ofmappings is nonlinear and non-commutative, in general.

A DN layer at level ` is an operator f (`)

θ(`)that takes as input

the vector-valued signal z(`−1)(x) ∈ RD(`−1)

and producesthe vector-valued output z(`)(x) ∈ RD(`)

. We will as-sume that x and z(`) are column vectors. We initialize withz(0)(x) = x and denote z(L)(x) =: z for convenience.

1 For concreteness, we focus here on processing K-channelimages x, such as color digital photographs. But our analysis andtechniques apply to signals of any index-dimensionality, includingspeech and audio signals, video signals, etc.

The signals z(`)(x) are typically called feature maps; it iseasy to see that

z(`)(x) =(f

(`)

θ(`) · · · f (1)

θ(1)

)(x), ` ∈ 1, . . . , L. (2)

We briefly overview the basic DN operators and layers weconsider in this paper; more details and additional layersare provided in (Goodfellow et al., 2016) and (Balestriero &Baraniuk, 2018). A fully connected operator performs anarbitrary affine transformation by multiplying its input bythe dense matrix W (`) ∈ RD(`)×D(`−1)

and adding the ar-bitrary bias vector b(`)

W ∈ RD(`)

, as in f (`)W

(z(`−1)(x)

):=

W (`)z(`−1)(x) + b(`)W . A convolution operator reduces the

number of parameters in the affine transformation by replac-ing the unconstrained W (`) with a multichannel convolutionmatrix, as in f (`)

C

(z(`−1)(x)

):= C(`)z(`−1)(x) + b

(`)C .

An activation operator applies a scalar nonlinear activationfunction σ independently to each entry of its input, as in[f

(`)σ

(z(`−1)(x)

) ]k

:= σ([z(`−1)(x)]k

), k = 1, . . . , D(`).

Nonlinearities are crucial to DNs, since otherwise the en-tire network would collapse to a single global affine trans-form. Three popular activation functions are the rectified lin-ear unit (ReLU) σReLU(u) := max(u, 0), the leaky ReLUσLReLU(u) := max(ηu, u), η > 0, and the absolute valueσabs(u) := |u|. These three functions are both piecewiseaffine and convex. Other popular activation functions in-clude the sigmoid σsig(u) := 1

1+e−u and hyperbolic tangentσtanh(u) := 2σsig(2u)−1. These two functions are neitherpiecewise affine nor convex.

A pooling operator subsamples its input to reduceits dimensionality according to a sub-sampling pol-icy ρ applied over a collection of input indicesRkK

(`)

k=1 (typically a small patch), e.g., max pooling[f

(`)ρ

(z(`−1)(x)

) ]k

:= maxd∈R(`)

k

[z(`−1)(x)

]d, k =

1, . . . , D(`). See (Balestriero & Baraniuk, 2018) for thedefinitions of average pooling, channel pooling, skip con-nections, and recurrent layers.

Definition 1. A DN layer f (`)

θ(`)comprises a single nonlinear

DN operator (non-affine to be precise) composed with anypreceding affine operators lying between it and the preced-ing nonlinear operator.

This definition yields a single, unique layer decompositionfor any DN, and the complete DN is then the compositionof its layers per (1). For example, in a standard CNN, thereare two different layers types: i) convolution-activation andii) max-pooling.

We form the prediction y by feeding fΘ(x) through a finalnonlinearity g : RD(L) → RD(L)

as in y = g(fΘ(x)). Inclassification, g is typically the softmax nonlinearity, whicharises naturally from posing the classification inference as a


multinomial logistic regression problem (Bishop, 1995). Inregression, typically no g is applied.

We learn the DN parameters Θ for a particular predic-tion task in a supervised setting using a labeled data setD = (xn,yn)Nn=1, a loss function, and a learning policy toupdate the parameters Θ in the predictor fΘ(x). For classi-fication problems, the loss function is typically the negativecross-entropy LCE(x,y) (Bishop, 1995). For regressionproblems, the loss function is typically is the squared error.Since the layer-by-layer operations in a DN are differen-tiable almost everywhere with respect to their parametersand inputs, we can use some flavor of first-order optimiza-tion such as gradient descent to optimize the parameters Θwith respect to the loss function. Moreover, the gradients forall internal parameters can be computed efficiently by back-propagation (Hecht-Nielsen, 1992), which follows from thechain rule of calculus.

3. Background on Spline OperatorsApproximation theory is the study of how and how wellfunctions can best be approximated using simpler functions(Powell, 1981). A classical example of a simpler functionis a spline s : RD → R (Schmidhuber, 1994). For con-creteness, we will work exclusively with affine splines inthis paper (aka “linear splines”), but our ideas generalizenaturally to higher-order splines.

Multivariate Affine Splines. Consider a partition of adomain RD into a set of regions Ω = ω1, . . . , ωR anda set of local mappings Φ = φ1, . . . , φR that map eachregion in the partition to R via φr(x) := 〈[α]r,·,x〉+ [β]rfor x ∈ ωr.2 The parameters are: α ∈ RR×D, a matrix ofhyperplane “slopes,” and β ∈ RR, a vector of hyperplane“offsets” or “biases”. We will use the terms offset and biasinterchangeably in the sequel. The notation [α]r,· denotesthe column vector formed from the rth row of α.

With this setup, the multivariate affine spline is defined as

s[α, β,Ω](x) =

R∑r=1

(〈[α]r,·,x〉+ [β]r)1(x ∈ ωr)

=: 〈α[x],x〉+ β[x], (3)

where 1(x ∈ ωr) is the indicator function. The secondline of (3) introduces the streamlined notation α[x] = [α]r,·when x ∈ ωr; the definition for β[x] is similar. Sucha spline is piecewise affine and hence piecewise convex.However, in general, it is neither globally affine nor globallyconvex unless R = 1, a case we denote as a degeneratespline, since it corresponds simply to an affine mapping.

2 To make the connection between splines and DNs more im-mediately obvious, here x is interpreted as a point in RD , whichplays the role of the space of signals in the other sections.

Max-Affine Spline Functions. A major complication offunction approximation with splines in general is the needto jointly optimize both the spline parameters α, β and theinput domain partition Ω (the “knots” for a 1D spline) (Ben-nett & Botkin, 1985). However, if a multivariate affinespline is constrained to be globally convex, then it can al-ways be rewritten as a max-affine spline (Magnani & Boyd,2009; Hannah & Dunson, 2013)

s[α, β,Ω](x) = maxr=1,...,R

〈[α]r,·,x〉+ [β]r . (4)

An extremely useful feature of such a spline is that it iscompletely determined by its parameters α and β withoutneeding to specify the partition Ω. As such, we denotea max-affine spline simply as s[α, β]. Changes in the pa-rameters α, β of a max-affine spline automatically inducechanges in the partition Ω, meaning that they are adaptivepartitioning splines (Magnani & Boyd, 2009).

Max-Affine Spline Operators. A natural extension of anaffine spline function is an affine spline operator (ASO)S[A,B,ΩS ] that produces a multivariate output. It is ob-tained simply by concatenating K affine spline functionsfrom (3). The details and a more general development areprovided in the SM and (Balestriero & Baraniuk, 2018).

We are particularly interested in the max-affine spline op-erator (MASO) S[A,B] : RD → RK formed by concate-nating K independent max-affine spline functions from (4).A MASO with slope parameters A ∈ RK×R×D and offsetparameters B ∈ RK×R is defined as

S[A,B](x) =

maxr=1,...,R〈[A]1,r,·,x〉+ [B]1,r...

maxr=1,...,R〈[A]K,r,·,x〉+ [B]K,r

=: A[x]x +B[x]. (5)

The second line of (5) introduces the streamlined no-tation in terms of the signal-dependent matrix A[x]and signal-dependent vector B[x], where [A[x]]k,· :=[A]k,rk(x),· and [B[x]]k := [B]k,rk(x) with rk(x) =arg maxr〈[A]k,r,·,x〉+ [B]k,r.

Max-affine spline functions and operators are always piece-wise affine and globally convex (and hence also continuous)with respect to each output dimension. Conversely, anypiecewise affine and globally convex function/operator canbe written as a max-affine spline. Moverover, using standardapproximation arguments, it is easy to show that a MASOcan approximate arbitrarily closely any (nonlinear) operatorthat is convex in each output dimension.

4. DNs are Compositions of Spline OperatorsWhile a MASO is appropriate only for approximating con-vex functions/operators, we now show that virtually all of


today’s DNs can be written as a composition of MASOs,one for each layer. Such a composition is, in general, non-convex and hence can approximate a much larger class offunctions/operators. Interestingly, under certain broad con-ditions, the composition remains a piecewise affine splineoperator, which enables a variety of insights into DNs.

4.1. DN Operators are MASOs

We now state our main theoretical results, which are provedin the SM and elaborated in (Balestriero & Baraniuk, 2018).

Proposition 1. An arbitrary fully connected operator f (`)W

is an affine mapping and hence a degenerate MASOS[A

(`)W , B

(`)W

], with R = 1, [A

(`)W ]k,1,· =

[W (`)

]k,· and

[B(`)W ]k,1 =

[b

(`)W

]k, leading to W (`)z(`−1)(x) + b

(`)W =

A(`)W [x]z(`−1)(x) +B

(`)W [x]. The same is true of a convolu-

tion operator with W (`), b(`)W replaced by C(`), b

(`)C .

Proposition 2. Any activation operator f (`)σ using a piece-

wise affine and convex activation function is a MASOS[A

(`)σ ,B

(`)σ

]with R = 2,

[B

(`)σ

]k,1

=[B

(`)σ

]k,2

=

0 ∀k, and for ReLU[A

(`)σ

]k,1,·

= 0,[A

(`)σ

]k,2,·

=

ek ∀k; for leaky ReLU[A

(`)σ

]k,1,·

= νek,[A

(`)σ

]k,2,·

=

ek ∀k, ν > 0; and for absolute value[A

(`)σ

]k,1,·

= −ek,[A

(`)σ

]k,2,·

= ek ∀k, where ek represents the kth canonical

basis element of RD(`)

.

Proposition 3. Any pooling operator f (`)ρ that is piece-

wise affine and convex is a MASO S[A

(`)ρ ,B

(`)ρ

].3 Max-

pooling has R = #Rk (typically a constant over alloutput dimensions k),

[A

(`)ρ

]k,·,·

= ei, i ∈ Rk, and[B

(`)ρ

]k,r

= 0 ∀k, r. Average-pooling is a degenerate

MASO with R = 1,[A

(`)ρ

]k,1,·

= 1#(Rk)

∑i∈Rk

ei, and[B

(`)ρ

]k,1

= 0 ∀k.

Proposition 4. A DN layer constructed from an arbitrarycomposition of fully connected/convolution operators fol-lowed by one activation or pooling operator is a MASOS[A(`), B(`)] such that

f (`)(z(`−1)(x)) = A(`)[x]z(`−1)(x) +B(`)[x]. (6)

Consequently, a large class of DNs boil down to a compo-sition of MASOs. We prove the following in the SM andin (Balestriero & Baraniuk, 2018) for CNNs, ResNets, skipconnection nets, fully connected nets, and RNNs.

3 This result is agnostic to the pooling type (spatial or channel).

Theorem 1. A DN constructed from an arbitrary composi-tion of fully connected/convolution, activation, and poolingoperators of the types in Propositions 1–3 is a composi-tion of MASOs that is equivalent to a global affine splineoperator.

Note carefully that, while the layers of each of the DNsstated in Theorem 1 are MASOs, the composition of severallayers is not necessarily a MASO. Indeed, a composition ofMASOs remains a MASO if and only if all of its componentoperators (except the first) are non-decreasing with respectto each of their output dimensions (Boyd & Vandenberghe,2004). Interestingly, ReLU and max-pooling are both non-decreasing, while leaky ReLU is strictly increasing. Theculprits causing non-convexity of the composition of layersare negative entries in the fully connected or convolution op-erators, which destroy the required non-increasing property.A DN where these culprits are thwarted is an interestingspecial case, because it is convex with respect to its input(Amos et al., 2016) and multiconvex (Xu & Yin, 2013) withrespect to its parameters.

Theorem 2. A DN whose layers ` = 2, . . . , L consist of anarbitrary composition of fully connected and convolution op-erators with nonnegative weights, i.e., W (`)

k,j ≥ 0, C(`)k,j ≥ 0;

non-decreasing, piecewise-affine, and convex activation op-erators; and non-decreasing, piecewise-affine, and convexpooling operators is globally a MASO and thus also glob-ally convex with respect to each of its output dimensions.

The above results pertain to DNs using convex, affine op-erators. Other popular non-convex DN operators (e.g., thesigmoid and arctan activation functions) can be approxi-mated arbitrarily closely by an affine spline operator but notby a MASO.

DNs are Signal-Dependent Affine Transformations. Acommon theme of the above results is that, for DNs con-structed from fully connected/convolution, activation, andpooling operators from Propositions 1–3, the operator/layeroutputs z(`)(x) are always a signal-dependent affine func-tion of the input x (recall (5)). The particular affine mappingapplied to x depends on which partition of the spline it fallsin RD. More on this in Section 7 below.

DN Learning and MASO Parameters. Given labeledtraining data (xn,yn)Nn=1, learning in a DN that meets theconditions of Theorem 1 (i.e., optimizing its parameters Θ)is equivalent to optimally approximating the mapping frominput x to output y = g

(z(L)(x)

)using an appropriate cost

function (e.g., cross-entropy for classification or squarederror for regression) by learning the parameters θ(`) of thelayers. In general the overall optimization problem is non-convex (it is actually piecewise multi-convex in general(Rister, 2016)).


z(L)CNN(x) = W (L)

(1∏

`=L−1

A(`)ρ [x]A(`)

σ [x]C(`)

)︸︷︷︸

ACNN[x]

x + W (L)L−1∑`=1

`+1∏j=L−1

A(j)ρ [x]A(j)

σ [x]C(j)

(A(`)ρ [x]A(`)

σ [x]b(`)C

)︸︷︷︸

BCNN[x]

+b(L)W

(7)

4.2. Application: DN Affine Mapping Formula

Combining Propositions 1–3 and Theorem 1 and substitut-ing (5) into (2), we can write an explicit formula for theoutput of any layer z(`)(x) of a DN in terms of the inputx for a variety of different architectures. The formula fora standard CNN (using ReLU activation and max-pooling)is given in (7) above; we derive this formula and analo-gous formulas for ResNets and RNNs in (Balestriero &Baraniuk, 2018). In (7), A(`)

σ [x] are the signal-dependentmatrices corresponding to the ReLU activations, A(`)

ρ [x]are the signal-dependent matrices corresponding to max-pooling, and the biases b

(L)W , b

(`)C arise directly from the

fully connected and convolution operators. The absence ofB

(`)σ [x], B(`)

ρ [x] is due to the absence of bias in the ReLU(recall (2)) and max-pooling operators (recall (3)).

Inspection of (7) reveals the exact form of the signal-dependent, piecewise affine mapping linking x to z

(L)CNN(x).

Moreover, this formula can be collapsed into

z(L)CNN(x) = W (L)

(ACNN[x]x +BCNN[x]

)+ b

(L)W (8)

from which we can recognize

z(L−1)CNN (x) = ACNN[x]x +BCNN[x] (9)

as an explicit, signal-dependent, affine formula for the fea-turization process that aims to convert x into a set of (hope-fully) linearly separable features that are then input to thelinear classifier in layer ` = L with parameters W (L) andb

(L)W . Of course, the final prediction y is formed by running

z(L)CNN(x) through a softmax nonlinearity g, but this merely

rescales its entries to create a probability distribution.

5. DNs are Template Matching MachinesWe now dig deeper into (8) in order to bridge DNs andclassical optimal classification theory. While we focus onCNNs and classification for concreteness, our analysis holdsfor any DN meeting the conditions of Theorem 1.

5.1. Template Matching

An alternate interpretation of (8) is that z(L)CNN(x) is the

output of a bank of linear matched filters (plus a set ofbiases). That is, the cth element of z(L)(x) equals the innerproduct between the signal x and the matched filter for thecth class, which is contained in the cth row of the matrix

W (L)A[x]. The bias W (L)B[x] + b(L)W can be used to

account for the fact that some classes might be more likelythan others (i.e., the prior probability over the classes). It iswell-known that a matched filterbank is the optimal classifierfor deterministic signals in additive white Gaussian noise(Rabiner & Gold, 1975). Given an input x, the class decisionis simply the index of the largest element of z(L)(x).4

Yet another interpretation of (8) is that z(L)(x) is computednot in a single matched filter calculation but hierarchicallyas the signal propagates through the DN layers. Abstracting(5) to write the per-layer maximization process as z(`)(x) =

maxr(`) A(`)

r(`)z(`−1)(x) +B

(`)

r(`)and cascading, we obtain a

formula for the end-to-end DN mapping

z(L)(x) = W (L) maxr(L−1)

(A

(L−1)

r(L−1) maxr(2)

(A

(2)

r(2). . .

maxr(1)

(A

(1)

r(1)x + B

(1)r(1)

)+ B

(2)

r(2)

)· · ·+ B

(L−1)

r(L−1)

)+ b

(L)W .

(10)

This formula elucidates that a DN performs a hierarchical,greedy template matching on its input, a computationallyefficient yet sub-optimal template matching technique. Sucha procedure is globally optimal when the DN is globallyconvex.Corollary 1. For a DN abiding by the requirements of The-orem 2, the computation (10) collapses to the followingglobally optimal template matching

z(L−1)(x) = W (L) maxr(L−1),r(2),...,r(1)

(A

(L−1)

r(L−1)

(A

(2)

r(2). . .(

A(1)

r(1)x + B

(1)

r(1)

)+ B

(2)

r(2)

)· · ·+ B

(L−1

r(L−1)

)+ b

(L)W .

(11)

5.2. Template Visualization Examples

Since the complete DN mapping (up to the final soft-max) can be expressed as in (8), given a signal x, wecan compute the signal-dependent template for class c viaA[x]c = d[z(L)(x)]c

dx , which can be efficiently computed viabackpropogation (Hecht-Nielsen, 1992).5 Once the templateA[x]c has been computed, the bias term b[x]c can be com-puted via b[x]c = z(L)(x)c − 〈A[x]c,·,x〉. Figure 1 plots

4Again, since the softmax merely rescales the entries ofz(L)(x) into a probability distribution, it does not affect the loca-tion of its largest element.

5In fact, we can use the same backpropagation procedure usedfor computing the gradient with respect to a fully connected or


Figure 1. Visualization of the matched filter templates generatedby a CNN using ReLU activation and max-pooling. At top, weshow the input image x of the digit ‘0’ from the MNIST datasetalong with the three signal-dependent templates ACNN[x](′0′,·),ACNN[x](′3′,·), and ACNN[x](′7′,·). The inner products betweenx and the three templates are 32.7,−27.7,−27.8, respectively (c.f.(12)). Below, we repeat the same experiment with the input imagex of a ‘truck’ from the CIFAR10 dataset along with the three signal-dependent templates corresponding to the classes ‘truck,’ ‘bird,’and ‘dog.’ The inner products between x and the three templatesare 30.6, −12.5, −16.2, respectively. (The final classification willinvolve not only these inner products but also the relevant biasesand softmax transformation.)

various signal-dependent templates for two CNNs trainedon the MNIST and CIFAR10 datasets.

5.3. Collinear Templates and Data Set Memorization

Under the matched filterbank interpretation of a DN devel-oped in Section 5.1, the optimal template for an image xof class c is a scaled version of x itself. But what are theoptimal templates for the other (incorrect) classes? In anidealized setting, we can answer this question.

Proposition 5. Consider an idealized DN consisting of acomposition of MASOs that has sufficient approximationpower to span arbitrary MASO matrices A[xn] from (9) forany input xn from the training set. Train the DN to classifyamong C classes using the training data D = (xn, yn)Nn=1

with normalized inputs ‖xn‖2 = 1 ∀n and the cross-entropyloss LCE(yn, fΘ(xn)) with the addition of the regulariza-tion constraint that

∑c ‖A[xn]c,·‖2 < α with α > 0. At the

global minimum of this constrained optimization problem,the rows of A?[xn] (the optimal templates) have the form:

[A?[xn]]c,· =

+√

(C−1)αC xn, c = yn

−√

αC(C−1) xn, c 6= yn

(12)

In short, the idealized CNN in the proposition will memorizea set of collinear templates whose bimodal outputs force

convolution weight but instead with the input x. This procedure isbecoming increasingly popular in the study of adversarial examples(Szegedy et al., 2013).

Figure 2. Empirical study of the implications of Proposition 5. Weillustrate the bimodality of 〈[A?[xn]]c,·,xn〉 for the largeCNNmatched filterbank trained on (a) MNIST and (b) CIFAR10. Train-ing used batch normalization and bias units. Results are similarwhen trained with bias and no batch-normalization. The greenhistogram summarizes the output values for the correct class (tophalf of (12)), while the red histogram summarizes the output valuesfor the incorrect classes (bottom half of (12)). The easier the clas-sification problem (MNIST), the more bimodal the distribution.

the softmax output to a Dirac delta function (aka 1-hotrepresentation) that peaks at the correct class. Figure 2confirms this bimodal behavior on the MNIST and CIFAR10datasets.

6. New DNs with Orthogonal TemplatesWhile a DN’s signal-dependent matched filterbank (8) isoptimized for classifying signals immersed in additive whiteGaussian noise, such a statistical model is overly simplisticfor most machine learning problems of interest. In practice,errors will arise not just from random noise but also fromnuisance variations in the inputs such as arbitrary rotations,positions, and modalities of the objects of interest. Theeffects of these nuisances are only poorly approximatedas Gaussian random errors. Limited work has been doneon filterbanks for classification in nonGaussian noise; onepromising direction involves using not matched but ratherorthogonal templates (Eldar & Oppenheim, 2001).

For a MASO DN’s templates to be orthogonal for all in-puts, it is necessary that the rows of the matrix W (L) in thefinal linear classifier layer be orthogonal. This weak con-straint on the DN still enables the earlier layers to create ahigh-performance, class-agnostic, featurized representation(recall the discussion just below (9)). To create orthogonaltemplates during learning, we simply add to the standard(potentially regularized) cross-entropy loss function LCE

a term that penalizes non-zero off-diagonal entries in thematrix W (L)(W (L))T leading to the new loss with theadditional penalty

LCE + λ∑c1 6=c2

∣∣∣⟨[W (L)]c1,·,[W (L)

]c2,·

⟩∣∣∣2 . (13)

The parameter λ controls the tradeoff between cross-entropy


(a) (b)

Figure 3. Orthogonal templates significantly boost DN performance with no change to the architecture. (a) Classification performance ofthe largeCNN trained on CIFAR100 for different values of the orthogonality penalty λ in (13). We plot the average (back dots), standarddeviation (gray shade), and maximum (blue dots) of the test set accuracy over 15 runs. (b, top) Training set error. The blue/black curvescorresponds to λ = 0/1. (b, bottom) Test set accuracy over the course of the learning.

minimization and orthogonality preservation. Conveniently,when minimizing (13) via backpropagation, the orthogonalrows of W (L) induce orthogonal backpropagation updatesfor the various classes.

We now empirically demonstrate that orthogonal templateslead to significantly improved classification performance.We conducted a range of experiments with three differentconventional DN architectures – smallCNN, largeCNN, andResNet4-4 – trained on three different datasets – SVHN,CIFAR10, and CIFAR100. Each DN employed biasunits, ReLU activations, and max-pooling as well as batch-normalization prior each ReLU. The full experimental de-tails are given in the (Balestriero & Baraniuk, 2018). Forlearning, we used the Adam optimizer with an exponentiallearning rate decay. All inputs were centered to zero meanand scaled to a maximum value of one. No further pre-processing was performed, such as ZCA whitening (Namet al., 2014). We assessed how the classification perfor-mance of a given DN would change as we varied the or-thogonality penalty λ in (13). For each configuration of DNarchitecture, training dataset, learning rate, and penalty λ,we averaged over 15 runs to estimate the average perfor-mance and standard deviation.

We report here on only the CIFAR100 with largeCNN ex-periments. (See (Balestriero & Baraniuk, 2018) for detailedresults for all three datasets and the other architectures. Thetrends for all three datasets are similar and are independentof the learning rate.) The results for CIFAR100 in Figure 3indicate that the benefits of the orthogonality penalty emergedistinctly as soon as λ > 0. In addition to improved final ac-curacy and generalization performance, we see that templateorthogonality reduces the temptation of the DN to overfit.(This is is especially visible in the examples in (Balestriero& Baraniuk, 2018).) One explanation is that the orthogonalweights W (L) positively impact not only the prediction butalso the backpropagation via orthogonal gradient updateswith respect to each output dimension’s partial derivatives.

7. DN’s Intrinsic Multiscale PartitionLike any spline, it is the interplay between the (affine) splinemappings and the input space partition that work the magicin a MASO DN. Recall from Section 3 that a MASO has theattractive property that it implicitly partitions its input spaceas a function of its slope and offset parameters. The inducedpartition Ω opens up a new geometric avenue to study how aDN clusters and organizes signals in a hierarchical fashion.

7.1. Effect of the DN Operators on the Partition

A DN operator at level ` directly influences the partition-ing of its input space RD(`−1)

and indirectly influences thepartitioning of the overall signal space RD.

A ReLU activation operator splits each of its input dimen-sions into two half-planes depending on the sign of the inputin each dimension. This partitions RD(`−1)

into a combina-torially large number (up to 2D

(`)

) of regions. Following afully connected or convolution operator with a ReLU simplyrotates the partition in RD(`−1)

. A max-pooling operatoralso partitions RD(`−1)

into a combinatorially large number(up to #RD

(`)

) of regions, where #R is the size of thepooling region.

This per-MASO partitioning of each layer’s input space con-structs an overall partitioning of the input signal space RD.As each MASO is applied, it subdivides the input space RDinto finer and finer partitions. The final partition correspondsto the intersection of all of the intermediate partitions, andhence we can encode the input in terms of the ordered col-lection of per-layer partition regions into which it falls. Thisoverall process can be interpreted as a hierarchical vectorquantization (VQ) of the training input signals xn. Thereare thus many potential connections between DNs and op-timal quantization, information theory, and clustering thatwe leave for future research. See (Balestriero & Baraniuk,2018) for some early results.


Figure 4. Example of the partition-based distance between images in the CIFAR10 (top) and SVHN (bottom) datasets. In each image, weplot the target image x at top-left and its 15 nearest neighbors in the partition-based distance at layer ` of a smallCNN trained withoutbatch normalization (see (Balestriero & Baraniuk, 2018) for the details). From left to right, the images display layers ` = 1, . . . , 6, andwe see that the images become further and further in Euclidean distance but closer and closer in featurized distance. In particular, the firstlayer neighbors share similar colors and shapes (and thus are closer in Euclidean distance). Later layer neighbors mainly come from thesame class independently of their color or shape.

7.2. Inferring a DN Layer’s Intrinsic Partition

Unfortunately there is no simple formula for the partition ofthe signal space. However, once can obtain the set of inputssignals xn that fall into the same partition region at eachlayer of a DN. At layer `, denote the index of the regionselected by the input x (recall (6)) by[t(`)(x)

]k

= arg maxr

⟨[A(`)]k,r,·, z

(`−1)(x)⟩

+ [B(`)]k,r.

(14)

Thus, [t(`)]k ∈ 1, . . . , R(`), with R(`) the number ofpartition regions in the layer’s input space. Encoding thepartition as an ordered collection of integers designatingthe activate hyperplane parameters from (4), we can nowvisualize which inputs fall into the same or nearby partitions.

Due to the very large number of possible regions (up to 2D(`)

for a ReLU at layer `) and the limited amount of trainingdata, in general, many partitions will be empty or containonly a single training data point.

7.3. A New Image Distance based on the DN Partition

To validate the utility of the hierarchical intrinsic clusteringinduced by a DN, we define a new distance function betweenthe signals x1 and x2 that quantifies the similarity of theirposition encodings t`(x1) and t`(x2) at layer ` via

d(t(`)(x1), t(`)(x2)

)= 1−

∑D(`)

k=1 1([t(`)(x1)]k = [t(`)(x2)]k

)D(`)

. (15)

For a ReLU MASO, this corresponds simply to countinghow many entries of the layer inputs for x1 and x2 arepositive or negative at the same positions. For a max-pooling

MASO, this corresponds to counting how many argmaxpositions are the same in each patch for x1 and x2.

Figure 4 provides a visualization of the nearest neighbors ofa test image under this partition-based distance measure. Vi-sual inspection of the figures highlights that, as we progressthrough the layers of the DN, similar images become closerin the new distance but further in Euclidean distance.

8. ConclusionsWe have used the theory of splines to build a rigorous bridgebetween deep networks (DNs) and approximation theory.Our key finding is that, conditioned on the input signal, theoutput of a DN can be written as a simple affine transfor-mation of the input. This links DNs directly to the classicaltheory of optimal classification via matched filters and pro-vides insights into the positive effects of data memorization.

There are many avenues for future work, including a morein-depth analysis of the hierarchical MASO partitioning,particularly from the viewpoint of vector quantization andK-means clustering, which are unsupervised learning tech-niques, and information theory. The spline viewpoint alsocould inspire the creation of new DN layers that havecertain attractive partitioning or approximation capabili-ties. We have begun exploring some of these directions in(Balestriero & Baraniuk, 2018).6

6 This work was partially supported by ARO grant W911NF-15-1-0316, AFOSR grant FA9550-14-1-0088, ONR grants N00014-17-1-2551 and N00014-18-12571, DARPA grant G001534-7500,and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grantN00014-18-1-2047.


ReferencesAmos, B., Xu, L., and Kolter, J. Z. Input convex neural

networks. arXiv preprint arXiv:1609.07152, 2016.

Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable boundsfor learning some deep representations. arXiv preprintarXiv:1310.6343, 2013.

Balestriero, R. and Baraniuk, R. A spline theory of deepnetworks. In Dy, J. and Krause, A. (eds.), Proceedings ofthe 35th International Conference on Machine Learning,volume 80 of Proceedings of Machine Learning Research,pp. 383–392, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/balestriero18b.html.

Bennett, J. and Botkin, M. Structural shape optimizationwith geometric description and adaptive mesh refinement.AIAA Journal, 23(3):458–464, 1985.

Bishop, C. M. Neural networks for pattern recognition.Oxford University Press, 1995.

Boyd, S. and Vandenberghe, L. Convex optimization. Cam-bridge University Press, 2004.

Bruna, J. and Mallat, S. Invariant scattering convolutionnetworks. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(8):1872–1886, 2013.

Cohen, N., Sharir, O., and Shashua, A. On the expressivepower of deep learning: A tensor analysis. In Feldman,V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Con-ference on Learning Theory, volume 49 of Proceedingsof Machine Learning Research, pp. 698–728, ColumbiaUniversity, New York, New York, USA, 23–26 Jun 2016.PMLR.

Eldar, Y. C. and Oppenheim, A. V. Orthogonal matchedfilter detection. In Acoustics, Speech, and Signal Process-ing, 2001. Proceedings.(ICASSP’01). 2001 IEEE Interna-tional Conference on, volume 5, pp. 2837–2840. IEEE,2001.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y.Deep learning, volume 1. MIT press Cambridge, 2016.

Graves, A. Generating sequences with recurrent neuralnetworks. arXiv preprint arXiv:1308.0850, 2013.

Hannah, L. A. and Dunson, D. B. Multivariate convexregression with adaptive partitioning. Journal of MachineLearning Research, 14(1):3261–3294, 2013.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hecht-Nielsen, R. Theory of the backpropagation neuralnetwork. In Neural networks for perception, pp. 65–93.Elsevier, 1992.

LeCun, Y. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.

Lu, H. and Kawaguchi, K. Depth creates no bad localminima. arXiv preprint arXiv:1702.08580, 2017.

Magnani, A. and Boyd, S. P. Convex piecewise-linear fitting.Optimization and Engineering, 10(1):1–17, 2009.

Nam, W., Dollar, P., and Han, J. H. Local decorrelation forimproved pedestrian detection. In Advances in NeuralInformation Processing Systems (NIPS), pp. 424–432,2014.

Pal, S. K. and Mitra, S. Multilayer perceptron, fuzzy sets,and classification. IEEE Transactions on Neural Net-works, 3(5):683–697, 1992.

Papyan, V., Romano, Y., and Elad, M. Convolutional neuralnetworks analyzed via convolutional sparse coding. Jour-nal of Machine Learning Research, 18(83):1–52, 2017.

Patel, A. B., Nguyen, T., and Baraniuk, R. G. A proba-bilistic framework for deep learning. Advances in NeuralInformation Processing Systems (NIPS), 2016.

Powell, M. J. D. Approximation Theory and Methods. Cam-bridge University Press, 1981.

Rabiner, L. R. and Gold, B. Theory and application ofdigital signal processing. Prentice-Hall, 1975., 1975.

Rister, B. Piecewise convexity of artificial neural networks.arXiv preprint arXiv:1607.04917, 2016.

Schmidhuber, J. Discovering problem solutions with lowkolmogorov complexity and high generalization capabil-ity. In Machine Learning: Proceedings of the TwelfthInternational Conference. Citeseer, 1994.

Soatto, S. and Chiuso, A. Visual representations: Definingproperties and deep approximations. In Proceedings ofInternational Conference on Learning Representations(ICLR’16), Nov. 2016.

Soudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in multilayer neural networks.arXiv preprint arXiv:1702.05777, 2017.

Srivastava, R. K., Greff, K., and Schmidhuber, J. Trainingvery deep networks. In Advances in Neural InformationProcessing Systems (NIPS), pp. 2377–2385, 2015.

http://proceedings.mlr.press/v80/balestriero18b.html

http://proceedings.mlr.press/v80/balestriero18b.html


Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing properties ofneural networks. arXiv preprint arXiv:1312.6199, 2013.

Targ, S., Almeida, D., and Lyman, K. Resnet in resnet:generalizing residual architectures. arXiv preprintarXiv:1603.08029, 2016.

Tishby, N. and Zaslavsky, N. Deep learning and the in-formation bottleneck principle. In Information TheoryWorkshop (ITW), 2015 IEEE, pp. 1–5. IEEE, 2015.

Xu, Y. and Yin, W. A block coordinate descent method forregularized multiconvex optimization with applicationsto nonnegative tensor factorization and completion. SIAMJournal on Imaging Sciences, 6(3):1758–1789, 2013.

Zeiler, M. D. and Fergus, R. Visualizing and understand-ing convolutional networks. In European Conference onComputer Vision (ECCV), pp. 818–833, 2014.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.Understanding deep learning requires rethinking general-ization. arXiv preprint arXiv:1611.03530, 2016.

a spline theory of deep networksproceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf · 3....

Documents