the singular value decomposition -...

The Singular Value Decomposition

We are interested in more than just sym+def matrices. But theeigenvalue decompositions discussed in the last section of notes willplay a major role in solving general systems of equations

y = Ax, y ∈ RM , A is M ×N , x ∈ RN .

We have seen that a symmetric positive definite matrix can be decom-posed as A = V ΛV T, where V is an orthogonal matrix (V TV =V V T = I) whose columns are the eigenvectors of A, and Λ isa diagonal matrix containing the eigenvalues of A. Because bothorthogonal and diagonal matrices are trivial to invert, this eigen-value decomposition makes it very easy to solve systems of equationsy = Ax and analyze the stability of these solutions.

The singular value decomposition (SVD) takes apart an arbi-traryM×N matrixA in a similar manner. The SVD of a real-valuedM ×N matrix A with rank1 R is

A = UΣV T

where

1. U is an M ×R matrix

U =[u1 | u2 | · · · | uR

],

whose columns um ∈ RM are orthonormal. Note that whileUTU = I, in general UUT 6= I when R < M . The columnsof U are an orthobasis for the range space of A.

1Recall that the rank of a matrix is the number of linearly independentcolumns of a matrix (which is always equal to the number of linearlyindependent rows).

36

Georgia Tech ECE 6250 Fall 2017; Notes by J. Romberg and M. Davenport. Last updated 23:19, October 25, 2017

2. V is an N ×R matrix

V =[v1 | v2 | · · · | vR

],

whose columns vn ∈ RN are orthonormal. Again, whileV TV =I, in general V V T 6= I when R < N . The columns of V arean orthobasis for the range space ofAT (recall that Range(AT)consists of everything which is orthogonal to the nullspace ofA).

3. Σ is an R×R diagonal matrix with positive entries:

Σ =

σ1 0 0 · · ·0 σ2 0 · · ·... . . .0 · · · · · · σR

.We call the σr the singular values of A. By convention, wewill order them such that σ1 ≥ σ2 ≥ · · · ≥ σR.

4. The v1, . . . ,vR are eigenvectors of the positive semi-definitematrix ATA. Note that

ATA = V ΣUTUΣV T = V Σ2V T,

and so the singular values σ1, . . . , σR are the square roots ofthe non-zero eigenvalues of ATA.

5. Similarly,AAT = UΣ2UT,

and so the u1, . . . ,uR are eigenvectors of the positive semi-definite matrix AAT. Since the non-zero eigenvalues of ATA

37


and AAT are the same, the σr are also square roots of theeigenvalues of AAT.

The rank R is the dimension of the space spanned by the columns ofA, this is the same as the dimension of the space spanned by the rows.Thus R ≤ min(M,N). We say A is full rank if R = min(M,N).

As before, we will often times find it useful to write the SVD as thesum of R rank-1 matrices:

A = UΣV T =R∑r=1

σr urvTr .

When A is overdetermined (M > N), the decomposition lookslike this A

=

U

σ1

. . .σR

V T

When A is underdetermined (M < N), the SVD looks like this A

=

U

σ1. . .

σR

V T

When A is square and full rank (M = N = R), the SVD lookslike A

=

U

σ1. . .

σN

V T

38


Technical Details: Existence of the SVD

In this section we will prove that anyM×N matrixAwith rank(A) =R can be written as

A = UΣV T

where U ,Σ,V have the five properties listed at the beginning of thelast section.

Since ATA is symmetric positive semi-definite, we can write:

ATA =N∑n=1

λnvnvTn ,

where the vn are orthonormal and the λn are real and non-negative.Since rank(A) = R, we also have rank(ATA) = R, and so λ1, . . . , λRare all strictly positive above, and λR+1 = · · · = λN = 0.

Set

um =1√λmAvm, for m = 1, . . . , R, U =

[u1 · · · uR

].

Notice that these um are orthonormal, as

〈um,u`〉 =1√λmλ`

vT`A

TAvm =

√λmλ`vT` vm =

{1, m = `,

0, m 6= `.

These um also happen to be eigenvectors of AAT, as

AATum =1√λmAATAvm =

√λmAvm = λmum.

Now let uR+1, . . . ,uM be an orthobasis for the null space of UT —concatenating these two sets into u1, . . . ,uM forms an orthobasis forall of RM .

39


Let V =[v1 v2 · · · vR

]. In addition, let

V 0 =[vR+1 vR+2 · · · vN

], V full =

[V V 0

]and

U 0 =[uR+1 uR+2 · · · uM

], U full =

[U U 0

].

It should be clear that V full is an N × N orthonormal matrix andU full is a M ×M orthonormal matrix. Consider the M ×N matrixUT

fullAV full — the entry in the mth rows and nth column of thismatrix is

(UTfullAV full)[m,n] = uT

mAvn =

{√λnu

Tmun n = 1, . . . , R

0, n = R + 1, . . . , N.

=

{√λn, m = n = 1, . . . , R

0, otherwise.

ThusUT

fullAV full = Σfull

where

Σfull[m,n] =

{√λn, m = n = 1, . . . , R

0, otherwise.

Since U fullUTfull = I and V fullV

Tfull = I, we have

A = U fullΣfullVTfull.

Since Σfull is non-zero only in the first R locations along its maindiagonal, the above reduces to

A = UΣV T, Σ =

√λ1 √

λ2. . . √

λR

.

40


The Least-Squares Problem

We can use the SVD to “solve” the general system of linear equations

y = Ax

where y ∈ RM , x ∈ RN , and A is an M ×N matrix.

Given y, we want to find x in such a way that

1. when there is a unique solution, we return it;

2. when there is no solution, we return something reasonable;

3. when there are an infinite number of solutions, we choose oneto return in a “smart” way.

The least-squares framework revolves around finding an x thatminimizes the length of the residual

r = y −Ax.

That is, we want to solve the optimization problem

minimizex∈RN

‖y −Ax‖22, (1)

where ‖ · ‖2 is the standard Euclidean norm. We will see that theSVD of A:

A = UΣV T, (2)

plays a pivotal role in solving this problem.

To start, note that we can write any x ∈ RN as

x = V α + V 0α0. (3)

41


Here, V is the N × R matrix appearing in the SVD decomposition(2), and V 0 is a N × (N −R) matrix whose columns are orthogonalto one another and to the columns in V . We have the relations

V TV = I, V T0V 0 = I, V TV 0 = 0.

You can think of V 0 as an orthobasis for the null space of A. Ofcourse, V 0 is not unique, as there are many orthobases for Null(A),but any such set of vectors will serve our purposes here. The decom-position (3) is possible since Range(AT) and Null(A) partition RN

for any M ×N matrix A. Taking

α = V Tx, α0 = V T0x,

we see that (3) holds since

x = V V Tx + V 0VT0x = (V V T + V 0V

T0 )x = x,

where we have made use of the fact that V V T + V 0VT0 = I, be-

cause V V T and V 0VT0 are ortho-projectors onto complementary

subspaces2 of RN . So we can solve for x ∈ RN by solving for the pairα ∈ RR, α0 ∈ RN−R.

Similarly, we can decompose y as

y = Uβ +U 0β0, (4)

where U is the M × R matrix from the SVD decomposition, andU 0 is a M × (M −R) complementary orthogonal basis. Again,

UTU = I, UT0U 0 = I, UTU 0 = 0,

2Subspaces S1 and S2 are complementary in RN if S1 ⊥ S2 (everything inS1 is orthogonal to everything in S2) and S1⊕S2 = RN . You can think ofS1,S2 as a partition of RN into two orthogonal subspaces.

42


and we can think of U 0 as an orthogonal basis for everything inRM that is not in the range of A. As before, we can calculate thedecomposition above using

β = UTy, β0 = UT0y.

Using the decompositions (2), (3), and (4) for A, x, and y, we canwrite the residual r = y −Ax as

r = Uβ +U 0β0 −UΣV T(V α + V 0α0)

= Uβ +U 0β0 −UΣα (since V TV = I and V TV 0 = 0)

= U 0β0 +U (β −Σα).

We want to choose α that minimizes the energy of r:

‖r‖22 = 〈U 0β0 +U (β −Σα), U 0β0 +U (β −Σα)〉= 〈U 0β0,U 0β0〉 + 2〈U 0β0,U (β −Σα)〉

+ 〈U (β −Σα),U (β −Σα)〉= ‖β0‖22 + ‖β −Σα‖22

where the last equality comes from the facts thatUT0U 0 = I,UTU =

I, and UTU 0 = 0. We have no control over ‖β0‖22, since it is de-termined entirely by our observations y. Therefore, our problem hasbeen reduced to findingα that minimizes the second term ‖β−Σα‖22above, which is non-negative. We can make it zero (i.e. as small aspossible) by taking

α = Σ−1β.

Finally, the x which minimizes the residual (solves (1)) is

x = V α = V Σ−1β = V Σ−1UTy. (5)

43


Thus we can calculate the solution to (1) simply by applying the lin-ear operator V Σ−1UT to the input data y. There are two interestingfacts about the solution x in (5):

1. When y ∈ span({u1, . . . ,uM}), we have β0 = UT0y = 0, and

so the residual r = 0. In this case, there is at least one exactsolution, and the one we choose satisfies Ax = y.

2. Note that if R < N , then the solution is not unique. In thiscase, V 0 has at least one column, and any part of a vector xin the range of V 0 is not seen by A, since

AV 0α0 = UΣV TV 0α0 = 0 (since V TV 0 = 0).

As such,x′ = x + V 0α0

for any α0 ∈ RN−R will have exactly the same residual, sinceAx′ = Ax. In this case, our solution x is the solution withsmallest norm, since

‖x′‖22 = 〈x + V 0α0, x + V 0α0〉= 〈x, x〉 + 2〈x,V 0α0〉 + 〈V 0α,V 0α〉= ‖x‖22 + 2〈V Σ−1UTy,V 0α0〉 + ‖α0‖22 (since V T

0V 0 = I)

= ‖x‖22 + ‖α0‖22 (since V TV 0 = 0)

which is minimized by taking α0 = 0.

To summarize, x = V Σ−1UTy has the desired properties stated atthe beginning of this module, since

1. when y = Ax has a unique exact solution, it must be x,

2. when an exact solution is not available, x is the solution to (1),

44


3. when there are an infinite number of minimizers to (1), x isthe one with smallest norm.

Because the matrix V Σ−1UT gives us such an elegant solution tothis problem, we give it a special name: the pseudo-inverse.

The Pseudo-Inverse

The pseudo-inverse of a matrix A with singular value decompo-sition (SVD) A = UΣV T is

A† = V Σ−1UT. (6)

Other names for A† include natural inverse, Lanczos inverse,and Moore-Penrose inverse.

Given an observation y, taking x = A†y gives us the least squaressolution to y = Ax. The pseudo-inverse A† always exists, sinceevery matrix (with rank R) has an SVD decompositionA = UΣV T

with Σ as an R×R diagonal matrix with Σ[r, r] > 0.

When A is full rank (R = min(M,N)), then we can calculate thepseudo-inverse without using the SVD. There are three cases:

• When A is square and invertible (R = M = N), then

A† = A−1.

This is easy to check, as here

A = UΣV T where both U ,V are N ×N,

45


and since in this case V V T = V TV = I andUUT = UTU =I,

A†A = V Σ−1UTUΣV T

= V Σ−1ΣV T

= V V T

= I.

Similarly, AA† = I, and so A† is both a left and right inverseof A, and thus A† = A−1.

• When A more rows than columns and has full column rank(R = N ≤M), then ATA is invertible, and

A† = (ATA)−1AT. (7)

This type of A is “tall and skinny” A

,and its columns are linearly independent. To verify equation(7), recall that

ATA = V ΣUTUΣV T = V Σ2V T,

and so

(ATA)−1AT = V Σ−2V TV ΣUT = V Σ−1UT,

which is exactly the content of (6).

46


• When A has more columns than rows and has full row rank(R = M ≤ N), then AAT is invertible, and

A† = AT(AAT)−1. (8)

This occurs when A is “short and fat” A

,and its rows are linearly independent. To verify equation (8),recall that

AAT = UΣV TV ΣUT = UΣ2UT,

and so

AT(AAT)−1 = V ΣUTUΣ−2UT = V Σ−1UT,

which again is exactly (6).

A† is as close to an inverse of A as possible

As discussed in above, whenA is square and invertible, A† is exactlythe inverse of A. When A is not square, we can ask if there is abetter right or left inverse. We will argue that there is not.

Left inverse Given y = Ax, we would like A†y = A†Ax = xfor any x. That is, we would like A† to be a left inverse ofA: A†A = I. Of course, this is not always possible, especiallywhen A has more columns than rows, M < N . But we canask if any other matrix H comes closer to being a left inverse

47


thanA†. To find the “best” left-inverse, we look for the matrixwhich minimizes

minH∈RN×M

‖HA− I‖2F . (9)

Here, ‖ · ‖F is the Frobenius norm, defined for an N × Mmatrix Q as the sum of the squares of the entries:3

‖Q‖2F =M∑n=1

N∑n=1

|Q[m,n]|2

With (9), we are finding H such that HA is as close to theidentity as possible in the least-squares sense.

The pseudo-inverse A† minimizes (9). To see this, recognize

(see the exercise below) that the solution H to (9) must obey

AATHT

= A. (10)

We can see that this is indeed true for H = A†:

AATA†T

= UΣV TV ΣUTUΣ−1V T = UΣV T = A.

So there is no N × M matrix that is closer to being a leftinverse than A†.

3It is also true that ‖Q‖2F is the sum of the squares of the singular values ofQ: ‖Q‖2F = λ2

1 + · · · + λ2p. This is something that you will prove on the

next homework.

48


Right inverse If we re-applyA to our solution x = A†y, we wouldlike it to be as close as possible to our observations y. That is,we would like AA† to be as close to the identity as possible.Again, achieving this goal exactly is not always possible, espe-cially if A has more rows that columns. But we can attemptto find the “best” right inverse, in the least-squares sense, bysolving

minimizeH∈RN×M

‖AH − I‖2F . (11)

The solution H to (11) (see the exercise below) must obey

ATAH = AT. (12)

Again, we show thatA† satisfies (12), and hence is a minimizerto (11):

ATAA† = V Σ2V TV Σ−1UT = V ΣUT = AT.

Moral:A† = V Σ−1UT is as close (in the least-squares sense)to an inverse of A as you could possibly have.

Exercise:

1. Show that the minimizer H to (9) must obey (10). Do this byusing the fact that the derivative of the functional ‖HA−I‖2Fwith respect to an entry H [k, `] in H must obey

∂‖HA− I‖2F∂H [k, `]

= 0, for all 1 ≤ k ≤ N, 1 ≤ ` ≤M,

to be a solution to (9). Do the same for (11) and (12).

49


Stability Analysis of the Pseudo-Inverse

We have seen that if we make indirect observations y ∈ RM of anunknown vector x0 ∈ RN through a M × N matrix A, y = Ax0,then applying the pseudo-inverse of A gives us the least squaresestimate of x0:

xls = A†y = V Σ−1UTy,

whereA = UΣV T is the singular value decomposition (SVD) ofA.

We will now discuss what happens if our measurements contain noise— the analysis here will be very similar to when we looked at thestability of solving square sym+def systems, and in fact this is oneof the main reasons we introduced the SVD.

Suppose we observey = Ax0 + e,

where e ∈ RM is an unknown perturbation. Say that we again applythe pseudo-inverse to y in an attempt to recover x:

xls = A†y = A†Ax0 +A†e

What effect does the presence of the noise vector e had on our es-timate of x0? We answer this question by comparing xls to thereconstruction we would obtain if we used standard least-squares onperfectly noise-free observations yclean = Ax0. This noise-free recon-

50


struction can be written as

xpinv = A†yclean = A†Ax0

= V Σ−1UTUΣV Tx0

= V V Tx0

=R∑r=1

〈x0,vr〉vr.

The vector xpinv is the orthogonal projection of x0 onto the row space(everything orthogonal to the null space) of A. If A has full columnrank (R = N), then xpinv = x0. If not, then the application of Adestroys the part of x0 that is not in xpinv, and so we only attemptto recover the “visible” components. In some sense, xpinv containsall of the components of x0 that A does not completely remove, andhas them preserved perfectly.

The reconstruction error (relative to xpinv is)

‖xls − xpinv‖22 = ‖A†e‖22 = ‖V Σ−1UTe‖22. (13)

Now suppose for a moment that the error has unit norm, ‖e‖22 = 1.Then the worst case for (13) is given by

maximizee∈RM

‖V Σ−1UTe‖22 subject to ‖e‖2 = 1.

Since the columns of U are orthonormal, ‖UTe‖22 ≤ ‖e‖22, and theabove is equivalent to

maxβ∈RR:‖β‖2=1

‖V Σ−1β‖22. (14)

51


Also, for any vector z ∈ RR, we have

‖V z‖22 = 〈V z,V z〉 = 〈z,V TV z〉 = 〈z, z〉 = ‖z‖22,

since the columns of V are orthonormal. So we can simplify (14) to

maximizeβ∈RR

‖Σ−1β‖22 subject to ‖β‖2 = 1.

The worst case β (you should verify this at home) will have a 1 inthe entry corresponding to the largest entry in Σ−1, and will be zeroeverywhere else. Thus

maxβ∈RR:‖β‖2=1

‖Σ−1β‖22 = maxr=1,...,R

σ−2r =1

σ2R

.

(Recall that by convention, we order the singular values so that σ1 ≥σ2 ≥ · · · ≥ σR.)

Returning to the reconstruction error (13), we now see that

‖xls − xpinv‖22 = ‖V Σ−1UTe‖22 ≤1

σ2R

‖e‖22.

Since U is an M × R matrix, it is possible when R < M that thereconstruction error is zero. This happens when e is orthogonal toevery column of U , i.e. UTe = 0. Putting this together with thework above means

0 ≤ 1

σ21

‖UTe‖22 ≤ ‖xls − xpinv‖22 ≤1

σ2R

‖UTe‖22 ≤1

σ2R

‖e‖22.

Notice that if σR is small, the worst case reconstruction error can bevery bad.

52


We can also relate the “average case” error to the singular values. Saythat e is additive Gaussian white noise, that is each entry e[m] is arandom variable independent of all the other entries, and distributed

e[m] ∼ Normal(0, ν2).

Then, as we have argued before, the average measurement error is

E[‖e‖22] = Mν2,

and the average reconstruction error4 is

E[‖A†e‖22

]= ν2 · trace(A†

TA†) = ν2 ·

(1σ21

+ 1σ22

+ · · · + 1σ2R

)= 1

M

(1σ21

+ 1σ22

+ · · · + 1σ2R

)· E[‖e‖22].

Again, if σR is tiny, 1/σ2R will dominate the sum above, and the

average reconstruction error will be quite large.

Exercise: Let D be a diagonal R × R matrix whose diagonalelements are positive. Show that the maximizer β to

maximizeβ∈RR

‖Dβ‖22 subject to ‖β‖2 = 1

has a 1 in the entry corresponding to the largest diagonal element ofD, and is 0 elsewhere.

4We are using the fact that if e is vector of iid Gaussian random vari-ables, e ∼ Normal(0, ν2 I), then for any matrix M , E[‖Me‖22] =ν2 trace(MTM ). We will argue this carefully as part of the next home-work.

53


Stable Reconstruction with the Truncated SVD

We have seen that if A has very small singular values and we applythe pseudo-inverse in the presence of noise, the results can be disas-trous. But it doesn’t have to be this way. There are several waysto stabilize the pseudo-inverse. We start be discussing the simplestone, where we simply “cut out” the part of the reconstruction whichis causing the problems.

As before, we are given noisy indirect observations of a vector xthrough a M ×N matrix A:

y = Ax + e. (15)

The matrix A has SVD A = UΣV T, and pseudo-inverse A† =V Σ−1UT. We can rewrite A as a sum of rank-1 matrices:

A =R∑r=1

σrurvTr ,

whereR is the rank ofA, the σr are the singular values, and ur ∈ RM

and vr ∈ RN are columns of U and V , respectively. Similarly, wecan write the pseudo-inverse as

A† =R∑r=1

1

σrvru

Tr .

Given y as above, we can write the least-squares estimate of x fromthe noisy measurements as

xls = A†y =R∑r=1

1

σr〈y,ur〉vr. (16)

54


As we can see (and have seen before) if any one of the σr are verysmall, the least-squares reconstruction can be a disaster.

A simple way to avoid this is to simply truncate the sum (16), leavingout the terms where σr is too small (1/σr is too big). Exactly howmany terms to keep depends a great deal on the application, as thereare competing interests. On the one hand, we want to ensure thateach of the σr we include has an inverse of reasonable size, on theother, we want the reconstruction to be accurate (i.e. not to deviatefrom the noiseless least-squares solution by too much).

We form an approximation A′ to A by taking

A′ =R′∑r=1

σrurvTr ,

for some R′ < R. Again, our final answer will depend on which R′

we use, and choosing R′ is often times something of an art. It is clearthat the approximationA′ has rankR′. Note that the pseudo-inverseof A′ is also a truncated sum

A′† =R′∑r=1

1

σrvru

Tr .

Given noisy data y as in (15), we reconstruct x by applying thetruncated pseudo-inverse to y:

xtrunc = A′†y =R′∑r=1

1

σr〈y,ur〉vr.

55


How good is this reconstruction? To answer this question, we willcompare it to the noiseless least-squares reconstructionxpinv = A†yclean,where yclean = Ax are “noiseless” measurements of x. The differ-ence between these two is the reconstruction error (relative to xpinv)as

xtrunc − xpinv = A′†y −A†Ax= A′†Ax +A′†e−A†Ax= (A′† −A†)Ax +A′†e.

Proceeding further, we can write the matrix A′† −A† as

A′† −A† =R∑

r=R′+1

− 1

σrvru

Tr ,

and so the first term in the reconstruction error can be written as

(A′† −A†)Ax =R∑

r=R′+1

− 1

σr〈Ax,ur〉vr

=R∑

r=R′+1

− 1

σr

⟨R∑j=1

σj〈x,vj〉uj,ur

⟩vk

=R∑

r=R′+1

− 1

σr

R∑j=1

σj〈x,vj〉〈uj,ur〉vr

=R∑

r=R′+1

−〈x,vr〉vr (since 〈ur,uj〉 = 0 unless j = r).

The second term in the reconstruction error can also be expandedagainst the vr:

A′†e =R′∑r=1

1

σr〈e,ur〉vr.

56


Combining these expressions, the reconstruction error can be written

xtrunc − xpinv =R′∑r=1

1

σr〈e,ur〉vr︸︷︷︸ +

R∑k=R′+1

−〈x,vk〉vk︸︷︷︸= Noise error + Approximation error.

Since the vr are mutually orthogonal, and the two sums run overdisjoint index sets, the noise error and the approximation error willbe orthogonal. Also

‖xtrunc − xpinv‖22 = ‖Noise error‖22 + ‖Approximation error‖22

=R′∑r=1

1

σ2r

|〈e,ur〉|2 +R∑

r=R′+1

|〈x,vr〉|2.

The reconstruction error, then, is signal dependent and will dependon how much of the vector x is concentrated in the subspace spannedby vR′+1, . . . ,vR. We will lose everything in this subspace; if itcontains a significant part of x, then there is not much least-squarescan do for you.

The worst-case noise error occurs when e is aligned with uR′:

‖Noise error‖22 =R′∑r=1

1

σ2r

|〈e,ur〉|2 ≤1

σ2R′· ‖e‖22.

As seen before, if the error e is random, this bound is a bit pes-simistic. Specifically, if each entry of e is an independent identicallydistributed Normal random variable with mean zero and variance ν2,then the expected noise error in the reconstruction will be

E[‖Noise error‖22] =1

M

(1

σ21

+1

σ22

+ · · · + 1

σ2R′

)· E[‖e‖22].

57


Stable Reconstruction using TikhonovRegularization

Tikhonov5 regularization is another way to stabilize the least-squaresrecovery. It has the nice features that: 1) it can be interpreted usingoptimization, and 2) it can be computed without direct knowledgeof the SVD of A.

Recall that we motivated the pseudo-inverse by showing that xLS =A†y is a solution to

minimizex∈RN

‖y −Ax‖22. (17)

When A has full column rank, xLS is the unique solution, otherwiseit is the solution with smallest energy. When A has full column rankbut has singular values which are very small, huge variations in x(in directions of the singular vectors vk corresponding to the tinyσk) can have very little effect on the residual ‖y −Ax‖22. As such,the solution to (17) can have wildly inaccurate components in thepresence of even mild noise.

One way to counteract this problem is to modify (17) with a regu-larization term that penalizes the size of the solution ‖x‖22 as wellas the residual error ‖y −Ax‖22:

minimizex∈RN

‖y −Ax‖22 + δ‖x‖22. (18)

The parameter δ > 0 gives us a trade-off between accuracy andregularization; we want to choose δ small enough so that the residual

5Andrey Tikhonov (1906-1993) was a 20th century Russian mathematician.

58


for the solution of (18) is close to that of (17), and large enough sothat the problem is well-conditioned.

Just as with (17), which is solved by applying the pseudo-inverse toy, we can write the solution to (18) in closed form. To see this, recallthat we can decompose any x ∈ RN as

x = V α + V 0α0,

where V is the N × R matrix (with orthonormal columns) used inthe SVD of A, and V 0 is a N ×N − R matrix whose columns arean orthogonal basis for the null space of A. This means that thecolumns of V 0 are orthogonal to each other and all of the columnsof V . Similarly, we can decompose y as

y = Uβ +U 0β0,

whereU is theM×R matrix used in the SVD ofA, and the columnsof U 0 are an orthogonal basis for the left null space of A (everythingin RM that is not in the range of A).

For any x, we can write

y −Ax = Uβ +U 0β −UΣV T(V α + V 0α0)

= U (β −Σα) +U 0β.

Since the columns of U are orthonormal, UTU = I, and alsoUT

0U 0 = I, and UTU 0 = 0, we have

‖y −Ax‖22 = 〈U (β −Σα) +U 0β0,U (β −Σα) +U 0β0〉= ‖β −Σα‖22 + ‖β0‖22,

and‖x‖22 = ‖α‖22 + ‖α0‖22.

59


Using these facts, we can write the functional in (18) as

‖y −Ax‖22 + δ‖x‖2 = ‖β −Σα‖22 + δ‖α‖22 + δ‖α0‖22. (19)

We want to choose α and α0 that minimize (19). It is clear that,just as in the standard least-squares problem, we need α0 = 0. Thepart of the functional that depends on α can be rewritten as

‖β −Σα‖22 + δ‖α‖22 =R∑k=1

(β[k]− σkα[k])2 + δα[k]2. (20)

We can minimize this sum simply by minimizing each term indepen-dently. Since

d

dα[k]

[(β[k]− σkα[k])2 + δα[k]2

]= −2β[k]σk + 2σ2

kα[k] + 2δα[k],

we needα[k] =

σkσ2k + δ

β[k].

Putting this back in vector form, (20) is minimized by

αtik = (Σ2 + δI)−1Σβ,

and so the minimizer to (18) is

xtik = V αtik

= V (Σ2 + δI)−1ΣUTy. (21)

We can get a better feel for what Tikhonov regularization is doingby comparing it directly to the pseudo-inverse. The least-squares

60


reconstruction xls can be written as

xls = V Σ−1UTy

=R∑r=1

1

σr〈y,ur〉vr,

while the Tikhonov reconstruction xtik derived above is

xtik =R∑r=1

σrσ2r + δ

〈y,ur〉vr. (22)

Notice that when σr is much larger than δ,

σrσ2r + δ

≈ 1

σr, σr � δ,

but when σr is smallσr

σ2r + δ

≈ 0, σr � δ.

Thus the Tikhonov reconstruction modifies the important parts (com-ponents where the σr are large) of the pseudo-inverse very little, whileensuring that the unimportant parts (components where the σr aresmall) affect the solution only by a very small amount. This damp-ing of the singular values, is illustrated below.

σrσ2r+δ

0 0.5 1 1.5 2 2.5 30

1

2

3

4

5

6

σr

61


Above, we see the damped multipliers σr/(σ2r + δ) versus σr for

δ = 0.1 (blue), δ = 0.05 (red), and δ = 0.01 (green). The blackdotted line is 1/σr, the least-squares multiplier. Notice that for largeσr (σr > 2

√δ, say), the damping has almost no effect.

This damping makes the Tikhonov reconstruction exceptionally sta-ble; large multipliers never appear in the reconstruction (22). In factit is easy to check that

σrσ2r + δ

≤ 1

2√δ

no matter the value of σr.

Tikhonov Error Analysis

Given noisy observations y = Ax0 + e, how well will Tikhonov reg-ularization work? The answer to this questions depends on multiplefactors including the choice of δ, the nature of the perturbation e,and how well x0 can be approximated using a linear combinationof the singular vectors vr corresponding to the large (relative to δ)singular values. Since a closed-form expression for the solution to(18) exists, we can quantify these trade-offs precisely.

We compare the Tikhonov reconstruction to the reconstruction wewould obtain if we used standard least-squares on perfectly noise-free observations yclean = Ax0. This noise-free reconstruction can

62


be written as

xpinv = A†yclean = A†Ax0

= V Σ−1UTUΣV Tx0

= V V Tx0

=R∑r=1

〈x0,vr〉vr.

The vector xpinv is the orthogonal projection of x0 onto the row space(everything orthogonal to the null space) of A. If A has full columnrank, then xpinv = x0. If not, then the application of A destroysthe part of x0 that is not in xpinv, and so we only attempt to recoverthe “visible” components. In some sense, xpinv contains all of thecomponents of x0 that we could ever hope to recover, and has thempreserved perfectly.

The Tikhonov regularized solution is given by

xtik =R∑r=1

σrσ2r + δ

〈y,ur〉vr

=R∑r=1

σrσ2r + δ

〈e,ur〉vr +R∑r=1

σrσ2r + δ

〈Ax0,ur〉vr

=R∑r=1

σrσ2r + δ


σ2r

σ2r + δ

〈x0,vr〉vr,

and so the reconstruction error, relative to the best possible recon-

63


struction xpinv, is

xtik − xpinv =R∑r=1

σrσ2r + δ


(σ2r

σ2r + δ

− 1

)〈x0,vr〉vr

=R∑r=1

σrσ2r + δ

〈e,ur〉vr︸︷︷︸ +R∑r=1

−δσ2r + δ

〈x0,vr〉vr︸︷︷︸= Noise error + Approximation error.

The approximation error is signal dependent, and depends on δ.Since the vr are orthonormal,

‖Approximation error‖22 =R∑r=1

δ2

(σ2r + δ)2

|〈x0,vr〉|2.

Note that for the components much smaller than δ,

σ2r � δ ⇒ δ2

(σ2r + δ)2

≈ 1,

so this portion of the approximation error will be about the same asif we had simply truncated these components.

For large components,

σ2r � δ ⇒ δ2

(σ2r + δ)2

≈ δ2

σ2r

and so this portion of the approximation error will be very small.

64


For the noise error energy, we have

‖Noise error‖22 =R∑r=1

(σr

σ2r + δ

)2

|〈e,ur〉|2

≤ 1

4δ

R∑r=1

|〈e,ur〉|2

≤ 1

4δ‖e‖22.

The worst-case error is more or less determined by the choice of δ.The regularization makes the effective condition number of A about1/(2√δ); no matter how small the smallest singular value is, the

noise energy will not increase by more than a factor of 1/(4δ) duringthe reconstruction process.

As usual, the average case error is less pessimistic. If each entry ofe is an independent identically distributed Normal random variablewith mean zero and variance ν2, then the expected noise error in thereconstruction will be

E[‖Noise error‖22

]=

R∑r=1

(σr

σ2r + δ

)2

E[|〈e,ur〉|2

]= ν2 ·

R∑r=1

(σr

σ2r + δ

)2

=1

M·(

R∑r=1

σ2r

(σ2r + δ)2

)· E[‖e‖22

]. (23)

Note that σ2r(σ2r+δ)

2 ≤ min(

1σ2r, 14δ

), so we can think of the error in

(23) as an average of the 1σ2r

, with the large values simply replaced by

1/(4δ).

65


A Closed Form Expression

Tikhonov regularization is in some sense very similar to the truncatedSVD, but with one significant advantage: we do not need to explicitlycalculate the SVD to solve (18). Indeed, the solution to (18) can bewritten as

xtik = (ATA + δI)−1ATy. (24)

To see that the expression above is equivalent to (21), note that wecan write ATA as

ATA = V Σ2V T = V ′Σ′2V ′T,

where V ′ is N ×N ,V ′ =

[V V 0

],

and the N ×N diagonal matrix Σ′ is simply Σ padded with zeros:

Σ′ =

[Σ 00 0

].

The verification of (24) is now straightforward:

(ATA + δI)−1ATy = (V ′Σ′2V ′T + δI)−1V ΣUTy

= V ′(Σ′2 + δI)−1V ′TV ΣUTy

= V ′(Σ′2 + δI)−1[ΣUTy

0

]= V (Σ2 + δI)−1ΣUTy

= xtik.

The expression (24) holds for all M,N, and R. We will leave it asan exercise to show that

(ATA + δI)−1ATy = AT(AAT + δI)−1y.

66


The importance of not needing to explicitly compute the SVD issignificant when we are solving large problems. When A is large(M,N > 105, say) it may be expensive or even impossible to con-struct the SVD and compute with it explicitly. However, if it hasspecial structure (if it is sparse, for example), then it may take manyfewer than MN operations to compute a matrix vector productAx.

In these situations, a matrix free iterative algorithm can be usedto perform the inverse required in (24). A prominent example of suchan algorithm is conjugate gradients, which we will see later inthis course.

Weighted Least Squares

Standard least-squares tries to fit a vector x to a set of “measure-ments” y by solving

minimizex∈RN

‖y −Ax‖22.

Now, what if some of the measurements ore more reliable than oth-ers? Or, what if the errors are closely correlated between measure-ments?

There is a systematic way to treat both of these cases using weightedleast-squares. Instead of minimizing the energy in the residual

‖r‖22 = ‖y −Ax‖22,

we will minimize

‖Wr‖22 = ‖Wy −WAx‖22,

67


for some M ×M weighting matrix W .

When W is a diagonal matrix,

W =

w11

w22. . .

wMM

,then the error we are minimizing looks like

‖Wr‖22 = w211r[1]2 + w2

22r[2]2 + · · · + w2MMr[M ]2.

By adjusting the wmm, we can penalize some of the components ofthe error more than others.

By adding off-diagonal terms, we can account for correlations inthe error (we will explore this further later in these notes).

Solving

minimize ‖Wr‖22 = minimizex∈RN

‖Wy −WAx‖22,

is simple. We simply use least-squares with WA as the matrix, andWy as the observations:

xwls = (WA)†Wy,

where (WA)† is the pseudo-inverse of WA.

For the rest of this section, we will assume that M ≥ N (meaningthat there are at least as many observations as unknowns) and thatA has full column rank. This allows us to write

xwls = (ATW TWA)−1ATW TWy.

68


Example: We measure a patient’s pulse 3 times, and record

y[1] = 70, y[2] = 80, y[3] = 120.

In this case, we can take

A =

111

.What is the least-square estimate for the pulse rate x0?

Now say that we were in a hurry when the third measurement wasmade, so we would like to weigh less than the others. What is theweighted least-squares estimate when

W =

1 0 00 1 00 0 w33

?

What about the particular case when w33 = 1/2?

69


the singular value decomposition -...

Documents