a method for compact image representation using sparse matrix and tensor projections onto exemplar...

14
IEEE TRANSACTIONS ON IMAGE PROCESSING 1 1

Upload: dhwanee-desai

Post on 28-Sep-2015

222 views

Category:

Documents


5 download

DESCRIPTION

Compact Representation using sparse matrix

TRANSCRIPT

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 1

    1

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 2

    A Method for Compact Image Representation usingSparse Matrix and Tensor Projections onto

    Exemplar Orthonormal BasesKarthik S. Gurumoorthy, Ajit Rajwade, Arunava Banerjee and Anand Rangarajan

    AbstractWe present a new method for compact representa-tion of large image datasets. Our method is based on treatingsmall patches from a 2D image as matrices as opposed to theconventional vectorial representation, and encoding these patchesas sparse projections onto a set of exemplar orthonormal bases,which are learned a priori from a training set. The end resultis a low-error, highly compact image/patch representation thathas significant theoretical merits and compares favorably withexisting techniques (including JPEG) on experiments involvingthe compression of ORL and Yale face databases, as well asa database of miscellaneous natural images. In the context oflearning multiple orthonormal bases, we show the easy tunabilityof our method to efficiently represent patches of different com-plexities. Furthermore, we show that our method is extensible in atheoretically sound manner to higher-order matrices (tensors).We demonstrate applications of this theory to compression ofwell-known color image datasets such as the GaTech and CMU-PIE face databases and show performance competitive withJPEG. Lastly, we also analyze the effect of image noise on theperformance of our compression schemes.

    Index Termscompression, compact representation, sparseprojections, singular value decomposition (SVD), higher-ordersingular value decomposition (HOSVD), greedy algorithm, tensordecompositions.

    I. INTRODUCTIONMost conventional techniques of image analysis treat images

    as elements of a vector space. Lately, there has been a steadygrowth of literature which regards images as matrices, e.g. [1],[2], [3], [4]. As compared to a vectorial method, the matrix-based representation helps to better exploit the spatial relation-ships between image pixels, as an image is essentially a two-dimensional function. In the image processing community, thisnotion of an image has been considered in [5], in the context ofimage coding by using singular value decomposition (SVD).Given any matrix, say J RM1M2 , the SVD of the matrixexpresses it as a product of a diagonal matrix S RM1M2 ofsingular values, with two orthonormal matrices U RM1M1and V RM2M2 . This decomposition is unique upto a signfactor on the vectors of U and V if the singular values are alldistinct. For the specific case of natural images, it has beenobserved that the values along the diagonal of S are rapidlydecreasing, which allows for the low-error reconstruction ofthe image by low-rank approximations. This property of SVD,coupled with the fact that it provides the best possible lower-rank approximation to a matrix, has motivated its application

    The authors are with the Department of Computer and Information Scienceand Engineering, University of Florida, Gainesville, FL 32611, USA. e-mail:{ksg,avr,arunava,anand}@cise.ufl.edu.

    to image compression in [5] and [6]. In [6], the SVD techniqueis further combined with vector quantization for compressionapplications.

    To the best of our knowledge, the earliest technique onobtaining a common matrix-based representation for a set ofimages was developed by Rangarajan in [1]. In [1], a singlepair of orthonormal bases (U, V ) is learned from a set of someN images, and each image Ij , 1 j N is representedby means of a projection Sj onto these bases, i.e. in theform Ij = USjV T . Similar ideas are developed by Ye in[2] in terms of optimal lower-rank image approximations.In [3], Yang et al. develop a technique named 2D-PCA,which computes principal components of a column-columncovariance matrix of a set of images. In [7], Ding et al. presenta generalization of 2D-PCA, named as 2D-SVD, and someoptimality properties of 2D-SVD are investigated. The workin [7] also unifies the approaches in [3] and [2], and providesa non-iterative approximate solution with bounded error to theproblem of finding the single pair of common bases. In [4], Heet al. develop a clustering application, in which a single set oforthonormal bases is learned in such a way that projections ofa set of images on these bases are neighborhood-preserving.

    It is quite intuitive to observe that, as compared to an entireimage, a small image patch is a simpler, more local entity,and hence can be more accurately represented by means of asmaller number of bases. Following this intuition, we chooseto regard an image as a set of matrices (one per image patch)instead of using a single matrix for the entire image as in[1], [2]. Furthermore, there usually exists a great deal ofsimilarity between a large number of patches in one imageor across several images of a similar kind. We exploit thisfact to learn a small number of full-sized orthonormal bases(as opposed to a single set of low-rank bases learned in [1],[2] and [7], or a single set of low-rank projection vectorslearned in [3]) to reconstruct a set of patches from a trainingset by means of sparse projections with least possible error.As we shall demonstrate later, this change of focus fromlearning lower-rank bases to learning full-rank bases but withsparse projection matrices, brings with itself several significanttheoretical and practical advantages. This is because it is mucheasier to adjust the sparsity of a projection matrix in order tomeet a certain reconstruction error threshold than to adjust forits rank [see Section II-A and Section III].

    There exists a large body of work in the field of sparseimage representation. Coding methods which express imagesas sparse combinations of bases using techniques such as the

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 3

    discrete cosine transform (DCT) or Gabor-filter bases havebeen quite popular for a long period. The DCT is also an es-sential ingredient of the widely used JPEG image compressionstandard. These developments have also spurred interest inmethods that learn a set of bases from a set of natural imagesas well as a sparse set of coefficients to represent images interms of those bases. Pioneering work in this area has beendone in [8], with a special emphasis on learning an over-complete set of bases and their sparse combinations. Some re-cent contributions in the field of over-complete representationsinclude the work in [9] by Aharon et al., which encodes imagepatches as a sparse linear combination of a set of dictionaryvectors (learned from a training set). An important feature ofall such learning-based approaches (as opposed to those thatuse a fixed set of bases such as DCT) is their tunability todatasets containing a specific type of images. In this paper, wedevelop such a learning technique, but with the key differencethat our technique is matrix-based, unlike the aforementionedvector-based learning algorithms. The main advantage of ourtechnique is as follows: we learn a small number of pairs oforthonormal bases to represent ensembles of image patches.Given any such pair, the computation of a sparse projectionfor any image patch (matrix) with least possible reconstructionerror, can be accomplished by means of a very simple andoptimal greedy algorithm. On the other hand, the computationof the optimal sparse linear combination of an over-completeset of basis vectors to represent another vector is a well-knownNP-hard problem [10]. We demonstrate the applicability ofour algorithm to the compression of databases of face images,with favorable results in comparison to existing approaches.Our main algorithm on ensembles of 2D images or imagepatches, was presented earlier by us in [11]. In this paper, wepresent more detailed comparisons, including with JPEG, andalso study the effect of various parameters on our method,besides showing more detailed derivations of the theory.

    In the current work, we bring out another significantextension of our previous algorithm - namely its elegantapplicability to higher-order matrices (commonly and usuallymistakenly termed tensors 1), with sound theoretical founda-tions. In this paper, we represent patches from color imagesas tensors (third-order matrices). Next, we learn a smallnumber of orthonormal matrices and represent these patchesas sparse tensor projections onto these orthonormal matrices.The tensor-representation for image datasets, as such, is notnew. Shashua and Levin [12] regard an ensemble of gray-scale (face) images, or a video sequence, as a single third-order tensor, and achieve compact representation by means oflower-rank approximations of the tensor. However, quite unlikethe computation of the rank of a matrix, the computationof the rank of a tensor2 is known to be an NP-completeproblem [14]. Moreover, while the SVD is known to yieldthe best lower-rank approximation of a matrix in the 2D case,

    1A N1N2 matrix is a tensor only when it is expressly characterized as amultilinear function from V1V2 to R where V1 and V2 are N1-dimensionaland N2-dimensional vector spaces respectively.

    2The rank of a tensor is defined as the smallest number of rank-1 tensorswhose linear combination gives back the tensor, with a tensor being of rank-1if it is equal to the outer product of several vectors [13].

    its higher-dimensional analog (known as higher-order SVD orHOSVD) does not necessarily produce the best lower-rankapproximation of a tensor (see the work of Lathauwer in[13] for an extensive development of the theory of HOSVD).Nevertheless, for the sake of simplicity, HOSVD has beenused to produce lower-rank approximations of datasets (albeita theoretically sub-optimal one [13]) which work well inthe context of applications ranging from face recognitionunder change in pose, illumination and facial expression as in[15] by Vasilescu and Terzopoulos (though in that particularpaper, each image is still represented as a vector) to dynamictexture synthesis from gray-scale and color videos in [16] byCostantini et al. In these applications, the authors demonstratethe efficacy of the multilinear representation in accountingfor variability over several different modes (such as pose,illumination and expression in [15], or spatial, chromatic andtemporal in [16]), as opposed to simple vector-based imagerepresentations.

    An iterative scheme for a lower-rank tensor approximationis developed in [13], but the corresponding energy functionis susceptible to the problem of local minima. In [12], twonew lower-rank approximation schemes are designed: a closed-form solution under the assumption that the actual tensor rankequals the number of images (which usually may not be truein practice), or else an iterative approximation in other cases.Although the latter iterative algorithm is proved to be conver-gent, it is not guaranteed to yield a global minimum. Wang andAhuja [17] also develop a new iterative rank-R approximationscheme using an alternating least-squares formulation, andalso present another iterative method that is specially tunedfor the case of third-order tensors. Very recently, Ding etal. [18] have derived upper and lower bounds for the errordue to low-rank truncation of the core tensor obtained inHOSVD (which is a closed-form decomposition), and use thistheory to find a single common triple of orthonormal basesto represent a database of color images (represented as 3Darrays) with minimal error in the L2 sense. Nevertheless, thereis still no method of directly obtaining the optimal lower-rank tensor approximation which is non-iterative in nature.Furthermore, the error bounds in [18] are applicable only whenthe entire set of images is coded using a single common tripleof orthonormal bases. Likewise, the algorithms presented in[17] and [12] also seek to find a single common basis.

    The algorithm we present here differs from the aforemen-tioned ones in the following ways: (1) All the aforementionedmethods learn a common basis, which may not be sufficientto account for the variability in the images. We do not learn asingle common basis-tuple, but a set of some K orthonormalbases to represent some N patches in the database of images,with each image and each patch being represented as a higher-order matrix. Note that K is much less than N . (2) We do notseek to obtain lower-rank approximations to a tensor. Rather,we represent the tensor as a sparse projection onto a chosentuple of orthonormal bases. This sparse projection, as we shalllater on show, turns out to be optimal and can be obtainedby a very simple greedy algorithm. We use our extension forthe purpose of compression of a database of color images,with promising results. Note that this sparsity based approach

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 4

    has advantages in terms of coding efficiency as compared tomethods that look for lower-rank approximations, just as inthe 2D case mentioned before.

    Our paper is organized as follows. We describe the theoryand the main algorithm for 2D datasets in Section II. SectionIII presents experimental results and comparisons with existingtechniques. Section IV presents our extension to higher-ordermatrices, with experimental results and comparisons in SectionV. We conclude in Section VI.

    II. THEORY: 2D IMAGESConsider a set of images, each of size M1 M2. We

    divide each image into non-overlapping patches of size m1 m2,m1 M1,m2 M2, and treat each patch as a separatematrix. Exploiting the similarity inherent in these patches, weeffectively represent them by means of sparse projections onto(appropriately created) orthonormal bases, which we call asexemplar bases. We learn these exemplars a priori from aset of training image patches. Before describing the learningprocedure, we first explain the mathematical structure of theexemplars.

    A. Exemplar Bases and Sparse ProjectionsLet P Rm1m2 be an image patch. Using singular value

    decomposition (SVD), we can represent P as a combinationof orthonormal bases U Rm1m1 and V Rm2m2 inthe form P = USV T , where S Rm1m2 is a diagonalmatrix of singular values. However P can also be representedas a combination of any set of orthonormal bases U and V ,different from those obtained from the SVD of P . In thiscase, we have P = USV T where S turns out to be a non-diagonal matrix 3. Contemporary SVD-based compressionmethods leverage the fact that the SVD provides the best low-rank approximation to a matrix [6], [19]. We choose to departfrom this notion, and instead answer the following question:What sparse matrix W Rm1m2 will reconstruct P froma pair of orthonormal bases U and V with the least errorP UWV T 2? Sparsity is quantified by an upper bound Ton the L0 norm of W , i.e. on the number of non-zero elementsin W (denoted as W0)4. We prove that the optimal W withthis sparsity constraint is obtained by nullifying the least (inabsolute value) m1m2T elements of the estimated projectionmatrix S = UTP V . Due to the ortho-normality of U and V ,this simple greedy algorithm turns out to be optimal as weprove below:.Theorem 1: Given a pair of orthonormal bases (U , V ), theoptimal sparse projection matrix W with W0 = T isobtained by setting to zero m1m2T elements of the matrixS = UTP V having least absolute value.Proof: We have P = USV T . The error in reconstructinga patch P using some other matrix W is e = U(S W )V T 2 = S W2 as U and V are orthonormal. Let

    3The decomposition P = USV T exists for any P even if U and Vare not orthonormal. We still follow ortho-normality constraints to facilitateoptimization and coding. See section II-C and III-B.

    4See section III for the merits of our sparsity-based approach over thelow-rank approach.

    I1 = {(i, j)|Wij = 0} and I2 = {(i, j)|Wij 6= 0}. Thene =

    (i,j)I1

    S2ij +

    (i,j)I2(Sij Wij)2. This error will be

    minimized when Sij = Wij in all locations where Wij 6= 0and Wij = 0 at those indices where the corresponding valuesin S are as small as possible. Thus if we want W0 = T , thenW is the matrix obtained by nullifying m1m2T entries fromS that have the least absolute value and leaving the remainingelements intact.Hence, the problem of finding an optimal sparse projectionof a matrix (image patch) onto a pair of orthonormal bases,is solvable in O(m1m2(m1 + m2)) time as it requires justtwo matrix multiplications (S = UTPV ). On the other hand,the problem of finding the optimal sparse linear combinationsof an over-complete set of basis vectors in order to representanother vector (i.e. the vectorized form of an image patch) is awell-known NP-hard problem [10]. In actual applications [9],approximate solutions to these problems are sought, by meansof pursuit algorithms such as OMP [20]. Unfortunately, thequality of the approximation in OMP is dependent on T , withan upper bound on the error that is directly proportional to1 + 6T [see [21], Theorem (C)] under certain conditions.

    Similar problems exist with other pursuit approximation al-gorithms such as Basis Pursuit (BP) as well [21]. For largevalues of T , there can be difficulties related to convergence,when pursuit algorithms are put inside an iterative optimizationloop. Our technique avoids any such dependencies.

    B. Learning the Bases

    The essence of this paper lies in a learning method to pro-duce K exemplar orthonormal bases {(Ua, Va)}, 1 a K ,to encode a training set of N image patches Pi Rm1m2(1 i N ) with least possible error (in the sense of the L2norm of the difference between the original and reconstructedpatches). Note that K N . In addition, we impose a sparsityconstraint that every Sia (the matrix used to reconstruct Pifrom (Ua, Va)) has at most T non-zero elements. The mainobjective function to be minimized is:

    E({Ua, Va, Sia,Mia}) =N

    i=1

    K

    a=1

    MiaPiUaSiaV Ta 2 (1)

    subject to the following constraints:(1)UTa Ua = V

    Ta Va = I, a. (2) Sia0 T, (i, a).

    (3)

    a

    Mia = 1, i and Mia {0, 1}, i, a. (2)

    Here Mia is a binary matrix of size N K which indicateswhether the ith patch belongs to the space defined by (Ua, Va).Notice that we are not trying to use a mixture of orthonormalbases for the projection, but just project onto a single suitablebasis pair instead. The optimization of the above energyfunction is difficult, as Mia is binary. A simple K-meanswill lead to local minima issues, and so we relax the binarymembership constraint so that now Mia (0, 1), (i, a),subject to Ka=1Mia = 1, i. The naive mean field theory lineof development culminating in [22], (with much of the theorydeveloped already in [23]) leads to the following deterministic

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 5

    annealing energy function:

    E({Ua, Va, Sia,Mia}) =N

    i=1

    K

    a=1

    MiaPi UaSiaV Ta 2+

    1

    ia

    Mia logMia +

    i

    i(

    a

    Mia 1)+

    a

    trace[1a(UTa Ua I)] +

    a

    trace[2a(V Ta Va I)].(3)

    Note that in the above equation, {i} are the Lagrangeparameters, {1a,2a} are symmetric Lagrange matrices, and a temperature parameter.

    We first initialize {Ua} and {Va} to random orthonormalmatrices a, and Mia = 1K , (i, a). As {Ua} and {Va} areorthonormal, the projection matrix Sia is computed by thefollowing rule:

    Sia = UTa PiVa. (4)

    Note that this is the solution to the least squares energyfunction PiUaSiaV Ta 2. Then m1m2T elements in Siawith least absolute value are nullified to get the best sparseprojection matrix. The updates to Ua and Va are obtained asfollows. Denoting the sum of the terms of the cost functionin Eqn. (1) that are independent of Ua as C, we rewrite thecost function as follows:

    E({Ua, Va, Sia,Mia}) =N

    i=1

    K

    a=1

    MiaPi UaSiaV Ta 2+

    a

    trace[1a(UTa Ua I)] + C

    =

    N

    i=1

    K

    a=1

    Mia[trace(Pi UaSiaV Ta )T (Pi UaSiaV Ta ))]+

    a

    trace[1a(UTa Ua I)] + C

    =

    N

    i=1

    K

    a=1

    Mia[trace(PTi Pi) 2trace(PTi UaSiaV Ta )

    +trace(STiaSia)] +

    a

    trace(1a(UTa Ua I)) + C.(5)

    Now setting derivatives w.r.t. Ua to zero, re-arranging theterms and eliminating the Lagrange matrices, we obtain thefollowing update rule for Ua:

    Z1a =

    i

    MiaPiVaSTia;Ua = Z1a(Z

    T1aZ1a)

    12 . (6)

    An SVD of Z1a will give us Z1a = 1aT1a where 1aand 1a are orthonormal matrices and is a diagonal matrix.Using this, we can re-write Ua as follows:

    Ua = (1aT1a)((1a

    T1a)

    T (1aT1a))

    12 = 1a

    T1a.

    (7)Notice that the steps above followed only owing to the factthat Ua and hence 1a and 1a were full-rank matrices. Thisgives us an update rule for Ua. A very similar update rule for

    Va follows along the same lines. The membership values areobtained by:

    Mia =ePiUaSiaV

    T

    a2

    Kb=1 e

    PiUbSibV Tb 2. (8)

    The matrices {Sia, Ua, Va} and M are then updated sequen-tially following one another for a fixed value, until conver-gence. The value of is then increased and the sequentialupdates are repeated. The entire process is repeated until anintegrality condition is met.Our algorithm could be modified to learn a set of orthonormalbases (single matrices and not a pair) for sparse represen-tation of vectorized images, as well. Note that even then,our technique should not be confused with the mixtures ofPCA approach [24]. The emphasis of the latter is again onlow-rank approximations and not on sparsity. Furthermore,in our algorithm we do not compute an actual probabilisticmixture (also see Sections II-C and III-E). However, we havenot considered this vector-based variant in our experiments,because it does not exploit the fact that image patches aretwo-dimensional entities.

    C. Application to Compact Image RepresentationOur framework is geared towards compact but low-error

    patch reconstruction. We are not concerned with the discrim-inating assignment of a specific kind of patches to a specificexemplar, quite unlike in a clustering or classification appli-cation. In our method, after the optimization, each trainingpatch Pi (1 i N ) gets represented as a projectiononto one out of the K exemplar orthonormal bases, whichproduces the least reconstruction error, i.e. the kth exemplaris chosen if Pi UkSikV Tk 2 Pi UaSiaV Ta 2, a {1, 2, ...,K}, 1 k K . For patch Pi, we denote thecorresponding optimal projection matrix as Si = Sik, andthe corresponding exemplar as (Ui , V i ) = (Uk, Vk). Thusthe entire training set is approximated by (1) the common setof basis-pairs {(Ua, Va)}, 1 a K (K N ), and (2) theoptimal sparse projection matrices {Si } for each patch, with atmost T non-zero elements each. The overall storage per imageis thus greatly reduced (see also section III-B). Furthermore,these bases {(Ua, Va)} can now be used to encode patchesfrom a new set of images that are somewhat similar to the onesexisting in the training set. However, a practical applicationdemands that the reconstruction meet a specific error thresholdon unseen patches, and hence the L0 norm of the projectionmatrix of the patch is adjusted dynamically in order to meet theerror. Experimental results using such a scheme are describedin the next section.

    III. EXPERIMENTS: 2D IMAGESIn this section, we first describe the overall methodology

    for the training and testing phases of our algorithm based onour earlier work in [11]. As we are working on a compressionapplication, we then describe the details of our image codingand quantization scheme. This is followed by a discussion ofthe comparisons between our technique and other competi-tive techniques (including JPEG), and an enumeration of theexperimental results.

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 6

    A. Training and Testing PhasesWe tested our algorithm for compression of the entire ORL

    database [25] and the entire Yale database [26]. We divided theimages in each database into patches of fixed size (12 12),and these sets of patches were segregated into training andtest sets. The test set consisted of many more images than thetraining set, in the case of either database. For the purposeof training, we learned a total of K = 50 orthonormal basesusing a fixed value of T to control the sparsity. For testing, weprojected each patch Pi onto that exemplar (Ui , V i ) whichproduced the sparsest projection matrix Si that yielded anaverage per-pixel reconstruction error PiU

    iSiV i

    T 2

    m1m2of no

    more than some chosen . Note that different test patchesrequired different T values, depending upon their inherentcomplexity. Hence, we varied the sparsity of the projectionmatrix (but keeping its size fixed to 12 12), by greedilynullifying the smallest elements in the matrix, without lettingthe reconstruction error go above . This gave us the flexibilityto adjust to patches of different complexities, without alteringthe rank of the exemplar bases (Ui , V i ). As any patchPi is projected onto exemplar orthonormal bases that aredifferent from those produced by its own SVD, the projectionmatrices turn out to be non-diagonal. Hence, there is nosuch thing as a hierarchy of singular values as in ordinarySVD. As a result, we cannot resort to restricting the rank ofthe projection matrix (and thereby the rank of (Ui , V i )) toadjust for patches of different complexity (unless we learn aseparate set of K bases, each set for a different rank r where1 < r < min(m1,m2), which would make the training andeven the image coding [see Section III-B] very cumbersome).This highlights an advantage of our approach over that ofalgorithms that adjust the rank of the projection matrices.

    B. Details of Image Coding and QuantizationWe obtain Si by sparsifying Ui

    TPiVi . As Ui and V i are

    orthonormal, we can show that the values in Si will alwayslie in the range [m,m] for mm patches, if the values ofPi lie in [0, 1]. This can be proved as follows: We know thatSi = U

    iTPiV

    i . Hence we write the element of Si in the ath

    row and bth column as follows:

    Si ab =

    k,l

    UiT

    akPiklVlb

    kl

    1mPikl

    1m

    = m. (9)

    The first step follows because the maximum value of Si abis obtained when all values of Ui

    Tak are equal to

    m. The

    second step follows because Si ab will meet its upper boundwhen all the m2 values in the patch are equal to 1. We cansimilarly prove that the lower bound on Si ab is m. As inour case, m = 12, the upper and lower bounds are +12 and12 respectively.In our experiment, we Huffman-encoded the integer parts ofthe values in the {Si } matrices over the whole image (givingus an average of some Q1 bits per entry) and quantized thefractional parts with Q2 bits per entry. Thus, we needed tostore the following information per test-patch to create thecompressed image: (1) the index of the best exemplar, usinga1 bits, (2) the location and value of each non-zero element

    in its Si matrix, using a2 bits per location and Q1 +Q2 bitsfor the value, and (3) the number of non-zero elements perpatch encoded using a3 bits. Hence the total number of bitsper pixel for the whole image is given by:

    RPP =N(a1 + a3) + T

    whole(a2 +Q1 +Q2)

    M1M2. (10)

    where Twhole =N

    i=1 Si 0. The values of a1, a2 and a3were obtained by Huffman encoding.After performing quantization on the values of the coefficientsin Si , we obtained new quantized projection matrices, denotedas Si . Following this, the PSNR for each image was measuredas 10 log10

    Nm1m2PN

    i=1PiUi S

    iV i

    T 2, and then averaged over the

    entire test set. The average number of bits per pixel (RPP)was calculated as in Eqn. (10) for each image, and thenaveraged over the whole test set. We repeated this procedurefor different values from 8 105 to 8 103 (range ofimage intensity values was [0, 1]) and plotted an ROC curveof average PSNR vs. average RPP.

    C. Comparison with other techniques:

    We pitted our method against four recent and competitiveapproaches: (1) KSVD, (2) overcomplete DCT, (3) SSVD and(4) the JPEG standard. (1) For the KSVD algorithm from[9], we used 441 unit norm dictionary vectors of size 144.In this method, each patch is vectorized and represented asa sparse linear combination of the dictionary vectors. Duringtraining, the value of T is kept fixed, and during testing itis dynamically adjusted depending upon the patch so as tomeet the appropriate set of error thresholds represented by [8 105, 8 103]. For the purpose of comparison betweenour method and KSVD, we used the exact same values of Tand as in our algorithm, to facilitate a fair comparison. Themethod used to find the sparse projection onto the dictionarywas the orthogonal matching pursuit (OMP) algorithm [21].(2) We created the over-complete DCT dictionary with 441unit norm vectors of size 144 by sampling cosine waves ofvarious frequencies, again with the same T value as for KSVD,and using the OMP technique for sparse projection. (3) TheSSVD method from [19] is not a learning based technique. It isbased on the creation of sub-sampled versions of the image bytraversing the image in a manner different from usual raster-scanning. (4) For the JPEG standard (for which we used theimplementation provided in MATLAB R), we measured theRPP from the number of bytes for storage of the file.

    For KSVD and over-complete DCT, there exist bounds onthe values of the coefficients that are very similar to thosefor our technique, as presented in Eqn. (9). For both thesemethods, the RPP value was computed using the formula in[9], Eqn. (27), with the modification that the integer partsof the coefficients of the linear combination were Huffman-encoded and the fractional parts separately quantized (as itgave a better ROC curves for these methods). For the SSVDmethod, the RPP was calculated as in [19], section (5).

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 7

    0 2 4 620

    25

    30

    35

    40

    45

    RPP

    PSNR

    ORL: PSNR vs RPP ; T = 10

    (1) OurMethod(2) KSVD(3) OverComplete DCT(4) SSVD(5) JPEG

    (a)

    0 0.5 1 1.520

    25

    30

    35

    40

    45

    RPP

    PSNR

    Yale: PSNR vs RPP ; T = 3

    (1) OurMethod(2) KSVD(3) OverComplete DCT(4) SSVD(5) JPEG

    1

    2 34

    5

    (b)

    0 1 2 3 4 5 6 7 8x 103

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    Error

    Varia

    nce

    in R

    PP

    Variance in RPP vs. error: ORL Database

    (c)

    Fig. 1. ROC curves on (a) ORL and (b) Yale Databases. Legend- Red (1):Our Method, Blue (2): KSVD, Green (3): Over-complete DCT, Black (4):SSVD and (5): JPEG (Magenta). (c) Variance in RPP versus pre-quantizationerror for ORL. These plots are viewed best in color.

    Original Error 0.003 Error 0.001

    Error 0.0005 Error 0.0003 Error 0.0001

    Fig. 2. Reconstructed versions of an image from the ORL database using5 different error values. The original image is on the extreme left and topcorner. Viewed better by ZOOMING into the pdf file.

    D. Results

    For the ORL database, we created a training set of patchesof size 12 12 from images of 10 different people, with10 images per person. Patches from images of the remaining30 people (10 images per person) were treated as the testset. From the training set, a total of 50 pairs of orthonormalbases were learned using the algorithm described in SectionII-B. The T value for sparsity of the projection matriceswas set to 10 during training. As shown in Figure III-D(c),we also computed the variance in the RPP values for everydifferent pre-quantization error value, for each image in theORL database. Note that the variance in RPP decreases withincrease in specified error. This is because at very low errors,different images require different number of coefficients inorder to meet the error.

    The same experiment was run on the Yale database with avalue of T = 3 on patches of size 12 12. The training setconsisted of a total of 64 images of one and the same personunder different illumination conditions, whereas the testingset consisted of 65 images each of the remaining 38 peoplefrom the database (i.e. 2470 images in all), under different

    Method Dictionary Size (number of scalars)Our Method 50 2 12 12 = 14400

    KSVD 441 12 12 = 63504Overcomplete DCT 441 12 12 = 63504

    TABLE ICOMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR

    EXPERIMENTS ON THE ORL AND YALE DATABASES.

    0 1 2 3 420

    25

    30

    35

    40

    RPP

    PSNR

    ORL: PSNR vs. RPP, Diff. Patch Size

    8X8 with T = 512X12 with T = 1015X15 with T = 1520X20 with T = 20

    (a)

    0 1 2 3 4 520

    25

    30

    35

    40

    RPP

    PSNR

    ORL: PSNR vs RPP; Diff. Values of T

    T = 5T = 10T = 20T = 50

    (b)

    Fig. 3. ROC curves for the ORL database using our technique with (a)different patch size and (b) different value of T for a fixed patch size of12 12. These plots are viewed best in color.

    illumination conditions. The ROC curves for our method weresuperior to those of other methods over a significant rangeof , for the ORL as well as the Yale database, as seen inFigures 1(a) and 1(b). Figure 2 shows reconstructed versionsof an image from the ORL database using our method withdifferent values of . For experiments on the ORL database,the number of bits used to code the fractional parts of thecoefficients of the projection matrices [i.e. Q2 in Eqn. (10)]was set to 5. For the Yale database, we often obtained pre-quantization errors significantly less than the chosen , andhence using a value of Q2 less than 5 bits often did not raisethe post-quantization error above . Keeping this in mind, wevaried the value of Q2 dynamically, for each pre-quantizationerror value. The same variation was applied to the KSVD andover-complete DCT techniques as well. For the experimentson these two databases, we also summarize the dictionary sizeof all the relevant methods being compared in Table I.

    E. Effect of Different Parameters on the Performance of ourMethod:

    The different parameters in our method include: (1) thesize of the patches for training and testing, (2) the numberof pairs of orthonormal bases, i.e. K , and (3) the sparsityof each patch during training, i.e. T . In the following, wedescribe the effect of varying these parameters: (1) Patch Size:If the size of the patch is too large, it becomes more and moreunlikely that a fixed number of orthonormal matrices will serveas adequate bases for these patches in terms of a low-errorsparse representation, owing to the curse of dimensionality.Furthermore, if the patch size is too large, any algorithm willlose out on the advantages of locality and simplicity. However,if the patch size is too small (say 2 2), there is very little acompression algorithm can do in terms of lowering the numberof bits required to store these patches. We stress that the patchsize as a free parameter is something common to all algorithms

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 8

    to which we are comparing our technique (including JPEG).Also, the choice of this parameter is mainly empirical. Wetested the performance of our algorithm with K = 50 pairsof orthonormal bases on the ORL database, using patches ofsize 8 8, 12 12, 15 15 and 20 20, with appropriatelydifferent values of T . As shown in Figure 3(a), the patch sizeof 1212 using T = 10 yielded the best performance, thoughother patch sizes also performed quite well. (2) The numberof orthonormal bases K: The choice of this number is notcritical, and can be set to as high a value as desired withoutaffecting the accuracy of the results. This parameter shouldnot be confused with the number of mixture components instandard mixture-model based density estimation, because inour method each patch gets projected only onto a single setof orthonormal bases, i.e. we do not compute a combinationof projections onto all the K different orthonormal basispairs. The only down-side of a higher value of K is theadded computational cost during training. Again note that thisparameter will be part of any learning-based algorithm forcompression that uses a set of bases to express image patches.(3) The value of T during training: The value of T is fixedonly during training, and is varied for each patch during thetesting phase so as to meet the required error threshold. A veryhigh value of T during training can cause the orthonormalbasis pairs to overfit to the training data (variance), whereasa very low value could cause a large bias error. This is aninstance of the well-known bias-variance tradeoff common inmachine learning algorithms. Our choice of T was empirical,though the value of T = 10 performed very well on nearly allthe datasets we ran our experiments on. The effect of differentvalues of T on the compression performance for the ORLdatabase is shown in Figure 3(b). We again emphasize thatthe issue with the choice of an optimal T is not an artifactof our algorithm per se. For instance, this issue of a choiceof T will also be encountered in pursuit algorithms to findthe approximately optimal linear combination of unit vectorsfrom a dictionary. In the latter case, however, there will alsobe the problem of poorer and poorer approximation errorsas T increases, under certain conditions on the dictionary ofovercomplete basis vectors.

    F. Performance on Random Collections of ImagesWe performed additional experiments to study the behavior

    of our algorithm when the training and test sets were verydifferent from one another. To this end, we used the databaseof 155 natural images (in uncompressed format) from theComputer Vision Group at the University of Granada [27].The database consists of images of faces, houses, naturalscenery, inanimate objects and wildlife. We converted allthe given images to grayscale. Our training set consisted ofpatches (of size 12 12) from 11 images. Patches fromthe remaining images were part of the test set. Though theimages in the training and test sets were very different, ouralgorithm produced excellent results superior to JPEG upto anRPP value of 3.1 bits, as shown in Figure 4(a). Two trainingimages and reconstructed versions of a test image for errorvalues of 0.0002, 0.0006 and 0.002 respectively, are shown

    0 0.5 1 1.5 2 2.5 3 3.520

    25

    30

    35

    40

    RPP

    PSNR

    CVG : RPP vs PSNR , T0 = 10

    OurMethodJPEG

    (a)

    0 1 2 3 4 520

    25

    30

    35

    40

    RPP

    PSNR

    UCID : RPP vs PSNR

    Our Method T0=5

    JPEG

    (b)

    Fig. 4. PSNR vs. RPP curve for our method and JPEG on (a) the CVGGranada database and (b) the UCID database.

    (a) (b) (c) (d) (e)

    Fig. 5. (a) and (b): Training images from the CVG Granada database. (c) to(e): Reconstructed versions of a test image for error values of 0.0002, 0.0006and 0.002 respectively. Viewed better by ZOOMING in the pdf file.

    in Figure III-F. To test the effect of minute noise on theperformance of our algorithm, a similar experiment was runon the UCID database [28] with a training set of 10 imagesand the remaining 234 test images from the database. Thetest images were significantly different from the ones used fortraining. The images were scaled down by a factor of two(i.e. to 320240 instead of 640480) for faster training andsubjected to zero mean Gaussian noise of variance 0.001 (ona scale from 0 to 1). Our algorithm was yet again competitivewith JPEG as shown in Figure 4(b).

    G. Comparison with Wavelet CodingIn this section, we present experiments comparing the

    performance of our method with JPEG2000 and also with awavelet-based compression method involving a very simplequantization scheme that employs Huffman encoding, similarto the quantization scheme we employed for our technique. Wewould like to emphasize that JPEG2000 is an industry standardthat was developed by several researchers for over a decade.Its strongest point is its superlative quantization and encodingscheme, rather than the mere fact that it uses wavelet bases[29]. On the other hand, a wavelet-coding method employinga much simpler quantization scheme (similar to the one weused in our technique) performs quite poorly in practice (muchworse than JPEG), with one of its major problems being thatthere are no bounds that can be imposed on the values of thewavelet coefficients in terms of the image intensity range orpatch size. In our experiments, the typical range of coefficientvalues turned out to be around 104. As opposed to this, thevalues of the coefficients of the projected matrices in ourtechnique are (provably) bounded between m and +m forpatch sizes of mm (as shown in Section (3B)).In our wavelet coding strategy, we performed a decompositionof upto four levels (level 1 giving us higher RPP and higherPSNR, and level 4 giving us low RPP and low PSNR).For any decomposition, the LL sub-band coefficients (also

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 9

    called the approximation coefficients) of the final level werenever thresholded. The integer parts of these coefficientswere Huffman encoded (just as in our technique) and 4 bitswere allocated for the fractional part. Having more bits forthe fractional part did not improve the ROC curves in ourexperiments. The coefficients of the other sub-bands (HH,HL, LH) at every level were thresholded using the commonlyemployed Birge-Massart strategy (part of the compressionimplementation in the MATLAB wavelet toolbox). Differentthresholds were used for each subband at each level. Again,the thresholded values were coded using Huffman encoding onthe integer part followed by 4 bits for the floating point. Thespecific type of wavelet that we used was sym4 as it gavebetter performance than other wavelets that we tried (such asdb7, biorthogonal wavelets etc).We performed experiments comparing the compression perfor-mance of the following four techniques: our method, JPEG,JPEG2000 (the specific implementation being the commonlyavailable Jasper package [30]), wavelet methods using sym4followed by a simple quantizer (as decribed earlier). In Figure6, we show results on ORL and Yale databases. For the ORLdatabase, we also present results with db7 and biorthogonalwavelets. As these performed much worse than sym4, we donot show more results with them. As seen in Figure 6 (a) and(b), our method as well as JPEG (using DCT) outperformedthe simple wavelet based techniques though not JPEG2000.Note that there are other reports in the literature [31] thatindicate a superior performance of JPEG over wavelet-basedmethods that did not use sophisticated quantizers, especiallyfor higher bit rates.

    Thus we believe that our results will improve further ifa more sophisticated quantizer/coder is used in conjunctionwith bases learned by our method. The improvement of thequantization scheme for our technique is well beyond thescope of the current paper. Nevertheless, we observed that theperformance of our technique was competitive with JPEG2000for low bit rates, when negligible amounts of noise (i.i.d. zero-mean Gaussian noise of variance 8104 or 103) were addedto the images. It should be noted that the visual effect of suchsmall amounts of noise is not obvious to the human eye. Atlower bit rates, we observed that our method sometimes beatJPEG2000. This is despite the fact that we trained our methodon noiseless images, and tested on noisy images. and the sizeof the test size was several times larger than the size of thetraining set. The results of our method, JPEG, JPEG2000 andwavelet coding with sym4 wavelets are shown in Figure 6(c)to (f), for two levels of noise on the ORL and Yale databases.

    IV. THEORY: 3D (OR HIGHER-D) IMAGESWe now consider a set of images represented as third-order

    matrices, each of size M1M2M3. We divide each imageinto non-overlapping patches of size m1 m2 m3,m1 M1,m2 M2,m3 M3, and treat each patch as a separatetensor. Just as before, we start by exploiting the similarityinherent in these patches, and represent them by means ofsparse projections onto a triple of exemplar orthonormal baseslearned from a set of training image patches. Note that our

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.520

    25

    30

    35

    40

    45

    Our MethodJASPERJPEGWavelet SYM4Wavelet DB7WAVELET BIOR6.8

    (a)

    0 0.2 0.4 0.6 0.8 1 1.2 1.422

    24

    26

    28

    30

    32

    34

    36

    38

    40

    42

    Our MethodJASPERJPEGWavelet SYM4

    (b)

    0 0.5 1 1.5 2 2.5 3 3.5 422

    24

    26

    28

    30

    32

    34

    36

    38

    Our MethodJASPERJPEGWavelet SYM4

    (c)

    0 0.5 1 1.5 2 2.5 322

    24

    26

    28

    30

    32

    34

    36

    38

    Our MethodJASPERJPEGWavelet SYM4

    (d)

    0 0.5 1 1.5 2 2.5 3 3.5 422

    24

    26

    28

    30

    32

    34

    36

    38

    Our MethodJASPERJPEGWavelet SYM4

    (e)

    0 0.5 1 1.5 2 2.522

    24

    26

    28

    30

    32

    34

    36

    38

    40

    Our MethodJASPERJPEGWavelet SYM4

    (f)

    Fig. 6. PSNR vs. RPP curve for our method, JPEG, JPEG2000 (JASPER)and wavelet coding with sym4 wavelets on (a) ORL Database and (b) YaleDatabase (with no noise). (c) and (d): with iid Gaussian noise of mean zeroand variance 8104, (e) and (f): with iid Gaussian noise of mean zero andvariance 103. Image are best viewed by ZOOMING into the pdf file.

    extension to higher dimensions is non-trivial, especially giventhat a key property of the SVD in 2D (namely that SVDprovides the optimal lower-rank reconstruction by simplenullification of the lowest singular values) does not extendinto higher-dimensional analogs such as HOSVD [13]. Inthe following, we now describe the mathematical structureof the exemplars. Note that though the derivations presentedin this paper are for 3D matrices, a very similar treatmentis applicable to learning n-tuples of orthonormal bases forpatches from n-D matrices (where n 4).

    A. Exemplar Bases and Sparse ProjectionsLet P Rm1m2m3 be an image patch. Using HOSVD,

    we can represent P as a combination of orthonormal basesU Rm1m1 , V Rm2m2 and W Rm3m3 in theform P = S 1 U 2 V 3 W , where S Rm1m2m3 istermed the core-tensor. The operators i refer to tensor-matrixmultiplication over different axes. The core-tensor has specialproperties such as all-orthogonality and ordering. See [13] formore details. Now, P can also be represented as a combinationof any set of orthonormal bases U , V and W , different fromthose obtained from the HOSVD of P . In this case, we haveP = S 1 U 2 V 3 W where S is not guaranteed to be anall-orthogonal tensor, nor is it guaranteed to obey the orderingproperty.

    As the HOSVD does not necessarily provide the best low-

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 10

    rank approximation to a tensor [13], we choose to departfrom this notion (instead of settling for a sub-optimal approx-imation), and instead answer the following question: Whatsparse tensor Q Rm1m2m3 will reconstruct P from atriple of orthonormal bases (U , V , W ) with the least errorP Q 1 U 2 V 3 W2? Sparsity is again quantifiedby an upper bound T on the L0 norm of Q (denoted asQ0). We now prove that the optimal Q with this sparsityconstraint is obtained by nullifying the least (in absolute value)m1m2m3 T elements of the estimated projection tensorS = P 1 UT 2 V T 3 WT . Due to the ortho-normalityof U , V and W , this simple greedy algorithm turns out to beoptimal (see Theorem 2).Theorem 2: Given a triple of orthonormal bases (U , V , W ),the optimal sparse projection tensor Q with Q0 = T isobtained by setting to zero m1m2m3 T elements of thetensor S = P 1 UT 2 V T 3 WT having least absolutevalue.Proof: We have P = S 1 U 2 V 3 W . The errorin reconstructing a patch P using some other matrix Q ise = (S Q) 1 U 2 V 3 W2. For any tensor X , wehave X2 = X(n)2 (i.e. the Frobenius norm of the tensorand its nth unfolding are the same [13]). Also, by the matrixrepresentation of HOSVD, we have X(n) = U S(n)(V W )5.Hence, it follows that e = U (SQ)(1) (V W )T 2. Thisgives us e = S Q2. The last step follows because U ,V and W , and hence V W are orthonormal matrices. LetI1 = {(i, j, k)|Qijk = 0} and I2 = {(i, j, k)|Qijk 6= 0}.Then e =

    (i,j,k)I1

    S2ijk +

    (i,j,k)I2(Sijk Qijk)2. This

    error will be minimized when Sijk = Qijk in all locationswhere Qijk 6= 0 and Qijk = 0 at those indices where thecorresponding values in S are as small as possible. Thus if wewant Q0 = T , then Q is the tensor obtained by nullifyingm1m2m3T entries from S that have the least absolute valueand leaving the remaining elements intact.We wish to re-emphasize that a key feature of our approachis the fact that the same technique used for 2D imagesscales to higher dimensions. Tensor decompositions such asHOSVD do not share this feature, because the optimal low-rank reconstruction property for SVD does not extend toHOSVD. Furthermore, though the upper and lower errorbounds for core-tensor truncation in HOSVD derived in [18]are interesting, they are applicable only when the entire setof images has a common basis (i.e. a common U , V and Wmatrix), which may not be sufficient to compactly account forthe variability in real datasets.

    B. Learning the Bases

    We now describe a method to learn K exemplar orthonormalbases {(Ua, Va,Wa)}, 1 a K , to encode a training set ofN image patches Pi Rm1m2m3 (1 i N ) with leastpossible error (in the L2 norm sense). Note that K N .In addition, we impose a sparsity constraint that every Sia(the tensor used to reconstruct Pi from (Ua, Va,Wa)) has at

    5Here AB refers to the Kronecker product of matrices A RE1E2 andB RF1F2 , which is given as AB = (Ae1e2B)1e1E1,1e2E2 .

    most T non-zero elements. The main objective function to beminimized is:

    E({Ua, Va,Wa, Sia,Mia}) =N

    i=1

    K

    a=1

    MiaPi Sia 1 Ua 2 Va 3 Wa2 (11)

    subject to the following constraints:(1)UTa Ua = V

    Ta Va =W

    Ta Wa = I, a. (2) Sia0 T, (i, a).

    (3)

    a

    Mia = 1, i, and Mia {0, 1}, i, a.(12)

    Here Mia is a binary matrix of size N K which indi-cates whether the ith patch belongs to the space defined by(Ua, Va,Wa). Just as for the 2D case, we relax the binarymembership constraint so that now Mia (0, 1), (i, a),subject to Ka=1Mia = 1, i. Using Lagrange parameters{i}, symmetric Lagrange matrices {1a}, {2a} and {3a},and a temperature parameter , we obtain the followingdeterministic annealing objective function:

    E({Ua, Va,Wa, Sia,Mia}) =N

    i=1

    K

    a=1

    MiaPi Sia 1 Ua 2 Va 3 Wa2+

    1

    ia

    Mia logMia +

    i

    i(

    a

    Mia 1)+

    a

    trace(1a(UTa Ua I))+

    a

    trace(2a(V Ta Va I)) +

    a

    trace(3a(WTa Wa I)).(13)

    We first initialize {Ua}, {Va} and {Wa} to random orthonor-mal tensors a, and Mia = 1K , (i, a). Using the fact that{Ua}, {Va} and {Wa} are orthonormal, the projection matrixSia is computed by the rule:

    Sia = Pi 1 UTa 2 V Ta 3 WTa , (i, a). (14)This is the minimum of the energy function Pi Sia 1Ua2 Va3Wa2. Then m1m2m3T elements in Sia withleast absolute value are nullified. Thereafter, Ua, Va and Waare updated as follows. Denote the sum of the terms in theprevious energy function that are independent of Ua, as C,denoting the the nth unfolding of the tensor Pi as Pi(n) (detailsin [13]). Using the fact that X2 = X(n)2 for any tensorX , we write the energy function as:

    E({Ua, Va,Wa, Sia,Mia}) =ia

    MiaPi (Sia 1 Ua 2 Va 3 Wa)2+

    a

    trace(1a(UTa Ua I)) + C

    (15)

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 11

    This produces:

    E({Ua, Va,Wa, Sia,Mia})=

    ia

    MiaPi(1) Ua Sia(1) (Va Wa)T 2+

    a

    trace(1a(UTa Ua I)) + C

    =

    ia

    Mia[trace(PTi(1)Pi(1))

    2trace(PTi(1)Ua Sia(1) (Va Wa)T )+trace(STia(1)Sia(1))] +

    a

    trace(1a(UTa Ua I)) + C. (16)

    Setting the derivative of E w.r.t. Ua to zero, and eliminatingLagrange matrices, we have an update rule for Ua:

    ZUa =

    i

    MiaPi(1)(Va Wa)STia(1);

    Ua = ZUa(ZTUaZUa)

    12 = 1a

    T1a. (17)

    Here 1a and 1a are orthonormal matrices obtained from theSVD of ZUa. Va and Wa are updated similarly by second andthird unfolding of the tensors respectively. The membershipvalues are obtained by:

    Mia =ePiSia1Ua2Va3Wa

    2

    Kb=1 e

    PiSib1Ub2Vb3Wb2. (18)

    The core tensors {Sia} and the matrices {Ua, Va,Wa},M arethen updated sequentially following one another for a fixed value, until convergence. The value of is then increasedand the sequential updates are repeated. The entire process isrepeated until an integrality condition is met.

    C. Application to Compact Image RepresentationQuite similar to the 2D case, after the optimization during

    the training phase, each training patch Pi (1 i N ) getsrepresented as a projection onto one out of the K exemplarorthonormal bases, which produces the least reconstructionerror, i.e. the kth exemplar is chosen if P Sik 1 Uk 2Vk 3 Wk2 P Sia 1 Ua 2 Va 3 Wa2, a {1, 2, ...,K}, 1 k K . For patch training Pi, we denotethe corresponding optimal projection tensor as Si = Sik, andthe corresponding exemplar as (Ui , V i ,W i ) = (Uk, Vk,Wk).Thus the entire set of patches can be closely approximated by(1) the common set of basis-pairs {(Ua, Va,Wa)}, 1 a K(K N ), and (2) the optimal sparse projection tensors {Si }for each patch, with at most T non-zero elements each. Theoverall storage per image is thus greatly reduced. Furthermore,these bases {(Ua, Va,Wa)} can now be used to encode patchesfrom a new set of images that are similar to the ones existingin the training set, though just as in the 2D case, the sparsity ofthe patch will be adjusted dynamically in order to meet a givenerror threshold. Experimental results for this are provided inthe next section.

    V. EXPERIMENTS: COLOR IMAGESIn this section, we describe the experiments we performed

    on color images represented in the RGB color scheme. Each

    image of size M1M23 was treated as a third-order matrix.Our compression algorithm was tested on the GaTech FaceDatabase [32], which consists of RGB color images with 15images each of 50 different people. The images in the databaseare already cropped to include just the face, but some of themcontain small portions of a distinctly cluttered background.The average image size is 150150 pixels, and all the imagesare in the JPEG format. For our experiments, we divided thisdatabase into a training set of one image each of 40 differentpeople, and a test set of the remaining 14 images each of these40 people, and all 15 images each of the remaining 10 people.Thus, the size of the test set was 710 images. The patch-sizewe chose for the training and test sets was 12 12 3 andwe experimented with K = 100 different orthonormal baseslearned during training. The value of T during training was setto 10. At the time of testing, for our method, each patch wasprojected onto that triple of orthonormal bases (Ui , V i ,W i )which gave the sparsest projection tensor Si such that theper-pixel reconstruction error PiSi1U

    i2V

    i3W

    i2

    m1m2was no

    greater than a chosen . Note that, in calculating the per-pixel reconstruction error, we did not divide by the number ofchannels, i.e. 3, because at each pixel, there are three valuesdefined. We experimented with different reconstruction errorvalues ranging from 8 105 to 8 103. Following thereconstruction, the PSNR for the entire image was measured,and averaged over the entire test test. The total number ofbits per pixel, i.e. RPP, was also calculated for each imageand averaged over the test set. The details of the trainingand testing methodology, and also the actual quantization andcoding step are the same as presented previously in SectionIII-A and III-B.

    A. Comparisons with Other MethodsThe results obtained by our method were compared to those

    obtained by four techniques: (1) KSVD, (2) Our algorithmon 2D with separate channels and (3) the JPEG standard. (1)For KSVD, we used patches of size 12 12 3 reshaped togive vectors of size 432, and used these to train a dictionaryof 1340 vectors using a value of T = 30. (2) For ouralgorithm for 2D images from Section II with an independent(separate) encoding of each of the three channels. As anindependent coding of the R, G and B slices would failto account for the inherent correlation between the channels(and hence give inferior compression performance), we usedprincipal components analysis (PCA) to find the three principalcomponents of the R, G, B values of each pixel from thetraining set. The R, G, B pixel values from the test imageswere then projected onto these principal components to give atransformed image in which the values in each of the differentchannels are decorrelated. A similar approach has been takenearlier in [33] for compression of color images of faces usingvector quantization, where the PCA method is empiricallyshown to produce channels that are even more decorrelatedthan those from the Y-Cb-Cr color model. The orthonormalbases were learned on each of the three (decorrelated) channelsof the PCA image. This was followed by the quantization andcoding step similar to that described in Section III-B. However

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 12

    in the color image case, the Huffman encoding step for findingthe optimal values of a1, a2 and a3 in Eqn. (10) was performedusing projection matrices from all three channels together. Thiswas done to improve the coding efficiency. (3) For the JPEGstandard (its MATLAB R implementation) we calculated RPPfrom the number of bytes of file storage on the disk. SeeSection V-B for more details.

    We would like to mention here that we did not compareour technique with the face image compression techniquepresented in [33]. The latter requires prior demarcation ofthe salient feature regions, which are encoded with more bits.Our method does not require any such prior segmentation, aswe manually tune the number of coefficients to meet the pre-specified reconstruction error.

    B. ResultsAs can be seen from Figure 7(a), all three methods perform

    well, though the higher-order method produced the best resultsafter a bit-rate of around 2 per pixel. For the GaTech database,we did not compare our algorithm directly with JPEG becausethe images in the database are already in the JPEG format.However, to facilitate comparison with JPEG, we used a subsetof 54 images from the CMU-PIE database [34]. The CMU-PIEdatabase contains images of several people against clutteredbackgrounds with a large variation in pose, illumination, facialexpression and occlusions created by spectacles. All the im-ages are available in an uncompressed (.ppm) format, and theirsize is 631467 pixels. We chose 54 images belonging to oneand the same person, and used exactly one image for training,and all the remaining for testing. Experiments with our higher-order method, our method involving separate channels andalso KSVD, revealed performance that is competitive withJPEG. For a bit rate of greater than 1.5 per pixel, our methodsproduced performance that was superior to that of JPEG interms of the PSNR for a given RPP, as seen in Figure V-B. Theparameters for this experiment were K = 100 and T = 10 fortraining. Sample reconstructions of an image from the CMU-PIE database using our higher-order method for different errorvalues are shown in Figure V-B. This is quite interesting, sincethere is considerable variation between the training image andthe test images, as is clear from Figure V-B. We would like toemphasize that the experiments were carried out on uncroppedimages of the full size with the complete cluttered background.Also, the dictionary sizes for the experiments on each of thesedatabases are summarized in Table II. For color-image patchesof size m1m23 using K sets of bases, our method in 3Drequires a dictionary of size 3Km1m2, whereas our method in2D requires a dictionary of size 2Km1m2 per channel, whichis 6Km1m2 in total.

    C. Comparisons with JPEG on Noisy DatasetsAs mentioned before, we did not directly compare our

    results to the JPEG technique for the GaTech Face Database,because the images in the database are already in the JPEGformat. Instead, we added zero-mean Gaussian noise of vari-ance 8 104 (on a color scale of [0, 1]) to the images ofthe GaTech database and converted them to a raw format.

    0 2 4 6 820

    25

    30

    35

    40

    45

    RPP

    PSNR

    PSNR vs RPP: Training and testing on clean images

    Our Method: 3DOur Method: 2DKSVD

    (a)

    0 2 4 6 8 1018

    20

    22

    24

    26

    28

    30

    32

    RPP

    PSNR

    PSNR vs RPP: Clean Training, Noisy Testing

    Our Method: 3DOur Method: 2DKSVDJPEG

    (b)

    0 5 10 1518

    20

    22

    24

    26

    28

    30

    32

    RPP

    PSNR

    PSNR vs RPP: Noisy Training Noisy Testing

    Our Method (3D)Our Method (2D)KSVDJPEG

    (c)

    Fig. 7. ROC curves for the GaTech database: (a) when training and testingwere done on the clean original images, (b) when training was done on cleanimages, but testing was done with additive zero mean Gaussian noise ofvariance 8104 added to the test images, and (c) when training and testingwere both done with zero mean Gaussian noise of variance 8 104 addedto the respective images. The methods tested were our higher-order method,our method in 2D, KSVD and JPEG. Parameters for our methods: K = 100and T = 10 during training. These plots are viewed best in color.

    Training Image Original Test Image Error 0.003

    Error 0.001 Error 0.0005 Error 0.0003

    Fig. 8. Sample reconstructions of an image from the CMU-PIE databasewith different error values using our higher order method. The original imageis on the top-left. These images are viewed best in color and when zoomedin the pdf file.

    0 2 4 6 8 1020

    25

    30

    35

    40

    RPP

    PSNR

    PSNR vs RPP: CMUPIE

    Our Method (3D)Our Method (2D)KSVDJPEG

    Fig. 9. ROC curves for the CMU-PIE database using our method in 3D, ourmethod in 2D, KSVD and JPEG. Parameters for our methods: with K = 100and T = 10 during training.These plots are viewed best in color.

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 13

    Method Dictionary Size (number of scalars)Our Method (3D) 100 3 12 12 = 43200Our Method (2D) 50 2 3 12 12 = 43200

    KSVD 1340 12 12 3 = 578880TABLE II

    COMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOREXPERIMENTS ON THE GATECH AND CMU-PIE DATABASES.

    Following this, we converted these raw images back to JPEG(using MATLAB R) and measured the RPP and PSNR. Thesefigures were pitted against those obtained by our higher-ordermethod, our method on separate channels, as well as KSVD,as shown in Figures 7(b) and 7(c). The performance of JPEGwas distinctly inferior despite the fact that the noise addeddid not have a very high variance. The reason for this isthat the algorithms used by JPEG cash in on the fact thatwhile representing most natural images, the lower frequenciesstrongly dominate. This assumption can fall apart in case ofsensor noise. As a result, the coefficients produced by the DCTstep of JPEG on noisy images will possess prominently highervalues, giving rise to higher bit rates for the same PSNR. Forthe purposes of comparison, we ran two experiments using ourhigher-order method, our separate channel method and KSVDas well. In the first experiment, noise was added only to thetest set, though the orthonormal bases or the dictionary werelearned on a clean training set (i.e. without noise being addedto the training images). In the second experiment, noise wasadded to every image from the training set, and all methodswere trained on these noisy images. The testing was performedon (noisy) images from the test set, and the ROC curves wereplotted as usual. As can be seen from Figures 7(b) and 7(c),the PSNR values for JPEG begin to plateau off rather quickly.Note that in all these experiments, the PSNR is calculated fromthe squared difference between the reconstructed image andthe noisy training image. This is because the original cleanimage would be usually unavailable in a practical applicationthat required compression of noisy data.

    VI. CONCLUSIONWe have presented a new technique for sparse representation

    of image patches, which is based on the notion of treatingan image as a 2D entity. We go beyond the usual singularvalue decomposition of individual images or image-patches tolearn a set of common, full-sized orthonormal bases for sparserepresentation, as opposed to low-rank representation. For theprojection of a patch onto these bases, we have provided avery simple and provably optimal greedy algorithm, unlikethe approximation algorithms that are part and parcel of pro-jection pursuit techniques, and which may require restrictiveassumptions on the nature or properties of the dictionary[21]. Based on the developed theory, we have demonstrated asuccessful application of our technique for the purpose of com-pression of two well-known face databases. Our compressionmethod is able to handle the varying complexity of differentpatches by dynamically altering the number of coefficients(and hence the bit-rate for storage) in the projection matrix

    of the patch. Furthermore, this paper also presents a cleanand elegant extension to higher order matrices, for whichwe have presented applications to color-image compression.Unlike decompositions like HOSVD which do not retain theoptimal low-rank reconstruction property of SVD, our methodscales cleanly into higher dimensions. The experimental resultsshow that our technique compares very favorably to otherexisting approaches from recent literature, including the JPEGstandard. We have also empirically examined the performanceof our algorithm on noisy color image datasets.

    Directions for future work include: (1) investigation of al-ternative matrix representations tuned for specific applications(as opposed to using the default rows and columns of theimage), (2) application of our technique for compression ofgray-scale and color video (represented as 3D and 4D matricesrespectively), or for denoising, and (3) a theoretical study ofthe method in the context of natural image statistics.

    REFERENCES[1] A. Rangarajan, Learning matrix space image representations, in En-

    ergy Minimization Methods in Computer Vision and Pattern Recognition(EMMCVPR). 2001, vol. 2134 of LNCS, pp. 153168, Springer Verlag.

    [2] J. Ye, Generalized low rank approximations of matrices, MachineLearning, vol. 61, pp. 167191, 2005.

    [3] J. Yang, D. Zhang, A. F. Frangi, and J. y. Yang, Two-dimensionalPCA: A new approach to appearance-based face representation andrecognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1,2004.

    [4] X. He, D. Cai, and P. Niyogi, Tensor subspace analysis, in NeuralInformation Processing Systems, 2005, pp. 499506.

    [5] H. C. Andrews and C. L. Patterson, Singular value decomposition(SVD) image coding, IEEE Transactions on Communications, pp. 425432, 1976.

    [6] J.-F. Yang and C.-L. Lu, Combined techniques of singular value decom-position and vector quantization for image coding, IEEE Transactionson Image Processing, vol. 4, no. 8, pp. 11411146, 1995.

    [7] C. Ding and J. Ye, Two-dimensional singular value decomposition(2DSVD) for 2D maps and images, in SIAM Intl Conf. Data Mining,2005, pp. 3243.

    [8] M. Lewicki, T. Sejnowski, and H. Hughes, Learning overcompleterepresentations, Neural Computation, vol. 12, pp. 337365, 2000.

    [9] M. Aharon, M. Elad, and A. Bruckstein, The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representation,IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 43114322, November 2006.

    [10] D. Mallat and M. Avellaneda, Adaptive greedy approximations,Journal of Constructive Approximations, vol. 13, pp. 5798, 1997.

    [11] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, Be-yond SVD - sparse projections onto exemplar orthonormal bases forcompact image representation, in International Conference on PatternRecognition (ICPR), 2008.

    [12] A. Shashua and A. Levin, Linear image coding for regression andclassification using the tensor-rank principle, in CVPR(1), 2001, pp.4249.

    [13] L. de Lathauwer, Signal Processing Based on Multilinear Algebra, PhDthesis, Katholieke Universiteit Leuven, Belgium, 1997.

    [14] J. Hastad, Tensor rank is NP-Complete, in ICALP 89: Proceedingsof the 16th International Colloquium on Automata, Languages andProgramming, 1989, pp. 451460.

    [15] M.A.O. Vasilescu and D. Terzopoulos, Multilinear analysis of imageensembles: Tensorfaces, in International Conference on Pattern Recog-nition (ICPR), 2002, pp. 511514.

    [16] R. Costantini, L. Sbaiz, and S. Susstrunk, Higher order SVD analysisfor dynamic texture synthesis, IEEE Transactions on Image Processing,vol. 17, no. 1, pp. 4252, Jan. 2008.

    [17] H. Wang and N. Ahuja, Rank-R approximation of tensors: Usingimage-as-matrix representation, in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2005, vol. 2, pp. 346353.

    [18] C. Ding, H. Huang, and D. Luo, Tensor reduction error analysis- applications to video compression and classification, in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2008.

  • IEEE TRANSACTIONS ON IMAGE PROCESSING 14

    [19] A. Ranade, S. Mahabalarao, and S. Kale, A variation on SVD basedimage compression, Image and Vision Computing, vol. 25, no. 6, pp.771777, 2007.

    [20] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, Orthogonal matchingpursuit: Recursive function approximation with applications to waveletdecomposition, in 27th Asilomar Conference on Signals, Systems andComputation, 1993, vol. 1, pp. 4044.

    [21] J. Tropp, Greed is good: algorithmic results for sparse approximation,IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 22312242, 2004.

    [22] A. Yuille and J. Kosowsky, Statistical physics algorithms that con-verge, Neural Computation, vol. 6, no. 3, pp. 341356, 1994.

    [23] R. Hathaway, Another interpretation of the EM algorithm for mixturedistributions, Statistics and Probability Letters, vol. 4, pp. 5356, 1986.

    [24] M. Tipping and C. Bishop, Mixtures of probabilistic principal com-ponent analysers, Neural Computation, vol. 11, no. 2, pp. 443482,1999.

    [25] The ORL Database of Faces, http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

    [26] The Yale Face Database, http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

    [27] The CVG Granada Database, http://decsai.ugr.es/cvg/dbimagenes/index.php.

    [28] The Uncompressed Colour Image Database, http://vision.cs.aston.ac.uk/datasets/UCID/ucid.html.

    [29] M. Marcellin, M. Lepley, A. Bilgin, T. Flohr, T. Chinen, and J. Kasner,An overview of quantization in jpeg 2000, Signal Processing: ImageCommunication, vol. 17, no. 1, pp. 7384, 2000.

    [30] M. Adams and R. Ward, Jasper: A portable flexible open-sourcesoftware toolkit for image coding/processing, in IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2004, pp. 241244.

    [31] M. Hilton, B. Jawerth, and A. Sengupta, Compressing still and movingimages with wavelets, Multimedia Systems, vol. 2, no. 5, pp. 218227,1994.

    [32] The Georgia Tech Face Database, http://www.anefian.com/face reco.htm.

    [33] J. Huang and Y. Wang, Compression of color facial images usingfeature correction two-stage vector quantization, IEEE Transactionson Image Processing, vol. 8, pp. 102109, 1999.

    [34] The CMU Pose, Illumination and Expression (PIE) Database, http://www.ri.cmu.edu/projects/project 418.html.