[lecture notes in computer science] pattern recognition volume 5096 || example-based learning for...

Example-Based Learning

for Single-Image Super-Resolution

Kwang In Kim1 and Younghee Kwon2

1 Max-Planck-Institute fur biologische Kybernetik,Spemannstr. 38, D-72076 Tubingen, Germany

2 Korea Advanced Institute of Science and Technology,373-1 Kusong-dong, Yusong-Ku, Taejon, Korea

Abstract. This paper proposes a regression-based method for single-image super-resolution. Kernel ridge regression (KRR) is used to esti-mate the high-frequency details of the underlying high-resolution image.A sparse solution of KRR is found by combining the ideas of kernelmatching pursuit and gradient descent, which allows time-complexity tobe kept to a moderate level. To resolve the problem of ringing artifactsoccurring due to the regularization effect, the regression results are post-processed using a prior model of a generic image class. Experimentalresults demonstrate the effectiveness of the proposed method.

1 Introduction

Single-image super-resolution refers to the task of constructing a high-resolutionenlargement of a given low-resolution image. This problem is inherently ill-posedas there are generally multiple high-resolution images that can produce the samelow-resolution image. Accordingly, prior information is required to approach thisproblem. Often, this prior information is available either in the explicit form ofan energy functional defined on the image class [9,10], or in the implicit form ofexample images leading to example-based super-resolution [1,2,3,5].

Previous example-based super-resolution algorithms can be characterized asnearest neighbor (NN)-based estimations [1,2,3] : during the training phase,pairs of low-resolution and the corresponding high-resolution image patches(sub-windows of images) are collected. Then, in the super-resolution phase,each patch of the given low-resolution image is compared to the stored low-resolution patches, and the high-resolution patch corresponding to the nearestlow-resolution patch is selected as the output. For instance, Freeman et al. [2]posed the image super-resolution as the problem of estimating missing high-frequency details by interpolating the input low-resolution image into the de-sired scale (which results in a blurred image). Then, the super-resolution wasperformed by the NN-based estimation of high-frequency patches based on thecorresponding patches of input low-frequency image.

Although this method (and also other NN-based methods) has already shownan impressive performance, there is still room for improvement if one views theimage super-resolution as a regression problem, i.e., finding a map f from the

G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 456–465, 2008.c© Springer-Verlag Berlin Heidelberg 2008

Example-Based Learning for Single-Image Super-Resolution 457

space of low-resolution image patches X to the space of target high-resolutionpatches Y. It is well known in the machine learning community that NN-basedestimation suffers from overfitting where one obtains a function which explainsthe training data perfectly yet cannot be generalized to unknown data. In thesuper-resolution, this can result in noisy reconstructions at complex image re-gions (cf. Sect. 3). Accordingly, it is reasonable to expect that NN-based methodscan be improved by adopting learning algorithms with regularization capabilityto avoid overfitting.

Based on the framework of Freeman et al. [2], Kim et al. posed the problem of es-timating the high-frequency details as a regression problem which is then resolvedby support vector regression (SVR) [6]. Meanwhile, Ni and Nguyen utilized SVR inthe frequency domain and posed the super-resolution as a kernel learning problem[7]. While SVR produced a significant improvement over existing example-basedmethods, it has several drawbacks in building a practical system: 1. As a regular-ization framework, SVR tends to smooth the sharp edges and produce an oscilla-tion along the major edges. This might lead to low reconstruction error on average,but is visually implausible; 2. SVR results in a dense solution, i.e., the regressionfunction is expanded in the whole set of training data points and accordingly iscomputationally demanding both in training and in testing.1

The current work extends the framework of Kim et al. [6]. A kernel ridgeregression (KRR) is utilized for the regression. Due to the observed optimalityof ε at (nearly) 0 for SVR in our previous study, the only difference betweenSVR and KRR in the proposed setting is their loss functions (L1- and L2-loss, respectively). The L2-loss adopted by KRR is differentiable and facilitatesgradient-based optimization. To reduce the time complexity of KRR, a sparsebasis is found by combining the idea of the kernel matching pursuit (KMP) [11]and gradient descent such that the time complexity and the quality of super-resolution can be traded. As the regularizer of KRR is the same as that of SVR,the problem of oscillation along the major edges still remains. This is resolvedby exploiting a prior over image structure proposed by Tappen et al. [9].

2 Regression-Based Image Super-Resolution

Base System. Adopting the framework of Freeman et al. [2], for the super-reso-lution of a given image, we estimate the corresponding missing high-frequencydetails based on its interpolation into the desired scale, which in this work isobtained by the bicubic interpolation. Furthermore, based on the conditionalindependence assumption of high- and low-frequency components given mid-frequency components of an image [2], the estimation of high-frequency compo-nents (Y ) is performed based on the Laplacian of the bicubic interpolation (X).The Y is then added to the bicubic to produce the super-resolved image Z.

To retain the complexity of the resulting regression problem at a moderatelevel, a patch-based approach is taken where the estimation of the values of Y

1 In our simulation, the optimum value of ε for the ε-insensitive loss function of SVRwas close to zero.

458 K.I. Kim and Y. Kwon

at specific locations NN (Y (x, y)) is performed based on only the values of X atcorresponding locations NM (X(x, y)), where NG(S(x, y)) represents a G-sizedsquare window (patch) centered at the location (x, y) of the image S.

Then, during the super-resolution, X is scanned with a small window (of sizeM) to produce a patch-valued regression result (of size N) for each pixel. Thisresults in a set of candidate pixels for each location of Z (as the patches areoverlapping with their neighbors), which are then combined to make the finalestimation (details will be provided later). The training images for the regressorare obtained by blurring and subsampling (by bicubic resampling) a set of high-resolution images to constitute a set of low- and high-resolution image pairs.The training image patch pairs are randomly sampled therein. To increase theefficiency of the training set, the data are contrast-normalized ([2]): during theconstruction of the training set both the input image patch and correspondingdesired patches are normalized by dividing them by the L1-norm of the inputpatch. For an unseen image patch, the input is again normalized before theregression and the corresponding output is inverse normalized.

For a given set of training data points {(x1,y1), . . . , (xl,yl)} ⊂ IRM × IRN ,we minimize the following regularized cost functional

O({f1, . . . , fN}) =∑

i=1,...,N

(12

∑

j=1,...,l

(f i(xj) − yij)

2 +12λ‖f i‖2

H

), (1)

where yj = [y1j , . . . , yN

j ] and H is a reproducing kernel Hilbert space (RKHS).Due to the reproducing property, the minimizer of above functional is expandedin kernel functions:

f i(·) =∑

j=1,...,l

aijk(xj , ·), for i = 1, . . . , N (2)

where k is the generating kernel for H which, we choose as a Gaussian kernel(k(x,y) = exp

(−‖x− y‖2/σk

)). Equation (1) is the sum of individual convex

cost functionals for each scalar-valued regressor and can be minimized separately.However, by tying the regularization parameter λ and the kernel k we can reducethe time complexity of training and testing down to the case of scalar-valuedregression, as in this case the kernel matrix can be shared: plugging (2) into (1)and noting the convexity of (1) yields

A = (K + λI)−1Y, (3)

where Y = [y�1 , . . . ,y�

l ]� and the i-th column of A constitutes the coefficientvector ai = [ai

1, . . . , ail]� for the i-th regressor.

Sparse Solution. As evident from (2) and (3), the training and testing timeof KRR is O(l3) and O(M × l), respectively, which becomes prohibitive evenfor a relatively small number of training data points (e.g., l > 10, 000). Oneway of reducing the time complexity is to trade it off with the optimality of


the solution by finding the minimizer of (1) only within the span of a basis set{k(b1, ·), . . . , k(blb , ·)} (lb � l):

f i(·) =∑

j=1,...,lb

aijk(bj , ·), for i = 1, . . . , N. (4)

In this case, the solution is obtained by

A = (KbxK�bx + λKbb)−1KbxY, (5)

where [Kbx(i,j)]lb,l = k(bi,xj) and [Kbb(i,j)]lb,lb = k(bi,bj), and accordingly thetime complexity reduces to O(M × lb) for testing. For a given fixed basis pointsB = {b1, . . . ,blb}, the time complexity of computing the coefficient matrix A isO(l3b + l × lb × M). In general, the total training time depends on the method offinding B.

In KMP [11,4], the basis points are selected from the training data points in anincremental way: for given n−1 basis points, the n-th basis is chosen such that thecost functional (1) is minimized when the A is optimized accordingly. The exactimplementation of KMP costs O(l2)-time for each step. Another possibility is tonote the differentiability of the cost functional (4) which leads to gradient-basedoptimization to construct B. Assuming that the evaluation of the derivative of kwith respect to a basis vector takes O(M)-time, the evaluation of derivative of(1) with respect to B and corresponding coefficient matrix A takes O(M × l ×lb + l × l2b)-time. Because of the increased flexibility, in general, gradient-basedmethods can lead to a better optimization of the cost functional (1) than selectionmethods as already demonstrated in the context of sparse Gaussian process (GP)regression [8]. However, due to the non-convexity of (1) with respect to B, itis susceptible to local minima and accordingly a good heuristic is required toinitialize the solution.

In this paper, we use a combination of KMP and gradient descent. The basicidea is to assume that at the n-th step of KMP, the chosen basis point bn plusthe accumulation of basis points obtained until the (n − 1)-th step (Bn−1) is agood initial point. Then, at each step of KMP, Bn can be subsequently optimizedby gradient descent. Naive implementation of this idea is still very expensive. Toreduce further the complexity, the following simplifications are adopted: 1. In theKMP step, instead of evaluating the whole training set for choosing bn, only lc(lc � l) points are considered; 2. Gradient descent of Bn(M) and correspondingA(1:n,:)

2 are performed only at the every r-th KMP step. Instead, for each KMPstep, only bn and A(n,:) are optimized. In this case, the gradient can be evaluatedat O(M × l).3

2 With a slight abuse of the Matlab notation, A(m:n,:) stands for the submatrix of Aobtained by extracting the rows of A from m to n.

3 Similarly to [4], A(n) can be analytically calculated at O(M × l)-cost:

A(n) =Knx(y − K�

bx(1:n−1)A(1:n−1)) − λKnbA(1 : n − 1)

KnxK�nx + λ

, (6)

where [Knx(1,i)]1,l = k(bn,xi) and [Knb(1,i)]1,n−1 = k(bn, bi).


At the n-th step, the lc-candidate basis points for KMP is selected basedon a rather cheap criterion: we use the difference between the function outputobtained at the (n−1)-th step and the estimated desired response of full KRR foreach training data points which is then approximated by the localized KRR: for atraining data point xi, its NNs are collected in the training set and the full KRRis trained based on only these NNs. The output of this localized KRR for xi givesthe estimation of the desired response for xi. It should be noted that these localKRRs cannot be directly applied for regression as they might interpolate poorlyon non-training data points. Once computed at the beginning, the estimateddesired responses are fixed throughout the whole optimization process.

To gain an insight into the performances of different sparse solution method, aset of preliminary experiments has been performed with KMP, gradient descent(with basis initialized by kmeans algorithm), and the proposed combination ofKMP and gradient descent with 10,000 training data points. Figure 1 summarizesthe results. Both gradient descent methods outperform KMP, while the combina-tion with KMP provides a better performance. This could be attributed to thebetter initialization of the solution for the subsequent gradient descent step.

0 50 100 200 300

38

42

46

50

54

58

# basis points

cost

KMPGradient descentKMP+gradient descent

Fig. 1. Performance of the different sparse solution methods evaluated in terms of thecost functional (1). A fixed set of hyper-parameters were used such that the comparisoncan be made directly in (1).

Combining Candidates. It is possible to construct a super-resolved image basedon only the scalar-valued regression (i.e., N = 1). However, we propose to predicta patch-valued output such that for each pixel, N different candidates are gen-erated. These candidates constitutes a 3-D image Z where the third dimensioncorresponds the candidates. This setting is motivated by the observation that1. by sharing the hyper-parameters, the computational complexity of resultingpatch-valued learning reduces to the scalar-valued learning; 2. the candidatescontain information of different input image locations which are actually diverseenough such that the combination can boost the performance: in our prelim-inary experiments, constructing an image by choosing the best and the worst(in terms of the distance to the ground truth) candidates from each 2-D loca-tion of Z resulted in an average signal-to-noise ratio (SNR) difference of 8.24dB.


Certainly, the ground truth is not available at actual super-resolution stage andaccordingly a way of constructing a single pixel out of N candidates is required.

One straightforward way is to construct the final estimation as a convex com-bination of candidates based on a certain confidence measure. For instance, bynoting that the (sparse) KRR corresponds to the maximum a posteriori esti-mation with the (sparse) GP prior [8], one could utilize the predictive varianceas a basis for the selection. In the preliminary experiments this resulted in animprovement over the scalar-valued regression. However, a better prediction wasobtained when the confidence estimation is obtained based not only on the in-put patches but also on the context of neighbor reconstructions. For this, a setof linear regressors is trained such that for each location (x, y), they receive apatch of output images Z(NL(x,y),:) and produce the estimation of differences({d1(x, y), . . . , dN (x, y)}) between the unknown desired output and each can-didate. The final estimation of pixel value for an image location (x, y) is thenobtained as the convex combination of candidates given in the form of a softmax :

Y (x, y) =∑

i=1,...,N

wi(x, y)Z(x, y, i), (7)

where wi(x, y) = exp( − |di(x,y)|

σC

)/[∑

j=1,...,N exp(− |dj(x,y)|σC

)].

For the experiments in this paper, we set M = 49(7 × 7), N = 25(5 × 5),L = 49(7×7), σk = 0.025, σC = 0.03, and λ = 0.5·10−7. The values are obtainedbased on a set of separate validation images. The number of basis points for KRR(lb) is determined to be 300 as the trade off between the accuracy and the timecomplexity. In the super-resolution experiments, the combination of candidatesbased on these parameters resulted in an average SNR increase of 0.43dB overthe scalar-valued regression.

Post-processing Based on Image Prior. As demonstrated in Fig. 2.b, the resultof the proposed regression-based method is significantly better than the bicubicinterpolation. However, detailed visual inspection along the major edges (edgesshowing rapid and strong change of pixel values) reveals ringing artifacts (oscil-lation occurred along the edges). In general, regularization methods (dependingon the specific class of regularizer) including KRR and SVR tend to fit thedata with a smooth function. Accordingly, at the sharp changes of the function(edges in the case of images) oscillation occurs to compensate the resulting lossof smoothness. While this problem can indirectly be resolved by imposing lessregularization at the vicinity of edges, more direct approach is to rely on theprior knowledge of discontinuity of images. In this work, we use a modificationof the natural image prior (NIP) framework proposed by Tappen et al. [9]:

P ({x}|{y}) =1C

∏

(i,j(∈NS(i)))

exp[−

( |xi − xj |σN

)α]·∏

i

exp

[−

(xi − yi

σR

)2]

, (8)

where {y} represents the observed variables corresponding to the pixel valuesof Y , {x} represents the latent variable, and NS(i) stands for the 8-connectedneighbors of the pixel location i. While the second product term has the role of


preventing the final solution flowing far away from the input regression result Y ,the first product term tends to smooth the image based on the costs |xi−xj|. Therole of α(< 1) is to re-weight the costs such that the largest difference is stressedmore than the others. If the second term is removed, the maximum probabilityfor a pixel i is achieved by assigning it with the value of the neighbor with thelargest difference among the other neighbors, rather than a certain weightedaverage of neighbors which might have been the case when α > 1. Accordingly,this distribution prefers a strong edge rather than a set of small edges andaccordingly can be used to resolve the problem of smoothing around majoredges. The optimization of (8) is performed by belief propagation (BP) similarlyto [9]. To facilitate the optimization, we reuse the candidate set generated fromthe regression step such that the best candidates are chosen by BP.

a b c d

e f

Fig. 2. Example of super resolution: a. bicubic, b regression result. c. post-processedresult of b based on NIP, d. Laplacian of bicubic with major edges displayed as greenpixels, and e and f. enlarged portions of a-c from left to right.

Optimizing (8) throughout the whole image region can lead to degraded re-sults as it tends to flatten the textured area, especially, when the contrast islow such that the contribution of the second term is small.4 This problem is re-solved by applying the (modification of) NIP only at the vicinity of major edges.Based on the observation that the input images are blurred and accordingly veryhigh spatial frequency components are removed, the major edges are found bythresholding each pixel of Laplacian of the input image using L2 and L∞ normsof the local patches encompassing it. It should be noted that the major edge isin general different from the object contour. For instance, in Fig. 2.d, the bound-ary between the chest of the duck and water is not detected as major edges as

4 In original work of Tappen et al. [9], this problem does not happen as the candidatesare 2 × 2-size image patches rather than individual pixels.


the intensity variations are not significant across the boundary. In this case, novisible oscillation of pixel values are observed in the original regression result.

The parameters α, σN , and σR are determined at 0.85, 200 and 1, respectively.While the improvement in terms of SNR is less significant (on average 0.04dBfrom the combined regression result) the improved visual quality at major edgesdemonstrate the effectiveness of NIP (Fig. 2).

3 Experiments

The proposed method was evaluated based on a set of high- and low-resolutionimage pairs (Fig. 3) which is disjoint from the training images. The desiredresolution is twice the input image along each dimension. The number of trainingdata points is 200,000 where it took around a day to train the sparse KRR ona 2.5GHz PC. For comparison, several different example-based image super-resolution methods were evaluated, which include Freeman et al.’s NN-basedmethod [2], Tappen et al.’s NIP [9],5 and Kim et al.’s SVR-based method [6](trained based on only 10,000 data points).

Fig. 3. Thumbnails of test images: the images are indexed by numbers arranged in theraster order

Figure 4 shows examples of super-resolution results. All the example-basedsuper-resolution methods outperform the bicubic interpolation in terms of vi-sual plausibility. The NN-based method and the original NIP produced sharperimages at the expense of introducing noise which, even with the improved vi-sual quality, lead to lower SNR values than the bicubic interpolations. The SVRproduced less noisy images. However it generated smoothed edges and perceptu-ally distracting ring artifacts which have disappeared for the proposed method.Disregarding the post-processing stage, we measured on average 0.69dB improve-ment of SNRs for the proposed method from the SVR. This could be attributedto the sparsity of the solution which enabled training on a large data set and the5 The original NIP algorithm was developed for super-resolving the NN-subsampled

image (not bicubic resampling which is used for experiments with all the other meth-ods). Accordingly, for the experiments with NIP, the low resolution images were gen-erated by NN subsampling. The visual qualities of the super-resolution results are notsignificantly different from the results obtained from bicubic resampling. However,the quantitative results should not be directly compared with other methods.


a b c d

e f g h

i j k l

Fig. 4. Results of different super-resolution algorithms on two images from Fig. 3: a-b.original, c-d. bicubic, e-f. SVR [6], g-h. NN-based method [2], i-j. NIP [9], and k-l.proposed method.

1 2 3 4 5 6 7 8 9 10 11 12

−2

−1

0

1

2

3

4

image index

incr

ease

of S

NR

from

bic

ubic

bicubicSVRNNNIPproposed

Fig. 5. Performance of different super-resolutions algorithms

effectiveness of the candidate combination scheme. Moreover, in comparison toSVR the proposed method requires much less processing time: super-resolvinga 256×256-size image into 512×512 requires around 25 seconds for the proposed


method and 20 minutes for the SVR-based method. For quantitative comparison,SNRs of different algorithms are plotted in Fig. 5.

4 Conclusion

This paper approached the problem of image super-resolution from a nonlinearregression viewpoint. A combination of KMP and gradient descent is adopted toobtain a sparse KRR solution which enabled a realistic application of regression-based super-resolution. To resolve the problem of smoothing artifacts that occurdue to the regularization, the NIP was adopted to post-process the regressionresult such that the edges are sharpen while the artifacts are suppressed. Com-parison with the existing example-based image super-resolution methods demon-strated the effectiveness of the proposed method. Future work should includecomparison and combination of various non-example-based approaches.

Acknowledgment. The contents of this paper have greatly benefited from discus-sions with G. BakIr and C. Walder, and comments from anonymous reviewers.The idea of using localized KRR was originated by C. Walder.

References

1. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEETrans. Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002)

2. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEEComputer Graphics and Applications 22(2), 56–65 (2002)

3. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analo-gies. In: Computer Graphics (Proc. Siggraph 2001), pp. 327–340. ACM Press, NewYork (2001)

4. Keerthi, S.S., Chu, W.: A matching pursuit approach to sparse gaussian processregression. In: Advances in Neural Information Processing Systems, MIT Press,Cambridge (2005)

5. Kim, K.I., Franz, M.O., Scholkopf, B.: Iterative kernel principal component analysisfor image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9),1351–1366 (2005)

6. Kim, K.I., Kim, D.H., Kim, J.H.: Example-based learning for image super-resolution. In: Proc. the third Tsinghua-KAIST Joint Workshop on Pattern Recog-nition, pp. 140–148 (2004)

7. Ni, K., Nguyen, T.Q.: Image superresolution using support vector regression. IEEETrans. Image Processing 16(6), 1596–1610 (2007)

8. Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. In:Advances in Neural Information Processing Systems, MIT Press, Cambridge (2006)

9. Tappen, M.F., Russel, B.C., Freeman, W.T.: Exploiting the sparse derivative priorfor super-resolution and image demosaicing. In: Proc. IEEE Workshop on Statis-tical and Computational Theories of Vision (2003)

10. Tschumperle, D., Deriche, R.: Vector-valued image regularization with pdes: acommon framework for different applications. IEEE Trans. Pattern Analysis andMachine Intelligence 27(4), 506–517 (2005)

11. Vincent, P., Bengio, Y.: Kernel matching pursuit. Machine Learning 48, 165–187(2002)

[lecture notes in computer science] pattern recognition volume 5096 || example-based learning for...

Documents