real-time texture detection using the lu- · pdf filereal-time texture detection using the...

Real-Time Texture DetectionUsing the LU-Transform

Alireza Tavakoli Targhi, Marten Bjorkman, Eric Haymanand Jan-Olof Eklundh

Computational Vision and Active Perception LaboratorySchool of Computer Science and Communication

Royal Institute of Technology (KTH), SE-100 44, Stockholm, Swedenatt,celle,hayman,joe @nada.kth.se

Abstract. This paper introduces a fast texture descriptor, theLU-transform. Itis inspired by previous methods, the SVD-transform and Eigen-transform, whichyield measures of image roughness by considering the singular values or eigen-values of matrices formed by copying greyvalues from a square patch arounda pixel directly into a matrix of the same size. The SVD and Eigen-transformstherefore capture the degree to which linear dependencies are present in the imagepatch. In this paper we demonstrate that similar information can be recovered byexamining the properties of the LU factorization of the matrix, and in particularthe diagonal part of theU matrix. While the LU-transform yields an output quali-tatively similar to the those of the SVD and Eigen-transforms, it can be computedabout an order of magnitude faster. It is a much simpler algorithm and well-suitedto implementation on parallel architectures. We capitalise on these properties inan implementation of the algorithm on a Graphics Processor Unit (GPU) whichmakes it even faster than a CPU implementation, and frees the CPU for othercomputations.

1 Introduction

Texture is an important cue in many applications of computer vision such as imagesegmentation [1, 2], the classification of objects [3] or materials [4–7], and texture syn-thesis for computer graphics [8, 5]. Many of these applications benefit from a fast andsimple texture descriptor. Although no formal or mathematical definition exists, textureis frequently considered to be small-scale structure in images, and many different de-scriptors for texture have been proposed in the literature [9]. Filter banks, for instance,are especially popular, and may be motivated by early processing in biological visualsystems. Descriptors which use the greylevels themselves, as opposed to an intermedi-ate filter-based representation, have also regained popularity [2, 8, 5].

Recently, Tavakoli Targhi and coworkers proposed theSVD-transform[10] andEigen-transform[11]. These texture descriptors are derived from matrix decomposi-tions. The basic idea is to form a matrix from the greyvalues in a small, square win-dow centred at a pixel, compute either the singular values or eigenvalues, and form adescriptor as the average of the smallest singular values / eigenvalues. This yields aone-dimensional descriptor which fires in “rough” areas of the image. The procedure isrepeated for all pixels, or on a subsampled regular grid. [11] demonstrated its suitabilityfor applications in attention (object detection) and image segmentation.

Our work is motivated by similar applications, but with one crucial difference: werequire real-time performance, e.g. for use on a robot. The SVD-transform and Eigen-transform are not quite fast enough on existing off-the-shelf hardware. The bottleneckis the computation of singular values or eigenvalues. The main contribution of thispaper is therefore the introduction of a new texture descriptor, inspired by the frame-work of [10, 11]. Rather than calculating singular values or eigenvalues, we perform anLU-decomposition [12, 13], and define theLU-transformas the average of the smallestdiagonal values in the resulting upper triangular matrixU . It is well-known that LUdecomposition is much faster than finding singular values or eigenvalues. In our experi-ments we experienced a speed-up of an order of magnitude. Moreover, the method lendsitself well to implementation on a Graphics Processor Unit (GPU). For small windowsizes this proves even faster, and it allows other processing to take place on the CPU.

The output of the descriptor is qualitatively similar to those obtained in [10, 11].This may be briefly explained as follows (we refer also to Section 2 which reviews[10, 11] in a little more detail, and Section 3 which discusses the LU-transform). Theeigenvalues / singular values provide information about the dependence between rowsand columns of the matrix of the local patch. In a patch of uniform brightness, allbut the largest eigenvalue / singular value are zero. If any two rows or columns areidentical, the matrix drops rank, that is the smallest eigenvalue / singular value becomeszero. If those two rows or columns are similar but not quite identical, the smallesteigenvalues / singular values will be close to, but not exactly equal to zero. Thus theSVD and Eigen-transforms essentially encode the degree to which rows or columns ofthe patch approach being linearly dependent by taking a sum over the smallest singularvalues or eigenvalues respectively. This information about rank is also captured withinthe LU factorization.

Indeed, while achieving a considerable speed-up, the LU-transform inherits theoriginal properties of [10, 11] for bottom-up processing in real-world applications: (i) itcaptures small-scale structure in terms of roughness or smoothness of the image patch;(ii) it provides a compact representation and low-dimensional output (usually just a sin-gle dimension) which is easy to store and perform calculations on; (iii) few parametersneed tuning. The most significant parameter is a notion of scale provided by the size ofthe local image patch; and (iv) unlike most other texture descriptors, it does not gener-ate spurious responses round brightness edges. This is in contrast to for instance filterswhich tend to identify a strip around a brightness edge as a separate region.

The remainder of the paper is organized as follows. Section 2 reviews the SVD-transform and Eigen-transform. The LU-transform is introduced and its output com-pared to [10, 11] in Section 3. Section 4 focuses on computational efficiency and de-scribes our GPU implementation. Finally, conclusions are drawn in Section 5.

2 Review of the SVD and Eigen-transforms

In this section we briefly review the SVD and Eigen-transforms [10, 11], and analysewhat these texture descriptors capture.

The general framework of these texture descriptors is that we consider aw × wsquare neighbourhood centred at a pixel, and copy its greyvalues directly into aw × w

real matrix,W . We proceed by computing a matrix decomposition ofW which pro-vides a vector of numbers. For the SVD-transform, this vector consists of the singularvalues ofW , while for the Eigen-transform we instead insert the eigenvalues ofW intothe vector.

Then we take the magnitude of the numbers in this vector, and sort them in decreas-ing order,‖α1‖, ‖α2‖, ..., ‖αw‖. At this stage we have a set ofw numbers describingeach pixel in an image. [10, 11] showed that the largest number,‖α1‖, gives a smoothversion of the original image, while the smallαi capture the texture. The texture trans-form may therefore be defined as

Φ(l, w) =w∑

k=l

‖αk‖ , 1 ≤ l ≤ w . (1)

The original papers [10, 11] took the average rather than sum of these numbers, but thisjust differs by a constant scale factor.l andw are parameters set by the user,w is a scaleparameter. For a descriptor that reacts to texture as opposed to brightness, the largestfew αi should be ignored. The transform is fairly insensitive tol, suitable values are inthe range[2, w/2].

The results of these transforms are shown in Section 3. To save computation timeand because the resulting output was found to change slowly spatially, we do not com-pute the transform for every pixel. Instead we define a spacing parameterδ, whichmeans that we calculate the transform only everyδ pixels in both horizontal and verti-cal directions.

It is useful to indicate why this approach yield a successful texture descriptor. Ini-tially we focus on the SVD-transform. Suppose we are given aw × w real matrixAwith SVD A = UΣV T , whereΣ is a diagonal matrix of singular values in decreasingorder, andU andV are orthogonal matrices [14, 15, 12, 13]. Then a rankr approxima-tion to A is the matrixAr = UrΣrV

Tr , whereΣr is the top-leftr × r submatrix of

Σ, Ur consists of the firstr columns ofU , andV Tr the firstr rows of V T . Indeed,

Ar is the optimal rankr approximation toA in the sense of minimizing the Frobeniusnorm of the residual‖ A − Br ‖F whereBr is a rankr matrix. SettingB = Ar, this

residual can be written as‖ A−Ar ‖F =√

Σwi=r+1σ

2i [14, 15, 12, 13]. This expression

is identical to the definition of the texture transform in Equation (1) ifαi = σ2i . In [10]

αi was justσi (not squared), because we at that time were unaware about this argu-ment. In unreported experiments we have, however, found that these two expressionsgive qualitatively similar results.

Both these expressions based on SVD, and also the Eigen-transform, can be seento detect rank deficient or near rank deficient matrices formed from image patches. Ifthe matrixA has low rank then some of its singular values and eigenvalues are zero.Therefore the texture transform has lower response. It means the rows and columns arelinearly dependent, which will occur if the image patch is uniform or partly uniform.Also if we have little structure then the rows and columns arecloseto being linearlydependent and consequently the texture transform is still relatively low. On the otherhand, if the image patch has complex structure, then row and columns are much lesslikely to be dependent and the texture transform will be high. Thus, the SVD and Eigen-transforms fire in image areas of rough texture.

3 The LU-transform

In this section, we first explain how to compute the LU-transform and justify why itshould work. Then we compare it in experiments to the SVD and Eigen-transforms.

3.1 Computation of LU-transform

Recall the methodology from Section 2 where we considered matrix decompositionsof w × w square image patches. Now, rather than calculating the singular values oreigenvalues of the matrix, we compute the LU-factorization.

The LU factorization [14, 15, 12, 13] of a matrix is a decomposition of general formA = P LU , involving a permutation matrixP , a unit lower triangular matrixL withones on the diagonal, and an upper triangular matrixU . The LU factorization is used forsolving sets of linear equation by Gauss elimination.L contains the row multipliers usedduring elimination.U contains the pivot values on the diagonal and other information inthe off-diagonal part.P records the pivoting operations carried out in Gauss elimination.Here we used partial rather than full pivoting [14, 15, 12, 13].

The LU factorization is generally computed only for nonsingular square matrices,but the LU factorization exists and is useful even if the matrix is singular, or rectangular.For the case whenA is a singular matrix thenU has zeros in the diagonal elements [16].The number of zero elements on the diagonalU gives the dimensionality of the null-space ofA. For example, assume thatr is the rank of ann×n matrix A, thenn−r zeroswill appear on the diagonal ofU in the LU factorization ofA. Therefore we see that thediagonal ofU captures the same information as the eigenvalues and singular values. Soanalogous to the Eigen and SVD transform we can define the LU-transformΩ(l, w) byinserting the sorted absolute values of the pivots as coefficientαi in Equation 1,

Ω(l, w) =w∑

k=l

‖ukk‖ , 1 ≤ l ≤ w . (2)

whereukk are the diagonal elements ofU .Homogeneous image patches induce linear dependence and have low value of rank

r, thereforeΩ has low magnitude. Conversely image patches with complex structureshave a high value ofr, thenΩ has a large magnitude and consequently, the LU-transformhas a high response. Empirically we have found that image patches with full rank, butsmall singular values also, give rise to low values ofΩ. We are currently studying theseproperties from a theoretical standpoint.

3.2 Examples

To illustrate the output of the LU-transform, we have applied the algorithm on an image(Figure 1a) of a natural scene from the Corel database [17]. The size of the image is720 × 480. Figure 1b shows the LU-transform for window sizew = 8 with the foursmallest coefficients (l = 4) and Figure 1c shows the corresponding result with param-etersl = 22 andw = 32. We have used the spacingδ = 8 for all of the experimentsin this paper. Recall thatw is some form of scale parameter, and as in many image

(a) Original image (b) LU-transformΩ(4, 8)

(c)LU-transformΩ(22, 32)

Fig. 1.The LU-transform with varying window size.

(a) Original image (b) LU-transform (c) SVD-transform

(a) LU-transform (b) SVD-transform (c) Eigen-transform

Fig. 2. Comparison of the output of LU, SVD and Eigen transforms with the same parameters,applied to the image in Figure 2a. The first row represents the result forw = 8, and the secondrow the results withw = 32.

processing applications its optimal value depends on the task. We have not found itnecessary to fine-tune thew parameter and have found it sufficient to concentrate onjust these two values (w = 8, w = 32). For example, for this image (Figure 1a)w = 32gives a better result if the objective is to segment the cheetah out from the background(Figure 1c). On the other hand, small scales such asw = 8 can be useful for detectingsmall objects or defects and abnormalities, in which case there is considerable risk thata32× 32 image patch contains background as well as foreground.

Figure 2 shows the SVD, Eigen and LU-transforms of the same image. The resultsare qualitatively very similar. This is not surprising since the LU, SVD and Eigen-transforms all indicate the rank of the windows. Figure 3 shows the three texture trans-forms applied to an image containing a mosaic of different textures. First, notice thatthe three methods give very similar visual results, and second that they have lower valueon smooth textures and higher value for rough textures.

The transform captures the roughness and smoothness of small scale structure withinan image. To further demonstrate this, we take different materials (Figure 4) with dif-ferent structure from the KTH-TIPS2 image material database [18]. Table 1 shows theaverage and standard deviation of the texture transforms for each material from Fig-ure 4. Higher scores are indicative of rough and coarse structure of materials, and lower

(a) Original image (b) LU-transform (c) SVD-transform (d) Eigen-transform

Fig. 3.Comparison of the output of the LU, SVD and Eigen transforms with the same parameters(w = 32 andl = 22).

1 2 3 4 5 6

Fig. 4. Images of six materials taken from the KTH-TIPS2 database.

numbers correspond to smooth and fine materials structure. The result gives the impres-sion that the SVD, Eigen and LU coefficients capture the same properties in the scenesof roughness. However the standard deviations of the SVD transform is very small incomparison to the Eigen and LU-transforms. This illustrates that the SVD-transform ismore consistent and uniform. This should come as no surprise, as the SVD decomposi-tion is more stable and gives better rank information than other matrix decompositions.

Figure 5 gives further illustration of this fact. To illustrate the behaviour of thesecoefficients, we scan an arbitrary row of an image and collectAj patches of size32×32and spacing ofδ = 8. Figure 5b shows the texture transforms forAj along that row.

Furthermore, Figure 5a shows1K∑n

j=1 ‖α(j)i ‖, for eachαi, whereαi (see Equation

(1)) can be eigenvalues, singular values or the pivots from the LU factorization as inEquation (2). Here we ignored the first coefficient as it is too large. Figure 5a indicatesthat the LU coefficients form a flat curve in comparison with the other two transforms,the sorted pivots‖ukk‖ converge to zero very slowly. Figure 5b shows that the SVD-transform’s value is much smoother than that of the LU-transform. This agrees with theresult in Table 1 that SVD gives more uniform and stable output.

We have presented different applications of the texture-transform in our previouswork [10, 11] and compared the results with other methods. In this paper the main focusis on computational efficiency, but here we briefly present results on two applications.Real-time object detection and visual attention are important tasks in robotic and com-puter vision. Figure 6 presents the result of the LU-transform for visual attention as atexture cue. The first row of Figure 6 shows a breakfast table with different materialsfrom the KTH-TIPS2 database. The second row in Figure 6 shows a table with someobjects. Most of the interesting objects with small structure or texture pop out in theLU-transform (w = 8 andl = 4, Figure 6b). Figure 6c shows the result of a threshold-ing of the LU-transform, yielding a segmentation. More examples of the LU-transformare shown in Figure 7.

Image No 1 2 3 4 5 6

LU 2.0±0.32 3.4±0.80 7.8±2.86 9.4±3.82 14.2±6.33 33.0±11.62SVD 0.8±0.04 0.9±0.07 3.8±0.46 4.6±1.20 5.6±1.60 14.7±1.74Eigen 1.0±0.26 1.1±0.58 4.1±4.04 5.9±5.60 6.1±5.60 17.3±11.18

Table 1.The average and standard deviation of the texture transforms for the images in Figure 4

.

(a) (b)

Fig. 5.Results from a scanline from Figure 4.1. (a) compares the coefficientsαi used in the SVD,Eigen and LU-transforms, averaged over the scan-line. (b) plots the texture transforms themselvesas they vary along the scanline.

4 Computational efficiency

In this section we first evaluate the computational efficiency of the LU-transform rela-tive to the SVD and Eigen-transforms. Then we describe our GPU implementation ofthe LU-transform, and compare it to the CPU version.

4.1 CPU benchmarks

We implemented the three transforms in both Matlab and C++ and benchmarked themon an AMD Opteron 250 CPU running Linux. The matrix decompositions are the dom-inant factor in the computational expense. Both Matlab and C++ implementations useLAPACK [19] functions. LAPACK is a library of Fortran 77 routines for solving themost commonly occurring problems in numerical linear algebra. It was designed to beefficient on a wide range of modern high-performance computers. For our C++ imple-mentation we used exactly the same LAPACK routines as those used by Matlab, andeven linked to the LAPACK and BLAS libraries shipped with Matlab. This enabled usto evaluate the overheads associated with using Matlab. Furthermore, Matlab providesBLAS libraries highly optimized for each architecture, making use of available SSEroutines. Pre-compiled libraries from Linux vendors may not be as highly optimized.

(a) Original image (b) LU-transformΩ(4, 8)

(c) Segmentation result bythresholding

Fig. 6. An experiment using the LU-transform as a saliency map (b) for attention for a mobilerobot. The objects on the table which have small structures or texture, have higher saliency values.Results using a very simple thresholding on LU-transform are also shown(c).

An important aspect of the texture transforms is that we always deal with square,real matrices, and we do not need singular vectors or eigenvectors, and this saves com-putation time. For instance, in Matlabd = svd(X) returns only the vector of singularvalues, which is considerably faster than[u,d,v]=svd(X) that also computes allsingular vectors. Different LAPACK functions are used depending on what output isrequired. All three texture transforms have a complexity ofO(w3) in general, wherew×w is the size of the matrices to be transformed. These matrices originate from smallwindow patches in the image, and are typicallyw = 8, 16 or 32 pixels in size. Sizeslarger thanw = 32 are uncommon. If there is an interest in collecting statistics fromwindows withw > 32, one might instead subsample the original image and computethe transform using a smaller window. For benchmarking, we used a greyscale imageof size720 × 480 pixels. The spacing between windows is kept constant and equal toδ = 8 pixels for all experiments in this section. This yields an output transform image8 times smaller than the original image in each dimension. Table 2 shows the compu-tational cost (inms) of the LU, SVD and Eigen-transforms for different window sizes.The costs are also illustrated as a graph in Figure 8.

The first property to notice is that the LU-transform ismuchfaster to compute, infact by roughly an order of magnitude. It is clearly well-suited to real-time applications.

Second, rather surprisingly the methods do not seem to exhibitO(w3) behaviour. Ifwe look at the LU-transform in particular (see the algorithm in Figure 9), the cost of thetransform is dominated by level-2 rather than level-3 operations. In order to understandthis we conducted a deeper analysis of which parts of the algorithm the costs are asso-ciated with. On the AMD Opteron 250 used for our experiments all level-3 operationscan be executed fully in the 64 Kb L1 cache. Each multiply-subtract operation (last lineof the algorithm) requires a total of 2 loads, a store, a product and a subtraction. Weverified that this can be done in 5.5 cycles per 4-point group of operations, using theSSE instruction set and exploiting full 4-way parallelism. It can further be shown that

Fig. 7. First row shows the original images and second row shows the LU-transform of originalimages.

the total number of such operations is aboutw3/3. Thus the total cost of all level-3operations should be about 35 ms for720 × 480 pixel images with a matrix size ofw = 32 and as = 8 pixel spacing. This is considerably lower than what we see in ourexperiments, which implies that our runs are dominated by the level-2 operations, thatare harder to make L1 cache efficient. This explains theO(w2)-like behaviours seen inthe experiments.

4.2 GPU implementation

Next we describe an implementation of the LU-transform on a Graphics ProcessingUnit (GPU) and compare the performance to the previously mentioned CPU versions.What makes our attempt different from those of earlier studies [20, 21] is that we havemany small matrices to be transformed, instead of a single very large one. This affectsthe way parallelism can be exploited, which will be explained below.

GPUs are the key components of modern cards for 3D graphics acceleration. Dueto the introduction of reprogrammable shading hardware and accompanying languages,GPUs have recently been applied for more general purpose processing. See [22] formore information on general purpose GPU initiatives. Some examples of applicationsin real-time computer vision exist, like depth matching [23], motion estimation [24] andfigure-ground segmentation [25]. Unlike typical CPUs, most GPUs are based on single-instruction-multiple-data (SIMD) architectures, with multiple processing units workingin parallel on different parts of an output image. Thus to avoid unnecessary sacrificesin performance, care has to be taken when porting existing methods to GPU hardware.An understanding of the underlying hardware is critical.

Off-screen rendering is supported by most graphics cards using either pixel-buffers(pbuffers) or frame-buffer objects (FBOs). In both cases images and temporary dataare mapped to the texture memory of the graphics card. Advanced filtering operationsare possible, as image data are rendered from a set of texture buffers into a new one.With the introduction of programmable fragment shaders in recent GPUs, filtering canbe controlled on a per-pixel basis. Each fragment shader is assigned a set of input and

Matlab C++w = 8 w = 16 w = 32 w = 8 w = 16 w = 32

LU 68 89 182 11 29 101SVD 145 249 771 83 255 708Eigen 201 572 2081 143 516 2088

Table 2.The computations costs (inms) of the LU, SVD and the Eigen-transforms.

Fig. 8.The computational costs (inms) of the LU, SVD and Eigen-transforms for different valuesof w and a constant spacing ofs = 8.

output textures, and for each point in the output a program, associated to the shader, isexecuted. There are a number of possible languages that can be used for programmingof shaders. In our study we use the OpenGL Shading Language (GLSL). However, sincethe hardware is the same, GLSL has many similarities to alternatives like Cg (NVidiaspecific) and HLSL (Microsoft specific). FBOs are typically faster than pbuffers, sincethey avoid repeated context switches when multiple textures are in use, which is typi-cally the case in general purpose processing, where the operations and operands con-stantly change. Floating-point arithmetics is another novel innovation that has affectedthe way GPUs can be used for operations that require higher levels of accuracy, such asthose used in this study.

The most critical part of a GPU implementation of the LU-transform, is that ofmapping image data to texture memory, so that parallelism can be exploited as much aspossible. A typical GPU includes multiple fragment shaders and it is important to keepall shaders busy, while using as much available texture bandwidth as possible. Unfortu-nately, this means that the implementation of choice might vary between GPUs. Similarto where was previously described Section 3, the original image,f(y, x), is divided intoa number of overlapping matrices, eachw × w points in size. IfM is the number ofpixels in the original image, then with aδ-pixel spacing between locations at whichthe LU-transform is computed, there areN = M/δ2 matrices in total, each requiringO(w3) operations to factorize. We then parallelize the computations, so that when apoint-to-point matrix operation is performed, it is done on all matrices simultaneouslyduring the same GPU call. Parallelism is achieved by initially shuffling the input image

for k=1:N-1Pivot by swapping A(l,k) and A(k,k), where l = argmax_i|A(i,k)|for i=k+1:N

F(i,k) = A(i,k)/A(k,k)for j=k+1:N

Pivot by swapping A(l,j) and A(k,j)for i=k+1:N

A(i,j) = A(i,j)-F(i,k)*A(k,j)

Fig. 9.Gaussian elimination with pivoting

into a large texture map that consists of small subsampled and shifted versions of theoriginal image. There arew2 patches, each patch corresponding to a different point inthe matrices. As a result of shuffling, the(j, i)-th such patch is given by

A(j, i) =

f(j, i) f(j, i + δ, ) · · · f(j, i + w − δ, )

f(j + δ, i) f(j + δ, i + δ) · · · f(j + δ, i + w − δ)...

......

...f(j + h− δ, i) f(j + h− δ, i + δ) · · · f(i + h− δ, i + w − δ)

When this is done, we compute the LU factorization using Gaussian elimination andpivoting (see Figure 9 above). Thus the rest of the process is similar to a typical CPUimplementation, with the distinction that computations are performed on patches ratherthan individual matrix points.

4.3 Performance evaluation

Unlike our CPU implementations, we cannot exploit the 4-way parallelism along rowsor columns on the GPU, but parallelism within the matrix point patches mentionedabove. We do this in order to fully utilize the fragment shaders of the GPU. Sincethere are 12 such shaders in the NVidia 6800 GT GPU used in our experiments, thelevel of parallelism is 12-way at most. Unfortunately, the level of parallelism cannot bemeasured, since we have no control over which processing cores are active or not. The-oretically, based on the knowledge of processing cores and memory systems one mightcome up with some conclusions. However, since NVidia likes to keep some critical in-formation on their memory systems hidden,these conclusions would hardly be accurate.One might also compare different, but similar, GPUs with different numbers of coresand draw conclusions on efficiency based on their relative performance. However, sincewe do not know the details of the memory system, it is hard to tell whether these areindeed comparable.

Even if 12-way parallelism would never be reached for different reasons, e.g. thelimited access to texture memory, we can be certain that the same level of parallelismwould never be reached if computations were done on matrices similar to those of theCPU implementations. On the GPU parallelism is exploited within each individual op-eration and is highly affected by the size of the polygons and textures involved, and a

w=8 w=16 w=32

Matlab 68 89 182C++ 11 29 101GPU 3 16 184

Fig. 10. The computational costs (inms) of the LU-transforms for implementations in Matlab,C++ and on a GPU.

matrix is typically considerably smaller than a matrix point patch. However, since thelindices in Figure 9 are different from one matrix to the next, they vary within the matrixpoint patches. Consequently, pivoting has to be tested and rows potentially swapped inthe last loop, increasing the level-3 costs. In total we will have three stores and fourloads per level-3 operation, one extra store for pivoting, one load since theF factors(see Figure 9) have to be stored in a texture map, and an additional load and store due tothe fact that the GPU cannot simultaneously store at two different locations in the sametexture map.

Summaries of the computational costs of each implementation for different matrixwidths can be seen in Figure 10. The real benefit of a GPU implementation is the lowoverhead for small matrix sizes. Especially forw = 8, but also withw = 16, theGPU version is considerably faster than the C++ CPU algorithm. Yet even with largermatricesw = 32 there is still a benefit to be had with the GPU implementation since theCPU is freed for other processing. The excellent performance with loww is due to theway parallelism is used. For the CPU implementations, the possibility of performingoperations in parallel along rows or columns is small when matrices are small. Forlarger matrix sizes more operations can be performed in parallel. It is not until a fullmatrix does not fit into the L1 cache, which occurs for sizes larger thanw = 128,that a large increase in computational cost can be noticed. On the GPU, parallelism isimplemented in matrix point patches. While the size of each patch is kept constant, thenumber of patches isw × w, which means that the total amount of accessed texturememory becomes very large for larger matrix sizes. Thus already atw = 32 textureread latencies are increased due to poor caching on the GPU. In conclusion, for GPUimplementations parallelism should be exploited in different ways depending on thesize of the matrices. The approach we choose is suitable for sizes smaller thanw = 32.

An operation that should not be underestimated is that of transferring the originalinput image from main memory to the texture memory of the GPU. For the graphicscard used in our study this bandwidth is about 600 Mb/s, which in practise means about

0.6 ms for a grey-level740× 480 pixel image. In the opposite directions the bandwidthis even lower. However, once uploaded transfers to and from texture memory are quick.In our case this bandwidth is about 22 Gb/s, which is considerably faster than mostCPUs.

5 Conclusion

In this paper we presented a real time texture descriptor, the LU-transform, with severaluseful properties. First, it provides a low-dimensional output which is easy to storeand perform calculations on. Second it is easy to implement even in parallel and needsfew parameters to tune. Although the C++ implementation is fast, the simple structureof the algorithm allowed us to implement it on the GPU and keep free the CPU forother calculations. The efficiency of the GPU implementations depends on the wayparallelism can be exploited and cache misses avoided. Our implementation was shownto be particularly successful for matrix size smaller thanw = 32.

The properties of the method make it very suitable for real-time segmentation andobject detection. Currently we are using this method as a texture input in a multiple cueattention system. Still there are several issues which we intend to study in our futurework. One is the transform’s use as a feature for object recognition. Another concernsthe scale parameterw. Different patch sizes capture different properties. In this workwe used only a single scale, but in the future we would like to exploit a multiple scalerepresentation. Finally, we need a more formal understanding of what the pivots usedin the LU-transform capture.

Acknowledgments

We gratefully acknowledge support from the European Commission within the projectsMUSCLE (A. Tavakoli Targhi) and MOBVIS (M. Bjorkman, E. Hayman), and theSwedish Foundation for Strategic Research within the project VISCOS (E. Hayman).

References

1. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmenta-tion. IJCV43 (2001) 7–27

2. Ojala, T., Pietikainen, M.: Unsupervised texture segmentation using feature distributions.Pattern Recognition32 (1999) 477–486

3. Schiele, B., Crowley, J.: Recognition without correspondence using multidimensional recep-tive field histograms. Intl. Journal of Computer Vision36 (2000) 31–50

4. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials usingthree-dimensional textons. Intl. Journal of Computer Vision43 (2001) 29–44

5. Varma, M., Zisserman, A.: Texture classification: are filter banks necessary? In: Proc. Com-puter Vision and Pattern Recognition. (2003) II: 691–698

6. Pietikainen, M., Nurmela, T., Maenpaa, T., Turtinen, M.: View-based recognition of real-world textures. Pattern Recognition37 (2004) 313–323

7. Hayman, E., Caputo, B., Fritz, M., Eklundh, J.O.: On the significance of real-world condi-tions for material classification. In: Proc. 8th European Conf. on Computer Vision, Prague.(2004) IV:253–266

8. Efros, A., Leung, T.: Texture synthesis by non-parametric sampling. In: Proc. Int. Conf. onComputer Vision. (1999) 1033–1038

9. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision, 2nd edn.Thomson Learning Vocational (1999)

10. Targhi Tavakoli, A., Shademan, A.: Clustering of singular value decomposition of imagedata with applications to texture classification. In: VCIP. (2003) 972–979

11. Targhi Tavakoli, A., Hayman, E., Eklundh, J., Shahshahani, M.: The eigen-transform andapplications. In: ACCV (1). (2006) 70–79

12. Golub, G.H., van Loan, C.F.: Matrix Computations. The John Hopkins University Press,Baltimore, MD (1989)

13. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C, 2nd edition.Cambridge University Press (1992)

14. Demmel, J.W.: Applied Numerical Linear Algebra. Society for Industrial and AppliedMathematics, Philadelphia, PA (1997)

15. Coleman, T., Van Loan, C.F.: Handbook for Matrix Computations. SIAM, Philadelphia(1988)

16. F.Chan, T.: On the existence and computation of lu-factorizations with small pivots. Mathe-matics of computation,AMS.25 (2003) 1075–1088

17. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling ap-proach. IEEE Trans. Pattern Anal. Mach. Intell.25 (2003) 1075–1088

18. Mallikarjuna, P., Fritz, M., Tavakoli Targhi, A., Hayman, E. Caputo, B., Eklundh, J.: TheKTH-TIPS2 database (2004-5) Available atwww.nada.kth.se/cvap/databases/kth-tips .

19. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide.Third edn. Society for Industrial and Applied Mathematics, Philadelphia, PA (1999)

20. Krueger, J., Westermann, R.: Linear algebra operators for gpu implementation of numericalalgorithms. ACM Transaction of Graphics22 (2003) 908–916

21. Galoppo, N., Govindaraju, N., Henson, M., Manocha, D.: Lu-gpu: Efficient algorithms forsolving dense linear systems on graphics hardware. In: Proc. ACM/IEEE SC05 Conf., Seat-tle, WA (2005)

22. Harris, M., Wooley, C.: General-purpose computation using graphics hardware (2006) Avail-able athttp://www.gpgpu.org/.

23. Woetzel, J., Koch, R.: Multi-camera real-time depth estimation with discontinuity handlingon pc graphics hardware. In: Proc. Int’l Conf. Pattern Recognition (ICPR), Cambridge,United Kingdom (2004) 741–744

24. Strzodka, R., Garbe, C.: Real-time motion estimation and visualization on graphics cards.In: Proc. IEEE Visualization 2004, Austin, Texas (2004) 545–552

25. Griesser, A., Roeck, S.D., Neubeck, A., van Gool, L.: Gpu-based foreground-backgroundsegmentation using an extended colinearity criterion. In: Proc. Vision, Modeling and Visu-alization (VMV), Erlangen, Germany (2005) 319–326

real-time texture detection using the lu- · pdf filereal-time texture detection using the...

Documents