a hybrid approach for building extraction from spaceborne multi-angular optical imagery

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 89

A Hybrid Approach for Building Extraction FromSpaceborne Multi-Angular Optical Imagery

Anish Turlapaty, Balakrishna Gokaraju, Qian Du, Senior Member, IEEE, Nicolas H. Younan, Senior Member, IEEE,and James V. Aanstoos, Senior Member, IEEE

Abstract—The advent of high resolution spaceborne imagesleads to the development of efficient detection of complex urbandetails with high precision. This urban land use study is focused onbuilding extraction and height estimation from spaceborne opticalimagery. The advantages of such methods include 3D visualizationof urban areas, digital urban mapping, and GIS databases fordecision makers. In particular, a hybrid approach is proposedfor efficient building extraction from optical multi-angular im-agery, where a template matching algorithm is formulated forautomatic estimation of relative building height, and the relativeheight estimates are utilized in conjunction with a support vectormachine (SVM)-based classifier for extraction of buildings fromnon-buildings. This approach is tested on ortho-rectified Level-2amulti-angular images of Rio de Janeiro fromWorldView-2 sensor.Its performance is validated using a 3-fold cross validationstrategy. The final results are presented as a building map andan approximate 3D model of buildings. The building detectionaccuracy of the proposed method is improved to 88%, comparedto 83% without using multi-angular information.

Index Terms—Building extraction, height estimation, multi-an-gular optical data.

I. INTRODUCTION

U RBAN sprawl change plays an important role in efficientplanning and management of the city against various

challenges. Real-time and precise information is in high de-mand by federal, state, and local resource management agenciesto dispense effective decisions for the challenges such as envi-ronmental monitoring and administration, utility transportation,land-use/land-cover change detection, water distribution, anddrainage systems of urban areas [1]. Very high resolutionspaceborne sensors are a cost and time effective choice forregion-based analysis to produce detailed urban land-covermaps. The high resolution capability enables the detection ofurban objects such as individual roads and buildings, leading toautomation in land mapping applications, cartographic digital

Manuscript received August 06, 2011; revised October 31, 2011; acceptedNovember 23, 2011. Date of publication January 04, 2012; date of current ver-sion February 29, 2012.A. Turlapaty is with the Department of Electrical and Computer Engineering,

Mississippi State University, Mississippi State, MS 39762 USA.Q. Du is with the Department of Electrical and Computer Engineering, Mis-

sissippi State University, and also with the Geosystems Research Institute, Mis-sissippi State University, Mississippi State, MS 39762 USA (corresponding au-thor, e-mail: [email protected]).N. H. Younan is with theDepartment of Electrical and Computer Engineering,

Mississippi State University, and also with the Geosystems Research Institute,Mississippi State University, Mississippi State, MS 39762 USA.B. Gokaraju and J. V. Aanstoos are with the Geosystems Research Institute,

Mississippi State University, Mississippi State, MS 39762 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSTARS.2011.2179792

mapping, and geographic information system (GIS). Buildingextraction has been an active research area for the last threedecades [2]–[4].Recent approaches for building extraction include: 1) active

contour models (ACM)-based detection on high resolutionaerial imagery with possible utilization of light detection andranging (LIDAR) data, 2) energy point process method ondigital elevation model (DEM) data, 3) height estimation fromsynthetic aperture radar (SAR) data, 4) classification-basedapproaches on optical imagery, and 5) hybrid approachescombining either height estimation or contour methods withclassification techniques on optical imagery.Early research on building extraction was focused on aerial

imagery due to its high spatial resolution. Region-based ACMdriven geometric snakes (GS) are more suitable for aerial im-agery, since they rely on intensity, texture, and statistical fea-tures of the pixels. Another advantage of GS is their ability todetect topological changes without noticeable edge data [5]. Theexisting GS approach for building detection was improved bytraining the model with gray level values from target regions.The traditional snake-based building detection was improvedthrough the incorporation of digital surface model (DSM) andheight data [6]. The DSM was generated from existing LIDARdata and a DEM. Later, approximate contours and initial pointsof interest were identified and the snakes were optimized to fitthe building contours. The standard approach for DEM-basedbuilding extraction depends on functional minimization withregularization [7]. This method was computationally improvedby modification of the energy-based objective function in theoriginal formulation [8]. In another study [9], void and ele-vated areas were extracted from LIDAR data. Using normal-ized difference vegetation index (NDVI) data and edge detec-tion of void areas, building shape extraction was improved. Thelimitation of ACM-based methods described in [6] is that theyperform well on isolated buildings, but further research is re-quired to analyze their performance on closely spaced and con-nected buildings. Additionally, the images have nearly uniformbackground [6], [9], thus there are no non-building objects withedges or elevated areas. In contrast, the images of commercialurban scenes have complex backgrounds requiring a more so-phisticated approach. ACMs seem to perform well when thebuildings are the dominant category in the image scene [5],[6], [9]. The extraction methods described in [5] are tested onaerial images with only rectangular buildings, not on more com-plex shapes. There are also many research studies proposing theuse of SAR for building extraction [10]–[13]. However, our re-search is focused on building extraction using optical satelliteimages.

1939-1404/$31.00 © 2012 IEEE

90 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012

Regarding height estimation from spaceborne optical images,various approaches, such as template matching, dynamic pro-gramming (DP), and a variant version of DP, were analyzed[14]. In the template matching method, an image is searchedfor the best correlated area for a given template from anotherimage. The DP approach tracks epipolar lines in two imagesand is formulated as an optimization problem by using a dis-similarity function. In the modified DP approach, the mutualinformation function is used instead of a dissimilarity function.The height precision for each of the methods was close to 1 m.Several classification-based building extraction methods werealso developed for spaceborne images. These algorithms usu-ally operate on both panchromatic and multispectral image sets.Some of the algorithms include a neural network classificationof spectral and structural features [15] and a Bayesian classi-fier approach [16]. A hybrid approach for building extractionwas developed using support vector machine (SVM)-based bi-nary classification [17]. This approach utilizes a combination ofthe intensity values from an ortho-rectified image, NDVI, andnormalized DSM (nDSM). In [18], it was pointed out that theradiometric resolution of a sensor plays a significant role in theshadow detection and removal. It was found that many simplealgorithms (such as bimodal histogram splitting) provided thebest chances for detecting shadow from non-shadow regions inthe imagery. This problem could be easily addressed using veryhigh radiometric resolution sensor data [19].The existing building extraction methods from spaceborne

optical imagery also have some limitations. For instance, themethod in [14] was limited to the panchromatic images withlittle variation in terms of the relative size, shape, and separa-tion of buildings. Moreover, the images in this study [14] hadhardly any non-building objects above the ground level. Thesemethods need further investigations on complex image sceneswith multiple bands. In the Hough-transform-based approach[17], the authors assumed that buildings were either rectanglesor circles. This assumption may not be valid for complexbuilding shapes. Dynamic programming was successfully usedin template matching especially for one-dimensional data, suchas speech recognition problems [20]. It was also useful forthe analysis of epipolar stereo image pairs with low buildingheight variations [21]. Since the WorldView-2 dataset usedin this study has images from several different view anglesand the building heights have high degree of variance, dy-namic programming may not be applicable without significantmodifications.In this paper, we propose a hybrid building extraction and rel-

ative building height estimationmethodology for an urban sceneusing the multi-angular characteristics of optical imagery fromthe WorldView-2 sensor. The objective is to improve buildingextraction performance by appropriately utilizing optical multi-angular information. Our major contributions are: 1) an alterna-tive approach is proposed for shadow detection using yellowand red bands of multispectral optical imagery from World-View-2; 2) a relative height estimation methodology based ona template matching is developed; and 3) a hybrid approach isproposed to significantly improve building extraction accuracy

using the results from the template matching, followed by clas-sification of the image using multi-band spectral features.The rest of the paper is organized as follows. Section II con-

sists of the proposed methodology of a hybrid approach withmathematical details. Section III consists of its application toWorldView-2 imagery, followed by classification and discus-sion. Finally, Section IV concludes with a summary and pro-posed future work.

II. METHODOLOGY

A block diagram of the proposed hybrid method is illus-trated in Fig. 1. First, pan-sharpening is conducted to enhancethe spatial resolution of multispectral images. Second, thepan-sharpened data is used to detect and remove shadow andvegetation regions, followed by a template matching techniquefor height estimation and potential building segment detection.Third, statistical features are extracted from potential buildingsegments, which are used for SVM-based supervised classifi-cation for final building extraction.

A. Pan-Sharpening

Consider a panchromatic image of size pixels withresolution m and amultispectral (MS) image set from the samesatellite with resolution m and spectral bands. In the pan-sharpening stage, the first step is to resample the original MSimagery to resolution m by using cubic spatial interpolation.The next step is to apply the Gram-Schmidt orthogonalizationbased pan-sharpening method in the ENVI package [22] on thepanchromatic image and the re-sampled MS image. The pan-sharpened (PS) image set is the input to the proposed hybridapproach.

B. Height Estimation

1) Shadow and Vegetation Removal: Shadows pose a seriousimpediment in our hybrid approach and can be confused withother class information, thus their removal is crucial. Vegetationdetection helps avoid the challenges that can be encountered ina template matching with objects such as tall trees. In addition,this detection enables reducing significant percentage of non-target information load for further processing. Finally, after themasking of shadows and vegetation, the resulting image consistsof only buildings, and other man-made structures such as roads,parking lots, and pavements, just to mention a few.The NDVI is calculated by using the near Infrared (NIR) and

Red bands of the pan-sharpened image as

(1)

The vegetation mask is obtained by using a threshold from theNDVI image. This threshold can be determined with the max-imum between-class variance (MBCV) criterion [23], [24]. Inthe MBCVmethod, it is assumed that the histogram of an imagewould have two major modes, and the threshold is selected suchthat the inter-class variance between these two modes is max-imized so as to maximize the separability of these two modes[23], [24].

TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY 91

Fig. 1. Block diagram of the proposed building extraction approach, consisting of three major stages: (1) pan-sharpening, (2) relative height estimation, and(3) building extraction.

Previous studies show that the normalized shadow index(NSI) and the normalized spectral shadow (NSS) producevery low index values for shadows but often mix up withthe surrounding water bodies and vegetation [25]. Similarly,shadow regions are processed by transforming the histogramfor grey scale compensation to discriminate from non-shadowregions [26]. Shadow regions are also detected by transfor-mation to Hue-Saturation-Intensity (HSI) and then removingthe NDVI-based vegetation [27]. In the current research, theproduct of yellow and red bands in (2) is used for shadowdetection.

(2)

For a given pixel at , if ,, then itis categorized as shadow. A threshold value is deter-mined using the MBCV of the image [23], [24]. Thelowest mode of the histogram represents the shadows in theimage. Once the vegetation and shadow regions in the imageare screened out, the remaining image is used as an input to thesubsequent processing of height estimation.2) Template Matching for Potential Building Extraction:

The template matching is commonly used in several imageprocessing applications, such as object tracking for forwardlooking infrared imagery [28], face recognition using normal-ized gradient matching [29], and human detection based onbody shape [30]. The method deals with two stereo images;one image consists of templates of target objects and the goalis to locate these templates in the other image. Usually, thesearch image is of slightly different view angle to that of thetemplate image. In this paper, we propose to use this techniquefor relative height estimation of a given image scene based onthe template displacements.

Consider two -th band images and atdifferent view angles and . In the image, each tem-plate is tracked in the image to estimate therelative height, where is the center pixel. In this study, the tem-plate is a spatial window of size pixels in theimage. The displacement corresponding to a template

in the image is assumed to be

(3)

where is the matching area in the secondimage, centered at pixel . As illustrated in Fig. 2, the heightof a building can be related to this displacement with thecosine rule:

(4)

where and are two different incidence angles, and is theangle between ground projections of sensor lines of sight. Here,the angle depends only on the view angles of the satellite[31]. The angles and depend on the building height. How-ever, compared to the altitude of the satellite, the influence ofthe building height variation is negligible. Hence, from (4), thebuilding height is linearly related to the displacement , andcan be used as a relative height.Initially, for a given template, a corresponding search area,

called frame, is selected from the image. This framehas to be large enough to contain even the farthest displacementfrom , which depends on the height of the building. The framesize depends on the difference between view angles of the twoimages. The template size and frame size are not usually equal.When computing the FFT-based correlation, the frame and tem-plate are padded with zeros to the nearest larger power of size 2.


Fig. 2. An illustration of the relation between template displacement and objectheight [31].

The 2D Fourier transforms of both the template and frame arecomputed as [32]

(5)

(6)

and their correlation is computed with the 2D inverseFourier transform:

(7)

Then, the normalized correlation is determined by

(8)where consists of a regional sum at each pixel ofthe frame and is the standard deviation. The regional sumis computed for a window with a size equal to that of a template.Here, is the number of pixels in the template. The pointused in (3) and (4) is the location with the largest correla-

tion . The maximum correlation value has tobe greater than the selected threshold for buildings discrimina-tion. Otherwise, it is a planar template [33]–[35]. The matchingtemplate is selected based on the maximum correlation. In thecase of multiple matches, the closest match is retained. By re-peating (5)–(8) for all the templates in the image, therelative height can be estimated for the whole scene. The outputafter the template matching is segments that primarily are forbuildings.

C. Building Extraction

At this stage, a block classification approach is chosen basedon the dimensions of the buildings in the image. For example,some buildings occupy hundreds of pixels in one dimension andfew occupy tens of pixels in the other dimension. In addition,some disadvantages with pixel-level or very small block-level

classification for building extraction are: 1) higher computa-tional load for the classification algorithm, 2) possibility ofhigher false positives because of the similarity of blocks orpixels of buildings with those from non-building regions, and3) lack of sufficient spatial and spectral variability for efficientclassification. Building detection from high resolution satelliteimages requires a resolution of at least 5 m, which enables theexpression of buildings as objects rather than mixed pixels.Hence, for a high resolution image with 0.5 m resolution, ablock size of 10 10 is equivalent to a 5 m resolution. Finally,an appropriate block size has to be chosen for buildings inthe imagery because a smaller size would not be sufficientto attain enough spatial variance, whereas the very large sizewould have mixed inter classes such as parking lots and roads[36]–[38]. Feature extraction is crucial for block-based imageclassification since the features computed for each block repre-sent the spatial and textural properties of that block. Moreover,feature vectors play a key role in developing the classificationmodel as they constitute the bridge between high-level objectcategory (abstraction) and raw pixel intensity.The image is divided into a spatial block

of size , and the total number of blocks is given by, where and . For

each block, the statistical features, such as mean , variance ,skewness , kurtosis , and energy ,are extracted. 2D-wavelet-based features are also computedas follows. Initially, a 2D wavelet decomposition is appliedto each spatial block using the following recursive filteringequations at the ( ) desired level of decomposition withinitial :

(9)

where , , , represent the approximation, horizontal,vertical, and diagonal coefficients, respectively, and and

denote highpass and lowpass filters of the selected Haarmother wavelet, respectively. The statistical features in thewavelet domain, such as energy “ ”, entropy “ ”, maximumcoefficient “ ”, and mean “ ”, are computed. A featurevector for the -th block in the -th band is constructed as:

(10)

where . The final feature matrix with allthe bands is .After the minimum building height is detected, it is used as a

threshold for the screening of the samples that certainly do notcorrespond to buildings. As a result of this screening, most ofthe remaining feature vectors correspond to buildings, and onlya few belong to non-buildings (such as roads and parking lots).Now, the SVM is adopted to classify buildings from the non-building classes, where building detection has been formulatedinto a binary classification problem. The primary goal of SVMis to devise an optimal hyperplane that maximizes the marginal


distance from the support vectors of each class [39]–[43]. Thesample data is grouped into training and testing data. A 3-foldcross validation strategy is employed. The feature set is dividedinto three folds. In each validation iteration, 2 folds (66.66%data) are used for training and 1 fold (33.34% data) is used fortesting; thus, after three iterations, each of the 3 folds are bothtested and trained. By using the results of the testing compo-nent of the SVM classification, a building map is yielded for theimage scene. The areas of individual segments of this map arecomputed and the smaller segments are discarded based on theminimum building area. This threshold can be detected usingMBCV for the above segment areas. Finally, the resulting mapconsists of buildings only. The building map is overlaid on therelative heightmap to obtain the height data for individual build-ings. Then, by averaging this height data separately for eachbuilding, an approximate 3D model can be generated.

III. EXPERIMENT AND RESULTS

The dataset used in the experiment comprises of five multi-angle ortho-rectified level-2a panchromatic and multispectralimages from the WorldView-2 sensor [44]. The multispectralimage consists of eight spectral bands, namely, coastal, blue,green, yellow, red, NIR1, NIR2, and NIR3, ranging from 400nm to 1040 nm with a spatial resolution of 2 m. The panchro-matic image covering 450 nm to 800 nm has a 0.5 m spatial res-olution. The image set was acquired over the downtown regionof Rio de Janeiro in Brazil in January 2010 within a three minutetime frame. The scene includes a large number of tall buildings,wide roads, parking lots, and a mixture of community parks andplantations. The five incidence angles of the sensor, i.e., 10.5(view 3), 21.7 (view 2), 26.7 (view 4), 31.6 (view 1), and36.0 (view 5), were used in this research. For further details onazimuth angles, please refer to [44].

A. Pan-Sharpening Result

From the original image at view 3, two subscenes ofsize 525 525 and 500 500 pixels, as shown in Fig. 3, werestudied. The panchromatic image was fused with the re-sampledMS images, resulting in pan-sharpened images as illustrated inFig. 4. It is clearly observed that the edges of the objects arevery distinct in the panchromatic image compared to the MSdata. In the pan-sharpened output, the spectral signatures of theMS imagery and spatial features of the panchromatic image arevery well retained.

B. Template Displacement Estimation

1) Detection of Shadows and Vegetation From Pan-Sharp-ened Images: The pixel level product of the yellow and redbands of the pan-sharpened image was computed. The his-togram of the logarithmic values of this product image presentsa bimodal distribution of shadows and non-shadows as shownin Fig. 5, from which a threshold value of 4.41 was determined.The pixels with a log-product value less than this thresholdwere considered as shadows. By using (3) on the NIR3 ( )and Red ( ) bands, vegetation pixels (enclosed by theyellow polygons) were identified with a very high accuracy asshown in Fig. 6. There was a reduction of approximately 30%of non-building pixels for further analysis.

Fig. 3. Panchromatic images of two scenes in the image at view 3 of Rio DeJaneiro from the WorldView-2 sensor. (a) scene 1 of size 525 525. (b) scene2 of size 500 500.

2) Template Matching: The goal in this step is to furtherscreen most of non-building features before the subsequent clas-sification process. The first image at wasused for extracting templates. The pan-sharpened imageat is the second image for extracting frames. Thesecond image consists of additional 200 rows and 50 columnstowards the south and west, respectively; it is used to detectdisplacement of large buildings from the first image. The viewangle of the second image is further away in the northeastern di-rection from the nadir compared to the first image. Because ofthis reason, the buildings appear to be displaced in the southwestdirection.Moreover, based on the height of the tallest building inthe image, the largest possible displacement is nearly 180 rowsand 45 columns. Thus, the frame extent is 180 rows towards thesouth and 45 columns toward the west to the template’s locationat the northeast corner.Visual inspection of Fig. 4(a) and (b) show that planar ob-

jects such as roads and other non-elevated structures are lo-cated at the same co-ordinates in both images. For example, the


Fig. 4. Pan-sharpened images using the Gram Schmidt method. (a) Pan-sharp-ened image from view angle 3 (scene 1). (b) Pan-sharpened image from viewangle 2 (scene 1).

red rectangular template shown in Fig. 4(a) and (b) is such aplanar template. On the other hand, we can clearly see that thetall structures such as buildings roofs (yellow boxes in Fig. 4)are displaced considerably in the second image. Thus, this isabout similarity comparison between two non-planar templatesat point and .Moreover, the displacement of the non-planartemplates depends on the building height. Since the two imageshave different view angles, the apparent areas of the regions rep-resenting shadows and vegetation vary between the images.By applying the template matching method as described in

Section II.B, a relative height map was generated as in Fig. 7(a).In this process, was used for the template block size.Thus, the total number of templates in the first image was 625.The template size 21 21 was found to be optimal because asmaller template may have correlation with other regions in thesearch area that are not necessarily the best matches. In addi-tion, if the template size is much larger, then it may have mixedcategories of planar and non-planar regions. From Fig. 7(a), wecan clearly see that the regions representing buildings showedhigh relative height values. However, a region corresponding to

Fig. 5. Histogram of log product values of the red and yellow band intensitiesfrom the image scene 1 at view 3.

Fig. 6. Pan-sharpened image at scene 1 of view 3 with shadows and vegetationmarked in red and yellow outlines, respectively. Only outlines of the respectivemasks are depicted here for visual appeal. Otherwise, detection is a binary seg-mentation and the masks obtained are binary intensity masks.

one building had several different height values even though theactual height was constant for a given building. This discrep-ancy can be attributed to a high degree of similarity betweendifferent sections of buildings. Most of the templates belongingto non-buildings have a very low height output which is justi-fied since they are planar objects. Based on a trial-and-error ap-proach, the optimal threshold for a minimum correlation valuefor a match is found to be 0.6. In the histogram of the log-rel-ative-height data in Fig. 7(b), the section shown in red repre-sents the non-building regions in the image and the green sec-tion represents the regions that are mostly buildings. From thishistogram, the minimum relative height value was 23 as dis-cussed in Section II.B using the MBCV method. The regions


Fig. 7. Template matching output: The template displacement calculated foreach template in view 3 is assigned as the relative height for each 21 21 block.(a) Relative height of scene 1 from view angle 3. The color bar indicates the rel-ative height value (template displacement). (b) Histogram of the relative heightoutput.

that are assigned to a non-building category account to 27.3%of the image.The percentage of image regions screened at different

process steps is given in Table I. Hence, before the classi-fication, a total of 57.3% of the image is pre-screened asnon-buildings. A comparison of Fig. 8 with Fig. 4(a) showsthat most of the building pixels are correctly detected throughthe height estimation method. However, a major issue with thismethod is false positives. One of the reasons for these falsedetections is the occlusion of some regions by large buildingsin the image (Fig. 4(b)). For instance, the region shown inthe red box in Fig. 8 can be clearly seen in Fig. 4(a) but it isnearly occluded by the large neighboring building in Fig. 4(b).However, there are also some regions, such as the yellow boxin Fig. 8, that are not occluded by any large structures butstill falsely represent a tall structure. This property can alsobe attributed to the correlation between different templates ofa road texture. Moreover, the roads in the downtown of RioDe Janeiro consists of flyovers and the terrain itself has variedelevations in the study area.

Fig. 8. Output at the template matching stage. The image scene consistingof the remaining 42.7% of pixels is illustrated after extracting vegetation,shadows, and most of the planar templates.

TABLE IPERCENTAGE OF IMAGES SCENE 1AND 2 OF VIEW ANGLE 3 SCREENEDAT EACH PROCESSING STEP; STEP 1 CORRESPONDS TO VEGETATIONAND SHADOW DETECTION AND STEP 2 CORRESPONDS TO SCREENING

MOST OF NON-BUILDINGS BY THE TEMPLATE MATCHING

C. Building Extraction Result

The image dataset was divided into 10 10 blocks. This sizewas determined based on the appropriate building size in theimage; the smaller size would not be sufficient to catch enoughvariance, whereas the larger size would have mixed classes suchas parking lots and roads. Hence, the total number of blocksobtained was . The labels for the buildingand non-buildings were obtained from the derived referencesamples via a visual expert analysis. These reference samplelabels were used in the training for each class of data. For agiven block, the majority voting rule was adopted to assign aclass label. Various statistical features were extracted for allthe blocks. Now approximately 42% (1150/2704) of the fea-ture vectors were available for further classification of build-ings from non-building classes. In this experimental study, thetraining time was relatively low because of masking in the pre-vious stage. The parameters used here in the SVM were:

, , and , where is the complexity,, and the polynomial linear kernel of order ’ ’

[42]. A 3-fold cross validation strategy was used to determinethe classification performance [42]. Table II lists the resultingcompleteness and correctness factors obtained from differentview combinations. These metrics were computed from the con-fusion matrix using the equations given in [5], where complete-ness is basically the number of correct detections as a fraction


TABLE IICONFUSION MATRICES FROM FUSION EXPERIMENTS AT SCENE 1 AND 2 ON COMBINATIONS OF DIFFERENT VIEWS

WITH CENTRAL VIEW 3, WHERE , AND

of actual building samples, while correctness is the number ofcorrect detections as a fraction of total detections. The fusion ofimages from view 3 and view 2 gave significant improvement inboth completeness and correctness. The combinations of view1, 4, and 5 with view 3 could not result in statistically significantimprovement in completeness (however, the correctness fac-tors showed some improvement from these combinations). Themajor reason is that the image scene has very large buildingswhich results in both large shadows and occlusions. Because ofthe satellite position with respect to the Sun, the shadows andocclusions are on opposite sides of the buildings for the imagesin view 1, 4, and 5, losing some information related to planarregions (non-buildings). However, for view 2, there is lesser in-fluence of occlusions on the template matching’s output as thisview is on the same side as view 3 from nadir.Table III further gives the performance of fusing view 3 and

view 2 in terms of the Kappa accuracy ( ), Overall Accuracy( ), and Error Rate ( ). The is the ratio between thetotal correct detections and the total number of samples, and

. The results reported confirms that fusingview 2 with view 3 yielded improvements in all these metrics.The McNemar’s test was adopted to verify the statistical sig-

nificance of improvement. The terms in Table IV are definedas follows: is the number of samples for which both clas-sifiers result in correct prediction, when classifier 1 is cor-

TABLE IIICLASSIFICATION PERFORMANCE METRICS WHEN FUSING VIEW 3 AND VIEW 2

TABLE IVSTATISTICS FROM MCNEMAR’S TEST ON SCENE 1 (COMBINATION VIEW 3

AND VIEW 2 VERSUS ONLY VIEW 3) PERFORMANCE COMPARISONBASED ON CORRECT PROPORTIONS

rect and classifier 2 is wrong, is the opposite case, andis the number of samples for which both classifiers fail. Thechi-square statistic can be easily computed using these values[45]. Here, the chi-square statistic is (the chi-squaredistribution value for a one degree of freedom). Hence the im-provement is significant at the 95% confidence level.


Fig. 9. Output of the hybrid building extraction algorithm: Classificationmaps are shown with building polygon outlines for better visual appeal. Thesepolygon outlines indicate that all the pixels within the polygons are classifiedas buildings. (a) Original building classification map for scene 1 of view 3.(b) Refined building classification map for scene 1 of view 3. (c) Refinedbuilding classification map for scene 2 of view 3.

Fig. 9 demonstrates the resulting building maps. The redpolygons in Fig. 9(a) are the boundaries of the classification

Fig. 10. An approximate 3D model of buildings computed from the relativeheight data (displacement) for scene 1 of view 3.

outputs of the buildings detected with each pixel inside theboundary detected true. This rough map included many falsepositives such as the small red polygons on roads and falsenegatives within the buildings. Most of these false alarmswere reduced based on the minimum building area thresholdcriterion as discussed in Section II.C. The refined buildingmaps for the two scenes are presented in Fig. 9(b) and (c), re-spectively. A preliminary analysis of the feature importance forthe classification process reveals that the statistical features andthe energy of the wavelet approximation sequence are the mostsignificant features. Features, such as wavelet-based entropy,and maximum coefficient and statistics of the decompositionsequences, are less relevant. Especially, the combined usage ofboth statistical and all the four types of wavelet features reducesthe classification performance. This behavior can be attributedto the similarity in the textural properties of the buildings andnon-building areas. In a related study, the lower relevance of thetexture properties for classification of similar optical imagerywas determined when texture features were used [46].A better visualization of the building classification output

is also presented using the relative building height derivedfrom the template matching as in Fig. 10. This approximate3D building model is created using the shaded surface plots.It provides a more appealing visualization, compared to theprevious height output in Fig. 7(a) where the height within abuilding was inconsistent and the shapes of the buildings werenot well preserved. Averaging the building segments (shownin red in Fig. 9(b)) from the classification output resulted insmooth heights and shapes of complete buildings (Fig. 10). Thiscould also be used to illustrate the relative height variationsof tall buildings in the scene. The tallest building in the 3Dbuilding model (shown in a sky blue color) corresponds to theactual tallest building (based on visual interpretation) in thescene, followed by the second tallest building (in green), andsmall buildings shown in other colors (e.g., red, etc.).


IV. SUMMARY AND FUTURE WORK

In this study, a hybrid building extraction methodologyis developed and applied to multi-angular imagery form theWorldView2 sensor. Initially, the Gram-Schmidt orthogo-nalization based pan-sharpening method is used to fuse thepanchromatic and multispectral imagery. Then, vegetation andshadow regions are detected and masked with high accuracy.Later, a 2D Fourier transform-based template matching is usedto estimate the relative height model of the image scene. In thefinal stage, features from the pan-sharpened data along withthis height data are exploited for building extraction. A relative3D building model is generated from the relative height dataand building classification map. Both shape completeness andcorrectness factors for two subsets of imagery are above 88%.The performance improvement due to the incorporation of theheight data is statistically significant ( ) over the baselineperformance with a SVM-based binary classification of 83%.Some of the key parameters for the hybrid approachinclude:

1) the threshold for separating shadow pixels from non-shadowpixels, whose value may depend on the type of pan-sharpeningmethod used; 2) the threshold for vegetation detection fromNDVI data, which has to be decided from the histogram of theNDVI values; 3) the block size in the template matching,whose selection influences efficient omission of non-buildingblocks without losing any building blocks; and 4) the minimumheight threshold, which separates buildings from non-buildings.We will study how to automatically optimize these parametersin future work.Other future work may concern the following issues. In this

study, only three combinations of four multi-angular imageswere used. It might be possible to improve the performance ofthe height estimation method by fusing the results from othercombinations. It is particularly possible to reduce false heightestimations due to occlusions. It is also possible to use segmen-tation algorithms, such as watershed transformation, to reducefalse positives by identifying the boundaries of the buildings.By using the metadata for the multi-angular imagery, such ascross track view angle, and off-nadir view angle [44], we willbe able to estimate true building heights. Another issue with thismethod is that the edges of the detected building shapes are notas smooth as the original shapes. By reducing the pixel blocksize in the feature extraction process, the shape can be improved.However, this process may incur additional false alarms. An-other possible direction for shape improvement is the applica-tion of ACMor curve evolution techniques.Wewill also explorethe state of the art template matching techniques [47], [48] forreducing the false positives in the screened image.

ACKNOWLEDGMENT

The multi-angular images were provided by DigitalGlobethrough the 2011 Data Fusion Contest, organized by the DataFusion Technical Committee of the IEEE Geoscience andRemote Sensing Society.

REFERENCES[1] D. Tuia and G. Camps-Valls, “Urban image classification with semi-su-

pervised multi-scale cluster kernels,” IEEE J. Selected Topics in Ap-plied Earth Observations and Remote Sensing, vol. 4, no. 1, pp. 65–74,Mar. 2011.

[2] M. Izadi and P. Saeedi, “Automatic building detection in aerial imagesusing a hierarchical feature based image segmentation,” in Proc. 20thInt. Conf. Pattern Recognition, 2010, pp. 472–475.

[3] L. B. Theng, “Automatic building extraction from satellite imagery,”Eng. Lett., vol. 13, no. 5, Nov. 2006.

[4] U. Weidner and W. Forstner, “Towards automatic building reconstruc-tion from high resolution digital elevation models,” ISPRS J. Pho-togramm. Remote Sens., vol. 50, no. 4, pp. 38–49, 1995.

[5] S. Ahmadi, M. J. V. Zoej, H. Ebabi, H. A. Moghaddam, and A.Mohammadzadeh, “Automatic urban building boundary extractionfrom high resolution aerial images using an innovative model ofactive contours,” Int. J. Appl. Earth Obs., vol. 12, pp. 150–157,Feb. 2010.

[6] M. Kabolizade, H. Ebadi, and S. Ahmadi, “An improved snake modelfor automatic extraction of buildings from urban aerial images andLiDAR data,” Computers, Environment and Urban Systems, vol. 34,pp. 435–441, Apr. 2010.

[7] M. Ortner, X. Descombes, and J. Zerubia, “Building outline extractionfrom Digital Elevation Models using marked point processes,” Int. J.Comput. Vision, vol. 72, no. 2, pp. 107–132, 2007.

[8] O. Tournaire, M. Bredif, D. Boldo, and M. Durupt, “An efficientstochastic approach for building footprint extraction from digitalelevation models,” ISPRS J. Photogramm., vol. 65, pp. 317–327,Mar. 2010.

[9] M. Awrangjeb, M. Ravabaksh, and C. S. Fraser, “Automatic detectionof residential buildings using LIDAR data and multispectral imagery,”ISPRS J. Photogramm., vol. 65, pp. 457–467, Jul. 2010.

[10] U. Soergel, E. Michaelsen, A. Thiele, E. Cadario, and U. Thoennessen,“Stereo analysis of high-resolution SAR images for building heightestimation in cases of orthogonal aspect directions,” ISPRS J. Pho-togramm., vol. 64, pp. 490–500, Feb. 2009.

[11] E. Michaelsen, U. Stilla, U. Soergel, and L. Doktorski, “Extraction ofbuilding polygons from SAR images: Grouping and decision-level inthe GESTALT system,” Pattern Recogn. Lett., vol. 31, pp. 1071–1076,Oct. 2009.

[12] D. Brunner, G. Lemoine, L. Bruzzone, and H. Greidanus, “Buildingheight retrieval from VHR SAR imagery based on an iterative simula-tion and matching technique,” IEEE Trans. Geosci. Remote Sens., vol.48, no. 3, pp. 1487–1504, Mar. 2010.

[13] R. Guida, A. Iodice, and D. Riccio, “Height retrieval of isolated build-ings fromsingle high-resolution SAR images,” IEEE Trans. Geosci.Remote Sens., vol. 48, no. 7, pp. 2967–2979, Jul. 2010.

[14] A. Alobeid, K. Jacobsen, and C. Heipke, “Building height estimationin urban areas from very high Resolution satellite stereo images,” Proc.ISPRS, vol. 38, 2009.

[15] Z. Lari and H. Ebadi, “Automated Building Extraction from High-Resolution Satellite Imagery using Spectral and Structural InformationBased on Artificial Neural Networks,” University of Hannover, Ger-many, 2011 [Online]. Available: www.ipi.uni-hannover.de/fileadmin/institut/pdf/Lari_Ebadi.pdf

[16] H. T. Shandiz, S. M. Mirhassani, and B. Yousefi, “Heirarchical methodfor building extraction in urban areas images using unsharp masking[usm] and Bayesian classifier,” inProc. 15th Int. Conf. Systems, Signalsand Image Processing, 2008, pp. 193–196.

[17] D. K. San and M. Turker, “Building extraction from high resolutionsatellite images using Hough transform,” in Int. Archives Photogram-metry, Remote Sensing and Spatial Information Science, Kyoto, Japan,2010, vol. XXXVIII, Part 8.

[18] P. M. Dare, “Shadow analysis in high-resolution satellite imagery ofurban areas,” Photogramm. Eng. Remote Sens., vol. 71, no. 2, pp.169–177, 2005.

[19] J.-Y. Rau, N.-Y. Chen, and L.-C. Chen, “True orthophoto generationof built-up areas using multi-view images,” Photogramm. Eng. RemoteSens., vol. 68, no. 6, pp. 581–588, 2002.

[20] T. F. Furtuna, “Dynamic programming algorithms in speech recogni-tion,” Informatica Economica, vol. 2, no. 46, pp. 94–99, 2008.

[21] T. Krauss, P. Reinartz, M. Lehner, M. Schroeder, and U. Stilla, “DEMgeneration from very high resolution stereo satellite data in urban areasusing dynamic programming,” in Proc. ISPRS Hannover Workshop onHigh-Resolution Earth Imaging for Geospatial Information, Hannover,Germany, May 2005, pp. 17–20.

[22] C. A. Laben, B. V. Brower, and , , Washington, DC, “Process forenhancing the spatial resolution of multispectral imagery using pan-sharpening,” U.S. patent 6,011,875, Jan. 4, 2000.

[23] O. Demirkaya, M. H. Asyali, and P. K. Sahoo, Image Processing withMATLAB, Applications in Medicine and Biology. London, U.K.:CRC Press, 2009.


[24] T. Kurita, N. Otsu, and N. Abdelmaelk, “Maximum likelihood thresh-olding based on population mixtur models,” Pattern Recognit., vol. 25,pp. 1231–1240, 1992.

[25] A.M. Polidorio, F. C. Flores, C. Franco, N. N. Imai, and A.M. G. Tom-maselli, “Enhancement of terrestrial surface features on high spatialresolution multispectral aerial images,” in Proc. 23rd Conf. Graphics,Patterns and Images, Aug. 2010, pp. 295–300.

[26] Z. Zhang and F. Chen, “A shadow processing method of high spatialresolution remote sensing image,” in Proc. 3rd Int. Congr. Image andSignal Processing, Oct. 2010, vol. 2, pp. 816–820.

[27] D. Cai, M. Li, Z. Bao, Z. Chen, W. Wei, and H. Zhang, “Study onshadow detection method on high resolution remote sensing imagebased on HIS space transformation and NDVI index,” in Proc. 18thInt. Conf. Geoinformatics, Jun. 2010, pp. 1–4.

[28] F. Lamberti, A. Sanna, and G. Paravati, “Improving robustness of in-frared target tracking algorithms based on template matching,” IEEETrans. Aerosp. Electron. Syst., vol. 47, no. 2, pp. 1467–1480, Apr. 2011.

[29] G. Zhu, S. Zhang, X. Chen, and C.Wang, “Efficient illumination insen-sitive object tracking by normalized gradient matching,” IEEE SignalProcess. Lett., vol. 14, no. 12, pp. 944–947, Dec. 2007.

[30] Z. Lin and L. S. Davis, “Shape-based human detection and segmen-tation via hierarchical part-template matching,” IEEE Trans. PatternAnal. Mach. Intel., vol. 32, no. 4, pp. 604–618, Apr. 2010.

[31] H. Hashiba, K. Kameda, S. Tanaka, and T. Sugimura, “Digital roofmodel (DRM) using high resolution satellite image and its applicationfor 3D mapping of city region,” in Proc. IEEE Int. Geoscience andRemote Sensing Symp., Jul. 2003, vol. 3, pp. 1990–1992.

[32] H. V. Sorensen, C. S. Burrus, andM. T. Heideman, Fast Fourier Trans-form Database. Boston, MA: PWS Publishing, 1995.

[33] K. Briechle and U. Hanebeck, “Template matching using fast normal-ized cross correlation,” in Proc. SPIE 4387, Orlando, FL, Apr. 2001,vol. 95.

[34] D. Aboutajdine and F. Essannouni, “Fast block matching algorithmsusing frequency domain,” in Proc. Int. Conf. Multimedia Computingand Systems, Jul. 2011, pp. 1–6.

[35] D. J. Kroon, “Fast/Robust Template Matching,” 2011 [On-line]. Available: http://www.mathworks.com/matlabcentral/fileex-change/24925-fastrobust-template-matching

[36] X. Meng, L. Wang, and N. Currit, “Morphology based building detec-tion from airborne lidar data,” Photogramm. Eng. Remote Sens., vol.75, no. 4, pp. 437–442, Apr. 2009.

[37] X. Jin and C. H. Davis, “Automated building extraction from high-resolution satellite imagery in urban areas using structural, contextual,and spectral information,” EURASIP J. Appl. Signal Process., vol. 14,pp. 2196–2206, 2005.

[38] J. R. Jensen, “Thematic information extraction: Pattern recognition,”in Introductory Digital Image Processing: A Remote Sensing Perspec-tive. New York: Prentice Hall, 2005.

[39] C. W. Hsu and C. J. Lin, “A comparison of methods for multiclasssupport vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2,pp. 415–425, Mar. 2002.

[40] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[41] C. Cortes and V. Vapnik, “Support-vector network,”Mach. Learn., vol.

20, pp. 273–297, 1995.[42] B. Gokaraju, S. S. Durbha, R. L. King, and N. H. Younan, “A machine

learning based spatio-temporal data mining approach for detection ofharmful algal blooms in the Gulf of Mexico,” IEEE J. Selected Topicsin Applied Earth Observations and Remote Sensing, vol. 4, no. 3, pp.710–720, 2011.

[43] H. Ulrich and G. Kressel, “Pairwise classification and support vectormachines,” in Advances in Kernel Methods—Support Vector Learning,B. Schölkopf, C. Burges, and A. J. Smola, Eds. Boston, MA: MITPress, 1999, pp. 255–268.

[44] J. K. N. Da Costa, “Geometric Quality Testing of the WorldView-2Image Data Acquired Over the JRC Maussane Test Site usingERDAS LPS, PCI Geomatics and Keystone Digital Photogram-metry Software Packages – Initial Findings,” Publications Officeof the European Union, 2010 [Online]. Available: http://publica-tions.jrc.ec.europa.eu/repository/handle/111111111/15036

[45] G. M. Foody, “Thematic map comparison: Evaluating the statisticalsignificance of differences in classification accuracy,” Photogramm.Eng. Remote Sens., vol. 70, no. 5, pp. 627–633, May 2004.

[46] N. Longbotham, C. Bleiler, C. Chaapel, C. Padwick, W. Emery, and F.Pacifici, “Spatial classification ofWorldView-2 multi-angle sequence,”in Proc. Joint Urban Remote Sensing Event, Munich, Germany, Apr.2011, pp. 105–108.

[47] H. Schweitzer, R. Deng, and R. F. Anderson, “A dual-bound algorithmfor very fast and exact template matching,” IEEE Trans. Pattern Anal.Mach. Intel., vol. 33, no. 3, pp. 459–470, Mar. 2011.

[48] F. Essannouni and D. Aboutajdine, “Fast frequency template matchingusing higher order statistics,” IEEE Trans. Image Process., vol. 19, no.3, pp. 826–830, Mar. 2010.

Anish Turlapaty received the B.Tech. degree inelectronics and communication engineering fromNagarjuna University, India, in 2002, the M.Sc.degree in engineering degree in radio astronomy &space sciences from Chalmers University, Sweden,in 2006, and the Ph.D degree in electrical engineeringfrom Mississippi State University, Mississippi State,MS, in 2010.He is a postdoctoral associate in the Electrical and

Computer Engineering Department of MississippiState University. He is currently developing pattern

recognition and digital signal processing based methods for detection ofsub-surface radioactive waste using electromagnetic induction spectroscopyand gamma ray spectrometry. He has received graduate student of the yearawards for outstanding research from both the Mississippi State UniversityCenters and Institutes and the Geosystems Research Institute at MississippiState University for the year 2010. He has published three peer-reviewedjournal papers and presented eight papers and posters at international con-ferences and symposia. He is also an active reviewer for many internationaljournals related to signal processing.

Balakrishna Gokaraju received the B.Tech. degreein electronics and communication engineering fromJ.N.T.U. Hyderabad, India, in 2003, and the M.Tech.degree in remote sensing from Birla Institute ofTechnology, Ranchi, India, in 2006. He also receivedthe Ph.D. degree in electrical and computer engi-neering at Mississippi State University, MississippiState, MS, in 2010.He is currently employed as Post Doctoral Asso-

ciate in the GeoSystems Research Institute at Mis-sissippi State University’s High Performance Com-

putation Collaboratory. His research interests include but not only limited topattern recognition, machine learning, spatio-temporal data mining, adaptivesignal processing. His research experience describes the development of var-ious algorithms towards automatic target detection, classification, and predic-tion in the application domains of Multi-spectral, Hyper-spectral and Radar re-mote sensing. Various object oriented approaches were developed by exploitingthe very high-resolution satellite sensor data for their potential in urban remotesensing studies. High performance computing using hybrid CPU-GPU architec-tures are also investigated for scientific applications. He also published in manypeer-reviewed journals and international conferences.Dr. Gokaraju is an active reviewer for various journals and conferences,

including Journal of Applied Geography (Elsevier), International Journal ofImage and Data Fusion (Taylor and Francis), and IEEE Journal of SelectedTopics in Applied Earth Observations and Remote Sensing (JSTARS) (IEEEGeoscience and Remote Sensing Society).

Qian Du (S’98–M’00–SM’05) received the Ph.D.degree in electrical engineering from the Universityof Maryland Baltimore County in 2000.She was with the Department of Electrical

Engineering and Computer Science, Texas A&MUniversity, Kingsville, from 2000 to 2004. Shejoined the Department of Electrical and ComputerEngineering at Mississippi State University in Fall2004, where she is currently an Associate Professor.Her research interests include hyperspectral remotesensing image analysis, pattern classification, data

compression, and neural networks.Dr. Du currently serves as Co-Chair for the Data Fusion Technical Committee

of IEEE Geoscience and Remote Sensing Society. She also serves as AssociateEditor for IEEE Journal of Selected Topics in Applied Earth Observations andRemote Sensing (JSTARS). She received the 2010 Best Reviewer Award from


IEEE Geoscience and Remote Sensing Society. She is the General Chair for the4th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolutionin Remote Sensing (WHISPERS). She is amember of SPIE, ASPRS, andASEE.

Nicolas H. Younan (SM’99) received the B.S. andM.S. degrees from Mississippi State University, in1982 and 1984, respectively, and the Ph.D. degreefrom Ohio University in 1988.He is currently the Department Head and James

Worth Bagley Chair of Electrical and ComputerEngineering at Mississippi State University. Hisresearch interests include signal processing andpattern recognition. He has been involved in thedevelopment of advanced image processing andpattern recognition algorithms for remote sensing

applications, image/data fusion, feature extraction and classification, automatictarget recognition/identification, and image information mining.Dr. Younan has published over 150 papers in refereed journals and conference

proceedings, and book chapters. He has served as the General Chair and Editorfor the 4th IASTED International Conference on Signal and Image Processing(General Chair), Co-Editor for the 3rd International Workshop on the Analysisof Multi-Temporal Remote Sensing Images, Guest Editor, Pattern RecognitionLetters, and Associate Editor, JSTARS. He is a senior member of IEEE and amember of the IEEE Geoscience and Remote Sensing society, serving on twotechnical committees: Data Fusion, and Data Archive and Distribution.

James V. Aanstoos (SM’99) received the B.S.and M.E.E. degrees in electrical and computerengineering from Rice University, Houston, TX,in 1977 and 1979, respectively. He received thePh.D. degree in atmospheric science from PurdueUniversity, West Lafayette, IN, in 1996.He is currently an Associate Research Professor of

the Geosystems Research Institute (GRI) at Missis-sippi State University and an adjunct in the Electricaland Computer Engineering Department. His currentresponsibilities include conducting and managing re-

search on remotely sensed image and radar analysis, and teaching in the area ofremote sensing. Current research activities include application of synthetic aper-ture radar and multispectral imagery to levee vulnerability assessment. Otherrecent research activities have focused on multispectral land cover classifica-tion and accuracy assessment, terrain database visualization, and developmentof a multispectral sensor simulator. He has published in several peer- reviewedjournals and conference proceedings.Dr. Aanstoos has been an active member in professional societies such as

IEEE, Sigma Xi, and the American Geophysical Union. He also rendered hisprofessional services as an Executive Committee and Treasurer for AppliedImagery Pattern Recognition (AIPR) from 1995to 1999, General Chair AIPR2001–2002, Program Chair AIPR 2000 and Secretary 2003–2011. He served inthe technical advisory committee to North Carolina statewide land cover map-ping project, 1995.

a hybrid approach for building extraction from spaceborne multi-angular optical imagery

Documents