spacenet mvoi: a multi-view overhead imagery dataset · shermeyer1, varun kumar3, and hanlin tang3...

SpaceNet MVOI a Multi-View Overhead Imagery Dataset

Nicholas Weir1 David Lindenbaum2 Alexei Bastidas3 Adam Van Etten1 Sean McPherson3 JacobShermeyer1 Varun Kumar3 and Hanlin Tang3

1In-Q-Tel CosmiQ Works [nweir avanetten jshermeyer]iqtorg2Accenture Federal Services davidlindenbaumaccenturefederalcom

3Intel AI Lab [alexeiabastidas seanmcpherson varunvkumar hanlintang]intelcom

Abstract

Detection and segmentation of objects in overheard im-agery is a challenging task The variable density ran-dom orientation small size and instance-to-instance het-erogeneity of objects in overhead imagery calls for ap-proaches distinct from existing models designed for naturalscene datasets Though new overhead imagery datasets arebeing developed they almost universally comprise a singleview taken from directly overhead (ldquoat nadirrdquo) failing toaddress a critical variable look angle By contrast viewsvary in real-world overhead imagery particularly in dy-namic scenarios such as natural disasters where first looksare often over 40 off-nadir This represents an importantchallenge to computer vision methods as changing view an-gle adds distortions alters resolution and changes light-ing At present the impact of these perturbations for algo-rithmic detection and segmentation of objects is untestedTo address this problem we present an open source Multi-View Overhead Imagery dataset termed SpaceNet MVOIwith 27 unique looks from a broad range of viewing angles(minus325 to 540) Each of these images cover the same665 km2 geographic extent and are annotated with 126747building footprint labels enabling direct assessment of theimpact of viewpoint perturbation on model performanceWe benchmark multiple leading segmentation and objectdetection models on (1) building detection (2) general-ization to unseen viewing angles and resolutions and (3)sensitivity of building footprint extraction to changes in res-olution We find that state of the art segmentation and objectdetection models struggle to identify buildings in off-nadirimagery and generalize poorly to unseen views present-ing an important benchmark to explore the broadly relevantchallenge of detecting small heterogeneous target objectsin visually dynamic contexts

1 Introduction

Recent years have seen increasing use of convolutionalneural networks to analyze overhead imagery collectedby aerial vehicles or space-based sensors for applicationsranging from agriculture [18] to surveillance [39 32] toland type classification [3] Segmentation and object de-tection of overhead imagery data requires identifying smallvisually heterogeneous objects (eg cars and buildings)with varying orientation and density in images a task ill-addressed by existing models developed for identification ofcomparatively larger and lower-abundance objects in natu-ral scene images The density and visual appearance of tar-get objects change dramatically as look angle geographiclocation time of day and seasonality vary further compli-cating the problem Addressing these challenges will pro-vide broadly useful insights for the computer vision com-munity as a whole for example how to build segmentationmodels to identify low-information objects in dense con-texts

Though public overhead imagery datasets explore geo-graphic and sensor homogeneity [8 12 22 34 19] theygenerally comprise a single view of the imaged location(s)taken nearly directly overhead (ldquoat nadirrdquo) Nadir imageryis not representative of collections during disaster responseor other urgent situations for example the first public high-resolution cloud-free image of San Juan Puerto Rico fol-lowing Hurricane Maria was taken at 519 ldquooff-nadirrdquo iea 519 angle between the nadir point directly underneaththe satellite and the center of the imaged scene [10] Thedisparity between looks in public training data and rele-vant use cases hinders development of models applicableto real-world problems More generally satellite and droneimages rarely capture identical looks at objects in differentcontexts or even when repeatedly imaging the same geog-raphy Furthermore no existing datasets or metrics permitassessment of model robustness to different looks prohibit-

arX

iv1

903

1223

9v2

[cs

CV

] 1

5 A

ug 2

019

Urban Industrial Dense Residential Sparse Residential7

(NA

DIR

)L

OO

KA

NG

LE

(BIN

)-3

2(O

FF)

52(V

OFF

)

Figure 1 Sample imagery from SpaceNet MVOI Four of the 2222 geographically unique image chips in the dataset areshown (columns) with three of the 27 views of that chip (rows) one from each angle bin Negative look angle correspondsto South-facing views whereas positive look angles correspond to North-facing views (Figure 2) Chips are down-sampledfrom 900times 900 pixel high-resolution images In addition to the RGB images shown the dataset comprises a high-resolutionpan-chromatic (grayscale) band a high-resolution near-infrared band and a lower-resolution 8-band multispectral image foreach geographic locationview combination The dataset is available at httpsspacenetai under a CC-BY SA 40License

ing evaluation of performance These limitations extend totasks outside of the geospatial domain for example con-volutional neural nets perform inconsistently in many nat-ural scene video frame classification tasks despite minimalpixel-level variation [1] and Xiao et al showed that spatialtransformation of images effectively altering view repre-sents an effective adversarial attack against computer visionmodels [36] Addressing generalization across views bothwithin and outside of the geospatial domain requires twoadvancements 1 A large multi-view dataset with diver-sity in land usage population density and views and 2 Ametric to assess model generalization

To address the limitations detailed above we intro-duce the SpaceNet Multi-View Overhead Imagery (MVOI)dataset which includes 62000 overhead images collected

over Atlanta Georgia USA and the surrounding areas Thedataset comprises 27 distinct looks including both North-and South-facing views taken during a single pass of aMaxar WorldView-2 satellite The looks range from al-most directly overhead (78 off-nadir) to up to 54 off-nadir with the same 665 km2 geographic area covered byeach Alongside the imagery we open sourced an atten-dant 126747 building footprints created by expert labelersTo our knowledge this is the first multi-viewpoint datasetfor overhead imagery with dense object annotations Thedataset covers heterogeneous geographies including highlytreed rural areas suburbs industrial areas and high-densityurban environments resulting in heterogeneous buildingsize density context and appearance (Figure 1) At thesame time the dataset abstracts away many other time-

sensitive variables (eg seasonality) enabling careful as-sessment of the impact of look angle on model training andinference The training imagery and labels and public testimages are available at httpsspacenetai under the CC-BYSA 40 International License

Though an ideal overhead imagery dataset would coverall the variables present in overhead imagery ie look an-gle seasonality geography weather condition sensor andlight conditions creating such a dataset is impossible withexisting imagery To our knowledge the 27 unique looksin SpaceNet MVOI represent one of only two such imagerycollections available in the commercial realm even behindimagery acquisition company paywalls We thus chose tofocus SpaceNet MVOI on providing a diverse set of viewswith varying look angle and direction a variable that isnot represented in any existing overhead imagery datasetSpaceNet MVOI could potentially be combined with exist-ing datasets to train models which generalize across morevariables

We benchmark state-of-the art models on three tasks

1 Building segmentation and detection2 Generalization of segmentation and object detection

models to previously unseen angles3 Consequences of changes in resolution for segmenta-

tion and object detection models

Our benchmarking reveals that state-of-the-art detectorsare challenged by SpaceNet MVOI particularly in viewsleft out during model training Segmentation and objectdetection models struggled to account for displacement ofbuilding footprints occlusion shadows and distortion inhighly off-nadir looks (Figure 3) The challenge of address-ing footprint displacement is of particular interest as it re-quires models not only to learn visual features but to ad-just footprint localization dependent upon the view contextAddressing these challenges is relevant to a number of ap-plications outside of overhead imagery analysis eg au-tonomous vehicle vision

To assess model generalization to new looks we devel-oped a generalization metric G which reports the relativeperformance of models when they are applied to previ-ously unseen looks While specialized models designed foroverhead imagery out-perform general baseline models inbuilding footprint detection we found that models devel-oped for natural image computer vision tasks have better Gscores on views absent during training These observationshighlight the challenges associated with developing robustmodels for multi-view object detection and semantic seg-mentation tasks We therefore expect that developments incomputer vision models for multi-view analysis made us-ing SpaceNet MVOI as well as analysis using our metricG will be broadly relevant for many computer vision tasks

The dataset is available at wwwspacenetai

2 Related WorkObject detection and segmentation is a well-studied

problem for natural scene images but those objects aregenerally much larger and suffer minimally from distor-tions exacerbated in overhead imagery Natural sceneresearch is driven by datasets such as MSCOCO [20]and PASCALVOC [13] but those datasets lack multipleviews of each object PASCAL3D [35] autonomous driv-ing datasets such as KITTI [14] CityScapes [7] existingmulti-view datasets [29 30] and tracking datasets such asMOT2017[24] or OBT [33] contains different views but areconfined to a narrow range of angles lack sufficient het-erogeneity to test generalization between views and are re-stricted to natural scene images Multiple viewpoints arefound in 3D model datasets [5 23] but those are not photo-realistic and lack the occlusion and visual distortion proper-ties encountered with real imagery

Previous datasets for overhead imagery focus on clas-sification [6] bounding box object detection [34 19 25]instance-based segmentation [12] and object tracking [26]tasks None of these datasets comprise multiple images ofthe same field of view from substantially different look an-gles making it difficult to assess model robustness to newviews Within segmentation datasets SpaceNet [12] repre-sents the closest work with dense building and road annota-tions created by the same methodology We summarize thekey characteristics of each dataset in Table 1 Our datasetmatches or exceeds existing datasets in terms of imagerysize and annotation density but critically includes varyinglook direction and angle to better reflect the visual hetero-geneity of real-world imagery

The effect of different views on segmentation or objectdetection in natural scenes has not been thoroughly stud-ied as feature characteristics are relatively preserved evenunder rotation of the object in that context Nonethelesspreliminary studies of classification model performance onvideo frames suggests that minimal pixel-level changes canimpact performance [1] By contrast substantial occlusionand distortion occurs in off-nadir overhead imagery com-plicating segmentation and placement of geospatially accu-rate object footprints as shown in Figure 3A-B Further-more due to the comparatively small size of target objects(eg buildings) in overhead imagery changing view sub-stantially alters their appearance (Figure 3C-D) We expectsimilar challenges to occur when detecting objects in natu-ral scene images at a distance or in crowded views Exist-ing solutions to occlusion are often domain specific [37] orrely on attention mechanisms to identify common elements[40] or landmarks [38] The heterogeneity in building ap-pearance in overhead imagery and the absence of landmarkfeatures to identify them makes their detection an ideal re-search task for developing domain-agnostic models that arerobust to occlusion

Dataset Gigapixels Images Resolution (m) Nadir Angles Objects AnnotationSpaceNet [12 8] 103 24586 031 On-Nadir 302701 PolygonsDOTA [34] 449 2806 Google Earth On-Nadir 188282 Oriented Bbox3K Vehicle Detection [21] NA 20 020 Aerial 14235 Oriented BboxUCAS-AOD [41] NA 1510 Google Earth On-Nadir 3651 Oriented BboxNWPU VHR-10 [4] NA 800 Google Earth On-Nadir 3651 BboxMVS [2] 111 50 031-058 [53 433] 0 NoneFMoW [6] 10840 523846 031-160 [022 575] 132716 ClassificationxView [19] 560 1400 031 On-Nadir 1000000 BboxSpaceNet MVOI (Ours) 502 60000 046-167 [-325 +540] 126747 PolygonsPascalVOC [13] - 21503 - - 62199 BboxMSCOCO [20] - 123287 - - 886266 BboxImageNet [9] - 349319 - - 478806 Bbox

Table 1 Comparison with other computer vision and overhead imagery datasets Our dataset has a similar scale asmodern computer vision datasets but to our knowledge is the first multi-view overhead imagery dataset designed for seg-mentation and object detection tasks Google Earth imagery is a mosaic from a variety of aerial and satellite sources andranges from 15 cm to 12 m resolution [15]

Figure 2 Collect views Location of collection points dur-ing the WorldView-2 satellite pass over Atlanta GA USA

3 Dataset Creation

SpaceNet MVOI contains images of Atlanta GAUSA and surrounding geography collected by MaxarrsquosWorldView-2 Satellite on December 22 2009 [22] Thesatellite collected 27 distinct views of the same 665 km2

ground area during a single pass over a 5 minute span Thisproduced 27 views with look angles (angular distance be-tween the nadir point directly underneath the satellite andthe center of the scene) from 78 to 54 off-nadir and witha target azimuth angle (compass direction of image acquisi-tion) of 17 to 1828 from true North (see Figure 2) Seethe Supplementary Material and Tables S1 and S2 for fur-ther details regarding the collections The 27 views in anarrow temporal band provide a dense set of visually dis-tinct perspectives of static objects (buildings roads treesutilities etc) while limiting complicating factors commonto remote sensing datasets such as changes in cloud coversun angle or land-use change The imaged area is geo-

Challenges in off-nadir imagery

Foot

prin

toff

set

and

occl

usio

n

(a) 7 degrees (b) 53 degrees

Shad

ows

(c) 30 degrees (d) -32 degrees

Figure 3 Challenges with off-nadir look angles Thoughgeospatially accurate building footprints (blue) perfectlymatch building roofs at nadir (A) this is not the case off-nadir (B) and many buildings are obscured by skyscrap-ers (C-D) Visibility of some buildings changes at differentlook angles due to variation in reflected sunlight

graphically diverse including urban areas industrial zonesforested suburbs and undeveloped areas (Figure 1)

31 Preprocessing

Multi-view satellite imagery datasets are distinct fromrelated natural image datasets in several interesting waysFirst as look angle increases in satellite imagery the nativeresolution of the image decreases because greater distortion

Figure 4 Dataset statistics Distribution of (A) buildingfootprint areas and (B) number of objects per 450mtimes450mgeographic tile in the dataset

is required to project the image onto a flat grid (Figure 1)Second each view contains images with multiple spectralbands For the purposes our baselines we used 3-channelimages (RGB red green blue) but also examined the con-tributions of the near-infrared (NIR) channel (see Supple-mentary Material) These images were enhanced with a sep-arate higher resolution panchromatic (grayscale) channel todouble the original resolution of the multispectral imagery(ie ldquopan-sharpenedrdquo) The entire dataset was tiled into900pxtimes 900px tiles and resampled to simulate a consistentresolution across all viewing angles of 05mtimes05m groundsample distance The dataset also includes lower-resolution8-band multispectral imagery with additional color chan-nels as well as panchromatic images both of which arecommon overhead imagery data types

The 16-bit pan-sharpened RGB-NIR pixel intensitieswere truncated at 3000 and then rescaled to an 8-bit rangebefore normalizing to [0 1] We also trained models directlyusing Z-score normalized 16 bit images with no appreciabledifference in the results

32 Annotations

We undertook professional labeling to produce high-quality annotations An expert geospatial team exhaustivelylabeled building footprints across the imaged area usingthe most on-nadir image (78 off-nadir) Importantly thebuilding footprint polygons represent geospatially accurateground truth and therefore are shared across all views Forstructures occluded by trees only the visible portion waslabeled Finally one independent validator and one remotesensing expert evaluated the quality of each label

33 Dataset statistics

Our dataset labels comprise a broad distribution of build-ing sizes as shown in Figure 4A Compared to natural im-age datasets our dataset more heavily emphasizes small ob-jects with the majority of objects less than 700 pixels inarea orsim 25 pixels across By contrast objects in the PAS-CALVOC [13] or MSCOCO [20] datasets usually comprise50-300 pixels along the major axis [34]

Task Baseline modelsSemantic Segmentation TernausNet [17] U-NET [27]Instance Segmentation Mask R-CNN [16]Object Detection Mask R-CNN [16] YOLT [11]

Table 2 Benchmark model selections for dataset baselinesTernausNet and YOLT are overhead imagery-specific mod-els whereas Mask R-CNN and U-Net are popular naturalscene analysis models

An additional challenge presented by this dataset con-sistent with many real-world computer vision tasks is theheterogeneity in target object density (Figure 4B) Imagescontained between zero and 300 footprints with substantialcoverage throughout that range This variability presentsa challenge to object detection algorithms which often re-quire estimation of the number of features per image [16]Segmentation and object detection of dense or variable den-sity objects is challenging making this an ideal dataset totest the limits of algorithmsrsquo performance

4 Building Detection Experiments41 Dataset preparation for analysis

We split the training and test sets 8020 by randomly se-lecting geographic locations and including all views for thatlocation in one split ensuring that each type of geographywas represented in both splits We group each angle intoone of three categories Nadir (NADIR) θ le 25 Off-nadir (OFF) 25 lt θ lt 40 and Very off-nadir (VOFF)θ ge 40 In all experiments we trained baselines using allviewing angles (ALL) or one of the three subsets Thesetrained models were then evaluated on the test set of eachof the 27 viewing angles individually

42 Models

We measured several state of the art baselines for se-mantic or instance segmentation and object detection (Table2) Where possible we selected overhead imagery-specificmodels as well as models for natural scenes to compare theirperformance Object detection baselines were trained us-ing rectangular boundaries extracted from the building foot-prints To fairly compare with semantic segmentation stud-ies the resulting bounding boxes were compared against theground truth building polygons for scoring (see Metrics)

43 Segmentation Loss

Due to the class imbalance of the training data ndash only95 of the pixels in the training set correspond to buildingsndash segmentation models trained with binary cross-entropy(BCE) loss failed to identify building pixels a problem ob-served previously for overhead imagery segmentation mod-els [31] For the semantic segmentation models we there-

F1

Task Model NADIR OFF VOFF AvgSeg TernausNet 062 043 022 043Seg U-Net 039 027 008 024Seg Mask R-CNN 047 034 007 029Det Mask R-CNN 040 030 007 025Det YOLT 049 037 020 036

Table 3 Overall task difficulty As a measure of over-all task difficulty the performance (F1 score) is assessedfor the baseline models trained on all angles and tested onthe three different viewing angle bins nadir (NADIR) off-nadir (OFF) and very off-nadir (VOFF) Avg is the linearmean of the three bins Seg segmentation Det object de-tection

fore utilized a hybrid loss function that combines the binarycross entropy loss and intersection over union (IoU) losswith a weight factor α [31]

L = αLBCE + (1minus α)LIoU (1)

The details of model training and evaluation including aug-mentation optimizers and evaluation schemes can be foundin the Supplementary Material

44 Metrics

We measured performance using the building IoU-F1

score defined in Van Etten et al [12] Briefly building foot-print polygons were extracted from segmentation masks (ortaken directly from object detection bounding box outputs)and compared to ground truth polygons Predictions werelabeled True Positive if they had an IoU with a ground truthpolygon above 05 and all other predictions were deemedFalse Positives Using these statistics and the number of un-detected ground truth polygons (False Negatives) we calcu-lated the precision P and recall R of the model predictionsin aggregate We then report the F1 score as

F1 =2times P timesRP +R

(2)

F1 score was calculated within each angle bin (NADIROFF or VOFF) and then averaged for an aggregate score

45 Results

The state-of-the-art segmentation and object detectionmodels we measured were challenged by this task Asshown in Table 3 TernausNet trained on all angles achievesF1 = 062 on the nadir angles which is on par withprevious building segmentation results and competitions[12 8] However performance drops significantly for off-nadir (F1 = 043) and very off-nadir (F1 = 022) imagesOther models display a similar degradation in performanceExample results are shown in Figure 5

Training ResolutionOriginal Equalized

Test Angles (046-167 m) 167 mNADIR 062 059OFF 043 041VOFF 022 022Summary 043 041

Table 4 TernausNet model trained on different resolu-tion imagery Building footprint extraction performancefor a TernausNet model trained on ALL original-resolutionimagery (046 m ground sample distance (GSD) for 78

to 167 m GSD at 54) left compared to the same modeltrained and tested on ALL imagery where every view isdown-sampled to 167 m GSD (right) Rows display per-formance (F1 score) on different angle bins The originalresolution imagery represents the same data as in Table 3Training set imagery resolution had only negligible impacton model performance

Directional asymmetry Figure 6 illustrates perfor-mance per angle for both segmentation and object detectionmodels Note that models trained on positive (north-facing)angles such as Positive OFF (Red) fair particularly poorlywhen tested on negative (south-facing) angles This may bedue to the smaller dataset size but we hypothesize that thevery different lighting conditions and shadows make somedirections intrinsically more difficult (Figure 3C-D) Thisobservation reinforces that developing models and datasetsthat can handle the diversity of conditions seen in overheadimagery in the wild remains an important challenge

Model architectures Interestingly segmentation mod-els designed specifically for overhead imagery (TernausNetand YOLT) significantly outperform general-purpose seg-mentation models for computer vision (U-Net Mask R-CNN) These experiments demonstrate the value of spe-cializing computer vision models to the target domain ofoverhead imagery which has different visual object den-sity size and orientation characteristics

Effects of resolution OFF and VOFF images havelower base resolutions potentially confounding analyses ofeffects due exclusively to look angle To test whether reso-lution might explain the observed performance drop we rana control study with normalized resolution We trained Ter-nausNet on images from all look angles artificially reducedto the same resolution of 167m the lowest base resolutionfrom the dataset This model showed negligible change inperformance versus the model trained on original resolutiondata (original resolution F1 = 043 resolution equalizedF1 = 041) (Table 4) This experiment indicates that view-ing angle-specific effects not resolution drive the declinein segmentation performance as viewing angle changes

Generalization to unseen angles Beyond exploring

Image Mask R-CNN TernausNet YOLT

10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)

Figure 5 Sample imagery (left) with ground truth building footprints and Mask R-CNN bounding boxes (middle left)TernausNet segmentation masks (middle right) and YOLT bounding boxes (right) Ground truth masks (light blue) areshown under Mask R-CNN and TernausNet predictions (yellow) YOLT bounding boxes shown in blue Sign of the lookangle represents look direction (negative=south-facing positive=north-facing) Predictions from models trained on on allangles (see Table 3)

Figure 6 Performance by look angle for various training subsets TernausNet (left) Mask R-CNN (middle) and YOLT(right) models trained on ALL NADIR OFF or VOFF were evaluated in the building detection task and F1 scores aredisplayed for each evaluation look angle Imagery acquired facing South is represented as a negative number whereaslooks facing North are represented by a positive angle value Additionally TernausNet models trained only on North-facingOFF imagery (positive OFF) and South-facing OFF imagery (negative OFF) were evaluated on each angle to explore theimportance of look direction

performance of models trained with many views we alsoexplored how effectively models could identify buildingfootprints on look angles absent during training We foundthat the TernausNet model trained only on NADIR per-formed worse on evaluation images from OFF (032) than

models trained directly on OFF (044) as shown in Table 5Similar trends are observed for object detection (Figure 6)To measure performance on unseen angles we introduce ageneralization score G which measures the performance ofa model trained on X and tested on Y normalized by the

Training AnglesTest Angles All NADIR OFF VOFFNADIR 062 059 023 013OFF 043 032 044 023VOFF 022 004 013 027Summary 043 032 026 021

Table 5 TernausNet model tested on unseen angles Per-formance (F1 score) of the TernausNet model when trainedon one angle bin (columns) and then tested on each of thethree bins (rows) The model trained on NADIR performsworse on unseen OFF and VOFF views compared to modelstrained directly on imagery from those views

performance of a model trained on Y and tested on Y

GY =1

N

sumX

F1(train = X test = Y )

F1(train = Y test = Y )(3)

This metric measures relative performance across viewingangles normalized by the task difficulty of the test set Wemeasured G for all our modeldataset combinations as re-ported in Table 6 Even though the Mask R-CNN modelhas worse overall performance the model achieved a highergeneralization score (G = 078) compared to TernausNet(G = 042) as its performance did not decline as rapidlywhen look angle increased Overall however generaliza-tion scores to unseen angles were low highlighting the im-portance of future study in this challenging task

46 Effects of geography

We broke down geographic tiles into Industrial SparseResidential Dense Residential and Urban bins and exam-ined how look angle influenced performance in each Weobserved greater effects on residential areas than other types(Table S3) Testing models trained on MVOI with unseencities[12] showed almost no generalization (Table S4) Ad-ditional datasets with more diverse geographies are needed

5 ConclusionWe present a new dataset that is critical for extending ob-

ject detection to real-world applications but also presentschallenges to existing computer vision algorithms Ourbenchmark found that segmenting building footprints fromvery off-nadir views was exceedingly difficult even forstate-of-the-art segmentation and object detection modelstuned specifically for overhead imagery (Table 3) The rel-atively low F1 scores for these tasks (maximum VOFF F1

score of 022) emphasize the amount of improvement thatfurther research could enable in this realm

Furthermore on all benchmark tasks we concluded thatmodel generalization to unseen views represents a signifi-cant challenge We quantify the performance degradationfrom nadir (F1 = 062) to very off-nadir (F1 = 022) and

Generalization Score GTask Model NADIR OFF VOFFSegmentation TernausNet 045 043 037Segmentation U-Net 064 040 037Segmentation Mask R-CNN 060 090 084Detection Mask R-CNN 064 092 076Detection YOLT 057 068 044

Table 6 Generalization scores To measure segmentationmodel performance on unseen views we compute a gen-eralization score G (Equation 3) which quantifies perfor-mance on unseen views normalized by task difficulty Eachcolumn corresponds to a model trained on one angle bin

note an asymmetry between performance on well-lit north-facing imagery and south-facing imagery cloaked in shad-ows (Figure 3C-D and Figure 6) We speculate that distor-tions in objects occlusion and variable lighting in off-nadirimagery (Figure 3) as well as the small size of buildings ingeneral (Figure 4) pose an unusual challenge for segmen-tation and object detection of overhead imagery

The off-nadir imagery has a lower resolution than nadirimagery (due to simple geometry) which theoretically com-plicates building extraction for high off-nadir angles How-ever by experimenting with imagery degraded to the samelow 167m resolution we show that resolution has an in-significant impact on performance (Table 4) Rather vari-ations in illumination and viewing angle are the dominantfactors This runs contrary to recent observations [28]which found that object detection models identify small carsand other vehicles better in super-resolved imagery

The generalization score G is low for the highest-performing overhead imagery-specific models in thesetasks (Table 6) suggesting that these models may be over-fitting to view-specific properties This challenge is not spe-cific to overhead imagery for example accounting for dis-tortion of objects due to imagery perspective is an essen-tial component of 3-dimensional scene modeling or rota-tion prediction tasks [23] Taken together this dataset andthe G metric provide an exciting opportunity for future re-search on algorithmic generalization to unseen views

Our aim for future work is to expose problems of inter-est to the larger computer vision community with the help ofoverhead imagery datasets While only one specific appli-cation advances in enabling analysis of overhead imageryin the wild can concurrently solve broader tasks For ex-ample we had anecdotally observed that image translationand domain transfer models failed to convert off-nadir im-ages to nadir images potentially due to the spatial shiftsin the image Exploring these tasks as well as other novelresearch avenues will enable advancement of a variety ofcurrent computer vision challenges

References[1] Aharon Azulay and Yair Weiss Why do deep convolutional

networks generalize so poorly to small image transforma-tions CoRR abs180512177 2018

[2] Marc Bosch Zachary Kurtz Shea Hagstrom and MyronBrown A multiple view stereo benchmark for satellite im-agery In 2016 IEEE Applied Imagery Pattern RecognitionWorkshop (AIPR) pages 1ndash9 Oct 2016

[3] Yushi Chen Xing Zhao and Xiuping Jia SpectralndashSpatialClassification of Hyperspectral Data Based on Deep Be-lief Network IEEE Journal of Selected Topics in AppliedEarth Observations and Remote Sensing 8(6)2381ndash2392July 2015

[4] Gong Cheng Peicheng Zhou and Junwei Han Learningrotation-invariant convolutional neural networks for objectdetection in vhr optical remote sensing images IEEE Trans-actions on Geoscience and Remote Sensing 547405ndash74152016

[5] Sungjoon Choi Qian-Yi Zhou Stephen Miller and VladlenKoltun A large dataset of object scans CoRRabs160202481 2016

[6] Gordon Christie Neil Fendley James Wilson and RyanMukherjee Functional Map of the World In 2018IEEECVF Conference on Computer Vision and PatternRecognition IEEE Jun 2018

[7] Marius Cordts Mohamed Omran Sebastian Ramos TimoRehfeld Markus Enzweiler Rodrigo Benenson UweFranke Stefan Roth and Bernt Schiele The CityscapesDataset for Semantic Urban Scene Understanding In The2009 IEEE Conference on Computer Vision and PatternRecognition (CVPR) June 2016

[8] Ilke Demir Krzysztof Koperski David Lindenbaum GuanPang Jing Huang Saikat Basu Forest Hughes Devis Tuiaand Ramesh Raskar DeepGlobe 2018 A Challenge to Parsethe Earth Through Satellite Images In The 2018 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR)Workshops June 2018

[9] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li andLi Fei-Fei ImageNet A Large-Scale Hierarchical ImageDatabase In The 2009 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) 2009

[10] DigitalGlobe Digitalglobe search and discovery httpsdiscoverdigitalglobecom Accessed 2019-03-19

[11] Adam Van Etten You only look twice Rapid multi-scaleobject detection in satellite imagery CoRR abs1805095122018

[12] Adam Van Etten Dave Lindenbaum and Todd M BacastowSpaceNet A Remote Sensing Dataset and Challenge SeriesCoRR abs180701232 2018

[13] Marc Everingham Luc Van Gool Christopher K IWilliams John Winn and Andrew Zisserman The pascalvisual object classes (voc) challenge International Journalof Computer Vision 88(2)303ndash338 June 2010

[14] Andreas Geiger Philip Lenz and Raquel Urtasun Are weready for autonomous driving the KITTI vision benchmark

suite In Conference on Computer Vision and Pattern Recog-nition (CVPR) 2012

[15] Google Google maps data help httpssupportgooglecommapsdata Accessed 2019-3-19

[16] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask R-CNN In The 2017 IEEE International Con-ference on Computer Vision (ICCV) Oct 2017

[17] Vladimir Iglovikov and Alexey Shvets Ternausnet U-netwith VGG11 encoder pre-trained on imagenet for image seg-mentation CoRR abs180105746 2018

[18] FM Lacar Megan Lewis and Iain Grierson Use of hyper-spectral imagery for mapping grape varieties in the BarossaValley South Australia In IGARSS 2001 Scanning thePresent and Resolving the Future Proceedings IEEE 2001International Geoscience and Remote Sensing Symposium(Cat No01CH37217) pages 2875ndash2877 vol6 2001

[19] Darius Lam Richard Kuzma Kevin McGee Samuel Doo-ley Michael Laielli Matthew Klaric Yaroslav Bulatov andBrendan McCord xView Objects in context in overheadimagery CoRR abs180207856 2018

[20] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollr and C LawrenceZitnick Microsoft COCO Common Objects in ContextIn 2014 European Conference on Computer Vision (ECCV)Zurich 2014 Oral

[21] Kang Liu and Gellert Mattyus Fast multiclass vehicle detec-tion on aerial images IEEE Geoscience and Remote SensingLetters 121938ndash1942 2015

[22] Nathan Longbotham Chuck Chaapel Laurence BleilerChris Padwick William J Emery and Fabio Pacifici VeryHigh Resolution Multiangle Urban Classification Analy-sis IEEE Transactions on Geoscience and Remote Sensing50(4)1155ndash1170 April 2012

[23] William Lotter Gabriel Kreiman and David D Cox Unsu-pervised learning of visual structure using predictive genera-tive networks CoRR abs151106380 2015

[24] Anton Milan Laura Leal-Taixe Ian D Reid Stefan Rothand Konrad Schindler MOT16 A benchmark for multi-object tracking CoRR abs160300831 2016

[25] T Nathan Mundhenk Goran Konjevod Wesam A Saklaand Kofi Boakye A large contextual dataset for classifi-cation detection and counting of cars with deep learningECCV abs160904453 2016

[26] Alexandre Robicquet Amir Sadeghian Alexandre Alahiand Silvio Savarese Learning social etiquette Human tra-jectory understanding in crowded scenes In Bastian LeibeJiri Matas Nicu Sebe and Max Welling editors The 2016European Conference on Computer Vision (ECCV) 2016

[27] Olaf Ronneberger Philipp Fischer and Thomas Brox U-Net- Convolutional Networks for Biomedical Image Segmenta-tion MICCAI 9351(Chapter 28)234ndash241 2015

[28] Jacob Shermeyer and Adam Van Etten The effects of super-resolution on object detection performance in satellite im-agery CoRR abs181204098 2018

[29] Tomas Simon Hanbyul Joo Iain Matthews and YaserSheikh Hand keypoint detection in single images using mul-tiview bootstrapping In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Jul 2017

[30] Srinath Sridhar Antti Oulasvirta and Christian Theobalt In-teractive markerless articulated hand motion tracking usingrgb and depth data In 2013 IEEE International Conferenceon Computer Vision (ICCV) pages 2456ndash2463 2013

[31] Tao Sun Zehui Chen Wenxiang Yang and Yin WangStacked u-nets with multi-output for road extraction InThe 2018 IEEE Conference on Computer Vision and PatternRecognition (CVPR) Workshops June 2018

[32] Burak Uzkent Aneesh Rangnekar and MJ Hoffman Aerialvehicle tracking by adaptive fusion of hyperspectral likeli-hood maps In 2017 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) Workshops pages 233ndash242 July 2017

[33] Yi Wu Jongwoo Lim and Ming-Hsuan Yang Online objecttracking A benchmark In 2013 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages 2411ndash2418 06 2013

[34] Gui-Song Xia Xiang Bai Zhen Zhu Jian Ding Serge Be-longie Jiebo Luo Mihai Datcu Marcello Pelillo and Liang-pei Zhang DOTA A Large-scale Dataset for Object Detec-tion in Aerial Images 2017 IEEE Conference on ComputerVision and Pattern Recognition Nov 2017

[35] Yu Xiang Roozbeh Mottaghi and Silvio Savarese Be-yond PASCAL A Benchmark for 3D Object Detection inthe Wild In IEEE Winter Conference on Applications ofComputer Vision (WACV) 2014

[36] Chaowei Xiao Jun-Yan Zhu Bo Li Warren He MingyanLiu and Dawn Song Spatially transformed adversarial ex-amples In International Conference on Learning Represen-tations 2018

[37] Shuo Yang Ping Luo Chen-Change Loy and Xiaoou TangFrom facial parts responses to face detection A deep learn-ing approach 2015 IEEE International Conference on Com-puter Vision (ICCV) Dec 2015

[38] Kevan Yuen and Mohan Manubhai Trivedi An occludedstacked hourglass approach to facial landmark localizationand occlusion estimation IEEE Transactions on IntelligentVehicles 2321ndash331 2017

[39] Peter WT Yuen and Mark A Canton Richardson An in-troduction to hyperspectral imaging and its application forsecurity surveillance and target acquisition The ImagingScience Journal 58(5)241ndash253 2010

[40] Shanshan Zhang Jian Yang and Bernt Schiele Occludedpedestrian detection through guided attention in cnns InThe 2018 IEEE Conference on Computer Vision and PatternRecognition (CVPR) June 2018

[41] Haigang Zhu Xiaogang Chen Weiqun Dai Kun Fu QixiangYe and Jianbin Jiao Orientation robust object detection inaerial images using deep convolutional neural network In2015 IEEE International Conference on Image Processing(ICIP) pages 3735ndash3739 2015

SpaceNet MVOI aMulti-View Overhead

Imagery DatasetSupplementary Material

A DatasetA1 Imagery details

The images from our dataset were obtained from Dig-italGlobe with 27 different viewing angles collected overthe same geographical region of Atlanta GA Each viewingangle is characterized as both an off-nadir angle and a targetazimuth We binned each angle into one of three categories(Nadir Off-Nadir and Very Off-Nadir) based on the angle(see Table 8) Collects were also separated into South- orNorth-facing based on the target azimuth angle

The imagery dataset comprises Panchromatic Multi-Spectral and Pan-Sharpened Red-Green-Blue-near IR(RGB-NIR) images The ground resolution of image var-ied depending on the viewing angle and the type of image(Panchromatic Multi-spectral Pan-sharpened) See Table7 for more details All experiments in this study were per-formed using the Pan-Sharpened RGB-NIR image (with theNIR band removed except for the U-Net model)

The imagery was uploaded into the spacenet-datasetAWS S3 bucket which is publicly readable with no costto download Download instructions can be found atwwwspacenetaioff-nadir-building-detection

A2 Dataset breakdown

The imagery described above was split into three folds50 in a training set 25 in a validation set and 25 ina final test set 900 times 900-pixel geographic tiles were ran-domly placed in one of the three categories with all of thelook angles for a given geography assigned to the same sub-set to avoid geographic leakage The full training set andbuilding footprint labels as well as the validation set im-agery were open sourced and the validation set labels and

Image Resolution at 78 Resolution at 54Panchromatic 046mpx 167mpxMulti-spectral 18mpx 70mpxPan-sharpened 046mpx 167mpx

Table 7 Resolution across different image types for twonadir angles

final test imagery and labels were withheld as scoring setsfor public coding challenges

B Model TrainingB1 TernausNet

The TernausNet model was trained without pre-trainedweights roughly as described previously [17] with modifi-cations Firstly only the Pan-sharpened RGB channels wereused for training and were re-scaled to 8-bit 90 rotationsX and Y flips imagery zooming of up to 25 and linearbrightness adjustments of up to 50 were applied randomlyto training images After augmentations a 512 times 512 cropwas randomly selected from within each 900times900 trainingchip with one crop used per chip per training epoch Sec-ondly as described in the Models section of the main texta combination loss function was used with a weight param-eter α = 08 Secondly a variant of Adam incorporatingNesterov momentum [] with default parameters was usedas the optimizer The model was trained for 25-40 epochsand learning rate was decreased 5-fold when validation lossfailed to improve for 5 epochs Model training was haltedwhen validation loss failed to improve for 10 epochs

B2 U-Net

The original U-Net [27] architecture was trained for 30epochs with Pan-Sharpened RGB+NIR 16-bit imagery ona binary segmentation mask with a combination loss as de-scribed in the main text with α = 05 Dropout and batchnormalization were used at each layer with dropout withp = 033 The same augmentation pipeline was used aswith TernausNet An Adam Optimizer [] was used withlearning rate of 00001 was used for training

B3 YOLT

The You Only Look Twice (YOLT) model was trainedas described previously [11] Bounding box training targetswere generated by converting polygon building footprintsinto the minimal un-oriented bounding box that enclosedeach polygon

B4 Mask R-CNN

The Mask R-CNN model with the ResNet50-C4 back-bone was trained as described previously [16] using thesame augmentation pipeline as TernausNet Boundingboxes were created as described above for YOLT

C Geography-specific performanceC1 Distinct geographies within SpaceNet MVOI

We asked how well the TernausNet model trained onSpaceNet MVOI performed both within and outside ofthe dataset First we broke down the test dataset into

Catalog ID Pan-sharpened Resolution Look Angle Target Azimuth Angle Angle Bin Look Direction1030010003D22F00 048 78 1184 Nadir South10300100023BC100 049 83 784 Nadir North1030010003993E00 049 105 1486 Nadir South1030010003CAF100 048 106 576 Nadir North1030010002B7D800 049 139 162 Nadir South10300100039AB000 049 148 43 Nadir North1030010002649200 052 169 1687 Nadir South1030010003C92000 052 193 351 Nadir North1030010003127500 054 213 1747 Nadir South103001000352C200 054 235 307 Nadir North103001000307D800 057 254 1784 Nadir South1030010003472200 058 274 277 Off-Nadir North1030010003315300 061 291 181 Off-Nadir South10300100036D5200 062 31 255 Off-Nadir North103001000392F600 065 325 1828 Off-Nadir South1030010003697400 068 34 238 Off-Nadir North1030010003895500 074 37 226 Off-Nadir North1030010003832800 08 396 215 Off-Nadir North10300100035D1B00 087 42 207 Very Off-Nadir North1030010003CCD700 095 442 20 Very Off-Nadir North1030010003713C00 103 461 195 Very Off-Nadir North10300100033C5200 113 478 19 Very Off-Nadir North1030010003492700 123 493 185 Very Off-Nadir North10300100039E6200 136 509 18 Very Off-Nadir North1030010003BDDC00 148 522 177 Very Off-Nadir North1030010003193D00 163 534 174 Very Off-Nadir North1030010003CD4300 167 54 174 Very Off-Nadir North

Table 8 DigitalGlobe Catalog IDs and the resolution of each image based upon off-nadir angle and target azimuth angle

the four bins represented in main text Figure 1 Indus-trial Sparse Residential Dense Residential and Urban andscored models within those bins (Table 9) We observedslightly worse performance in Industrials areas than else-where at nadir but markedly stronger drops in performancein residential areas as look angle increased

C2 Generalization to unseen geographies

We also explored how models trained on SpaceNetMVOI performed on building footprint extraction from im-

Type NADIR OFF - NADIR VOFF - NADIRIndustrial 051 minus013 minus028Sparse Res 057 minus019 minus037Dense Res 066 minus021 minus041Urban 064 minus013 minus030

Table 9 F1 score for the model trained on all angles andevaluated evaluated on the nadir bins (NADIR) then therelative decrease in F1 for the off-nadir and very off-nadirbins

agery from other geographies in this case the Las Vegasimagery from SpaceNet [12] After normalizing the Las Ve-gas (LV) imagery for consistent pixel intensities and chan-nel order with SpaceNet MVOI we predicted building foot-prints in LV imagery and scored prediction quality as de-scribed in Metrics We also re-trained TernausNet on the LVimagery and examined building footprint extraction qualityon the SpaceNet MVOI test set Strikingly neither modelwas able to identify building footprints in the unseen ge-ographies highlighting that adding novel looks angles doesnot necessarily enable generalization to new geographic ar-eas

Test SetMVOI 7 SN LV

Training Set MVOI ALL 068 001SN LV 000 062

Table 10 Cross-dataset F1 Models trained on MVOI orSpaceNet Las Vegas [12] were inferenced on held out im-agery from one of those two geographies and building foot-print quality was assessed as described in Metrics












































Urban Industrial Dense Residential Sparse Residential7

(NA

DIR

)L

OO

KA

NG

LE

(BIN

)-3

2(O

FF)

52(V

OFF

)

Figure 1 Sample imagery from SpaceNet MVOI Four of the 2222 geographically unique image chips in the dataset areshown (columns) with three of the 27 views of that chip (rows) one from each angle bin Negative look angle correspondsto South-facing views whereas positive look angles correspond to North-facing views (Figure 2) Chips are down-sampledfrom 900times 900 pixel high-resolution images In addition to the RGB images shown the dataset comprises a high-resolutionpan-chromatic (grayscale) band a high-resolution near-infrared band and a lower-resolution 8-band multispectral image foreach geographic locationview combination The dataset is available at httpsspacenetai under a CC-BY SA 40License

ing evaluation of performance These limitations extend totasks outside of the geospatial domain for example con-volutional neural nets perform inconsistently in many nat-ural scene video frame classification tasks despite minimalpixel-level variation [1] and Xiao et al showed that spatialtransformation of images effectively altering view repre-sents an effective adversarial attack against computer visionmodels [36] Addressing generalization across views bothwithin and outside of the geospatial domain requires twoadvancements 1 A large multi-view dataset with diver-sity in land usage population density and views and 2 Ametric to assess model generalization

To address the limitations detailed above we intro-duce the SpaceNet Multi-View Overhead Imagery (MVOI)dataset which includes 62000 overhead images collected

over Atlanta Georgia USA and the surrounding areas Thedataset comprises 27 distinct looks including both North-and South-facing views taken during a single pass of aMaxar WorldView-2 satellite The looks range from al-most directly overhead (78 off-nadir) to up to 54 off-nadir with the same 665 km2 geographic area covered byeach Alongside the imagery we open sourced an atten-dant 126747 building footprints created by expert labelersTo our knowledge this is the first multi-viewpoint datasetfor overhead imagery with dense object annotations Thedataset covers heterogeneous geographies including highlytreed rural areas suburbs industrial areas and high-densityurban environments resulting in heterogeneous buildingsize density context and appearance (Figure 1) At thesame time the dataset abstracts away many other time-

















3 Dataset Creation




Foot

prin

toff

set

and

occl

usio

n


Shad

ows




31 Preprocessing





32 Annotations









42 Models




F1






44 Metrics




(2)


45 Results











10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)








GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN










































































3 Dataset Creation




Foot

prin

toff

set

and

occl

usio

n


Shad

ows




31 Preprocessing





32 Annotations









42 Models




F1






44 Metrics




(2)


45 Results











10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)








GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN





























































32 Annotations









42 Models




F1






44 Metrics




(2)


45 Results











10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)








GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN


























































F1






44 Metrics




(2)


45 Results











10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)








GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN



























































10(N

AD

IR)

LO

OK

AN

GL

E(B

IN)

-29

(OFF

)53

(VO

FF)








GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN





























































GY =1

N

sumX








































































B2 U-Net


B3 YOLT


B4 Mask R-CNN


















































































































B2 U-Net


B3 YOLT


B4 Mask R-CNN


























































spacenet mvoi: a multi-view overhead imagery dataset · shermeyer1, varun kumar3, and hanlin tang3...

Documents