incorporating spatial context and fine-grained detail from ...eix/multi_resolution.pdf ·...

Incorporating Spatial Context and Fine-grained Detail fromSatellite Imagery to Predict Poverty

Jae Hyun KimStanford University

[email protected]

Michael XieStanford University

[email protected] Jean

Stanford [email protected]

Stefano ErmonStanford University

[email protected]

ABSTRACTThe lack of accurate poverty data in developing countriesand the high cost of obtaining this data pose major chal-lenges for making informed policy decisions and allocatingresources effectively. The rise of inexpensive, high-resolutionremote sensing data such as satellite imagery can provideglobal-scale information, given methods for extracting use-ful insights from this unstructured data. In this paper,we present a machine learning approach to predicting real-valued asset indices in five countries in Africa which usessatellite imagery from multiple resolutions to incorporateboth spatial context and fine-grained detail. Following themodel developed by Xie et al. [17], we train convolutionalneural network models to predict nighttime light intensityfrom daytime imagery, simultaneously learning features thatare useful for asset prediction. We show that for many dif-ferent resolutions, single-resolution models trained using thistransfer learning method can extract features that are goodpredictors of asset measures. Further, we demonstrate thatmodels trained on different resolutions extract semanticallydifferent features that are often conceivable by human sight.Finally, we propose two multi-resolution models that com-bine spatial context and fine-grained details of satellite im-ages to outperform the single-resolution models as well asthe previous state-of-the-art.

1. INTRODUCTIONData scarcity in the developing world poses a major chal-

lenge for making informed policy decisions and allocating re-sources effectively [4, 15]. Countries with advanced economies,such as the United States, have sufficient resources and insti-tutional infrastructure to be able to collect detailed house-hold survey data on a regular basis, but these capabilitiesdo not exist in many of the world’s poorest nations. Ac-cording to data from the World Bank, only a few Africancountries have conducted more than two nationally repre-sentative surveys since the year 2000 [16]. This lack of re-liable data severely hinders our ability to identify areas ofgreatest need and to understand the complex dynamics ofpoverty.

With the proliferation of passively collected data from cellphones and other distributed sensors, a variety of creativesolutions to this data shortage problem have been offered. Ina recent paper, Blumenstock et al. [2] predict the distribu-tion of poverty in Rwanda using proprietary cell phone data.Other approaches based on call record data have been ap-

plied to create poverty maps in developing countries [7, 12].Remote sensing data is also becoming increasingly accu-

rate and inexpensive. Given methods of extracting useful in-sights from unstructured data, remote sensing data such ashigh-resolution satellite imagery can provide valuable infor-mation at a global scale. Xie et al. [17] use publicly availablesatellite images to predict poverty, circumventing the lackof labeled data through a transfer learning approach. Us-ing nighttime light intensity prediction as a data-rich proxytask, they train a convolutional neural network (CNN) tolearn to identify image features such as roads, urban areas,and terrains that are predictive of poverty.

We improve upon this approach by incorporating satel-lite imagery of multiple resolutions. Different resolutionscontain semantically different information, with lower reso-lutions providing coarser information about spatial context(e.g., rural/urban distinctions, major roads, nearby popula-tion centers) while higher resolutions capture more detailedfeatures of the area of interest (e.g., individual buildings,building materials, cars).

In this paper, we train residual CNN networks (ResNet[6]) to predict higher-resolution nightlight intensities fromthe VIIRS dataset using daytime satellite images of mul-tiple resolutions, in comparison to the older VGG archi-tecture trained using images of a single resolution and thelower-resolution DMSP nightlight data used in Xie et al.[17]. Through this transfer learning proxy task, our mod-els learn to extract low- and high-resolution features fromsatellite images that can be generalized to poverty predic-tion. The trained CNNs serve as feature extractors, and theresulting image features are used to predict real-valued assetindices. We assess the performance of two types of models:“in-country” models that are trained on and evaluated usingdata from the same country, and “out-of-country” modelsthat are trained on data from one country and evaluatedusing data from different countries. By using satellite im-agery from multiple resolutions, we are able to capture bothspatial context and high-resolution detail.

We find that for many different resolutions, the single-resolution models trained using the transfer learning ap-proach can extract features that are good predictors of as-set measures. For locations where multiple resolutions areavailable, we also demonstrate that by combining image fea-tures from different resolutions of imagery, we can improvesignificantly on the performance achieved by any individualsingle-resolution model. Our multi-resolution models out-perform the previous state-of-the-art, and often compare fa-

https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/284766595_Predicting_poverty_and_wealth_from_mobile_phone_metadata?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/282403466_Transfer_Learning_from_Deep_Features_for_Remote_Sensing_and_Poverty_Mapping?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==


https://www.researchgate.net/publication/278048526_Virtual_Networks_and_Poverty_Analysis_in_Senegal?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/264332702_Africa's_Statistical_Tragedy?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/profile/Stefano_Ermon?el=1_x_100&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/profile/Neal_Jean?el=1_x_100&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/profile/Jae_Hyun_Kim17?el=1_x_100&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

vorably with predictions made using expensively collectedsurvey data.

2. BACKGROUND

2.1 CNNs and residual learningConvolutional neural networks trained on the ImageNet

image recognition challenge have had remarkable success inobject classification, semantic segmentation, and other fun-damental computer vision tasks [13]. A convolutional neuralnetwork (CNN) is a hierarchical model which includes mul-tiple layers of convolution operations over the input. Theconvolution operator is translation invariant, which is im-portant for identifying features in images [3]. For imagedata, the initial layers typically identify low-level featuressuch as edges, while additional layers identify high-level fea-tures such as objects and textures [18]. The top layer en-codes the input as a feature vector, which is then used asthe input to a final classifier. Therefore, we can view a CNNas a mapping F : X → Rn from the space of input imagesX to their feature vector representations.

The success of CNNs has also launched various attemptsto repurpose the learned feature representations for generictasks. Donahue et al. [5] show that deep features learnedfrom training on the ImageNet dataset directly achieve goodperformance for object recognition, domain adaptation, sub-category recognition, and scene recognition.

We use a residual network (ResNet) developed by He et al.[6], which uses a gradient “highway” to train a much deeperarchitecture without gradient decay problems. This archi-tecture achieved the best performance on the 2015 ImageNetchallenge for object recognition and allows for faster con-vergence during training. In residual learning, we allowthe neural network to approximate the residual functionF(x) = H(x) − x, where H(x) is the underlying mappingto be fit and x is an input. Implicitly, we assume that H(x)can be more easily approximated by F(x) +x than directly.Formally, a residual layer is defined

y = F(x,W ) + Wsx

where W are the residual weights and Ws is a transformationto match the size of Wsx with F . Note that Ws is chosen asthe identity mapping wherever possible and is not optimizedin that case. Each layer with residual learning requires theinput from an earlier layer as part of its calculation, mit-igating the gradient degradation problem. Intuitively, thisensures that adding layers can only improve performance,since we can set the residual weights of the additional lay-ers to zero to learn an identity mapping from the shallowernetwork.

2.2 Transfer learningTransfer learning uses knowledge from a related task to

improve performance on the target task (see Pan and Yang[11]). As in Xie et al. [17], we define a transfer learn-ing problem P = (D, T ) as a domain-task pair. Transferlearning is used to improve the learning of a target pre-dictive function in the target problem PT by using knowl-edge from a source problem PS . The transfer of knowl-edge forms a transfer learning graph G = (V, E), a directedacyclic graph with vertices V = {P1, . . . ,Pv} and edgesE = {(Pi1 ,Pj1), . . . , (Pie ,Pje)}. For each problem Pj ∈ V,

Figure 1: Daytime images of the same location cor-responding (from left to right) to resolutions of zoomlevel 14, 16, and 18 respectively.

we transfer the knowledge gained through solving the parentproblems in the graph ∪(Pi,Pj)∈EPi.

3. NIGHTTIME LIGHT PREDICTION

3.1 ImageNet initializationLet the domain of natural images be DNI and the Ima-

geNet classification task be TIN . The ImageNet challengeproblem (DNI , TIN ) is the first source problem in our trans-fer learning graph pipeline seen in Fig. 2, as we transfer theknowledge learned from the ImageNet problem to the night-time light intensity prediction problem by initializing ourmodels with parameters trained with ImageNet. Althoughthe ImageNet domain consists of natural images that arefundamentally different from the aerial view of satellite im-ages, low-level features such as edges and corners are stillgenerally useful.

3.2 Daytime satellite imageryIn the intermediate transfer learning problems, we learn a

predictive function for the nighttime light intensity predic-tion task TNL from the domain of daytime satellite imagesDSIr of resolution (zoom) level r obtained from the GoogleStatic Maps API. We use zoom levels r ∈ R = {14, 16, 18},which correspond to resolutions of 9.55 m, 2.39 m, and 0.60m per pixel respectively. Therefore, 224-by-224 pixel imagescover areas of 2.14 km by 2.14 km, 0.54 km by 0.54 km, and0.13 km by 0.13 km respectively.

At the highest resolution (zoom level 18), landscape fea-tures such as individual buildings and rooftops can be seenclearly. These details are useful for inferring quantities of in-terest (e.g., consumption expenditures or asset ownership)at a fine-grained level. Below this resolution, we generallycannot see building outlines or cars. At the lowest resolu-tion (zoom level 14), larger-scale features, such as roads andother types of infrastructure and natural terrains, dominatethe variation in the images. Since the lower resolutions areable to cover more land area in each image, they also capturethe surrounding spatial context—a small farm in the middleof a desert likely has a different economic outlook than onethat is located on the outskirts of a large city.

3.3 Nighttime light intensitiesTo obtain nighttime light intensity labels, we use data

from the Visible Infrared Imaging Radiometer Suite (VI-IRS) satellite at a 15 arc-second resolution (∼0.5 km), twicethe resolution of the 30 arc-second Defense MeteorologicalSatellite Program (DMSP) data used in Xie et al. [17]. TheVIIRS data is also more recent than the DMSP data, whichis from 2013, allowing for the daytime satellite images and

https://www.researchgate.net/publication/303748578_ImageNet_Large_Scale_Visual_Recognition_Challenge?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==




https://www.researchgate.net/publication/258424423_Visualizing_and_Understanding_Convolutional_Neural_Networks?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/257409887_DeCAF_A_Deep_Convolutional_Activation_Feature_for_Generic_Visual_Recognition?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

nighttime light labels to be matched temporally. The ma-jority of our daytime imagery was taken in 2015, so we use2015 VIIRS data. However, while the DMSP data is pro-cessed to filter out noise from ephemeral light sources, theVIIRS data is not. We process the VIIRS data by averag-ing nighttime light intensity values over 2015, weighted bythe number of observations in each month. These annualweighted averages are then used as ground truth nighttimelight values.

As in Xie et al. [17], we discretize the nighttime light inten-sities into classes by observing the distribution of nighttimelights across Africa. We bin the intensities into 3 classes,with the classes ranging from 0 to 8, 8 to 35, and 35 to 200nW/cm2 − sr. The lowest light intensity class includes nat-ural terrains, the middle intensity class includes areas withsome human activity such as rural areas, and the highestintensity class includes most of the urban areas. For eachclass, we sample 50,000 images and augment the data byrandom rotations. We sample an additional 10,000 imagesfrom each class for the test set.

4. POVERTY MEASURE PREDICTIONIn our final target problem, we predict an asset-based

measure of wealth using DSIR = ∪r∈RDSIr , the space ofsatellite imagery data with multiple resolution levels R. Forthe poverty prediction task, TP , we use an asset index forfive countries (Nigeria 2013, Tanzania 2010, Uganda 2011,Malawi 2010, Rwanda 2010) drawn from the Demographicand Health Surveys (DHS). The index is computed as thefirst principal component of survey data about ownership ofassets [8]. While the asset index cannot be used directly toconstruct benchmark measures of poverty, asset-based mea-sures are thought to better capture households’ longer-runeconomic status and are measured with relatively less errorsince they are directly observable to survey administrators[14].

The DHS survey data is collected at the household level,but the locations of the households in the survey are pro-vided only at the “cluster” level, roughly corresponding tovillages in rural areas and wards in urban areas. To preservethe anonymity of the participants, these cluster locationsalso include random noise, up to 2 km in each direction forurban clusters and up to 5 km for rural clusters. Cluster-level wealth measures are computed by averaging the assetindex of all households belonging to each cluster. In thepoverty measure prediction problem, we aim to predict thereal-valued cluster-level asset index given a set of input satel-lite images.

4.1 Extracting featuresA CNN model takes a satellite image as input and out-

puts a feature representation of the image. After trainingeach model on the nighttime light prediction problem, wecan extract image features from new input images by eval-uating the model on the new image. For each DHS surveycluster location, we sample a set of images which are usedto construct a feature vector for the cluster. For urban clus-ters, we extract features from 256 images in a 4 km by 4km bounding box centered on the DHS cluster location. Forrural clusters, we extract from 400 images in a 10 km by 10km bounding box centered on the cluster location. The dif-ference in sampling areas arises from the different amountsof noise added to rural and urban cluster locations. We then

(DSI14 , TNL)

(DNI , TIN ) (DSI16 , TNL) (DSIR , TP )

(DSI18 , TNL)

Figure 2: Transfer learning graph first depicting thetransfer of knowledge from the domain of naturalimages (DNI) and the ImageNet classification task(TIN) to the nighttime light prediction task (TNL)with multiple resolution levels of input satellite im-ages (DSIr for r ∈ R = {14, 16, 18}), then depictingthe transfer of knowledge to a poverty measure pre-diction task (TP ) with a multi-resolution domain ofsatellite images (DSIR = ∪r∈RDSIr).

average the 256 or 400 feature vectors for each cluster intoa final feature vector for the cluster. For locations whereGoogle Maps imagery was incomplete, we use as many im-ages as possible (up to 256 or 400) for computing the clusterimage features.

4.2 Making predictionsThe cluster image features are then used to train ridge

regression models that can predict the average householdasset index for each cluster. In our experiments, we first re-duce the feature dimension to 100 using principal componentanalysis (PCA) to speed up computation while still captur-ing almost all of the variation in the data. Regularization inthe ridge model guards against overfitting, a potential chal-lenge given the relatively small number of training examplesin the survey datasets (< 1000).

We evaluate asset prediction models through in-countryand out-of-country tests. For in-country models, we useDHS data and daytime imagery from one country to traina model that is then evaluated on a test set from the samecountry. An example of this would be to train a model usingimages and survey data from Uganda, then test that modelon test survey data from Uganda. An out-of-country modeluses survey data and daytime imagery from one country totrain a model, then applies that model to predict assets fromdaytime imagery in a different country. An example of thiswould be to take the model that was trained on Uganda anduse it to predict asset-based wealth measures in Tanzania,Nigeria, Malawi, or Rwanda.

5. MODELS

5.1 Single-resolution modelsIn the single-resolution setting, we train ResNet-50 mod-

els (which each have 50 layers) to predict nighttime lightintensity classes using input daytime satellite images of asingle resolution. The locations for the training set are sam-pled randomly from the entire African continent, and themodel is trained on the 3-class VIIRS nighttime light inten-sity labels described in section 3.2. Through this proxy task,


https://www.researchgate.net/publication/2800735_Exploring_Alternative_Measures_of_Welfare_in_the_Absence_of_Expenditure_Data?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

the ResNet-50 models learn to map each input image intoa 2048-dimension feature vector representation that is use-ful for predicting nighttime light intensities. We train threeseparate single-resolution models using zoom 14, zoom 16,and zoom 18 images, which we will refer to as the Zoom14,Zoom16, and Zoom18 models respectively.

5.2 End-to-end modelOne way to use information from multiple resolutions is

to combine the single-resolution models from all three res-olutions and concatenate their output feature vectors. Wetrain an end-to-end neural network model that takes as in-put 3 images with different resolutions (zoom 14, 16, 18) foreach location and outputs a 6144-dimensional feature vec-tor. The three ResNet-50 models are initialized with thefinal parameters of the trained single-resolution models andare then trained jointly on the nighttime light intensity pre-diction problem. In the process of training, the three single-resolution models are able to optimize how they identifyfeatures in relation to the other models, potentially allowingmodels of different resolutions to focus on features that areunique to their resolution instead of trying to find featuresthat can predict nightlights using information from only oneresolution.

5.3 Neural network modelAnother way to combine the information from multiple

image resolutions is to train a shallow neural network modelwhich takes in the concatenated features from the multi-ple single-resolution models and outputs a combined featurevector that also has fixed dimensionality. After training thesingle-resolution models, we fix the 6144-dimensional con-catenated features as input and train this shallow networkon the nighttime light prediction problem. This networkhas one 2048-dimensional hidden layer with a ReLU non-linearity, and therefore the learned feature representation is2048-dimensional. The network is implemented using Ten-sorFlow [1] and trained using the ADAM optimizer [10] withinitial learning rate 1e-6. Dropout regularization with adropout probability of 0.5 is employed with weight decayof 0.005. In the training process, the neural network modeloptimizes a nonlinear transformation of the concatenatedmulti-resolution features on the transfer learning task. Onedrawback of this model is that while the regression coeffi-cients of the End-to-end model can easily be interpreted asfeature contributions from different resolutions, the neuralnetwork model is not as interpretable.

6. RESULTS AND DISCUSSION

6.1 Nighttime light predictionThe single-resolution ResNet-50 modls are trained with

Caffe and initialized from the parameters learned from thepre-training process on the ImageNet dataset [9]. Randommirroring is used for data augmentation, and stochastic gra-dient descent (SGD) with momentum is run for 150,000 it-erations using batch sizes of 16 (4 epochs) to fine-tune theparameters on the nightlight prediction task. The learningrate is initially set to 1e-6, and then dropped by a factor of0.5 every 10,000 iterations. We use a momentum parameterof 0.9 and a weight decay of 0.0005, with other hyperparam-eters set to the same values used in the pre-training process.We choose the weights corresponding to the minimum of the

Zoom 14 Image224× 224



ResNet-50 ResNet-50 ResNet-50

x14 x16 x18

[x14 x16 x18]

Multi-resolution step

x14,16,18

Figure 3: General multi-resolution model architec-ture which takes 3 input different resolution imagesper location and outputs a feature vector by con-catenating the 2048-dimensional output of 3 single-resolution ResNet-50 models trained on the night-time light transfer learning problem and using thisconcatenated vector as input to a multi-resolutionstep. In the End-to-end model, we take the multi-resolution step as the identity mapping and trainthe architecture jointly. In the Neural model, wetake the multi-resolution step as a single hidden-layer neural network and train the multi-resolutionstep while keeping the rest of the architecture fixed.

moving average of validation accuracy over 5,000 iterationsas the final model parameters.

On the 3-class nighttime light prediction problem, theZoom14 (9.55 m/pixel) model achieved test accuracy of 0.7590after 62,000 iterations, the Zoom16 (2.39 m/pixel) modelachieved test accuracy of 0.7275 after 82,500 iterations, andthe Zoom18 (0.60 m/pixel) model achieved test accuracy of0.6878 after 54,500 iterations.

The End-to-end model is initialized with the final param-eters of the trained single-resolution models and trained onthe nighttime light prediction problem. All hyperparame-ters, including learning rate and momentum, are the sameas single-resolution models. For the night light predictionproblem, the End-to-end model achieved test accuracy of0.7856 after 40,000 iterations.

We also use the nighttime light prediction problem as atransfer learning proxy for training an additional neural net-work which combines the single-resolution features. Theimage features extracted using the single-resolution mod-els are concatenated and fed as input into a shallow neuralnetwork with one hidden layer, as described in 5.3. Duringthe training stage, we use batch sizes of 1024. As before,we stop training when the moving average of the validationaccuracy is minimized, which occurs after 72,250 iterations(123.3 epochs) with a final test accuracy of 0.8920 on thenighttime light prediction problem.

6.2 Model and feature generalizationWe transfer knowledge from the nighttime light prediction

https://www.researchgate.net/publication/301839500_TensorFlow_Large-Scale_Machine_Learning_on_Heterogeneous_Distributed_Systems?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/269935079_Adam_A_Method_for_Stochastic_Optimization?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

https://www.researchgate.net/publication/264979485_Caffe_Convolutional_Architecture_for_Fast_Feature_Embedding?el=1_x_8&enrichId=rgreq-6a60df86cbe65458a579a1947b4f12fa-XXX&enrichSource=Y292ZXJQYWdlOzMwNzkyNTc4NDtBUzo0MDQxODEzNjcxODEzMTJAMTQ3MzM3NTc1NTUzMg==

Country VGG Zoom14 Zoom16 Zoom18 End-to-end Neural Survey

Nigeria 0.6917 0.7440 0.7684 0.7579 0.7747 0.7841 0.7648Tanzania 0.5746 0.7084 0.6984 0.6974 0.7468 0.7615 0.6866Uganda 0.6614 0.7226 0.7579 0.7610 0.7447 0.7517 0.8252Malawi 0.5470 0.6027 0.6221 0.6039 0.6252 0.6380 0.8218Rwanda 0.7438 0.7701 0.7732 0.7722 0.7786 0.7788 0.5870

Table 1: r2 for in-country models, which are trained and tested within the same country. The VGG modelsare trained following Xie et al. [17]. The Zoom14, Zoom16, Zoom18 models are single-resolution ResNet-50models. The End-to-end and Neural models are multi-resolution models as described in Sections 5.2 and 5.3.The Survey model does not use satellite imagery, and is built using features from the DHS surveys.



Table 2: r2 for out-of-country models, which are trained and tested in different countries.

problem to the asset index prediction problem by using thevarious trained CNN models as fixed feature extractors. Theextracted image features are then used to build ridge regres-sion models that predict asset indices. The metrics used toevaluate performance on the target poverty prediction prob-lem are r2 (where r is Pearson’s correlation coefficient) andmean squared error (MSE). The linear correlation betweenthe true asset indices and the predicted values is measuredby the r2 value, giving a measure of the ability of the im-age features to explain relative differences in asset wealthbetween clusters. The MSE measures how accurately theridge regression model is actually predicting the outputs.

For each model, we evaluate prediction performance whentrained and evaluated within the same country (in-country),and when trained and evaluated in different countries (out-of-country). We note a crucial difference between model gen-eralization and feature generalization, which are capturedby MSE and r2, respectively. Model generalization refersto whether the regression coefficients are similar from train-ing to test sets, such that the model prediction error (MSE)would be low. For example, if a model generalizes well in anout-of-country test, then the regression coefficients are sim-ilar across countries, suggesting that the same combinationof features are important in many countries. Feature gen-eralization refers to the ability of image features to capturevariation in asset wealth via a linear dependence, which cor-responds to high r2. For example, if a set of features general-izes well in an out-of-country test, then these features allowsfor a good linear model of asset wealth to be fit across coun-tries. However, if the true linear coefficients of these featuresare not necessarily the same in these different countries, thenthe predictive accuracy (MSE) across countries is not nec-essarily high. Even if the features give rise to regressionmodels whose outputs are linearly correlated with assets inevery country (high r2), if these models have varying coef-ficients across countries, then they will perform poorly onpredicting asset values (high MSE) in out-of-country tests.Therefore, it is possible for the features to generalize wellbut for the model to generalize poorly, exhibiting both high

r2 and high MSE.

6.3 Asset-based wealth predictionWe now compare the results of the single resolution mod-

els and the two multi-resolution models on asset-based wealthprediction. For in-country models, we employ a nested 10-fold cross validation scheme in which the inner cross vali-dation loop chooses the regularization parameter by averag-ing the best regularization parameters found in the 10 innerfolds and the outer cross validation loop evaluates a ridge re-gression model using the regularization parameter found inthe inner loop. For out-of-country r2 and MSE, we take theaverage of the four values obtained by applying the modeltrained on the four out-of-country tests to the target coun-try (e.g., Tanzania→ Nigeria, Uganda→ Nigeria, Malawi→Nigeria, Rwanda → Nigeria). For all the regression modelsusing satellite imagery, we use PCA to reduce the dimensionof the input feature vectors to 100, removing the possibilitythat any differences in performance come from differences infeature dimension.

As a baseline, we also compare the results to regressionmodels built using data from the same DHS surveys that theasset indices are drawn from. In this “Survey” model, we useother features of the cluster households to try to predict theaverage wealth index of the cluster. The features used forthe survey model are: proportion of roofs that are metallic,average number of rooms per household, whether the clusteris urban or rural, average distance to the nearest populationcenter of at least 100 thousand people, average elevation,and average temperature. The Survey model does not makeuse of satellite imagery.

To compare against the existing state-of-the-art, we alsoevaluate the performance of the VGG-based model describedin Xie et al. [17]. The models are trained according to thetransfer learning approach involving nighttime light predic-tion as the proxy task and are subject to the same in-countryand out-of-country tests.

Tables 1 and 2 show r2 values for in-country and out-of-country tests. Across the single-resolution models, r2 values





Table 3: MSE for in-country models.



Table 4: MSE for out-of-country models.

tend to increase with resolution, especially in out-of-countrymodels. This indicates that features extracted by higherresolution models generalize better than the lower resolu-tion features, suggesting that fine-grained details (e.g., rooftypes, cars) are linearly correlated with asset wealth andcapture more of the variation in wealth than low resolutionfeatures such as infrastructure and terrains. The End-to-end and Neural models have higher r2 and lower MSE val-ues than all single-resolution models and the VGG modelin almost all countries. In particular, the Neural model ex-hibits the best performance in many cases, although theperformance of the multi-resolution models are generallysimilar. Therefore, the features extracted by the multi-resolution models generalize the best both in-country andout-of-country. Additionally, out-of-country r2 and MSEvalues for the multi-resolution models show better improve-ment than in-country r2 values, suggesting that the multi-resolution step significantly improves the cross-border gener-alizability of the features. Finally, in comparison to predic-tion using features from the survey, we find that the multi-resolution models are competitive with the survey model forin-country tests and particularly strong in out-of-countrytests. While survey data provides accurate measurementsat a very sparse, local level, additional data such as censusdata is usually needed to extrapolate the survey data acrossa country. In contrast, our model allows for high-resolutioncoverage of countries even without survey data, as suggestedby the strong out-of-country generalization.

Tables 3 and 4 show MSE for in-country and out-of-countrycases. Across the single-resolution models, the MSE valuestend to increase with increasing resolution, especially forout-of-country models. With the knowledge that higher res-olution features generalize better across countries, this in-dicates that the ridge regression coefficients of higher zoomlevels do not generalize well across countries, suggesting thathigh-resolution features are important for capturing vari-ation, but having same values of high-resolution featuresmay not correspond to equivalent levels of absolute wealthin different countries. On the other hand, this suggests thatlow-resolution models generalize well across countries. Sincelow-resolution images contain more infrastructure and ter-rain information, they can give a better absolute estimateof wealth without having enough variation to explain the

intricate patterns of asset wealth at the cluster level.Therefore, combining the two results, we conclude that

higher resolution models tend to extract features which giverise to models with a high linear correspondence with as-set wealth even across different countries, while lower res-olution models give rise to models with coefficients thatgeneralize better across countries, highlighting features thatare consistently important. By taking advantage of thecomplementary information in the different resolutions, themulti-resolution models immediately improves in both r2

and MSE. We further learn the relationship between the fea-tures from different resolutions through the neural networkmodel, and performance improves in almost every test.

6.4 Visualizing the featuresWe can visualize the features identified by the different

single-resolution models by examining the filter activations.We observe the output of each filter in layer 4f of the ResNet-50 model as described in He et al. [6] on a set of sampleimages. For each filter, we look for images with significantactivation, meaning that the image contains features thatthe filter has learned to look for. By looking at the activationmap, which is the output of the convolution after applyingthe nonlinearity, we can determine where the filter activatesmost strongly on the image.

Fig. 4 shows several examples of filter activation mapsfor the three single-resolution models. Low-resolution fea-tures capture geography and spatial context, such as urbanarea, rivers, and desert-like regions, while high-resolutionfeatures capture fine-grained details such as cars, buildings,and roads. High-resolution features achieved high r2 in out-of-country tests, implying that fine-grained details such ascars and building features are highly correlated with assets.However, the high MSE of the high-resolution models in-dicates that, despite strong linear correlations, we cannotmake accurate predictions with only those features. For ex-ample, suppose that one of the high-resolution features rep-resents the number of cars found in an image. It is likelythat a village with 10 cars in Malawi is wealthier than a vil-lage with 5 cars, and that this relative wealth relationshipwould also hold in Uganda. However, a village with 10 carsin Malawi may not have the same absolute level of wealth asa village with 10 cars in Uganda. High-resolution features


Figure 4: Activations of filters for single-resolution models. Each of the three columns corresponds to zoomlevel 14, 16, and 18, respectively. For each column, the image on the left is the satellite image, and theimage on the right is the activation of the filter. Four representative filters out of 512 filters were chosen foreach zoom level. Zoom 14 model captures geography/spatial contexts such as rivers, seas, desert-like cities,and other urban areas (top to bottom). Zoom 16 model captures infrastructures such as building clusters,farmlands, large buildings, and roads. Zoom 18 model captures fine-grained details such as cars, shadows(implying tall buildings), sparse buildings, and swimming pools.

may fail to capture more macroscopic characteristics thataffect absolute wealth.

On the other hand, low-resolution features such as roads,urbanness, and infrastructure are not as linearly correlatedwith assets due to less variation between regions. However,they predict the absolute level of wealth better than high-resolution features, achieving better out-of-country MSEs.In other words, a region with more cars is wealthier than aregion in the same country with less cars, but to determinethe region’s absolute wealth, and further a nation’s levelof wealth, spatial context and infrastructure are generallymore important. The spatial context tells us the generallevel of wealth of a nation, while the fine-grained detailsprovide us with information needed to compare between re-gions. Therefore, both the spatial context and fine-graineddetails are necessary to have an accurate depiction of thenation’s wealth.

7. CONCLUSIONSFor the asset-based wealth prediction problem in Africa,

we find that single-resolution models for three semanticallydifferent resolutions can each achieve reasonable performance,suggesting that each resolution contains useful information.We show two different ways of building models which incor-porate image features from all three resolutions, the End-

to-end model and the Neural model. These multi-resolutionmodels provide an immediate performance improvement overany single-resolution model, including the state-of-the-art,in both in-country and out-of-country performance.

In analyzing the results, we find that high-resolution fea-tures and low-resolution features have different properties.High-resolution features show higher correlation with assets,but do not have high predictive power on their own becausethey provide little information about the absolute level ofwealth of the given area. On the other hand, low-resolutionfeatures, despite having less correlation with assets due tolower variation, give stronger indications of the overall levelof wealth. From this, it is clear that both low- and high-resolution images give useful information for asset wealthprediction. Further investigation into the information cap-tured by each resolution and into how the low- and high-resolution features interact with each other could enhancethe model’s performance in the future.

Harnessing information from multiple resolutions is cru-cial for making full use of the satellite imagery data that isavailable. Our method of incorporating satellite images ofmultiple resolutions boosts predictive performance, which iscritical for real-world deployment and can benefit a widerange of applications which utilize remote sensing data.

References[1] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefow-icz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens, BenoitSteiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vie-gas, Oriol Vinyals, Pete Warden, Martin Wattenberg,Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: Large-scale machine learning on heterogeneoussystems, 2015. URL http://tensorflow.org/. Softwareavailable from tensorflow.org.

[2] Joshua Blumenstock, Gabriel Cadamuro, and RobertOn. Predicting poverty and wealth from mobile phonemetadata. Science, 350(6264):1073–1076, 2015.

[3] J. Bouvrie. Notes on convolutional neural networks.2006.

[4] Shantayanan Devarajan. Africa’s statistical tragedy.Review of Income and Wealth, 59(S1):S9–S15, 2013.

[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutionalactivation feature for generic visual recognition. CoRRabs/1310.1531, 2013.

[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. arXiv:1512.03385, 2015.

[7] Lingzi Hong, Enrique Frias-Martinez, and VanessaFrias-Martinez. Topic models to infer socio-economicmaps. In Thirtieth AAAI Conference on Artificial In-telligence, 2016.

[8] ICF International. Demographic and health surveys(various) [datasets], 2015.

[9] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, SergioGuadarrama, and Trevor Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[10] D. Kingma and J. Ba. Adam: A method for stochas-tic optimization. International Conference on LearningRepresentations, 2015.

[11] Sinno Jialin Pan and Qiang Yang. A survey on trans-fer learning. Knowledge and Data Engineering, IEEETransactions on, 22(10):1345–1359, 2010.

[12] Neeti Pokhriyal, Wen Dong, and Venu Govindaraju.Virtual networks and poverty analysis in senegal. arXivpreprint arXiv:1506.03401, 2015.

[13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-stein, A.C. Berg, and L. Fei-Fei. Imagenet large scalevisual recognition chellange. International Journal ofComputer Vision, pages 1–42, 2014.

[14] David E Sahn and David Stifel. Exploring alternativemeasures of welfare in the absence of expenditure data.Rev. Income Wealth, 49(4):463–489, 1 December 2003.

[15] United Nations. A world that counts: Mobilising thedata revolution for sustainable development. 2014.

[16] World Bank. Povcalnet online poverty analysis tool,http://iresearch.worldbank.org/povcalnet/, 2015.

[17] Michael Xie, Neal Jean, Marshall Burke, David Lobell,and Stefano Ermon. Transfer learning from deep fea-tures for remote sensing and poverty mapping. Proceed-ings of the 30th AAAI Conference on Artificial Intelli-gence, 2016.

[18] Matthew D. Zeiler and Rob Fergus. Visualizingand understanding convolutional networks. CoRR,abs/1311.2901, 2013. URL http://arxiv.org/abs/1311.2901.

View publication statsView publication stats

http://tensorflow.org/

http://arxiv.org/abs/1311.2901

http://arxiv.org/abs/1311.2901























































https://www.researchgate.net/publication/307925784

incorporating spatial context and fine-grained detail from ...eix/multi_resolution.pdf ·...

Documents