BUILDING EXTRACTION FROM REMOTE SENSING DATA USING FULLY CONVOLUTIONAL NETWORKS

Building detection and footprint extraction are highly demanded for many remote sensing applications. Though most previous works have shown promising results, the automatic extraction of building footprints still remains a nontrivial topic, especially in complex urban areas. Recently developed extensions of the CNN framework made it possible to perform dense pixel-wise classification of input images. Based on these abilities we propose a methodology, which automatically generates a full resolution binary building mask out of a Digital Surface Model (DSM) using a Fully Convolution Network (FCN) architecture. The advantage of using the depth information is that it provides geometrical silhouettes and allows a better separation of buildings from background as well as through its invariance to illumination and color variations. The proposed framework has mainly two steps. Firstly, the FCN is trained on a large set of patches consisting of normalized DSM (nDSM) as inputs and available ground truth building mask as target outputs. Secondly, the generated predictions from FCN are viewed as unary terms for a Fully connected Conditional Random Fields (FCRF), which enables us to create a final binary building mask. A series of experiments demonstrate that our methodology is able to extract accurate building footprints which are close to the buildings original shapes to a high degree. The quantitative and qualitative analysis show the significant improvements of the results in contrast to the multy-layer fully connected network from our previous work.


INTRODUCTION 1.1 Related work
Building detection and footprint extraction are important remote sensing tasks and used in the fields of urban planning and reconstruction, infrastructure development, three-dimensional (3D) building model generation, etc. Due to the sophisticated nature of urban environments the collection of building footprints from remotely sensed data is not yet productive and time consuming, if it is manually performed.Therefore, automatic methods are needed in order to complete the efficient collection of building footprints from large urban areas comprising of numerous constructions.Recently, various approaches have been developed, which perform building extraction on the basis of high-resolution satellite imagery.Depending on the type of data employed for building extraction the existing methods can be divided into two main groups: using aerial or high-resolution satellite imagery and using three-dimensional (3D) information.
Aerial photos and high-resolution satellite images are extensively used in urban studies.The pioneering approaches proposed to extract edge, line and/or corner information, which are fundamental elements for buildings extraction (Huertas andNevatia, 1988, Irvin andMcKeown, 1989).Many studies additionally incorporate shadow information to the low-level features (Liow and Pavlidis, 1990, McGlone and Shufelt, 1994, Peng and Liu, 2005).Some methodologies formalize the building extraction problem in terms of graph theory (Kim and Muller, 1999, Krishnamachari and Chellappa, 1996, Sirmacek and Unsalan, 2009).Many researchers implemented more forward-looking methods to extract shapes of the detected buildings (Karantzalos andParagios, 2009, Sirmacek et al., 2010).Further studies, which employ the ad-vantages of multi-spectral information, solve the detection problem in a classification framework (Lee et al., 2003, Koc-San and Turker, 2014, Sumer and Turker, 2013).However, due to the complexity of shapes and variety of materials of human-made constructions, the image classification in urban areas is still complicated.
In order to use the advantage of height information from DSM, obtained from optical stereo images or light detection and ranging measurements (LIDAR), several works investigated the building footprint extraction from DSM alone or together with highresolution imagery.In general, building detection from DSM is a very challenging task due to scene complexities and imperfections in the methodological steps for depth image generation such as stereo matching methods.As a result this leads to a presence of noise in the generated DSM.Although, the quality of stereo DSM concedes to the one from LIDAR data, they have become more popular in recent years due to their large coverage and lower costs as compared to LIDAR data.(Gerke et al., 2001) detects and generates building outlines from DSM by separating them from surrounding above-ground objects such as trees using Normalized Difference Vegetation Index (NDVI).Similar studies are employed in (San andTurker, 2006, Lu et al., 2002).(Krauß et al., 2012) introduced a methodology for DSM-based building mask generation by using an Advanced Rule-based Fuzzy Spectral Classification algorithm, which fuses nDSM with the classified multispectral imagery.Afterwards, the height thresholding is applied to extract buildings from other surrounding objects.The approach proposed in (Brédif et al., 2013) extracts rectangular building footprints directly from the DSM using a Marked Point Process (MPP) of rectangles and then refines them into polygonal building footprints.

In spite of the efforts put into developing methodologies for the
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-1/W1, 2017 ISPRS Hannover Workshop: HRIGI 17 -CMRT 17 -ISA 17 -EuroCOW 17, 6-9 June 2017, Hannover, Germany This contribution has been peer-reviewed.doi:10.5194/isprs-archives-XLII-1-W1-481-2017automatic extraction of building footprints from DSM, they are still not able to provide satisfactory results.Therefore, our goal is to implement such a methodology, which will automatically, without any assumptions on the shape and size of buildings, extract them from DSMs.

Deep Neural Networks for building extraction
With the revolutionary appearance of Convolutional Neural Networks (CNNs), which became the state-of-the-art for image recognition problems, it became possible to automatically detect buildings in remote sensing data.In (Yuan, 2016) the building footprints are automatically extracted from hight-resolution satellite images using Convolutional Network (ConvNet) framework.The authors in (Maggiori et al., 2017) propose to generate building mask out of RGB satellite imagery by using a FCN, firstly, trained on possibly inaccurate reference data, and, finally, refined on a small amount of manually labeled data.One of the first approaches for above-ground objects classification from high-resolution DSM using a deep learning technique, specifically a Multilayer Perceptron model, was demonstrated in the work of (Marmanis et al., 2015).In our previous study (Davydova et al., 2016), we developed a similar approach to create a binary building mask from DSM using a four-layer fully connected neural network.As a continuation of our previous work in this paper we present a methodology using a deep learning approach, for building footprint extraction from remote sensing data, particularly nDSM, with a focus on dense residential areas.Besides learning discriminative per-pixel classifiers, we further encode the output relationship from FCN as unary term for fully connected Conditional Random Fields (CRF) and generate a final building mask.

Fully Convolutional Network
Traditional CNNs architectures were generated for image-level classification tasks, which require an input image of a fixed size h × w × ch (h and w represent the spatial dimensions, and ch is the feature/channel dimension) and output a vector of class scores cli.The fully connected layers of such architectures have fixed dimensions and completely discard the spatial information.FCNs on the other hand transform fully connected layers as a large set of 1×1 convolutions allowing the network to take input of any size and output classification maps of class scores cli(x, y).However, the maps generated by FCN per-class probability have smaller sizes and coarser spatial resolution compared to the input image due to the pooling layers along the network.The solution to this problem is to enlarge the FCN with deconvolution layers, which up-sample the previous layer.As a result, by adding several deconvolution operations at the top part of the network it allows to up-sample the coarse maps to the input image size and get the class scores for each pixel, performing an end-to-end learning by backpropagation from the pixel-wise loss.However, the output of such FCN (known as FCN-32s) cannot obtain satisfying object boundaries, because of the final deconvolution layer, which has the 32 pixel stride, restrict the scale of details.In order to refine object boundaries the high-frequency information from lower network layers is added with the help of the so-called "skip" layer.The "skip" layer combines the final prediction layer with the output of earlier convolutional layers with rich information.In this way, the FCN-16s adds "skip" connection from pool4, and FCN-8s propagates even more detailed information from the pool3 layer, in addition to the pool4 layer (see Figure 1).

Problem Formulation
In practice, very few people train an entire CNN from scratch (with random initialization) due to the limited amount of training data.Therefore, it is common to take a model, pre-trained on a very large dataset, and transfer its relevant knowledge as an initialization for a new task.Such models can be adapted then to the new task with relatively few training data.Nowadays, several networks exist, which have been pre-trained on huge image datasets.A FCN, proposed in (Long et al., 2015), was constructed based on VGG-16 network, for which a model pre-trained on the large public image repositories Imagenet (Deng et al., 2009) exists.We take this pre-trained model and fine-tune it for our task, but randomly initialize the last layers, because the channel dimension is changed to 2 in order to predict scores for our binary task.
However, in contrast to Imagenet, which contains RGB images as inputs, our training dataset consists of depth images, which carry completely different information compared to intensities.The main concern is whether it is applicable to use a pre-trained model.It turns out to be suitable due to the fact that RGB and depth images share common features such as edges, corners, endpoints, etc., at the low and middle level image domains.
The Imagenet database contains images which fit into the GPU size.Remote sensing images are huge and cannot be loaded into the GPU as a whole.As a result, in our work input image I, i.e., nDSM, and corresponding target image with labels M , i.e., a building mask, are tiled into patches of size w × h: P atchi(I) w×h and P atchi(M ) w×h .(Long et al., 2015) performed a cascaded training, starting from the shallower network FCN-32s and gradually adding the "skip" connections to include the high-frequency information from the pool4 layer (FCN-16s) and then pool3 (FCN-8s).We apply the same procedure on our dataset: first, we fine-tune a 32 stride network, then 16 stride and finally 8 stride, and each next network uses the previous network's weights to speed up the training process.We fine-tune the models by minimizing the negative log likelihood and adjusting the weights and biases along the whole network with a backpropagation algorithm using stochastic gradient descent (SGD) with a small batch.Mathematically, we solve min where W l are the weights of the last layer, W l−1 are the weights of the previous layer, a(•) is an activation function, yi represents given true mask patches P atchi(M ) w×h .The softmax function is given by and the loss function is computed as where p k is the label assignment probability at pixel k.
After training, we take the FCN-8s as a final classifier and perform predictions on a new unseen dataset.Those new data are forwarded though the network as separate patches and the predicted maps with the same size as the patches are obtained.After that, the tiles are stitched together in order to generate an image with the same size as the original DSM.

Fully connected Conditional Random Field for object boundaries enhancement
In order to generate a binary building mask we need to assign to each image pixel the best suitable label, where 1 corresponds to building and 0 to non-building/background label.At the same time we want to keep spatial correlations between neighboring pixels and accurately localize segment boundaries.
Modern Deep CNN architectures produce typically quite smooth classification results (Chen et al., 2014).Therefore, we are interested in obtaining detailed local structures (object boundaries) rather than further smooth it.It can be reached, by applying Fully connected Conditional Random Field (CRF) approach proposed by (Koltun, 2011).Fully connected CRF allows an elegant way to combine single pixel predictions and shared structure through unary and pairwise terms.It differs from standard CRF by establishing pairwise potentials on all pairs of pixels in the image and not only on neighboring pixels.
As described above, the predictions are computed by the chosen FCN-8s.These predictions can be seen as pixel-wise unary likelihoods φi(xi) for the Fully connected CRF energy function shown in Equation ( 4).
The pairwise edge potentials is defined by a linear combination of Gaussian kernels and has the form where µ is a label compatibility function, k (m) (fi, fj) is a Gaussian kernel, which depends on features (defined as f ) extracted for pixel i and j and is weighted by parameter ω (m) .The kernels consist of two parts and are contrast-sensitive.They are defined as where the first term called the appearance kernel and depends on both the pixel color intensities (Ii and Ij) and pixel positions (pi and pj).This term encourages assigning similar labels to nearby pixels with similar color.Parameter θα controls the degrees of nearness and θ β of similarity.The second term called smoothness kernel is responsible for removing small isolated regions (Shotton et al., 2009).
As a result, applying the described methodology and minimizing the CRF energy E(x) we search for the most probable label assignment for each pixel taking into account spatial correlations between them.This finally leads to a binary building mask.

STUDY AREA AND EXPEREMENTS
We perform experiments on datasets consisting of DSM reconstructed from WorldView-2 stereo panchromatic images with a resolution of 0.5 meter per pixel using the semi-global matching methodology proposed by (d'Angelo and Reinartz, 2011).In order to obtain a nDSM with only above-ground information a topography information was removed based on the methodology described in (Qin et al., 2016).As ground truth, a building mask covering the same region as DSM from the municipality of the city of Munich, Germany is used for learning the parameters in the neural network.
The fine-tuning was done on FCNs implemented in the Caf f e deep learning framework.For learning process we prepared a training data consisting of 7161 pairs of patches with size of 300×300 pixels.To avoid the artifacts and object discontinuity at tile boundaries the patches are generated with an overlap of 100 pixels.We start fine-tuning process of FCN-32s with a learning rate of 0.0001, decreasing it by a factor of 10 for each next stage of gradual learning.We used a weight decay of 0.0005 and a momentum of 0.99.
The final binary building mask was obtained using the FCRF software developed in (Koltun, 2011).We chose the smoothness θγ and appearance kernel parameters θα and θ β after performing an experimental grid search varying the spatial and color ranges of these parameters and examining the resulting classification accuracy.As a result, we found that θγ = 3, θα = 3 and θ β = 11 work well in practice.The weight parameters ω (1) and ω (2) were set to 1.

RESULTS AND DISCUSSIONS
In this section we present and discuss the results obtained from the methodology described above for binary building mask generation.After gradual training of FCNs, we use the final FCN-8 model for our binary classification task.To demonstrate the performance of our model we present to the network a new test dataset, which was used neither for training nor for validation (see Figure 2).

Qualitative Results
The building mask generated directly from the last layer of FCN-8 network is presented in Figure 3(b).As can be seen from the results, the FCN model is able to extract only the buildings from the nDSM without any influence of other above-ground objects.However, some noise could be noticed in the form of small regions near building boundaries together with irregularities of the boundaries itself.Therefore, the Fully connected CRF is applied as a post processing step in order to remove this noise and improve the boundaries (see Figure 3(c)).In Figure 3(d) we present the building mask extracted from the four-layer fully connected network from our previous work.The configuration of the network stayed the same as described in (Davydova et al., 2016), but in order to perform the comparison, we train it on the same dataset.In comparison to the four-layer network FCN can manage to extract complex building structures, without missing some of its parts (for example see red marks in left and down parts of Figure 3(c)) and with the size of footprints, which is closer to original one (see red mark on the upper part of Figure 3(c)).This can be explained as deep neural network learns to take into account the context within a 300×300 pixel window in contrast to 32×32 pixel window of four-layer network.Of course, one can increase the input patch size to a bigger one for multy-layer network, but it will heavily influence the computation time.Besides, FCN does not over-smooth the object boundaries and can identify more detailed building structures.The absence of some buildings on defined mask is due to the low sensitivity of our approach to the recognition of low-rise buildings, which are surrounded by higher buildings.Another possible reason is their small amount in the training dataset that limited our network in learning this kind of buildings.Besides, the low-raised buildings can be totally covered by the trees that makes their detection impossible.

Quantitative Results
To assess the quality of the proposed methodology on the selected test dataset in comparison to the ground truth shown in Figure 3(a) we used metrics commonly used for semantic segmentation and useful for binary classification task.Let TP, FP, FN denote the total number of true positive, false positive and false negative, respectively.Then we can define those metrics as following: P recision = T P T P + F P , (8) where the P recision is the fraction of predicted positives which are actual positive, the Recall is the proportion of actual positives which are predicted positive.The higher these metrics, the better the performance of model.Besides, we employ the F −measure (see Equation ( 10)) derived from the precision and recall values in Equations ( 8) -( 9) for the pixel-based evaluation.For simplicity, we set β = 1.It reaches its best value at 100% and worst at 0%.
Another useful metric is Intersection over Union (IoU), which is an average value of the intersection of the prediction and ground truth regions over the union of them.Here we adapted this metric to the binary case, because in our data there are many more pixels which belong to the background than those belonging to the buildings areas.Therefore, in our case IoU is defined as the number of pixels labeled as building on both in the ground truth and predicted mask, divided by the total number of pixels labeled as buildings in each of them (Maggiori et al., 2017).
where n pred is the number of pixels labeled as buildings in predicted mask and ngt is the one in ground truth.The results are presented in Table 1.
The results show that doing a binary classification of remote sensing data by using a deep convolutional network, in our case the FCN-8, outperforms the binary mask generated by four-layer fully connected network.The FCN followed by a dense fully connected CRF refinement significantly improves the mask quality.This statement is also confirmed from the quantitative point of view (see second line of Table 1).The Intersection over Union (IoU) was not improved after applying a post processing step.This can be explained as for some buildings in DSM boundaries cannot be clearly seen due to their possible overlap with trees or inaccuracies of DSM itself.As a result the presence of inaccuracies in buildings outlines can be observed.

CONCLUSION AND FUTURE WORK
Recent developments in neural network approaches have greatly advanced the performance of visual recognition tasks such as image classification, localization and detection.We proposed to use a fully convolutional network architecture for automatic building footprint extraction from remote sensing data, specifically the nDSM.The main advantage of using DSM data is that they provide the elevation information of the objects, which is crucial for the tasks such as buildings extraction in urban area.Because the satellite images are huge, we tile the nDSM and available reference building mask into patches.In the first step, the FCNs were trained on the prepared set of patches to extract the building footprints.The predictions, generated afterwords, are presented as unary terms to the Fully connected CRF and, finally, the binary building mask is obtained.Experimental results show that the deep neural network approaches are suitable for the remote sensing applications such as building footprints extraction.The proposed methodology can generalize various shapes of urban and industrial buildings and is robust to their complexity and orientation.Some undetected buildings can be explained as they are totally covered by trees or they exhibit noisy representations of the DSM itself.In our further work, we will refine building outlines directly during the learning process including additional input data such as panchromatic or RGB images and will reorganize the network structure.The extracted building outlines then will be used for 3D model reconstruction.

Figure 1 .
Figure 1.Fully convolution network with "skip" connections.The last layers of the network are 2 dimensional feature maps as we have two classes: building and non-building.More detailed description is given in Section 2.1

Figure 3 .
Figure 3. Results of extraction of building footprints from test region.

Table 1 .
Binary classification results on the test dataset in comparison to the four-layer network.