BUILDING DETECTION USING AERIAL IMAGES AND DIGITAL SURFACE MODELS

In this paper a method for building detection in aerial images based on variational inference of logistic regression is proposed. It consists of three steps. In order to characterize the appearances of buildings in aerial images, an effective bag-of-Words (BoW) method is applied for feature extraction in the first step. In the second step, a classifier of logistic regression is learned using these local features. The logistic regression can be trained using different methods. In this paper we adopt a fully Bayesian treatment for learning the classifier, which has a number of obvious advantages over other learning methods. Due to the presence of hyper prior in the probabilistic model of logistic regression, approximate inference methods have to be applied for prediction. In order to speed up the inference, a variational inference method based on mean field instead of stochastic approximation such as Markov Chain Monte Carlo is applied. After the prediction, a probabilistic map is obtained. In the third step, a fully connected conditional random field model is formulated and the probabilistic map is used as the data term in the model. A mean field inference is utilized in order to obtain a binary building mask. A benchmark data set consisting of aerial images and digital surfaced model (DSM) released by ISPRS for 2D semantic labeling is used for performance evaluation. The results demonstrate the effectiveness of the proposed method.


INTRODUCTION
Building detection from aerial and satellite images has been a main research issue for decades and is of great interest since it plays a key role in building model generation, map updating, urban planning and reconstruction (Davydova et al., 2016).Various methods have been developed and difference data sources such as aerial images, digital surface/eleviation models, LIDAR data, multi-spectral images, synthetic aperture radar images, have been used for building detection.In this section we briefly review relevant methods in the literature on building detection.Decades ago the initial endeavor for building detection was relying on grouping of low level image features such as edge/line segments and/or corners to form building hypotheses (Ok, 2013).For instance, a generic model of the shapes of building was adopted in (Huertas and Nevatia, 1988) and shadows cast by buildings were used to confirm building hypotheses and to estimate their height.A computational techniques for utilizing the relationship between shadows and man-made structures to aid in the automatic extraction of man-made structures from aerial imagery is described in (Irvin and McKeown, 1989).An approach to perceptual grouping for detecting and describing 3-D objects in complex images was proposed in (Mohan and Nevatia, 1989) and was illustrated by applying it to the task of detecting and describing complex buildings in aerial images.The vertical and horizontal lines identified using image orientation information and vanishing point calculation were used in (McGlone and Shufelt, 1994) to constrain the set of possible building hypotheses, and vertical lines are extracted at corners to estimate structure height and permit the generation of three dimensional building models from monocular views.Due to the neglected performance evaluation in building detection, a comprehensive comparative analysis of four building extraction systems was presented in (Shufelt, 1999) and he concluded that none of the developed systems were capable of handling all of the challenges in building detection.Most of these initial methods rely heavily on the adopted low level features and assumption of a specific type of building hypothesis.However due to the uncertainty in low level features, their performance is not likely to be perfect.
Due to the popularity of space-borne VHR sensors in a wide variety of remote-sensing-related applications, multi-spectral information have motivated new approaches based on machine learning techniques for building detection.For instance, multi-spectral classification and texture filtering were combined in (Zhang, 1999) within a two-level framework to optimize building detection in satellite images.In the first level, a fused image was classified using ISODATA clustering.A filtering method based on a modified co-occurrence matrix was applied in the second level to improve the classification results of the first level.Later on, morphological transformations to build a differential morphological profile was proposed for building detection in (Pesaresi and Benediktsson, 2001) and (Benediktsson et al., 2003).A combined fuzzy pixel-based and object-based approach for classification of urban land cover from high-resolution multi-spectral image data was proposed in (Shackelford and Davis, 2003).This method was demonstrated using pan-sharpened multispectral IKONOS imagery from dense urban areas.A system was developed in ( Ünsalan and Boye, 2005) to detect houses and residential street networks in multispectral satellite images, which produced a highly successful detection rate (94.8%) for house detection.However, because of the assumptions on buildings, it is applicable only to the buildings in North America.Without assumptions on building structure, a generic method for the detection of man-made objects in high resolution optical remote sensing images was developed in (Inglada, 2007) by SVM classification of geometric image features such as geometric invariants and FourierMellin descriptors.Nevertheless, this approach is not designed to detecting building regions instead the patches of a particular size corresponding to building.A novel decision fusion approach to building detection in VHR optical satellite images is proposed in (Senaras et al., 2013).The method combines the detection results of multiple classifiers under a hierarchical architecture, called Fuzzy Stacked Generalization (FSG).However, this method assumes statistical stability of the training and test data.A new approach based on an adaptive fuzzy-genetic algorithm was proposed in (Sumer and Turker, 2013) for building detection using high-resolution satellite imagery.This approach combines a hybrid system of evolutionary techniques with a traditional classification method (Fishers linear discriminant) and an adaptive fuzzy logic component.Nevertheless it is by no means determinist that genetic algorithms can find a global optimum solution.
Methods based on graphical models are quite popular for building detection as well.An MRF model was used in (Krishnamachari and Chellappa, 1996) to group these lines to delineate buildings in aerial images.First straight lines are extracted from images by using an edge detector followed by a line extractor.Then an MRF model is formulated on these extracted lines with a suitable neighborhood.The probabilistic model is chosen to support the properties of the shapes of buildings.In the end, the energy function associated with the MRF is minimized, resulting in the grouping of lines.However, no quantitative results were provided for evaluation.Similarly, a stochastic image interpretation model, which combines both 2-D and 3-D contextual information of the imaged scene, was proposed in (Katartzis and Sahli, 2008) for the identification of building rooftops.However, the approach is only applicable to building rooftops with low inclination.A graphbased approach was developed in (Kim and Muller, 1999) for building detection.The whole process of building detection is divided into four small stages of line extraction, line-relation-graph generation, building hypothesis generation, and building hypothesis verification.This method is yet limited to certain building shapes.Another method for building detection based a graphical method was proposed by (Sirmacek and Unsalan, 2009), where the vertices in the graph are SIFT Keypoints.A multiple subgraph matching method was applied to detection individual building by matching graphs corresponding to a template and a test image.Nevertheless this method is applicable only to urban areas with well-separated buildings.A different method based on graphical modeling of buildings was proposed by in (Cui et al., 2012), where the vertices in the graph are the intersections of line segments.The graph was then adapted to the buildings by filtering the edges by considering the region properties.Then all cycles are search by an algorithm and the most probable cycle were considered as the best building candidate.A novel system was developed by (Izadi and Saeedi, 2012) for automatic detection and height estimation of buildings with polygonal shape roofs in singular satellite images and it employs image primitives such as lines, and line intersections, and examines their relationships with each other using a graph-based search to establish a set of rooftop hypotheses.The height of each rooftop hypothesis is estimated using shadows and acquisition geometry.
Another category of methods for building detection is based on active contour model.For example, a method based snake that combines the regional features of an image with context was proposed in (Peng and Liu, 2005) using the direction of the cast shadows.However, it is not applicable to complex buildings in urban areas.Similarly a modified ChanVese model based level set method was proposed in (Cao and Yang, 2007) for detecting man-made objects in aerial images.A three-stage level set evolution strategy was used to minimize the proposed model energy with a fractal error and texture edge descriptor.Unfortunately the method extracts only the boundaries of the man-made regions instead of the building outlines.A variational framework for building detection was proposed in (Karantzalos and Paragios, 2009) by an integration of multiple competing shape priors that is pose/affine invariant through an explicit estimation of the transformation.However, this method is limited to prior building shape models.A new model, based on level set formulation, is introduced in (Ahmadi et al., 2010) to detect buildings in aerial images using active contour models.All building boundaries are detected by introducing certain points in the buildings vicinity.However, the number of building and background classes must be precisely known a priori to achieve the best results.
There are still a large number of relevant works to this paper.The above introduction is by no means comprehensive.In this paper, we introduce a method for building detection by classifying the pixels comprising building through a logistic regression classifier that is learned by a Bayesian method.In the following sections, the steps comprising the entire method are presented in detail and the method is evaluated using a benchmark data set consisting of aerial images and digital surfaced model (DSM) released by ISPRS for 2D semantic labeling.

Overview of the method
The overview of the proposed method is shown in Figure 1.It consists of three step: Bag-of-Words (BoW) feature extraction, Bayesian learning via variational inference, and pixel labeling via a fully connected CRF model.For the purpose of building characterization, an effective bag-of-Words (BoW) method is applied for feature extraction in the first step.After that, a classifier of logistic regression is learned in this feature space via a Bayesian method.Due to the presence of hyper prior in the probabilistic model of logistic regression, approximate inference methods have to be applied for prediction.In order to be fast, a variational inference method based the mean field theory is applied for inference.In the end, a fully connected CRF model is applied for labeling the building pixels, where the inference is performed by the mean field method.

BoW feature extraction
The Bag-of-Words (BoW) method has a large number of successful applications in various fields.In this paper we follow our previous work (Cui et al., 2015) on BoW method for image classification.The framework of BoW feature extraction is composed of four steps as shown in the second column in Figure 1, namely local feature extraction, dictionary learning, feature coding, and feature pooling.Assume we have a data set of N images Ii, i = 1, ..., N , the first step is to sample a collection of patches from the images in the database.This can be done by dense sampling or sparse detection.The second step is to extract local descriptor vectors x j i ∈ R D , j = 1, ..., M from all patches.The third one is learning a dictionary D = (d1, ..., dK ) ∈ R D×K with K words using all local features.Normally, this is done by a time consuming unsupervised learning method, such as kmeans clustering or a Gaussian mixture model.The elements di in a dictionary are the centers of the clusters.The next step is to find a dictionary-based representation v = [v1, ..., vK ] for each previously extracted local descriptor x.This can be done using hard feature assignment or soft assignment.Hard assignment assigns a single label, i.e., the index of the nearest neighbor in the dictionary, to each local descriptor x.Formally, it is defined as: Thus, the final descriptor representation v = [v1, ..., vK ] has only one non-zero element.The last step is to do the sumpooling 1 of all local descriptors extracted from one image vi = sum(v j i , ..., v j i ).
This method for image classification can be easily extended and applied to pixel classification that is the goal of this paper.To this purpose, we first extract local features as described in our previous work (Cui et al., 2015).Then we calculate a BoW feature representation of a local neighborhood surrounding each pixel.These BoW feature vectors are considered as the a characterization of building pixels and used in the next steps for learning a logistic regression.

Bayesian logistic regression
Logistic regression is a widely used statistical model for the approximation to the underlying functional relation between a feature vector x and a binary response variable y ∈ {−1, 1}.Formally the probabilistic model is defined as where σ(•) is the logistic sigmoid function.w is the model parameters that is going to be inferred by the Bayesian method.Consequently, the probability that y = −1 is p(y = −1|x, w) = 1 − p(y = 1|x, w).Thus, the model can be universally written as p(y|x, w) = σ(yw T x).(3) 1 Sum-pooling is equivalent to computing the histogram in the case of hard feature assignment.
The model is explicitly conditioned on the input data X although they are not random variables.The prior on the model parameters w is assumed to be an isotropic gaussian α is a hyper-parameter in the prior.To be a fully Bayesian treatment, all parameters including unknown quantities of interest and nuisance parameters are considered as random variables and are assumed to be following certain distributions.Therefore we assume a hyper-prior Gamma distribution p(α) = Γ(α|c, d) on α.Thus, the joint distribution of all the parameters {w, α} and the data X = {xi, yi} is that is yet hard to compute because of the model evidence p(y).
Accordingly the predicative distribution can be computed by integrating over the model parameters w, as given in p(y|x * , X, y) = p(y|x * , w) × p(w|X, y)dw (6) Unfortunately, both integral involved in computing the marginal posterior and the predicative distribution are not trackable.Therefore, one has to resort to approximation to the posterior, which can be solved mainly in two ways: deterministic and stochastic approximation.The stochastic methods, mainly referring to various Markov Chanin Monte Carlo (MCMC) method, rely on a large number of samples drawn indecently from the posterior distribution p(w, α|y).In most cases, this is computationally very intensive and it is hard to ensure the independence between samples.Thus, in this paper, we concentrate on deterministic variational approximation based on the mean field theory.

Variational inference
The goal of variational inference (Blei et al., 2016) is to approximate a (posterior) distribution.The key idea is to solve this problem with optimization.A family of distributions over the latent variables, parameterized by free variational parameters, is selected.The optimization finds the member of this family, i.e., the setting of the parameters, that is closest in the Kullback-Leibler divergence to the conditional distribution of interest.The fitted variational distribution then serves as a proxy for the exact conditional distribution.All inferences involved in prediction are computed using the variational distribution instead of the posterior distribution.One widely used family of distributions is the factorized distributions, which leads to the well-known mean field inference.Thus we assume that the posterior distribution can be approximated as q(w, α) = q(w) × q(α).
The variational distribution are found by maximizing the variational lower bound L(q) = q(w, α) ln p(y|X, w) × p(w|α) × p(α) q(w, α) dwdα (7) However, the data log likelihood does not have a conjugate prior in the exponential family and will be approximated by the use of a lower bound of the logistic sigmoid function (Jaakkola and Jordan, 2000), which is with one additional parameter ξ for each observation.Thus, by replacing the sigmoid function in the likelihood in (3) with this lower bound, we can obtain a lower bound in (9) of the data loglikelihood.
ln p(y|X, w) ≥ ln h(w, ξ) Substituting the new lower bound as the data log-likelihood in (9), one can obtain a new lower bound of the variational lower bound in (7).
L(q, ξ) = q(w, α) ln h(w, ξ) × p(w|α) × p(α) q(w, α) dwdα (10) This lower bound is going to be maximized to seek for the variational distributions q(w, α).Nevertheless it contains the parameters ξ.Here we adopt an alternating optimization by maximizing w.r.t either ξ or q(w, α) while fixing the other.While fixing ξ, the variational distributions can be computed by standard variational methods for factorized distributions (Bishop, 2006).The variational distribution for w is given by ln q(w) = ln h(w, ξ) + Eα[ln p(w|α)] + const = ln N (w|µ w , Σw) where Thus, the involved expectations are While fixing the variational distributions q(α) and q(w), the parameters ξ can be obtained by setting the derivative of (10) w.r.t ξ to zero (Bishop, 2006), giving After obtaining the variational distribution q(w), it can be used as a proxy for computing the predictive distribution.Then, the predictive distribution in ( 6) can be approximated as p(y = 1|x * , y) = p(y = 1|x * , w) × p(w|X, y)dw ≈ p(y = 1|x * , w) × q(w)dw It is worth noting that the integrand is quadratic in w.Thus by complete the square in exponential function, the integral can be written as The bound parameter ξ that maximizes ln p(y = 1|x * , y) is given by ξ 2 = x T * ( Σ + μμ T )x T * .Therefore, the predictive density can be computed by iterating over the updates for Σ, μ, ξ until p(y = 1|x * , y) converges (Drugowitsch, 2014).

Dense CRF inference
After completing the variational inference, one obtains a probability map with each pixel having two probabilities p(y = 1|x * , y) and p(y = −1|x * , y).The label of each pixel can be determined by assigning the one with maximum probability.However, there will be a lot of isolated small groups of pixels that are wrongly labeled.Thus, a smoothness constraint should be considered when labeling the pixels based on the probability map.Usually a random filed models is applied.In this paper a fully connected CRF model (Krähenbühl and Koltun, 2011) is employed.Basic CRF models are composed of unary potentials on individual pixels or image patches and pairwise potentials on neighboring pixels or patches.The fully connected CRF uses a different model structure that establishes pairwise potentials on all pairs of pixels in the image, enabling greatly refined segmentation and labeling.
Formally, we consider a random field X defined over a set of variables {X1, X2, ... ,XN }, where each variable takes value from {-1, 1}.In the fully connected pairwise CRF model, the energy function is defined as where x is a particular labeling of X and i, j range from 1 to N .The unary potential φµ(xi) is computed independently for each pixel by the previous variational inference of logistic regression, which is actually a probabilistic distribution over the label assignment.The pairwise potentials in the model is defined as the following line combination of a set of Gaussian similarity functions where pi, pj are the color vectors and Ii, Ij are the position vectors for pixel i and j and w1, w2 are the weights.There are five parameters w1, w2, θα, θ β , θγ that should be set or learned.The maximum a posteriori (MAP) labeling of the random field X is given by x * = argmin E(x).An effective inference method based on the mean field theory using high-dimensional filtering was proposed in (Krähenbühl and Koltun, 2011), which is used in this paper for labelling the building pixels.

Data set and setup
The data that we used for performance evaluation is a benchmark data set consisting of aerial images and digital surfaced model (DSM) released by ISPRS for 2D semantic labeling2 .It has 16 arial images and corresponding DSM in addition to the ground truth of buldings.Example images and DSM in this data are shown in Figure 3.
The BoW features that we used are computed as described in section 2.2.The local features are the vectorized pixel values in a 3 × 3 neighborhood (Cui et al., 2015).The codebook used for feature encoding is a random dictionary, in which the entries are uniform randomly selected from the entire local feature space.The size of the codebook is empirically set to 500.50,000 BoW features are randomly selected from each image for learning the logistic regression model by variational inference.The parameters in hyperprior are set to small values, i.e., 1e-4.The weights in (21) are set to 1.0 and the standard deviations are set to 30.All these parameters are set by try-and-error.The accuracy measures used for performance comparison are precision, recall and F1 score.

Results and discussion
The results for the 16 images are shown in table 1.The detection results using the images shown in Figure 3 are shown in the first row in Figure 4. Compared with the ground truth shown in the second row in Figure 4, the detection results are quite good.
Visually there is no isolated noisy building pixels, which is the

Figure 1 .
Figure 1.The workflow of the proposed method for building detection.It consists of three steps: BoW feature extraction, Bayesian learning, and Dense CRF for building pixel labeling.

Figure 2 .
Figure 2. Graphical model for Bayesian logistic regression.Circles denote random variables and shaded circles represent observations while squares denote deterministic variables.

Figure 3 .
Figure 3. Example images and the corresponding DSMs used for performance evaluation.

Figure 4 .
Figure 4. Example results of building detection: (a)(b)(c) detection results on the images shown in Figure 3; (d)(e)(f) ground truths.

Table 1 .
Performance evaluation in terms of precision and recall.desiredeffect by applying the fully connected CRF model.From the quantitative results shown in table 1, the average precision, recall and F1 score are respectively 75%, 85%,and 79%.In this paper a method for building detection based Bayesian logistic regress and a fully connected CRF model is proposed and demonstrated.It consists of three step: Bag-of-Words (BoW) feature extraction, Bayesian learning via variational inference, and pixel labeling via a fully connected CRF model.Due to the presence of hyper prior in the probabilistic model of logistic regression, a variational inference method based on the mean field theory is applied for learning the model.A benchmark data set released by ISPRS for 2D semantic labeling is used for performance evaluation.The results demonstrate the effectiveness of the proposed method.