EXPLOITING SHADOW EVIDENCE AND ITERATIVE GRAPH-CUTS FOR EFFICIENT DETECTION OF BUILDINGS IN COMPLEX ENVIRONMENTS

This paper presents an automated approach for efficient detection of building regions in complex environments. We investigate the shadow evidence to focus on building regions, and the shadow areas are detected by recently developed false colour shadow detector. The directional spatial relationship between buildings and their shadows in image space is modelled with the prior knowledge of illumination direction. To do that, an approach based on fuzzy landscapes is presented. Once all landscapes are collected, a pruning process is applied to eliminate the landscapes that may occur due to non-building objects. Thereafter, we benefit from a graph-theoretic approach to accurately detect building regions. We consider the building detection task as a binary partitioning problem where a building region has to be accurately separated from its background. To solve the two-class partitioning, an iterative binary graph-cut optimization is performed. In this paper, we redesign the input requirements of the iterative partitioning from the previously detected landscape regions, so that the approach gains an efficient fully automated behaviour for the detection of buildings. Experiments performed on 10 test images selected from QuickBird (0.6 m) and Geoeye-1 (0.5 m) high resolution datasets showed that the presented approach accurately localizes and detects buildings with arbitrary shapes and sizes in complex environments. The tests also reveal that even under challenging environmental and illumination conditions (e.g. low solar elevation angles, snow cover) reasonable building detection performances could be achieved by the proposed approach.


INTRODUCTION
Space-borne imaging is a standard way of acquiring information about the objects on the Earth surface.Today, the information obtained is rather diverse and high-quality due to the advanced capabilities of satellite imaging such as the availability of submeter resolution optical sensors, broadened spectral sensitivity, and increased data availability.Thus, satellite images are one of the most important data input source to be utilized for the purpose of object detection.
It is a fact that most of the human population lives in urban and sub-urban environments.Therefore, the detection of man-made features from satellite images is of great practical interest for a number of applications such as urban monitoring, change detection, estimation of human population etc.In an early work, Huertas and Nevatia (1988) emphasized the importance of the automation for the detection, and they also stated the major task: the extraction and description of man-made objects, such as buildings.Up to now from their early paper, various researchers belonging to different scientific communities involved for the same task, and accordingly, a significant number of research studies have been published.Since this paper is devoted to the automated detection of buildings from a single optical image, we very briefly summarize the previous studies aimed to automatically detect buildings from monocular optical images.
In this paper, we present an automated approach for the detection of building regions from single optical satellite imagery.To focus on building regions, we exploit the cast shadows of buildings, and the shadow areas are detected by recently proposed false colour shadow detector (Teke et al., 2011).The directional spatial relationship between buildings and their shadows in image space is modelled with the prior knowledge of illumination direction.To do that, an approach based on fuzzy landscapes is presented.Once all landscapes are collected, a pruning process is applied to eliminate the landscapes that may occur due to non-building objects.Thereafter, we benefit from a graph-theoretic approach to accurately detect building regions.In this paper, we consider the building detection task as a binary partitioning problem where a building region has to be accurately separated from its background.One of our insights is that such a problem can be formulated as a two-class labelling problem (building/nonbuilding) in which a building class in an image corresponds only to the pixels that belong to building regions, whereas a non-building class may involve pixels that do not belong to any of building areas (e.g., vegetation, shadow, and roads).To solve the two-class partitioning, an iterative binary graph-cut optimization (Rother et al., 2004) is carried out.This optimization is performed in region-of-interests (ROIs) generated automatically for each building region, and assigning the input requirements of the iterative partitioning in an automated manner turns the framework into a fully unsupervised approach for the detection of buildings.
The individual stages of our approach will be described in the subsequent section.Some of these stages are already welldescribed in Ok et al. (2013), and therefore, these stages are only revised here.Besides, this paper extends our previous work from two aspects.First, we aim to improve the pruning step before the detection of building regions.Because water bodies appear dark both in visible and NIR spectrum, the shadow detector utilized detects water bodies as shadow.To mitigate this problem, we extend the pruning step in which we investigate the length of each shadow component in the direction of illumination by enforcing a pre-defined maximum height threshold for buildings.In this way, we eliminate the landscapes generated from large water bodies before the detection of building regions.Second, we improve the way used to generate ROIs.In our previous work, the bounding box of each ROI was extracted automatically after dilating the shadow regions.However, we realized that this might cause large ROI regions particularly where the cast shadows of multiple building objects are observed as a single shadow region.To avoid this problem, in this paper, we generate ROIs from the foreground information extracted from the shadow regions, thereby allowing us to better focus on building regions and their close neighbourhood.
The remainder of this paper is organized as follows.The approach is presented in Section 2. The results of the approach are given and discussed in Section 3. The concluding remarks are provided in Section 4.

Image and Metadata
The approach requires pan-sharped multi-spectral (B, G, R, and NIR) ortho-images.We assume that the metadata files providing information about the solar angles (azimuth and elevation) of the image acquisition are also attached to the images.By definition, the solar azimuth angle (A) in an orthorectified image space is the angle computed from north in a clockwise direction, whereas the solar elevation angle (ϕ) is the angle between the direction of the geometric centre of the sun and the horizon.

The Detection of Vegetation and Shadow Regions
Normalized Difference Vegetation Index (NDVI) is utilized to detect vegetated areas.The index is designed to enhance the image parts where healthy vegetation is observed; larger values produced by the index in image space most likely indicate the vegetation cover.We use the automatic histogram thresholding based on the Otsu's method (Otsu, 1975) to compute a binary vegetation mask, M V (Fig. 1b, e).A new index is utilized to detect shadow areas (Teke et al., 2011).The index depends on a ratio computed with the saturation and intensity components of the Hue-Saturation-Intensity (HSI) space, and the basis of the HSI space is a false colour composite image (NIR, R, G).To detect the shadow areas, as also utilized in the case of vegetation extraction, Otsu's method is applied.Thereafter, the regions belonging to the vegetation cover are subtracted to obtain a binary shadow mask, M S (Fig. 1c, f).

The Generation and Pruning of Fuzzy Landscapes
Given a shadow object B (e.g. each 8-connected component in M S ) and a non-flat line-based structuring element , the landscape β α (B) around the shadow object along the given direction α can be defined as a fuzzy set of membership values in image space (Ok et al., 2013): In Eq. 1, B per represents the perimeter pixels of the shadow object B, B C is the complement of the shadow object B, and the operators and ∩ denote the morphological dilation and a fuzzy intersection, respectively.The landscape membership values are defined in the range of 0 and 1, and the membership values of the landscapes generated using Eq. 1 decrease while moving away from the shadow object, and bounded in a region defined by the object's extents and the direction defined by angle α.In Eq. 1, we use a line-based non-flat structuring element generated by combining two different structuring elements with a pixel-wise multiplication ( * ): . (2) In Eq. 2, is an isotropic non-flat structuring element with kernel size κ, and the decrease rate of the membership values within the element is controlled by a single parameter σ ( ) where ‖ ⃗⃗⃗⃗ ‖ is the Euclidean distance of a point x to the centre of the structuring element.On the other hand, the flat structuring element is responsible to provide directional information ( ) where L denotes the line segment and α is the angle where the line is directed where the round(.)operator maps the computed membership values to the nearest integer and θ α (x,o) denotes the angle International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-1/W1, ISPRS Hannover Workshop 2013, 21 -24 May 2013, Hannover, Germany differences computed between the unit vector along the direction α and the vector from kernel centre point (o) to any point x on the kernel.In this paper, we utilized the parameter combination κ = 40 m and σ = 100 which successfully characterizes the neighbourhood region of a building region.
During the pruning step, we investigate the vegetation evidence within the directional neighbourhood of the shadow regions.At the end of this step, we remove the landscapes that are generated from the cast shadows of vegetation canopies.To do that, we define a search region in the immediate vicinity of each shadow object by applying two thresholds (T low = 0.7, T high =0.9) to the membership values of the fuzzy landscapes generated.Once the region is defined, we search for vegetation evidence within the defined region using the vegetation mask, M V , and reject a fuzzy landscape region generated from a cast shadow if there is substantial evidence of vegetation (≥ 0.7) within the search region (Fig. 2).
We assess the height difference of the objects compared to the terrain height to separate the landscapes of building and other non-building objects.Based on the assumption that the surfaces on which shadows fall are flat, it is possible to investigate the length of the shadow objects in the direction of illumination to enforce a pre-defined height threshold value.To do that, for a given solar elevation angle (ϕ) and height threshold (T H ), we compute the shadow length (L H ) that should be cast by a building: L H = T H / tan(ϕ).Thereafter, we generate a directional flat structuring element whose length is equal to L H in the direction of illumination.Since the perimeter pixels of the shadow objects are already computed (B per ), for each shadow object, we use a directional flat structuring element to search the number of perimeter pixels that satisfies the length L H .In this study, we apply two height thresholds two limit the height of building regions.The lower threshold = 3 m eliminates the fuzzy landscapes arise due to short non-building objects such as cars, garden walls etc., and if none of the perimeter pixels of a shadow object is found to be satisfying , the generated fuzzy landscape is rejected.The upper threshold = 50 m discards the fuzzy landscapes generated from large dark regions such as water bodies which are incorrectly identified as shadow region by the shadow detector (Fig. 2f).To do that, we eliminate the landscapes if at least one of the perimeter pixels of a shadow object satisfies .

Detection of Building Regions using Iterative Graph-cuts
In this paper, we consider the building detection task as a twoclass partitioning problem where a given building region has to be separated from its background accurately (building/non-building).Therefore, the class building in an image corresponds only to the pixels that belong to building regions, whereas the class non-building may involve pixels that do not belong to any of building areas (e.g.vegetation, shadow, roads etc.).To solve the partitioning, we utilized the GrabCut approach (Rother et al., 2004) in which an iterative binary-label graph-cut optimization is performed.
GrabCut is originally semi-automated foreground/background partitioning algorithm.Given a group of pixels interactively labelled by the user, it partitions the pixels in an image using a graph-theoretic approach.Given a set of image pixels ( ) in an image space, each pixel has an initial labelling from a trimap , where and represent the background and foreground label information provided by the user respectively, and denotes the unlabelled pixels.In addition, each pixel has an initially assigned value ( ) corresponding to background or foreground where and the underline operator indicates the parameters to be estimated/solved.At the first stage of the algorithm, two GMMs with K components for the foreground (K F ) and the background classes (K B ) are constructed from the pixels manually labelled by the user.Let us define with as the vector representing the mixture components for each pixel.Then, the Gibbs energy function for the partitioning can be written as where denotes the probability density function to be obtained by mixture modeling for each pixel.In Equ. 5, ( ) denotes the fit of the background/foreground mixture models to the data considering values, and defined as where ( ) favor the label preferences for each pixel z n based on the observed pixel values.On the other hand, ( ) is the boundary smoothness and is written as where the term [ ] can be considered as an indicator function getting a binary value 1 if , and 0 if , C is the set of neighboring pixel pairs computed in 8neighborhood, β and γ 1 are the constants determining the degree of smoothness.The smoothness term β is computed automatically after evaluating all the pixels in an image, and the other smoothness term γ 1 is fixed to a constant value (that is 50) after investigating a set of images.To complete the partitioning and to estimate the final labels of all pixels in the image, a minimum-cut/max-flow algorithm is utilized.Thus, the whole framework of the GrabCut partitioning algorithm can be summarized as a two-step process (Rother et al., 2004): As can be seen from (i), the initialization of the iterative partitioning requires user interaction.The pixels corresponding to foreground (T F ) and background (T B ) classes must be labelled by the user, and after that, the rest of the pixels in an image is partitioned.In this part, we integrate the iterative partitioning approach to an automated building detection framework.We term T F to the image pixels that are most likely to belong to building areas.On the other hand, T B of an image corresponds to the pixels of non-building areas.We present a shadow component-wise approach to focus on the local neighbourhood of the buildings to define T F .It is a basic common fact of all images is that the shadows cast by building objects are located next to their boundaries (Fig. 3a).Thus, T F can be extracted automatically from the directional neighbourhood of each shadow component with the previously generated fuzzy landscapes.To do that, we define the T F region in the vicinity of each shadow object whose extents are outlined after applying a double thresholding (η 1 = 0.9, η 2 = 0.4) to the membership values of the fuzzy landscape generated (Fig. 3d).To acquire a fully reliable T F region, a refinement procedure that involves a single parameter, shrinking distance (d = 2 m), is also performed (Ok et al., 2013).
In this study, we present a region-of-interest (ROI) based iterative partitioning.In Ok et al. (2013), we performed the iterative partitioning locally for each shadow component in a bounding box covering only a specific ROI region whose were extracted automatically after dilating the shadow region.The dilation was performed with a flat line kernel defined in the opposite direction of illumination, and since the ROI must include all parts of a building to be detected, the size of the building in the direction of illumination was taken into account.During the generation of the ROIs, the size was controlled by a single dilation distance parameter, ROI size (= 50 m), which was also defined in the opposite direction of illumination.The bounding boxes generated by dilating the shadow components works well for most of the cases; however for certain conditions (e.g.acute solar elevation angles, dense environments etc.), it might cause large ROI regions to be produced for multiple building objects (Fig. 4c).To avoid this problem, as original to this work, we generate ROIs from the foreground information T F (Fig. 4d).Since the generated T F regions are separated for such cases, this provides an opportunity to define the ROIs in a separate manner (Fig. 4f).Thus, this strategy allows us to better focus on individual building regions and their close neighbourhood independent from the shadow component utilized.
Once the bounding box of a specific ROI is determined, we automatically set up the pixels corresponding to background information (T B ) within the selected bounding box.To do that,  we search for the shadow and vegetation evidences within the bounding box and we label all those areas as T B .In addition, we also label the regions outside the ROI region within the bounding box as T B since we only aim to detect buildings within the ROI region for a given foreground information.
Finally, we remove the small-sized artefacts that may occur after the detection stage.To do that, a threshold (T area = 30 m 2 ) is employed to define the minimum area enclosed by a single building region.

RESULTS AND DISCUSSION
The test data involve images acquired from two different satellites (QuickBird and Geoeye-1) which are capable of providing sub-meter resolution imagery, and all images are composed of four multi-spectral bands (R, G, B and NIR) with a radiometric resolution of 11 bits per band.The assessments of the proposed approach are performed over 10 test images which differ from their urban area and building characteristics as well as from their illumination and acquisition conditions.The first three test images (#1-3) belong to a QuickBird image, whereas the rest (#4-10) is selected from different Geoeye-1 images.The solar elevation angles tested range between 21.54° and 78.12° and the images were acquired with off-nadir angles of at most ≈ 18 degrees.To assess the quality of our results, they are compared to reference data.The precision, recall and F 1 -score (Aksoy et al., 2012;Ok et al., 2013) performance measures are determined both on a per-pixel and per-object level.For the object based evaluation, a building region is considered to be a true positive if 60% of its area is covered by a building region in the reference.
We visualize the detection results in Fig 5, and according to the results presented, the developed approach seems to be robust and the regions detected are found to be satisfactory.The building regions are well detected despite the complex characteristics of buildings in the test images, e.g.roof colour and texture, shape, size and orientation.The numerical results in Table 1 favour these facts.Considering the per-pixel evaluation, overall mean ratios of precision and recall are computed as 79.1% and 85.5%, respectively.The computed pixel-based F 1score for all test images is around 82%.In view of the perobject evaluation, overall mean ratios of precision and recall are computed as 92.8% and 79.9%, respectively.This corresponds to an overall object-based F 1 -score of approximately 86%.If the complexities of the test images and the involved imaging conditions are jointly taken into consideration, we believe that this is a promising building detection performance.The lowest precision and recall ratios for both per-pixel and perobject assessment are obtained for test image #10.Actually, this is not surprising since that image is acquired in winter season with a very low solar elevation angle (21.54°).Thus, the region is covered by snow.This fact and the fact that the low solar elevation angle causes severe shading effects on building rooftops (especially for buildings having gable roof styles with specific orientation) limit the detection.Besides, it is rather difficult to detect shadow areas in a snow covered image because the cast shadows of buildings fall over a bright colour may significantly reduce the saturation component of the shadow region.As a result, the effectiveness and the performance of the index used to detect shadow areas reduce dramatically, which also have a major influence on the final performance of the proposed approach.The second lowest precision performance of per-pixel evaluation is achieved for test image #3 and the main reason is the two large bridges used for vehicular traffic on the upper-right corner of the image.The height threshold works well to eliminate the landscapes generated from non-building objects since the shadows of these objects generally have height differences less than 3m compared to terrain height.However, in certain cases such as large bridges, the height of non-building objects exceeds the given threshold.As a result, it is not possible to avoid such cases and some parts of the road segments might be labelled as building regions.Besides, our approach may over-detect some building boundaries.This is due to two specific reasons.First, some parts of the building boundaries may have very smooth transition between their surroundings.Second, a building may involve several roof parts that are identical to their surroundings although the main colour of the rooftop is distinguishable from its background.Nevertheless, we think that most of the overdetections can be corrected with further high-level processing.
The results show that the approach presented is generic for different roof colours, textures and types, and has the ability to detect arbitrarily shaped buildings in complex environments.According to the results provided in Table 1, the highest F 1scores are achieved for test image #1 where the buildings are formed in a single-detached style.Besides, for most of the test images, our approach provides quite satisfactory object-based ratios.Apparently, this is due to the reason that our approach labels a region as building only if a valid shadow region is detected.Therefore, we can conclude that the presented approach for building detection is robust from an object-based point-of-view.Besides the mentioned advantages, the proposed approach is also time-efficient.The images are processed on a PC with a CPU Intel i5 2.6GHz and 4GB RAM, and the processing requires less than 30 seconds for each image on average.
Figure 2. (a, d) Geoeye-1 pan-sharped images (RGB).The fuzzy landscapes generated using the shadow masks provided in Fig. 1c, f are illustrated in (b, e), respectively.The fuzzy landscapes after applying the pruning step are shown in (c, h).
. (iii) Initialize mixture models for and .Iterative minimization: (iv) Assign GMM components for each in , are assigned.(v) Extract GMM parameters from data z.(vi) Solve the optimization using min-cut/max-flow (vii) Repeat steps (iv)-(vi) until convergence.International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-1/W1, ISPRS Hannover Workshop 2013, 21 -24 May 2013, Hannover, Germany Figure 3. (a) Geoeye-1 image (RGB), (b) the detected shadow (blue) and vegetation (green) masks.(c) Fuzzy landscape generated from the shadow object with the proposed line-based non-flat structuring element, (d) the final foreground pixels (T F ) overlaid with the original image.
(a) Geoeye-1 image (RGB), (b) a single shadow component detected, and (c) the large ROI region generated.(d) The foreground information T F (without refinement) generated from the shadow component in (b).(e) One of the T F regions and (f) the ROI formed for that region after dilation.

International
Figure 5. (first column) Test dataset (#1-10), (second column) the results of per-pixel evaluation, and (third column) the results of per-object evaluation.Green, red and blue colours represent true-positive, false-positive and false-negative, respectively.

Table 1 .
Performance results of the proposed approach.