HIGH-RESOLUTION URBAN MAPPING BY FUSION OF SAR AND OPTICAL DATA

Mapping the exact extent of urban areas is a critical prerequisite in many remote sensing applications, such as hazard evaluation and change detection. The usage of Synthetical Aperture Radar (SAR) data has gained popularity due to the unique characteristics of the backscattered radio signal from human-made targets. The Sentinel-1 (S1) constellation, with a global revisit time of 6-12 days in Interferometric Wide Swath (IW) mode and free and open access to the data, allows the development of new applications to monitor urban sites. However, S1 is rarely considered when fine resolution is required due to the large pixel size and the need for spatial averaging to obtain robust estimators. We propose a method to improve Sentinel-1 urban classification performance by exploiting one Multi-Spectral (MS) image acquired by Sentinel-2 (S2). MS data is used for tracing the precise natural boundaries in a scene through superpixels segmentation. A machine learning approach is then applied to interpret the thematic context of each segment from short temporal stacks of coregistered SAR data. We use a short sensing period (around two months), so rapid changes can be traces. The proposed fusion of S1 and S2 data was tested in the area of Milan (Italy), with a total accuracy of about 90%. The ability to follow high-resolution details in a mixed environment is demonstrated, opening the possibility of efficiently tracing the human footprint.


INTRODUCTION
In recent decades, mapping urban areas and change have been vital in facing numerous environmental and cadastral challenges. Many coarse resolution solutions were developed (Bartholomé andBelward, 2005, Chen et al., 2015), providing insight into large-scale trends. Deriving urban maps with a fine level of detail (< 20m) is usually well performed by high resolution MS remote sensing instruments (Thomas et al., 2003, Malarvizhi et al., 2016. SAR data holds great potential in urban mapping because of the sensed signal's nature, which is very much different between urban and natural areas. While overcast weather limits the possibility of processing continuous time-series using MS data, radio signal penetrates clouds, allowing regular sampling worldwide. The launch of several high-resolution missions made SAR a strong candidate for the task (Esch et al., 2017), especially where cloud coverage prevents regular time sampling by optical means. Sentinel-1, a C-band constellation with nominal 5-by-20 meter resolution (in the standard Interferometric Wide mode), is rarely considered for fine resolution applications because of the coarse pixel size. However, some advancement has been accomplished recently (Chini et al., 2018, Sun et al., 2019, motivated by open data availability and regular worldwide acquisitions. Traditionally, resolution enhancement is obtained by replacing spatial averaging with temporal one, requiring a long time series. While effective, it reduces the sensitivity to fast changes. Fusion of SAR and MS was shown to improve classification performance and increase resolution (Ban et al., 2010, Clerici et al., 2017. The work presented here explores the possibility of combining the unique SAR capabilities of detecting stable targets, i.e., urban structures, with the fine resolution of optical surveys. First, segmentation of an optical image is performed, such that the final product follows the visible borders between land covers. A Machine Learning (ML) model is then trained to identify urban segments by features retrieved from a SAR stack of the same area. The classification products are given as the percent of urban pixels in a segment, and the value is finally thresholded to achieve binary classification. * Corresponding author.
The remainder of this paper is organized as follows. SAR feature extraction is presented in Section 2. MS image segmentation approach is given in Section 3. A description of the method is presented in Section 4. The dataset used for testing and the final results are reported in section 5. Finally, conclusions and further discussions are summarized in Section 6.

SENTINEL-1 DATA FOR URBAN CLASSIFICATION
Given a set of homogeneous SAR pixels, numerous features can be extracted for their characterization. In the field of urban mapping some possible solutions are: backscatter intensity (Strozzi and Wegmuller, 1998), backscatter temporal variability and long term coherence (Bruzzone et al., 2004), entropy of the polarimetric coherence matrix (Cloude and Pottier, 1997) and texture analysis (Nyoungui et al., 2002). This section covers the set of features we use for classification. The features were selected to match the data characteristics of S1, and the possibility for a robust estimation over a short multi-temporal stack.

Differential Entropy
The stability of targets in SAR images over time is a feature of interest when studying the difference between urban areas and natural surfaces, since human-made targets tend to show little change over time. We propose an innovative quantification of temporal stability, in the form of an entropy measure, which is both robust and easy to compute. Differential entropy (Cover, 1999) is an information theory quantity, describing the level of uncertainty for continuous random variables. When a target is very stable, i.e., the pixels are highly correlated over time, the entropy is expected to be low, since the knowledge of one outcome, infers on the others. For a continuous random variable S with density f(x), the differential entropy is defined by: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2021 XXIV ISPRS Congress (2021 edition) The definition can be extended to a set of random variables. The vector of complex SAR measurements S = [S1, S2, .., SN ] T has a multivariate circular normal distribution with covariance matrix Γ: To observe the phase stability of the series, it is useful to normalize Γ with respect to the amplitude variations, i.e., obtaining the coherence matrix C. For a set of random variables described by the distribution in Equation 2, the differential entropy holds a closed form solution: (3) Figure 1 shows the effectiveness of the differential entropy to highlight urban constructions. We use it for classification purposes, as it captures the coherence deterioration process without model fitting, which is sensitive to noise. The feature is computationally efficient, as it requires merely a matrix inversion.

Coherence Between Co-Pol and Cross-Pol Polarizations
Polarization is a property defining the alignment of the Electro-Magnetic (EM) field in reference to a plane perpendicular to the direction of propagation. The response of a target to an EM signal can be decomposed to its horizontal and vertical polarization components (Freeman et al., 1992): where E R and E T are the received and transmitted signals, respectively, and S is the complex scattering matrix. Urban areas are characterized by a high concentration of Permanent Scatterers (PS) (Ferretti et al., 2001), i.e., strong reflection in the direction of the antenna, as coming from a single point. Such targets were shown to have a unique signature in the different polarization channels (Atwood and Thirion-Lefevre, 2018). The effect can be measured by the coherence between the co-pol and cross-pol channels: where SV V and SV H are the complex SAR signals, for the two polarization channels. E[.] denotes the expected value. γ V V /V H (P ) was shown to be higher with respect to natural targets (Prati et al., 2018), as can be seen in Figure 1. The feature helps to emphasizes dihedral and trihedral structures (PS), and was selected due to the availability of VV and VH polarizations in S1 data. Note that the effect described above depends on the incidence angle, which explaines the distribution of γ V V /V H values in Figure 1.

Backscatter Intensity
PS targets are known to result in a high-intensity response due to the specular reflection. Most resolution cells in a SAR image are not dominated by a stable and strong scatterer (PS), but by a superposition of many independent targets. In this case we are in presence of the so-called Distributed Scatterer (DS). The received complex signal (i.e. a pixel) can be statistically described as a realization of a random variable with complex normal distribution: where σ is the backscatter coefficient. The estimate of backscatter per given area is useful for the characterization of scatterers (Sica et al., 2019). The combination of smooth surfaces and PS targets in urban zones causes the measured backscatter to be high, with respect to agriculture fields or forests. Incidence angle affects backscatter, and may lead to misclassification in mountain regions (Chini et al., 2018). Thus, we use σ 0 (Small, 2011), i.e., the backscatter per area on the ground, calibrated with respect to the local incidence angle: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B3-2021XXIV ISPRS Congress (2021 where N is the number of images in the stack, In is the amplitude in image n, θ i n is the local incidence angle. Values of σ 0 in Figure 1 are shown to be high for urban regions. Note that some areas show significantly high values, possibly due to the orientation of buildings, which causes increased specular response.

GEOMETRY EXTRACTION FROM A SENTINEL-2 IMAGE
A segmentation algorithm aims to divide an image into nonoverlapping regions which exhibit certain properties, such as spectral homogeneity or compacity. The identified groups of pixels are treated as objects, from which characteristics can be extracted. Consequently, the volume of data to be studied is reduced in terms of the number of elements to inspect. Segmentation algorithms enable the so-called object based processing (Blaschke, 2010) (as opposed to the traditional pixels based analysis), and are increasingly used for remote sensing applications (Jacob et al., 2020). In this work we use the SLIC superpixels algorithm (Achanta et al., 2012) for the definition of geometrical features. Pixels belonging to similar land cover types are grouped according to their MS signature. As opposed to other segmentation methods, superpixels do not try to capture the entire object; rather, they almost always result in an over-segmentation of the surfaces in the image. The obtained clusters adhere to the natural borders of the scene, i.e., the shape of edges is well represented in the segmented image. Superpixels fit well this application due to their ability to accurately describe shapes in the image, with a set of segments that are similar by size. The latter guarantees that a sufficient number of pixels are used to compute SAR-based features, avoiding, for example, estimation of coherence of single pixels. The process of grouping pixels into a unique cluster allows for significant dimensionality reduction, without substantial deterioration in resolution.
In Figure 2, a 2.5Km X 1.5Km area around Lainate (Italy) is used to demonstrate the segmentation algorithm's effectiveness. The processing is performed in the S2 native resolution (10m) to maximize the retrieved detail level.

PROPOSED METHOD
SAR features are powerful in highlighting urban areas. However, due to the relatively large estimation window, the level of detail is coarse. This section demonstrates how the features discussed in section 2. can be used efficiently in object-based processing, improving the obtained resolution. The method requires two S1 stacks of at least eight images each, from ascending and descending orbits, and one S2 image with low cloud coverage from the same period. First, we segment the S2 image using SLIC superpixels. The output of the segmentation process is a segment index for each pixel. To use the geometrical knowledge with S1 data, we project the indexes into the SAR coordinate system (i.e., range-azimuth), defined by the master acquisition. This allows analyzing SAR coherence and intensity without prior resampling and multi-looking. SAR multi-temporal features are then computed over the segments since pixels labeled by the same segment are assumed to belong to a similar land cover. Replacing the boxcar filter with a data-driven local window has a twofold benefit: better homogeneity in the estimation process and adhering to the visible borders in the image. The orientation of urban structures is distributed randomly, which affects the characteristics of the returned signal. To enhance accuracy, we use two stacks with different orbits (i.e., ascending and descending). Distortions induced by the geometry of the SAR acquisition appear differently for the two orbits, improving the feasibility of detecting urban structures. Figure 3 shows the result of feature extraction over the segments. Note the large difference in γ V V /V H for the two stacks, due to the difference in the double bounce effect, as it is experienced from different incidence an-

gles.
Despite the best efforts to define a precise segmentation procedure, some segments will not contain only samples from one class, as typical for roads, parks, etc. Using traditional binary classification methods can bias the learning mechanism. We chose a regression approach to overcome this issue. We aim to obtain a model which is able to predict the percent of urban pixels per segment. The Random Forest (RF) regressor (Cutler et al., 2012) is an ensemble learning method, which generates a set of Decicision Trees (DT) from the training data. DTs are constructed such that splits minimizes an impurity measure. The prediction of unseen The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2021 XXIV ISPRS Congress (2021 edition) cases is performed by aggregating results of multiple DTs. We identified RF as a suitable solution, because of the ability to effectively make predictions without a prior assumption of a model. Also, a low number of calibration parameters is required, reducing tuning sensitivity. For each DT i, a training set Di is sampled with replacement (i.e., bootstrap) from the available set. The frequency in which a sample is chosen during training can be controlled via weighting. We use the size of a segment as the weight, since larger estimation windows are known to be more robust.
The set of SAR features are used as input to an RF calibrated model. The prediction is done independently on ascending and descending stacks as depicted in Figure 4. ) are combined to a binary label according to the following rule: where 1 notes the urban class. T1 and T2 were empirically calibrated to maximize accuracy: T 1 = 0.5; T 2 = 0.2. Note that urban pixels surrounded by a large number of not-urban pixels will be misclassified. This is the disadvantage of using object-based techniques in comparison to pixel-based ones. However, those are usually isolated targets or borders that are not of great importance, and the added benefit surpasses the drawbacks. We project the labeled segments back to the georeferenced domain, using the pixels-to-segment map generated during segmentation. The output of the method is a binary map with a 10m pixel size. Figure 5 demonstrates the benefit of using two stacks. While classification results are similar for both stacks, some differences are seen. The combination allows obtaining a map that closely follows the ground truth.

DATASET
The Spring of 2018 (April-June) was considered for the evaluation of the classification procedure. Springtime was chosen to enhance the difference between urban and natural land covers, as vegetation coverage is high. Two stacks of Single Look Complex (SLC) SAR images were collected from ascending and descending orbits. Each stack was coregistered to its unique master. We use a regular sampling period of 6 days, which generally possible due to the regular acquisitions of S1. One S2 image was retrieved from the same period (April 2018). Low cloud coverage is needed to enable precise segmentation. The chosen images showed 2.4% cloud coverage. Training and testing require an accurate ground truth. We used the DUSAF6 land use map provided by the Lombardy region (Italy), updated to 2018. DUSAF6 has a scale of 5m and is based on aerial photogrammetry. Labels are organized into three principal levels of detail, shown in Figure 6. cover map was reclassified, following the DUSAF6 level-I labels. Some adjustments were adopted so to consider the nature of the SAR signal. For example, greenhouses, labeled as Agricultural areas by DUSAF6, are regarded as urban here. The ground truth layer was aligned to the Sentinel-2 image, i.e., resampled and clipped in the same 10mx10m grid. Note that the test site was chosen to have a similar distribution of urban and non-urban targets, so the accuracy is a reliable estimator of performance.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2021 XXIV ISPRS Congress (2021 edition)

RESULTS
After image segmentation, the training of the RF model was performed on 5% of the segments. The evaluation of accuracy considers all pixels which were not used for training. Figure 7 presents the results of the classification process. An overall accuracy of 90.29% was achieved. Note the level of obtained details, such as roads and complex clusters of buildings, which is not feasible by using only S1 data. The benefits of using two stacks are demonstrated in Table 1. The overall agreement with the ground truth is evident from Figure 7. However, some differences are noticeable. We argue that part of the discrepancy can be explained by inconsistensies of Figure 7: Urban classification results. Green: urban pixels detected both by the method and the ground truth, Red: urban pixels detected only by the proposed method, Blue: urban pixels detected only by ground truth.  Figure 8. Pixels marked in red are classified as urban while appear as non-urban in the ground truth, maybe due to lack of update of the DUSAF6 dataset. Pixels marked in blue are the missed detections. It is clear that some areas are labeled urban in DUSAF6 even though they are not covered by an urban structure (i.e., building, road, etc.). Green pixels mark the correctly labeled urban pixels, which closely follow the shapes of the buildings.

DISCUSSION AND CONCLUSION
This article suggests a method to map urban extents with 10m resolution using only openly distributed data, i.e., Sentinel-1 and Sentinel-2. The approach characterizes objects defined in the optical domain, which are less prone to noise and are available in higher resolution, with features from the SAR domain, which are more robust in differentiating urban areas. The ability to observe a high level of detail, which is not possible with standard SAR techniques, is demonstrated. Urban areas, which are usually defined by clusters of buildings over confined territory, can be traced with an enhanced level of accuracy. Efficiency is gained by selecting a set of computationally simple features and the dimension reduction introduced by segmentation. Innovative utilization of the differential entropy allows for the generation of a very powerful feature in terms of separating natural and urban environments, with the low cost of computing a determinant of an NxN matrix (being N the number of SAR images used for the processing).
The methodology was tested with S1 and S2 data over an area in the north of Italy, showing an overall accuracy of about 90%. An RF model was trained to estimate the thematic content of clusters of pixels based on the DUSAF6 land cover map. Errors in classifications of small features may be explained by the nature of the segmentation procedure, limiting the minimal size of a segment. Fine urban features with not-urban surroundings (or vice-versa) are challenging to detect. Also, DUSAF6 is published in 5m resolution, while Sentinel data is provided with a coarser resolution. DUSAF6 is considered a reliable representation of the ground truth. However, in some cases, labels do not coincide with the actual land cover due to possible errors or undocumented changes. The percent of wrongly labeled pixels is assumed to be low enough to not significantly bias the learning procedure, allowing the model to learn the main characteristics of an urban segment. Joint usage of two orbit direction was shown to improve classification performance. The accuracy is relatively high also in the case of one stack, if a second is not available. However, the different looks provide complementary data due to features dependence on incidence angle. Combining the SAR and MS allows to overcome the S1 pixel size limitation and makes it a viable candidate for precise monitoring applications. Dependency on fair weather conditions is also relaxed, as just one S2 image is required, from any period where boundaries between land covers are assumed to be stable. Only two months of data were used for the result reported here, a significant reduction in the required period compared to other known solutions. The short sensing period allows for tracking rapid scene changes; thus, it can be suitable for near real-time applications. The technique can be well suited as a complementary tool for change detection tasks, where high-resolution urban masks are often needed, to discriminate types of changes.