INCORPORATING INTERFEROMETRIC COHERENCE INTO LULC CLASSIFICATION OF AIRBORNE POLSAR-IMAGES USING FULLY CONVOLUTIONAL NETWORKS

Inspired by the application of state-of-the-art Fully Convolutional Networks (FCNs) for the semantic segmentation of highresolution optical imagery, recent works transfer this methodology successfully to pixel-wise land use and land cover (LULC) classification of PolSAR data. So far, mainly single PolSAR images are included in the FCN-based classification processes. To further increase classification accuracy, this paper presents an approach for integrating interferometric coherence derived from coregistered image pairs into a FCN-based classification framework. A network based on an encoder-decoder structure with two separated encoder branches is presented for this task. It extracts features from polarimetric backscattering intensities on the one hand and interferometric coherence on the other hand. Based on a joint representation of the complementary features pixel-wise classification is performed. To overcome the scarcity of labelled SAR data for training and testing, annotations are generated automatically by fusing available LULC products. Experimental evaluation is performed on high-resolution airborne SAR data, captured over the German Wadden Sea. The results demonstrate that the proposed model produces smooth and accurate classification maps. A comparison with a single-branch FCN model indicates that the appropriate integration of interferometric coherence enables the improvement of classification performance.


INTRODUCTION
The automatic analysis of Synthetic Aperture Radar (SAR) images, which can be acquired independently of cloud cover, weather conditions and daylight, allows the generation of upto-date land use and land cover (LULC) maps. These maps provide an essential prerequisite for the efficient planning and management of urban and agricultural land use as well as for environmental monitoring. The task that underlies the generation of LULC maps is to semantically segment captured SAR images. In the course of strongly increasing data availability, particularly methods from the field of machine learning have proven to be suitable for this purpose. For example, the use of a Random Forest (RF) (van Beijma et al., 2014) or Support Vector Machines (SVMs) (Huang et al., 2002) achieve good results for pixel-based classification tasks.
However, the success of machine learning approaches strongly depend on the design and composition of suitable features. Typically, handcrafted low-level features are used, which often have the disadvantage of being location-and data-specific. Furthermore, such features are usually engineered for a particular task, which limits the ability to generalise to other requirements. These challenges are countered by methods from the field of Deep Learning (DL). DL methods have the ability to learn abstract, hierarchical features from raw data, thus eliminating the need for heuristic feature engineering and increasing generalisation performance. In addition, end-to-end training schemes allow task-specific outputs to be provided without expensive pre-and post-processing of data. For the task of semantic segmentation, particularly Fully Convolutional Networks (FCNs) have become established.
While FCNs have been used very successfully for LULC clas-sification of optical imagery (Kampffmeyer et al., 2016;Fu et al., 2017;Mboga et al., 2019), the potential of this method for application to SAR data has not yet been fully exploited. Due to the intrinsic differences between the imaging mechanisms of SAR and optical images, FCNs that have been pretrained on optical data do not achieve satisfactory results (Yao et al., 2017). In contrast, a complete training of FCNs from scratch with annotated SAR data is promising, because domainspecific low-level and high-level features can be learned. Therefore, this method can be successfully used for LULC classification of polarimetric SAR (PolSAR) images. For instance, Cao et al. (2019) introduced a complex-valued FCN designed for PolSAR image classification that outperforms conventional machine learning tools (e.g. RF and SVM). To distinguish several LULC classes, Li et al. (2018) proposed a sliding window FCN and reduced time and memory consumption by using sparse coding. Despite these encouraging results, further investigations are necessary to exploit the full potential of FCNs for pixel-wise LULC classification. Most previous work only employs information contained in single PolSAR images. In contrast, this work includes interferometric SAR measurements to further enhance classification performance. SAR interferometry has proven to be a valuable technique that allows the measurement of geophysical parameters such as surface topography or ground deformation. The central idea of InSAR is to gain information by comparing the phase of two radar images, which capture the same scene from slightly different positions at the same time (single-pass) or with a time offset (repeat-pass). An important measure relevant to LULC classification is the interferometric coherence, which quantifies the local phase correlation between the two complex images. As discussed in (Wegmüller, Werner, 1995), interferometric coherence provides complementary information to that contained in backscattered intensities. For example, considering water surfaces, the backscattering coefficient can vary due to water movements caused by wind, while the interferometric coherence (in case of repeat-pass measurement) is consistently low because of temporal change of water surfaces. It is shown in various studies (Wegmüller, Werner, 1997;Abdelfattah, Nicolas, 2006;Mohammadimanesh et al., 2018) that combining backscattered intensities with interferometric coherence has the potential to significantly improve LULC classification. Hence, this paper addresses the questions of how to incorporate the complementary information contained by coherence images into FCN segmentation and to what extent the additional information improves the classification performance.
An existing challenge that still prevents the successful and widespread use of FCNs for SAR data analysis is the limited availability of densely labeled data. In most cases, data is manually labelled by experts in time-consuming processes as described for instance in (Mohammadimanesh et al., 2019). In contrast, this work investigates how automatically generated sparse and potentially noisy annotations based on the fusion of different available LULC products can be used for training. Particular attention is paid to a method which mitigates the negative influence of incorrectly labelled data. This paper is organised as follows: in Section 2, an FCN architecture is introduced that combines backscattering coefficients and interferometric coherence to classify PolSAR images. Subsequently, details concerning the training of this network are described. In Section 3, experiments to evaluate the performance of the proposed FCN are outlined and the outcomes are presented in Section 4. Finally, in Section 5, results are summarised, conclusions are drawn and suggestions for future work are given.

METHODOLOGY
In the following, the FCN-based method for LULC classification using polarimetric SAR images is explained. First the generation of the input data is described followed by the architecture of the network and its training.
2.1 Input Image Generation 2.1.1 Pauli decomposition: To encode measurements of a polarimetric SAR system, the complex polarimetric scattering matrix is used that describes the transformation between transmitted and received wave vectors caused by a scatterer (Lee, Pottier, 2017). The matrix S provides information about scattering processes of an observed object and thus about the object itself.
However, it turns out to be difficult to derive physical properties of a scatterer directly from the matrix S. In contrast, the Pauli decomposition of the scattering matrix allows the representation of polarimetric information that corresponds directly to physical scattering mechanisms of coherent targets. Assuming a monostatic system configuration that results in s hv = s vh , the decomposition of the scattering matrix based on the Pauli-basis is given by: This decomposition subdivides the scattering matrix S into three components that refer to specific scattering mechanisms. The matrix Sa corresponds to single-or odd-bounce scattering, S b represents double-or even-bounce scattering and Sc indicates a scattering mechanism characterised by volume scattering. The related complex coefficients a, b and c indicate the contribution of the corresponding matrices to the scattering matrix S, whereas |a| 2 , |b| 2 and |c| 2 express the scattered power by the associated types of target. To represent this information in a single three-channel image, denoted as Pauli-RGB image, the following codification is used: In this work, the Pauli-RGB image is used as one input image of an FCN. Its rich texture and color features match the visual perception of the captured scene. Thus, spatial features can be extracted by the network that give a good indication of requested LULC classes.

Interferometric Coherence:
Based on two coregistered complex SAR image values s1 and s2 the interferometric coherence is calculated by: where * indicates complex conjugation and x denotes the expected value, which is commonly approximated by averaging adjacent pixels. The resulting correlation coefficient |γ| has a value range from 0 indicating total decorrelation to 1 denoting complete conformity and depends on system and acquisition parameters as well as on structural parameters of the scatterer and temporal scene coherence. In this work, a coherence image is formed using two SAR images that are captured within repeat-pass acquisition, thus it is predominantly related to random changes of scatterers. The coherence image is used as second input of an FCN to include complementary information to that contained in the Pauli-RGB image.

Network Architecture
In order to realise pixel-wise LULC classification, we propose an FCN, denoted as Fused U-Net, shown in Figure 1, that relies on the generic encoder-decoder paradigm. Within the encoder stage, high-level features are extracted from the input layer, while the decoder stage is designed to consecutively up-sample the feature maps to the original input resolution.
To effectively combine the information contained in the Pauli-RGB image on the one hand and in the interferometric coherence image on the other hand, a network architecture inspired by the FuseNet structure (Hazirbas et al., 2016) is used. The fusion-based Convolutional Neural Network (CNN) architecture was originally developed to incorporate depth information into the semantic segmentation of RGB images. Following the same fusion approach, the proposed network comprises two encoder branches that are trained to extract and combine features from Pauli-RGB and coherence images. The threechannel Pauli-RGB image is taken as input for the main branch, while the single-channel coherence image is the input of the additional branch. Within both encoder branches, hierarchical feature maps are computed by a stack of convolution, batch normalisation and activation layers. Aggregation of feature maps is performed by max-pooling on five levels. The convolution operators act as image filters with trainable kernel weights and the Rectified Linear activation Unit (ReLU) function enables the learning of non-linear mappings. During batch normalisation, which is applied before each activation, feature maps are first normalised over a mini batch to have zero mean and unit variance. In order to maintain the expressivity of the model, the normalised values are scaled and shifted by two additional parameters that are learned along the training of the network. Batch normalisation enables a faster and more stable training process due to reduced internal covariate shift (Ioffe, Szegedy, 2015) and a smoother optimisation landscape (Santurkar et al., 2018). Furthermore, it provides regularisation effects and thus strengthens the model to better generalise to unseen examples. The trainable parameters of batch normalisation allow the network to learn internal representations of the Pauli-RGB and coherence images, which complement each other optimally in the following fusion steps.
The step-wise fusion is accomplished by adding feature maps extracted from the coherence image to feature maps derived from the Pauli-RGB image using element-wise summation before each max-pooling layer. In this way, feature maps in the main branch are enriched by features from the additional branch and a joined complementary representation is learned. The use of this fusion design, instead of simply stacking Pauli-RGB and coherence images within one input layer, is based on the as-sumption that the two distinct modalities require different sets of filters for the extraction of significant features. The separated encoder branches of Fused U-Net enable independent learning of features, specialised in the discriminant representation of information from the different data sources.
Within the common decoder part of the network, resulting fused feature maps of low spatial resolution are consecutively upsampled by transposed convolution operations. In order to reduce the loss of information due to down-sampling in the encoder, five skip connections are used, which incorporate highresolution feature maps from the encoder to the decoder stage. This concept was introduced in (Ronneberger et al., 2015) and is widely applied in many FCN approaches for semantic segmentation. By concatenating deep coarse features with shallow fine features, accurate detail information can be preserved. Five up-sampling blocks are followed by a 1 × 1 convolution layer that reduces the depth dimension of feature vectors to the desired number of output classes. To transform feature vectors, that each describe one pixel, into probabilities, the softmax function is applied. Final class labels that build up the intended segmented map are determined based on the highest probability values.

Network Training
The described network can be modelled by a chain of functions: , an optimal set of parameters W * is determined during network training. Here X1 and X2, with X1 ∈ R W ×H×3 and X2 ∈ R W ×H , denote two input images (i.e. the Pauli-RGB and the corresponding coherence image); Yi ∈ K W ×H , with K = {1, ..., K}, denotes the associated ground-truth labeling. To find a parameter set W * , a suitable loss function is minimised that compares predicted class distributions resulting from The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2020, 2020 XXIV ISPRS Congress (2020 edition) softmax mapping to corresponding one-hot encoded ground-truth distributions q(k|x). A loss function that is commonly used in conjunction with neural networks, which apply softmax activations in the output layer, is the categorical cross entropy defined by: However, in this work, the categorical cross entropy in its original form is not suitable, due to characteristics of employed training data that are described in the following.
To derive required ground-truth LULC class images Yi and form a diverse and sufficiently large training data set D without the necessity of time-consuming manual labeling, PolSAR images are annotated automatically using the approach described in (Schmitz et al., 2020): Information from publicly available LULC products, namely OpenStreetMap, CORINE Land Cover 2018, and Global Water Surface, is extracted and fused with information that can be derived from the PolSAR image itself based on interferometric coherence and polarimetric features.
As part of the automatic annotation process, class assignments are excluded that are not sufficiently reliable due to conflicting information of the various sources of input data. Thus, the resulting training data is not densely but only sparsely labeled. Despite filtering uncertain labels, the training data may contain incorrectly assigned class labels that have a negative impact on the network training. As empirically evaluated in (Wang et al., 2019), the cross entropy loss (Equation (12)) reveals weaknesses in the context of learning on erroneous training data. It is stated that, within the learning process, the network tends to overfit to noisy labels on "easy" classes, while the effect of under-learning occurs for "hard" classes. To overcome these limitations, a suitable loss function proposed in (Wang et al., 2019), based on symmetric learning, is implemented and used for the training of the Fused U-Net. The chosen symmetric entropy loss Lsce, inspired by the symmetric KL-divergence, is defined as the weighted sum of cross entropy and reverse cross entropy loss: While the cross entropy term leads to good convergence, the additional reverse cross entropy term is noise tolerant. The hyperparameters α and β can be tuned to find a balance between the reduction of overfitting and speed of convergence.
During the training of the Fused U-Net, symmetric entropy loss values are calculated on each training pixel x ∈ Ω. Here, Ω denotes the set of pixels of the training image tuples (X1,i, X2,i) that have a valid label in Yi. Based on the loss values, the network parameters W are updated iteratively using the gradientbased Adam optimisation algorithm (Kingma, Ba, 2014). To stabilise the training, the update frequency is reduced by using mini batch gradient descent. Weight updates are performed based on averaged sample losses over a subset (called mini batch) of Ω.

Study Area
Within the framework of the GeoWAM project, which aims to generate high-resolution geodata for coastal monitoring, fully polarimetric SAR data are captured over the tide-influenced German Wadden Sea. The geographic location of the study area considered in this paper, namely Otzumer Balje, is illustrated in Figure 2. With the objective of creating an accurate model of the watercourse for the study area, the main focus of the classification is on the distinction between water and dry fallen mudflats. The foreshore and land area, which is less focused in this work, is merely divided into two classes, soil and non-soil. The soil class includes areas with low vegetation, crop land, meadows and roads, while the non-soil class includes human-made objects, built-up areas and forestation areas. For data acquisition, the F-SAR system developed at the German Aerospace Center (Deutsches Zentrum für Luft-und Raumfahrt; DLR) (Horn et al., 2009) was used. F-SAR is an airborne SAR system, equipped with multiple antennas that enable capturing fully polarimetric SAR data at different wavelengths. In this work, image data are employed that were recorded by the S-band antenna during a measurement campaign in July 2019. Interferometric measurements, needed for the calculation of coherence images, were performed with repeat-pass baselines in the order of 40 metres. At the time of acquisition, the tidal range was low, thus large areas of dry fallen mudflats are depicted in the SAR images.

Experimental Setup
Three co-registered complex PolSAR image pairs were selected for the training and testing of the Fused U-Net model. The respective Pauli-RGB images were calculated according to Equations (6) to (8) and subsequently projected from slant-range to ground-range geometry. To minimise the influence of varying incidence angles, a gamma-naught calibration were performed. The coherence images were derived from VV-polarised complex image pairs by applying Equation (9) using a Gaussian filter of size 11 × 11 with standard deviation σ = 5.0 to approximate the expected values. The resulting image was geocoded as well. Before the input images were fed into the network, the values within an image were normalised to [0, 1]. Since we expected the network to learn basic SAR-specific image filters, no additional filtering of the input images was carried out. As Figure 2. Geolocation of the study area Otzumer Balje, a tidal basin in the German Wadden Sea.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2020, 2020 XXIV ISPRS Congress (2020 edition) described in Section 2.3, reference images that contain the class labels including water, mudlflats, soil and non-soil on pixel level were generated automatically. A quantitative evaluation of the accuracy of the labels that were generated in this way for the study area is given in (Schmitz et al., 2020). The data were divided into training and test data in a ratio of 70% to 30%, taking care to use geographically separate areas. While the reference images used for testing were manually post-processed to reduce faulty labels, there was no correction of the reference images used for training. In this way, it can be examined whether the model is robust against faulty training labels. Figure 3 illustrates exemplary sections of the resulting training and testing image triplets (Pauli-RGB, coherence and reference image).  For the training of the Fused U-Net, images were divided into patches of size 512 × 512 pixels with an overlap of 25 %. Image patches that contained less than 5 % annotated pixels were excluded from training. In order to counteract the imbalanced class distribution within the training data, random undersampling was performed to reduce the number of samples from dominant classes. Based on selected training image patches, an optimal set of model parameters was determined by minimising the loss (Equation (14)) using the keras implementation of the Adam optimiser. The optimisation started with a learning rate of 0.01 that was reduced by a factor of 0.5 every 10 epochs. The model parameters were updated iteratively after the evaluation of a mini batch of 8 patches.
To examine the extent to which the proposed way of inclusion of coherence images affects classification performance, we compared the proposed Fused U-Net model to a single-branch FCN with U-Net architecture. It follows the same basic structure as the presented Fused U-Net, but the additional encoder branch is omitted. This model was trained with two different configurations of the input layer. In the first approach, the input layer contained only the Pauli-RGB image, for the second approach, Pauli-RGB and coherence images were stacked to a 4-channel image. In the following, the two methods are referred to as Pauli U-Net and Pauli-Coh U-Net. The training was per-formed using the same strategy and data as that used for the Fused U-Net.
After training each network for 100 epochs, the resulting models were applied to the classification of the remaining unseen test data. Therefore, the corresponding image data was divided into overlapping patches of size 1024 × 1024 pixels that were fed into the neural networks. For each network, the predicted output maps were combined to form one classification image, whereby the overlap was used to eliminate artefacts occurring at the borders of single output patches. The resulting classification images provide the basis for the following performance evaluation.

RESULTS
For the evaluation and comparison of the classification performance of the different models, several metrics are considered. On pixel-level, the precision and recall rate as well as the F1-Score that indicates the harmonic mean between precision and recall rate are determined for each class. These metrics are defined as follows: with rprecision = TP TP + FP (16) where TP denotes the number of true positives; FP the number of false positives and FN the number of false negatives. The macro-average F1-Score is determined for each model by equally weighting all class-specific scores. To assess the performance on region-level, the IoU is used, a similarity measure, which determines the degree of overlap between predicted classification and ground-truth masks defined as: The mean IoU for each model is calculated by averaging the IoU over all classes. The achieved performance scores, based on the classification results for the test area, are summarised in Tables 1 and 2 for each model. The highest average F1-Score as well as the best average IoU is provided by the Fused U-Net model with 0.9 and 0.82, respectively. The average performance of the Pauli-Coh U-Net is only slightly below, with an average F1-Score of 0.88 and a mean IoU of 0.81. Comparing the class-specific performance, it can be seen that the achieved results for classification of water and mudflat are similarly good for both models. The poorer average performance of the Pauli-Coh U-Net is mainly due to a lower ability to recognize non-soil areas accurately, which is reflected in lower IoU, precision and recall rates. classification of soil regions succeeds, the Pauli U-Net model fails to reliably recognize water and mudflat areas. As expected, this suggests that the inclusion of the coherence image has a clear benefit for classification performance, especially in the investigation of tidal-influenced areas, where the distinction between mudflats and water plays a crucial role.
In order to better interpret the quantitative results, achieved classification results are presented in Figure 4 for a few example regions of the test area. The visual comparison of the predictions leads to the following observations: As already indicated by the performance scores, the Pauli U-Net model, that uses only the Pauli-RGB image for classification, fails to accurately distinguish between water and mudflats. In the tidal basin, a large part of the mudflats are falsely classified as water. Consequently, the watercourse, which is mainly characterised by the course of tidal creeks, seaweeds and narrow water channels, cannot be accurately captured by the Pauli U-Net model. In contrast, the water and mudflat separation which is obtained by the Fused U-Net model and the Pauli-Coh U-Net model, matches the reference image very precisely. This can be explained by the fact that the watercourse is clearly visible in the coherence image. Coherence is low in water-covered areas, while the Wadden areas lead to high coherence values resulting in a high contrast that can be easily detected by convolution filters and is apparently learned by both models. Considering the classification results of the mainland area, the different smoothnesses of the three result images are clearly visible. Large continuous areas such as meadows and salt marshes are well captured by all models without contamination of speckle noise that is present in input images. Greater differences between the predictions are evident for urban areas, isolated farmyards and coastal structures. In the reference image used for training and testing, not every single building and object is labelled. Instead, urban regions are combined into one labelled region, which encloses buildings, streets, trees, etc.; farmyards include surrounding meadows and tiny coastal buildings are not labeled at all. The Fused U-Net and Pauli-Coh U-Net model are able to learn this kind of annotation as illustrated in the middle example of Figure 4. Thus, smooth and homogeneous predictions are generated that resemble the reference image. However, the effect of over-smoothing occurs. Individual buildings and small artificial objects are filtered out and the boundaries of urban areas and farmyards are not accurately segmented. On the contrary, the Pauli-Net generates less smoothed results, which on the one hand leads to misclassification of single pixels in actually connected areas, but on the other hand allows the segmentation of fine structures.

CONCLUSION
In this paper, Fused U-Net, a two-branch encoder-decoder network, was presented that combines polarimetric backscattering intensity and interferometric coherence and is trained to perform LULC classification based on SAR data. The model was developed and applied to classify water, mudflats, soil and nonsoil areas in airborne S-band SAR data acquired over the German Wadden Sea. Instead of manually labelled accurate training data, training labels were automatically generated. In order to deal with few faulty labels that accompany the automatic label generation, the error-robust symmetrical cross entropy loss function was used for the training of the Fused U-Net. The experimental results demonstrate that the model trained in this manner achieves a fine-grained segmentation of the watercourse in the tidal area and smooth predictions in the mainland area. By comparison with similar network architectures, it is shown that the integration of interferometric coherence significantly increases the classification performance, especially for the separation of water and mudflats. This work serves as first proof of concept for the presented Fused U-Net model and its training on partly incorrect training data.
Future work will focus on the application of this approach for the classification of additional, more resembling classes. We assume that, for this case, the superiority of the presented method for the fusion of features extracted from different types of input data, as opposed to a stacking of different input data, will be even more evident. Furthermore, the results of this work suggest that a fine-tuning of the trained Fused U-Net model with a