SELF-TRAINING FOR SEMI-SUPERVISED DEEP CONTOUR DETECTION OF SURFACE WATER

: Contour detection is better for monitoring dynamic and long-term changes to surface water bodies. For that purpose, we present a semi-automated method for collecting and labeling water contours from Landsat-8 and Sentinel-2 images. Due to the need for human inspection, the method has thus far generated 14K labeled images from more than 1.5M images. Given the cost of data labeling, we propose a deep semi-supervised self-learning system performed in two training stages, known as teacher-student. The teacher is trained on the accurate human-labeled data, then used to pseudo label the remaining unlabeled data. The student is trained on both human-labeled and machine pseudo-labeled data. For both teacher and student, we use a uniquely designed multiscale UNet classifier that uses fewer parameters and is more accurate than other state-of-the-art classifiers. Random augmentations are used to “noise” the student model and improve its generalization, and normalization schemes are used to blend the human-labeled loss with the machine-labeled loss. Comparisons to existing water body detection classifiers and segmentation classifiers show the superiority of our proposed system in detecting water contours.


INTRODUCTION
Monitoring surface water from remote sensing data is a critical GIS task for risk evaluation, resource management, public policy, emergency response, cartography, and education. Many remote sensing technologies (Huang et al., 2018) are currently available, providing data that vary in cost, temporal resolution, spatial resolution, spectral resolutions, and the number of spectral channels.
Surface water monitoring techniques (Gao, 1996;Xu, 2006;Fisher et al., 2016;Feyisa et al.,2014;Wang et al., 2018;Friedl & Brodley, 1997;Mueller et al., 2016;Aung & Tint, 2018;Cordeiro et al., 2021;Isikdogan et al., 2017;Isikdogan et al., 2020) have focused on the multispectral detection of water bodies that is sensitive to the infra-red (IR) channels. In the planar view, contour detection is more effective in capturing dynamic and long-term changes to surface water than water body detection. Additionally, the dependence on the IR channels makes the detectors expensive and requires recalibration of the system to IR sensing technology (bandwidth, central wavelength, sensitivity, etc.) We propose RGB-based detection thatmuch like humanscan detect contours without relying on multispectral data.
To aid in this effort, we have started collecting satellite data representing a variation of Landsat and Sentinel waterbody images (lakes, rivers, shores, etc.) from across the globe. We employed rule-based metrics and basic image processing to label the contour data and used visual (human) inspection to isolate and remove inaccurately labeled portions. The process has been extremely slow, thus far yielding only 14K useful images from over 200K candidates, with over 1M images still unchecked.
Given the cost of data labeling, we propose to use a deep semisupervised self-learning framework in which our unique __________________________ * Corresponding Author -(mbsyed@uno.edu) multiscale UNet-style classifier is trained on a small subset of the labeled data. The trained classifier, also known as a teacher model, is then used to pseudo-label the more extensive set of unlabeled data. Then the classifier is retrained with both human and pseudo-labeled data to achieve a more robust classifier, known as the student model. During the student model training, 50% of each batch is randomly selected from the human-labeled data, and the human-labeled loss is weighted more heavily than the machine-labeled loss. This is done to prevent the pseudolabeled data from dominating the learning process. The student model batch also undergoes random augmentations of vertical flip, horizontal flip, and rotation to make it "noisier." The training process reiterates, with the student model becoming the new teacher. We found that after three iterations, the performance improvements become negligible.
In the proceeding sections, we will describe our data collection process (section 2), the architecture of our unique multiscale classifier (section 3), semi-supervised self-training (section 4), and experimental results (in section 5) that demonstrate the superiority of the proposed classifier.

Collection
We collected data for both Landsat-8 and Sentinel-2 satellites. There are two datasets that we created for each of the satellite data. One was fully supervised training the second was for semisupervised (self-training).
For the Landsat-8 data collection, a single method was used. The shapefile from DeepWaterMapV2 (Isikdogan et al., 2020) was used to determine potential global water body locations. Using the metadata in the shapefile, locations with any less than 1% water were removed. Google earth engine (GEE) was used to download the data. . Two methods were used for Sentinel-2 data collection, one for supervised learning and one for semi-supervised learning. For supervised learning, a shapefile from BlueDotWater 1 was used to determine the locations of inland water bodies. These were downloaded using Sentinel's python API. For semi-supervised learning, data was collected using the same method used in Landsat-8 data collection. The satellite images were labeled using NDWI (Gao, 1996) to detect water bodies. The water contour is then labeled by subtracting NDWI from its morphological image dilation. The process yield many inaccuracies in the contour even within the same image. To improve the yield, the satellite images were split into 128 × 128 tiles. Human inspection is then used to identify accurately labeled tiles from inaccurate ones.
Currently, we have two sets of data in our repository 2, one for __________________________ (1) www.blue-dot-observatory.com/ (2) https://github.com/mbsyed/Deep-Surface-Water-Contour-Detection unlabeled and the second for labeled data. There are over 1M+ Landsat tiles and 1.4M+ unlabeled Sentinel tiles. To make sure that the used data has a contour, we eliminate tiles with with less than 1% water. This means than only 490,070 Landsat tiles and 400,682 Sentinel tiles are used for unlabeled self-training from the unlabeled dataset for Landsat and Sentinel.
The labeled data was hand-selected from the unlabeled data set. This is an extremely slow process with a minimal return. 200,000 images were visually inspected to create a labeled dataset containing 7,000 tiles for Landsat and 7,174 images for Sentinel. We balanced the dataset to avoid an abundance of water-only or land-only tiles.
Each tile in the datasets is stored as 16-bit raw satellite data with six channels in the following order: blue (b1), green (b2), red (b3), NIR (b4), SWIR1 (b5), SWIR2 (b6). As we will be working with RGB data, we convert the raw data into True color images (TCI). For Landsat, we recommend subtracting the min and dividing by the max. For Sentinel, clip it at 3558 first and then divide by the same number. Each image has metadata, including satellite source and water percentage for each image. The data also contains JRC (Pekel et al. 2016) water labels for each Landsat tile for reference.

Multiscale UNet
Our proposed UNet-based water contour detector can be seen in Figure 1. Our architecture design has encoder/decoder layers that are based on the multiscale convolution block seen in Figure  1(b). Our model uses multiscale 2D filters that are effective in capturing contours. The 1x1 filters are effective in controlling data expansion and help weigh the channels going forward. Each convolution has a Batch Normalization (BN) layer before it to avoid outlier data in a batch.
We chose to use a stirded convolution instead of max-pooling for the down-sampling process. This adds a few parameters to the architecture but the overall performance increases. "Skip" connections between corresponding encoder and decoder blocks are a general attribute of UNet systems that have been shown to improve training and provide better localization in the output. A sigmoid output is used to classify each pixel output as a contour or non-contour (i.e., 1 or 0).

Semi-Supervised Self-Training
Supervised vs. unsupervised learning models are determined by the labeled vs. the unlabeled data used for learning. Semisupervised learning (SSL) aims to combine labeled and unlabeled data to improve the learning task. SSL consists of a variety of techniques (van Engelen & Hoos, 2019) that can be generalized as one of two scenarios: either the system is a supervised learning model that benefits from unlabeled data (inductive) or an unsupervised learning model that is improved by labeled data (transductive). Self-learning (aka self-training or wrapper methods) is an inductive SSL that aims to train a classifier on a small, accurately (human) labeled set of data. The trained classifier is then used to pseudo-label a larger unlabeled data set. The accurately labeled data and the machine pseudo-labeled data are then combined to train a new classifier. The two classifier stages are sometimes referred to as teacher-student models.
A variety of approaches for self-training have been proposed, see (Triguero et al., 2013) for a review. These vary in the number of classifiers used, the type of classifiers, how the pseudo labeled data is incorporated into retraining, and how many iterations of teacher-student training cycles are used. Implementations such as (Yalniz et al., 2019) use a more powerful teacher model, while others such as (Xie et al., 2020) use a more powerful student model. In (Xie et al., 2020;(Zoph et al., 2020)) noise is added to student model data in the form of random augmentations to help improve the system's generalization. In (Sohn et al., 2020;Tang et al., 2021), the pseudo-labeled data is ranked prior to student model training.

Proposed Training Process: Teacher Model
We start with 7,000 Landsat, and 7,174 Sentinel true-color RGB images and their corresponding accurately labeled contours. Each dataset is randomly split into training/testing. Landsat has 5,000 images for training and 2,000 images for testing. Sentinel has a few more images, with 5,121 for training and 2,053 for testing. The teacher model is our multiscale UNet that is trained using an Adam optimizer (learning rate of 0.003, beta1 is set at 0.9, and beta2 is 0.999). The maximum batch size that our GPU can support is 64. We allowed the model to train for 50 epochs. We used a combination of three loss functions, Binary crossentropy (BCE), Dice, and IoU. BCE captures the pixel-level loss in the image, and intersection over union (IoU) and Dice loss are used to capture contour object-related loss. All three losses are combined equally.
Due to the significant imbalance between contour and noncontour pixels, the pixels are first weighted by the ratio of nonwater pixels to the number of water pixels. Additionally, a × border makes errors closest to the contour count more heavily than those outside the × border. We found a border of 9 × 9 to be optimum.

Proposed Training Process: Student Model
The trained teacher model is used to provide machine pseudolabels for 490,070 unlabeled Landsat and 400,682 unlabeled Sentinel images. Our multiscale UNet is selected again as the student model and retrained from scratch using both human and machine labels. Due to the large number of pseudo-labels to human labels, half the batch (32) is randomly sampled from the human-labeled data, while the other half is sampled from the pseudo-labeled data. Additionally, 50% of the entire batch is randomly selected for augmentation to add noise to the system. When an image is selected for augmentation, one of four augmentations is randomly chosen: vertical flip, horizontal flip, and 90 rotations. In each training batch, the loss from human and pseudo labeled data is normalized as such where ℎ and are the human-labeled loss and the pseudolabeled loss for that batch, while ℎ ̅̅̅ and ̅̅̅ are the exponential moving average losses with a decay rate of 0.9997. The weight rate ( = 3) was found to be optimum. The student model was trained for nine epochs for Landsat data (7.7K iterations) and eight epochs (6K iterations) for Sentinel.

Proposed Training Process: Iterations 2 and 3
Following the student model training, we convert the student model into a teacher model and use it to pseudo-label the unlabeled data again. The multiscale UNet is reinitialized, and step 4.2. is repeated. Due to a large number of training batches, we employ a cosine annealing schedule for the learning rate.

RESULTS
Tables 1  DWM is a DL waterbody detection model that was retrained on our data specifically for contour detection. Waterdetect (Cordeiro et al. 2021) is also a water body detector, but it relies on hierarchical clustering of rule-based metrics.   All systems are trained with our RGB data for accurate water contour detection. The results indicate that our base system uses fewer parameters, has a faster training time, and is more accurate at detecting water contours.
Tables 3 and 4 contain the results of self-training for Landsat and Sentinel, respectively. A clear improvement can be seen in the model's performance, where there is a 2% improvement for Sentinel's F-score and a 6% improvement for Landsat. We can also see a clear improvement in the model's output before and after self-training as the F-score for individual images increases.
Although both models were trained for three iterations of teacher-student iterative training, there was no improvement in the performance of the F-score for Sentinel data after the first iteration.

CONCLUSION
It is laborious and time-consuming to hand-select training data; we present a self-training technique that enhances our baseline's performance and removes the need to hand-select data. Due to the lack of well-labeled data, we present a dataset that can be used in training deep learning models. We also present a deep learning model that can accurately detect contours faster and use fewer parameters than state-of-the-art segmentation and object detection models.