Automated Recognition of Permafrost Disturbances using High-spatial Resolution Satellite Imagery and Deep Learning Models

: The accelerated warming conditions of the high Arctic have intensified the extensive thawing of permafrost. Retrogressive thaw slumps (RTSs) are considered as the most active landforms in the Arctic permafrost. An increase in RTSs has been observed in the Arctic in recent decades. Continuous monitoring of RTSs is important to understand climate change-driven disturbances in the region. Manual detection of these landforms is extremely difficult as they occur over exceptionally large areas. Only very few studies have explored the utility of very high spatial resolution (VHSR) commercial satellite imagery in the automated mapping of RTSs. We have developed deep learning (DL) convolution neural net (CNN) based workflow to automatically detect RTSs from VHRS satellite imagery. This study systematically compared the performance of different DLCNN model architectures and varying backbones. Our candidate CNN models include: DeepLabV3+, UNet, UNet++, Multi-scale Attention Net (MA-Net), and Pyramid Attention Network (PAN) with ResNet50, ResNet101 and ResNet152 backbones. The RTS modeling experiment was conducted on Banks Island and Ellesmere Island in Canada. The UNet++ model demonstrated the highest accuracy (F1 score of 87%) with the ResNet50 backbone at the expense of training and inferencing time. PAN, DeepLabV3, MaNet, and UNet, models reported mediocre F1 scores of 72%, 75%, 80%, and 81% respectively. Our findings unravel the performances of different DLCNNs in imagery-enabled RTS mapping and provide useful insights on operationalizing the mapping application across the Arctic.


INTRODUCTION
The Arctic is going through rapid changes in recent years. The temperatures in the region are rising at two to fourfold the global average (Screen, 2010). Due to the warming Arctic, the occurrence of permafrost disturbances, such as retrogressive thaw slumps (RTSs) has increased (Lants 2008). It is important to perform continuous monitoring of these disturbances to evaluate the impact on the Arctic environment. However, monitoring these disturbances is difficult in the Arctic compared to other parts of the world due to extreme weather, remoteness, and logistical challenges. RTSs are thermokarst features created by the rapid thaw of icerich permafrost on slopes of permafrost. An active thaw slump consists of an exposed headwall that defines the upslope boundary of the RTS. Below the headwall, there is a scar zone consisting of muddy exposed soil. The materials in the scar zone can move downslope by creating a tongue-like shape at the other end of the RTS (Figure 1). RTSs impact infrastructure, and aquatic and terrestrial ecosystems (Kokelj et al. 2013). Sediment and solutes released by RTS alter the properties of soils and surface waters. A mass movement of sediments and runoff can change the turbidity of adjacent rivers, lakes, and coastal environments. (Segal, 2015) There are many attempts have been made to map RTSs in the

Mapping of RTS using satellite images
We used a transfer learning strategy to train the candidate DL-CNNs. In transfer learning, we have two stages. In the first stage, we use backbone CNN and in the second stage, we use classifier network. Figure 2 shows a schematic diagram for this approach. The backbone CNN is used to extract features from the images. The backbones of the networks have been pre-trained on ImageNet headwall scar zone debris tongue datasets. Therefore, we can use a small number of samples to train the CNN. The extracted features are used to segment the RTSs in satellite images. We use different CNN networks for the segmentation of RTSs. We tasked three convolutional backbone networks, 1) ResNet50, 2) ResNet101, and 3) ResNet152 (He, 2016) in this study. We used the pre-trained weights on the ImageNet dataset and froze weight values while training on our custom RTS dataset. Our comparative analysis entailed five semantic segmentation algorithms: UNet (Ronneberger, 2015), Pyramid Attention Network (PAN) (Li, 2018), Multi-scale Attention Net (MANet) (Fan, 2020), and UNet++ (Zhou, 2018). Table 1 shows the number of total parameters and the number of trainable parameters in each network.

Model Backbone
Number of parameters (millions) Table 1. Comparison of the size variation of candidate DL-CNN models

Model Training
The RTS modeling was conducted based on the high res satellite imagery from Banks Island and Ellesmere Island in north Arctic Canada ( Figure 3). We selected 12 WorldView-2 satellite images from Banks Island and 14 WorldView-2 satellite images from Ellesmere Island to generate hand-annotated RTS training data. Image scenes were acquired during July -Aug at 0.5m spatial resolution with 4 spectral channels (red, green blue, and near infra-red). Pansharpened and orthorectified imagery were provided by the Polar Geospatial Center, University of Minnesota.

Figure 3. Selected study areas from Banks Island (left) and
Ellesmere Island (right) in Canada.
For the model training, 475 image tiles (2048 x 2048 pixels or ~ 1km x 1km on the ground) were selected from each of the study sites shown in Figure 3. The dataset was split into 80%, 10%, and 10% for training, validations, and testing, respectively.
We utilized Adam optimization algorithms with a learning rate of 10 -4 for the first 25 epochs and 10 -5 for the rest of the epochs. We used dice loss for calculating training and value loss while training. All models were trained across 100 epochs. We employed 3 augmentations (horizontal flip, vertical flip, and random 90-degree rotation) to the datasets with 50% probability in each epoch.  Figure 4 shows the F1 scores for MANet. All backbone networks achieved 97% accuracy at the end of epoch 50. Figure 5 shows the F1 scores for the DeepLanV3 network. Here all three backbones reported 96% accuracy at the end of the training.
Training accuracy for the UNet model is shown in Figure 6. At the end of the training, all three backbones achieved 97% accuracy. Figure 7 shows the training F1 scores for the PAN network. All three backbone networks scored 96% accuracy. As seen in Figure 8, UNet++ with Resnet50 showed elevated F1 scores (at epoch 50 it's around 98%) compared to the other two backbones.     Based on the training accuracy budget (Figures 4-8), we selected the UNet++ model with the ResNet50 backbone as our bestperforming model to detect RTSs in the study area. Automated detection of RTSs using high-resolution imagery is a challenging task. A typical 0.5m resolution image scene is about 20 km x 20 km in size and contains about 1.6 billion pixels. An image scene as it is does not fit the GPU memory, therefore we need to split the image scene into small tiles. As shown in Figure 9, we first partitioned the image into 2000 x 2000 pixel tiles. Then we feed these tiles into the trained DL-CNN model for predictions. We used NVIDIA A100 GPU with 40Gb memory to run our DL-CNN models. The different models were executed using the PyTorch Segmentation Models library (Yakubovskiy 2019). We further utilized other libraries such as OpenCV for image processing, GDAL for accessing satellite images, and Albumentations for image augmentation.

Model Comparison
ResNet50 backbone network consistently performs better in the training stage according to Figures 4-8. Figure 10 exhibits CNN model performance with respect to the test dataset. Here we have chosen ResNet50 which was the best performing network for proceeding CNN model comparison. Accuracy scores from the comparative model analysis (Figure 10) elected the UNet++ model as the best contender The lower F1 scores were reported by the DeepLab V3. The MANet and the UNet demonstrated the second and third best performances, respectively.    Among different CNN model-encoder combinations, the UNet++ model with the Rsetnet50 backbone demonstrated the highest accuracy (F1 score of 87%) at the expense of training and inferencing time. The PAN, DeepLabV3, MaNet, and UNet, models reported mediocre F1 scores of 72%, 75%, 80%, and 81% respectively.

RTS Prediction
We have applied the trained UNet++ model with the Reset50 backbone on satellite imagery from Banks Island and Ellesmere Island Figure 14(a) shows example detection in Banks Island. Over 90% of the RTS were correctly detected by the UNet++ with the ResNet50 backbone. Figures 14(b) show the zoomedinviews of example areas. The trained model was able to detect the RTSs in Banks Island accurately. Figures 15(a) show the example detection in Ellesmere Island. Similar to Banks Island, over 90% of the RTS were correctly detected by the UNet++ with ResNet50 backbone. Figures 15(b) shows the zoomed-in view of example detections.
In all the cases the RTS headwall was correctly detected. In some cases, the RTS only in the scar zone (refer to an anatomy of RTS shown in Figure 1) was detected. In other instances the debris tongue was also included, however, it was not consistent across all predictions. Figure 16 demonstrates the potential of multi-temporal RTS detection. Figure 16

CONCLUSION
The central goal of this study was to understand the performances of different deep learning CNN algorithms pertaining to automated recognition of retrogressive thaw slumps from very high spatial resolution commercial satellite imagery. Our comparative analysis entailed five DL-CNN with three encoders (backbone) types.
Our findings unravel the performances of different DLCNNs in imagery-enabled RTS mapping and provided useful insights on operationalizing the mapping application over large areas. We also demonstrated that our method can be used to find temporal changes in RTS accurately. The headwalls of RTS have been detected in all the predictions. But the detection of scar zone and debris tongue boundaries were not consistent throughout the region. One reason for this can be that there is no clear definition to annotate debris tongue and scar zone. When we closely inspected RTS annotations from other studies, it was evident that the annotation process lacks formality. Among many, some of the important questions that arise in the annotation process include, should annotation include debris flow? deposition area? if those should be included how far away from the headwall?. In some instances, debris flow is way more extensive than the RTS itself. So consistent agreement should be prepared for consistent detection of RTS using deep learning models.
The UNet++ model performs well in our study candidate study sites. But to employ RTS detection in a circumpolar mapping context, one has to test the selected model in other areas of the Arctic. This requires a systematic model transferability analysis.
Our study area is one of the more challenging to be used in DL-CNN models as there is no visible vegetation. With vegetation cover, the RTS stands out. Thus, we think that the inclusion of a substantial amount of training data representing the heterogeneity of multiple permafrost landscapes would elevate the interoperability of the UNet++ model.