DETECTION OF CLOUDS IN MEDIUM-RESOLUTION SATELLITE IMAGERY USING DEEP CONVOLUTIONAL NEURAL NETS

: Cloud detection is an inextricable pre-processing step in remote sensing image analysis workflows. Most of the traditional rule-based and machine-learning-based algorithms utilize low-level features of the clouds and classify individual cloud pixels based on their spectral signatures. Cloud detection using such approaches can be challenging due to a multitude of factors including harsh lighting conditions, the presence of thin clouds, the context of surrounding pixels, and complex spatial patterns. In recent studies, deep convolutional neural networks (CNNs) have shown outstanding results in the computer vision domain. These methods are practiced for better capturing the texture, shape as well as context of images. In this study, we propose a deep learning CNN approach to detect cloud pixels from medium-resolution satellite imagery. The proposed CNN accounts for both the low-level features, such as color and texture information as well as high-level features extracted from successive convolutions of the input image. We prepared a cloud-pixel dataset of approximately 7273 randomly sampled 320 by 320 pixels image patches taken from a total of 121 Landsat-8 (30m) and Sentinel-2 (20m) image scenes. These satellite images come with cloud masks. From the available data channels, only blue, green, red, and NIR bands are fed into the model. The CNN model was trained on 5300 image patches and validated on 1973 independent image patches. As the final output from our model, we extract a binary mask of cloud pixels and non-cloud pixels. The results are benchmarked against established cloud detection methods using standard accuracy metrics.


INTRODUCTION
Cloud detection in satellite imagery is very important in many remote sensing applications (Pugazhenthi & Kumar, 2020). Being complex in shape, clouds are very difficult to detect from satellite imagery (Shrivastava, 2013). Researchers have already developed several methods to detect cloud pixels from satellite imagery, such as, Rule-based cloud detection-Fmask (Zhu & Woodcock, 2012), Machine learning approaches-Bag-of-words and SVM (Yuan & Hu, 2015), SVM classification (Bai et al., 2016;Ishida et al., 2018). However, current methods primarily rely on per pixel-based classification algorithms, thus mainly focusing on the spectral characteristics or the statistics of pixel values. This leads to misclassifications of pixels with similar spectral signatures, for example, highly reflective man-made structures, sand in deserts, and snow/ice. The spatial patterns are often ignored, or solely used in a simple post-processing step, mainly due to the lack of efficient methods for including them in the analysis (Jeppesen et al., 2019).
Owing to their superior performances in computer vision tasks such as everyday image understanding, medical image analysis, deep learning (DL) algorithms have radically been adopted in remote sensing image analysis. Several DL-based (DL) approaches, such as convolutional neural nets (CNNs) (Li et al., 2019;Mateo-Garcia et al., 2017;F. Xie et al., 2017;Zhan et al., 2017) haven secured a wider attention in recent years; however, utilization of sophisticated DL architectures is in operational context yet at exploratory phases. A plethora of DL CNN architectures have developed and tested in automated image analysis tasks, including classification (VGG16 (Simonyan & Zisserman, 2015), InceptionV3 , ResNet50 , Xception (Chollet, 2017), InceptionResNetV2 (Szegedy et al., 2017), ResNeXt50 (S. ), detection (R-CNN (Girshick et al., 2014), R-FCN (Dai et al., 2016), SSD (Liu et al., 2016)), semantic segmentation (ParseNet (Liu et al., 2015), U-Net (Ronneberger et al., 2015), PSPNet ), and semantic instance segmentation (SOLOv2 (Wang et al., 2020), Mask R-CNN , UPSNet (Xiong et al., 2019), DeepLabv2 (Chen et al., 2017)). Typically, each DLCNN model has its own pros and cons with respect to performances and computational needs. In most instances, these algorithms are application dependent, thus, require various adaptation strategies such as re-training based a new set of training samples, tuning of hyper parameters, modification of the architecture, and inclusion of additional data inputs. Among other contenders, the U-Net architecture (Ronneberger et al., 2015) is one of the widely used DLCNN based image segmentation algorithms. This is one of the simplified DL architectures hence outperforms, both computationally and accuracy-wise, other state-of-the-art image segmentation algorithms (Soni et al., 2020). In addition to spectral properties, clouds can have different and distinct characteristics (e.g. shape attributes, background separation, shadow, density attributes) that can prudently be mined in automated classification process (Mahajan & Fataniya, 2019). It is evident that we can visually differentiate clouds as bright feature in standard RGB given that cloud density is not very thin. Other than RGB channels, near infrared (NIR), visible-infrared (VIR), thermal infrared (T-IR) bands exhibit significant responses to cloudy regions (Jan et al., 2019). The overarching of our study is to explore the possibility of modifying the generic U-Net architecture to classify cloud pixels from moderate resolution satellite images. Through modifications, we aim to reduce the number of trainable parameters by decreasing the number of convolutional layers. However, the modified architecture is yet capable enough to extract contextual information from images.
Depending on sensors characteristics, satellite imagery is acquired at multiple spatial resolutions and spectral specifications. Moderate resolution satellite sensors, such as Landsat-8 and Sentinel-2 record imagery at 30m, and 10m resolutions, respectively, whereas very high spatial resolution commercial satellite sensors such as WorldView-2 acquire imagery at 0.5m resolution. A limited number of sensors own distinct spectral bands (e.g., band 9 (cirrus) of Landsat-8) of which wavelengths are sensitive to clouds. This luxury is not available with a majority of sensors which have limited spectral ranges. Most cases spectral resolution is confined to visible and NIR range. Thus, in our model development process, we purposely focused only on blue, green, red, and NIR channels of Landsat-8 and Sentinel-2. By doing this we aimed to understand how feasible and transferable a DLCNN model is when classifying cloud pixels only based on visible and NIR channels. In our systematic experiment, we utilized candidate scenes acquired by Landsat-8 and Sentinel-2 sensors to train the modified version of U-Net model. We evaluated the model performances based on standard accuracy metrics.

Study Area
We centred the analysis on satellite image scenes acquired by chose Landsat-8 and Sentinel-2 sensors. Both sensors provide cloud masks. Image scenes were chosen randomly and representing different biomes ( Figure 1). We downloaded a total of 121 satellite images-60 of Landsat-8 (30m) and 61 of Sentinel-2 (20m) from the USGS earth explorer. We did not use 10m resolution images from Sentinel-2 due to the absence of 10m cloud masks. Distribution of the selected image scenes is shown in Figure 2. Of the available multispectral channels, we selected only red, green, blue and NIR bands for model development.
There are two reasons for only relying on four bands; firstly, these bands are commonly available in almost any multispectral satellite imagery (including commercial satellite imagery), thus it will ensure that our model can also be utilized with satellites other than Landsat-8 and Sentinel-2, and secondly, these bands show significant response for cloud pixels (Yao et al., 2022). Downloaded image scenes were tiled for training and prediction purposes. Figure 2 shows the general workflow.

Data Processing
Array size of the Landsat-8 scenes is approximately 7000 by 7000 pixels whereas for the Sentinel-2 scenes, it is around 5000 by 5000 pixels. Due to memory limitations, smaller tile sizes are preferred for deep learning models. We tiled the large image scenes into a total of 7273 image tiles of 320 by 320 pixels (Figure 3). We randomly selected 5300 image tiles for training the DL model and kept the rest of the image tiles for the validation and testing purposes. We utilized 30m QA_PIXEL band in Landsat-8 and 20m SCL band in Sentinel-2 image scenes to create cloud masks. Table 1 summarizes the image scenes and tiles from different sensors.

Model Preparation
We developed and trained a modified version of the U-Net model to detect cloud pixels using medium resolution satellite imagery. The U-Net architecture is an encoder-decoder based deep learning model and was originally utilized for bio-medical image segmentation (Ronneberger et al., 2015). U-Net concatenates the encoder (blue blocks in Figure 4) feature maps to up-sampled feature maps from the decoder (red blocks in Figure 4) at every stage to form a ladder-like structure. A simplified block diagram of the modified U-Net architecture is shown in Figure 4. The modified U-Net model consists of 10 smaller blocks. The blocks on the left side reduce the image dimensionality in the x and y-dimension, collects information and stacks all the extracted feature in the z-dimension. Once the dimension is reduced to 20 pixels by 20 pixels, the red blocks on the right side increase the image dimension in the x and y dimension until the dimension becomes equal to the input x and y dimensions. In the red blocks, while moving from bottom to top layers, the number of z dimension gets reduced and in the final output layer the dimension is reduced to the same height and width as the input image. Here we have two classes, one for the cloud object, the other for the background which is a default class in almost any image segmentation networks.
The modified U-Net model that we proposed consists of 19 convolutional or deconvolutional layers, whereas the original U-Net model has 23 layers. The convolution operations used in the modified U-Net model are padded convolutions but in the original U-Net, there is spatial reduction between subsequent convolutional layers. Padding improves performance by keeping information at the borders (Islam et al., 2021). Overall, the modified U-Net model has a smaller number of parameters compared to the original version. Thus, the modified U-Net takes less time to train and to infer.

Model Training
We trained the proposed U-Net model up to 300 epochs in a local machine with Intel(R) Core (  As seen in the loss graph ( Figure 5) the validation loss decreases up to 150 epochs and then fluctuates around the same values. Figure 6 shows the validation accuracy for the training process on the combined data reaches the plateau between epochs 150 to 200.

Accuracy Assessment
We conducted a multi-step accuracy assessment for the outputs. The outputs are in the form of class names and binary masks. We considered the output pixels having the same values as the validation cloud masks as correctly predicted. Figure 7 shows the confusion matrix and defines the terms such as, true positive, true negative, false positive, and false negative which are used in the model evaluation metrics. We calculated accuracy, precision, recall and F1-score for each of for each of the images as well as for the whole validation dataset using equations 1, 2, 3, and 4.

RESULTS AND DISCUSSION
We evaluated the trained model based on the validation dataset which consists of 1973 image tiles randomly selected from 121 Landsat-8 and Sentinel-2 image scenes.

Evaluation
After the model training step was completed, we calculated accuracy, precision, recall, F1-score depending on which dataset was used to train the model and which dataset was used to evaluate the model. Table 1 shows the mean accuracy is high if the model was trained and evaluated on only Landsat-8 dataset.
However, for evaluating on the combined dataset, the highest accuracy of 87.27% was achieved when the model was trained on the combined dataset. The model trained only on one satellite data performs poorly on the other type of satellite data (Tables 3,  4, 5, 6). All the evaluation metrics such as mean accuracy, precision, recall, and F1-score have higher values when trained and validated on the similar types of datasets (Tables 3, 4, 5, 6).     (Table 7) as reported by other researchers (Li et al., 2019), are similar to our results on different datasets. Sample results along with the original cloud mask are shown in Figure 8. The yellow pixels show the cloudy pixels. Visually these results look promising, and the cloud pixels seem to be labelled correctly. We randomly selected some samples with lower accuracy values. Figure 9 shows some sample prediction where the accuracy values are lower than the average. In the visual inspection, it looks like there are some issues with the original cloud masks and most of these results are from the Sentinel-2 cloud masks. Figure 9. Some randomly selected predicted cloud masks with lower accuracy.

Challenges
Based on visual inspections, the predicted cloud mask seems to be consistent with provided cloud mask. However, there might be some incorrect labels in the provided cloud masks. As Figure  10 shows, in some cases roads are marked as clouds in some of the Sentinel-2 cloud masks. Sometimes, rivers are marked as clouds in the provided cloud masks. Thus, these types of issues on the training samples might cause poor performance of the model on some image tiles.

CONCLUSION
In this study, we propose a new approach for cloud detection from medium resolution satellite imagery using a deep learningbased image segmentation algorithm named U-Net. The core principle is to utilize contextual information in the image rather than using traditional cloud detection algorithms based on pixel values. Our cloud detection results showed that the proposed pipeline performed well on combined Landsat-8 and Sentinel-2 dataset. Our proposed method is applicable in a variety of use cases and can be repurposed with different types of satellite imagery. A shortcoming of the method is that we need to rely on provided cloud masks from different sources. Our future research will address this issue and reduce the dependency on training samples by means of image augmentations on manually inspected training samples.