APPLICATION OF PRE-TRAINED REAL-WORLD SUPER RESOLUTION MODELS TO OPTICAL SATELLITE IMAGE

The single image super-resolution (SISR) technique refers to improving the resolution over the original image. In recent years, we use deep learning-based convolutional neural networks to improve the spatial resolution of images more reasonably. To train such deep learning models, we use training samples consisting of the original HR images and LR images obtained by bicubic downsampling. However, this method of training data using downsampling has a negative impact when applying the trained model to real images. That is because the downsampling function that occurs in the real image is unknown, and the hypothetically created LR image does not represent the resolution degradation that can realistically occur. Therefore, SISR methods that use realistic degradation called real-world super-resolution (RWSR) have been proposed. In this paper, we investigate how such RWSR methods using realistic degradation affect the SISR performance of satellite images. The results of applying the trained model to optical satellite images show that the RWSR method is not the most effective way to handle optical satellite images when compared to the deep learning method without modeling the degradation. In particular, we showed that the effect of RWSR with not only upsampling but also noise and blur removal is significant in the visibility of optical satellite images. Moreover, pre-trained RWSR models can be an aid in visually deciphering ground objects.


INTRODUCTION
The Single Image Super-resolution (SISR) is the process of applying an algorithm to a low resolution (LR) image to derive a higher resolution (HR) image. The SISR approach does this using a single LR image and is considered a classic problem in computer vision (Dong et al., 2015). In parallel with many other computer vision problems, SR approaches employing deep convolutional neural networks (SRCNNs) have outperformed other techniques in the last few years. Generalizing the SISR task, we can define that the task is to restore the high-resolution (HR) image from a low-resolution (LR) image. In this case, LR images are generated by applying a degradation function f to the HR image like below: In other words, the definition of the degradation function f is important to train a SISR model. In general, in training a naive SR model, an LR image is created by bicubic down-sampling as the degradation function. However, since the degradation of the actual image and the degradation of the image obtained by bicubic downsampling are far apart, applying the trained model to another actual image will cause problems. Therefore, the real-world super-resolution (RWSR) method ( Figure 1) was proposed. RWSR is a super-resolution method that proposes a degradation function that models the degradation that can occur in real situations in more detail. As a degradation function, the RWSR method use not only reduces the resolution of the images but also adds blur and noise. By training to restore the original HR image from the LR image created by such a degradation function, the prediction performance was improved for any input LR image.
The single image super-resolution (SISR) technique using deep learning has been applied to remote sensing (Benecki et al., 2018, Lu et al., 2019, Lanaras et al., 2018. There is a wide range of applications of SISR technology for satellite imagery, such as further increasing the resolution of sub-meter imagery like WorldView-2/3, and increasing the resolution of free imagery such as Landsat and Sentinel. The same problem of degradation function occurs when applying the SISR technique to such optical satellite images. Consider the case where an optical satellite image is upsampled to a resolution equivalent to that of an aerial photo, for example. Since optical satellite images have similar observation conditions to each other, it is possible to directly pair an LR satellite image with an HR satellite image and train SR task (Pouliot et al., 2018, Shin et al., 2020, Salgueiro Romero et al., 2020. However, optical satellite images and aerial photographs do not correspond pixel by pixel due to misalignment issues and distortions, and thus cannot be directly trained for SISR. applying SISR. Specifically, optical satellite images include atmospheric effects and sensor noise as well as resolution. Therefore, training a SISR model by simply pairing the downsampled image with the original image is difficult to apply to optical satellite images. Thus, we decided to apply a method called the real-world super-resolution (RWSR) method ( Figure 1). RWSR not only reduces the resolution of the image but also removes blur and noise. In this paper, to confirm the effectiveness of applying the RWSR method to satellite images, we apply a deep learning model trained not on satellite images but natural images.

SUPER-RESOLUTION
The single image super-resolution (SISR) task is the problem of improving the spatial resolution of an image by considering only the information available from the input image itself and the knowledge obtained in the past in the form of algorithms or trained models. Since the publication of SRCNN (Dong et al., 2015), there has been a lot of research on this SISR technique using deep learning. SRCNN is an approach to obtain HR images from LR images by extracting the relationship between LR images and HR images using a Convolutional Neural Network (CNN). By incorporating deep learning, it is characterized by its ability to estimate HR images with higher accuracy than conventional methods. In addition, EnhanceNet (Sajjadi et al., 2017) proposed a new SISR method using adversarial training in addition to considering the perceptibility and texture consistency of images as loss functions. Moreover, a new SISR method (Ledig et al., 2017) is proposed using adversarial training, which enables us to obtain clearer images than other SISR methods such as SRCNN.
To train a CNN for SISR, we need image pairs that represent the same scene at different resolutions. The LR image is the input to the network, while the HR image is the desired output of the network for the corresponding input image. Therefore, the training process involves adjusting the weights of the CNN to minimize the error between the output image produced by the network and the corresponding ground truth HR image. When creating training data, it is common practice to obtain pairs of images by artificially downsampling the target HR image to obtain the corresponding LR source image. The ratio of the downsampling depends on the magnitude of the expected SISR effect and is usually between two and four times the original resolution. The appropriate choice of loss function during training is an important issue when training CNNs, and the mean square error (MSE) or mean absolute error (MAE) between the predicted HR image and the ground truth HR image is used. Generative Adversarial Networks (GANs) have also received a lot of attention from researchers and can be understood as a different training method closely related to the loss function. SR-GAN (Ledig et al., 2017) is an example of using GAN training for SISR, where the discriminator network that should classify between predicted HR image and ground truth HR images is also trained at the same time. This adversarial approach yielded good visual results but was not directly reflected in the numerical metrics in pixel-wise similarity. In addition, an ESR-GAN (Wang et al., 2018) based on the SRGAN was proposed. ESRGAN improved the network architecture, adversarial loss, and perceptual loss of SRGAN. ESRGAN used a deeper model using Residual in Residual Dense Block (RRDB) without batch normalization layers in the generator and Relativistic Discriminator in the discriminator.
The ultimate goal of SISR is real-world applications (Chen et al., 2021). However, these methods do not work well on real images because they are primarily designed for bicubic downsampling. This stems from the fact that they consider the inverse mapping of an ideal HR image to an LR image with a degradation function applied. This is because, in general, the blur kernel plays an important role in the success of SISR methods, and the bicubic kernel is too simple for a degradation function (Zhou and Süsstrunk, 2019). To solve this problem, some researchers use real-world degradation functions with the blur kernel and noise models , Wang et al., 2021a. These studies are called real-world superresolution(RWSR). Since both LR and HR images may be noisy or blurred, it is not necessary to adopt the order of blurring, downsampling, noise addition as in the conventional degradation function when generating LR images. Since the blur kernel space of the conventional degradation model should vary with scale, it is difficult to determine a very large scale factor in practice. Bicubic decomposition is hardly suitable for real LR images, but it can be used for data augmentation, and indeed for super-resolution of clear and sharp images. To solve this problem, Zhang et al. proposed a practical degradation model BSRGAN for RWSR and showed its effectiveness on real images . Subsequently, some methods for RWSR such as Real ESRGAN (Wang et al., 2021a) and Swin IR  have been proposed. It is now possible to construct realistic degradation functions by randomly recombining blur, downsampling, and noise, or by applying each several times. The trained model can then be applied to real images without any degradation in performance. In this study, we confirmed the effectiveness of this RWSR technique by applying it to optical satellite images.

PRETRAINED REAL-WORLD SR MODELS
In this study, we use four pre-trained super-resolution models and apply them to optical satellite images (Table 1).

ESRGAN(baseline)
3.1.1 Deterioration data One of the famous non RWSR methods is ESRGAN (Wang et al., 2018). We used pretrained ESRGAN as the baseline method. In ESRGAN, the input LR image is created by downsampling the HR image by 1/4. The downsampling method uses bicubic. Therefore, only the degradation of resolution is considered as the degradation of the image.  , blur, downsampling, and noise were identified as important elements of the degradation function. A direct way to improve the utility of the degradation function is to make the space that the degradation function of the three important elements can take a large and realistic as possible. By adopting a random shuffling

Network
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France •Spectral Normalization •RaGAN strategy for the three critical elements, we further enlarge the degradation function space. In other words, an LR image could be a "noise→ downsample→blurred version of an HR image" or a "downsample →noise → blurred version of an HR image".

Network
The network design used the same architecture as ESRGAN (Wang et al., 2018). Experiments on synthetic and real image datasets show that the trained model performs favorably on images corrupted by a variety of degradations.

Real ESRGAN
3.3.1 Deterioration In Real ESRGAN (Wang et al., 2021a), they claimed that the classical first-order degradation function could not be modeled well in the real situations proposed in BSRGAN . Real ESRGAN proposed a new second-order degradation model. A second-order model is one that repeats n degradation processes, and each degradation process adopted a classical degradation model with the same procedure but different hyperparameters. The term "second-order" here mainly refers to the implementation time of the same operation, as opposed to that used in mathematical functions. In Real ESRGAN, the degradation function was defined using the method of iterative application of blurring, downsampling, and noise.

Network
The generator uses the same network architecture as ESRGAN (Wang et al., 2018), but a few changes are proposed in Real ESRGAN. First, they changed the discriminator from a simple CNN to UNet (Schonfeld et al., 2020) to improve the discriminator capability. In addition, they used spectral normalization in the discriminator to avoid large gradients and stabilize the training.

Deterioration
The data preparation method of BSR-GAN ) is used to train SWIN IR . (Wang et al., 2021b (Liu et al., 2021) layer, a convolutional layer, and a residual connection.

EXPERIMENTAL RESULTS
In order to verify the effectiveness of the method on optical satellite images, we conducted a validation experiment using actual optical satellite images. In this section, we describe the dataset, the experimental setup, and the super-resolution results of satellite images.

Dataset
The WorldView-2 from MAXAR (Panchromatic) image shown in Table 2 was prepared as the input image for the experimental evaluation. The ground sample distance (GSD) was 30 cm and the images were taken from the typical area in Japan. The original satellite images were divided into small patches, which cannot be input into the deep learning model in its original size, so we spilled it into patches of a certain size. The patch size was 256 × 256. These patches were used as the input images for experimental evaluation.

Experimental Setup
For the experiments, only the trained model was modified while the inference code and data preprocessing were kept in common in order to compare equal super-resolution results. The implementation used in this experiment was KAIR 1 . We used ABCI of the National Institute of Advanced Industrial Science and Technology(AIST) was used as the computing environment.

Comparison of pre-trained models
We qualitatively evaluated the characteristics of each pretrained SISR model. The results super-resolution were performed on the image for experimental evaluation using each pretrained model with scale factor 4. Since there was no HR image to be ground-truth, we have visually confirmed the results to evaluate. The superresolved results are shown in Figure 2.
Since the GSD of the input image of WorldView-2 is 30 cm, the output of the trained model increases the resolution by a scale factor of 4, resulting in a ground resolution equivalent to 7.5 cm. As a comparison, Figure 2 shows the results of super-resolution for the input image using five different methods. Figure 2 shows the results of bilinear upsampling, the results of ESRGAN as an The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France example of a method that was naive SISR, and the results of the RWSR methods BSRGAN, Real ESRGAN, and SWIN IR. The overall trend was that both the bilinear upsampling and ESR-GAN methods perform super-resolution while retaining the blur in the input image. In other words, they simply improve the resolution of the input image without departing from the original image. However, by using real-world super-resolution methods such as BSRGAN, Real ESRGAN, and SWIN IR, we were able to improve the resolution while removing the blur in the input image, making it easier to visually read the edges of buildings. This improvement invisibility can be attributed to the fact that the method of creating the training data incorporates not only resolution degradation but also various degradations of blur and noise that can occur in reality. However, the distortions of the original satellite image are still emphasized, so care must be taken when visually reading the image.
We compared the results of the RWSR methods BSRGAN, Real ESRGAN, and SWIN IR. We show in Figure 3 the results of a typical ground object. First, we discussed the results for buildings. For buildings, all methods improved the visibility of the edges of the buildings. Next, the white line of the parking lot is shown. The visibility of the white line was improved by all methods, but especially by using Real ESRGAN, we could obtain an image with enhanced edges. This is because a more realistic method of creating the degradation function, which was not used in BSRGAN and SWIN IR, was incorporated in the creation of the dataset. Generally, the results of pretrained RWSR model are expected that will assist in visual reading.
However, some cases of the ground objects, which are unique to satellite images, could not be sufficiently obtained because the RWSR models were trained on only natural images. For example, artificial objects were observed in the white lines of the road. This is thought to reflect the distortion of satellite images when they are converted to orthoimages. In the superresolution results of vegetation, the images tended to be flat in general. This may be because the leaves of the trees were treated as high-frequency noise and were not emphasized during the super-resolution process.
4.3.2 Effect of scale factor Next, we investigated the effect of the scale factor of RWSR. We applied two times pretrained Real ESRGAN with scale factor 2(Real ESRGAN x2x2 in Figure 4) and a pretrained Real ESRGAN with scale factor 4(Real ESRGAN x4 in Figure 4)).
A visual comparison of these results showed no significant differences. This is because the training process of Real ESRGAN removes image degradation (noise, blur, etc.) that may occur in the real world, and thus removes the products generated by the second time 2x super-resolution.

Poll-based evaluation
In addition, the results of the visual evaluation are shown in the Table 3. The evaluation method was based on the number of votes that selected the most natural super-resolution result in the form of a questionnaire for ten images in Figure 2. The polling was limited to the RWSR method (BSRGAN, Real ESRGAN, SWIN IR) in ten different situations, the Real ESRGAN x2x2 images from the previous section were also included. For this evaluation, we focused on the textures, edges such as building roofs, and natural objects such as vegetation. The overall tendency was that the SWIN IR method and the BSRGAN method were selected. Thus, the RWSR model trained on natural images was applied to optical satellite images for super-resolution, and good results were obtained in the visual evaluation. The reason why Real ESRGAN was not polled was the stronger deterioration function than BSRGAN and SWIN IR. Real ESRGAN defined a degradation function that can handle a variety of degradations that occur in the real world, so it was thought to have worked to eliminate the texture of the satellite image.
Pretrained RWSR models can be attributed to the presence of features such as building edges, road edges, and roof textures that appear in the satellite image within the domain of the natural image used for training. These results suggest that deep learning models trained on natural images are expected to be effective for various tasks in satellite imagery.

CONCLUSION
In this paper, single-image super-resolution (SISR) of satellite images is achieved by applying a trained model of a deep learning method aimed at real-world super-resolution. SISR of satellite images is a task that not only restores resolution but also removes degradation functions that include various factors such as standby effects and sensor effects. Therefore, applying a SISR model trained on a pair of pseudo-LR images (e.g., bicubic downsampling) and a base image will not be able to cope with the degradation of satellite images that occurs in reality. Therefore, we adapted the real-world super-resolution (RWSR) method to satellite images by making the degradation function correspond to the real world. To verify the effectiveness of such deep learning models for RWSR on satellite images, we used the trained models of BSRGAN, Real ESRGAN, and SWIN IR, which were trained on natural images. By applying these RWSR trained models, we were able to achieve superresolution of the satellite images and improve the visibility by removing noise. This means that even if a model trained on natural images is used, the trained model has already acquired universal super-resolution for the image itself, and therefore, the edge enhancement and noise removal required for superresolution of satellite images can be achieved. This suggests the effectiveness of applying the trained model of natural images to satellite images as well.
However, since the RWSR model used in this paper was trained on natural images, it may not provide sufficient super-resolution results due to the difference in the domain from satellite images. This problem can be solved by applying the degradation function not only to natural images but also to satellite images and aerial photos while training RWSR.  Dong, C., Loy, C. C., He, K., Tang, X., 2015. Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2), 295-307.