ANALYSIS OF FOUR GENERATOR ARCHITECTURES OF C-GAN, LOSS FUNCTION, AND ANNOTATION METHOD FOR EPIPHYTE IDENTIFICATION

The deep learning (DL) models require timely updates to continue their reliability and robustness in prediction, classification, and segmentation tasks. When the deep learning models are tested with a limited test set, the model will not reveal the drawbacks. Every deep learning baseline model needs timely updates by incorporating more data, change in architecture, and hyper parameter tuning. This work focuses on updating the Conditional Generative Adversarial Network (C-GAN) based epiphyte identification deep learning model by incorporating 4 different generator architectures of GAN and two different loss functions. The four generator architectures used in this task are Resnet-6. Resnet-9, Resnet-50 and Resnet-101. A new annotation method called background removed annotation was tested to analyse the improvement in the epiphyte identification protocol. All the results obtained from the model by changing the above parameters are reported using two common evaluation metrics. Based on the parameter tuning experiment, Resnet-6, and Resnet9, with binary cross-entropy (BCE) as the loss function, attained higher scores also Resnet-6 with MSE as loss function performed well. The new annotation by removing the background had minimal effect on identifying the epiphytes.


INTRODUCTION
Neural network (NN) algorithms are used in many digital data analysis (Tefas et.al., 2013). Advancements in computational hardware, storage and software are fueling progress in digital data analysis. Deep learning-based data analysis are a part of NN algorithms and are robust for applications with data generated by numerous sources (Jia et.al., 2017 andNajafabadi et.al., 2015). These deep learning (DL) algorithms are capable of understanding data and its pattern from an experiential learning and derive the features from input data and generate learned models (Harshvardhan et.al., 2020). The performance of DL algorithms are highly dependent on the quantity and quality of data used for learning and its mathematical modelling.
DL algorithms are used for prediction and classification tasks in several disciplines (Rory et.al., 2019;Iqbal et.al., 2019). DL algorithms consist of deep neural network components and their organisation collectively referred as their architecuture. Several state of the art DL architectures are used for image classification, object detection, and image segmentation tasks (Zhao et.al., 2019;Nida et.al., 2015).
Performance of DL or any NN algorithms varies over time when the requirements change. Changes to the structure, parameters, and mathematical modelling of DL architecture are necessary for improving their performance. There are several DL architectures like VGG16, GoogleNet, ImageNet etc (Chen et al, 2018;Szegedy et.al., 2015 andKrizhevsky et.al., 2012) used for various image processing applications. The DL algorithms are not selfadaptive to the new requirements and sometimes they are computationally intensive. Hence updated concepts to DL and * Corresponding author other NN algorithms are necessary for better learning and effective utilisation of computational resources. There are many examples like introduction of different types of convolutions in convolutional NN algorithms (Ding et.al., 2018) to improve the feature extraction with reduced computational cost. The DL architectures like Unet (Ronneberger et.al., 2015) and generative adversarial networks (GAN) (Goodfellow et.al., 2014) are specifically designed for image-to-image translation by innovative architectural components. The deep learning algorithm updates in their components aims to produce better output, improved performance and effective utilization of computational resources. Shashank et.al., (2020) used Conditional-GAN algorithm (C-GAN) introduced by Philip et.al., (2017) for identifying the epiphytes (Werauhia kupperiana) in the drone acquired images. That study modelled the target identification task as an image-toimage translation problem and applied adversarial concepts. C-GAN identified 80% of the epiphytes from the test set. This study was not able to produce good output labels in many scenarios. Also, the algorithm was not able to perform well when the target plant differs in distance at which target is imaged, lighting conditions and distance at which the target is imaged. This motivated the current study to experiment and explore the performance of C-GAN algorithms by changing its hyper parameters and architecture components to improve the performance of the algorithm.
This study builds on the work completed by Shashank et al. (2020). First, a new annotation method was used for generating the images needed to train the algorithm. Next, the performance four generator architectures were evaluated with two different loss functions. The objective of this study is to update the DLbased epiphyte identification model used by Shashank et al., (2020). The proposed model underwent an architectural change to the existing C-GAN (Philip et.al., 2017). Results from this study will provide valuable information for developing more robust DL algorithms to identify epiphytes in digital images.

MATERIALS AND METHODS
The experiment were organized in two stages where binary cross entropy (Shie et.al., 2005) was set as the loss function with 4 different residual networks. In the second stage all residual networks were coupled with MSE (Zahra et al., 2014) as the loss function. The above experiments trained the C-GAN model with the epiphyte dataset. The phython program trained the model for 200 epochs and saved the final model to the local system. The testing was done separately with 12 images which are not seen by the network during the training phase.

Epiphyte dataset and annotation
The epiphyte dataset used in this study was acquired in Costa Rica (Sajithvariyar et.al., 2019). The epiphyte dataset consisted of 115 drone-aquired images of which 98 were used for training, 12 for testing and remaining 5 for validation. The validation set were used during training to tune the parameters and the test set were used to assess the trained network's ability to identify the target plant.
The C-GAN architecture was implemented using python and trained in an i7 processor with 8 GB RAM and NVIDIA Quadro P5000 GPU. All the input images and labels had a dimension of 256 x 256 x 3. The new annotation method removed the background black pixels and kept the target as a true colour images. In the earlier study, the target plant was identified as red pixels ( Figure 1). The annotation images generated by Shashank et al., (2020) were used to remove the background pixels and retain the original, true-colour image of the target plant ( Figure 1). Retaining the RGB values of the target plant will help C-GAN to concentrate more on the foreground pixel and generate better output labels.

CGAN Generator and Discriminator
The C-GAN deep learning architecture used for epiphyte identification consists of two competing networks called generators and discriminator with two loss functions. The previous study used the UNET encoder-decoder architecture (Ronneberger et.al., 2015). The encoders will map the input data to a lower dimension and the decoder maps this back to original information. The dimensionality reduced data will be scale invariant and translation invariant which is very important for object identification tasks. C-GAN is a variant of GAN where it enforces the input data to derive the features based on a condition, which is our annotated image. This conditional enforcing will ensure that the algorithm will focus on the target region in the input image by referring to the annotation images.
In the present study, we evaluated four different types of deep convolutional neural architecture for the generator networks. The following subsection gives the details of the deep neural networks used for constructing the generator network.

Residual Generator Networks
The Generator networks architecture was replaced with a deep convolutional neural architecture called Residual networks (He et.al., 2016). The residual networks were designed by a group of researchers in Microsoft during 2015. The major contribution of the network is to remove the vanishing / exploding gradient problem in deep networks. The residual networks are implemented with shortcut connections between layers and exhibit efficient training from residual functions as shown in (Figure 2). Figure 2. The Residual network learning residual functions by referring input layers (He et.al., 2016).
The residual networks are designed in various layer depths and they are named in such a way that the decimal number indicates the layer depth. In this study we tested four different variants of the Resnet architectures after replacing the C-GAN generator with following variants a) Resnet-6, b) Resnet-9, c) Resnet-50, and d) Resnet-101.
The generator network plays an important role in C-GAN algorithms. The generator network is responsible for generating the fake sample by looking at the conditional parameter that is the annotated image. The performance of generator is good when it can produce fake samples as close to the original images. At this stage, the discriminator network will fail to differentiate the B C D

A
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-M-3-2021 ASPRS 2021 Annual Conference, 29 March-2 April 2021, virtual original and from the fake samples. Since this study is mainly focused on the generator networks, we retained the original discriminator network that was used in the previous study. The discriminator network is a patchGAN architecture with a patch size of which is the best window size experimentally proved by (Isola et.al., 2017).
Most of the studies and research in deep convolutional networks states that "the deeper the better" (Bekele et.al., 2019). Hence we attempted to improve the performance of deep CNN networks by adding more layers. On the other hand, by making the networks deeper the computational cost and run time will increase. The major issue in a deep neural network is the vanishing Gradient problem. This occurs when the network is not able to learn anything from the input data. When the network is too deep, the gradients from the loss functions will map the values to zeros. This will result in no further updates to the weight matrix of the model and the learning rate of the network will decline. To overcome this huddle we need an architecture which is deeper but also free from the vanishing gradient problem.
The residual networks enforce efficient learning by mapping residual functions between layers and there improve the training process. The residual network learns the residuals to match the input with the predicted weights. This process makes sure that the deeper networks will learn better without degrading the process.

The CGAN loss functions
The loss functions are vital in any neural network training to keep track of the model's learning of the weights. The proposed methods consist of a binary classification where the C-GAN must classify the background and target pixels. Generator loss depends on the ability of the discriminator to identify fake as real samples. Discriminator loss penalizes itself for misclassifying a real instance as fake and vice versa. In this study, we used two loss functions a) Mean Squared Error (MSE) b) Binary cross Entropy (BCE) for the generator network. The experiments were conducted for all four generator networks with two loss functions. The output generated by 4 different generator networks with two loss functions are reported using structural similarity index (SSIM) and intersection over union score (IoU).

RESULT AND DISCUSSION
The SSIM and IoU scores were computed for the predicted label and ground truth label from different models trained in this study. The SSIM will look for the structural similarity and IoU will find the maximum overlap between the predicted labels and ground truth labels. A Python script was developed to iteratively compute the IoU and SSIM score for all the test images and their average.
The new annotation method proposed in this study contributed more towards the predicted label analysis for the analyst. The new annotation images are not contributing anything new while comparing them to the annotation used in the previous study. This reveals that masked annotation with false color will be sufficient for epiphyte identification task.
Analyses of the output images generated by various models trained with the new annotation method revealed no major difference in generated output labels. The major advantage for analyst with the new annotation is that after predicting the labels it is easy to understand the portion of the epiphyte where the model failed to predict. The effect of loss function on predicted labels like blurring is evident from the annotation. This also helped to understand that the prediction on epiphyte leaf edges and overlapped leaves are more blurred.
Replacing the generator networks with residual networks and two generator loss functions MSE and BCE generated different output labels. Table 1 summarizes the results obtained from 4 different residual networks with BCE and MSE loss functions. From the results obtained, generator networks with Resnet-6, Resnet-9 and BCE as the generator loss function scored maximum IoU and SSIM score. Also, from Table 1 it is evident that when MSE was set as the loss function Resnet-6 generated output labels with high SSIM and IoU score. The Resnet-50 and -101 underperformed due to a smaller number of training samples. The deeper the networks, the more data required for training. The Resnet-50 consist of 50 layers and Resnet-101 consist of 101 layers, when this many layers iterate over fewer number of training samples there will not be further improvement in training. The performance will be degraded, and the network will generate a poor model. This resulted in generating poor output labels which gives low SSIM, and IoU scores compared to ground truth. The SSIM scores obtained with Resnet-6 and MSE as the loss function was higher for Resnet-6 and remained the same for the remaining networks (Table 1). These scenarios indicate the limitations of SSIM scores for evaluating the output labels. The SSIM score looks for the maximum similarity between the predicted and ground truth labels. In this study all label images consisted of two classes: 1) all non-target pixels belong to background (black), and 2) the target pixels in its original colour space. Also, in many test images the target plant occupied a small area in each frame compared to background information. Under these circumstances, though the model fails to correctly predict the target plant, the similarities in the background pixels will lead to higher SSIM scores (Figure 3).
The IoU scores associated with Resnet-6 and MSE loss function was 0.56 (Table 1) and was 0 for the rest (Table 1). The IoU is a method to quantify the percentile of overlap between the predicted label and ground truth label. IoU metric measures the number of pixels common between the ground truth and prediction label divided by the total number of pixels present across both labels. The value of IoU score is ranging between 0 and 1 where value close to 1 indicate predicted label is closer to ground truth and 0 indicate they are dissimilar.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-M-3-2021 ASPRS 2021 Annual Conference, 29 March-2 April 2021, virtual The SSIM limitations can be easily replaced while computing the IoU score. The SSIM score is helpful when we need to evaluate the predicted label with some epiphyte pixels and compare the structural similarity. This shows that SSIM score along with IoU gives a better clarity on the output labels predicted.
Results obtained from various models evaluated in this study shows that the output predicted labels are more blurred when we have BCE as the loss function ( Table 2). The model trained with Resnet-6 and MSE loss produce sharp images with less blurring effect. The objective of the BCE loss function is to reduce the error to zero. This results in blurring effect. Unlike BCE, MSE computes the error based on the squared distance. This results in less blurring effect, when compared to the output obtained using BCE loss function. Table 2 gives some sample output labels with high IoU and SSIM scores predicted by Resnet-6 and 9 with BCE and Resnet-6 with MSE as the loss function.
The results obtained from the experiments also reveals that when the network is going deeper from number of layers 6 to 101 in residual networks the SSIM and IoU scores are declined. The deeper the networks a greater number of images is required for training. This also shows that the potential of improving the results with more training data.

CONCLUSIONS AND RECOMMENDATIONS
Resnet-6 and Resnet-9 with BCE and MSE loss functions were able to generate output labels with higher SSIM and IoU scores. The SSIM scores can be higher if the target plant occupies a small area in the images used for testing. Resnet-50 and Resnet-101 did yield output labels with lower SSIM and IoU scores due to smaller number of images for training.
The output labels were more blurred for BCE compared to MSE loss function. Choice of selecting appropriate loss function for reducing the blur in the output labels. Current DL architecture demands more changes in the system loss function.
The new annotation by removing the background had no significant improvement in label prediction.
Incorporation of hybrid models might be necessary to make improvements to the epiphyte identification model. This work also highlights the opportunities for further improvement by making changes to hyper parameters like loss function in addition to incorporating new architectures.