LAND USE LAND COVER MAPPING USING UAS IMAGERY: SCENE CLASSIFICATION AND SEMANTIC SEGMENTATION

: Land use and Land cover classification plays a vital role in understanding the changes happening on the surface of the earth. Vegetation classification can be performed by incorporating various deep learning models using Convolution Neural Network approach. The primary purpose of this research is to test the performances of pre-existing Deep Convolution Neural Network (CNN) models for vegetation classification in the Jupiter Inlet Lighthouse Outstanding Natural Area (JILONA). Specifically, this study focuses on examining of the capacity of Scene and Pixel classification technique using certain 3 band and 5 band combinations. Eight well known scene-based deep convnets, namely VGG19, ResNet152V2, InceptionV3, EfficientNetB5, Xception, InceptionResNetV2, MobileNetV2, DenseNet201, and one important pixel classification model, UNet have been used for vegetation mapping in the North and South part of the JILONA. Among all the scene classification models, EfficientNetB5 and DenseNet201 outperformed all other models with an accuracy of over 97%. Xception and ResNet152V2 model exhibited 95% accuracy, whereas the accuracy of remaining models ranged from 85% to 92%. The classification map achieved through UNet pixel-based method, resulted with an accuracy of about 87% and an even better 91% when all 5 bands (Blue, Green, Red, RedEdge, Near-infrared) were used for training.


INTRODUCTION
Jupiter Inlet Lighthouse Outstanding Natural Area (JILONA) is a region covered by various kinds of vegetation, wildlife and other cultural components that rely on adequate land cover. Vegetation acts as a purifier and combats climatic change, hence it plays a critical role in the biosphere (Biello, 2008). With the growing importance and significance of Land Use Land Cover (LULC) classification, there are many effective methods to carry out the vegetation mapping, namely satellitebased, aircraft-based, terrestrial-based, etc. Recently, deep learning, an asset of machine learning, is being considered as the spotlight of remote sensing field. This is because CNN had made a series of breakthroughs in various remote sensing applications such as classification, object detection and segmentation with its superior performance.
The objective of this study is to demonstrate the capability of various deep learning models to perform LULC classification using UAV Imagery and hence determine the best CNN model to perform such vegetation classification. This is important when a particular site needs to be examined for improvements and preservation. Therefore, the primary aim of this research is to examine the power of deep CNN for LULC classification based on UAV Imagery, to investigate the generalization capacity of the existing CNN models, compare the efficiency of certain well known deep CNN scene-based models, which includes, VGG19, ResNet152V2, InceptionV3, InceptionResNetV2, MobileNetV2, DenseNet201, Xception, EfficientNetB5 and perform a study on UNet, which is known to be an efficacious pixel-based classification model. Thus, this study results in mapping the vegetation cover by a comprehensive and elaborative analysis using Unmanned Aerial remote sensing data acquired during 2018 and 2020.

MATERIALS AND METHODS
Though various techniques exist for vegetation mapping using different resolutions of satellite images (Xie et al., 2008), the traditional classification approach is not sufficient for community level mapping. Using high-resolution multispectral sensors can improve the image classification process (Cingolani et al., 2004). This research is to use the UAS Imagery data collected from Micasense RedEdge camera for classifying the land cover classes. The overall workflow for this research is shown in Figure 1. Being a subset of machine learning, Deep learning deals with various algorithms that includes neural networks, which was inspired by the structure and function of a human brain. In all the data science techniques including Deep learning algorithms, performance will be based on the amount of data. In other words, accuracy improves when the quantity of training data is higher. Deep learning is often referred to as "Feature learning", which is the ability of extracting features from raw data automatically. Deep learning exhibits maximum precision whenever the inputs and outputs are analog. The analog data includes imagery with pixels, text data, audio files etc., This is because, the model gains knowledge from the input data through a general-purpose learning procedure, without the intervention of human engineers. One of the deep learning packages and a high-level API for TensorFlow, 'Keras applications', consists of various models that have pre-trained weights which can be further used for different processes including feature extraction and prediction. Certain other python modules such as pandas, scikit-learn, matplotlib, seaborn, pickle etc., were used for this study. The efficiency of some of the Keras deep learning models in vegetation classification along the JILONA region is studied using a scene-based classification approach. Apart from this, a pixel-based classification approach is also performed for this study to come up with a classification map. This approach showcases and compares different combination of bands available from the Micasense RedEdge camera. From these set of band information, three major combinations (1) Red, Green, Blue; (2) Green, Red, Near Infrared; (3) Blue, Green, Red, RedEdge, Near Infrared were identified, and the accuracy of the trained models was investigated.

Figure 1: Research Methodology / Workflow
The raw data from the Micasense RedEdge camera was processed using Pix4D Mapper, in order to generate an orthomosaic imagery. This was followed by the process of collecting training samples for both scene-based and pixelbased classification methods. Training samples were collected separately based on the requirement of the model. Following this, the collected samples were utilized to train the deep learning models, so that the accuracy of the models could be compared. While training, the process of augmentation was carried out in order to increase the number of training samples. Adam optimizer, SoftMax activation function, categorical cross entropy loss functions were all examined during the model training procedures. In addition to this, for both scene-based and pixel-based classification methods, 80% of the total dataset was considered to be the training set whereas the remaining 20% was allocated to validation and testing sets equally. Moreover, the acquired ground truth data was also considered to test the accuracy of the developed pixel-based classification model. The evaluation metrics such as accuracy, intersection over union (iou) and dice co-efficient were analysed while testing the model. The output from the scene-based classification model would be a name of a class that the input imagery belongs to, based on the probability, whereas for pixel-based classification model, the result would be the class code for each pixel. Hence, using pixel-based method, the classification map of whole region can be determined easily and accurately.

Scene Based Classification
Scene classifier is a deep learning technique in which scenes from the collected imagery data are classified categorically. When Object based method classifies prominent objects in foreground, scene classification utilizes the layout of available objects within the given image along with the ambient context. It is often referred to as scene analysis or scene recognition.

VGG19
VGG19 is one of the variants of Visual Geometry Group (VGG) model. VGG19 model comprises of 19 layers, which includes 16 convolution layers, 3 fully connected layers, 5 MaxPool layers and 1 SoftMax layer. Similar kinds of variants include VGG11, VGG16, etc., It shows more accurate results when deep convolution neural layers are used. Spatial padding, Max pooling and convolving were all used in training this model, which was followed by Rectified linear unit (ReLu) activation function in order to make the model classify effectively by introducing non-linearity. The fully connected layer was fed into SoftMax function, which is the final layer in this model.

ResNet152V2
ResNet is referred to as Residual Neural Network, which is useful for image classification tasks. It supports various architectural configurations that helps in attaining satisfactory balance between work speed and quality. It introduces a structure called residual learning unit, that alleviates the degradation of deep neural networks. The structure of the unit looks like feedforward network, producing the best classification accuracy without increasing the model's complexity. Among all the variants of ResNet family, ResNet152V2 is said to have the best accuracy.

InceptionV3
One of the popular networks called GoogLeNet was extended further and that gave rise to InceptionV3. By making use of transfer learning, it achieved greater performance in several applications. InceptionV3 followed GoogLeNet and thereby proposed a new inception model by concatenating multiple different sized convolution filters, forming a new filter. This approach of creating a new filter reduces the count of parameters involved that needs to be trained and thereby decreases the computational complexity.

InceptionResNetV2
The combination of Inception structure and Residual connection resulted in the model called InceptionResNetV2. In this model, various multiple sized convolution filters are fused together through residual connections to form an Inception-ResNet block. By using these types of residual connections, degradation problems occurring due to deep structures can be avoided, thereby reducing the training time.

DenseNet201
In Dense Convolutional Network (DenseNet), each layer obtains necessary information from all the preceding layers. This information gets passed on to its own feature-maps to all the subsequent layers. Through the process of concatenation, each layer in this model receives "collective knowledge". Hence, it possesses strong gradient flow, parameter and computational efficiency, diverse features and thereby maintains low complexity features. By utilizing all complexity features, it tends to result in smoother decision boundaries.

Xception
Xception, which is an extension of Inception, replaces the standard inception modules with deepwise separable convolutions. It is one of the convolutional neural networks that contains 71 deep layers.

MobileNetV2
Since MobileNetV2 uses inverted residual blocks with bottlenecking features, it still differs from MobileNet with respect to the number of parameters. Having lower parameters count, MobileNetV2 supports any input size which is greater than 32 x 32. It provides much better accuracy based on the size of the input image. Larger the size of the image, greater the performance. Two types of blocks exist in MobileNetV2. The major difference between these two blocks is the value of stride. For each block, there are three layers namely convolution layer with ReLU, depth wise convolution layer and convolution layer without non-linearity. Hence, this model is more effective in extracting features on object detection and during the process of semantic segmentation. It is a type of convolutional neural network that outperforms on mobile devices.

EfficientNetB5
Efficient Net is said to be the next state-of-the-art network, which has a smaller number of parameters. It has variants ranging from B0 to B7. The resolution of EfficientNetB5 is 456 and training the model is relatively faster. All the existing variants of EfficientNet models are being scaled from Efficient Net-B0 but uses different compound co-efficient. This consistently reduces the number of parameters and FLOPS through an order of magnitude.

Pixel Based Classification
In this method, each individual pixel is utilized in the classification task in order to a define land cover class, as there exists a difference among each pixel of the imagery. Floristic plant communities in tropical savanna were classified based on the range of large spatial scale by the integration of SPOT 5 and Landsat 5 TM images with auxiliary data (Donna et al., 2012). Each individual pixel, or group of pixels, will have different brightness values, texture etc., in different bands, which is the major feature that allows the model to predict the land cover class it originally belongs to. These features and their related statistics derived from each individual pixel allows the trained model to differentiate them.

UNet
UNet, which is one of the popular fully convolutional networks (FCN), comprises of contraction and expansion paths. Contraction path helps in extracting more advanced features and reduces the size of feature maps, whereas expansion path covers the size of segmentation map. UNet is basically a semantic segmentation approach, which is the process of associating each and every pixel of an image with a corresponding class label.

STUDY AREA AND DATASET
The geographic region of interest for this study is chosen to be Jupiter Inlet Lighthouse Outstanding Natural Area (JILONA) as shown in Figure 2. It is located in the northern Palm Beach County along the Atlantic coast of South Florida. It is one of the parts of Bureau of Land Management's 27-million-acre National Conservation Lands and it is the only complete unit in the east of the Mississippi River. It is adjoined by the Loxahatchee River and the Indian River Lagoon and is only half mile away from the Atlantic Ocean. It is situated 14 miles north of West Palm Beach and it is approximately a 2-hour drive from the north of Miami via Interstate 95. The total land cover area is about 120 acres of open space.
The data for this research work has been collected from very high-resolution drone imagery. The Unmanned Aircraft System (UAS) data used in this study was collected using drones with Micasense Multispectral Sensor Package in two different years. For Scene classification approach, the three band (RGB) data collected during 2018 was used, whereas for Pixel classification, five bands (Blue, Green, Red, Red edge, Nearinfrared) data collected during September 2020 was utilized. UAS has several advantages such as low device cost, ultra-high spatial resolution, enhanced temporal resolution, lesser survey time and ability to study inaccessible areas. The Micasense RedEdge sensor is commonly preferred for its calibrated narrow bands, durability, ruggedness, and flexibility to integrate with several types of drones. At about 230g in weight, the Micasense RedEdge consists of DLS 2 light sensors and cables. Accurate and high-resolution measurements of various spectral bands such as blue, green and RedEdge and near-infrared is facilitated by the calibration and global shutter imagers of the sensor. The sensor captures the blue spectral band at 475 nm center wavelength and 32 nm bandwidth, green band at 560 nm center wavelength and 27 nm bandwidth, red band at 668 nm wavelength and 14 nm bandwidth, red edge at 717 nm wavelength, and 12 nm bandwidth, and the near-IR band at 842 nm center wavelength and 57 nm bandwidth. The Micasense RedEdge is easily one of the best and effective sensors available today because of its optimized GSD (Resolution) of 8cm per pixel (per band) at 120m (-400ft) AGL, DLS 2 light sensors, presence of global shutter which can help avoid image distortion even at a capturing rate of 1 per second (all bands), and the ability to generate plant health indexes and RGB color images that are aligned with all bands. Apart from being compact and lightweight, the sensor's wide 47.2 degrees HFOV and various interfaces like Serial, 10/100/1000 ethernet, removable Wi-Fi, external trigger, GDS and SDHC, easily adds on to the fact that Micasense RedEdge is indeed one of the best sensors in recent times. It is also very convenient to use with a variety of UAS due to the availability of various triggering options such as Timer mode, overlap mode, external trigger mode, and manual capture mode (Micasense RedEdge, 2022).

Training
The drone data collected in the year 2018 was used for scene classification method, which consists of only three band information namely Red, Green, and Blue. Both the North and South JILONA imagery was merged in order to perform scenebased classification. Training samples were collected from the mosaicked imagery of North and South region of JILONA. Cabbage palm, Grassland, Ground, Mangrove, Parking Lot, Road, Sand pine, Scrub oak and Sea grape were the nine classes that were chosen. All the collected training samples were then subjected to certain augmentation techniques such as rescaling, horizontal flipping, vertical flipping, rotation range, brightness range, shear range, and zoom range. Image augmentation is a technique in which existing data gets altered, so that more amount of data samples gets generated for model training process. Some of the results of augmentation process are shown in Figure 3. Finally, the whole dataset was split accordingly, so that 80% was allocated to training, 10% for validation and the remaining 10% for testing purposes. Each model was trained for about 50 epochs.

Validation and Testing
Once the model is trained using the extracted dataset, it needs to be validated and tested. For both training and testing purposes, 256*256 image patch size was used. Figure 4 shows the time taken for training each scene-classification model in the GPU utilizing Tesla T4 with 170 GB RAM in 64-bit Windows OS. The average accuracy of all the models was nearly 93%.

Evaluation Metrics
In order to evaluate the developed scene classification models, certain evaluation metrics have been used, which are discussed in detail in this section. This is done to measure the quality of any deep learning model or statistical analysis. The types of evaluation metrics include classification accuracy, confusion matrix, logarithmic loss, Cohen's kappa co-efficient, etc. The ratio of number of correct predictions to the total number of input data samples is called classification accuracy, whereas the false classifications are often referred to as log loss or logarithmic loss. Confusion matrix describes the complete performance of the machine learning model. Evaluation metrics helps in determining whether the model is operating correctly and optimally.

Accuracy and Loss Charts
The accuracy, validation accuracy, loss, and validation loss for all the explored models can be visualized clearly in Figure 5. Through these charts, one can determine whether the model is learning properly, or it has any issues such as overfitting or diverse probability values, underfitting, cramming values etc., so that model can be insisted to minimize its loss and thereby maximizing the accuracy.

Confusion Matrix
Figures 6 and 7 shows the confusion matrix and accuracy and the cohen's kappa of all the investigated models respectively. EfficientNetB5 and DensNet201 are proven to be the best scene-classification models for JILONA region, with an accuracy of nearly 98%. In contrast, VGG19 scored the least accuracy percentage of 85.5%. Likewise, MobileNetV2, ResNet152V2, InceptionV3, InceptionResNetV2, Xception exhibited an accuracy of about 90.6%, 95.1%, 91%, 91.8%, 95.4% respectively.
Figure 7: Model accuracy and kappa co-efficient Figures 8 and 9 depicts the training data for pixel classification approach. Nearly 30% of the training data has been collected separately for North and South region of JILONA through ArcGIS Pro from the existing classification output from Support Vector Machine (SVM). This is done since a classification map is needed as an input for the UNet methodology. SVM classifier is a supervised learning method, which is found to be more memory efficient and effective in high dimensional spaces (Kesavan, 2021). The existing datasets are then artificially expanded through a process called "Image augmentation", to train the deep learning model. Seven classes namely, Oakscrub, Sandpine, Grass, Ground, Palm, Shadow, Water were considered for South region of JILONA, whereas eight classes were considered for North, which includes Mangrove trees in addition to the seven classes for south. With the help of the existing SVM results, the UNet model has been developed and trained for various epochs using different approaches of band combinations. Meanwhile, Quality Assurance Quality Check (QA-QC) was performed manually for both the North and South training dataset, before using it as input for the model. QA-QC helped in removing a lot of misclassified images, which improved the prediction accuracy.

Experiments
Two types of UNet models namely, CustomUNet and SatelliteUNet were examined during this study with three  -2-2022ASPRS 2022Annual Conference, 6-8 February & 21-25 March 2022 different band combinations. Certain other hyper parameters such as number of layers, learning rate, batch size, optimizer were also investigated in detail in this research work. In addition to that, it is also proven that the models outclass upon using augmented images for training North and South region of JILONA separately. Several epochs of training were carried out to determine the improvement of model. All the assessments that were made during this study and the best performing model, out of all demonstrations is shown in Table 1.

Validation and Testing
Once the training is completed for all the developed models, testing was performed for randomly chosen points, which has ground verified class. Individual accuracy of each class for the best model is shown in Figure 10.

Evaluation Metrics
Overall accuracy for three major results of this study is being discussed in Table 2. It is evident that, the Custom UNet model outperforms while using all the five-band information with 87.02% of accuracy for North imagery and 91% of accuracy for South imagery of JILONA region. The overall accuracy was calculated based on the individual accuracies explained in section 5.3. Overall Accuracy is the ratio of number of correctly classified pixels to the total number of pixels.

Classification map
The model that was most accurate was found to be the one that uses all the five bands for classification. The prediction results of both the north and south region are shown in the Figures 11 and 12.

CONCLUSIONS
In this study, Scene and Pixel based Classifications were performed for JILONA region. The performance of eight well known scene-based models namely VGG19, ResNet152V2, InceptionV3, EfficientNetB5, Xception, InceptionResNetV2, MobileNetV2, DenseNet201 were compared. In addition, various combinations of hyperparameters were tuned in order to capture the finest working model for pixel-based classification. Finally, LULC map was generated from the determined best UNet segmentation model. This model was developed using the UAS imagery, which is cost effective as it utilizes a low-cost sensor. The findings of this research are that among all the evaluated models, EfficientNetB5 outperformed in scene classification with nearly 98% of prediction accuracy. Similarly, CustomUNet with all 5-band combination trained using Adam optimizer, 64 as batch size, 5 layers and 0.01 learning rate, outclassed all other investigated models for pixel-based classification or semantic segmentation. It is also notable that accuracy percentage of UNet model, got elevated after performing the Quality Inspection for training images.
The current study is restricted to a specific study area, which can be extended in the future. Also, the comparison of the classification methods can be carried out for various other band combinations of the same (or) different sensors. Further, an indepth analysis can be performed to improve the prediction accuracy of the UNet model. Similarly, hyperparameters can be tuned in such a way, so that both the scene and pixel classification model improve. Moreover, there are some patch issues in the classification map, which can be seen in Figure 13, that needs to be resolved in future.

ACKNOWLEDGEMENTS
The author would like to thank the Bureau of Land Management for providing all the necessary support. A special thanks to the Department of Civil, Environmental and Geomatics Engineering at Florida Atlantic University for permitting the author to carry out this research. In addition, the author wishes to express sincere gratitude to all colleagues for the required academic assistance.