MAGO APPROACH FOR SEMANTIC SEGMENTATION: THE CASE STUDY OF UAVID BENCHMARK DATASET

The present work is focused on a semantic segmentation strategy implemented in the workflow of the tool MAGO (standing for “Adaptive Mesh for Orthophoto Generation”), considering the contribution of the 3D geometry and the colour information, both deriving from the point cloud of the scene. Moreover, the 2D source imagery, previously used to obtain the photogrammetric point cloud, is employed even to enhance the procedure with the recognition of moving objects, comparing the evolution of epochs. The analysed context is an urban scene, deriving from the UAVid dataset proposed for the ISPRS benchmark. In particular, the socalled “seq18”, a set of high-resolution oblique images taken by UAV (Unmanned Aerial Vehicle), has been used to test the semantic segmentation. The workflow includes the production of two Digital Surface Models (DSMs), containing the geometric and radiometric information, respectively, and their processing by means of the Harris corner detector, allowing the understanding of the image variability. Then, starting from the source geometry and colour information and combining them with their variability mapping, a preliminary classification is performed. Further criteria allow the segmentation of the humans and cars present in the scene. In particular, static objects are identified according to the content of the neighbour pixels in a certain kernel, while the evolution in time of moving elements is recognized by means of the comparison of the projected images belonging to the different epochs. The presented preliminary achievements show some criticalities that require further attention and improvement. In particular, the strategy could be enriched getting more information from the source 2D images, which at the moment are directly used only for the comparison of consecutive epochs.


INTRODUCTION
Semantic segmentation is a Computer Vision technique (Förstner and Wrobel, 2016) that aims to the recognition and the comprehension of the content of an image at the pixel level. This approach is widely used in remote sensing applications, especially in the analysis of urban scenarios (Ajmar et al., 2019;Huang et al., 2019, Schmitz et al., 2019, Zhou et al., 2019 or in the delineation of forest trees (Chen et al., 2021;Sothe et al., 2020;Kempf et al., 2019). The segmentation approach could be based on imagery (Marmanis et al., 2018) or three-dimensional models (Ao et al., 2019), as well as on the combination of both 2D and 3D information (Ding et al., 2019). Typically, deep learning methods are applied to such procedure, including, to cite some examples, Conditional Random Fields (CRF; Pan et al., 2020;Lafferty et al., 2001), Markov Random Fields (MRF; Zoltan and Josiane, 2012), Spatial Pyramid Pooling (SPP; Zhengyu and Joohee, 2020), and Convolutional Neural Networks (Cresson, 2020;Martinez-Soltero et al., 2020;Ouyang and Li, 2021). The present work is intended to describe the preliminary approach conceived by the authors, developed to obtain the segmentation. Both geometric and radiometric information are combined; moreover, the 3D point cloud of the object is used as source for the identification of static objects, while the contribution of 2D imagery allow evaluating the evolution in time of moving objects by comparing the projected images at different epochs. The analysed case study is represented by the UAVid dataset (Lyu et al., 2020), composed by high-resolution videos and imagery focusing on urban scenes, whose segmentation is based on eight object categories: buildings, roads, static cars, trees, low vegetation, humans, moving cars, and background clutter. The proposed strategy consists in a machine-learning procedure, whose inputs are represented by the images and the photogrammetric point cloud obtained from their postprocessing. Since the UAVid imagery is not exactly fitted for photogrammetric applications, the dataset employed as case study has been chosen paying attention that the image overlapping was sufficient to allow the 3D point cloud reconstruction by means of Structure From Motion (SFM; Ullman, 1979) technique. The implemented functions have been introduced as a new module in the software MAGO (Mesh Adattiva per la Generazione di Ortofoto, literally Adaptive Mesh for Orthophoto Generation; Gagliolo, 2019;Gagliolo et al., 2019a and2019b), implemented within the Geomatics Laboratory of the University of Genoa. This tool, written in C++ language, is originally born for the automatic reconstruction of high-resolution orthophotos of adjacent walls, automatically recognising their rotation, and it has been enriched with the function of semantic segmentation here presented. MAGO procedure already included a module for the automatic check of the geometry homogeneity, carried out by means of the evaluation of the Z coordinate trend, where Z is the direction normal to the representative surface. In the original purpose of the software, this module was useful to evaluate and apply the transformation required to put the point cloud in a service reference system with X and Y axes identifying the orthophoto plane and the Z direction along its normal vector.
In the present work, taking cue from the described process, a new module for the detection of discontinuities is presented. The aim is to adopt this function for both geometric and radiometric segmentations, therefore joining the results of the two operations in a unique classification that takes into account both the aspects, assigning autonomously a category to each pixel. The paper is organized as follows: in section 2, the case study is presented; in section 3, the strategy for the semantic segmentation is described; in section 4, the results of the application of the conceived approach on the testing dataset are shown; finally, conclusions and future perspectives of the work are reported.

THE UAVID DATASET
UAVid collection is a new high-resolution Unmanned Aerial Vehicle (UAV) semantic segmentation dataset focused on new challenges, including large-scale variation, moving object recognition and temporal consistency preservation. The proposed scenarios include urban and street scenes. The dataset consists of 42 video sequences (from "seq1" to "seq42"), which are captured with 4K high-resolution by the oblique point of view. The authors of the benchmark provided ten images extracted per each sequence, labelling the 420 resulting images with eight classes. Moreover, the sequences have been classified in three groups, i.e., training, test and validated sequences (Lyu et al., 2020). Since the proposed approach requires the use of the 3D point cloud of the scene, the so-called "seq18" has been chosen as case study, given that the overlapping of provided frames was sufficient to obtain the photogrammetric reconstruction. In Figure 1, the first epoch of the distributed "seq18" is shown. The 3D point cloud production has been achieved by means of the SFM software Agisoft Metashape© (Agisoft© LLC, 2019). In this regard, the choice of the input information for the photogrammetric post-processing represented the first issue. In facts, on the one hand, the extracted images provided for each sequence are a few number compared to the usual photogrammetric blocks, and their resulting point cloud is affected by the presence of outliers due to the scarcity of the correspondences, as testified by Figure 2. On the other hand, the use of a higher number of frames (91) directly extracted from the provided videos has been attempted, leading to the production of a visibly distorted point cloud ( Figure 3). This behaviour could be due to the fact that the drone path followed almost a straight line, as in a single stripe dataset. Thus, the presence of a single stripe, without any Ground Control Point to stabilize, has been badly managed by the software in the set conditions, causing a fleeting reconstruction.  The following parameters have been set to carry out the workflow: the ten provided frames have been given as input, then the aerotriangulation has been performed at Medium quality, meaning that the image has been downscaled by factor of 4 (2 per each side). Finally, the dense cloud reconstruction has been launched at Ultra High quality, corresponding to process the images at their original resolution, and Mild depth filtering. Since no information about the Reference System was available, a pretended one has been attributed so that the building façades are vertical and the objects proportions are coherent with their standard measurements. The resulting point cloud has been filtered using the Statistical Outlier Removal (SOR) and the noise filter available in the open source software CloudCompare (CloudCompare Development Team, 2021), applying the suggested default parameters; moreover, it has been subsampled with a minimum spacing between points of 0.1 m.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)

SEMANTIC SEGMENTATION APPROACH
As aforementioned, the proposed semantic segmentation approach combines the geometric and the radiometric criteria, in order to obtain a unique classification. Starting from the 3D point cloud resulting from the 2D images provided by the ISPRS benchmark, two raster maps are produced: the former consists in the Digital Surface Model (DSM) of the scene, while the latter represents the corresponding nadiral greyscale map. Both these raster images contain Not a Number (NaN) values where the cell could not be filled with any source information from the 3D point cloud.
Both the raster maps are processed using the Harris Corner detector (Harris and Stephens, 1988), by means of the corresponding function implemented in the OpenCV (OpenCV Development Team, 2019) open-source library, available for C++ language. This technique allows to rate each image pixel with a mark R, according to the presence of a large variation in intensity with respect to the neighbours. In particular, the pixels are associated to the following groups basing on the obtained mark R:  R > 0: corner, i.e., significant change in all directions;  R < 0: edge, i.e., no change along the edge direction;  |R| small: flat region, i.e., no change in all directions.
The value R is obtained proceeding with the following steps. A greyscale 2D image, denoted as I, and a window W(x, y) of the image, shifted time by time of the quantity (u, v), are assumed as input. The sum of squared differences (SSD) between these two patches, denoted as E, is given by: Approximating the quantity ( + , + ) by means of a firstorder Taylor expansion, the function E could be written as: Thus: The quadratic approximation could be written in matrix form as where M is a second moment matrix computed from image derivatives: Each horizontal section of the function E(u, v) is the equation of an ellipse. The diagonalisation of the M matrix allows to obtain the lengths of the ellipse axes and their orientation, by means of the eigenvalues λ1 and λ2 and the corner response measure R, respectively.
As previously described, the method implemented in OpenCV takes into account the corner response measure, using the value R, which is calculated as: where α is an empirically determined constant ranging within 0.04 and 0.06. In the present work, the value 0.04 has been adopted. The input parameters for the cv::cornerHarris function are the source image, the destination image, the kernel size, the aperture parameter for the Sobel operator (Duda and Hart, 1973), and the constant α.
Once both the geometric and the radiometric input maps have been processed with this technique, each cell has been classified according to the obtained R value. In particular, the service images containing the R marks obtained from the radiometric and the geometric contribution are called Rcolour and Rgeom, respectively. Moreover, a synthesis of the two contributions is stored in the matrix Rclass, giving a label based on the following criterion, such that the obtainable Rclass values are resumed in Resuming, the labelling operation allows identifying the level of variability associated to each pixel, considering the changes in X and Y directions and in geometric and radiometric information. The following step is the processing of the classification map by grouping in homogeneous regions the neighbour pixels with the same assigned value (Nikhil and Sankar, 1993). The set of regions that derives from this segmentation process is called partition. The needed phases for the achievement of the partition are detailed in the following. First, the labelling results from the Rbased classification determine for each pixel the so-called event in which it is involved, tagging if pixels are pertaining or not to an area. Then, the operation of grouping allows joining the pixels with the same label in a cluster. This operation generates regions, i.e., a set of neighbouring pixels, called connected components. The label assigned to each pixel is an integer that identifies the belonging region of the pixel. Two pixels and with the same label belongs to the same connected component if there is a sequence of points ( 0 , 1 , … , ) of value belonging to where 0 = and = and is neighbour to −1 for = 1, … , . The value of each pixel is replaced by the smallest label value of its neighbours belonging to same connected component; the pixel above and the one on the left are considered, obtaining a 4connected grouping. This operation is carried out recursively from the left to the right and from the top to the bottom. Thus, bottom-up row scan follows, considering the 4-connected neighbourhood made by the lower pixel and the one on the right. The replacement of the values is iterated until no more label changes are applicable. Until now, the greyscale colour space has been chosen to apply the geometric segmentation using the Harris corner detector for the gathering of the discontinuities. Nevertheless, this colour space does not easily allow recognizing the hue of the analysed pixel, as well as the well-known RGB (Red Green Blue). Thus, the authors decided to convert in the HSV (Hue Saturation Value) range the original map of colours, obtained from the coloured point cloud. Starting from the HSV associated to each cell, several masks are arranged in order to identify the pixel membership. In particular, the threshold criteria are listed in the following; they have been chosen according to the authors interpretation and not using a superimposed classification. The OpenCV interpretation of the input values is due to the bytes coverage and requires that the H is within 0° and 180° instead of 360°, and S and V are within 0 and 255 instead of between 0 and 1. In such regions having a resulting value Rclass equal to 0 or 1, i.e. with no significant radiometric and null or low geometric variation respectively, the HSV masking is applied homogeneously, on the basis of the most recurrent value in the area. In the first step of the segmentation, the criteria coming from Rclass or directly the height information resulting from the DSM are combined with the colour inferred from the HSV masking, as listed in Table 3 Table 3. Applied criteria for the preliminary labelling.
In this phase, the static scene is distinguished in five macro-areas, including buildings, roads, trees and low vegetation, as well as the remaining background. Further criteria need to be implemented in order to point out also the three remaining categories, i.e., static cars, humans and moving cars, isolating them from the generic background.
In this regard, the actual potentialities of the algorithm are not suitable to discern humans from cars. Thus, the label static cars (class 3) and moving cars (class 7) are changed to static and moving objects respectively, while the category humans (class 6) is suppressed. Regarding the segmentation of static objects, they are extracted from the generic background checking the presence of at least a certain number of cells labelled as road (class 2) or static object (class 3) in the neighbourhood of the analysed pixel by using a kernel. In particular, the road and the static object cells need to be more than the half of cells filled with categories different from the background and the empty ones (classes 8 and 0, respectively). If at least one of the neighbour cells is road, a further check on the difference between the analysed pixel and the mean of the heights in the surrounding road cells is performed, i.e., if the Z coordinate of the analysed pixel is lower than three meters over the road average height, the matching with class 3 is confirmed. The last step is the recognition of the moving objects, which is achieved thanks to the comparison of the 10 epochs provided in the UAVid source imagery. The frames, obtained from the acquisition of the camera as central projections, are orthogonally projected using the tool MAGO firstly on a plane with a similar orientation to the original image attitude, then on the XY plane. The intermediate phase, which takes into account a service plane approximately at the same inclination of the original image, allows MAGO to optimize the research of the matching points that compose the adaptive mesh (Gagliolo et al., 2019b). Once the projections are performed, the resulting greyscale maps are subtracted, in order to highlight the difference from the previous to the following epoch. The pixels that are not visible in at least one of the single analysed views are excluded from the comparison. Moreover, a threshold of 30 in the range of greyscale tones is applied to exclude changes barely perceivable by the human eye.

RESULTS AND DISCUSSION
The described procedure has been applied on the case study of the "seq18" belonging to the UAVid dataset proposed for the ISPRS benchmark. First, the 3D point cloud built starting from the provided frames sequence has been processed to obtain two DSMs, the former containing the geometric information in terms of Z coordinate median (Figure 4), while the latter containing the radiometric information converted to greyscale values ( Figure 5). The following Figure 6 and Figure 7 are focused on a portion of interest, depicted in red in Figure 4 and Figure 5. This box, located where the obtained point cloud is sufficiently satisfying, has been chosen to carry out the test. The poor quality of the obtained point cloud for the analysed dataset is due to the fact that the shooting of the source images was not planned to obtain a 3D survey.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)    Then, Rclass map has been computed, as described in section 3; Figure 8 shows the portion belonging to the established boundaries, indicated in Figure 4. It is possible to notice that most of the areas are denoted with value 4, interpretable as high variability both in geometry and radiometry and both in the X and Y directions, or -2, identifying high variability in colour but null in geometry. The starting Rcolour and Rgeom data are obtained using the cv::cornerHarris function, setting the kernel size at 7, comparable with the dimension of trees crowns and cars considering that each pixel is 0.5 m, the aperture parameter for the Sobel operator at 3, and the constant α at 0.04. The Rcolour and Rgeom values considered as limit for the flat region (|R| small) are 10 -5 and 10 -9 respectively. The preliminary segmentation, resulting from the sheer analysis of the static scenario, is depicted in Figure 9. It includes all the categories except for the moving objects (class 7). The kernel used for the static objects research has size 5×5. It is worth noting that the class of the trees is satisfying, as well as the ones of buildings, roads and low vegetation. Regarding static objects, it is recognisable the presence of the cars waiting at the crosswalk or parked on the roadside. Moving objects are isolated subtracting the resulting projections of consecutive epochs time by time, as shown in Figure 10. The points depicted in red represent the variations between two consecutive epochs. Nevertheless, their quantity is excessive with respect to the real presence of moving objects. The development of further strategy for the outlier removal would be needed in future. In facts, some mismatches could be associated to discontinuities, which are not perfectly overlapping even if pertaining to static elements. From the shown results, the processing outcomes are strongly affected by the quality of the input data, i.e., the 3D point cloud and the deriving geometric and radiometric DSMs. Undoubtedly, the upstream 3D reconstruction is badly influenced by the presence of many moving objects, which would require a huge time to be singularly masked in each source frame in the Agisoft Metashape© interface. Thus, it could be worthy to bring forward the moving object detection, working already on their recognition on the 2D images, so that it would be possible to automatically exclude them from the point cloud reconstruction.

CONCLUSIONS AND FUTURE PERSPECTIVES
The present work shows the first experience of the authors with a segmentation strategy, based on both 2D and 3D source data and considering both the geometric and the radiometric aspects.
The produced maps point out the steps to reach the final segmentation. First, the 3D point cloud has been obtained from the provided frames of the UAVid "seq18", considered as a better choice than the extraction of input images directly from the UAV video, which highlighted a substantial distortion. The so-obtained point cloud has been used as input for the geometric and radiometric DSMs production. Then, these maps have been processed using the Harris corner detector in order to underline the image variability, according to whether the geometric or radiometric components and analysing both X and Y directions. The results of this phase are resumed in an Rclass map, whose labels are associated to the possible combination of corner, edge and flat regions deriving from the Harris corner detector applied on the two maps Rcolour and Rgeom. Starting from the DSM heights, the Rclass values and the HSV image masking, the criteria for the static scenario classification, except for humans and cars, are applied. Further rules are established to isolate static objects that could pertain to humans or cars. Finally, the moving objects are isolated subtracting the images referred to consecutive epochs and applying a proper threshold in order to exclude not perceivable changes. Nevertheless, a strategy to exclude outliers, i.e., points along discontinuities not perfectly overlapped, has still not been conceived. For sure a stereo camera acquisition and permanent Ground Control Points may substantially improve the obtainment of the 3D model and, as consequent result, the classification by means of the proposed strategy.
The preliminary achievements shown in the present work are useful to do a critical analysis of the proposed workflow, putting in place new prompts for the further enhancement of the procedure. In particular, this method could be improved to get more information from the primary source, represented by the acquired video and images. Moreover, the moving objects need to be treated so that they do not represent an obstacle to the processing but an opportunity to improve it.