NOVEL SINGLE TREE DETECTION BY TRANSFORMERS USING UAV-BASED MULTISPECTRAL IMAGERY

Single tree detection has been a major research topic concerning automatic forest inventory using remote sensing data. Recently, deep learning-based approaches in remote sensing forestry have gained attention because of the prospect of improved accuracy. In this study, we present a novel tree detection method based on the detection transformer (DETR), which applies a transformer in combination with a pre-trained convolutional neural network to detect individual trees using high-resolution multispectral imagery. The test site (Kranzberg Forest Roof Experiment KROOF) is located in Bavaria, north of Munich, and is characterised by a mixed forest which consists of large groups of European beeches (Fagus sylvatica) surrounded by Norway spruces (Picea abies). The image data were acquired with a MicaSense RedEdge-MX Dual camera mounted to UAV. Two flight mission were conducted at an altitude of around 85 m with a flight speed of 5 m/sec, resulting in a ground resolution of about 5 cm. 125 trees were surveyed by tacheometric means in the field for testing, and 1390 trees were labelled by visual interpretation of the multispectral imagery for training and validation. The novel tree detection method based on DETR shows promising results and outperforms the standard, well-known object detection method YOLOv4 in mixed and deciduous test plots. More detailed, F1-scores were evaluated for coniferous plot at 83%, for mixed plot at 86% and for deciduous plot at 71%. The corresponding figures for YOLOv4 are 87% coniferous, 65% mixed and 67% deciduous. In terms of accuracy, DETR is inferior by 6% in coniferous plot, however superior by 28% and 5% in mixed and deciduous plot, respectively. Compared to YOLOv4, we found that DETR sometimes failed to detect small coniferous trees. Moreover, both deep learning-based methods tend to over-detect single trees in deciduous test areas. In sum, transformer-based tree detection shows great potential to improve single tree detection.


INTRODUCTION
Forests are an essential part of our environment, providing critical ecosystem services, such as carbon storage, nutrient cycling, drinking water supply and air purification. Moreover, they offer recreational opportunities and host a large proportion of Earth´s biodiversity. Forest loss, global change and an unsustainable management are threatening forest ecosystems in an unpreceded manner. A better knowledge of the condition of the forests is a prerequisite for sound management, for which forest inventories form an important basis. Here, remote sensing methods come into play as they can acquire this information over large areas at a much lower cost in comparison to conventional methods (Krzystek et al., 2020). Forest inventories, as part of sustainable forest management, are usually conducted on small sample plots (less than 1% of the area) with intensive terrestrial measurements, which survey individual tree attributes and derive statistical indicators for the surveyed areas. An areal wide collection of forest structure parameters down to single tree information can only be done * Corresponding author. by using remote sensing methods and offers an added value for the areal monitoring of forest structures (Latifi et al., 2015). Using high-resolution remote sensing data and innovative AI methods, this information can be collected over large areas at a much lower cost (Latifi and Heurich, 2019).

RELATED WORK
In recent years, the use of deep neural networks (DNN), such as segmentation and classification algorithms, has attracted a great deal of interest as they outperform standard machine learning approaches in various tasks (Voulodimos et al., 2018). The main advantage of many DNNs is representation learning, which characterises automatic feature extraction as part of the training process (LeCun et al., 2004). However, single tree detection and segmentation via deep learning are more challenging and only a few approaches apply instance segmentation that imbed two-stage object detectors to delineate single trees using lidar data (Windrim and Bryson, 2020) or multispectral imagery (G. Braga et al., 2020). In another study, a tree detection method based on the single-stage detector RetinaNet (Lin et al., 2017) using RGB imagery is presented (Weinstein et al., 2019). The model is initially trained by tree segments provided by a lidarbased segmentation and is fine-tuned using manually labeled segments. When applied in an open forest area, the approach outperforms two baseline methods (Silva et al., 2016), (Li et al., 2012). The novel use of transformers is promising (Parmar et al., 2018), which are deep learning building blocks using the mechanism of self-attention (Vaswani et al., 2017). In this work, we aim to detect single trees in high-resolution RGB true orthophotos (TDOPs) using a novel transformer approach detection transformer (DETR) (Carion et al., 2020). In an interesting study, a similar procedure was applied in the field of bioinformatics (Prangemeier et al., 2020). It was successfully shown that cells in microstructures can be detected and classified using microscope imagery with the help of transformers. In remote sensing, change detection in residential areas was conducted, reporting accuracy improvements compared to baseline architectures such as U-Net . Recently, a study presented a new deep learning model called density transformer (DENT) for automatic tree counting from aerial images (Chen and Shang, 2022). The architecture is similar to DETR in (1) using a convolutional neural network for extraction of visual features and (2) providing contextual image information with the help of conventional transformer encoder in a multi-head attention mechanism. The encoder gives input for two separate feed-forward networks: one that generates a tree density map and another that counts trees. DENT outperforms most of the other deep learning-based methods such as Faster R-CNN  and YOLOv3 (Redmon and Farhadi, 2018). To the authors' best knowledge, so far no experiments have been carried out using this new far-reaching deep learning-based object detection method to detect single trees in a high-resolution TDOP in the context of forest inventory. In order to demonstrate the potential of the transformer-based method, the results were compared with a well-known one-stage object detection method called You Only Look Once v4 (YOLOv4) (Bochkovskiy et al., 2020).

Study area
Our experiments were conducted close to the Kranzberg Forest Roof Experiment (KROOF) research site, located at 11°39'42" E, 48°25'12" N, approximately 35 km northeast of Munich. The forest around the KROOF research site is under administration of the Bayerische Staatsforsten. Most of the mixed forest is characterised by large groups of beeches surrounded by spruces. Tree heights vary between 19 m and 36 m with a stem density of around 200-300 trees/ha. For the evaluation, field measurements were conducted to generate reference data. For trees with a breast height diameter (BHD) greater than 15 cm, the tree positions were measured by tacheometric means with an accuracy of less than 2 cm. The BHD was conventionally determined using a caliper. The first plot (Figure 1, Plot #1) is characterised by dominant coniferous trees and some understory trees as well. The second plot (Figure 1, Plot #2) is more diverse, composed of 60% coniferous and 40% deciduous trees. The third plot (Figure 1, plot #3) is dominated by deciduous trees which make up 76% of the area. The variety also refers to the size and the age of the occurring trees. Table 1 shows the plot characteristics. Since a 2D data based method is used, only dominant trees and trees recognisable in the TDOP were used for the accuracy assessment. Figure 1 shows test plots #1 and #2 superimposed on the TDOP of the August 2020 flight. Test plot #3 is shown on the data set flown in July 2021.

Data acquisition and preparation
3.2.1 Aerial multispectral data In August 2020 and July 2021, multispectral images were collected using a RedEdge MX Dual camera (MicaSense, 2022) attached to a remotely piloted hexacopter (DJI M 600 Pro). The camera system captures ten channels (spectral range 475 -842 nm) with a horizontal field of view of (HFOV) of 47.2°, which corresponds to a focal length of 5.5 mm. A downwelling light sensor provided accurate ambient light calibration. Images of a calibration panel were taken for radiometric calibration. The flight speed was 5 m/sec above ground. The end lap and side lap of the image block were 90% and 60%, respectively. For the two missions, the flight heights were 90 m and 80 m, resulting in ground sample distances (GSDs) of 5.93 cm and 5.3 cm. For postprocessing of the imagery, structure-from-motion (SFM) software was used to generate TDOPs (MetaShape, 2022). The processing steps consisted of (1) radiometric calibration of imagery (2), bundle adjustment, (3) point cloud generation and (4) generation of an orthomosaic. The exported TDOPs had a cell size of 5 cm containing ten channels captured by the camera. Table 2

Field survey
The goal of the field campaign was to measure tree positions as precisely as possible in order to generate accurate test data (see 3.2.3). Due to the expected shading effects in dense forest areas using global navigation satellite system (GNSS) systems, a survey campaign was conducted in April 2021. First, a traverse was measured in the area of plots #1, #2 and #3. The traverse included seven polygon stations and was georeferenced using three geodetic points. The Trimble R12i GNSS system and the Leica TCRP1203+ total station were used as instruments. Afterwards, tree positions were surveyed from the polygon points by tacheometric means. The BHD of each tree greater than 15 cm was also measured using a caliper, and the tree group was also documented. In summary, 55 trees, 36 trees and 34 trees were surveyed in plots #1, #2 and #3, respectively (see also Table 1). The estimated accuracy of the tree positions was less than 10 cm.

Labeling of tree crowns
The training and test reference data are provided in the form of enclosing bounding boxes. For this purpose, the TDOP is used for visualisation. For the labeling of the training data, tree segments were defined in the TDOP. The tree segments of the test data are also determined using the TDOP and additionally linked with tree positions derived from the field measurements. Figure 2 shows an example of labeled trees with corresponding bounding boxes and tree positions.

DETR
The deep learning-based method DETR considers object detection as a direct set prediction problem. This approach pro- cesses global image information by the transformer mechanism (Vaswani et al., 2017) and eliminates the need for several subtasks that require prior knowledge about the problem, such as anchor generation. Predictions are determined directly and in parallel using a small set of learned object queries. The relationship between objects and the global image context is a key factor in this process. The overall DETR architecture with the key elements is illustrated in Figure 3. First, the features are extracted using ResNet-50 . Afterwards, the positional encodings are added up element by element to the CNN features. Finally, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France  the result is transferred to a transformer encoder followed by a transformer decoder, which generates N object queries. The last step classifies bounding boxes and classes using an feedforward network (Carion et al., 2020). Beside the architecture, finding and evaluating ground truth and predicted boxes plays a crucial role. The definition of a set prediction loss, which includes a unique matching procedure between ground truth boxes and a larger set of predicted boxes, has been determined efficiently by bipartite matching using the Hungarian algorithm (Kuhn, 1955). The total loss is a combination of the matching loss and the Hungarian loss, which includes a linear combination of the generalised intersection over union (IoU) loss (Rezatofighi et al., 2019) and the L1 loss. This architecture offers several advantages. No previous information about anchors is needed and global information can be processed due to the transformer mechanism. However, on the other hand this workflow has issues detecting small objects compared to the faster R-CNN  and converges slower than comparable object detection methods. The new design based on transformers and bipartite matching in the area of object detection and the good extensibility of the workflow offers the possibility for adaptations in different fields. For example, the authors of Deformable DETR (Zhu et al., 2020) extended the existing workflow so that Deformable DETR detects smaller objects better and requires a factor of 10 less training epochs. The authors of Dynamic DETR (Dai et al., 2021) have significantly reduced the number of training epochs and achieved improved performance by introducing a dynamic encoder that reduces the quadratic computational complexity of the self-attention module in transformer encoders.

YOLOv4
YOLOv4 is the fourth evolutionary step of the original You Only Look Once (YOLO) (Redmon et al., 2015) released in a flexible research framework called darknet. The original ver-sion included the first neural net approach that could generate all bounding boxes and class labels parallel in one inference step using an end-to-end network. Historically, YOLO has undergone several improvement iterations with YOLOv2 (Redmon and Farhadi, 2016), YOLOv3 (Redmon and Farhadi, 2018) and YOLOv4 (Bochkovskiy et al., 2020). YOLOv4 achieves state-of-the-art detection accuracy in roughly realtime. The architecture illustrated in Figure 4 shows the essential components of the workflow. First, features are extracted from images using the feature extractor CSPdarknet53. Here, cross-stage partial connections are attached to darknet53 from YOLOv3. As feature aggregator, spatial pyramid pooling is utilised as it increases the receptive field and differs the most important features. Then, instead of the feature pyramid network in YOLOv3, the path aggregation network is utilized. The original YOLOv3 network was used as head to generate bounding boxes and class labels. Beside the architecture, two strategies have been introduced. One of these strategies is called bag of freebies, which does not require any additional computing power and uses data augmentation, such as mosaic or cutmix. The other strategy, bag of specials, contains improvement modules for inference (e.g. mish activation) (Misra, 2019).

Experimental setup
Due to the architecture of DETR and YOLOv4, we preprocessed the training, validation and test image data in 50% overlapping tiles of 512 x 512 pixels. For the training process, the training data was split into 80% training and 20% validation. Table 1 and 3 show the number of images used for training, validation and testing. Three models each were trained for DETR and YOLOv4 with varying random number generator The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France seeds to check the reproducibility of the results. In the case of DETR, no remarkable variation of the results were found. Instead, YOLOv4 exhibited a wider deviation of the results. We therefore computed mean values of statistical parameters accuracy, precision, recall and F1-score (See Section 5.4) in the respective plots #1, #2, and #3. As a result of the 50% overlap in the images, overlapping bounding boxes were predicted during testing. At the image edges, small tree fragments sometimes occurred. Therefore, post-processing was necessary to filter the small bounding box fragments using a threshold value (e.g. 20 m 2 ). Subsequently, a non-maximum suppression was applied using an IoU threshold of 0.3. A workstation equipped with 256 GB RAM, a Nvidia RTX 8000 GPU, and an AMD Ryzen Threadripper 3970X processor was used.

Configuration of DETR
The DETR Python implementation of Hugging Face (Wolf et al., 2020) was used and the default configuration of DETR was applied with an adjusted learning rate of 1e-7, a backbone learning rate of 1e-6, a weight decay of 1e-8 and a batch size of 12.
A pre-trained model based on the common objects in context (COCO) detection dataset (Lin et al., 2014) was used because of the limited amount of training data. This required a constant parameter value object queries (optimized for the COCO dataset) to be fixed to 100. Within a period of 500 epochs, we applied early stopping to train and validate the model, thereby mitigating overfitting effects. To achieve this, the model with the highest validation mean average precision @IoU=0.50 (mAP) was selected first. It was checked whether the validation loss was within the range of a minimum. Due to the 50% overlapping test images and the object queries parameter, the bounding box fragments with high confidence scores at the edges of the test images were eliminated in an intermediate step.

Configuration of YOLOv4
In this work, the YOLOv4 implementation for Windows computers was used (Bochkovskiy et al., 2020). The configuration was adapted for this data set by setting the number of training steps to 7200 and using a batch size of 64. Data augmentation (crop, rotation, flip, hue, saturation, exposure, aspect, cutmix, mixup, mosaic and blur) was also applied. A pre-trained model based on the COCO detection data set (Lin et al., 2014) was selected for transfer learning. The model with the maximum mean average precision (Everingham et al., 2010) value was selected within 7200 training steps.

Accuracy assessment
In order to determine the quality of the tree detection, the following metrics were used. First, accuracy, precision, recall and F1-score were taken to identify the performance of the results. Detected trees that could be assigned to a reference bounding box with at least an IoU of 50 % were taken as successful detected trees (true positives). If no assignment to a reference bounding box was found for detected trees, they were categorised as false positives. Furthermore, reference trees, which could not be matched to any detected tree, were marked as false negatives. The IoU describes the quality of the overlap and is defined as the ratio of the common area and the combined area of the bounding boxes A and B. Equations 1, 2, 3, 4 and 5 show how the described parameters are calculated, whereby TP, FP, FN and F1 are denoted as true positives, false positives, false negatives and F1-score.

RESULTS AND DISCUSSION
Tree detection results using DETR and YOLOv4 methods are summarised in Table 4. DETR clearly outperforms YOLOv4 in mixed plot #2 and deciduous plot #3. Interestingly, mixed plot #3 shows a significant difference of more than 20% in terms of F1-score. The F1-score in deciduous plot #3 is lower with 4%. In contrast to plot #2 and #3, DETR deteriorates by 4% F1-score in coniferous plot #1. Note that the accuracy values in Table 4 fully confirm the trend of the F1-score. Across all three test plots, both methods have problems with over-segmentation. The effect is clearly distinctive in mixed plot #2. This is reflected in a 23% lower recall for YOLOv4. Figure 6 shows a deciduous crown with a diameter of 15 m completely detected by DETR and split-up into two boxes by YOLOv4. Our explanation for this is that the training data contain mainly medium-sized trees. Therefore, to reduce the oversegmentation effect, more larger tree crowns should be included in the training. Furthermore, we notice that DETR obviously detects smaller trees worse than YOLOv4. This is especially noticeable in plot #1, where a total of seven small trees with a crown diameter of less than 5 m are located. More detailed, DETR detects only one tree, however YOLOv4 is able to successfully find four trees in this plot. To clarify this, Figure 7 shows a sample subarea of plot #1. In Figure 7a, we notice that DETR detects two small trees. However, these are false positives because of a too low IoU value. Instead, YOLOv4 successfully detects two trees in this subarea (See Figure 7b).
In conclusion, DETR has apparently problems to detect smaller trees, which is also reflected in poorer results in plot #1. This can be explained mainly by the fact that DETR generally detects smaller objects worse than object detectors such as YOLOv4 or Faster-RCNN, which normally use higher resolution feature maps. This disadvantage of DETR was recently compensated by an extension called Deformable DETR, which achieves significantly better detection results for small objects (Zhu et al., 2020). Finally, we compare our results with the study (Weinstein et al., 2019) DETR  71  83  81  85  76  86  84  89  55  71  71  71  YOLOv4  77  87  84  89  48  65  57  75  50  67 63 71 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France pine (Pinus sabiniana). The study reports a detection accuracy of 69% recall and 61% precision. Comparison with our study is difficult due to differences in the characteristics of the forest area.

CONCLUSIONS AND OUTLOOK
In this study, the successful detection of individual trees using a novel transformer-based object detection method called DETR was demonstrated. When comparing DETR with the baseline method YOLOv4, we observed a significant improvement in detection accuracy. In a mixed plot, DETR achieved an improvement of more than 20% in terms of F1-score compared to YOLOv4. In a deciduous plot, a moderate increase of 4% F1-score was significant. Moreover, our experiments suggest that small trees are detected worse because of the drawbacks of DETR localising objects of reduced size. Future experiments will focus on (i) usage of multispectral channels (e.g. NIR, NDVI, NDRE), (ii) usage of lidar-based metrics generated from a lidar flight mission conducted in the same area (e.g. DSM, lidar intensity, penetration rate), and (iii) extension of the detection method with a segmentation enabling tree crown delineation.