FUSION OF LIDAR AND HYPERSPECTRAL DATA FOR SEMANTIC SEGMENTATION OF FOREST TREE SPECIES

Hyperspectral images (HI) and Light Detection and Ranging (LiDAR) provide high resolution radiometric and geometric information for monitoring forests at individual tree crown (ITC) level. It has many important applications for sustainable forest management, biodiversity assessment and healthy ecosystem preservation. However, the integration of different remote sensing modalities is a challenging task for tree species classification due to different artifacts such as the lighting variability, the topographic effects and the atmospheric conditions of the data acquisition. The characterization of ITC can benefit from the extraction and selection of robust feature descriptors that solve these issues. This paper aims to investigate the integration of feature descriptors from HI and LiDAR by using the intra-set and inter-set feature importance for the semantic segmentation of forest tree species. A fusion methodology is proposed between high-density LiDAR data (20 pulses m−2) and VNIR HI (160 bands and 0.80m spatial resolution) acquired on French temperate forests along an altitude gradient. The proposed scheme has three inputs: the field inventory information, the HI and the LiDAR data. Our approach can be described in nine stages: polygon projection, non-overlapping pixel selection, vegetation and shadow removal, LiDAR feature extraction, height mask, robust PCA (rPCA), feature reduction and classification. The overall accuracy of tree species classification at pixel-level was 68.9% by using random forest (RF) classifier. Our approach showed that 74.0% of trees were correctly assigned overall, by having conifer species such as Norway Spruce (Picea abies) with a producer’s accuracy of 97.4%.


INTRODUCTION
Spatial distribution of forest tree species has important benefits for sustainable forest management (Ghosh et al., 2014). For instance, scientists are able to study the functioning of forested ecosystems by understanding variables associated to stress, disease patterns, invasive species spread and deforestation (Ghosh et al., 2014, Lee et al., 2016. This information is very relevant to establish useful exploitation policies of forests (Dalponte et al., 2012). Conventional ecological survey methods of tree species mapping require an exhaustive work that thrives in difficult scenarios, relies on small plot-level datasets and implies a significant amount of time, manpower and economic resources (Ghosh et al., 2014, Lee et al., 2015. Several studies have highlighted the potential of spectral and spatial resolution in remote sensing based tools for monitoring forest ecosystems. For instance, hyperspectral sensors extract radiance information from objects or scenes lying on the Earth surface (Bioucas-Dias et al., 2013) by covering the visible (V), nearinfrared (NIR), and shortwave infrared spectral bands in the range from 300 to 2500 nm (Bioucas-Dias et al., 2012). LiDAR data describe the scene using a point cloud that has explicit 3D coordinates (x, y, z), intensity of the returns, return number, number of returns, point classification, among other attributes (Vincent et al., 2017). The advances in the integration of LiDAR and hyperspectral sensors make possible to identify tree species at pixellevel with a high accuracy. * Corresponding author. Data fusion is implemented at three levels: observation-level, feature-level and decision-level (Tusa et al., 2020). Considering that tree species identification is a remote sensing topic posed as a supervised approach (Dalponte et al., 2019), recent studies pursue to integrate HI and LiDAR data to improve the classification (decision-level fusion). In (Matsuki et al., 2015), features are extracted from the HI by applying principal component analysis (PCA) for reducing the redundancy within the bands (Liao et al., 2017). In (Lee et al., 2016), the algorithm rPCA is applied for filtering the noise and for selecting relevant features. This study demonstrated that rPCA improved the classification of six tree species over PCA, with an overall accuracy of 61.0% at tree-level. The algorithms support vector machines (SVM) (Lee et al., 2016) and RF (Maschler et al., 2018, Anderson, 2018 have been used for tree species classification with HI. Although SVM performs superior for high dimensional features, it requires a proper parameter-setting that can be time-consuming for large datasets. Alternatively, RF can be a good choice for feature selection (feature-level fusion) and classification (Tusa et al., 2014).
The aim of this study is to integrate HI and LiDAR data for the classification of four forest tree species. The matrix decomposition of rPCA increases the feature representation, which can be reduced by considering the intra-set and inter-set feature importance. This work is divided as follows: section 2. describes the study site and the data acquisition specifications, section 3. presents the contributions based on PCA and rPCA, feature selection and classification based on RF, section 4. discusses the results of feature fusion for the classification and finally, section 5. corresponds to the conclusions and the work perspectives.

STUDY SITE
The study area corresponds to the site of Chamrousse, located in Belledonne massif, Northern Alps, France. Trees with a diameter at breast height (DBH) over 7.5 cm were inventoried on seven field plots: four of size of 50 × 50 m 2 , two circular plots of 15 m radius, and a plot of 80 × 100 m 2 , which is part of a long-term inventory dataset (Fuhr et al., 2017). The forest is dominated by Norway spruce (Picea abies; 37.5%) and other conifers: silver fir (Abies alba; 31.6%) and mountain pine (Pinus uncinata; 10.6%); broadleaves species are mainly represented by European beech (Fagus sylvatica; 15.2%).
For tree positions, slope, distance and azimut relatively to reference poles are recorded. Tree crown extensions were measured during the summer 2018, with ruban tapes in the north, south, east and west directions as the horizontal distance between the trunk center and the vertical projection of the furthest live branch along that direction. Tree positions in the Lambert 93 projected coordinate system are then computed from the angular and distance measurements. As GNSS precision is around a few meters under forest cover, trees positions are displayed over the LiDAR canopy height model. In case of a significant shift, position of the reference pole is adjusted by using the co-registration method described in (Monnet and Mermin, 2014). The field data is summarized in Table 1 by considering seven plots with 893 tree crowns.
The remote sensing data was collected between 21 st and 23 rd of June 2018. LiDAR acquisition was carried by using a RIEGL LMS Q780 sensor. The 3D point cloud has a mean pulse density of 20 pulses m −2 . The HI was collected by using a Hyspex VNIR 1600 sensor. It has 160 bands with a spatial resolution of 0.80 m and a spectral resolution of 4.5 nm in a spectral range between 400 and 1000 nm.

METHODOLOGY
The semantic segmentation workflow for forest tree species classification is presented in Figure 1. The proposed scheme has three inputs: the field inventory information, the HI and the LiDAR data. This approach comprises the following steps: polygon projection, non-overlapping pixel selection, spatial filtering, vegetation and shadow mask, LiDAR features, height mask, robust PCA, feature reduction and classification.

Polygon projection
Each tree crown from the inventory is projected into the xy-plane by fitting an ellipsoidal crown shape. Then, a HI pixel is associated with a crown by verifying if the pixel center coordinate is inside the crown polygon. The pixel center coordinate, (xc, yc), is computed through equations (1) where (x0, y0) is the xy coordinates for the upper-left pixel in the image, s f is the half of the spatial resolution of the image, and (ri, ci) are the coordinates for the row and the column pixel positions in the image (ri ∈ {0, ..., m} and ci ∈ {0, ..., n} with an image size m × n.

Spatial filtering
According to (Laybros et al., 2019), a strategy for reducing local noise and improving class separability can be implemented through the spatial filtering. Conventionally, a mean filtering is computed by moving a windows of 3 × 3 over each band. In (Dechesne et al., 2017), three circular neighborhoods of radius r f ∈ {1.0, 3.0, 5.0} m are applied to compute the mean value of every pixel in each band. Since the pixel center coordinate (xc, yc) is known, the pixels whose centers are inside of the circular neighborhood of each pixel, are selected for averaging the reflectance in each band. Finally, the three mean values of all pixel associated to each r f are averaged.

Non-overlapping pixel selection
The crown pixel correspondence becomes challenging when some crown regions overlap among each other. To overcome this issue, those overlapping pixels that are associated with crown trees of different species, are not considered in our ground truth. The purpose is to select a representative set of pure pixels for each species. The overlapping pixels for those trees of the same species, are labeled to the tree with the greatest height and crown area.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) This pixel -tree association is relevant when the training and testing sets are defined for the classification.

Vegetation and shadow mask
The selection of sunlit pixels is advised because shadowed pixels are affected from a low signal-to-noise ratio (SNR) (Torabzadeh et al., 2019). Several strategies have been proposed for shadow removal. In our approach, the criterion explained by (Dalponte et al., 2014), estimated the average of the blue portion of the spectrum in the range of [450,550] nm. Then, Otsu threshold is calculated over this range. Since the Otsu threshold value removed approximately 50.0% of the pixels, the minimum threshold is computed to preserve around 65.0% of our ground truth.
The normalized difference vegetation index (NDVI) is used to separate forest from non-vegetation . In (Dalponte et al., 2018), authors selected a NDVI value greater than 0.5 to consider well-lit, leafy vegetation pixels. In (Alonzo et al., 2016), they used a threshold of 0.6 based on the average NDVI for all crowns. For our approach, a NDVI value of 0.55 is applied for all forest plots.

LiDAR features
From the LiDAR 3D point cloud, 24 LiDAR features are computed at point-level according to (Dechesne et al., 2017): • Two vegetation density features: the cumulative sum of local height maxima retrieved from each radius r f maximum filter within three cylindrical neighborhoods of radius rc ∈ {1.0, 3.0, 5.0} m. The second feature is computed from the number of points classified as ground points over the total number of points within each cylindrical neighborhood of radius rc. The three outputs of the neighborhoods for every point are averaged.
• Two shape features: the scatter and the planarity are computed from the eigenvalues of the covariance matrix within the three cylindrical neighborhoods of radius rc.
• 20 statistical features from each LiDAR point using the cylindrical neighborhoods given in rc. The height and intensity information from the LiDAR data are used to derive statistical features: minimum; maximum; mean; median; standard deviation; median absolute deviation from median; mean absolute deviation from median; mean absolute deviation from mean; skewness; kurtosis; 10 th , 20 th , 30 th , 40 th , 50 th , 60 th , 70 th , 80 th , 90 th and 95 th percentiles. All the statistical functions are used for the height, while the mean is used for the intensity only.
After computing the LiDAR features at point-level, these features are rasterized at the resolution of the HI, sc. The feature values of the points inside the pixel that is centered at the coordinate x = (xc, yc), are weighted according to the function W (x) where W h and Wv are the weighting functions of the horizontal and vertical information, respectively. The point x i represents the xy coordinate of each 3D point inside the cylinder, γ is selected to be 0.5 (normal distribution), w h is an horizontal bandwidth equal to √ 2sc, Zi represents the z coordinate of each 3D point inside the pixel and wz = Zmax − Zmin, corresponds to the difference between the maximum and minimum height 3D point values in the vertical weighting function, Wv. These weighting functions have been used for finding the maximum value of the 3D point cloud distributions for 3D segmentation purposes (Xiao et al., 2019). In this way, the information of the highest points close to the pixel center are given more importance because these are associated to the most illuminated regions in the image.

Height mask
The purpose of the height mask is to avoid negative effects due to shrubs, grassland and low height objects. A threshold value of hmin = 1.5 m (Kandare et al., 2016) is applied over the digital surface model (DSM) for preserving most of the forest regions. At this point, by merging all the criteria of specie purity, pixel illumination and height information, a set of n lp labeled pixels associated with nHI = 160 hyperspectral bands and nLi = 24 rasterized LiDAR metrics is processed in the next stage.

Robust PCA
From a given data matrix of size m × n, PCA can generate a low-rank matrix L that minimizes the error S = D − L: where r min(m, n) is the target rank of D and || · ||F is the Frobenius norm. Although PCA is widely used as dimensionality reduction technique, it can be affected by large-amplitude noise. rPCA (Lee et al., 2016) is a method proposed to improve the robustness of the PCA for dealing with outliers. This algorithm pursues to recover a low-rank matrix L from highly noisy measurements D = L + S, by separating a sparse matrix S. Given the matrix D, the matrices L and S are computed by satisfying the following condition: where || · ||0 is the l0-norm, λ is a regularization parameter for balancing the importance between the ranking operator and the sparsity regularization. Equation (4) is reformulated as a convex optimization approach: where ||·|| * is the nuclear norm, which is the sum of singular values of L; and || · ||1 is the l1-norm. In (Brunton and Kutz, 2019), authors described the programming implementation of the rPCA algorithm. In this stage, the matrices L and S are obtained from the HI, D HI , and the rasterized LiDAR features, D Li , separately.

Feature reduction
Considering the findings of (Luan et al., 2014), the sparse matrix S contains detailed information to establish inter-class and intra-class differences. This is a motivation to focus on the effect of stacking features according to the feature importance given by the RF classifier (Friedman, 2001). Let L HI and S HI be the low-rank and sparse matrices from HI, D HI ; the principal components (PC) are computed to obtain: PC D HI , PC L HI and PC S HI . The same procedure is repeated with the LiDAR information to have six matrices: D Li , L Li and S Li ; and the corresponding PC: PC D Li , PC L Li and PC S Li .
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) Each subset is evaluated in the RF classifier and the subset with the highest F1-score corresponds to the selected feature set. This procedure is applied for every set of features: D HI , L HI , S HI , PC D HI , PC L HI , PC S HI , D Li , L Li , S Li , PC D Li , PC L Li and PC S Li .
The 12 sets of selected features are ordered based on the F1-score as it is described in Table 4. The procedure to stack features from a different set to another is explained as follows. The set with the highest F1-score is our reference set to which new features will be added to increase the overall F1-score. Let A = [a4, a3, a1] be our reference set with the highest F1-score, the feature set B = [b1, b2, b3, b4] contains the features to be added to the set A.
The estimation of the inter-set feature importance is achieved by stacking every feature to the reference set: The subset of features that increased the F1-score of the reference set are stacked to form a new reference set. This procedure is repeated with the remaining sets.

Classification
For the tree species classification, tree crowns were selected according to the number of pixels per crown. Thus, those trees that had more than 4 pixels per crown, were considered in our dataset. By applying this criterion, an average of 19.6 pixels per crown was obtained which is close to the average number of pixels per crown selected by (Baldeck and Asner, 2014) (24.9 pixels per crown). After performing the tree crown selection, the number of trees per specie were distributed according to Table 2.
For the RF classifier, the number of decision trees to grow was set to 1000. The training and test sets were defined by randomly separating trees and by associating the pixels with their crowns. The tree crowns were divided into six sets. The training set was formed by five of these six sets, resulting in a test set of 1550 pixels, which is a similar number of samples selected by (Lee et al., 2016) (1537 pixels).
Our dataset showed a problem of class imbalance, which means the number of trees and pixels per class are not evenly distributed (Anderson, 2018). To address this problem, two strategies were implemented. First, those tree species that had less than six trees or had less than 1% of total number of pixels were grouped into a class called "other" species. Second, RF automatically weighted the classes by considering that these are inversely proportional to the frequency that the class has in the data (Albon, 2018). The pixel distribution is described in   Table 3: Abbreviation, number and percentage of trees and pixels for four identified species and one "other" species class

RESULTS AND DISCUSSION
The results of feature stacking based on feature importance and F1-score are presented for the HI and the LiDAR data for the fusion of these two modalities. The ground truth of the seven plots is illustrated in Figure 2. Four classes of tree species were discriminated at pixel-level. The tree species PIAB is present in all forest plots, while the class PIUN is in plot 4 only. The classification at tree-level is obtained by applying a majority voting rule to define the species for each crown (Lee et al., 2016).
The ranking of the feature sets based on the F1-score obtained by pixel classification is described in Table 4. Due to the difference between the amount of features from HI and LiDAR data, the spectral descriptors performed better than those features extracted from the LiDAR point cloud. However, the rasterized LiDAR features were not filtered spatially as the HI bands, which clearly considers the information from the neighborhood of each pixel. Feature selection based on RF importance reduced the amount of features and increased the classification performance. According to Table 4, the reference set for our approach is formed by 10 features from D HI . The complete HI set is represented in Figure 3(a) by computing the mean spectral signature of each tree species considering the ground truth from Figure 2.   Table 5 shows the F1-scores of pixel classification for each tree species by adding the amount of HI features marked in the second row. Since D HI is the reference set with F1-score of 54.5%, the selection of 20 PC from PC D HI increased 5.2 points of percentage. The effect is clear in the most abundant classes, conifers, such as PIAB and ABAL. The minority classes such as FASY, PIUN and other species; decreased their performance at the limit that the PIUN class is not detected by the model. The addition of 25 features from S HI , the sparse information from HI, improved for the species ABAL, PIAB and PIUN. Although the 18 PC from HI improved the performance for ABAL, FASY and PIAB species, the 8 PC from the low-rank representation, PC L HI , improved the F1-scores for all species. The contribution of 2 features from L HI increased in 0.01 points of percentage. Table 6 shows the F1-scores of pixel classification by adding the amount of LiDAR features marked in the second row. The computation of these LiDAR features was explained in the subsection 3.5 and the mean normalized range of each feature by species can be visualized in Figure 3(b). The results are presented for the datasets S Li , PC D Li and D Li . The datasets associated to the lowrank representation L Li and the PC: PC L Li and PC S Li , decreased the overall F1-score. Hence, these features were not considered for the feature fusion. Although the LiDAR information of these three datasets contributed with 1.2 points of percentage, the performance of the PIUN tree species increased 7.4 points of percentage with respect to the F1-score in the HI datasets. LiDAR data can be very useful for describing minority classes, but our feature fusion strategy gave more importance to the overall result than the individual species performance. In total, 4 features from LiDAR information are integrated to the selected features from HI to obtain a new set of 87 features that are described in Figure  3(c). The overall F1-score for the fusion of the selected features from HI and LiDAR data was 68.9%.   Table 6: F1-scores of feature fusion from three LiDAR sets by stacking a number of features (No.) to the reference set described in Table 5 Tables 7 and 8 present the species classification at the pixel-level and tree-level, respectively. The overall accuracy at tree-level is 74.0%, which is greater to the accuracy at pixel-level, 68.9%. The species PIAB, the most abundant class, obtained the highest producer's accuracy among all species, but the user's accuracy is affected by the wrong classification of the minority species as it is illustrated in Figure 4. Two trees labeled as ABAL (Figure 4(b)) and PIUN (Figure 4(h)) species, are wrongly detected as PIAB. In Figure 4(d), the tree is correctly detected as FASY, but the amount of pixels predicted as PIAB is notorious over the region of interest. The second class after the PIAB tree species with good performance is ABAL, which belongs to the conifers family. At tree-level, ABAL obtained a good producer's accuracy (70.6%) and the second highest user's accuracy (85.7%). The FASY tree species has less number of crowns than the class ABAL and the second amount of pixels in our dataset. Broadleaves, such as FASY tree species, are characterized by having the biggest tree crowns with the greatest average amount of pixels per crown (Table 3). FASY performed the highest user's accuracy at pixel-and tree-level. The minority classes such as PIUN and other species are challenging to be detected because the model predicted them as the majority class PIAB.
The approach in (Lee et al., 2016) based rPCA and SVM provided an overall accuracy for the classification of six tree species of 91.7% and 61.1% at the pixel-and tree-level, respectively. In the test set, they used 1537 pixels associated with 677 crown trees, which means every crown had approximately 2.3 pixels. There is a negative difference of 30.6 points of percentage between the pixel-and the tree-level overall accuracy by having an estimated increase of pixels from 1 to 2.3. Our test set contains 73 tree crowns with 1550 pixels (21.2 pixels per crown). The resulting overall accuracy for the classification of four tree species was 68.9% and 74.0% at the pixel-and tree-level, respectively. The estimated increase of pixels from 1 to 21.2, generated an improvement in the overall accuracy of 5.1 points of percentage between the pixel-and the tree-level performance.

CONCLUSION
In this work, we presented a data fusion scheme that integrated features extracted from the HI and the LiDAR data for semantic segmentation. The potential of rPCA and PCA are applied to expand the feature representation through the low-rank and sparse information, and their principal components. The sparse features from the HI contributed to increase the F1-score of the classifier performance by adding 49.4% of the sparse features. The feature selection implements two steps: first, the intra-set feature importance was described by the RF scores; then, the inter-set feature importance was estimated by the overall F1-score for stacking different set of features. This method offers the flexibility of incorporating additional features such as vegetation indexes, which increases the interpretability of the model. The feature reduction process decreased the dimensionality from 1104 to 87 features. The overall accuracy of tree species classification at pixellevel was 68.9%. Our approach showed that 74.0% of trees were correctly assigned overall, which represented an improvement of 12.9 points of percentage with respect to (Lee et al., 2016). Although these two remote sensing modalities provided valuable information, the most abundant classes, the conifers such as Abies alba and Picea abies, benefited mainly from the HI information. However, it became a challenging task for dealing with imbalanced classes of our dataset, such as Pinus uncinata. Future work is going to be focused in the implementation of tree species detectors by selecting the most relevant features from HI and LiDAR data for each specific tree species.