A COMPARISON OF RANDOM FOREST AND LIGHT GRADIENT BOOSTING MACHINE FOR FOREST ABOVE-GROUND BIOMASS ESTIMATION USING A COMBINATION OF LANDSAT, ALOS PALSAR, AND AIRBORNE LIDAR DATA

Sustainable forest management is a critical topic which contributes to ecological, economical, and socio-cultural aspect of the environment. Providing accurate AGB maps is of paramount importance for sustainable forest management, carbon accounting, and climate change monitoring. The main goal of this study was to leverage the potential of two machine learning algorithms for predicting AGB using optical and synthetic aperture radar (SAR) datasets. To achieve this goal random forest (RF) and light gradient boosting machine (LightGBM) models were deployed to predict AGB values in Huntington Wild Forest (HWF) in Essex County, NY using continuous forest inventory (CFI) plots. Both models were trained and evaluated based on airborne light detection and ranging (LiDAR) data, Landsat imagery, advanced land observing satellite (ALOS) phased array type L-band Synthetic Aperture Radar (PALSAR), and their combination. The integration of airborne LiDAR, optic, and SAR datasets provided the best results in terms of root mean square error (RMSE) and mean bias error (MBE). The RF model outperformed the LightGBM in all scenarios (LiDAR, Landsat 5, ALOS PALSAR, and their combination). The RF model was able to predict AGB values with the RMSE of 51.90 Mg/ha and MBE of -0.189 Mg/ha for the combination of LiDAR, optic, and SAR data, while LightGBM estimated the AGB values with the RMSE of 52.78 Mg/ha and MBE of -0.253 Mg/ha. LightGBM is more sensitive to noise and there are lots of hyperparameters that need to be tuned which highly affect its performance.


INTRODUCTION
In today's world, global deforestation is expanding and accelerating, resulting in the release of a quarter of carbon into the atmosphere (Li, Quackenbush, and Im 2019). Forest monitoring and more accurate estimates of forest above-ground biomass (AGB) is of significance to clarify the contribution of forests in global climate change. Therefore, some countries have developed a campaign which is known as Reducing Emissions from Deforestation and Degradation (REDD) to mitigate the effects of the climate change (Bellassen and Gitz 2008). A key question for REDD effort is how much AGB is available at national and global scale. According to the requirements of this effort, participating countries are supposed to report verified estimates of AGB at national level which is a key indicator of carbon pools in forest systems (Chen et al. 2018). One major problem in accurate carbon estimation is to find an efficient method for the determination of the forest AGB. Although filed measurement techniques can estimate AGB accurately, they are inherently destructive, labor-intensive, costly, time-consuming, and practical only for local scale (M. Li, Im, and Beier 2013). The increasing availability of remote sensing data paves the road for cost-effective and large scale AGB estimation.
Recently, light detection and ranging (LiDAR), optic, and synthetic aperture radar (SAR) data have been extensively used as non-destructive and effective methods for forest monitoring (S. Li, Quackenbush, and Im 2019). The invention of light detection and ranging (LiDAR) data has provided an efficient tool for an accurate AGB estimation by capturing threedimensional forest structure (Bolton et al. 2020). The cost and volume of LiDAR data limits its application for large scale AGB estimation. Optic and SAR data are considered as other valuable sources of AGB estimation over large areas and with less cost in comparison to airborne LiDAR flight. Spectral bands, vegetation indices, and texture features derived from optical imagery have a high correlation with vegetation density, biomass, chlorophyll content and etc. (Zhou et al. 2016). However, weather conditions such as cloud cover, rain, and snow can greatly affect the quality of the optical imagery, especially in tropical regions and northern climates. In addition, these images suffer from saturation which occurs when the pixels' spectral reflectance values do not show the exact reflectance at high biomass regions (Urbazaev et al. 2018;Zhou et al. 2016). SAR sensors acquire data independent of weather and illumination conditions. SAR signals are sensitive to trees geometric structure and can penetrate through forest canopy depending on the wavelength (Tamiminia et al. 2017). Urbazaev et al. (2018) indicated that SAR data also have limitations with saturation depending on the wavelength and biomass density. Thus, in this study the combination of airborne LiDAR, optic, and SAR data is used to overcome the limitation of single source approaches and to enhance AGB estimation. comparison between decision-tree based techniques is needed. Y.  compared the results of linear regression, RF, and extreme gradient boosting (XGBoost) for AGB estimation in subtropical forest of southern China using spectral bands, vegetation indices, and texture measures derived from Landsat 8 imagery. They reported that XGBoost provides better performance in compared with linear regression and RF. In 2020, Y. Li et al. conducted further investigation on XGBoost tuning parameters for AGB estimation in southern China using the combination of Landsat 8 and Sentinel-1 data. Moreover, Pham et al. (2020) compared and reported that the combination of XGBoost method and genetic algorithm feature selection technique provides better results than CatBoost (CB), RF, gradient boosted regression tree (GBRT), and support vector regression (SVR) algorithms for mangrove AGB estimation using integration of SAR and optical data in Vietnam. Since mentioned studies were conducted in tropical and subtropical forests, in this study we plan to explore the potential of decision tree-based models for a temperate forest.
The main gaol of this study is to compare two ensemble machine learning models: random forest (RF) and light gradient boosting machine (LightGBM) for AGB estimation using the integration of remote sensing data. To achieve this goal, the following research objectives are addressed:


To assess the potential of the synergy of airborne LiDAR, optic (i.e. Landsat 5 Thematic mapper (TM)), and SAR (i.e. advanced land observing satellite (ALOS) phased array type L-band Synthetic Aperture Radar (PALSAR)) data for AGB estimation. The hypothesis is that the combination of remote sensing data increases the performance of AGB modelling in comparison to single data source method.


To compare RF (a bagging technique) and LightGBM (a boosting technique) models. Our hypothesis is that a properly-tuned LightGBM can perform better than RF model.

Study Area
The study area is located in Huntington Wildlife Forest (HWF) area, in the central Adirondack Park, northern New York State ( Figure 1). HWF covers an approximate area of 6,000 ha (latitude 44E 00" N, longitude 74E 13" W). It has mountainous topography and the elevation ranges from 473 m to 908 m above mean sea level. The mean annual temperature and precipitation are 4.4 Celsius degree and 1010 mm, respectively.

Figure 1.
Location of the study area (Essex County, NY) for forest AGB estimation using ensemble machine learning models. White circles depict sample plots located in Huntington wildlife forest

Field Measurements
Continuous Forest Inventory (CFI) plots have been used as a reference dataset in this study. This dataset was collected by the State University of New York, College of Environmental Science and Forestry (ESF) during July and August of 2011. Plots cover northern hardwood species including sugar maple, red maple, yellow birch, beech, white ash, red oak, white pine, hemlock, red spruce, and pine/softwood plantations of various species (Breitmeyer et al. 2019). CFI data over HWF contains 288 sample plots ( Figure 1). Tree information such as tree species, diameter at breast height (DBH) of 11.7 cm or greater, and the relative location to the center of the plot were recorded (S. Li, Quackenbush, and Im 2019). Then, AGB at tree level was calculated using species-specific DBH Component Ratio Method (CRM) allometric equations (Kennedy et al. 2018). Finally, plot level AGB was calculated as the average AGB per unit area within each sample plot. In other words, the plot level AGB in megagrams per hectare (Mg/ha) was calculated by dividing the tree level AGB by the plot area.

Airborne LiDAR:
Discrete return aerial data collection was acquired over HWF in May 2015 using the Leica Airborne Laser Scanner (ALS70) at a maximum flying height of 3500 above ground level (AGL). This was to support a 2.5 ppm 2 LiDAR point cloud. First step in LiDAR data processing was to convert the raw point clouds into height-normalized point clouds.
A k-nearest neighbour imputation algorithm (k=5) was used to imputed a digital elevation model (DEM) which is subtracted from all returns in the point cloud (Hawbaker et al. 2009;Huang et al. 2019). Then, predictors were computed using the height normalized LiDAR data for modelling at 30 m grid cells. Finally, 29 height (i.e. height percentiles, coefficient of variation of height, and etc.) and intensity (i.e. percentage of ground intensity, percentage of feature intensity, and etc.) predictors were fed as inputs into the machine learning models. Since the reference datasets were collected in 2011 and we are using airborne LiDAR for 2015, the main hypothesis is that HWF did not change from 2011 to 2015.

Landsat 5 TM Imagery:
Landsat 5 TM surface reflectance (SR) imagery with 30 m spatial resolution were preprocessed and downloaded through Google Earth Engine (GEE) platform. First, the Landsat 5 image collection which covered the HWF were selected for July and August of 2011 with less than 5 percent of cloud cover. A cloud masking function was applied based on the pixel quality assessment (pixel-qa) band of Landsat SR data to mask out the clouds. Then, 6 spectral bands including blue, green, red, near infrared (NIR), shortwave infrared-1 (SWIR1), and shortwave infrared-2 (SWIR2) were selected. Finally, some vegetation indices such as normalized difference vegetation index (NDVI), Soil Adjusted Vegetation Index (SAVI), Ratio Vegetation Index (RVI), normalized burn ratio (NBR), and normalized difference moisture index (NDMI) were calculated based on spectral bands.

ALOS PALSAR Data:
For SAR data, the global PALSAR/PALSAR-2 yearly mosaic with 25 m resolution at Lband was utilized. This dataset is freely available at GEE platform (Tamiminia et al. 2020).It should be noted that the strips with less response to surface moisture were selected for this procedure. Then, the imagery was ortho-rectified and slope corrected using the 90 m shuttle radar topography mission (SRTM) digital elevation model (DEM). In this study, the dual polarization (horizontal transmit/horizontal receive (HH) and horizontal transmit/vertical receive (HV) polarizations) yearly mosaic was used for the year 2011over HWF. Then, a smoothing speckle filter with the radius of 30 m was applied to the channels to remove the speckle noise. Span and ratio were also calculated to add more features to train the models (Equations 1 and 2). The images were resampled to 30 m resolution to be as the same resolution as the LiDAR and optic datasets.
where HH= horizontal transmit/horizontal receive channel HV = horizontal transmit/vertical receive channel

METHODS
This section describes two ensemble machine learning models used and compared in this study. Two well-known ensemble techniques are bagging and boosting (Liaw and Wiener 2002). RF and LightGBM are subset of decision tree-based models which use bagging and boosting methods, respectively. The training dataset contained 70% of the data used for tuning and training the model while 30% of data used for the evaluation of the final model. Hyperparamers were tuned using a grid search approach.

Random Forest (RF)
The first idea of RF was proposed by Ho in 1995, then, an extension of RF that is applied for classification and regression purposes was developed by Breiman (2001). RF is a machine learning algorithm which uses a bagging technique to train the model independently in parallel. The forest is a combination of trees which each tree is trained separately without any dependency on the other trees. Due to the characteristics of RF, the training speed runs faster, the results are less sensitive to tuning parameters, and few parameters are needed to be tune . Moreover, RF creates many trees on subsets of the data both bagged observations and subsets of variables. This is done to increase the difference in the trees in order to improve predictive power, leading to a robust algorithms that can reduce the over-fitting issues

Light Gradient Boosting Machine (LightGBM)
Gradient boosting machine (GBM) is another ensemble-based decision tree, however, unlike the trees in RF model, the trees in GBM method cannot be created in parallel. Alternatively stating, in the GBM algorithm, the second tree depends on the first tree and the third depends on the first two and so on. Boosting technique used in GBM builds new sets in a sequential way and the observations are weighted, leading some take part in the sets (Ke et al. 2017). The main objective of GBM is to reduce the model's residual along the gradient direction by decreasing the previous residuals (Pham et al. 2020). The GBMs have very low interpretability because the second tree in the model no longer predicts the same target as the original model the subsequent trees in the model seek to predict how far off the original predictions were from the truth by using the residuals from the prior trees. In this way, each subsequent tree of the gradient boosting model is slowly reduces the overall error of the previous trees. This enables the gradient boosting models to have high predictive power but low interpretability (Ke et al. 2017). The main advantage of GBM is being robust and being able to handle mixed data types which is useful for remote sensing data. However, it is computationally expensive. In this paper, a LightGBM model which uses a leaf-wise split and runs faster was implemented using the lightgbm package in R.

RESULTS AND DISCUSSION
This section describes the results of applied regression models for AGB estimation for 4 different scenarios: airborne LiDAR, Landsat 5, ALOS PALSAR, and the combination of them. RF and LightGBM regression models were applied for each scenario separately. The performance of the models were evaluated using root mean square error (RMSE) and mean bias error (MBE). MBE was calculated based on the difference between the mean values of all observed and predicted values (Ji et al. 2015). Therefore, the positive MBE is a sign of under-prediction and the negative MBE indicates over-prediction.  (Table 3).  Table 3. Statistics characteristics of AGB of CFI plots in HWF used for AGB estimation using RF and LightGBM models .26 Mg/ha. In this case, both models are over-predicting the AGB, however, the average over-estimation of RF model is nearly 5 Mg/ha lower than the LightGBM method. Since the RF is a parallel ensemble decision tree which trains each tree separately, it leads to reducing variance. This independent training process decreases the sensitivity of the RF model to noise existing in the predictors. On the other hand, LightGBM trains the trees sequentially and aims to reduce the bias. Although LightGBM can overcome over-fitting issues in decision tree algorithms, noisy input variables may decrease the accuracy of the subsequent tree. In this case, due to the inherent characteristic of remote sensing data, noise cannot be ignored. Though overfitting is still considered as a concern for modelling the AGB values, RF model which is less sensitive to outliers seems to be more effective for AGB prediction in HWF. In airborne LiDAR only scenario, both intensity and height variables of LiDAR data were used as predictors for AGB estimation. It is worth mentioning that LiDAR predictors can be noisy because of low vertical accuracy relative to horizontal sample distance. Furthermore, spectral bands and vegetation indices derived from Landsat imagery contain mixed pixels which is not totally correspond to forested areas. Thus, increasing the chance of noise in the variables. Therefore, with regard to input predictors, the RF model outperformed LightGBM in AGB estimation. In addition, LightGBM is sensitive to hyperparameter tuning. Figure 3 demonstrates the scatter plots of predicted versus actual AGB values for both RF and LightGBM models using the combination of LiDAR, Landsat 5, and ALOS PALSAR data.  Figure 4. As it is shown, the residuals for the combination scenario are the lowest, followed by airborne LiDAR, Landsat 5, and ALOS PALSAR, respectively. ALOS PALSAR has the highest residuals especially for plots with near to zero and high biomass. Thus, the integration of vertical structural information of forest provided by LiDAR, spectral information of Landsat, and ALOS PALSAR L-band backscatter information can significantly improve the AGB mapping. Figure 4. Residuals of predicted AGB values for airborne LiDAR, Landsat 5, ALOS PALSAR, and the their combination using RF regression model

CONCLUSION
The main goal of this study was to compare two ensemble machine learning models for AGB estimation using airborne LiDAR, Landsat 5, ALOS PALSAR, and the synergy of these datasets in a temperate forest in NY. In all scenarios, RF regression model outperformed LightGBM. Both RF and LightGBM are capable of handling over-fitting issue. However, LightGBM is more susceptible to noise and hyperparameter tuning which decreases its performance. The combination of airborne LiDAR, optic, and SAR data provided the most accurate AGB estimation with the lowest RMSE and MBE, followed by LiDAR, Landsat 5, and ALOS PALSAR, respectively. It can be concluded that the combination of vertical structure along with spectral and backscatter information of trees enhances the AGB estimation results and reduces the saturation effects.
Hyperparameter tuning plays an important role in the performance of machine learning models especially LightGBM. Thus, Bayesian optimization or genetic algorithms might be better options for hyperparameter tuning in the future.