ESTIMATING DBH OF TREES EMPLOYING MULTIPLE LINEAR REGRESSION OF THE BEST LIDAR-DERIVED PARAMETER COMBINATION AUTOMATED IN PYTHON IN A NATURAL BROADLEAF FOREST IN THE PHILIPPINES

Diameter-at-Breast-Height Estimation is a prerequisite in various allometric equations estimating important forestry indices like stem volume, basal area, biomass and carbon stock. LiDAR Technology has a means of directly obtaining different forest parameters, except DBH, from the behavior and characteristics of point cloud unique in different forest classes. Extensive tree inventory was done on a two-hectare established sample plot in Mt. Makiling, Laguna for a natural growth forest. Coordinates, height, and canopy cover were measured and types of species were identified to compare to LiDAR derivatives. Multiple linear regression was used to get LiDAR-derived DBH by integrating field-derived DBH and 27 LiDAR-derived parameters at 20m, 10m, and 5m grid resolutions. To know the best combination of parameters in DBH Estimation, all possible combinations of parameters were generated and automated using python scripts and additional regression related libraries such as Numpy, Scipy, and Scikit learn were used. The combination that yields the highest r-squared or coefficient of determination and lowest AIC (Akaike’s Information Criterion) and BIC (Bayesian Information Criterion) was determined to be the best equation. The equation is at its best using 11 parameters at 10mgrid size and at of 0.604 r-squared, 154.04 AIC and 175.08 BIC. Combination of parameters may differ among forest classes for further studies. Additional statistical tests can be supplemented to help determine the correlation among parameters such as KaiserMeyer-Olkin (KMO) Coefficient and the Barlett’s Test for Spherecity (BTS).


INTRODUCTION
Proper forest resources assessment and management highly depends on accurate data gathering of tree inventories in an authentic ground survey to get detailed and up-to-date forest characteristics and attributes.The needed forest information are often based on species name, diameter-at-breast-height (DBH), basal area, merchantable height, total height and, in some cases, age.The most significant of these characteristics, and also the focus of this paper, is the DBH since other significant attributes like biomass and carbon stock are both derived by using allometric equations with DBH.DBH is obtained by measuring the diameter of the trunk at 1.3 meters above the ground.This method is too intensive and costly to conduct when applied in large areas.(Kankare,2015) LiDAR is a breakthrough in forestry application.It has the capacity for direct calculation and estimation of important forest characteristics such as canopy height, canopy cover, stand volume, basal area and above ground biomass.However, Airborne LiDAR Scanning (ALS) is capable of directly measuring forest characteristics based on height but not DBH.

LiDAR Technology in Forestry
LiDAR Technology is a remote sensing method that uses light in the form of pulsed lasers that measure the roundtrip time for a pulse of laser energy to travel between a sensor and a target.These light pulses are recorded by the airborne system.When combined with other recorded data, it generates precise, threedimensional information about the shape of the earth and its surface characteristics.(NOAA, 2015) The pulse of energy strikes the canopy and the ground and goes back to the sensor.Canopy height, one of the more straightforward parameters, is calculated by subtracting the elevations of the first and last returns from the LiDAR signal.Since DBH is not height-quantifiable, it cannot be derived directly.However, vertical distribution of points from captured surfaces provides basis for estimating other important canopy descriptors such as DBH.(Dubayah,  A related study has also undertaken research that focuses on derivation of information from LiDAR data.In the study by Argamosa (2015) the average diameter at breast height of a broadleaf forest was estimated using linear and log-linear regression.A field validation site was established at 20m, 10m and 5m grids and tree inventories were also collected.Using 27 LiDAR parameters, estimation was made for both linear and log-linear regression at different grid sizes.All three grids showed promising results in estimating DBH.Linear regression analysis showed an r-squared of 0.72, 0.083, 0.70 for 20x20m, 10x10m and 5x5m grids, respectively.For the log-linear regression, 10x10m showed the highest r-squared at 0.67 followed by 20x20m at 0.56 and 5x5m grid at 0.05.The estimation was proved best at the Linear Regression using the 10x10m grid.
This paper would focus on using the 10x10m grid and linear regression as standards for DBH estimation.It aims to identify if all the 27 parameters contribute to the accuracy of estimation.
The best parameter combination would be identified using combination of matrices generated by the python script.

Linear Regression and Python Programming
Regression falls under the category of supervised machine learning.Machine Learning is a study in computer science that involves pattern recognition and algorithm construction from basic raw dataset to create a model that makes predictions on other data, while supervised learning bases its algorithm construction by providing the computer the values of both dependent and independent variables.
Under this category are the two most prominent model generation methods, Regression and Classification.Regression estimates the outcome of the dependent variable based on the relationships, while classification identifies group membership of each sample through pattern recognition and user defined limits of each class.
Linear Regression is used to identify the effect of each parameter on the response variable or the dependent variable.It assumes all the parameters or the independent variables have a linear relationship to the dependent variable.The main objective is to determine the magnitude of effect of a feature in the final computation of the response.
On the other hand, python is a high-level programming language that is commonly used in scientific computations like machine learning.High-level language is easier to read and write.It is portable and can be run on different computers with less or no modifications.

Study Area
The study area is located at Mt. Makiling, Laguna at 14° The study area represents a mid-mountain Dipterocarp forest.At its present stage of forest development, it is vigorously regenerating secondary Dipterocarp forest stemming from years of logging during the Spanish period.Figure 1 shows the location of the plot with respect to Laguna province and its neighboring provinces.

LiDAR Acquisition
The LiDAR data was obtained using Optech Airborne Laser Terrain Mapper (ALTM) Pegasus acquired on March 12-18, 2014.The flying height for this flight is 1,100m above mean sea level and with a pulse density of 1.19 pulse/m and a planned point density of 2 pulses per square meter.

Field Measurements
The field survey was conducted on April 2015 at the Molawin-Dampalit Forest Plot of Mt.Makiling to create an inventory that would be useful in comparing LiDAR derived parameters with that of ground truth data.This validation aims to determine the accuracy of LiDAR data in determining the structural characteristics and provide calibration parameters for a broadleaf forest classification in assessing its structural attributes.
The inventory includes the establishment of a two-hectare plot subdivided into 150 grids each with a dimension of 10x10m.All trees within the plot with DBH greater than 10 cm are geotagged.The height and DBH of these selected trees were also recorded.Figure 2 shows the LAS File (LiDAR Point Cloud) over the subdivided grids and exact location of geotagged trees.
Figure 2: Geotagged trees within the 10x10m grid plots A total of 1,094 trees were surveyed within the two hectares plot.From the 150 grids, 50 random samples were selected to serve as data for calibration for the DBH estimation and also to be comparable to the study of Argamosa (2015) who also used 50 samples.

LiDAR Data Processing
Using Lastools, the corresponding LiDAR data of the plot was normalized to adjust the elevation values before extracting the 27 parameters needed for the estimation.In this situation, the height of the trees with respect to each other would be comparable since they were all treated to be in the same elevation.Figure 3 shows the original and normalized point cloud of the plot.The generated estimates for DBH are calculated as the average DBH of trees at a 10x10m grid plot.

Statistical Tests
Aside from the coefficient of determination, r-squared, two other statistical tests, namely AIC (Akaike's Information Criterion) and BIC (Bayesian Information criterion) were used to assess the output of regression analysis.R-squared was used as the threshold of accuracy while AIC and BIC was used to determine which among the parameter combination within the threshold r-squared gives the most accurate estimates.
Computing r-squared can be done by squaring the correlation coefficient as shown below: where x= x-coordinate (LiDAR-derived DBH Estimates) y= y-coordinate (Field DBH Estimates) n= number of samples of grids) R-squared shows how many data points fall within the results of the line formed by the regression equation.The higher the value means higher the percentages of points are closer and fitted to the regression line.
On the other hand, it is claimed that using AIC and BIC can obtain useful information for model selection together, especially when models needed should be selected by both criteria.The equations used for AIC and BIC are listed below. The And where n= number of data points Rss= residual sum of squares K= number of parameters AIC and BIC do not give the quality of the accuracy like rsquared.They only the determine the best combination of parameters among the pre-selected combinations on the given threshold of r-squaredAIC and BIC are based on the maximum likelihood estimates of the model parameter.

Python Script
Itertools one of the standard library packages available was used to get the best combination of the script.Numpy, Pandas and Skicit learn are freely downloaded python libraries were also added in the script to facilitate the mathematical computation.Numpy was used to make the arrays.Pandas was used to create dataframe.And lastly, the Scikit learn was used to perform the machine learning algorithms that is Linear Regression.Through its metric modules, easier computations of different statistical values are done to determine the robustness of a model.

Results and Discussion
From the field data, the number of trees that fell within each of 50 grids was recorded.The mean DBH per grid were also obtained.The average DBH of each grid ranges from 13cm to 40 cm as shown by Figure 4. Other combinations with r-squared greater than 0.71 were also selected.The first run stopped using 24 parameters with an rsquared of 0.71, AIC of 164.26 and BIC of 210.15.Upon plotting the parameters not used in each combination, four (4) significant parameters appeared not to be in used.The Q. Average, 50 th percentile, 10 th bincentile and 20 th bincentile appeared to be insignificant parameters when obtaining DBH estimates with r 2 >0.71 as shown in figure 5.The mentioned parameters were then deleted for the next run of least squares.
For the next run of least squares, the input matrices are the field DBH of the 50 grids and the 23 remaining parameters of each grid from the first run of least squares.Upon setting the value of r-squared to 0.6, three (3) additional parameters appeared to be not needed for the estimation and they were the minimum, skewness and 40 th bincentile.Figure 6 shows the next batch of parameters not being used in the estimation.The final combination using only 11 parameters gave a promising r-squared of 0.604 and has the lowest of AICs and BICs with values 154.04 and 175.08, respectively.There has been a significant changed in AIC and BIC when compared to an r-squared of 0.712, 170.12 AIC and 221.25 BIC when all parameters are used.
The canopy-related metrics, cover and density, were both removed from the list of parameters.From the general statistics parameters, only the maximum, average and kurtosis were found to provide positive effect on the estimation.The highest and the lowest of percentiles proved to be not in used, too.On the other hand, the upper limit of the bincentile are highly used in the estimation.
The remaining parameters used in the final estimation is shown in Table 4

CONCLUSION
Estimating DBH of trees has found to be obtainable when parameters directly derived from LiDAR are used.However, not all these parameters contribute to the accuracy of the estimation.Using all this parameter may just cause over parameterization in the process.
The removals of unnecessary parameters have been found systematically.Almost same parameters appeared to be not in used when the estimation showed a good result.This study shows that even when using 11 parameters out of the 27 parameters that can be directly derived, a promising estimation can still be generated.
Since the distribution of point clouds vary depending on the structure of canopy, different parameter combinations may be good for other forest classification.

Figure 3 :
Figure 3: Raw point cloud of the plot (upper) and the normalized point-cloud (lower) The list of canopy metrics to be derived from LiDAR data are listed in Table2.
python script was created to compute for 27 LiDAR parameter coefficients.The computation of coefficients of the features is done through Least Squares computation.The equation used is shown below: D = a 0 + a 1 P 1 + a 2 P 2 + ... + a n P n D = a0 + a 1 P 1 field-measured diameter at breast height (DBH) n= number of parameters The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLI-B8, 2016 XXIII ISPRS Congress, 12-19 July 2016, Prague, Czech Republic This contribution has been peer-reviewed.doi:10.5194/isprsarchives-XLI-B8-657-2016P= LiDAR derived parameters a= coeffieients to be determined by the regression

Figure 4 :
Figure 4: Average DBH in cm of the 50 sample grids The corresponding 27 LiDAR parameters of the 50 grids were then derived using Lastools.Using the automated least squares analysis, with the 27 parameters and measured DBH of each grid as input, an estimated DBH and coefficient of parameters for estimation were obtained.Using all the parameters, the generated DBH estimates have an r-squared of 0.71, AIC of 170.13 and BIC of 221.75.

Figure 6 :
Figure 6: Second run of least squares using 23 parameters as inputFor the third run, 7 parameters were already deleted from the original 27 parameters.The run was continued until no more combination could give an r-squared greater than 0.6.The determination of the best combination underwent 6 (six) run until it reached its threshold.The parameters being removed for each run are summarized in Table3.

Figure 7
Figure7shows the field derived DBH against the estimated DBH using the 11-parameter combination.The estimated DBH ranges from 14 to 36 centimeters as opposed to the field data that ranges from 12 cm to 40 centimeters.

Figure 7 :
Figure 7: Estimated mean DBH and measured mean DBH per grid

Table 1 :
(Kankare, 2015)al Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLI-B8, 2016 XXIII ISPRS Congress, 12-19 July 2016, Prague, Czech Republic DBH of a particular area.Multiple linear regression models are developed using the plot characteristics as predictor variables to get the DBH of other area of interest.Field-measured diameter distributions are linked to the areas of interest based on the similarity of point-cloud-derived metrics measurements of individual trees and their attributes directly from the ALS point cloud.(Kankare,2015) Potential Contributions of LiDAR Remote Sensing in Forestry Application (Dubayah, no date) Parametric techniques, using field measurements that are height-quantified as input, have been employed to predict the

Table 3 :
List of Parameters Removed from LiDAR Combinations

Table 4 :
below: Parameters Used in the Best DBH Estimation