WATER MAPPING USING MULTISPECTRAL AIRBORNE LIDAR DATA

This study investigates the use of the world’s first multispectral airborne LiDAR sensor, Optech Titan, manufactured by Teledyne Optech to serve the purpose of automatic land-water classification with a particular focus on near shore region and river environment. Although there exist recent studies utilizing airborne LiDAR data for shoreline detection and water surface mapping, the majority of them only perform experimental testing on clipped data subset or rely on data fusion with aerial/satellite image. In addition, most of the existing approaches require manual intervention or existing tidal/datum data for sample collection of training data. To tackle the drawbacks of previous approaches, we propose and develop an automatic data processing workflow for land-water classification using multispectral airborne LiDAR data. Depending on the nature of the study scene, two methods are proposed for automatic training data selection. The first method utilizes the elevation/intensity histogram fitted with Gaussian mixture model (GMM) to preliminarily split the land and water bodies. The second method mainly relies on the use of a newly developed scan line elevation intensity ratio (SLIER) to estimate the water surface data points. Regardless of the training methods being used, feature spaces can be constructed using the multispectral LiDAR intensity, elevation and other features derived from these parameters. The comprehensive workflow was tested with two datasets collected for different near shore region and river environment, where the overall accuracy yielded better than 96%.


INTRODUCTION
Water has critical implications for ecosystem function and biosphere atmospheric interactions, and is one of the treasurable resources of which to support a variety of human and economic activities (Gleick, 1996).Since the amount of water appearance is subject to climatic variation, tidal influence and seasonal phenology, precise delineation of water region and its associated boundary pose considerable challenges, particularly for large spatial extent.With the emergence of satellite remote sensing, mapping and monitoring different types of water bodies, such as inland channel, delta, coastal zone and lake, turns to be practically feasible through such a top-down image capturing approach.Optical satellite remote sensing has been demonstrated as an efficient solution to delineate open water region (McFeeters, 1996), monitor evapotranspiration of water resources (Anderson et al., 2012), estimate water quality (Brando and Dekker, 2003), assess water erosion (Vrieling, 2006), monitor lake shrinkage / expansion (Jepsen et al., 2013) through reaping the benefits of the multispectral image dataset and its extensive coverage.
The widely adopted technique to detect water bodies within the remote sensing image scene mainly utilizes the normalized difference water index (NDWI) derived from the green and near infrared image bands (Gao, 1996) or the near infrared and shortwave infrared image bands (McFeeters, 1996).Such an image ratioing approach can aid in enhancing the water regions being isolated from the surrounding vegetation and land features.Despite the successful use of optical remote sensing image for water region mapping, there still exist certain limitations and drawbacks that require further research effort.First, optical satellite remote sensing can only provide a two dimensional image scene, unless stereoscopic data acquisition is adopted (Shaker et al., 2010).Even if stereo satellite imagery is available, it is mostly impossible to unveil the depth of water columns and further derive information such as lake volume, channel cross section, seabed properties, etc.Second, though optical remote sensing can offer a wide coverage of spatial extent, the need of having a fine scale data product, such as shoreline, river boundary, etc. has been sought (Malinowski et al., 2017).Lastly, the drawback of shadowing effect and relief displacement in optical imagery cause serious analytical burden (Yan et al., 2015), particularly when dealing with those water bodies partially covered by forest canopies or situated nearby elevated features.
To tackle these limitations, the research community has explored the use of Light Detection and Ranging (LiDAR) techniques to cope with these issues during the last decade.An airborne LiDAR system is capable of emitting eye-safe, narrow-beam laser pulses from the aircraft and records the laser signal strength backscattered from the Earth's topography.Through computing the time of travel for each of the laser pulses, the instantaneous distance between the LiDAR system and the backscattered surface can be determined, which can subsequently help to derive the 3D coordinates of each backscattered laser pulse.As a result, airborne LiDAR can provide a dense high resolution 3D data point cloud that can capture both upper-and understory of tree canopies as well as the water surface and its bottom, depending on the laser wavelength being used.Applications of LiDAR toward waterrelated studies first come along with airborne LiDAR bathymetry systems, such as Optech's SHOALS, that mainly operates with green laser channel (532 nm) with a Nd:YAG near-infrared (NIR) laser channel (1064 nm) and a "Raman" receiver that is capable of collecting the wavelength of 640 nm arising from the 532 nm green beam (Guenther et al., 1996;Irish and Lillycrop, 1999;Wang and Philpot, 2007).Depending on the water depth and the instantaneous water condition, the bathymetric LiDAR system is capable of providing multiple returns along the water column, i.e. from water surface to sea bottom.By measuring the returns from these two water layers, the water depth can be estimated after implementing certain data correction mechanism to readjust the bending effect of laser beams and retrieve the sea bottom reflectance (Wang and Philpot, 2007).
In 2014, Teledyne Optech launched the world's first multispectral airborne LiDAR system, named Optech Titan, that was equipped with three laser channels operating at 532 nm, 1064 nm and 1550 nm wavelength.Such a breakthrough in LiDAR system development opens new door in many remote sensing applications, improves the derived product accuracy and potentially enhances the automation of data processing (Fernandez-Diaz et al., 2016).In this study, we explore the use of such latest multispectral airborne LiDAR technology for water region mapping.Although there exist a number of recent attempts exploring the use of monochromatic airborne LiDAR system to delineate land and water regions Brzank et al. (2008); Yuan and Sarma (2011); Yousef et al. (2014), the majority of these existing studies have certain constraints and limitations in terms of data requirement or assumptions towards the water environment.Therefore, it is desired to develop a robust method that can handle most of the land-water environments with minimum manual intervention.The rest of the paper is organized as follows.In section 2, we present the overall workflow of the automatic land-water classification using multispectral LiDAR data.In section 3, we present two testing datasets collected by the Optech Titan and the results are presented in section 4. Finally, the concluding remarks together with the further work are drawn in section 5.

METHODOLOGY
Fig. 1 shows the overall workflow of the proposed automatic land-water classification algorithm.The classification process first starts with reading the n channels of LiDAR data files, usually in las or ASCII format, before implementation.Those data fields required to be read includes x, y, z, intensity, number of returns, total number of returns, scanning direction and edge of flight line.All these parameters can be stored in a data array so as to serve the subsequent classification purpose.A core channel should be identified so that the corresponding values from the rest of the two other channels can be assigned accordingly.Once the data import is accomplished, user can select either one of the two data training approaches (Gaussian mixture model (GMM) or Scan line intensity elevation ratio (SLIER) approach) for training site selection.Once the training areas for land and water are identified, feature vectors such as normalized difference feature indices (NDFIs), elevation variation, intensity variation, can be computed using the three data channels so that they can be used to serve the log-likelihood computation.As a result, each LiDAR data point can be assigned with a posterior probability, where the data point can be labelled as either land or water by taking the maximum probability.Post-classification enhancement can be implemented to help improving the classification result.

GMM Approach
With the LiDAR data being imported, the first step is to plot the histogram for the data of the core channel.This method works well if the study area has a significant difference of elevation/intensity between the land and water features.Depending on the study area and the dataset, either the "intensity" or "z-coordinate" field is used to generate the intensity or elevation histogram.The use of last return works well in certain dataset; however, all the data points are recommended to be used to generate the histogram in general.GMM can be run to perform fitting for the generated histogram.The number of Gaussian components being fitted can be determined using the model selection methods such as Bayesian information criterion (BIC), Akaike information criterion (AIC), or manual input.Finally, the first two Gaussian components, which should in most cases represent the water surface and the land surface, can be separated by computing the intersection of the two adjacent Gaussian components.After the intersection point between two adjacent Gaussian components being located (either for an intensity value or an elevation value), all the LiDAR data points in the core channel are split based on this value.For those data points with value lower than this identified threshold, they are preliminarily identified as water bodies; else, they are identified as ground surface.

SLIER Approach
The second method makes use of the characteristics of water bodies (especially in coastal area with water turbulence) which has high variance of intensity values in the LiDAR data channel along each scan line, comparing to those of the land features.The process first reads the LiDAR data file scan line by scan line (sl).Based on the use of the field "scan direction" and/or "edge of flight line", all the data points in each scan line are stored.Although the "edge flight line" supposes to provide a hint for the turning direction of scanning, the proposed method mainly relies on reading the scanning direction in order to look for a complete scan line.Subsequently, standard deviation is computed individually for the elevation value and intensity value, named σE, σI respectively, along each scan line.By dividing the σE over σI , a new index, named it as scan line elevation-intensity ratio (SLIER), is being developed.
where θ is the scan angle, N l is the maximum number of points along all the scan lines and n l is the number of points in the current scan line.With SLIER being computed along each scan line, it is obvious that water bodies always have very high SLIER value comparing to the land features, since water bodies usually have very high variance of intensity value with a relatively flat surface.In this regard, there exist a high separability between land and water features in terms of the SLIER value.Since the water bodies usually have high variance of intensity value, which may degrade the use of intensity value for classification.Therefore, right before using the SLIER to preliminarily split the land and water features, data correction should be implemented to remove those high peaks of intensity values found on the water bodies due to the appearance of turbulence.
These high peak values, or deemed as outliers in the data, can be identified by computing the Mahalanobis distance through using the intensity value and the SLIER value.The equation of the Mahalanobis distance can be found in the below equation: where xj is a 2 by 1 matrix storing the intensity and SLIER value of LiDAR data point j, µ is a 2 by 1 matrix having the mean value of intensity and SLIER computed for all the LiDAR data points, where Vj is a 2 by 2 covariance matrix computed using the intensity and SLIER values.With two variables having a confidence level of 95%, the chi-square value is 5.99, and thus the square root of it is 2.45.With any Mahalanobis distance less than 2.45, they are treated as outlier (high intensity peak value on water bodies), and they can be assigned with either an extreme low value (say 1) or remove it from the subsequent calculation.
After removing those high peak intensity values found on the water bodies, user can either manually define a threshold percentage in SLIER (for instance top 20% of SLIER), or similar to Method 1, use GMM fitted (SLIER) histogram to look for an optimal threshold to preliminarily separate the land and water features.With all the potential water data points being identified using SLIER, the standard deviation (σ E ) and mean (µ E ) elevation values of the potential water data points are being computed.Then a preliminary classification can be conducted using the elevation value, regardless of data training method 1 or 2. For all data points d in L, (3)

Land-Water Classification
After all the LiDAR data points are preliminarily split into either land or water class, the process carries on by preparing and creating feature spaces.A number of features are generated, depending on the study area, as an input to serve the subsequent supervised classification.Those features include elevation, elevation variation, intensity, intensity variation, number of returns and NDFIs.For those features, including elevation, intensity and number of returns, they are already embedded in the data array through reading the las files during the data import stage.The rest of the parameters, including elevation variation, intensity variation and NDFIs, requires further computational process.Since the computation of elevation variation and intensity variation of each data point first requires obtaining the (elevation or intensity) information from the surrounding data points, thus, a kd-tree data structure is first built for the core channel in order to speed up the data searching.The searching radius for building the kd-tree is set to be twice of the mean point spacing.Once the kd-tree is built, the intensity variation and elevation variation can be computed by searching the corresponding intensity or elevation values with respect to the surrounding K points.The equation for computing the elevation variation can be found as follow:- where e refers to the elevation, σ 2 e and µe refers to the elevation variation and mean elevation value, respectively, with respect to the surrounding K points.Similarly, the intensity variation can be computed using the above equation by replacing the elevation value by the intensity value I, as follow: Regarding the NDFIs, the intensity values of the rest of the two channels, aside from the core channel, should be used.Similar to the abovementioned process, kd-tree should be first constructed for the rest of the n-1 channels with the recommended searching radius.Then, the intensity value of the nearest data point from n-1 channel can be assigned to each of the data point the core channel.For instance, if the multispectral airborne LiDAR data has three channels (channel 1 = 1550 nm, channel 2 = 1064 nm and channel 3 = 532 nm) and channel 2 is selected as the core channel, the three NDFIs can be computed as follows: As a result of the abovementioned process, each LiDAR data point in the core channel has the following feature:-elevation, elevation variation, intensity, intensity variation, number of returns, and three NDFIs, where these feature sets are normalized based on Z-score prior to classification.User can choose all of the features or selected features for the supervised classification.
After user selects their own features being used for the subsequent classification, a maximum likelihood classification is implemented to estimate the posterior probability of the land and water data points as preliminarily separated in the abovementioned steps.The computation of the posterior probability can be found as follow:- where P (X|ωi) refers to the posterior probability of data point X (arranged in an array based on the selected features) belonging to the class ωi, and Vi and Mi are the covariance matrix and the mean vector of class ωi, respectively, computed using the features selected as aforementioned.The posterior probability should be computed for the two classes, i.e. water (ωw) and land (ω l ).Assign data point X to land if P (X|ω l ) > P (X|ωw); else, assign data point X to water if P (X|ωw) > P (X|ω l ).
An enhanced version of the classification can be implemented using Bootstrap aggregating (Bagging) by splitting the training data (both land and water class) into n portions evenly.For instance, if there exist 3 million points for the LiDAR data, where the land and water are preliminarily split into 2 million points and 1 million points based on the GMM fitting.In this case, bagging can be implemented by dividing the 2 million points into n portions for the land with each 200K points (if n = 10) and 1 million points into 100K points for the water class.These individual portion of land and water class are trained separately using the maximum likelihood.As a result, each data point is given with ten classification results, and a majority voting mechanism is used to label the final class for each data point within these ten classification results.Regardless of using a single classifier or a bagging classifier, each data point must be labelled with either land or water.
To further improve the classification accuracy, two procedures are proposed in the final stage in order to fix or erase those misclassified points.

Post-classification enhancement
Two post-classification enhancement procedures are presented, where user can either select running both procedures or either one of them.Regardless of the post-classification enhancement procedures being used, a 3D majority filter is first implemented on the classification result.A kd-tree data structure is first built on the LiDAR data points in order to speed up the neighborhood searching.In each of the classified data points, if the majority of the neighborhood are classified as a specific class where the current data point is classified as an opposite class, such data point should be labelled in a reverse way.
The second post-classification enhancement mechanism is based on the use of the elevation features.With the classification results of land and water classes, the mean (µE) and standard deviation (σE) of the elevation are computed individually for the two classes.Then, we define two thresholds based on these four values using the following equations:- With any data points being classified as land and its elevation values is lower than Tw, then such point should be labelled as water.
With any data points being classified as water and its elevation value is higher than T l , then such point should be labelled as water.The number of value a is recommended to be set from 0.5 to 2. Regardless of the enhancement method, user can have an option to select either running the post-classification enhancement procedure or not, depending on the classification result found.

EXPERIMENTAL WORK
The aforementioned data processing workflow was tested with two LiDAR datasets collected by Optech Titan covering different land-water scenarios, including shore and inland river environment during 2015 to 2016.The first dataset was collected nearby Bowmanville, Ontario, Canada which is close to Lake Ontario, where the second dataset was collected for the Rouge river located south west of Pickering (see Fig. 2).Both of the datasets were collected with low flying altitude (< 500 m), high pulse repetition frequency (100 kHz to 225 kHz) and scan frequency (35 to 50 Hz), resulting in a high mean point density for the two datasets (> 10 points /m 2 ).One should note that all these datasets are acquired by two different versions of the Optech Titan sensor, where only two LiDAR data channels (channel 2 and 3) are available for the first dataset (Bowmanville) and three data channels are available for the Rouge river (Pickering) dataset.
As shown in Fig. 2, the first dataset located nearby Bowmanville was collected parallel to the shore.Although channel 1 (1550 nm) is unavailable in this dataset, the intensity value of both channel 2 (1064 nm) and channel 3 (532 nm) ranges from 0 to 4096 covering different types of land features including built-up features, tree canopies, grass cover and road.The elevation varies from 30 m to 82 m in channel 2, while the lowest elevation of channel 3 yields down to 28 m due to the laser penetration to the seabed.
The number of points stored in channel 2 and 3 are 12.6 million and 21.3 million, respectively, with an extent approximately covering 0.2 km by 6 km.
The second dataset was collected along the Rouge river situated nearby the Pickering region.Similar to the first dataset, all the three LiDAR data channels include a 12-bit intensity data.Nevertheless, channel 1 (1550 nm) and channel 2 (1064 nm) do not have sufficient data returns along the river, leaving void data hole in these two channels.The elevation varies from 29 m to 81 m in channel 3 covering the tree canopies, water surface and the seabed.The number of points stored in channel 1 to 3 are 8.8 million, 10.4 million and 10.8 million, respectively.The spatial extent of the dataset is approximately 0.27 km by 2.1 km.
Both LiDAR datasets were processed with the aforementioned workflow.The first dataset was processed with the SLIER method due to its complete coverage on part of the Lake region, while the second dataset was processed with the GMM method based on the use of intensity histogram generated from channel 3 due to its partial coverage on the Rouge river.Reference aerial photos were used to digitized the boundary of land and water in order to provide a ground truth to perform accuracy assessment.The experimental work conducted in both case studies compare the use of multispectral LiDAR intensity data in order to demonstrate the capability of adding additional laser channels to aid in improving the classification accuracy.

RESULTS AND DISCUSSION
In the first dataset (Bowmanville), the result derived from channel 3 as core channel (i.e.assigning the intensity channel 2's value to channel 3) yielded to an overall accuracy over 97.1%, where the feature sets include the intensity values from the two channels, NDFI, elevation, intensity variation, elevation variation and number of returns.The precision and recall of both land are respectively ranging from 95.7% and 96.4%, while the respective values are 97.9% and 97.4% for the water bodies.As a benchmark for comparison, we also implemented the classification using the information acquired from purely channel 2 or 3 (i.e., intensity, elevation, intensity variation, elevation variation and number of returns), the overall accuracy was 96.6% and 96.9%.One should note that there exists a significant elevation difference between the land and water region in this study area.As a result, the use of monochromatic LiDAR data to serve the land-water classification can yield a high accuracy, though combining the dual channels can slightly improve the result (by 0.2% to 0.5%).
In the second dataset (Rouge river), we selected channel 3 as core channel to perform the classification due to insufficient data point falling along the Rouge river in channel 1 and 2. In this experimental trial, we compared the results generated from using 1) all return and last return, 2) other two channels' intensity information, and 3) post-classification enhancement mechanism as suggested in Eqs. 10 and 11.If all the LiDAR returns were used, the classification accuracy yielded 59.3% with all the feature sets being used.An improvement of 14.1% can be achieved when the post-classification enhancement mechanism was implemented.However, if only the last returns were used, the overall accuracy yielded 93.9% when the post-classification enhancement mechanism was off and it improved up to 96.5% when the post-classification enhancement was implemented.If only the channel 3's intensity and elevation were used for classification, the overall accuracy was 92.9% and 96%, respectively, with or without running the post-classification enhancement.This case study demonstrates that the use of last return can provide good separability.Also, adding the multispectral LiDAR intensity data can improve the result by 0.5% to 1%.In addition, the proposed post-classification enhancement mechanism can significantly enhance the classification accuracy regardless of the type of data points or feature sets being used.

CONCLUSIONS
A comprehensive land-water classification workflow is presented which makes best use of both radiometric and geometric LiDAR data features to support the classification capability.Two automatic data training mechanisms, i.e.GMM approach and SLIER approach, are proposed to deal with either river or shore environment.In addition, optional post-classification enhancement step can be implemented to improve the classification accuracy.In our experimental trials, both case studies yielded an overall accuracy better than 96%.We also examined the use of different feature sets derived from the multispectral LiDAR intensity and the post-classification enhancement mechanism, which can help to improve the classification accuracy.Depending on the study environment, optional settings, such as selection of all/last return or implementation of post-classification enhancement mechanism, are provided to the end-users to improve the land-water classification result.The development of multispectral LiDAR certainly enhances the data collection capability under different environment.In addition, the fruitful spectral information provided not only increases the classification accuracy but also the robustness of object extraction and surface classification.In near future, the proposed workflow will be tested with other challenging shore area, such as delta wetland, rocky shore and shore with land depression, etc., in order to further demonstrate the merit of using multispectral LiDAR data.

Figure 1 .
Figure 1.Overall workflow for automatic land-water classification.

Figure 3 .
Figure 3. First dataset -Bowmanville (from top to bottom): LiDAR elevation displayed of channel 2 and 3, LiDAR intensity of channel 2 and 3 and classification result (green = land, blue = water).

Figure 4 .
Figure 4. Second dataset -Rouge river (from top to bottom): LiDAR elevation displayed of channel 1 to 3, LiDAR intensity of channel 1 to 3 and classification result (green = land, blue = water).