FOREST COVER CLASSIFICATION USING GEOSPATIAL MULTIMODAL DATA

: To address climate change, accurate and automated forest cover monitoring is crucial. In this study, we propose a Convolutional Neural Network (CNN) which mimics professional interpreters’ manual techniques. Using simultaneously acquired airborne images and LiDAR data, we attempt to reproduce the 3D knowledge of tree shape, which interpreters potentially make use of. Geospatial features which support interpretation are also used as inputs to the CNN. Inspired by the interpreters’ techniques, we propose a uniﬁed approach that integrates these datasets in a shallow layer in the CNN network. With the proposed CNN, we show that the multi-modal CNN works robustly, which gets more than 80 % user’s accuracy. We also show that the 3D multi-modal approach is especially suited for deciduous trees thanks to the ability of capturing 3D shapes.


INTRODUCTION
The Paris Agreement, adopted at the COP21 in 2015, set out a global action plan to reduce greenhouse-gas emissions, which not only puts the world on track to avoid dangerous climate change but also accelerates the Carbon Disclosure Project (CDP).CDP requests companies and cities to disclose the status of environmental actions against climate change.Under these circumstances, remote sensing, which enables us to observe the planetary surface, is expected to monitor the forest owners' effort such as sustainable forest management (e.g.organized logging, planting and thinning).To meet the purpose of carbon disclosure, not only monitoring but also frequent and low-cost monitoring is required.Since both these features would be difficult to achieve through manual work, it is urgent to establish an automated forest monitoring method.pCurrently, there already exists automated forest monitoring systems.Global Forest Watch (World Resources Institutes, 2014), a dynamic online forest monitoring and alert system, automatically produces annualized global tree cover change data based on Landsat satellite imagery.Global Forest/Nonforest Maps (Shimada et al., 2014) also show the forest cover with certain thresholds.Due to medium-resolution images and limited number of classes, neither of the two systems is suitable to monitor the forest management.Aiming for specific targets, a number of different methods have been developed for different types of forests using various remote sensing data, whereas forest cover classification using high-resolution data remains challenging.
To tackle the accurate forest cover classification, we propose a CNN (convolutional neural network) approach which is inspired by professional interpreters.Professional interpreters produce official forest maps by interpreting a forest from airborne images or satellite images.Interpreting requires the knowledge of forestry and in some cases geospatial features as well as RGB images.Inspired by that, we employ geospatial features in the proposed CNN.Following the interpreters' techniques, where they not only consider the surface of forests but also recall the inside of forests, we propose to feed a 3Dvoxel data originated from LiDAR (Light Detection and Ranging) to the CNN.To combine the geospatial input data, we propose a multi-modal CNN, where the input data is integrated in a shallow layer, which is a closer layer to the input than output in the CNN.

RELATED WORK
In the following, we review recent advances in remote sensing tasks with CNNs.Driven by powerful deep neural networks (Krizhenvsky et al., 2012), remote sensing tasks, especially in land cover classification have started to make great progress.Using UC Merced Land Use Dataset (UCM) introduced by (Yang et al., 2010) which provides 21 land cover classes with 100 images each, (Penatti et al., 2015) shows that their CNN obtained 99.5% class accuracy.(Nogueira et al., 2017) points out that features of fine-tuned networks tend to perform well on UCM through comparing popular CNN algorithms.CNNs are thus reported to perform quite well in the remote sensing field.
Regarding forest cover classification, there is no public benchmark dataset yet.Researchers, therefore, explore the algorithms on their own datasets.(Lu et al., 2017) proposes a spatial-temporal-spectral data fusion framework over publicly available low-middle resolution images, leading to around 80% classification accuracy on seven-class classification task using support vector machine (SVM).(Kussul et al., 2017) reports around 85% class accuracy over 11 land cover and crop type classification using Landsat-8 and Sentinel-1A images.Thus, forest cover classification is a challenging task compared to land cover classification due to the similarity among classes.
Deep learning architectures have been developed for LiDAR datasets as well given it avoids feature engineering phase where discriminating features are designed as is common in traditional classification algorithms.Point clouds classification algorithms are especially discussed for daily scene such as Voxnet (Maturana et al., 2015) and PointNet (Qi et al., 2017).Although point clouds driven by airborne LiDAR is different from daily scene point clouds in a sense that it has tons of objects inside the dataset and not fine, several algorithms improves the result.(Yang e al., 2017) utilizes CNNs to transform from points to images and improves the urban objects classification on ISPRS 3D semantic labeling contest (Niemeyer et al., 2014).(Boulch et al., 2017) utilizes 2D CNNs to classify urban city point clouds: semantic3d.net(Hackel et al., 2016) and shows the efficient labelling algorithm.On the other hand, most of algorithms are aimed for urban objects classification and not for tree species classification.
As such, most of previous studies focus on images themselves or on standard indices such as NDVI (Normalized Difference Vegetation Index) to feed the classifier.However, LiDAR data can picture the characteristics inside the forest while remote sensing images only depict the surface of forests.As (Görgens et al., 2016) shows, some studies utilize airborne LiDAR data to classify forest cover yet images and LiDAR are not fed to the classifier simultaneously to extract information of both the surface and inside of forests.

METHODOLOGY
The basic principle of our methodology follows the traditional interpreters' techniques.As mentioned by (Ng, 2012), CNN itself is biologically inspired from human beings' brain system, meaning that the brain network is composed of neurons which, to interpret what we see from the sight, extract edges from pixels, primitive shapes from the edges, and object models from the several shapes.Accordingly, as (Russakovsky et al., 2015) shows, state-of-the-art CNN algorithms perform well for a general image such as a photograph taken at a close range.For forest classification from remote sensing images, on the other hand, not only picturing the shape of objects from a bird-eye view but also collecting geospatial information and capturing 3D shapes are required to interpret.CNN with just remote sensing images can thus be not sufficient to reproduce the quality of professionals' forest cover interpretation.Based on the assumption, in the following, we analyse how interpreters identify the tree species, and describe how the proposed CNN incorporates interpreters' techniques.

Knowledge to Interpret Forestry
As we daily recognize objects in our sight, forest interpreters also utilize characteristics of images such as shape and colour as a key to classify forests.They, at the same time, recall how forest should appear from above and identify the tree species based on their knowledge of forestry.For instance, the difference between Hinoki (Chamaecyparis obutsusa) and Sugi (Cryptomeria japonica) appeared in remote sensing images is not always obvious depending on the season or location.However, interpreters can differentiate them relatively easily based on their knowledge.They infer that Hinoki and Sugi are likely to have a different shape of crowns given that their tree shape is different as shown in Figure 1.On top of that, although interpreters have no information about the inside of the forest, they know that Sugi avoids touching other trees while Hinoki grows mixed with others, leading the tree crown shape of Sugi to tend to be clear and the one of Hinoki to be vague.Thus, interpreters utilize their knowledge of forestry and compensate the lack of information (i.e. the information about the inside of the forest) to classify the forest.We propose, in an attempt to mimic the professional interpreters' strategy, an approach to feed LiDAR data as 3D information as well as remote sensing images to our CNN.LiDAR transmits a light pulse and records the time that the pulse returns, which creates a 3D point cloud of targets.Since typical convolutional neural networks require a regular shape for input data such as a collection of images, deep learning architectures with 3D voxels have been explored (Maturana et al., 2015).Although (Qi et al., 2017) shows that raw point clouds perform better than voxelization architecture on classification and segmentation by selecting informative points through the network, they assume point clouds taken from a close range such as CAD model, Kinect data, and structure from motion in proximity photographing as the input dataset.The point density of airborne LiDAR taken over a forest area, on the other hand, is generally around 4 points/㎡ in Japan which is different in the sense that the shape derived from point clouds is not necessarily clear.We thus use a basic voxel format as input.

Geospatial Features
Since it is not practical for interpreters to check raw LiDAR data while examining images, geospatial features extracted from LiDAR have been developed to facilitate LiDAR data.As interpreters identify the tree species by attaching the observation on the geospatial features to their knowledge, we feed the geospatial features listed below to our CNN.
Topographic openness: the topographic openness (Yokoyama et al., 1999) is normally calculated from digital terrain model (DTM), whose value indicates the dominance or enclosure of a certain place, and the underground openness similarly indicates how much underground space is spreading.We compute them using digital surface model (DSM), which emphasizes the shape of tree crowns and supports to differentiate Sugi and Hinoki as shown in Figure 2. Slope: slope of the ground is beneficial to classify trees which prefer a steep slope.For instance, Sugi tends to grow in the trough and Hinoki mainly grows in the ridge.
Aspect: direction of a downhill slope can imply how forests grow.
Tree height: tree height can imply the age of trees, and the appearance of trees varies according to the age.

Multimodal Learning for Geospatial Datasets
With above analysis, we showed that interpreters empirically utilize features extracted from images, LiDAR which we consider is equivalent to our 3D reconstruction, and geospatial features properly.Accordingly, our classification algorithm, which mimics interpreters' strategy, needs to be capable of handling different types of data (i.e.airborne image and LiDAR).CNN can handle different types of input data through what is called multi-modal learning.As human beings process information from five senses and unify them to understand circumstances, multi-modal learning handles different modals for a given task such as video scene understanding using visual and audio signal.The main concept of multi-modal learning is to extract abstract and common information from each modal, which is in the same representation domain, and unify them to process information (Li et al., 2016).While handling geospatial datasets, on the other hand, interpreters associate data by overlaying over GIS since each data share its location.As such, we employ data fusion within a shallow layer in the CNN architecture where the layer is closer to the input and the spatial information has not been lost yet.
Figure 3 illustrates the base architecture of our CNN, where we feed a patch of images around a targeted LiDAR point and get class predicted by the CNN.The main architecture is inspired by AlexNet (Krizhenvsky et al., 2012).AlexNet, which is a basic structure of deep learning with a small number of layers, suites this case since our input data size is small so that it is difficult to apply the deeper network such as ResNet (He et al., 2015).In case of voxels, 3D-CNN is additionally executed (i.e. the 3 dimension convolution over height, width and band) to extract features along z-axis, leading to dimension reduction along z-axis as well.

EXPERIMENTS
We evaluate our proposed method in this section on the dataset of Japanese planted forest from two perspectives: (1) the contribution from each modality (i.e.airborne images, LiDAR voxels, and geospatial features) to the forest cover classification, and ( 2) the effects of 3D information (i.e.3D-CNN over voxel grids) driven from LiDAR.

Dataset
Since there are no publicly available datasets for forest cover classification with high resolution images, we at first create the labelled dataset.

Data Acquisition
The dataset is acquired over the forest in Tenryu area, a traditionally famous planted forest located in Shizuoka Prefecture, middle of Japan as shown in black lines of Figure 4.As a typical Japanese planted forest, there is a cycle of planting, growing, thinning, and logging, so that monitoring is crucial to evaluate the forest management.Given an area size of around 2.5km 2 , we adopt airborne measurements to acquire RGB images and LiDAR data as listed in Table 1.

Data Transformation
As CNNs require highly regulated input data formats (e.g.images with 3 bands whose pixels are corresponding over the bands), the dataset acquired above need to be transformed while keeping geospatial information.The geospatial features, therefore, are transformed to 20cm ground resolution which is the same size as the imagery, the highest ground resolution.
Voxel grids are also transformed to 20cm × 20cm grids.The value range is normalized from 0 to 1 for all features.

Implementation Detail
Our implementation is based on the public platform Chainer (Tokui et al., 2015).We use the Adam optimization with base learning rate of 0.001 as a basic method, which adaptively arrange the learning rate and known to be converged relatively fast.The epoch number is set to 200 at a maximum, where we check the convergence of test data every experiment.The minibatch size is 126.

Results
To evaluate our methodology, we conduct experiments with several settings as listed in Table 2, including single-modal or multi-modal, and 2D or 3D convolution on multi-modal method.
For evaluation, class accuracy is used.
As for the contribution from each modality, the results with the single-modal CNN show that images contribute a lot to Sugi and Hinoki classification.Geospatial features work well for deciduous trees, which indicates images are not capable of identifying bared deciduous trees.As for the multi-modal CNN, the performance of classifying Sugi is improved when compared to each single modality, and the result of classifying Hinoki keeps as good accuracy as obtained in the single-modal CNN with images.The reason that the multi-modal CNN works robustly is that multi-modal datasets potentially compensate each modality which has a different kind of information.As for the 3D multi-modal CNN, the result shows a slightly decreased accuracy over Sugi and Hinoki classification, which implies that the 3D convolution over potentially different resolutions might weaken the performance.On the other hand, the performance of deciduous trees classification is improved, which indicates that the 3D multi-modal CNN can learn the shape of branches captured from LiDAR data.

CONCLUSION AND FUTURE WORK
In this study, we proposed a CNN which mimics professional interpreters' manual techniques.Using simultaneously acquired airborne images and LiDAR data, we fed the 3D knowledge of tree shape (i.e.voxel) and geospatial features as well as RGB images to the proposed CNN.Inspired by the interpreters' techniques, our network provides a unified approach that integrates these datasets in a shallow layer.The proposed CNN shows that the multi-modal CNN works robustly, and the 3D multi-modal approach is especially suited for deciduous trees.
The results of this study suggest that the 3D multi-modal learning over voxels is a promising approach for forest cover classification tasks, especially those involving a forest with a complex 3D structure.
As future work, we plan to improve the performance and robustness of the 3D multi-modal CNN by means of (1) optimization of the weight to integrate modalities and (2) ensemble learning to combine the effective models appropriately.Aside from that, we intend to use different cropping sizes and resolutions as input data to investigate the effect of information amount.To verify the ability of 3D feature extraction of the proposed method, it is also considerable to incorporate the complexity of 3D shapes such as TIN surface area which derived from point clouds.Finally, the experience on a different area is essential to show the robustness of the proposed approach.

Figure 1 .
Figure 1.Knowledge of forestry for interpretation

Figure. 3
Figure.3 Base architecture of multi-modal CNN

Figure
Figure. 4 Acquired dataset for the experiments