THE EXTRACTION OF POST-EARTHQUAKE BUILDING DAMAGE INFORMATIOM BASED ON CONVOLUTIONAL NEURAL NETWORK

The seismic damage information of buildings extracted from remote sensing (RS) imagery is meaningful for supporting relief and effective reduction of losses caused by earthquake. Both traditional pixel-based and object-oriented methods have some shortcoming in extracting information of object. Pixel-based method can’t make fully use of contextual information of objects. Object-oriented method faces problem that segmentation of image is not ideal, and the choice of feature space is difficult. In this paper, a new stratage is proposed which combines Convolution Neural Network (CNN) with imagery segmentation to extract building damage information from remote sensing imagery. the key idea of this method includes two steps. First to use CNN to predicate the probability of each pixel and then integrate the probability within each segmentation spot. The method is tested through extracting the collapsed building and uncollapsed building from the aerial image which is acquired in Longtoushan Town after Ms 6.5 Ludian County, Yunnan Province earthquake. The results show that the proposed method indicates its effectiveness in extracting damage information of buildings after earthquake.


INTRODUCTION
China, surrounded by the world's two major seismic zones (the Eurasian seismic belt, the circum-pacific seismic belt), suffers many serious earthquake disasters. With the development of remote sensing technology, the acquisition capacity and the quality of remote sensing images have been greatly improved, which provide favourable conditions for extracting seismic damage information of buildings from remote sensing images. There are many previous works have been done to analyse the remote sensing imagery. According to basic unit of classification, the method of extracting earthquake damage information can be summed up as pixel-based method and object-based method. The pixel-based classification method can't fully use the spectral, shape, texture and contextual information of the image, which makes the accuracy of the extraction and classification of the buildings relatively low. Baatz and Schäpe (1999) firstly proposed the object-oriented method to deal with high resolution remote sensing image. The key technology is multi-scale segmentation based on the minimum principle of heterogeneity (producing object), and the classification system is based on fuzzy logic and fuzzy mathematics (information extraction) (Metzler, et al. 2007). But the object-oriented method is usually difficult to choose the feature used for the image classification.
Recently, deep learning has become state-of-the-art solution for visual recognition (Nogueira, et al. 2017). Given its success, deep learning has been intensively used in several distinct tasks of different domains (Ian Goodfellow, et al. 2016;Bengio and Yoshua, 2009). In remote sensing field, compared with the traditional classification method based on pixel or object, the artificial neural network classification algorithm has strong ability of self-learning and fault tolerance (Haykin, 2008). There are several CNN architectures have been proposed in analysis of remote sensing imagery. Saito, et al. (2016) introduced a CNNbased framework, which is used to extract building and road. * Corresponding author Basaeed, et al. (2016) proposed a region segmentation technique for remote sensing images using a boosted committee of CNNs coupled with inter-band and intra-band fusion. The proposed method is a fusion framework consisting of a set of thirty boosted networks that derive individual probability maps on the location of region boundaries from the different multi-spectral bands and combines them into one using an averaging inter-band fusion scheme. Because a set of thirty boosted networks are used, it will increase the cost of time. Längkvist, et al. (2016) shown how a CNN can be applied to multispectral ortho imagery and a digital surface model (DSM) of a small city for a full, fast and accurate per-pixel classification. But in earthquake disaster region, DSM data usually can't be acquired in time. Saito and Aoki (2015) used CNN to learn mapping from raw pixel values in aerial imagery to three object labels (buildings, roads, and others). It applies a patch-based approach, and the boundary of the label object is very sharp.
In order to overcome the shortcomings of pixel-based and objectoriented methods, a new stratagem that object-oriented and CNN are combined is proposed to extract the information of damaged buildings from remote sensing image.
This article include the method, technology route and basic architecture of CNN. Then experiment and results are described. The Last is discussion and conclusion.

Basic Idea of the Methods
The basic idea to extract building damage from RS image by using the method of CNN combined with imagery segmentation is shown in Figure 1. The main workflow is divided into two steps. The first step is divided into two parts. One is the segmentation of remote sensing imagery; The other is to use well-trained CNN to predicate pixel's probability belongs to a certain probability and then generate probability patches. The second step is to combine segmentation spots and probability map to integrate the category of every segmentation spots. Multi-scale segmentation method is used to segment the remote sensing imagery into segmentation spots, which will be unit for integrate the probability. During the segmentation process three key parameters (scale parameter, image layer weights, composition of homogeneity criterion) will be set.
CNN is one of the import methods for the research. Let a fixed size window scan through the entire remote sensing imagery with a stride s. At each location, it will produce an image patch N with the same size of the window, which is the input of CNN. The operation of CNN produces a probability patch m with a fixed size ( ). The m represents the probability of pixels within the boundary which has the size of at the centre of image patch N. Then assemble all probability patches and form the probability map of the entire remote sensing imagery. Some pixels in different probability patches may present the probability of the same pixel in image N. In this situation, the max value of pixel in these probability patches is chosen as the probability value of pixel in image N. At last, the value of every pixel within a segmentation spot is integrated to obtain the average probability of every segmentation spot.
The goal of this article is extract building damage information, including collapsed building, un-collapsed building and background. Its focus operation is to use trained wall CNN to predict a multi-channel label image from an input image patch N. Let be the number of categories that we want to extract, while it is easy to understand image has channels. The main work is to use input image patch N and the corresponding ground truth map which labels the category of every pixel to train CNN. Then we can use the well-trained CNN model to predicate the category of the raw pixels.

Basic Architecture of CNN
The basic architecture of CNN usually consists of alternatively piled convolutional layer, fully connect layer and predicate layer. The convolutional layer usually includes the operation of convolution, non-liner transformation and pooling.
The convolutional layer usually connects input imagery or feature maps. The operation of convolution is explained as following. Assume that an imagery or feature map having the size of with -channels and K two-dimensional filterkernels having the size of are taken as inputs of the convolutional layer. The output maps will be in the size of w 1 h 1 with K-channels. Each channel of the output image is called a filter site (Alshehhi, et al. 2017). In the case that the convolution process is not slide 1 pixel, the stride r which effects the size of filter site is required (Nogueira, et al. 2016). If 1, the size of an output map from convolution process is decreased to w / 1 h /r 1 The convolution process is defined as following: where , = pixel value at , in -th channel of an input image or of a feature map , = pixel value at , of k-th filter site , = weight value at , on k-th filter = a bias parameter of k-th filter that is shared among all locations , The second operation of convolutional layer is non-liner transformation which is also called activation function. There are many activation functions that can be used, such as sigmoids, hyperbolic tangents, and rectified linear units (ReLUs) . But rectified function is currently the mostly used because ReLUs are known to offer some practical advantages in the convergence of the training procedure. In this paper, we use ReLU function (Nair and Hinton 2010), as follows: After the process of activation function, the next step is pooling, which takes the filter sites operated by non-liner transformation performs subsampling to them by considering maximum or average value in pooling window. this pooling window is set to slip at a stride t. In this article max-pooling is used. Let us assume that , is an output of the previous activation operator and by applying max-pooling, the output , is expressed as following: when the suitable number of convolutional layers are stacked, the next several layers are usually set to be fully connected layers to comprehensively use the entire features of the image patch. But the fully connected layer will hugely increase the number of parameters needed training and increase the cost of computation. So a dropout stratagem ) is adopt. In some CNN architectures, they usually don't have only one fully connected layer, and in this article the CNN architecture has two fully connected layers.
Fully connect layer is usually followed by A classifier layer. The operation of the classifier is described as following. Let assume , , … … , denotes the output of fully connected layer and softmax function is applied to each to convert into probability vector which is reshaped into the form , , , , … , … , , , ( 1,2, ⋯ , ). The expression of the function as follow: where , = the τ-th weight vector which connect to theth probability value of τ-th pixel output unit In this article, we adopt the CNN architecture which is shown in figure 2. The CNN is trained by minimizing the negative log likelihood using mini-batch stochastic gradient descent with momentum (Bengio and Yoshua, 2012) Figure 2. the CNN architecture (Saito and Aoki 2015) After the training of the network, it operates the test imagery, then we can acquire the probability map. The map with channels, expresses the probability of each pixel belong to a certain category. If the probability map is used to produce the label map according to a threshold, it will also cause a lot of saltand-pepper noise similar as the pixel-based method. This phenomenon can be avoided by combining it with the segmentation of imagery. In the probability map, we integrate the value of the pixel within the boundary of a certain segmentation spot, the function is expressed as following: where m , is the probability of the -th pixel in -th channel within a certain segmentation spot is the total number of pixels within a certain segmentation spot p is the probability of the segmentation spot belongs to the -th category

Dataset
A destructive earthquake with Ms6.5 occurred at 16:30 on August 3, 2014 in Ludian County (27.1° N, 103.3°E), Yunnan Province. The postearthquake aerial image was acquired with area of about 12 km 2 and resolution of 0.3m in Longtoushan Town of Ludian. Some collapsed and uncollapsed buildings are found in the imagery. This imagery is used for experiment.
The imagery need to be preprocessed to train CNN. Firstly, the ground truth map is obtained (Figure 3) by using object-based classification method and then corrected artificially for each object spot to suitable category. Secondly, we scan and crop the imagery and ground truth map to patches which has size of 64 64 pixels. These patches are divided into three sets: training (9000 patches), validation (2000 patches) and testing (500 patches). Every data and ground truth map are rotated randomly, to increase the number of training data.

Experiment
The imagery is segmented with a set of parameters as scale parameter=50, image Layer weights=1, composition of homogeneity criterion=0.5. As a result the boundary of each segmentation spot is shown in figure 4. Before the training of CNN, we firstly set the hyper parameters of the net wok. The fine-tuning was done to reduce the training iteration. In this article, hyper parameters as set as following (Alshehhi, et al. 2017): mini-batch size of 128 with the momentum of 0.9. The training was regularized by weight decay set to 0.0005, and dropout regularization for all fully connected layers with dropout ratio set to 0.5. We initialized the weights in each layer with a random number drawn from a zero-mean Gaussian distribution with standard deviation 0.01. The learning rate is started with 0.0005 with initial bias set to constant 0.1.
Then well-trained model is chosen to predicate the probability of every pixel belonging to a certain category, such as the probability map of unclooapsed building which is shown as figure 5. The Combination of the segmentation spots and the probability map (as is shown in figure 6.) is used to integrate the probability of every segmentation spot belonging to a certain category by function (5).

Result
Through the above processing, the seismic damage information of the building is extracted from the aerial image in Longmenshan Town (Figure 7). The result is shown in figure 8.

CONCLUSION
Through the experiment, it is found that the method of combining CNN and segmentation can have a good result in extracting collapsed and uncollapsed buildings. But there still exists some defect comparing the extracted building damage map with the ground-truth map. Firstly, some bare soil or un-collapsed buildings are classified as collapsed buildings and vice versa. Secondly, small number collapsed buildings are not extracted. The reasons can be summarized as following: firstly, the number of training data is too small, what's more the ground truth map may have error; secondly, the collapsed buildings and the bare land are easy to be confused, because both of them have similar spectral signature; thirdly, the number of training of different categories have some difference. In concrete terms, there are more number of training patches are belong to the background (other objects), this phenomenon may result to error. In the future work, we will try to focus on taking the artificially defined features such as topological relations into account, to obtain better effects of seismic damage extraction.