SALIENCY-GUIDED CHANGE DETECTION OF REMOTELY SENSED IMAGES USING RANDOM FOREST

Studies based on object-based image analysis (OBIA) representing the paradigm shift in change detection (CD) have achieved remarkable progress in the last decade. Their aim has been developing more intelligent interpretation analysis methods in the future. The prediction effect and performance stability of random forest (RF), as a new kind of machine learning algorithm, are better than many single predictors and integrated forecasting method. In this paper, we present a novel CD approach for high-resolution remote sensing images, which incorporates visual saliency and RF. First, highly homogeneous and compact image super-pixels are generated using super-pixel segmentation, and the optimal segmentation result is obtained through image superimposition and principal component analysis (PCA). Second, saliency detection is used to guide the search of interest regions in the initial difference image obtained via the improved robust change vector analysis (RCVA) algorithm. The salient regions within the difference image that correspond to the binarized saliency map are extracted, and the regions are subject to the fuzzy c-means (FCM) clustering to obtain the pixel-level pre-classification result, which can be used as a prerequisite for superpixel-based analysis. Third, on the basis of the optimal segmentation and pixel-level pre-classification results, different super-pixel change possibilities are calculated. Furthermore, the changed and unchanged super-pixels that serve as the training samples are automatically selected. The spectral features and Gabor features of each super-pixel are extracted. Finally, superpixel-based CD is implemented by applying RF based on these samples. Experimental results on Ziyuan 3 (ZY3) multi-spectral images show that the proposed method outperforms the compared methods in the accuracy of CD, and also confirm the feasibility and effectiveness of the proposed approach. * Corresponding author


INTRODUCTION
Change detection (CD) is an important research topic that leverages quantitative analysis of multi-temporal remotely sensed images to determine the process of land cover change, especially in the monitoring of building land, urban development and disaster assessment (Hazel 2001;Hussain et al. 2013).Along with the rapid development of remotely sensed image acquisition means and the gradual shortening of the acquisition cycle, the scope of its applications is becoming increasingly widespread and the application demand is expanding.This presents higher requirements and challenges for CD technology.
With the improvement in resolution, the internal spectral difference of the same pairs of similar features increases gradually.The automatic CD technology based on pixel spectral statistics is not able to meet the requirement of the extraction of change information and becomes the main obstacle for the widespread application of high resolution remotely sensed images.The emergence of object-oriented technology for highresolution remote sensing image analysis provides a new way of thinking, and the basic unit of CD has also transformed from pixel to object (Hazel 2001).Since the object-based change detection (OBCD) approach has more advantages than the pixel-based change detection (PBCD) approach, it has received extensive attention and been developed in recent years (Wang, Zhao and Zhu 2007;Emary et al. 2010;Wang and Xu et al. 2013;Hao and Shi et al. 2016;Xiao and Zhang et al. 2016;Xiao and Yuan et al. 2017).The object is defined as a single homogeneous region with shape and spectral properties.Each object has features such as spectrum, shape, texture and context.Therefore, in the process of CD, we can take full advantage of spectral features and combine other features to improve the CD accuracy.In this field, the most commonly used methods are object-based change vector analysis (OCVA), object-based correlation coefficient (OCC), object-based chi square transformation (OCST) (Wang, Yan and Wang 2014) etc.These methods take advantage of the various features of the object, and incorporate them into analyses in later stage.They compared the methods using only a single feature, can significantly improve the accuracy of CD.The performances of these methods, however, heavily rely on the quality of feature selection, the allocation of feature weights and the determination of the change threshold.Moreover, due to the difficulty to determine the segmentation scale, it is likely to introduce uncertainty in the CD process, and reduce the reliability of the detection results.In order to obtain a better result, the segmentation scale, feature extraction, the change of threshold and many other factors need to be taken into consideration.The future trend of CD is the automation and intellectualization of the analysis process.Although a large number of CD methods and theoretical models are proposed from the object-based or pixel-based perspective, or for different application purposes, there are still many uncertainties.The combination of object-based and pixel-based CD approaches helps to reduce the uncertainty (Aguirre et al. 2011;Lu and Li et al. 2015;Xiao and Zhang et al. 2016;Feng and Sui et al. 2017).The purpose of combining two image analysis approaches is to obtain better image analysis results.There are two kinds of usage of the combined approach, one is that the two methods are executed in parallel, and then integrated; the other is that the two methods are performed consecutively, which allows one method to obtain results as the premise for the execution of the other method.Both of these two strategies are for the purpose of achieving better image analysis results.
This paper adopts the strategy that performs the pixel-based and object-based approaches consecutively.In the pixel-level analysis process, the robust change vector analysis (RCVA) method is used to obtain the difference image, and the visual attention mechanism is introduced to find the regions that are most likely to change.Saliency, which is closely related to human visual perception, helps people understand the image (Zheng and Jiao et al. 2016;Wang and Yang et al. 2016;Hou and Wang et al. 2016).Saliency detection has significant effects on improving the efficiency of computer processing of image information and obtaining better processing results, which broadly benefits e.g.image segmentation, object recognition, and detection task (Zheng and Jiao et al. 2016;Wang and Yang et al. 2016;Li and Xu et al. 2017).Motivated by these advantages, we explore the saliency cue for CD from remotely sensed images, based on the assumption that changed regions have higher saliency than unchanged regions nearby in the local context.The strong visual contrast of local areas makes saliency suitable to guide the sample selection for the object-level CD.This paper makes full use of the advantages of pixel-based and object-based analysis method and combines it with RF model to analyse the influence of saliency, the influence of sample selection and feature extraction on the performance of the final classifier.The entire workflow of the proposed method is shown in Figure 1.

Figure 1. Flowchart of the proposed approach
The rest of this paper is organized as follows.Section 2 describes the proposed method.Section 3 presents the experimental results and discussion.Finally, we conclude this paper in Section 4.

Optimal Super-pixel Segmentation
We exploit the entropy-rate segmentation algorithm (Liu, Tuzel, Ramalingam and Chellappa 2011) to segment the image into many super-pixel regions for subsequent information extraction.Further details of the entropy-rate segmentation algorithm are given by Liu et al (2011).The purpose of super-pixel segmentation is to segment the same type of features in order to obtain a series of homogeneous regions with compact and strong regional consistency.However, in the process of segmentation, the selection of the number of super-pixels plays an important role in image segmentation, and it is also the key to improve the image segmentation quality.
For multi-level and super-pixel segmentation in object-oriented remote sensing information extraction methods, the definition of optimal number of super-pixel could be expressed by one or several super-pixels.This requires that the super-pixel size should be close to the target feature, and its geometry (polygon) should not be too broken; the boundaries of the super-pixel should be clear; the heterogeneity within the super-pixel should be as small as possible; the heterogeneity among the different types of super-pixels should be as large as possible; and the super-pixel could express the basic characteristics of a certain object.The homogeneity of super-pixels guarantees the purity, while the heterogeneity between super-pixels ensures the separability.In our study, the weighted variance of the superpixels is used to express the internal homogeneity, and Moran's I index (Espindola, Camara, Reis and Bins 2006) is used to represent the heterogeneity between the super-pixels.The formula is as follows: In equation ( 1), H is the homogeneity index, k a is the area size of the super-pixel k which is expressed as the number of pixels inside the super-pixel, k v is the standard deviation of the super- pixel k, n is the total number of super-pixels in the image segmentation.The equation represents the process of assigning weights to areas; the weights can reduce the small super-pixelinduced instability.The higher the value of H, the higher the homogeneity of the super-pixels.In equation ( 2 This paper adopts formula (3) to express the evaluation index of the optimal number of super-pixel (Espindola, Camara, Reis and Bins 2006).It uses the homogeneity index and heterogeneity index of super-pixel to construct a function to measure the quality of the segmentation results.The formula is as follows: In equation ( 3), () F H denotes the homogeneity evaluation index, () F MI denotes the heterogeneity evaluation index, ρ denotes the heterogeneity weight, [0,1] ρ  , and the homogeneity weight is 1 ρ  .In our study, the value of ρ is 0.5.The homogeneity index and heterogeneity index should be normalized before being used for evaluating the optimal number of super-pixels.The formula is as follows: On this basis, an optimal model of the number of super-pixels selection could be obtained by cubic spline function interpolation: In equation ( 6), when the function 3 () sx takes the maximum value in the super-pixel interval min max

[ , ] xx
, the corresponding super-pixel number x is the optimal super-pixel number.

Initial Difference Image Generation:
The traditional robust change vector analysis (RCVA) algorithm, which was proposed by Thonfeld et al. (2016), only considers the spectral difference in a single band and computes the least difference of each pixel.It does not incorporate the whole spectral difference of the corresponding pixels.To solve the aforementioned problems, we proposed an improved RCVA algorithm to reduce the influence of registration errors.
The improved RCVA algorithm is used for the analysisnot only on the pixels 1 ( , ) x i j in T1 phase image and 2 ( , ) x i j in T2 phase image, but also on the pixels in the adjacent neighborhood including diagonal neighbors.It is based on the assumption that a pixel 2 ( , ) x i w j w  showing the least spectral variance to 1 ( , ) x i j is the pixel containing most of the corresponding ground information of 1 ( , ) x i j .That is, if the bi- temporal images have geometric registration error, and providing that the difference between a certain pixel in T1 phase image and the other pixel in T2 phase image is the smallest, then the two pixels are called corresponding image points.So the influence of the registration error can be effectively reduced.The improved RCVA algorithm uses a moving window with the size of 21 w  to incorporate the spectral difference of adjacent pixels, where w refers to the neighborhood expressed as number of pixels from the pixels of concern.In our study, we use the value of w=2 resulting in a 55  moving window.The calculation process is divided into two steps.In the first step, differencing images are calculated considering pixel neighborhood for subtracting T1 from T2 and vice versa.The specific formula is shown in equations ( 7) and ( 8).
The second step is to obtain the change intensity image incorporating the neighborhood information by the equation ( 9).After the acquisition of change intensity image, it is important to distinguish the real change regions and pseudo regions accurately.

From Saliency to Pixel-level Pre-classification:
For the difference image, the user is only interested in some parts of it.The salient region in difference image which attracts the users could also best deliver image content.Saliency is used to extract regions that are distinct from local and global regions.After acquiring the change intensity image, the problem of CD can be seen as finding a region with rigor distinctions with other regions.In the change intensity image, the changed region has a larger pixel value, but the pixel value of unchanged region is close to zero.Furthermore, the area of the changed regions is usually smaller than the unchanged region.Therefore, the changed region can be further highlighted by the graph-based visual saliency (GBVS) (Schölkopf, Platt and Hofmann 2007) model as a target with high contrast.From this point of view, CD and saliency detection problems are coincided in nature.And regarding visual performance, the changed areas also correspond to the salient regions of the change intensity image.
By using a thresholding method, the pixels can be preserved in the extracted areas when their values are larger than a given threshold saliency T , otherwise the pixels are neglected.The thresholding function is as follows: saliency Map 1, 0, otherwise

S
be a thresholding map, where "1" indicated that the corresponding pixel is preserved in the extracted areas and "0" stands for neglection.With the usage of a thresholding method on the saliency map S, the interest areas with discriminative information are well preserved and the false-changes generated via the noise are greatly neglected.In our study, the regions within the change intensity image that correspond to the thresholding map are extracted, and the regions are subject to the fuzzy c-means (FCM) clustering to obtain the pixel-level pre-classification result.

Different Super-pixel Change Possibilities:
The sample selection is conducted based on the results from optimal segmentation and pixel-level pre-classification.For the i-th super-pixel i R , its uncertainty index T is calculated based on the pixel-level pre-classification result, as shown in the following equation ( 11).
Where c n , u n , and n are the changed, unchanged and total numbers of pixels in the super-pixel i R respectively.After setting the threshold m T , we then use the following equation ( 12) to determine the properties of the super-pixel i R : .The changed and unchanged super-pixels are selected as training samples for RF, and the uncertain super-pixels will be further classified.With regard to the selection of the threshold m T , we calculate the average accuracy after randomly selecting samples repeated 20 times in the interval [0.5, 1), with the step size of 0.05.When the CD accuracy is the best, the corresponding threshold m T is the best and selected.

Classification based on RF:
The key step after sample selection is the change feature extraction and analysis.In our study, after segmentation, the spectral and Gabor features of the super-pixels at the same position are extracted from the T1 and T2 phase image respectively.Among them, the spectral features (SF) mainly include the mean value, standard deviation, ratio, maximum value and minimum value, as shown in equations ( 13) and ( 14).The spectral features and Gabor features can reflect different object type information from different angles, and often complement each other.After extracting the above two types of features, we need to combine them as the characteristic input data for the RF model training.The category of uncertain superpixels will be further predicted by the well trained RF model, and the voting result of all the decision trees is taken as the final classification result (Breiman 2001;Wessels and Bergh et al. 2016).

Dataset Description and Evaluation Metrics
The experimental dataset is the ZY3 multi-spectral images which are acquired in year 2014 and 2015 covering the city of Wuhan, China.ZY3 is China's first civilian high-resolution stereo mapping satellite, which was launched on January 2012.The orbit height of ZY-3 satellite is 505.984km, and the orbit inclination is 97.421°.The bi-temporal images are produced by national census geography of China and have been orthorectified.The pan and multi-spectral images are fused by Pan-sharp algorithm (Amro et al. 2011) and the spatial resolution of the fusion image is 2.1m.The bi-temporal images in the experiments mainly include three bands of R, G and B and the image size is 1564 × 1424.The main land cover types are vegetation, water, road, building, and bare land, etc.We can find that the change of the main features is the transformation between water and bare land, building and bare land, etc. Accuracy assessment is important for understanding CD results and final decision making.Four indexes are used to evaluate the accuracy of the final results (Hou, Wang and Liu, 2016).The criteria are set as follows: the false alarms (FA) which is the number of unchanged pixels that are incorrectly detected as changed ones, i.e., FA N and the false alarm rate in percentage is , where 0 N is the total number of unchanged pixels; the missed alarms (MA) which is the number of changed pixels that are incorrectly detected as unchanged ones, i.e., MA N and the missed alarm rate in percentage is , where 1 N is the total number of changed pixels; the overall error (OE) which is the total number caused by FA and MA and the overall alarm rate in percentage is ; the kappa index that is a statistical measurement of accuracy or agreement and it reflects the consistency between experimental results and the ground truth.
The Kappa index is expressed as , where 0 p indicates the real consistency and c p indicates the theoretical consistency.

Parameter Setting:
The ideal result of remote sensing image segmentation is that the interior of the obtained object has significant homogeneity, whereas the adjacent object has significant heterogeneity.To obtain the best super-pixel segmentation results, this paper uses the evaluation index to determine the final optimal number of super-pixel.In the segmentation experiment, we utilize the entropy rate segmentation algorithm to segment the first principal component image by changing the number of super-pixels to obtain the multi-level super-pixel regions with different sizes.
The selected number range of super-pixels is increased from 2000 to 6000 and the step size is 200.The value of evaluation index F(H,MI) at each scale after segmentation is calculated, and the cubic spline interpolation is performed.The result of Figure 3 substantiates that F(H,MI) is the highest when the number of super-pixel is 3745.That is, at this time the inside of each super-pixel is homogeneous and different super-pixels are heterogeneous.This paper selects the number of 3745 as the optimal value of super-pixel segmentation.
Figure 3.The evaluation index of the optimal super-pixel number Since the saliency map highlights the regions with strong local contrast in vision, these regions draw people's attention mostly.
There is a theoretical and visual commonality between saliency and CD.By setting thresholds for the saliency map, the regions within the change intensity image that correspond to the binarized saliency map are extracted.Then, the regions are subject to the FCM algorithm to obtain the pixel-level preclassification result, which can be used as a prerequisite for object-oriented CD analysis.The pixel-level CD accuracy result is shown in Figure 4.When the threshold saliency 90  T , the Kappa index is the largest.In the interval (45, 90], both the false alarm rate and the overall alarm rate are decreasing, while the missed alarm rate is increasing slowly.When the threshold is more than 90, the trends of the false alarm rate and the overall alarm rate have become flat, whereas the missed alarm rate is still rising.Therefore, the best threshold value is 90. Figure 5 shows the change intensity images and saliency maps of the experimental dataset, as well as the salient and non-salient regions extracted at the optimal saliency threshold.

Results and Analysis:
In order to verify the feasibility and effectiveness of the proposed approach, we challenge the high-resolution remote sensing image CD problem on ZY3 multi-spectral images.Typical correction, such as coregistration and relative radiometric correction are done for the three datasets before applying the proposed approach.And also, for the DRLSE model.And we select the optimal super-pixel segmentation result to carry out the OCVA method.The OSTU algorithm (Otsu, 1979) is adopted for automatic threshold segmentation.
The results are demonstrated in two ways: the final CD results in figure form and the criteria in tabular form.The reference images are manually generated according to a detailed visual interpretation.The black areas indicate the unchanged regions, and the white areas indicate the changed regions.Figure 7 shows the experimental results carried out upon the experimental dataset.Table 1 lists the values for evaluation metrics.In the table, the results of the proposed method are written in bold.The CD accuracy of traditional pixel-level methods such as MRF-ICM, PCA-kMeans are comparatively low as shown in Table 1.This is mainly because only the spectral features of bitemporal images are used in these algorithms.With regard to the time efficiency, the iterative optimization is needed in the MRF-ICM, thus it has lower time efficiency than that of PCA-kMeans.The DRLSE algorithm can involve neighborhood information between pixels in the difference image.It can be seen that compared to MRF-ICM, PCA-kMeans methods, the DRLSE can further improve the accuracy of CD result.However, it takes the longest time.This is because the adjacency restriction relation between the regions is taken into account in the evolutionary segmentation process.The result from using OCVA is not prominently better than the traditional pixel-level CD methods.This is caused by over dependence on the selection of threshold value and the grey mean information of the super-pixel while it fails to take advantage of grey distribution information, and it results the poor CD result.The proposed approach in our study obtains the best result and gains higher accuracy with respect to the comparison methods.
Overall, in the application of CD, we can combine the pixelbased and object-based image analysis approaches according to different purposes by using visual saliency and RF, to get the final object-level CD result.The changed objects are more regular, and are corresponding to actual geographical features.Therefore, the combination can not only absorb the advantages from both pixel-based and object-based approaches, but also can obtain the best accuracy.

CONCLUSION
A novel CD approach based on visual saliency and RF from multi-temporal high resolution remotely sensed images is proposed in our paper.Saliency detection is used to guide the search of interest regions in the initial difference image obtained via the improved RCVA algorithm, where the effect of noise can be reduced to some extent.On the basis of the optimal segmentation and pixel-level pre-classification results, the object-level CD is implemented by applying RF model.The bitemporal ZY3 multi-spectral images are used to verify the effectiveness and show the superiority of the proposed approach.
There still exist some works to further improve the performance of the proposed method.In view of the problem of highresolution remote sensing image CD, because of the phenomenon of different objects which have the same spectrum and the same objects have different spectrum is more serious, by only using spectral feature and Gabor feature is still inadequate.
In the future research, we can further introduce the edge features, geometric features, and elevation so as to further improve the accuracy of CD.
), MI is the heterogeneity index, ij w indicates whether the super-pixel i and the super-pixel j are adjacent; if 1 ij w  , they are adjacent; if 0 ij w  , they are not adjacent.The value of i y represents the average grey value of the super-pixel i; the value of j y represents the average grey value of the super-pixel j ; and the value of y represents the average grey value of the entire image.The smaller the value of MI, the lower the correlation between the super-pixels, and the more clearly of the boundary between the super-pixels.
attributes of the super-pixel i R are non-change, uncertain and change respectively, and the range of the threshold is m [0.5,1)T  Mean ,Std ,Ratio ,Minvalue ,Maxvalue ) (14) The Gabor wavelet function is obtained by Fourier transform of the Gaussian function, which can extract the image correlation feature in different scales and directions (Daugman 1988; Li, Shi, Zhang and Hao 2017).In our study, Gabor wavelet is employed to extract the texture features.The two-dimensional Gabor function the extracted feature image from the original image ( , ) xy I after Gabor filtering, and  is the two- dimensional convolution operation.The mean value and standard deviation at different bands of each super-pixel in the Gabor feature image are extracted as the texture features.
The influence of saliency on the experimental dataset.Influence of saliency (a) on FA, MA and OE; (b) on kappa index.On the basis of the optimal segmentation and pixel-level preclassification results, the changed and unchanged super-pixels that serve as the training samples are automatically selected.And the super-pixels' spectral and Gabor features are extracted as the characteristic input data for the RF model training.As shown in Figure6, in the experimental dataset, when the threshold index is the largest.When the threshold is in the interval [0.5,0.75], the missed alarm rate increases slowly while the false alarm rate and the overall alarm rate are declining; when the threshold m T has exceeded 0.75, the missed alarm rate grows dramatically, while the false alarm rate and the overall alarm rate become flat, thus the best threshold is 0.75.Therefore, based on the above discussions, for the experimental dataset, 90  Intermediate results of saliency.(a) The change intensity image; (b) the saliency map; (c) the salient region; (d) the non-salient region.(a) (b) Figure 6.The influence of uncertainty index on the experimental dataset.Influence of uncertainty index (a) on FA, MA and OE; (b) on kappa index.
three pixel-based CD methods, such as Iterative Condition model (MRF-ICM), PCA-kMeans (Celik 2009), edge-based distance regularized level set evolution (DRLSE) (Li and Xu et al. 2010; Lei, Shi and Wu 2017), and object-based change vector analysis (OCVA) are selected as the comparison methods.These methods are performed to obtain best CD maps and are applied to demonstrate the superiority of the proposed approach based on the combination of visual saliency and RF.In MRF-ICM, the k-means algorithm is used to obtain the initial clustering value and the class number The results of different methods.(a) MRF-ICM; (b) PCA-kMeans; (c) DRLSE; (d) OCVA; (e) Proposed; (f) Reference image.

Table 1 .
Performance comparisons against different approaches.