RECURSIVE HIERARCHICAL CLUSTERING FOR HYPERSPECTRAL IMAGES

Partition based clustering techniques are widely used in data mining and also to analyze hyperspectral images. Unsupervised clustering only depends on data, without any external knowledge. It creates a complete partition of the image with many classes. And so, sparse labeled samples may be used to label each cluster, and so simplify the supervised step. Each clustering algorithm has its own advantages, drawbacks (initialization, training complexity). We propose in this paper to use a recursive hierarchical clustering based on standard clustering strategies such as K-Means or Fuzzy-C-Means. The recursive hierarchical approach reduces the algorithm complexity, in order to process large amount of input pixels, and also to produce a clustering with a high number of clusters. Moreover, in hyperspectral images, a classical question is related to the high dimensionality and also to the distance that shall be used. Classical clustering algorithms usually use the Euclidean distance to compute distance between samples and centroids. We propose to implement the spectral angle distance instead and evaluate its performance. It better fits the pixel spectrums and is less sensitive to illumination change or spectrum variability inside a semantic class. Different scenes are processed with this method in order to demonstrate its potential.


The problem of labeling
Hyperspectral images give us access to a wide range on information contained in the different spectral bands. The classification process suffers from several drawbacks : it reduces the wealth of the information (only a limited number of classes for example) and also requires of full set of learning samples (availability of ground truth). As we can see in many scientific papers, the availability of precise and validated ground truth is not always easy (Lange, 2018).
Clustering methods are able to exploit to whole amount of data into the hyperspectral cube. As they are unsupervised, they may extract some new classes into the data. The unsupervised clustering produces a list of clusters or centroids, a classification map, and a distance matrix to these centroids for each pixel.
In some cases of learning tasks, it is difficult or costly to build a ground truth especially fr classification activities. Furthermore, creating a dedicated map for one or few classes is accessible, but creating a full ground truth over a geographical area is much more difficult. Clustering approach may be used to learn the structure of data, and using few labels to classify elements should be possible.

The problem of high dimension
In this paper, we propose to compare several clustering techniques such as K-Means, Fuzzy-C-Means, and Self Organizing Map (Kohonen Map). They are all using the Euclidean distance to compute the position updates of the centroids. One of our contribution consists in modifying the classical Euclidean distance or l2 norm by a spectral angle and evaluate its impact. Indeed, in many analysis of high dimension hyperspectral vectors, the spectral angle is more robust to illumination change.
The Principal Component Analysis transform is a widely used method to reduce the data dimensionality. It may conduct to the loss of interesting information. For the main semantic classes, its use is recommended as it fastens the execution time.

About clustering algorithms
Among the different clustering techniques, K-Means is a widely used since it is fast, and quite robust (MacQueen, 1967). At each iteration, every sample is assigned to its nearest cluster. Then the new means (centroids) are computed with the assigned samples. The algorithm ends when the assignment no longer changes, and so the centroids do not move any more. This algorithm is quite simple but does not guarantee to find the optimum clusters. Given a set of observations (X1, X2, X3, ..., XN ), the C sets of points are S = {S1, S2, ..., SC }. Distance between observations and clusters is computed with l2 norm, which is : (1) Position is computed with the formula Sj = 1/nj k∈S j X k Its complexity is O(N.D.C.I) with N number of samples, D dimensions of the data, C number of clusters and I number of iterations.
The fuzzy C-Means (FCM) is a method of clustering which allows one pixel sample to belong to several clusters (Dunn, 1973) (Bezdek, 1980). It is based on the minimization of the objective function : The new position of the updated centroid is computed with the The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) The Self Organizing Map differs from the K-Means algorithm (Kohonen, 1982). For each pixel, we assign a winning node with the distance Di,j = Xi − Sj (4) Considering an external topology describing the spatial organization of nodes, spatial neighbors of the winning node are also updated. Consequently, spatial neighbors nodes into the topology will become spectral neighbors. The updated centroid is computed with the formula : (5) The parameter βt is the learning rate and w dist is the function of distance into the topological representation of the network (may be simply equal to 1 if neuron is updated, or 0 if not).
There are many other interesting evolutions relative to clustering algorithms around K-Means (Xu, 2016) or (Nasser, 2006) for example. But our study is limited for the moment to the three algorithms presented in the paragraph.

About hierarchical or recursive approaches
To solve the problem of large dataset, many attempts have been done with hierarchical analysis or recursive analysis (Gowda, 2017) (De Silva, 2018 (Cardot, 2011). Divide and conquer strategy is applied : learn a K-Means clustering once, apply it on the whole dataset, and divide it in several sub-groups. Then apply a new K-Means recursively on each sub-groups till the partitions meet a predefined criterion : last level reached, min number of samples in the last group, statistical homogeneity of vectors, etc.

PROPOSED METHOD
First of all, considering the 3 clustering algorithms described previously, we propose to replace the Euclidean distance function by a spectral angle distance. It is defined as : The spectral angle (SA) distance replaces the l2 norm into formulas (1) (2) and (4). The Spectral Angle Mapper algorithm is widely used for remote sensing, for classification purpose or to fit a database spectrum with pixel spectrums (May, 2013). The aim is to analyse its impact on the results.
The proposed hierarchical clustering is a top-down approach. It is defined by a number of levels L, and number of clusters C for each clustering. We consider here that the total number of clusters is not a question : the final numbers of clusters will be large relatively to the number of labeled classes that we are looking for. Many methods try to find an optimal partition of the input data.
A first clustering is applied on random samples (N pixels) inside the image. It produces C clusters, with C groups of image pixels. Then, at the second (hierarchical) stage, for each group of image pixels, we apply (recursively) the same clustering algorithm on random samples (N pixels), producing C clusters, and C groups of image pixels. At this level, we obtain C 2 clusters, C 2 groups of pixels. The same algorithm is applied recursively to each group of pixels, producing each time C clusters and C groups of pixels. At the final level L, we obtain C L clusters.
We perform Each individual clustering algorithm may be a K-Means, Fuzzy-C-Means, Kohonen map or any other clustering approach.
The algorithm steps are listed below : 1. Start at first hierarchical level (l = 1) 2. Extract N samples Xn from group of pixels 3. Learn the C clusters S k ∀k ∈ [1; C] 4. For all pixels of the image, compute the distance to the nearest cluster, and assign the corresponding cluster id 5. Next hierarchical level (l + 1) : loop on C 6. For a given cluster id (let's consider i0), extract N samples Xn from the data 7. Learn the C clusters S i0,k ∀k ∈ [1; C] 8. For all pixels of the image assigned to i0, compute the distance to the nearest cluster, and assign the corresponding cluster ids 9. Stop condition (example : max number of hierarchical layer level L reached). 10. Else, recursive call to 5) At each stage, the initial number of pixels is divided into C groups of pixels. So the complexity of the clustering is reduced for the lower levels. Each clustering algorithm conducts to add C nodes into a hierarchical tree of clusters.
For a target number of clusters C ′ , we define a couple of values (C, L) as parameters of the recursive hierarchical clustering, to build C L clusters. Examples: C = 2, L = 8 ⇒ C L = 256 C = 10, L = 3 ⇒ C L = 1000 Each learning and clustering achieved at a given level is done on its own group of pixels, so is completely independent from the other parallel learning.
Finally, each cluster must be linked with a labeled class. In this work, we consider that the number of available labeled samples is sparse. Each cluster is labeled related to its nearest class sample. If a class label has no representative cluster, if is linked to its nearest cluster.
The main advantages of this method are : 1. Scalability : as at each level, we only learn a reduced number of nodes, there is no need to take a very high number of samples. It is able to manage large amount of data 2. It produces a high number of clusters 3. Easier convergence : the number of clusters (C) is limited at each step so the convergence is easier 4. Recursive algorithm : the same method is applied on the son clusters 5. Parallelism : each pixel subgroup may be process independently The main drawback is the loss of homogeneous distribution of clusters. Two clusters in two separate branches of the hierarchical tree may be spectrally close. The number of clusters is chosen voluntary high, and at the end, distance between spectrums in the image and clusters should be reduced.

Pavia University
The ROSIS sensor acquired a scene over University of Pavia during a flight campaign. The image contains 103 spectral bands. It is a is 610*610 pixels image.The geometric resolution is 1.3 meters. The image ground truths has 9 classes : gravel, painted metal sheet, trees, asphalt, self-blocking bricks, bitumen, shadows, meadows, bare soil.
A very limited set of manual labels have been done with the input image and the ground truth. 30 small squares or rectangles with size around 3x3 pixels have been edited. It means that each class is represented by ∼3 patches. They are used for supervised classification (for comparison) and for the clusters labeling. We indicate that in the theoretical ground truth, some areas labeled as meadows contains meadows and bare soil. Also a large area with bare soils contains several classes.
The clustering algorithms K-Means, Fuzzy C-Means and Self Organizing Map are trained with 25 classes, 50k samples inside the image. For each method, an evaluation is done without Principal Component Analysis (PCA) and with a PCA with the 10 first components. The Recursive Hierarchical Clustering is computed with the input image, with 2 clusters per node (C=2), a number of levels L varying between 3 to 8 (producing between 8 clusters to 256 clusters). At each node, a K-Means clustering method is used with a maximum of 5000 samples to compute the C clusters. Comparison with classical clustering such as K-Means, Fuzzy C-Means, and the different distances, is useful to evaluate which base clustering shall be used into the Recursive Hierarchical Clustering.
For each cluster id, a maximum of 5000 samples are used to compute the C child clusters. Between each levels, all pixels of the image are assigned to the corresponding cluster ids. At each level, and for each cluster id, a new set with a maximum of 5000 samples is used to compute the recursive clustering. Consequently, with this method, if the number of levels is high, all pixels of the image are taken into account to compute the clusters. As we don't try in this paper to use a large amount of labeled samples, the performance is lower than in other scientific papers. But we focus on classification with low number of labeled samples. The Random Forest is used as a reference method for supervised classification for the configuration with PCA and without PCA.

Method
The clustered image reveals the real classes in the image, with their heterogeneity. As seen in table 1. The comparison of Spectral angle to Euclidean distance gives an advantages to the Euclidean distance. Legend of classes is given in figure 3. From a qualitative point of view, both clustering with spectral angle are able to separate the very green meadows. In the final classification map, K-Means (figure 1) and Fuzzy C-Means (figure 2) are able to separate several classes in the two large areas at the bottom right of the image. On the road, as there are several types of asphalt, the method would require more samples to label the clusters. It is important to note that the saturation of the overall accuracy at 0.65 is due to the problem of cluster labeling. Some labels are badly represented by the clusters due to the combined heterogeneity of classes into this image, and the low number of used patches. Study has not been done with a higher number of labeled samples.

Figure 3. Legend of Pavia University classes
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition)

Pavia Centre
The ROSIS sensor acquired a scene over Pavia centre during a flight campaign, in northern Italy. The number of spectral bands is 102. It is a 1096*1096 pixels image. The geometric resolution is 1.3 meters. The image ground truth differentiates 9 classes : water, trees, asphalt, self-blocking bricks, bitumen, tiles, shadows, meadows, bare soil. Pavia scenes were provided by Professor Paolo Gamba from the Telecommunications and Remote Sensing Laboratory, Pavia university (Italy).
For the evaluation of the method, 22 small squares or rectangles with size around 3x3 pixels have been edited. It means that each class is represented by ∼2 or 3 patches. They are used for supervised classification (for comparison) and for the clusters labeling.
The Recursive Hierarchical Clustering is computed with the same varying parameters than for the previous scenario : 2 clusters per node (C=2), number of levels L varying between 3 to 8 (producing between 8 clusters to 256 clusters). At each node, a K-Means clustering method is used with a maximum of 5000 samples to compute the C clusters. Several conclusions car be done with the table 2 of results. For this scenario, the use of PCA is an advantage on the clustering process. The Random Forest reference is better with the PCA, which is due to the small number of labeled samples. The K-Means in this case directly produces interesting results. It is able to produce comparable or better results than Fuzzy C-Means and Self Organizing Map. Performance and low computation complexity (with low value of clusters) explain why K-Means is proposed as base clustering algorithm for the Recursive Hierarchical Clustering method. Difference between Euclidean and Spectral angle is not very important but gives a small advantage to the Euclidean distance. When the number of clusters increases for the Recursive Hierarchical Clustering method, results become really close to the Random Forest reference. Beyond 32 clusters, the Overall Accuracy is very stable. Because each local clustering produces only C=2 clusters with N=5000 samples, the method is quite stable. The proposed method to label each cluster is quite simple (class label of the nearest samples), but seems to be robust if the number of clusters is high.

Method
The figure 4 has been labeled with a very low number of samples. Some errors remains between water and shadows or dark roads. Errors may be reduced by simply adding new samples.
The As expected, the computation times of RHC is faster than K-Means for a high number of clusters. Furthermore, a high number of clusters can be easily computed without initialization or convergence problems thanks to the recursive approach. For the last scenario with 256 clusters on Pavia, K-Means overall accuracy is equal to 0.91 which is very near the reference value obtained with the supervised classifier. But the use of K-Means with such a high number of clusters is not always possible when we consider large images and no dimension reduction.

CONCLUSIONS AND PERSPECTIVES
This paper presents the current work related to unsupervised clustering algorithms applied to hyperspectral images. The clustered images directly outline some areas of interest. Quantitative results with the spectral angle relative to the Euclidean distance are lower. Nevertheless, the classification map got with the spectral angle method better reveals some particular regions.
The proposed method with Recursive Hierarchical Clustering method is able to outperform the standard clustering algorithms thanks to the high number of clusters with a reduced processing time. Unsupervised clustering techniques are promising for classification purpose but also for anomaly detections.