DATA IMBALANCE IN LANDSLIDE SUSCEPTIBILITY ZONATION: UNDER-SAMPLING FOR CLASS-IMBALANCE LEARNING

: Machine learning methods such as artificial neural network, support vector machine etc. require a large amount of training data, however, the number of landslide occurrences are limited in a study area. The limited number of landslides leads to a small number of positive class pixels in the training data. On contrary, the number of non-landslide pixels (negative class pixels) are enormous in numbers. This under-represented data and severe class distribution skew create a data imbalance for learning algorithms and suboptimal models, which are biased towards the majority class (non-landslide pixels) and have low performance on the minority class (landslide pixels). In this work, we have used two algorithms namely EasyEnsemble and BalanceCascade for balancing the data. This balanced data is used with feature selection methods such as fisher discriminant analysis (FDA), logistic regression (LR) and artificial neural network (ANN) to generate LSZ maps The results of the study show t hat ANN with balanced data has major improvements in preparation of susceptibility maps over imbalanced data, where as the LR method is ill-effected by data bala n cing algorit h ms. The FDA does not show significant changes between balanced and imbalanced data.


INTRODUCTION
Landslides are amongst the most devastating natural disasters, which cause billions of dollars in property damage and thousands of deaths every year worldwide. India has more than 15% of its land area prone to landslides, hence mapping of these areas for the presence/absence of landslides is of utmost importance. Numerous studies have contributed to reduce the damage from landslides through modelling and production of susceptibility maps (Roodposhti et al., 2019, Jhunjhunwalla et al., 2019, Gupta et al., 2018, Shukla et al., 2016. The susceptibility mapping can be a crucial tool for a wide range of end-users, from both private and public sectors, aimed at hazard mitigation purposes at both local and international levels. Landslide susceptibility zonation (a.k.a. LSZ) maps give approximate information about the occurrence of landslides. The susceptibility mapping requires data of various factors responsible for slope instability. In this work we have considered seven causative factors such as aspect, elevation, plan curvature, profile c urvature, slope, tangential curvature, topographic wetness index.
In recent years, there is an increasing application of machine learning techniques to complex real-world problems. The application ranges from daily life problems to nation's security, processing of the information to decision making support system and from micro-scale analysis of data to macro-scale discovery of knowledge (Stumpf et al., 2012). Most standard machine learning algorithms presume or expect balanced class distributions or equal misclassification c osts ( He, G arcia, 2 009) a nd suffers data imbalance (Stumpf et al., 2014). Data imbalance refers to a scenario where majority classes dominate or overpower minority classes. In simple words, there is disproportionate distribution of observations in each class. It leads to the classifier b eing m ore b iased t owards the dominating class (Pradhan et al., 2014). Generally, the data is imbalanced when the class ratio is of the order of 1:100, 1:1000 or 1:10000 etc. i.e.number of points in one-class are 100 times or 1000 times or 10000 times less than that in another class (Liu et al., 2009). The imbalance level can be as high as 10 6 . In this research, * Corresponding author the class ratio is 1:300, i.e. for each landslide pixel we have more than 300 non-landslide pixels. When we use various machine-learning approaches for the generation of LSZ maps then the algorithms do not classify the landslide pixels correctly. Therefore, it is necessary to reduce the imbalance in the susceptibility mapping data. There are two major data balancing techniques, which are oversampling of a minority class and under-sampling of majority class (He, Garcia, 2009). The minority oversampling cannot be applied, as it will create false landslide pixels. We under-sample the majority class (i.e., non-landslide pixels) using Balance Cascade and Easy Ensemble methods. Some of the techniques used by various authors to overcome data imbalance are random oversampling and under sampling, informed under sampling, Synthetic sampling with data generation etc. (Haixiang et al., 2017, Stumpf et al., 2014, Stumpf et al., 2012, Galar et al., 2012, Chawla, 2010, Liu et al., 2009, He, Garcia, 2009).
This work aims at first, b alancing d ata u sing two different d ata b alancing t echniques i .e. EasyEnsemble and BalanceCascade. This balanced data is used for computing the weights using various methods methods such as fisher d iscriminant a nalysis ( FDA), l ogistic regression (LR) and artificial n eural n etwork(ANN) t o g enerate LSZ maps. Furthermore visual analysis, statistical quantities, Heidke Skill Score (HSS) and Recall is used to assess the quality of susceptibility maps. Based on these observations, We can make assertions as how accuracy is affected when balancing techniques are applied to our data w.r.t. imbalanced data (Jhunjhunwalla et al., 2019).

STUDY AREA & DATA RESOURCES
A small part of Mandakini river basin of Garhwal Himalaya in Uttarakhand has been considered for the study as shown in Figure-1. Mandakini river originates from the Chorabari Glacier near Kedarnath in Uttarakhand, India. The study area covers about 275.60 sq. km area and lies between 30°19'00"N to 30°49'00"N latitude and 78°49'00"E to 79°21'13"E longitude. The study area falls in the Survey of India toposheet no. 53J and 53N. This region is highly prone to landslides during the monsoon season. This region has highly rugged topography, deep gorges, high peaks where higher areas are mostly snowcovered forming the U-shaped wide valleys of glacial landscape. The study area is highly prone to landslide as every year many landslides occur in this area. In this research we have used 30m shuttle radar topography mission (SRTM) DEM, which was downloaded from EarthExplorer (www.earthexplorer.usgs.gov). The landslide causative factors can be either categorical, which can be classified i nto fi nite nu mber of groups/classes (e.g. soil types in an area), or continuous (e.g. slope or elevation of the mountain). In this study, we have used only continuous data. Seven causative factors/layers i.e. aspect, slope, topographic wetness index, elevation, profile c urvature a nd p lan c urvature are considered for preparation of LSZ maps. These layers have been prepared from 30 m spatial resolution SRTM elevation model using ArcGIS and QGIS software. The size of all the layers is 1028 × 801 pixels. The layers are converted into ASCII format for training of the models. These layers have been transformed to column vector of size 823428× 1 and are stacked to generate a matrix of size 823428×7. In this work, we have used three algorithms i.e. FDA, LR and ANN for finding t he w eights o f various factors, which are further used for finding susceptibility index using weighted linear combinations.

Data Balancing Algorithms
The following data balancing algorithms have been applied to the initial experimental data set to obtain a balanced data set.
3.1.1 EasyEnsemble This method works on samples of majority class. It reduces the number of observations from majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles. EasyEnsemble is an example of informed under sampling as it explore subsets of majority class by independent replacement of subsets (He, Garcia, 2009, Liu et al., 2009). This method is an unsupervised learning algorithm as it explores subsets of majority class by independent random sampling with replacement.
Algorithm -EasyEnsemble (Liu et al., 2009) where, Ii is the number of iterations, Ci,j is the weak classifier/learner, αi,j is the weights of ci,j, βi is the ensemble's threshold.
iii This uses boosting algorithm which is an repetitive technique to tune weights of an observation based on previous classification. This is useful in reducing the bias error in machine learning models. Then instead of voting the decisions of all the classifiers to select one, boosting is used to combine the results of all T classifiers and make an improved final decision (Liu et al., 2009). The output ensemble is obtained as Eq-2.
3.1.2 BalanceCascade In this method we create several subsets of data which are balanced, and a weak classifier is trained for each subset. This method reduces the majority class training sets at every step by removing all the examples that are correctly classified. I t i s d ifferent from Easy Ensemble in two steps. Primarily the weights are adjusted based on false positive rates that a classifier have to achieve. Second, the samples are removed which are correctly classified. This sequential dependence mainly focuses on reducing the redundant information in majority class.
In Balance Cascade, the training can finally b e stopped when size of majority class(|M |) is less than size of minority class(|N |), as size of majority class is getting shrunk at every iteration. The main advantage of Balance Cascade is that it generate the restricted sample space to extract as much useful information possible (Liu et al., 2009).
Algorithm -BalanceCascade (Liu et al., 2009)  Hi(x) = sgn( where, Ii is the number of iterations, Ci,j is the weak classifier/learner, αi,j is the weights of ci,j, βi is the ensemble's threshold.
iv We adjust βi such that Hi's false positive rate is f .
v Similar to EasyEnsemble method, the output ensemble is obtained as Eq-4

Methods of Weights Assignment
The following filter a nd w rapper m ethods h ave b een used for finding the weights of factors.

Fisher Discriminant Analysis (FDA):
This method is used in pattern recognition, statistics and machine learning for finding t he l inear c ombination of features to distinguish two or more classes of events or objects (Jhunjhunwalla et al., 2019). In this method, all the factors/layers are projected in one dimension corresponding to landslide occurrences. A multiplicative factor is required for projection of the data. This multiplicative factor is used for giving weights to all the thematic layers (Gupta et al., 2018).

Logistic Regression (LR):
It is a special case of linear regression and predicts the probability of the occur-rence of an event by using logit function (Gupta et al., 2018). In this method, the probability of presence or ab-sence of a binary outcome (1 = landslide and 0 = no land-slide) is modelled based on the values of predictor variables (Shukla et al., 2016). The independent variable can be in-terval or categorical while the dependent variable can be multinomial or binary. The LR coefficients a re u sed for giving weights to all the factors.

Artificial Neural Network (ANN)
: ANN is a computational system, which is inspired by the human brain. The network consists of set of neurons, an input layer, an output layer and few hidden layers. The layers are interconnected and input data is passed to output layer by means of the hidden layers. There can be one or more hidden layers depending on the complexity of data. It is useful when a complex relationship exists between the data and the responses such as between landslide factors and landslide occurrences (Jhunjhunwalla et al., 2019).
ANN require several architectural and training parameters to be selected prior to analysis. The optimal number of hidden layers and the number of neurons per hidden layer are not known apriori. These parameters are empirically determined through rigorous experimentation and examination of different p arameter settings (Blackard, Dean, 1999).

Landslide Susceptibility Index Computation and Susceptibility Mapping
The weights obtained in previous section are used for computation of landslide susceptibility index (LSI). LSI is calculated using Weighted linear combination method (Jhunjhunwalla et al., 2019, Gupta et al., 2018, Michael, Samanta, 2016. LSI can be calculated as given in Eq-5 LSI = ∑ attributes * weights (5) LSI can be classified into five different zones i.e LSZ (from very high susceptibility to very low susceptibility) based on natural break in the data. Results obtained with/without data balancing are compared using statistical quantities and visual quality analysis. The methodology used in the current study has been shown in Figure-2.
The EasyEnsemble and BalanceCascade methods was applied on total 30 subsets, whihch were randomly taken from the non-landslide pixels by the algorithm in python.
After the data balancing, the FDA, LR and ANN methods were applied on these 30 subsets of data, which resulted in 30 LSI images for each of the methods. The mean and median of these 30 LSI images for each method was taken to generate mean image and median image for all the methods.
3.4 Accuracy Assessment 3.4.1 Heidke Skill Score: The accuracy of the LSZ maps is measured using Heidke Skill Score (HSS). It is the measure of skill of prediction and lies between 0 and 1. HSS can be defined as given in the following Eq-6 (NDFD Verification S core D efinitions, 20 17, Hy värinen, 2014).
where NC is the number of correct predictions, i.e. number of times the prediction and observation match, E is the number of predictions expected to verify based on chance, i.e. incorrect predictions and T is the total number of observations.
3.4.2 Recall: Recall can be defined as the ratio of relevant instances (landslides) predicted correctly by the model to actual number of relevant instances. It is also known as "sensitivity". This gives us a measure of how accurately the model predicts with respect to the actual instances of that class. We can compute recall using the expression in Eq-7 Recall = T rueP ositive T rueP ositive + F alseN egative The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6-11 October 2019, Baltimore, Maryland, USA The weights and ranking of factors is important for preparation of landslide susceptibility zonation maps. Weights and ranking obtained by FDA, LR and ANN for all seven causative factors without using the data balancing algorithms are given in Table-  The weights obtained in Table-1 are multiplied with the corresponding causative factor layer to obtain the LSI values. The LSI values are normalized between 0 and 1 and classified into five different zones i.e. from very high susceptibility to very low susceptibility (as shown in Figure-3, 4(a) and 5(a)) based on natural break in the data. The mean and median of LSI values for all the methods (except ANN) is observed to be near 0.55 (refer Table-2) so 0.55 is set as a threshold for classification. Figure 3. LSZ map obtained using weights from LR, without data balancing The landslides with LSI greater than 0.55 are considered to be correctly classified and below 0.55 are considered to be falsely classified. As we can see from Table-2 that ANN has mean/median values significantly less than 0.55 and hence the susceptibility maps generated by ANN have maximun area lying in low or very low susceptibility zones. Hence, after data balancing, the mean/median value of LSI for ANN should increase significantly.

With Balanced Data
The imbalanced data is provided to EasyEnsemble and BalanceCascade algorithms and the data is balanced to match the size of minority class pixels. The feature selection algorithms are applied to balanced data and the weights are obtained. These weights are multiplied with the different l ayers a nd s usceptibility i ndex m ap are generated. The statistical quantities for LSI using all three methods are given in Table-3. The mean and median for ANN has increased significantly u sing balanced data, however, the values using LR has been reduced and mean/median is very small compared to that without balancing the data. The mean/median obtained for LSI values using FDA does not show significant c hanges. The susceptibility maps obtained using balanced data are shown in Figure-4(b) and 5(b).
The results of accuracy assessment using HSS and Recall are given in Table- (Crone, Finlay, 2012, King, Zeng, 2001. The FDA method may or may not not show major changes in the results with/without data balancing, which also agree with the results of few studies (Xue, Hall, 2015, Xue, Tit-terington, 2008.

CONCLUSIONS
Data balancing methods improves accuracy for machine learning based methods such as ANN, Support vector machine etc. The EasyEnsemble method coupled with mean of weights seems to overpredict the high susceptibility zones whereas BalanceCascade method with median of The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6-11 October 2019, Baltimore, Maryland, USA weights generated by ANN gives most appropriate LSZ map based on visual analysis. Using statistical quantities, the LSI generated from both the data balancing methods show mean and median value greater or very near to 0.55, which also justifies the importance of data balancing before using ANN. The HSS and recall value shows the superiority of LSZ maps prepared using mean of weights in ANN with the data balanced using EasyEnsemble method. The Balanced data do not show good results with Logistic regression as the LR method is not able to model the underlying probability distribution. Balanced data does not effect F DA m ethod a nd h ave a pproximately s ame accuracy as before. The landslide data is highly imbalanced in nature, hence balancing algorithms must be applied before preparation of LSZ maps using machine learning methods. However the data driven methods do not need balancing as seen from the results.