A GENERIC MACHINE LEARNING-BASED FRAMEWORK FOR PREDICTIVE MODELING OF LAND SURFACE TEMPERATURE

: In the realm of data analytics, machine learning (ML) is one of the most successful techniques for making predictions. The ability of ML algorithms has also been studied in various aspects of land surface temperature (LST) besides its retrieval. The few investigations on LST retrieval using ML algorithms suggested that it may potentially obtain the LST values incorporating relevant variables of land surface parameters; however, the variables and ML models used differ, and so do their accuracies. The accuracy of the model is affected by the variable's contribution, its quality and quantity, and the fulfilment of each technique's assumptions. Hence this study provides a wide range of LST indicators to be employed for LST retrieval using a widely used ML algorithm, random forest. The ML algorithm framework for LST prediction is illustrated with significant spectral indices and terrain parameters across the highly industrialised district of Jharkhand, India. With the exception of one (aspect) variable, the analysis shows that all 20 variables that were included as independent factors were significant and equally contributed to the model. The model built with all the variables including the aspect of the terrain obtained an RMSE of 1.13 degree Celsius and R 2 of 0.48. However, after the removal of aspect, the model obtained an R 2 of 0.89 and RMSE of 0.74º C. The performance of the model on consecutive removal of lesser significant variables are evaluated and the study made clear how crucial it is to consider several environmental or land-use factors that could be pertinent to LST.


INTRODUCTION
Urbanisation and industrialisation drive rapid transformation of the land use and cover pattern (Kedia et al., 2021;Maheng et al., 2021).Although such development improves living standards in certain ways and causes social changes, it also impacts the environment and ultimately contributes to climate change.The human-induced changes in the land use/land cover (LULC) type transforms t h e pervious to impervious surfaces, which reduces the outgoing longwave terrestrial emissions while absorbing shortwave solar radiation, increasing the temperature of the atmosphere and the land surface.The local climate and rainfall patterns are altered by such temperature fluctuations (Eleftheriou et al., 2018;Ranagalage et al., 2018;Salwan et al., 2021).Several studies have found that LULC type significantly impacts the relative rise in land surface temperature (LST), particularly in urban areas.The world's population is expected to be exposed to human-induced climate change in metropolitan areas with a projected threefold increase in urban land use accompanied by heat stress (Seto et al., 2012;IPCC, 2014;Argüeso et al., 2015).LST massively contributes to the land surface processes, which impacts the Earth's environmental and climatic conditions (Anderson et al., 2008;Kustas and Anderson, 2009;Karnieli et al., 2010).Due to the significant heterogeneity of land use patterns and cover, understanding and keeping track of the dynamics of LST is crucial.
LST is typically obtained via remote sensors that collect data from one or more channels in the electromagnetic spectrum's thermal infrared window (Dar et al., 2019;Ermida et al., 2020).Numerous LST retrieval algorithms have been established for the Landsat satellite series (Sattari and Hashim, 2014;Duan et al., 2020).Although earlier studies demonstrated a relationship between various spectral indices (especially LULC indices) and LST, they tend to be non-linear.Several techniques have been put forth to retrieve LST and quantify its relationship with associated variables.One such example is the geographically weighted regression (GWR) which uses linear models; however, they are insufficient to simulate the non-linear relations between LST and its indicators.Furthermore, multicollinearity has a significant negative impact on GWR and can result in predictions that are not accurate when the associated surface parameters have a high degree of correlation (Li et al., 2019;Zhao et al., 2018;Jia et al., 2021).Hence, researchers have implemented machine learning (ML) algorithms to address the shortcomings of traditional LST retrieval.
In the realm of data analytics, ML is one of the most successful techniques for making predictions using models and algorithms (Angra and Ahuja, 2017;Dhall et al., 2020).Although there is a paucity of research employing ML algorithm to retrieve LST, the technique has been used in other aspects of LST studies, such as spatial downscaling, simulation, addressing meteorological conditions, and similar tasks (Li et al., 2019;Buo et al., 2021;Maithani et al., 2022;Xu et al., 2021).Wang et al., (2022) employed land cover categories and Landsat bands to estimate the LST across the Tibetan plateau using random forest (RF).The LST by the RF trained model was the most accurate with the lowest root mean square error (RMSE) (1.89 Kelvin), according to Wang et al., who also obtained and examined the LST from the single channel technique, the linear regression model (2.77Kelvin), and the moderate resolution imaging spectroradiometer product (MOD11A1) (3.62 Kelvin).In order to acquire the LST over Dehradun using an artificial neural network, Maithani et al., (2022) employed built-up densities with a mean absolute error of 1.5° C and 0.9° C, while Rana and Suryanarayana (2022) employed four ML techniques, K nearest neighbour, neural network, regression tree, support vector machine incorporating three indices resulting with an RMSE of 0.54° C, 0.59° C, 0.89° C, and 0.61° C respectively.Mohammad et al., (2022) predicted the LST over a city in Ahmedabad with an RMSE of 0.03° C using XGB regressor.
Although earlier studies demonstrated the potential of ML algorithms to retrieve the LST incorporating related variables like land surface parameters, the variables and the ML models utilised vary, and so do their accuracies.The variable's contribution and its quality and quantity together with the fulfilment of each technique's presumptions, all have an impact on the model's accuracy (Zhong et al., 2019).For example, it is generally known that ML models suffer when there are few observations available, and this issue could worsen as more predictors are added.Herein, a robust ML algorithm, RF is employed to extract LST in the study area.RF has been proven to be efficient for generating spatial predictions, first described in Breiman (2001), and several studies (Hengl et al., 2018;Meyer et al., 2019;Sekuli et al., 2020;Pouyan et al., 2022;Wang et al., 2022) have verified that it is a promising technique.Since LST retrieval is vital for a sustainable check and planning for every rapidly growing region, our prime goals are: • to retrieve LST over the mineral-rich East Singhbhum district of Jharkhand, India, • to quantify the relationship between the spatially varying LST and surface parameters, • and to determine the effectiveness of the variables in ML model to predict the LST.

Study site description
One of the highest mining potential districts, East Singhbhum, also called Purbi Singhbhum (Figure 1) is considered for this study.It is situated at the extreme southeast of Jharkhand, having a longitudinal and latitudinal extent of 86°04' -86°54' E and 22°12 -23°01' N, respectively.In terms of industrial development, and mining and quarrying, the district leads the state Jharkhand.Approximately 53% of the district's total area is made up of residual mountains and hills formed of granite, gneiss, and schist.It is situated on the Chhotanagpur plateau, surrounded by lush forest from west to east and the Dalma Range on the northern edge.The Subernarekha river runs across the district from west to the south-east.The annual rainfall ranges from 1200 to 1400 mm, and the climate is temperate.The minimum recorded winter temperature is 8 degrees Celsius, while summer temperatures reach 40 to 45 degrees Celsius.Minerals are in abundance in the district.The primary minerals include iron ore, copper, uranium, and gold kynite (https://jamshedpur.nic.in/about-district/).The region is a heavily mineralised zone that has undergone substantial mineral extraction, which could result in collateral environmental damage (Singh et al., 2018).It is essential to monitor the growth and associated climate variables because Jharkhand's urbanisation has increased across all districts, with East Singhbhum experiencing a considerable growth rate (Kumar and Reshmi, 2018).For the generation of the various LULC indices, we utilised the atmospherically corrected surface reflectance (SR) data from Landsat 8's collection 2, Tier 1 (highest available data quality) datasets that have been processed using a sophisticated data processing method and algorithm.2).SMW algorithm is based on the empirical link between the top of atmosphere (TOA) brightness temperatures (BT) in a single thermal infrared channel and LST via a linear regression (Sun et al., 2004;Jiménez-Muñoz and Sobrino, 2003).Landsat's TOA-BT and SR collection have a cloud mask applied to them.The nearest two total column water vapour (TCWV) NCEP analysis times for each TOA-BT image are chosen and linearly interpolated to the Landsat observation time.The normalised difference vegetation index (NDVI) is computed using the SR data, which is then used to estimate the fractional vegetation cover (FVC).The equivalent Landsat emissivity is then calculated using the ASTER emissivity values.
_ _ (1 ) where, for a specific spectral band b,  b_veg and  b_bare represent the emissivity of dense vegetation and bare soil, respectively.Because the emissivity of vegetated surfaces typically exhibits only minor fluctuations in the TIR area, vegetation is taken as 0.99 as recommended (Peres and DaCamara, 2005).The surface emissivity is derived from the ASTER GEDv3.Therefore, the emissivity is modified to correspond to Landsat's thermal bands using the coefficients provided by Malakar et al., 2018.The TOA BT of the Landsat TIR band is then subjected to the SMW algorithm.The TWVC from NCEP and the designated classes are then used to overlay the algorithm's coefficients over the Landsat image.
where Tb is the TIR channel's TOA BT, and ε is the channel's surface emissivity.The Coefficients Ai, Bi, and Ci were computed from linear regressions of radiative transfer simulations following Martin et al., (2016) using a dataset of air temperature, water vapour, and ozone profiles generated by Borbas et al., 2011.

Machine learning-based LST retrieval:
Random forest is a non-parametric, non-linear ML technique that creates decision trees on several samples and predicts continuous or categorical data by averaging the values or taking the majority vote, respectively (Breiman, 2001).The creation of RF models entails modelling the target variable, sometimes referred to as the dependent variable, using a dataset of predictor or independent variables.Each tree is built using several bootstrapped datasets, and predictors are randomly selected from the entire array to act as candidates for each split.The approach de-correlates each tree because they are not reliant on the same factors, reducing the variance in the dataset (Zhong et al., 2021).Here, we created an RF regression model with LST as the target variable to investigate the spatial variation of the LST over the study area using spectral indices as predictor variables.
The The values of all the parameters were then extracted at the corresponding coordinates of the 429 total points that were generated at equal intervals over the entire study region, ensuring local heterogeneity.The necessary packages for the random forest algorithms were installed and loaded using the RStudio integrated development environment of R to train the model (R Core Team, 2020).To ensure that the variables generated are relevant, they were passed to boruta algorithm, which uses a wrapper approach built around a random forest (Kursa and Rudnicki, 2010).It was cross-validated with the random forest variable importance ranking of the ranger package.The RF model was implemented using a computationally faster ranger package, which contains several parameters that can be fine-tuned.The most important parameters were considered when developing the model (Probst and Boulesteix, 2017;Probst et al., 2019): Mtry-number of variables to be drawn at each split in a tree.Node size-Minimum number of observations in a terminal node.Ntree: number of trees in the forest.Sample size: Number of observations drawn for each tree.
To determine whether the trained model's predictive ability can be extrapolated to an unknown dataset, we randomly split the sample points (429 points) into 70 and 30 per cent and used the later portion to validate.The predictor variables' imageries were then applied to predict the LST for the entire study region using the trained and validated model that has reached its maximum accuracy and stability based on root mean square error (RMSE) and r-square (R 2 ).In order to assess the effectiveness of the variables generated, the model was run repeatedly with all the variables generated with fine-tuned hyperparameters by consecutively removing the less significant ones.

RESULTS AND DISCUSSION
To demonstrate the spatial distribution of surface temperature in response to the surface parameters of the study area, LST map of the year 2021 was generated using the SMW algorithm.
Table 1.Spectral indices of the surface parameters  In the case of vegetation and waterbody, the opposite is supposed to be true.Here the vegetation indices are negatively correlate with LST because wherever there is vegetation cover, the surface temperature is low (Mishra et al., 2021).Similarly, the NDWI-LST relation is low as water temperature varies with its implying a negative correlation between them.

Figure 4. Correlation matrix of the LST and the LULC indices
The variables that are highly correlated, whether positively or negatively, are shown to be important for the prediction of the LST by Boruta algorithm (Figure 5).With the exception of the aspect, which is due to the Spectral indices (References) Equations* BSI (Abdullah et al., 2022) EBBI (Ramaiah et al., 2020)  planar study region (70% of the study area has nearly level or very gentle slopes), all variables were determined to be significant for the LST prediction.To predict the LST, the RF model was built using the variables selected by the feature selection algorithm.The Mtry, minimum node size, sample size, and ntree values were obtained via the hyperparameter grid search for optimal parameters as 3 (Figure 6), 10, 0.7 (70%), and 500 (Figure 7), respectively.Less correlated trees are produced by using smaller sample sizes, larger node size values, and smaller mtry values (Probst et al., 2019).These trees are more distinct from one another; thus, they produce more distinctive predictions.The model predicted the LST on the testing dataset with an R square of 0.89 and RMSE of 0.74 degree Celsius.The SMW algorithm has an LST ranging from 22.69 to 42.23 degree Celsius, while the RF predicted LST ranges from 24.75 to 41.09 degree Celsius.The range difference in the temperatures are due to the deficient coverage of the temperature above 41 degrees Celsius (0.01 square kilometre) and below 24 degrees Celsius (0.06 square kilometre), which prevented the model from being trained with smaller samples.It was not possible to infer that expanding the dataset would increase the model's accuracy as proposed by Wang et al., (2022).Because when we increased the training points in our case, the majority of the region was predicted with hardly any spatial variance of the LST.To evaluate the effectiveness of the variables, the models were run repeatedly with all the variables and gradually removed one lesser significant variable at a time.The model built with all the variables including aspect obtained an RMSE of 1.13 and R 2 of 0.48, which when removed the model improved.Using all the other variables by removing the terrain aspect obtained an R 2 of 0.89 and RMSE of 0.74.The performance of the model on consecutive removal of lesser significant variables are displayed in figure 8.

CONCLUSION
Given the likelihood that global temperatures will continue to rise, it is crucial to detect and monitor LST frequently.The LST variation in response to surface parameters using the spectral indices was determined in the mineral-abundant and industrially developed district of Jharkhand, East Singhbhum.
The study has also quantified the relationship between the spatially varying LST and surface parameters.The analysis showed substantial variations in the temperatures of each type of land use and cover.The ML algorithm predicted the LST incorporating terrain parameters and significant spectral indices of vegetation, urban, soil or bare land and water, thus demonstrating the potential of the algorithm to predict the LST with known errors.Keeping in mind that the variables used by the ML algorithm are not redundant, other surface parameters may be investigated.Furthermore, to prevent environmental extremes, the LST must be closely monitored due to the anticipated increase in population and rising urbanisation and industry.The study's findings may also be helpful in developing site-specific adaptation plans/strategies to mitigate environmental effects and enhance urban residents' quality of life.City planners and policymakers may take specific steps to make the city less sensitive to climate change by assessing the parameter (LST) that is influenced by compact land use patterns.

Figure 1 .
Figure 1.Location map of the study site where NDVIb and NDVIv represent, respectively, the NDVI values of soil or bare land and densely vegetated pixels.The threshold values are set to NDVIb = 0.2 and NDVIv = 0.9 in accordance withJiménez-Muñoz et al., (2009) since the highest NDVI values in our study area are 0.88, and those below 0.2 are non-vegetated surfaces.We used mean ASTER GEDv3 NDVI and NDVI calculated from Landsat to account for vegetation.
green, red, near infrared, short wavelength infrared 1, and short wavelength 2 bands of Landsat 8.

Figure 3 .
Figure 3. LST variation in response to LULC; (a) Industrial and settlement area (b), quarrying/mining area (c) agricultural and settlement (d), waterbody (e) Vegetation, (f) Bare ground The mean of the LST for the year 2021 retrieved by the SMW algorithm ranged between 22.69 to 42.23 degree Celsius.The distinct thermal signatures exhibited in the LST maps are due to the different attributes of the various land use/cover type.Throughout the period, temperatures were high in places with various factories and industries, mining and quarrying locations, urban settlements, and agricultural lands.On the other hand, waterbody and vegetation have the lowest temperature throughout the area.The graph shown in Figure 4 depicts a positive correlation between urban spectral indices' (BSI, EBBI, MNDWI, NBAI, NBI, NDBaI, NDBI, NDBSI, NMDI, and UI) and the LST.In the case of vegetation and waterbody, the opposite is supposed to be true.Here the vegetation indices are negatively correlate with LST because wherever there is vegetation cover, the surface temperature is low(Mishra et al., 2021).Similarly, the NDWI-LST relation is low as water temperature varies with its implying a negative correlation between them.

Figure 8 .Figure 9 .
Figure 8.Comparison of the model performance based on the variables' importance (Gorelick et al., 2017)ono-window (SMW) algorithm developed by the Climate Monitoring Satellite Application Facility (CM-SAF) method is a reliable technique to provide LST with high accuracy, it was applied to derive LST in this study using Google Earth Engine(Gorelick et al., 2017)(Figure