PASIG RIVER WATER QUALITY ESTIMATION USING AN EMPIRICAL ORDINARY LEAST SQUARES REGRESSION MODEL OF SENTINEL-2 SATELLITE IMAGES

This study entails generation of empirical ordinary least squares regression models to estimate water parameters. It uses remote sensing for environmental monitoring of Pasig River located in the Philippines. This uses measurements of primary water quality (WQ) parameters defined on Department of Environment and Natural Resources Administrative Order 2016-08 recorded on the Pasig River Unified Monitoring Stations (PRUMS) report from January to June of 2019. Sentinel-2 images are utilized to estimate biological oxygen demand (BOD), Chloride, Color, Dissolved Oxygen (DO), Fecal Coliform, Nitrate, pH, Phosphate, Temperature, and Total suspended solids (TSS). Feature generation involved calculation of different band reflectances from the satellite image. Exhaustive feature selection through application of a Pearson Correlation threshold was applied to limit number of independent variables. The box-cox transformations of water quality parameters (except for Temperature) were used as dependent variables and the selected features are used as dependent variables for the ordinary least squares regression model. The root mean square error (RMSE) values for the models which are computed using the k-fold cross validation technique showed outliers, especially for the TSS model (>547000 mg/L), which made its average negative RMSE so large. Tests for multicollinearity, autocorrelation, and homoscedasticity indicated problems in models created. However, normality of residuals indicates that models allow us to roughly estimate water quality for the river as a whole with the advantages of remote sensing, enabling a better perspective for its spatial distribution.


INTRODUCTION
Pasig River connects Laguna de Bay to Manila Bay. The river stretches up to 27 kilometers with an average depth of 50 meters. Through the years, it has served as an important means of transport. However, today, it suffers from high levels of water pollution (Meijer, et. al, 2021). According to the Pasig River Rehabilitation Program Case Study (2004), it dates back after World War 2 when there was a massive population growth, construction of lots of infrastructures and a dispersal of economic activities. It was observed during the 1930s that there was a significant increase of pollution, diminishing fish migration from Laguna de Bay, and decrease of ferry transports. Foul smells began in 1970s and in the 1980s. During the 1990s, its water quality failed to meet Class C standards, a classification suited for fishery water for propagation and growth of fish and other aquatic resources. It was also then declared biologically dead by the Pasig River Rehabilitation Commission (PRRC).
A more recent study by Gorme, et al. (2010) stated that Pasig River was very polluted and failed to meet the Department of Environment and Natural Resources (DENR) standards for dissolved oxygen (DO) and BOD. Water quality in the river improved from the time when the Pasig River Rehabilitation Commission (PRRC) was established in 1999, but continued to deteriorate through the years. According to American Association for the Advancement of Science (2021), Pasig River is considered the world's most polluting river when it comes to plastic waste. The 27-kilometer Pasig River which runs through Metro Manila, accounting for 63,000 tons of plastic entering oceans from rivers per year.
A problem which this paper aims to solve is the big gap in the retrieval of water quality using remote sensing methods. Although remote sensing provides a more cost efficient and faster complementary approach for a more comprehensive assessment of water bodies compared to conventional water quality monitoring methods (i.e. sampling and lab analysis), it is still limited to the retrieval of water clarity, turbidity, water color, and the concentrations of optically active constituents (Wisconsin DNR., n.d.). A study by Márquez, et. al. (2018) generated an empirical model to estimate Temperature, PO4, Total suspended solids (TSS), Turbidity, pH and Electrical conductivity (EC) using Landsat 8 images tested a multitude of different independent variables of the reflectance values such as individual bands, combinations of bands, square roots, reciprocals, square, cubic, powers, sums, subtractions, logarithms, and band ratios for their linear regression model. This study aims to enhance environmental monitoring of Pasig River using remote sensing methods.
This study uses different Sentinel-2 image band combinations to generate empirical ordinary least squares (OLS) models to estimate different water quality (WQ) parameters established by the Department of Environment and Natural Resources Administrative Order (DAO-2016-08), namely, Biological oxygen demand (BOD), Chloride, Color, Dissolved Oxygen (DO), Fecal Coliform, Nitrate, pH, Phosphate, Temperature, and Total suspended solids (TSS). It aims to analyze the water quality of the river through time based on the Pasig River Unified Monitoring Stations (PRUMS) data from January to June of 2019. It also aims to estimate the water quality parameters using Sentinel-2 satellite images from the same date.  Figure 1 describes the general methodology.

Figure 1. Summary of datasets used including source, type and date
The PRUMS report is manually converted into a CSV format per station per WQ parameter. The shapefile for the PRUMS WQ stations will be used in ArcMap to extract multi values, which are the band reflectances of each calibrated Sentinel-2 Level-2A image. Band reflectance per station is integrated to the PRUMS report with each corresponding date. Cloud-obstructed stations are removed in the image, and PRUMS data is filtered by Zscore, removing entries which go beyond the threshold of 3 standard deviations, to remove outliers. Exploratory data analysis is then implemented to potentially apply a box-cox transformation, or other any necessary transformation, which allows a non-normal dependent variable to be transformed into a normal shape. Different band combinations are calculated ( Figure 2) in the next step, then an exhaustive feature selection using different Pearson Correlation thresholds is done to calculate the final empirical OLS model per water quality parameter. These thresholds aim to limit number of features per model between 14 to 16. The exhaustive feature selection described above will be based on a determined threshold for the magnitude of each parameter's Pearson correlation. It allows thousands of features computed ( Figure 2) to be filtered quickly. The features which will be selected must have a balance in high positive and high negative correlations with the water quality parameters for a better model performance.
OLS regression estimates the relationship between one or more independent variables (Sentinel-2 bands, 2019 Q1-Q2) and a dependent variable (WQ parameters in PRUMS report, 2019 Q1-Q2), as described in Equation 1. where: = Dependent variable (WQ parameter) 0 = Constant 1 , … , = Coefficients 1 , … , = Independent variables (Sentinel-2 bands) = Error It is a statistical method of analysis which minimizes the sum of the squares in the difference between the observed and predicted values of the dependent values configured as a straight line. Producing one model per WQ parameter, there will be a total of ten models using data from 2019 Q1 to Q2. These models are tested by calculating their RMSE, and by evaluating the normality of their residuals

RESULTS AND DISCUSSION
Exploratory data analyses (EDA) yielded the following results for the distplot (Figure 3 (i) to (x)), which is a combination of the histogram and kernel density estimate. It shows both the distribution of the data in bars and as a comparison to the standard distribution. Box-cox transformation was applied on BOD, Fecal Coliform, Chloride, TSS, Nitrate, Phosphate, Color, and pH on all 106 data point observations. Temperature was not included because it already has near normal distribution.
i. BOD  The resulting positive and negative thresholds which limits the number of features from 14 to 16 is described in Figure 4 (a  Figure 4  The RMSE values for the models which are computed using the k-fold cross validation technique are shown in Figure 5. It is observed that there are outliers, especially for the TSS model (>547000 mg/L), which made its average negative RMSE so large. This indicates the problem of underfitting for these models.   Table 2 indicate that all models except DO, chloride, and temperature have normal residual distribution. Residuals for DO and temperature is skewed towards the right, while a bimodal distribution is present in the residuals for Chloride. A potential cause for this is that some of the predictors are significantly non-normal, which might affect the confidence intervals for these three models.  Table 3. Summarized test results for water quality models Based from the tests, the ones with the most linear relationship between target and features are the temperature (linear), Fecal Coliform (slightly biased), and Chloride (slightly biased), this means that all other models might indicate underfitting. All models failed the multicollinearity test, which becomes a problem since it adds to the overall standard errors to its predictions. Signs of positive autocorrelation appears on the Fecal Coliform, Chloride, and temperature models, which can impact model estimates. The test for homoscedasticity failed for DO, Temperature and pH only.

ii. DO
Even though most of the models exhibit normal distribution of residuals, it has not performed well based on the test results. It can roughly estimate the relative values for the water quality parameters; however, it cannot be concluded that the values are accurate enough to know the precise values in each spot of the river. Possible reasons include: One, that the methodology in feature selection described in Section 2 lacks checking correlation between other features during the process. Two, the dataset is very limited (106 observations) with discrepancies between satellite passing dates and dates of data reading, and the exact locations of the observation point and the sample point picked in the image.

CONCLUSION AND RECOMMENDATION
This study generated and tested empirical OLS models for estimating water quality parameters such as BOD, DO, Fecal coliform, Chloride, TSS, Nitrate, Phosphate, Color, Temperature, pH on Pasig River using Sentinel-2 satellite images. Although the exhaustive feature selection process did not produce excellent models based from the test results, they can still be used to roughly estimate water quality from the river because of the normal distribution of residuals which can be performed more quickly and with less cost for fast estimation purposes. The recommendation for the methodology is to include steps to prevent multicollinearity between the features while currently being filtered. This can potentially improve model performance significantly, since this is the test which all models failed.