HYBRID MODELING: FUSION OF A DEEP LEARNING APPROACH AND A PHYSICS-BASED MODEL FOR GLOBAL HYDROLOGICAL MODELING

Process-based models of complex environmental systems incorporate expert knowledge which is often incomplete and uncertain. With the growing amount of Earth observation data and advances in machine learning, a new paradigm is promising to synergize the advantages of deep learning in terms of data adaptiveness and performance for poorly understood processes with the advantages of process-based modeling in terms of interpretability and theoretical foundations: hybrid modeling. Here, we present such an end-toend hybrid modeling approach that learns and predicts spatial-temporal variations of observed and unobserved (latent) hydrological variables globally. The model combines a dynamic neural network and a conceptual water balance model, constrained by the water cycle observational products of evapotranspiration, runoff, snow-water equivalent, and terrestrial water storage variations. We show that the model reproduces observed water cycle variations very well and that the emergent relations of runoff-generating processes are qualitatively consistent with our understanding. The presented model is—to our knowledge—the first of its kind and may contribute new insights about the dynamics of the global hydrological system.


INTRODUCTION
Process-based models of the Earth and its subsystems have been key to diagnose, predict, and understand environmental processes and change for decades. Such models are based on conceptualizations and abstractions of many individual processes according to expert understanding. They are forced, evaluated, and occasionally tuned using environmental observations. The rapidly growing amount of Earth observation data, however, does not necessarily translate into better process models, as process representations are predefined rather than learned from data. Due to advances in machine learning, complex patterns and relationships in multivariate datasets can now be recognized with high accuracy and further exploited. These models typically need large amounts of training data, while they are agnostic to the physical meaning and consistency among variables. It is, thus, promising to explore a synergistic combination of machine learning and process-based approaches for modeling in Earth system sciences . The hybrid approach is still in its infancy and we are aware of one application on Earth observation data only: de Bézenac et al. (2019) predicted future sea-surface temperature fields by using a convolutional encoder-decoder network to learn a motion field that was fed into a physical model of advection and diffusion.
We present an end-to-end global hybrid hydrological model that couples long short-term memory (LSTM, Hochreiter and Schmidhuber, 1997) networks with a traditional conceptual water balance model that is trained jointly on a set of water cycle observations: total water storage (TWS), runoff (Q), evapotranspiration (ET), and snow water equivalent (SWE). The model is forced by the meteorological variables precipitation, air temperature, and net radiation. From a deep learning perspective, the hybrid approach can be seen as a regularization of the neural * Corresponding author network, constraining the solution space to physically plausible results. Furthermore, the hydrological states (pools) and fluxes (inflows and outflows) of the conceptual water balance model remain interpretable and are still largely data-driven, as they are informed by the neural network.
In this study, we provide a proof-of-concept and test the applicability of hybrid modeling to learn a representation of the global water cycle from data. We explore the robustness of the approach based on independent cross-validations which include the full training set-up.

Total Water Storage Anomalies (TWS)
The Gravity Recovery & Climate Experiment (GRACE) Mascon Equivalent Water Height RL06 with Coastal Resolution Improvement (CRI) v1 (Watkins et al., 2015;Wiese et al., 2016;Wiese et al., 2018) represents variations in global water storages, i.e., groundwater, soil moisture, surface water, snow, and ice for land pixels. The product has a native spatial resolution of 3 • but is delivered at 0.5 • . For this study, all time series datasets were aggregated to 1 • resolution, but still, the TWS data may not represent local grid-scale variabilities properly. The TWS data is available from April 2002 to June 2016 covering irregular, roughly monthly periods. As we observed some outliers in the dataset, observations −500 > tws > 500 were removed.

Evapotranspiration (ET)
Monthly ET data was retrieved from the global FLUXCOM-RS product Tramontana et al., 2016), which is based on upscaling of FLUXNET (Baldocchi et al., 2001) eddy covariance data. The upscaling is achieved using an ensemble of machine learning models, each learning a mapping from remote sensing (RS) observations to the site-level fluxes, which can then be upscaled to global scale. The ET was derived from the latent energy estimates, assuming a constant latent heat of vaporization of 2.45 MJ mm −1 .

Total Runoff (Q)
GRUN v1 is a global gridded dataset providing estimates of monthly total runoff with a native spatial resolution of 0.5 • (Ghiggi et al., 2019). The authors used random forests to model local discharge observations from small catchments as a function of climate data and generalized the learned relationships to retrieve global estimates.

Snow Water Equivalent (SWE)
Daily SWE was retrieved from GlobSnow v2 (Luojus et al., 2014;Takala et al., 2011) and aggregated from 0.25 • to 1 • spatial resolution. The product only covers the Northern Hemisphere and pixel time-steps with no snow are encoded as missing values. As the absence snow is important information that we do not want to discard, the SWE product was enriched using 8 d MODIS snow cover fractions (SCF) disaggregated to daily using nearest neighbor (Hall and Riggs, 2016). SWE with missing data were set to 0 if: a) more than 24 consecutive days were missing for SWE and b) the mean SCF over ±12 days was below 10 %. This gap-filling mainly assigned zero SWE to previously missing values in the Southern Hemisphere and Northern Summer. Note that some mountainous regions were masked out in the GlobSnow product. The SWE signal is known to saturate at 100-150mm (Larue et al., 2017).

Meteorological Forcing
As time-varying model inputs, we used three meteorological forcing datasets, each on daily resolution: Net radiation is obtained from the SYN1deg Ed3A product (Doelling, 2017) of the Clouds and the Earth's Radiant Energy Systems (CERES) program (Wielicki et al., 1996). The precipitation data was retrieved from the Global Precipitation Climatology Project daily 1 • dataset (GPCP-1DD) v1.2 (Huffman et al., 2012). Air temperature was obtained from the CRUNCEP v8 dataset, a combined product of the observation-based Climate Research Unit (CRU) and the National Center for Environmental Prediction (NCEP) reanalysis data (Harris et al., 2014;Viovy, 2018).

Static Datasets
A number of static datasets were used to represent the spatial variability of surface and subsurface environmental conditions. To represent topography, we used the digital elevation model from GTOPO30 (DOI/USGS/EROS, 1997). Furthermore, we used variables from the soilgrids dataset (Hengl et al., 2017): absolute depth to bedrock and the average across all soil layers of bulk density, coarse fragments, clay, silt, and sand content. Land cover fractions were derived from the Globland30 dataset (Chen et al., 2015) for the classes water bodies, wetlands, artificial surfaces, tundra, permanent snow and ice, grasslands, barren, cultivated land, shrublands, and forests. In addition, a wetland dataset was used that contains fractions of groundwater-driven wetlands, regularly flooded wetlands, and the intersection of the them (Tootchi et al., 2019).
These 22 variables were aggregated from their mostly finer native spatial resolution to 1 30 • to keep information on the spatial variability inside a 1 • model pixel. To reduce the size of the stacks (30 (lat. pixels) × 30 (lon. pixels) × 22 (variables) = 19 800 values per model cell) and ultimately the number of parameters in the model, we reduced the dimensionality of the static variables in a pre-processing step. A simple convolutional autoencoder was used for this, consisting of an encoder network, a bottleneck layer, and a decoder network. The encoder layers extract features from the input stack with consecutively smaller capacity. The final representation is the bottleneck layer, with a vector of size 30. The decoder, which has the reverse structure of the encoder network, maps the bottleneck layer back to the input stack. By minimizing the reconstruction loss, the model is forced to find a low-dimensional representation of the stack. To retrieve valid land pixels with a clear signal of TWS, ET, and Q, cells with more than 50 % water bodies, 10 % permanent snow or ice, 10 % artificial surfaces, and 10 % bare land were removed. Further, regions with strong anthropogenic groundwater withdrawal were discarded, as the model does not account for these effects. After applying these criteria, the dataset consisted of 11 026 spatial samples. Note that some grid-cells were masked out further due to missing values in the SWE dataset, e.g., some mountainous areas. The excluded cells are shown in Figure 1 along with five bioclimatic regions used in the model evaluation.

The Hybrid Hydrological Model
The hybrid model represents the major states and fluxes of the hydrological cycle (see Box 1). The model learns a mapping from the meteorological features (X) to the target variables (y).
To predict yt at time t, it has access to the present and past observations X ≤t and a set of static variables.

Self-Paced Multi-Task Learing
To combine the four loss terms corresponding to the target variables, we used self-paced task uncertainty weighing (Kendall et al., 2018), as done in state-of-the-art multi-task learning (e.g. Liebel and Körner, 2018). By optimizing an uncertainty term σ for each task (Equation 1), the different uncertainties inherent to the target variables are compensated dynamically.
Box 1: The end-to-end hybrid hydrological model The meteorological time series (Section 2.5), encoded static variables (Section 2.6) and physically interpretable states groundwater (GW, g) a and cumulative water deficit (CWD, c) are fed into the LSTM.
B The LSTM layer The LSTM updates the hidden states h t LSTM and c t LSTM at each time-step. LSTM , X t ) C Multi-task layer The multi-task layer, comprising of independent feedforward layers (NN), yields interpretable variables: evapotranspiration (ET, e), snow water equivalent (SWE, s), and fractions (α) defining how the liquid water input (w inp ) is partitioned into the fluxes of fast runoff (w inp · αq f → q f ), soil recharge (w inp · αc → rc), and groundwater recharge (w inp · αg → rg). The current w inp is the precipitation (p) minus snow accumulation or plus snow melt (∆s). I addition, a fraction αe determines the source pool from which e is taken from. If αe=1, e is taken from the soil, if αe=0, e is taken from the groundwater.
The hydrological block implements water balance equations. The physical state variables g and c are updated at each time-step using a combination of the above latent variables and variables derived here. When c = 0, the soil is fully water-saturated, negative values indicate a water deficit. If c > 0, the soil capacity is exceeded and overflow occurs (c overflow ). Note that for the model evaluation, c is transformed such that a deficit is denoted by positive values. The base runoff (Q b , q b ) is defined as g times a learned global fraction αq b . The total runoff (Q, q) is the sum of q b and q f . The total water storage (TWS, w) anomalies are calculated as the sum of s, g, and c, minus the mean of w to get the variation around 0. The units are mm for state variables and mm d −1 for fluxes.
w t = s t + g t + c t a (acronym, math. symbol) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) where wi is a weight for the task i of n total tasks, reciprocal to the task uncertainty σi and ri is a regularization term to prevent the uncertainty from converging to infinity. In practice, the uncertainty is encoded as s := log(σ 2 ) to assert numerical stability and to have an unbound parameter s. Hence, w = 0.5 · exp(−s) and r = 0.5 · s.
We added a further constraints (Cg) to penalize negative values for groundwater (GW). In preliminary experiments, we observed that the model can easily reach a loss Cg = 0, and, thus, s converged to minus infinity. To prevent this, a constant of 0.1 was added: Cg = mean(− min(g, 0)) + 0.1, where g is a simulated groundwater time series.

Model Selection & Training
The model was trained end-to-end and simultaneously on global observation-based products of TWS, SWE, ET, and Q using the backpropagation algorithm (Goodfellow et al., 2016). We used the root mean square error (RMSE) as the objective function. The model was implemented in PyTorch v1.4 (Paszke et al., 2017).
The time series were split into two periods, 2002-01 to 2008-12 for training and 2009-01 to 2014-12 for validation and testing. The feature time series were extended by selecting ten random years from the features of the respective periods for model spinup to obtain steady physical model states (GW and soil cumulative water deficit (CWD)), before the actual evaluation period. Furthermore, a warmup period of one year was added to both time-ranges to have some temporal context even for the start of the periods. In addition, the samples were split into mutually exclusive regular grids for the hyperparameter (HP) optimization and the cross-validation (Fig. 2). These measures were taken to reduce overfitting due to spatial and temporal autocorrelation (Roberts et al., 2017).
For the model selection, we used the Bayesian optimization hyper-band (BOHB) algorithm (Falkner et al., 2018) from the Ray.tune framework (Liaw et al., 2018). BOHB is a state-ofthe-art method for HP optimization that combines an early stopping mechanism (dropping non-promising runs) and a Bayesian surrogate model that suggests new HPs. Here, we used samples from one of the four spatial grids. To match the cross-validation scheme, the samples were split into five folds, of which three were used for training and one for validation. The final HPs are reported in Table 1. The remaining three grids were used to perform three independent cross-validations: in each, one fold was withheld for testing (5 % of the grid-cells) and the remaining four folds (20 % of the grid-cells) were iterated such that each fold was used for validation once. The test set predictions used for the model evaluation are referred to as cv i,f , where i ∈ {1, 2, 3} is the cross-validation and f ∈ {1, 2, 3, 4} is the fold index.

Model Evaluation
First, the model fit was quantified regarding the temporal patterns aggregated by the bioclimatic regions (Figure 1) using the Pearson correlation coefficient (r) and the Nash-Sutcliffe model efficiency coefficient (NSE). The NSE ranges from −∞ to 1, a Model architecture layer num. layers hidden size dropout static encoding 2 (1, 2) 100 (50, 100) 0.2 (0.0, 0.5) LSTM 1 (−) 100 (50, 200) − task branches 1 (1, 3) 100 (50, 200) 0.2 (0.0, 0.5) Optimizer parameters learning rate 10 −2 (10 −2 , 10 −4 ) task weight learning rate 10 −2 (10 −2 , 10 −4 ) weight decay 10 −5 (10 −2 , 10 −5 ) grad. clipping 0.6 (0.1, 1) Table 1. Model architecture and optimizer hyperparameters with range limits searched in brackets (lower, upper). The static encoding layer extracts features of the static input which are fed into the LSTM together with the meteorological forcing time series. The single-layer LSTM is followed by multiple task branches. The learning rate defines the step size of the optimizer (with an independent learning rate for the task weights), weight decay adds L2 regularization (preventing large parameter values) and gradient clipping counteracts exploding gradients. Figure 2. Regional example of the data splitting for the hyperparameter tuning and cross-validation. The grid-cells are split into four mutually exclusive, regular grids (colored). The grid-cells of each set are separated by a buffer to reduce the spatial autocorrelation between the samples. The samples of each grid were then split randomly into 5 sets of which one was used for testing and the remaining four were iterated such that each set was used as validation set once. One of the four grids was used for hyperparameter optimization. Following this scheme, three separate cross-validations (cv i∈{1,2,3} ) are performed, each yielding four predictions on the test set. Note that some grid-cells are masked out (grey), see Section 2.7 for more details.
negative NSE indicates that the model fit is worse than just taking the observed mean as prediction, 1 is a perfect fit (Nash and Sutcliffe, 1970). The evaluation was performed based on the test sets which have not been used for HP optimization or model training. From the three cross-validations, only one of the four runs was used and combined into one unified dataset, i.e., cv i∈{1,2,3},f =1 . Then, we aggregated the time series per bioclimatic regions using the mean of all respective grid-cells. We then calculated r and NSE for each bioclimatic region.
Then, the robustness of the simulated latent variables was assessed. As the proposed hybrid model has a high degree of freedom compared to conceptual models, it is crucial to check if repeated runs lead to similar results. Robust model predictions increase the trust in the latent variable estimates. The robustness of the model was assessed using the simulations from the cross-validation. In addition, we assess the plausibility of the non-observed (latent) estimates based on our process understanding. For the evaluation of the latent variables, we cannot rely on a ground-truth. Rather, the patterns are confronted with domain knowledge. Exemplarily, we take a closer look at the liquid water input (w inp ) partitioning through fast runoff fraction (αq f ), soil recharge fraction (αc), and groundwater recharge fraction (αg). These fractions are known to depend strongly on the water status of the soil (CWD) with, e.g., more fractional runoff under wet conditions. As the fractions are learned from data and no constraints were imposed, we evaluated their relationship with CWD qualitatively and quantitatively using the Spearman's rank correlation coefficient (rs).

Model Performance by Bioclimatic Regions
The observed and simulated time series and the model performance per bioclimatic region are shown in Figure 3. The hybrid model has learned the temporal patterns of the target variables. The seasonality was represented well with varying performance among bioclimatic regions and variables. Remember that ET and Q are upscaled from point measurements and products of machine learning algorithms themselves. The ET product, for example, is known to be affected by systematic biases due to biases in the underlying site measurements and an incomplete spatial sampling (Jung et al., 2020). For that reason, the trust in these variables, especially the interannual variability (IAV), is limited. Similarly, the SWE product is affected by biases due to a signal saturation above 100-150mm (Larue et al., 2017). Therefore, and also because TWS explicitly depends on all the other target variables, we use the observation-based TWS as the main reference for assessing the model performance.
The response of TWS to precipitation can be strongly delayed due to buffering effects of snow mass, soil moisture, or groundwater. This expresses in a lag between the seasonality of precipitation and TWS, but also single precipitation events cause a delayed response in the TWS (Humphrey et al., 2016). The model fit the seasonal patterns of TWS well, especially in the Tropics, Subtropics, and the Northern Hemisphere (NSE >0.8).
In the temperate and more clearly in the cold Northern Hemisphere, the predictions exhibited a phase-shift compared to the observations. This means that the model struggled to discharge the input of water at an adequate pace. Similar phase-shifts can be observed in conceptual models (e.g. Schellekens et al. (2017) and Trautmann et al. (2018)) and the phenomenon is still under investigation. A reason for this mismatch could be a missing implementation of lateral fluxes between grid-cells or buffering effects of surface water storages like wetlands. In Figure 3, we also show the interannual variability (IAV) of TWS, calculated as the deviation from the mean seasonality. The IAV signal reflects how the model can deal with anomalous conditions, like strong precipitation events or droughts. The model was able to predict the timing and strength of the major TWS anomalies.  The relationships is quantified using the mean Spearman's rank correlation coefficient (rs) over all folds, the standard deviation is shown in brackets. For the density plot, on single fold (cv i=1,f =1 ) was used. The lines represent the binned median, i.e., the median of the fractions over a range of CWD values, of the individual cross-validation test sets. The lines are colored by cross-validation run index, i.e., lines with the same color come from one cross-validation run and represent the same grid-cells.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)

Model Robustness & Latent Variables
A challenge in hybrid modeling is to find the right balance between constraining the model sufficiently to avoid equifinalities and to allow it enough flexibility to adapt to the data. This act of balance requires domain knowledge and a careful evaluation of the results. Based on a set of repeated model runs from the cross-validations, we assess the robustness of the simulations. While the RMSE varied only marginally (1.42 ± 0.03) and the target variables predictions were robust, the stability of the latent variable simulations was lower among cross-validation folds (Figure 4).
The robustness of the latent variable simulations varied among the bioclimatic regions. This indicates that the optimization problem was underconstrained under certain conditions and different pathways lead to a similar solution in terms of target variables. We take a closer look at the SH regions and note that the mean CWD varied substantially among the model runs. Note that, here, the snow mass is neglectable and thus, TWS is partitioned between GW and CWD. TWS, however, reflects the anomalies of the total water column and thus, the absolute values of GW and CWD are not constrained through this relationship. Thus, further constraints were added to the model: through the base runoff (Q b ) being a constant fraction of GW and the ET partitioning, the solution space is reduced. Similarly, the absolute values of CWD are constrained by the CWD overflow and the ET partitioning. Under certain conditions, however, these constraints are not sufficient: in a hydrological regime where soil moisture and groundwater are not limited, for example, the model fails to learn from which pool the ET is extracted. Likewise, if the soil is never or only rarely watersaturated and CWD overflow (CWD overflow ) does not occur, the mean CWD is not constrained.
In other regions, the simulations were more stable. In the Tropics and Subtropics, GW, CWD, and the ET partitioning fraction (αe) were estimated more robustly, even if we see some outliers.
In the TemperateNH and ColdNH regions, the GW simulations were rather stable, but we see a varying offset of CWD. Here, the model struggled again to yield robust estimates of αe with even opposite seasonal patterns. This suggests overall that potential groundwater access by plants via ET is not well constrained in the current set-up.
The relationship between w inp partitioning fractions and CWD and its robustness is shown in Figure 5. These patterns follow, to a certain degree, simple hydrological laws: if the soil is wet, for example, we expect to see a decrease in soil recharge fraction (αc), an increase in groundwater recharge fraction (αg), and a larger fast runoff fraction (αq f ). Insofar, the patterns align with our prior knowledge. However, the fractions were not estimated robustly, which also reflects in rather large variations in rs, especially in the cold Northern Hemisphere. There, the relationship was less pronounced, which could be caused by snowmelt dynamics adding complexity.

Limitations
The cross-validation scheme was designed to have global coverage and reduce spatial and temporal autocorrelation between samples of the training, validation and test set. Due to a limited amount of samples, we made a compromise between data limitations and autocorrelation requirements (Roberts et al., 2017). Similarly, aggregating the daily predictions to match the monthly target variables may introduce leakage, as the target variables can influence the feature time series (e.g. ET → precipitation). Further, we noted that some cross-validation runs did not converge ideally. Thus, the assessment of the robustness does not only reflect the model robustness, but also the robustness of the training process.

CONCLUSION
We presented a global end-to-end hybrid hydrological model that combines artificial neural networks and a conceptual model. To our knowledge, the presented approach is the first application of the hybrid approach to model global environmental systems. The approach opens doors to novel data-driven simulations, attribution, and diagnostic assessments of water cycle variations globally and is applicable to other fields. Our experiments have shown that a major challenge remains to sufficiently constrain the model to retrieve interpretable simulations of non-observed (latent) variables. Under certain conditions, the simulations are unstable but we can infer general patterns of the water cycle using this data-driven approach. Thus, further refinement of the model is required. This iterative process of model improvement, evaluation, and discussion is part of the scientific process that leads ultimately to a better understanding of the subject of investigation.