A Refining Method of Non-linear Regional Tm Model Based on Random Forest

Weighted mean temperature(Tm) is a critical parameters in GNSS technology to retrieve precipitable water vaper(PWV). By obtaining high-precision Tm, it can provide an important reference data source for regional strong convective weather and large-scale climate anomalies. The high-precision Tm of most areas can be obtained by using the BEVIS model and the surface temperature (Ts). The eastern coastal areas of China are affected by the monsoon climate, which makes the applicability of the method in this area to be improved. The research shows that the Tm which calculated by Fourier series analysis (FTm model) has better applicability in the region than the BEVIS model. However, the method has a single modeling factor, and the precision improvement effect in some area is not obvious. By using the observation data of 13 radiosonde stations in the eastern coastal areas of China from 2010 to 2015. Tm which calculated by numerical integration is used as the reference of the true value. Four of the observation data are selected by the method of random forest (RF). The eigenvalues include the pressure、surface temperature、water vapor pressure and specific humidity are used as input factors. The prediction corrections are added to the deviation of FTm model, and a new Tm is applied to the eastern coast of China which called RFF Tm. Taking the observation data from 2010 to 2014 as the training database, the research area is divided into three areas from south to north according to the latitude. The prediction results of different time scales are studied by the clamping criterion, and then the prediction of random forest is discussed. The correction effect is adaptable in the eastern coast areas of China. The results show that: (1) The RFF Tm model refinement method based on random forest has better adaptability in eastern coastal areas of China, and the applicability of first area is more stable with the prediction time scale than the FTm model. (2) On the time scale with a forecast period of one year, MAE and RMS are 4.7 and 4.6 in third area, 3.2 and 3.8 in second area, and 2.6 and 2.5 in first area. (3) The improvement effect of random forests in the eastern coastal areas of China gradually increases with the prediction period becoming shorter. The predicted deviation values of the eastern coast areas of China reach a steady state when the period is one month. The correction deviations is within 1.5K. The correction range of the third area is better than the second area and first area, which makes up for the shortcomings of the FTm model with low precision in the region. It can be used as a new multifactor prediction and correction Tm model for GNSS remote sensing water vapor in the eastern coastal areas of China. * Corresponding author.


INTRODUCTION
Atmospheric water vapor is mainly distributed at the bottom of the troposphere, accounting for only 0.1% to 0.3% of the composition of the atmosphere, but it is not only the most active part of the atmosphere, but also one of the important factors affecting the vertical stability of the atmosphere [1][2] . Because the water vapor content has a significant positive correlation with the Precipitable Water Vapor (PWV), atmospheric water vapor content has always been an important research content of weather forecasting and meteorology [3] . At present, commonly used methods for obtaining atmospheric rainfall can be classified into radiosonde, satellite detection, ground-based GNSS, et al. The cost of radiosonde is high and the number of observations is limited; satellite detection is affected by the weather and has many limiting factors; and ground-based GNSS has the advantages of high precision, high spatial and temporal resolution, all-weather, low cost, etc. [4] , Therefore, it is widely used.
In the process of inverting PWV using GNSS, the weighted mean temperature is one of the important parameters. At present, the international general calculation method of Tm is the BEVIS model proposed by Bevis in 1992 [5] . It uses the radiosonde station between 27°~65° north latitude to establish a linear model of Tm and Ts. But the applicable Area is smaller, and the applicability in China is poor. The Ref [6][7] combined with multi-factor analysis, found that Tm is periodically negatively correlated with latitude, elevation and pressure (Ps), and is positively correlated with ground temperature (Ts) and water vapor pressure (es). The conclusions were established and a multi-factor regression model for the Chinese Area was established. With the development of GNSS meteorology, the accuracy requirements of Areaal Tm have gradually increased. Various Areaal Tm and Ts models have been established in the Ref [8][9][10] . However, the previous Areaal models have adopted a linear relationship, and the accuracy in some Areas still cannot meet the application requirements. The Ref [11] based on mathematical statistics model, proved the nonlinear relationship between Tm and Ts, which provides a new direction for the study of Tm. Traditional machine learning methods, such as support vector machine [12] , BP neural network [13] , Kalman filter model [14] , etc., due to the large proportion of training sample distribution, easy to lead to over-fitting phenomenon and insufficient robustness. As a new machine learning model, random forest can process highdimensional data samples without dimension reduction processing, and has less parameter debugging and strong versatility, which can effectively avoid over-fitting and has good robust , so it has been widely used in economics, medicine, and exploration [15] . The eastern part of China is affected by the monsoon climate and is prone to strong convective weather, resulting in a significant nonlinear change in the atmospheric weighted average temperature. The Fourier series model can well fit the variation characteristics of Tm in this Area. Therefore, the nonlinear F-Tm model in eastern China is first established based on Fourier series, and the deviation of the model is carried out by random forest method. The prediction is corrected, and finally the nonlinear RFF-Tm model based on random forest is obtained, and the space-time adaptability analysis is carried out on the model.

Research Area
Thirteen radiosonde stations in eastern China were selected as research objects. Because Tm has a large correlation coefficient with latitude, this paper divides eastern China into three research Areas according to 25°N and 35°N. Figure 1 shows the geographical distribution of 13 radiosonde stations. The specific research Area information is shown in Table 1

Tm calculation method
At present, the commonly used Tm reference value calculation methods mainly include numerical integration method, constant method, BEVIS formula method and approximate integral method. Among them, the numerical integration method has the advantages of high precision and easy implementation, and its calculation result is generally taken as the Tm true value, and the calculation formula is as follows: In the formula (1),Ti represents the average atmospheric i-th temperature(the unit is Kelvin), ∆hi is the thickness of the i-th layer atmosphere (the unit is m), and Pvi is the atmospheric mean vapor pressure of the i-th layer (the unit is hPa) . Pvi belongs to indirect observation, and is generally calculated by the saturated water vapor pressure calculation formula recommended by the World Meteorological Organization (WMO). The formula is as follows: 17.502* 6.112 exp 240.97 vi t P t In the formula (2), t is a temperature (the unit is C). In this paper, the measured temperature data of each radiosonde station is extracted, and the Tm obtained by the numerical integration method is taken as the reference value.

Fourier series
The Fourier series is a harmonic analysis designed to decompose a function f(x) into the sum of a sine function and a cosine function. For the condition that the function f(x) with a period of 2L satisfies the convergence theorem, the progression of the series can be obtained as: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W10, 2020 International Conference on Geomatics in the Big Data Era (ICGBD), 15-17 November 2019, Guilin, Guangxi, China

accuracy index
In this paper, the deviation (BIAS), the mean absolute deviation (MAE) and the root mean square error (RMS) are selected as the accuracy evaluation factors. The expressions are as follows: Where N is the number of data samples, i m X is the model calculated value of the i-th data, and i r X is the reference value of the i-th data.

F-Tm model accuracy analysis
The According to the statistics in Figure 2, the F-Tm model has a BIAS value of 0.04k and a RMS increase of 14% in five years.
Compared with the BEVIS model, it has better adaptability in eastern China. Among them, the F-Tm model has improved model accuracy in 9 radiosonde stations in Hong Kong, Zhangqiu, Sheyang, Taizhou, Fuzhou, Xiamen, Shantou, Nanning and Haikou; but in Dalian, Qingdao, Shanghai and Taipei. The accuracy of the radiosonde station Area has not improved significantly. Even in a small number of Areas, due to the special geographical location of the Area and industrial pollution, the accuracy of the model has decreased. Therefore, based on the F-Tm model, the deviation is predicted by the random forest method, and the prediction correction number is added to construct the RFF-Tm model: In the formula, t ∆ is a prediction correction based on random forest.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W10, 2020 International Conference on Geomatics in the Big Data Era (ICGBD), 15-17 November 2019, Guilin, Guangxi, China

Random Forest (RF)
Random Forest was proposed by Breiman and Culter in 2001 and belongs to the bagging algorithm in integrated learning. The method divides the data into the original training sample N and the prediction sample Z through the bootstrap resampling technique, and randomly extracts k samples from the N back to generate a new training sample set, and then according to the selection of the feature values, the self-service sample The set generates k classification trees to form a random forest.
The two important parameters that have an impact on the model prediction result are the number of decision trees (ntree) and the candidate variable (mtry). The average value of ntree is 1/3 of the number of samples. The classification result of the predicted sample Z is determined by the classification tree. Random forest algorithm has obvious advantages over the over-fitting and complex structure of machine learning methods such as common neural networks and support vector machines. Therefore, it has been widely used in remote sensing image monitoring, ocean subsurface structure prediction, etc.

Establishment of RFF-Tm model
In this paper, we use the method of random forest, and the deviation value of F-Tm model in 2010-2015 is selected as the data set. The Tm deviation from 2010-2014 is the original training sample. The deviation in 2015 is the training sample, and the selection is related to Tm. The four parameters are used as eigenvalues (pneumatic pressure P, surface temperature Ts, water vapor pressure es, specific humidity s  As can be seen from the figure, the improvement of RFF-Tm in the three radiosonde stations in Area 1 is particularly significant. According to statistics, the MAE of the F-Tm model is increased by 81%, 76%, 77%, respectively. 78%, 72%, and 75% can make up for the tropospheric disorder that appears in the Area due to low latitude. In Area 2 and Area 3, the original accuracy of the F-Tm model is already high, and the bias value used for prediction is small. Long-term machine learning causes distortion of the prediction signal in the Area, and the accuracy is reduced. Therefore, the degree of improvement is not Obviously, the applicability of the RFF-Tm model in these two Areas needs to be further studied.
Based on the premise of not changing the characteristic parameters, ntree and mtry values, the adaptability of the RFF-Tm model in Area 2 and Area 3 is analyzed by adjusting the prediction duration of the random forest. The radiosonde stations of Area 2 and Area 3 are used as random forest prediction models on six time scales of one year, six months, one quarter, two months, one month, and 15 days, respectively. The results of the verification accuracy are shown in the figure below: It can be seen from the figure that the RFF-Tm model has a weaker applicability in Area 3 than Area 2 on the time scale with a forecast period of one year, and MAE and RMS reach a maximum of 4.7 and 4.6 in Area 3, respectively, and the highest in Area 2 respectively. 3.2 and 3.8. As the prediction time scale decreases, the MAE and RMS of the two Areas gradually decrease and tend to be stable, and simultaneously reach a stable value on a one-month time scale, while the prediction accuracy of the time scale of 15 days is not in the Area 3. Significant improvement, in Area 2, a negative growth state. Therefore, in Area 2 and Area 3, the RFF-Tm achieves the best prediction state on the time scale of the predicted time period of one month, and the improvement accuracy of the Area 3 is slightly better than that of the Area 2, and is more stable. Both of them have a good adaptability for MAE and RMS on a one-month time scale. They can be used as a highprecision Tm model for GNSS remote sensing water vapor in eastern China.

CONCLUSION
This paper uses the Tm and Ts data of 13 radiosonde stations in eastern China from 2010 to 2014, and uses the Fourier series analysis method to construct the F-Tm model. The results are better than the BEVIS model, but the accuracy in some Areas. There is still room for improvement. Therefore, based on the F-Tm model, four eigenvalues (P, Ts, es, s) are selected by using the random forest method to predict the deviation and obtain the RFF-Tm model. The spatio-temporal adaptability analysis of RFF-Tm is spatially divided into three Areas according to the latitude band, and six time scales are used for random forest prediction. The results show that: (1) The RFF-Tm model has good adaptability in eastern China, and the improvement degree is obvious compared with the F-Tm model. (2) In the Area 1 with low latitude, the RFF-Tm model with a time scale of 1 year has good adaptability and can be applied to long-term sequence analysis.
(3) In Area 2 and Area 3, the RFF-Tm model gradually stabilizes with the decrease of time scale, and the time series prediction effect is best in 1 month, and the correction effect of Area 3 is slightly better. In Area 2, and the correction effect is more stable, both are adapted to shorter time series analysis.