CASTLE: A CONTEXT-AWARE SPATIAL-TEMPORAL LOCATION EMBEDDING PRE-TRAINING MODEL FOR NEXT LOCATION PREDICTION

: Next location prediction is helpful for service recommendation, public safety, intelligent transportation, and other location-based applications. Existing location prediction methods usually use sparse check-in trajectories and require massive historical data to capture complex spatial-temporal correlations. High spatial-temporal resolution trajectories have rich information. However, obtaining personal trajectories with long time series and high spatiotemporal resolution usually proves challenging. Herein, this paper proposes a two-stage Context-Aware Spatial-Temporal Location Embedding (CASTLE) model, a multi-modal pre-training model for sequence-to-sequence prediction tasks. The method is built in two steps. First, large-scale location datasets, which are sparse but easier to be acquired (i.e., check-in and anomalous navigation data), are used for pre-training location embedding to capture the multi-functional properties under different contexts. After that, the learned contextual embedding is used for downstream location prediction in small-scale but higher spatiotemporal resolution trajectory datasets. Specifically, the CASTLE model combines Bidirectional and Auto-Regressive Transformers to generate contextual embedding vectors rather than a fixed vector for each location. Furthermore, we introduce a location and time-aware encoder to reflect the spatial distances between locations and visit times. Experiments are conducted on two real trajectory datasets. The results show that the CASTLE model can pre-train beneficial location embedding and outperforms the model without pre-training by 4.6-7.1%. The proposed method is expected to improve the next location prediction accuracy without massive historical data, which will greatly drive the use of trajectory data.


INTRODUCTION
Next location prediction has raised intensive studies in recent years owing to the growth of location-based services.The large volume of historical data makes it possible to understand individuals' preferences for the next movements (Wan et al., 2021), as the trajectory data reveals individuals' travel patterns and preferences.Meanwhile, predicting the next location is of great significance for service recommendation, public safety, intelligent transportation, and other location-based applications (Luo et al., 2021).
There have been various models to predict the next location based on the historical trajectory in the past two decades.In general, location prediction methods can be categorized as pattern-based, probability distribution-based, statistical learning-based, and representation-learning-based.The pattern-based methods refer to extracting spatiotemporal patterns from historical trajectories for location prediction.Commonly used patterns include sequential, frequent, periodic, and clustering patterns.For example, the commonly used sequence mining model T-pattern tree records the behavior and the visit time of each location and calculates the transition probabilities between locations to dynamically predict the next location by finding the optimal matching path (Monreale et al., 2009).Based on historical trajectory data, some studies mine frequently visited locations of individuals through clustering and established a path network to predict the next location that individuals will go to (Yuan et al., 2014).However, it is not easy to extract a long-term effective and meaningful movement pattern, and the fixed pattern limits the diversity of model prediction results.

* Corresponding author
The core idea of the probability distribution-based method is to fit a probability calculation model to evaluate the user's visiting preferences for different locations and then predict the next visiting location.This method does not require parameter learning and training but only needs to fit the parameters according to a predetermined probability model based on historical trajectories.Specifically, this method establishes a probability calculation model from different aspects (e.g., geographic location, time, sequence characteristics, location category) based on the existing model (Zhang et al., 2015).An interesting study uses the Gaussian Mixture Model to fit the twodimensional spatial distribution of the user's historically visited points of interest (Zhang and Chow, 2014).In general, the method based on probability distribution has good interpretability.However, methods based on probability distributions rely on prior knowledge, and fitting with different statistical models may yield different results.
The method based on statistical learning refers to obtaining the optimal parameter combination by training based on historical data, mainly including matrix factorization, topic, and other classification-based machine learning models.The basic idea of the matrix factorization models, e.g., RCH (Wang et al., 2015) and GeoMF (Lian et al., 2018), are to decompose the userlocation matrix into two low-rank matrices representing the user and the location, respectively.In the subsequent research, the matrix dimension is expanded into a tensor, expressed as a usertime-location tensor.The tensor factorization method is used to analyze the temporal patterns of users' travel behavior (Bhargava et al., 2015).However, this kind of method is not suitable for cold-start problems, especially for new users and new locations requiring the model's retraining.Meanwhile, it ignores the sequential correlations in the trajectory.
With the advancement of deep learning technology, representation learning has become widespread in next location prediction research.The core idea is to represent each location with a vector and train the model to get a latent embedding vector with a specific task.Similar to natural language, the trajectory is also a sequence in which the sequence node strongly correlates with its context.Thus, word embedding models in natural language processing have been widely used in the representation learning of trajectory.For example, the DeepMove model (Feng et al., 2018) uses the Skip-gram model to extract contextual information; that is, predicting the surrounding context through the central node.The Tale model (Wan et al., 2019) is based on the CBOW model to capture the temporal dependencies in the trajectory; that is, predicting the central nodes by the surrounding context.Although these methods can generate beneficial embedding vectors, they will produce a fixed embedding vector for the same location under a different context.Unlike the usually used check-in data, the visited location is uncertain in the real GPS or mobile phone trajectory dataset (Figure 1) due to the low spatial accuracy of the positioning terminal and the place ambiguity (e.g., multiple shop malls or cinemas in a shopping mall).It means that people may visit the same location for a different purpose; that is, a location may be multi-functional in the real world.Thus a model which can dynamically generate the contextual embedding vector is urgently needed.A recently proposed Bert-based (Devlin et al., 2019) CTLE (Lin et al., 2021) model uses a bidirectional encoder to generate the embedding for a location based on its spatialtemporal context.It shows that the dynamic location embedding significantly improves the downstream task's performance.However, it ignores the spatial proximity of the locations.
However, there still exist some problems in the existing methods.First, most existing studies use sparse check-in trajectory data, which is easy to acquire.However, the trajectory data obtained by mobile phones or GPS terminals with the high spatialtemporal resolution has rich information.Obtaining personal trajectories with long time series and high spatiotemporal resolution usually proves challenging.Furthermore, training an effective model will be a difficult task without massive historical data.Second, the trajectory contains rich spatial-temporal information, and the existing methods fail to capture spatiotemporal associations between visited locations effectively.
To address the above problems, we propose a two-stage contextaware spatial-temporal location embedding pre-training model for the next location prediction.The contribution can be summarized as follows: (1) A two-stage framework is proposed to solve the problem that obtaining large-scale trajectories with the high spatial-temporal resolution is challenging.Thus, our model could predict the visit places accurately using small-scale fine-grained trajectory data.
(2) We propose an encoding layer that incorporates the spatial position and the temporal information.Therefore, preferences for travel distance and visit time can be reflected in the model.
(3) The bidirectional encoder and the autoregressive decoder are combined to dynamically capture long-term sequential dependence, which is more suitable for the uncertain visit places of real GPS or mobile phone trajectory.

Proposed Framework for Next Location Prediction
Obtaining personal trajectories with long time series and high spatiotemporal resolution usually proves to be challenging.Thus, we propose a two-stage framework (Figure 2) for next location prediction, including pre-training contextual embedding vectors for locations and fine-tuned next location prediction.In this paper, a visit v = (l, t, g) indicates that individual visits a location l at time t and the geospatial position of the location l can be denoted as g.Given the trajectory s={(  " ,  " ,  " ), ( & ,  & ,  & ),…, ( ' ,  ' ,  ' )}, the goal of the next location prediction is to predict the output  '(" .And the goal of the pre-training step is to learn a parameterized map function f, which generates the latent contextual embedding vector V( * ) from a visited record  * = ( * ,  * ,  * ) and its context C( * ).

The Proposed CASTLE Model
Our proposed CASTLE Model (Figure 3) consists of 1) a multimodal encoding module that inputs the visited location, time, and geospatial position into the model; 2) a bidirectional encoder that learns the embedding of locations by taking other relevant visited locations within the sequence into account.3) an autoregressive decoder that predicts locations auto-regressively.(1) location encoding layer: The location encoding is implemented using a fully connected embedding layer, and the embedding layer can be represented as an embedding matrix  3 ∈ ℝ ' 6 * 8 9:;<6 , where  3 is the total number of locations and  ?@8A3 is the set dimension of the location embedding vector.A matrix multiplication process   * =   * C  3 is used to generate the embedding of locations  * based on the one-hot vector   * .
(3) geospatial position embedding layer: In general, the geospatial position of a visit is usually characterized by latitude and longitude.Nevertheless, this representation method will suffer from the sparsity issue.Thus, we adopt a hierarchical map gridding method to represent the geospatial position (Lian et al., 2020).Because of grid division like quadtrees, each grid was represented as a base-4 number with a certain length (e.g., the length of a quadtree key is 16 at the 16th level of detail).In this way, the spatial distances of different locations can be reflectedd in their quadtree keys.In order to model the spatial positional relationship of visited locations, this study uses the N-gram method and self-attention network to construct a geospatial embedding layer using the tiled quadtree index of trajectory points (Lian et al., 2020).N-gram is a widely used method of segmenting sequences according to a certain length.N-gram consists of a series of substrings obtained by sliding a window of length N by one string at a time.Taking the quadtree index "13101113" as an example, the corresponding trigram sequence when N is 3 is {131, 310, 101, 011, 111, 113}.Since the character set of the quadtree index string only includes {0, 1, 2, 3} four characters, it is not enough for characterizing the whole area.The size of the vocabulary of the embedding layer corresponding to the N-gram is 4 H .In order to obtain the contextual information of sequences in N-grams, the N-gram embedding sequences are represented by a self-attention encoder after adding positional encoding.The self-attention encoder used here is consistent with the encoder in the transformer.Finally, the The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-4/W2-2022 GeoSpatial Conference 2022 -Joint 6th SMPR and 4th GIResearch Conferences, 19-22 February 2023, Tehran, Iran (virtual) embedding of the geospatial position is generated by the average pooling of the n-gram sequence.

Bidirectional Encoder and Autoregressive Decoder:
The main body of the CASTLE model adopts the encoderdecoder structure of the transformer (Vaswani et al., 2017).The encoder is adopted to capture sequence context information, and the decoder is used for sequence prediction.The encoder consists of several attention sub-modules with the same layer structure.The bidirectional self-attention in the sub-modules can capture the spatiotemporal context information in the sequence.The output vector of the encoder corresponding to visited records  * contains the spatiotemporal context information C( * ).which is represented as V( * ) in this study.Use the sequence obtained by the decoder as Q, and the representation sequence obtained by the encoder as K and V to perform attention interaction.After multiple self-attention and attention modules, a sequence of latent vectors {ℎ " , ℎ & , … ℎ ' } with the same dimension as the input vector is finally obtained.Finally, the location is predicted as follows: where ℎ * = the latent output vector of the decoder  Q ∈ ℝ ' 6 * 8 and  ∈ ℝ ' 6 , both of them are learnable  3 = the total number of locations d = the set dimension of the location embedding vector

Pre-training Objective
The goal of pre-training is to learn a mapping function f to produce a contextual embedding V(v) for a target visit v given its spatial-temporal context C(v).
Inspired by Masked Language Model proposed in BERT, we implement a self-supervised training model.Given a trajectory s, 15% of visited records are randomly chosen as masked visited records and replaced the embedding vectors of masked records with special tokens [( 3 ,  D ,  T )].In the pre-training process, the original visited location of each masked visit record is predicted.And the pre-training objective is expressed as follows: where  = all the learnable parameters M= the number of all the masked visit records   *  * 3 = the probability that location  * is correctly predicted

Next Location Prediction Objective
Given the trajectory{( " ,  " ,  " ), ( & ,  & ,  & ),……,( '/" ,  '/" ,  '/" ), ( ' ,  ' ,  ' )}, the goal of the next location prediction is to predict  '(" correctly.Thus, the objective of the next location prediction can be represented as: where  = the set of all the learnable parameters of the model = the number of all the predicted records The above objectives can be transformed into classification tasks and optimized using the cross-entropy loss function.

EXPERIMENTAL RESULT
The experiments were conducted on two real-world spatialtemporal trajectory datasets to verify the effectiveness of the proposed model.Furthermore, the CASTLE model was compared with other models quantitatively.

Datasets
Two real trajectory datasets were used in the experiments, including a large-scale anomalous navigation dataset and a smallscale mobile phone trajectory dataset in the same city, denoted as TucityPre and TucityLife, respectively.The TucityPre data set was captured from one of the biggest navigation service companies in China and included anonymous mobile phone or vehicle navigation data for about two weeks in May 2021.The TucityLife dataset was collected by some volunteers in the same city in August 2021.
The trajectory-subsequence that a user stays within 100 meters for more than 5 minutes is regarded as visiting the location.We included the trajectories with more than 3 different visit locations in the both two datasets.

Baseline Pre-Training Embedding Models:
In order to verify the effectiveness of the proposed CASTLE model in pre-training embedding vectors for visits, this paper uses the CTLE model (Lin et al., 2021) as the baseline.The CTLE model uses a BERT-like bidirectional encoder to predict the masked location.The CTLE model interoperates time and location information, which is a state-of-the-art model in pre-training embedding vectors for locations.

Baseline Next Location Prediction Models:
To evaluate the usefulness of our framework, we employ some effectivenext location prediction methods: (1) GRU (Cho et al., 2014) (Gate Recurrent Unit): An improved model of the RNN model, we use the GRU-based seq2seq location prediction model as a baseline model.
(2) DeepMove (Feng et al., 2018): a state-of-the-art model consisting of recurrent network and attention layers to capture sequence correlations.
(3) Pre-trained CASTLE encoder + GRU: The pre-trained embedding vectors are used as the input to the GRU model.

Evaluation Metrics
The pre-training embedding method does not have a stable performance evaluation index.Since the downstream task of this study is next location prediction, the trajectory prediction task is used here to evaluate the accuracy of the pre-training model.Specifically, we masked the last visit record in the trajectory sequence and used the pre-training model to predict the last visit location of the trajectory.
Two widely used metrics of next location prediction, including Recall and NDCG (Normalized Discounted Cumulative Gain) (Lian et al., 2020), were adopted in this study.Furthermore, two metrics were both calculated at the cut-off of k = 1 and k = 5.

Settings
The dimension of location embedding  ?@8A3 was set to 64 for the CASTLE model and other compared methods.We train all the models using the Adam optimizer with a learning rate of 0.001.To avoid over-fitting, we set the dropout ratio to 0.2.In the pre-training process, the mask ratio of the input encoder is set to 15%.

Next Location Prediction Results:
The location prediction performance of different models on the TucityLife test set is shown in Table 3.By analyzing the next location preddiction results, we couldd conclude the following conclusions: (1) The performance of the CASTLE model without pre-training is better than that of GRU and DeepMove, indicating that the spatiotemporal context information obtained through attention can play a positive role in location prediction; (2) Inputting the pre-trained location embedding vectors into the GRU model can also significantly improve the accuracy, indicating that the location embedding vectors pre-trained by the CASTLE model have good transfer performance.
( (1) -TimeEnc: This model replaces the time encoding layer with the positional encoding layer in Transformers.
(2) -GeoEnc: This model just uses the location and time encoding layer.
We compared these two variants with the CASTLE model on the next location prediction task in pre-training.Figure 5 shows that both time geospatial encoding layers benefit the learning of location embedding vectors.The CASTLE model with time and geospatial encoding layers can better capture the spatiotemporal context information of visited locations in the trajectory sequence.

CONCLUSIONS
To address the issue that it is challenging to obtain personal trajectories with long time series and high spatiotemporal resolution, we propose a two-stage next-location prediction framework.The first step is pre-training contextual embedding for locations using large-scale trajectory datasets, which are relatively sparse but easier to be acquired.After that, the model is fine-tuned for the next location prediction task in the smallscale but higher spatiotemporal resolution trajectory datasets.Furthermore, a context-aware spatial-temporal location embedding (CASTLE) model is designed for pre-training and next location prediction.The model will generate different embedding vectors for the same location in different spatialtemporal contexts.Specifically, the CASTLE model combines Bidirectional and Auto-Regressive Transformers to predict the next location.Furthermore, we introduce a spatiotemporal aware encoder to reflect the spatial distances between locations and the visit times, which consists of location, time, and geospatial spatial encoding layers.Experiments were conducted on a largescale anomalous navigation dataset and a small-scale mobile phone trajectory dataset in the same city.The results show that the fine-tuned CASTLE model achieves the best performance and outperforms the model without pre-training by 4.6-7.1%,indicating the effectiveness of our proposed two-stage next location prediction framework.Furthermore, inputting the pretrained location embedding vectors into other location prediction models can also significantly improve the accuracy, indicating that the location embedding vectors pre-trained by the CASTLE model have good transfer performance.Without massive historical data, our method could still accurately predict the next location in a dense but small-scale trajectory dataset.

Figure 1 .
Figure 1.Uncertain visited places in real trajectories.One stop may match multiple different places.

Figure 2 .
Figure 2. Flowchart of the proposed two-stage next location prediction framework.Firstly, large-scale sparse location datasets, which are easier to be acquired (i.e., check-in data and anomalous navigation data), are used for pre-training the location embedding model to capture multi-functional properties.Herein, we propose a Context-Aware Spatial Temporal Location Embedding Pre-Training (CASTLE) Model to learn the contextual embedding vectors for visited locations.The same location will have different embedding vectors in different spatial-temporal contexts.After pre-training the CASTLE model, the learned contextual embedding is used for downstream location prediction in small-scale but higher spatiotemporal resolution trajectory datasets.Besides, the parameters of the CASTLE model are fine-tuned to learn the spatial-temporal information in the dataset.

Figure 3 .
Figure 3.The sketch map of CASTLE.(a) The input layer of CASTLE consists of the location, time, and geospatial position encoding layer.(b) Bidirectional encoder.(c) Autoregressive decoder.

Figure 4 .
Figure 4.The geospatial encoding layer of CASTLE.

( 4 )
CASTLE without pre-training: Directly train the CASTLE location prediction model on the TucityLife dataset without pretraining.TheInternational Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-4/W2-2022  GeoSpatial Conference 2022 -Joint 6th SMPR and 4th GIResearch Conferences, 19-22 February 2023, Tehran, Iran (virtual) Geocoded N-grams use trigrams to represent quadtree indexes and use a two-layer self-attention structure with a single attention head to capture the context of trigrams.Besides, the level of the quadtree key is set as 17 in our experiments.The hidden layer dimension of its feed-forward network is set to 128.The encoder of CASTLE uses a two-layer self-attention structure with four attention heads, the dimension of the feed-forward network hidden layer is set to 128, and the parameter settings of the decoder are consistent with the encoder.The pre-training batch size is set to 64, and the training epoch is set to 1000.In addition, to avoid randomness of the results, a different random seed is used for each epoch training.In pre-training, the first 80% (in time order) of the trajectory visit records for each user are used for training, and the last records are used for testing to prevent data leakage.The parameters in the next location prediction are almost the same as in pre-training.Considering the small size of the TucityLife dataset, the Batch Size is set to 32.For each user's time-ordered trajectory sequence, the first 70% of the trajectory visit records are used as prediction, 20% of the trajecotories are used as validation, and the rest are used as the test set.-training Results: After the model was pre-trained on the training set of TucityPre, the trajectory prediction task was chosen to evaluate the performance of the pre-trained CASTLE model.The comparison between the model and the baseline model is shown in

Table 1 .
The numbers of users, locations, and visit records of the two datasets are shown in Table1.Statistics of users, locations, and visit records of the used datasets.

Table 2 .
For the top-1 and top-5 sets of prediction results, the Recall and NDCG of the CASTLE model are both better than the CTLE model.It shows that the CASTLE pre-training method proposed in this study can better capture the spatiotemporal context information by incorporating geospatial position information, which improves the model's accuracy in the trajectory prediction task.

Table 3 .
) The fine-tuned CASTLE model achieves the best performance and outperforms the model without pre-training by 4.6-7.1%,indicating the effectiveness of our proposed two-stage next location prediction framework, Comparison of next location prediction with different methods.

3.6 Ablation Study Figure
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-4/W2-2022  GeoSpatial Conference 2022 -Joint 6th SMPR and 4th GIResearch Conferences, 19-22 February 2023, Tehran, Iran (virtual) 5. The ablation study results of the CASTLE model.To further prove the effectiveness of time encoding layer and geospatial encoding layer of the CASTLE model, we design an ablation study, and the compared variants include: