Spatial trajectory data generation with Conditional Diffusion Probabilistic Model

9 min readDec 13, 2024

“In this study, I implement the probabilistic conditional diffusion model (DiffTraj architecture) from Zhu et.al, (2023) on real aircraft datasets (TrajAir) with additional conditional information of weather parameters”

Dataset : General Aviation Aircraft trajectories (AirLab)

Data period : 18 September 2020–23 April 2021

Data points : 2.731.225 GPS records

Total trajectories : 7602

Total distinct Aircraft_ID : 526

Objective : Train aircraft real trajectory data with the conditional probabilistic conditional model and generate high-quality synthetic trajectory data under different underlying distributions

Figure 1. (a) Illustration of the stages of generating trajectories using the diffusion model. (Left) Forward and reverse processes. (Right) Illustration of combining neural network model structures to reverse denoising. (b) The network architecture is adapted from by DiffTraj (Zhu et al., 2023)

The objective of DiffTraj architecture is to estimate the real trajectory distribution using a parameterized model . Given a random noise , the model generates a synthetic trajectory with the observation condition .

The forward process works gradually by perturbing the data distribution with noise. Given a set of real data samples , the forward process adds T time-steps of Gaussian noise to it. Next, the reverse diffusion process aims to recover the original data distribution from the noisy data.

In the conditional module Figure 1, there is an implementation of a wide and deep network to embed conditional information. The wide component role as a memorization. It captures the significance of individual features and their interactions. In contrast, the deep component will concentrate on generalization, learning the high-level abstractions and feature combinations. The combination of both components can capture simple and complex patterns in the trajectory. A wide network is utilized for embedding the numerical motion attributes, whereas a deep network is employed for the discrete categorical attributes.

A synthetic trajectory will be generated by providing conditional information to guide the generation process. This means that, by providing a set of conditional information such as trajectory length, travel time, vehicle speed, and wind speed, the synthetic trajectory will mimic real-world patterns and behavior.

Furthermore, synthetic trajectories can be created by determining the start and end areas of the trajectories. Then, the model will start generating trajectories that follow those conditions.

In the DiffTraj model, there is one parameter that controls the variability of the generation results. The model has a parameter called the guide scale. This parameter serves as a weight to control the predicted noise generation from the inverse training process. This parameter determines how important the conditional parameters are to generating noisy predictions. In terms of determining the quality of the model, the experiment should be set to a deterministic setting. In that case, the guide scale parameter needs to be set as low as possible. A low value of the guide scale parameter will set the prediction noise highly dependent on the conditional parameters. The results effectively demonstrate proficiency in generating trajectories that align with the specified start and end regions.

Data processing

Data preprocessing steps include trajectory segmentation, determining trajectory start and end regions, data normalization and transformation. In trajectory segmentation, I need to delineate unique trajectories from each group of GPS records dataset.

Figure 2. Example of single trajectory plot from aircraft ID 11364079

From the total 7121 trajectory records, there were interesting findings related to data distribution. The trajectory distribution showed that there were 13 aircraft IDs that dominate the number of recorded trajectories. Figure 3 demonstrates the distribution of number of trajectories recorded from top 50 aircraft ID. Each of these aircraft IDs consisted of more than 100 trajectory records. For example, aircraft ID 11322591 had 648 trajectory records in the dataset. In contrast, there were 285 unique aircraft IDs with only one trajectory record.

Figure 3. Total distribution of trajectories for each aircraft ID

Deterministic mode

To evaluate model performance, the model needs to be set deterministic. For that purpose, I needed to define the starting and end region area from each trajectory. Figure 4 demonstrates how the starting and end regions are determined. Each region consists of four x and y cartesian coordinates. The start and end areas are determined by taking 20% of the specified GPS coordinates.

Figure 4. Example of selecting the start and end coordinates of a trajectory

Forward & Reverse process

In this step, I transformed the dataset into a random noise by sequence of time step. Figure 5 illustrates how the forward and inverse process in the diffusion model. An important point in this step was to determine the function of time step. This function will determine how good the model will be able to learn trajectory transformation on each time step. If the transformation of the trajectory into complete noise occurs too quickly, the model will fail to learn the trajectory construction pattern. In contrast, if the time step is set too long, it will burden the computational cost. At this point, I needed to set a time step function at which the model can optimally capture the trajectory transformation.

Figure 5. Illustration of forward mode and reverse mode in diffusion model

Model training

In the inverse mode, the model will start learning to predict the prediction noise between two consecutive time steps. Both forward and inverse steps occur sequentially on the one cycle loop iteration. The objective of inverse mode is to train the model for predicting the noise. With this capability, the model will be able to predict the noise coordinates at different time steps. At each iteration, the model parameters are optimized, and the loss function is calculated. After long iteration the operation would be stopped when it reaches the total number of epochs.

Figure 6. Pseudocode Diffusion training phase

In the loop iteration process, the model calculates the loss function. I used Mean Squared Error (MSE) as loss function implementation. It calculates the difference between the predicted noise and the actual noise. Both noises come from the inverse process and the forward process in successive times. The purpose of calculating MSE is to monitor the model’s ability. It considers how accurate the model is in predicting noise. When the loss function decreases to near zero, it will reach model convergence where the reduction in error is no longer significant.

Model Hyperparameter:

I tried various combinations of model parameter selection and compared the evaluation results. In the initial stage, I used smaller parameter setting values (e.g., batch size = 64, number of epochs = 10). The idea here was to see the behavior of the model. If the model evaluation results were reasonable, then I continued with a larger set of model parameter settings in that direction. This process was repeated until the optimal model is reached.

MSE (Mean Square Error): 0.025

The model’s performance shows good results with error convergence well. The noise reduces smoothly and there is no significant change after 60 epochs. To get the optimal model result, training model will run a few times. It will run on a phase from input data, training model, and evaluation result. During this optimization, I applied model parameter tuning by changing several hyperparameter set up and rerun the model.

Trajectory Data Generation

Figure 9. Example of trajectories generation result from validation dataset

Once the optimal model is achieved, I used the training model to generate the trajectory result. Figure 9 shows the example of a single trajectory generation result from validation set. Blue dot color represents the real trajectory, while others are the synthetic trajectories from different iteration. In this example, the resulting trajectories show a pattern that is like the real trajectory (blue color). Other trajectories are generated with the same parameter set up. It just runs the model multiple times. As we can see, there are different variations from each times model inference occurs. It demonstrates how the probabilistic diffusion model results in diversity. Figure 9 result also shows the coordinates of the starting point of the position close to the real trajectory.

Figure 10. Synthetic trajectory evaluation from validation dataset. It consists of density error (top), length error (middle), and trip error (bottom) metrics

Figure 10 shows the quality result of trajectory generation from validation dataset. The results are presented on a histogram graph. Where the x-axis defines the JSD index values between 0 and 1. And y-axis represents the total frequency of trajectories. The similarity result between synthetic and real trajectories on validation dataset varied over the dataset. At density error metric evaluation, the majority result shows a moderately good similarity with JSD index between 0.10 to 0.50. There are 40 (5%) trajectories that failed to be generated properly. The lower performance on validation set might indicate the overfitting problem. Compared to loss function in training model result, the model performance on the validation set appears to decrease.

Test set analysis

Figure 11. Example of trajectory generation result from test dataset

Figure 11 shows the example of a single trajectory generation result from OOD dataset. Blue dot color represents the real trajectory, while others mean the synthetic trajectories from different iteration. The result shows some interesting findings. The synthetic trajectories still show a similar pattern like a real trajectory. It is still able to capture the curve on the trajectory pattern. However, there is a deviation between the resulting trajectory and the real trajectory. There is no single trajectory that matches the real trajectory. The start and end point also do not quite match the real ones.

Figure 12. Synthetic trajectory evaluation from test dataset. It consists of density error (top), length error (middle), and trip error (bottom) metrics.

Next, I measured the quality of trajectory generation result with JSD index evaluation metrics. Figure 12 shows the quality result of trajectory generation from test dataset. The similarity evaluation results show varied over the test dataset. The density error metrics majorly show moderate good results. With the major frequency distribution on the JSD index ranging from 0.10 to 0.50. There are around 16 trajectories that failed to be generated properly. The length error and trip error metrics show a properly good result. With length error metrics majorly range between 0.275 to 0.30. And the overall trip error metrics value is less than 0.30. It indicates the generated trajectory has similarity in length and matches the start-end point.

Learning insight

Overall, the probabilistic conditional diffusion model can handle spatial-temporal distribution on different underlying datasets. The result shows moderately good quality based on JSD metrics index. Based on the evaluation of this study, the probabilistic conditional diffusion model converges well on the training model. The evaluation trajectory quality with JSD metrics on the validation dataset shows quite good results. An interesting finding also shows that this model is still able to produce synthetic trajectories with moderately good quality on the test dataset.

References

[1] Patrikar, J., Moon, B., Jean O., Scherer S. (2022). Predicting Like a Pilot: Dataset and Method to Predict Socially Aware Aircraft Trajectories in Non-Towered Terminal Airspace. 2022 International Conference on Robotics and Automation (ICRA), (pp. 2525–2531). doi: https://doi.org/10.1109/ICRA46639.2022.9811972

[2] Y. Du, Y. Hu, Z. Zhang, Z. Fang, L. Chen, B. Zheng, and Y. Gao. (2023). Ldptrace: Locally differentially private trajectory synthesis. Proceedings of the VLDB Endowment, 16(8), 2150–8097. doi: http://dx.doi.org/10.14778/3594512.3594520

[3] Zhu, Y., Ye, Y., Zhang, S., Zhao, X., James J.Q., (2023). DiffTraj: Generating GPS Trajectory with Diffusion Probabilistic Model. Proceedings of the Advances in Neural Information Processing Systems, 36, 65168–65188. url: https://proceedings.neurips.cc/paper_files/paper/2023/file/cd9b4a28fb9eebe0430c3312a4898a41-Paper-Conference.pdf

[4] Zhu, Y., Ye, Y., Wu, Y., Zhao, X., James J.Q., (2023). Synmob: Creating high-fidelity synthetic gps trajectory dataset for urban mobility analysis. Proceedings of the Advances in Neural Information Processing Systems, 36, 22961–22977. url: https://proceedings.neurips.cc/paper_files/paper/2023/file/4786c0d1b9687a841bc579b0b8b01b8e-Paper-Datasets_and_Benchmarks.pdf

Closing Remark

This publication is produced for educational or information only, If there are any mistake in data, judgement, or methodology that I used to produce this publication.

* Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

Spatial trajectory data generation with Conditional Diffusion Probabilistic Model

Model training

Closing Remark

Written by Andriyan Saputra

No responses yet