Towards a Unified Understanding of Uncertainty Quantification in Traffic Flow Forecasting

Uncertainty is an essential consideration for time series forecasting tasks. In this work, we focus on quantifying the uncertainty of traffic forecasting from a unified perspective. We develop a novel traffic forecasting framework, namely Deep Spatio-Temporal Uncertainty Quantification (DeepSTUQ), which can estimate both aleatoric and epistemic uncertainty. Specifically, we first leverage a spatio-temporal model to model the complex spatio-temporal correlations of traffic data. Subsequently, two independent sub-neural networks maximizing the heterogeneous log-likelihood are developed to estimate aleatoric uncertainty. To estimate epistemic uncertainty, we combine the merits of variational inference and deep ensembling by integrating the Monte Carlo dropout and the Adaptive Weight Averaging re-training methods, respectively. Furthermore, to relax the Gaussianity assumption, mitigate overfitting, and improve horizon-wise uncertainty quantification performance, we define a new calibration method called Multi-horizon Conformal Calibration (MHCC). Finally, we provide a theoretical analysis of the proposed unified approach based on the PAC-Bayes theory. Extensive experiments are conducted on four public datasets, and the empirical results suggest that the proposed method outperforms state-of-the-art methods in terms of both point prediction and uncertainty quantification.

volume, can help municipalities manage urban transportation more efficiently.In terms of traffic forecasting, the road segments in a road network interact with each other spatially, and the current state of a road segment depends on previous states, which results in complicated spatio-temporal correlations.Modelling the spatial-temporal correlations of traffic data is non-trivial [5], [17], [30], [39], [52], [54], [57].
Thanks to the recent advances of deep learning techniques, a number of deep learning-based spatio-temporal models have been proposed in the field of traffic forecasting [44], [47].Since the topology of a typical road network can be described by a graph in which each node represents a sensor and each edge represents a road segment, the spatial dependency of traffic data can be naturally extracted by Graph Neural Networks (GNNs) [20].Correspondingly, the temporal dependency of traffic data can be modelled by Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or their variants [5], [30], [39], [52].
Despite the fact that existing methods regarding traffic forecasting have been shown successful [47], most of them only provide point traffic prediction without quantifying uncertainty -a critical component in traffic data.Uncertainty quantification can be used to estimate the possible minimum and maximum values of the predicted traffic flow, speed, and volume.Such reliability information can be imperative for municipalities to manage urban traffic systems in real-world scenarios (e.g., emergency rescue and disaster evacuation) where unreliable deterministic point prediction may lead to catastrophic consequences [48].Traffic forecasting can help improve traffic conditions and consequently reduce traffic incidents [34], [38].However, unexpected events, e.g., accidents, will affect the accuracy of the traffic forecasting [2], [11].Therefore, from a safety perspective, it is necessary to provide predictions with reliability information.Moreover, traffic forecasting models with uncertainty quantification can be used to develop proactive intelligent traffic control systems to prevent possible future traffic congestion.
In this paper, we aim to attain both future traffic forecasting and its corresponding uncertainty.More specifically, the research goal includes the estimation of both epistemic and aleatoric uncertainties, which refer to model uncertainty and data uncertainty, respectively.Aleatoric uncertainty can be obtained by two independent neural networks by estimating means and variances, respectively [35].As for epistemic uncertainty, both variational inference and ensembling are possible solutions.However, these two types of approaches both have their own limitations.Variational approaches, e.g., Bayesian Neural Networks (BNNs) [21], [33], are prone to modal collapse [14].Deep ensembling is capable of finding multiple local minimums by training a set of deterministic models, but the prediction of each trained deterministic model lacks diversity [14].To circumvent this problem, it needs to find a set of local minimums/solutions with certain amount of diversities.
To this end, we carefully design an epistemic uncertainty quantification method integrating the merits of both variational inference and deep ensembling.Due to the high flexibility and efficiency of Monte Carlo dropout (MCDO) [12], we adopt it as the variational inference method.To implement MCDO in the proposed approach, we place the dropout operations in the base model with careful design to ensure the model performance.Despite the success of Stochastic Weight Averaging (SWA) [19] on approximating deep ensembling, we find that the original SWA method cannot guarantee the convergence of the training process of traffic forecasting tasks.In this light, we propose a new re-training method, called Adaptive Weight Averaging (AWA), to better approximate deep ensembling.Compared with SWA that uses SGD for training, AWA utilizes a new learning rate scheduler with Adam, which can approximate deep ensembling more efficiently on traffic forecasting.In addition, a post-processing calibration method is proposed to mitigate the overfitting issue in uncertainty quantification.We design the above building blocks to tackle different aspects of the uncertainty quantification problem, achieving an effective and efficient traffic forecasting framework.Finally, a unified uncertainty quantification approach called Deep Spatio-Temporal Uncertainty Quantification (DeepSTUQ) is formulated for both epistemic and aleatoric uncertainty estimation.Compared to existing approaches, DeepSTUQ has the following advantages: 1) DeepSTUQ can predict future traffic while providing both epistemic and aleatoric prediction uncertainty; and 2) Deep-STUQ requires training only one single model, which as a result, is fast-training, low-memory-footprint, and fast-inferring.
In the conference version of this work [36], Gaussian assumption was used in the calibration method.However, it may not be valid empirically with the real traffic data in many cases.The other issue of the original calibration method [36] is that it cannot guarantee horizon-wise uncertainty quantification performance.To mitigate overfitting, relax the Gaussianity assumption, and improve horizon-wise uncertainty quantification performance, we propose a novel conformal inference based calibration method, called Multi-horizon Conformal Calibration (MHCC), where the target significance can be corrected according to the proposed empirical equation.Moreover, we provide an in-depth theoretical analysis for uncertainty quantification to show that the proposed method can strengthen the point prediction performance and the horizon-wise prediction coverage performance.
The major value-added extensions over our preliminary work [36] are three-fold.
r We identify and study in depth a limitation in our previ- ous approach, thus enabling improved horizon-wise uncertainty quantification performance.
r We propose a horizon-wise post-processing calibration method that relaxes the Gaussianity assumption and reduces overfitting, achieving better performance of uncertainty quantification in traffic flow forecasting.
r Extensive experiments are conducted on four public datasets, and the results show that the proposed DeepSTUQ advances the state of the art in terms of point prediction and uncertainty quantification.The remainder of this paper is organized as follows.Section II surveys the related work, and Section III gives preliminary concepts.We then present the DeepSTUQ model in Section IV, followed by an empirical study in Section V. Section VI offers conclusions.

II. RELATED WORK
Our study relates to spatio-temporal traffic forecasting regarding its application and uncertainty quantification regarding the methodology.In the following, we review the state-of-the-art methods from these two aspects in this section.

A. Spatio-Temporal Traffic Forecasting
Traffic data can be regarded as multivariate time series.Hence, for traffic forecasting tasks, both spatial and temporal correlation are critical data features to learn from.In terms of spatial correlations, Graph Neural Network, such as Graph Convolutional Networks (GCNs) [25], ChebNet [9], and Graph Attention Networks (GATs) [43], have become the de facto deep learning techniques.As for temporal dependency, deep architectures like Gated Recurrent Networks (GRUs) [7], Gated Convolutional Neural Networks (GCNNs) [8], and WaveNet [42], have been widely applied to traffic prediction.Base on these two types of methods, a number of deep spatio-temporal models have been proposed in the context, such as Diffusion Convolutional Recurrent Neural Network (DCRNN) [30], Temporal Graph Convolutional Network (T-GCN) [56], Spatio-Temporal Graph Convolutional Networks (ST-GCN) [54], and GraphWaveNet [52].These methods are capable of learning spatio-temporal correlations but fail to capture multi-scale or hierarchical dependency.
More recently, Li et al. [29] proposed Spatial-Temporal Fusion Graph Neural Network (STFGNN), in which a spatialtemporal fusion graph module and a gated dilated CNN module were used to capture local and global correlations simultaneously.Zheng et al. [55] proposed Spatial-Temporal Graph Diffusion Network (ST-GDN) that adopted a hierarchical graph neural network architecture and a multi-scale attention network to learn spatial dependency from local-global perspectives and multi-level temporal dynamics, respectively.Qu et al. [37] used contrastive self-supervision to learn the spatio-temporal correlations within the coarse-grained urban traffic flows.Liang et al. [31] proposed a physics-informed approach for spatiotemporal forecasting.Additionally, the attention mechanism has been applied to address this issue as well.For instance, Attention-based Spatial-Temporal Graph Convolutional Network (ASTGCN) [17] uses spatial and temporal attention to model the spatial patterns and dynamic temporal correlations, respectively.
Nevertheless, in a practical real-world case where the knowledge of a graph is missing, the physical road connectivity may not necessarily represent the real data correlation in a graph.It is therefore beneficial to learn the graph structure from the data.To this end, methods such as Multivariate Time Series Forecasting with Graph Neural Network (MTGNN) [51] and Adaptive Graph Convolutional Recurrent Network (AGCRN) [5] can learn the unknown adjacency matrix in a data-driven manner and consequently improve the prediction performance.However, all these aforementioned methods only focus on providing point estimation without computing prediction intervals.

B. Uncertainty Quantification
Uncertainty quantification has recently become an actively researched area and widely applied to solve various real-world problems [1], [14].In general, uncertainty can be classified into two categories, namely, aleatoric and epistemic.
Aleatoric uncertainty refers to the data uncertainty caused by noise or intrinsic randomness of processes, which is irreducible but can be computed via predictive means and variances [22] using negative log-Gaussian likelihood as the loss function.When data distribution is unknown, distributionfree methods such as quantile regression [26] and conformal inference [4] can be used to estimate aleatoric uncertainty by computing the upper and the lower bounds of the prediction intervals with respect to the predefined significance level.
Epistemic uncertainty refers to model uncertainty caused by data sparsity or lack of knowledge, which is learnable and reducible.A widely-used method for estimating epistemic uncertainty is Bayesian Neural Networks (BNNs) [21], in which a Gaussian distribution is imposed on each weight to generate model uncertainty.However, a typical BNN doubles the number of model parameters and requires to compute the Kullback-Leibler (KL) divergence explicitly [6], which raises the model complexity and slows down the training process.Alternatively, a simple approach called Monte Carlo (MC) dropout [12] performs Bayesian approximation by turning on dropout at both training and test time as opposed to standard dropout [40].
Apart from Bayesian methods, ensembling-based approaches can be applied to uncertainty quantification as well [27], [49].However, vanilla ensembling methods is time and memory consuming because it is required to train and store multiple models.To address this issue, Fast Geometric Ensembling (FGE) [13] and Stochastic Weight Averaging (SWA) [19] are proposed, which use varying learning rates during training to find different local minimums.In addition, model calibration methods, e.g., Temperature Scaling [16], are also used to estimate prediction uncertainty.
Although uncertainty quantification has been quite popular in many deep learning domains, such as Computer Vision [22], Medical Imaging [1] and Reinforcement Learning [14], it is less explored in traffic prediction.Wu et al. [50] analyzed different Bayesian and frequentist uncertainty quantification approaches for spatio-temporal forecasting.They figured that Bayesian methods were more robust in point prediction, whilst frequentist methods provided better coverage over ground truth variations.Other works also focus on deep learning-based spatio-temporal uncertainty quantification, e.g., [45], [46], [58], [60], [60].
In this paper, we specifically study the uncertainty quantification problem for spatio-temporal traffic prediction.The proposed approach is based on the spatio-temporal architecture [5] and combines Monte Carlo dropout, Adaptive Weight Averaging re-training, and model calibration to provide both point prediction and uncertainty estimation advancing the state of the art.

III. PROBLEM STATEMENT
In this section, we will describe the task of traffic forecasting and its corresponding uncertainty quantification problem.

A. Traffic Forecasting
Traffic flow data can be regarded as multivariate time series.Let x t ∈ R N be the values of all the sensors in a road network at time t and X <t = {x t−T h +1 , x t−T h +2 , . . ., x t } ∈ R N ×T h be the corresponding historic input sequence with T h steps.Similarly, X>t = {x t+1 , x t+2 , . . ., x t+τ } ∈ R N ×τ represents the prediction sequence, where τ denotes the prediction horizon.
Fig. 1 describes a spatio-temporal correlation modelling problem in traffic forecasting.
Instead of treating the forecasting as deterministic, we aim to compute a conditional distribution to predict the traffic flow as well as the prediction uncertainty X>t ∼ P ( X>t |X <t ), which can improve the accuracy of the prediction, enhance the generalization ability of the model, and provide uncertainty estimation as well.
We consider the uncertainty assumptions on the traffic data from two aspects.The first one is the Gaussianity assumption, and the other is the distribution-free assumption.

B. Uncertainty Quantification for Traffic Forecasting
The assumptions for the predictive likelihoods in traffic forecasting can be either Gaussian or distribution-free, and we will study both in this work.
1) Gaussian Uncertainty Assumption: Although multimodality and seasonality do exist in traffic data [17], for simplicity, we do not consider multi-modality or seasonality, and consequently treat the predictive distribution of each node at each time point as a conditional univariate Gaussian distribution.A similar assumption can also be found in a previous study [59].To this end, P ( X>t |X <t ) can be represented by a set of predictive mean-variance pairs, and the problem can be given as follows: where θ is the model parameters, N is the number of total training data points, N denotes the Gaussian likelihood, and μ(X <t ) and σ(X <t ) 2 represents the estimated mean and variance, respectively.
2) Distribution-Free Uncertainty Assumption: From a distribution-free perspective, the uncertainty quantification task aims to obtain prediction interval such that a future ground truth datapoint Xi >t falls into C θ (X i <t ) with a sufficiently high probability, where ŷL i and ŷU i denote the upper and the lower bounds of the predicted interval, respectively.Let α be the significance level, and then the first optimization goal is to ensure where ) can be any continuous probability distribution.Hence, the Gaussianity assumption aforementioned is relaxed.
Apart from satisfying (2), the width of the prediction interval should be as small as possible as well.Let q be the uncertainty scalar correspondent to α.The new prediction intervals can be rendered via using the means and variances of the learned Gaussian likelihoods.Subsequently, the upper and the lower bounds of the new constructed prediction interval are ŷU i = μi (x i ) + qσ i (x i ), and ŷL i = μi (x i ) − qσ ( x i ), respectively.Accordingly, the second optimization goal is shown in the following equation.
where σθ is the estimated stand deviation under the model parameters θ.Note that the distribution-free uncertainty quantification paradigm can be compatible with the Gaussianity based uncertainty quantification paradigm as it can provide adaptive un-calibrated predictive intervals, i.e., stand deviations.

IV. DEEP SPATIO-TEMPORAL UNCERTAINTY QUANTIFICATION
We first briefly give an overview of the proposed Deep Spatio-Temporal Uncertainty Quantification (DeepSTUQ).The DeepSTUQ model architecture is illustrated in Fig. 2, which follows the principle of the previous study [5].Specifically, the architecture includes an encoder and a decoder sub-neural network.The encoder is composed of a GCN and a GRU module to capture both the spatial and temporal dependencies, respectively.To estimate the aleatoric uncertainty, the decoder employs two independent convolutional layers computing means and variances, respectively.Moreover, dropout operations are deployed in both sub-networks to estimate the epistemic uncertainty.
In terms of model training, conventional training procedures are only capable of providing uni-modal solutions, which lacks

TABLE I MAIN VARIABLES AND THEIR DEFINITIONS IN THIS PAPER
diversity for quantifying uncertainty [14].Moreover, the conventional calibration methods cannot guarantee horizon-wise prediction coverage.
To address the above issues, a four-stage uncertainty quantification approach is proposed, which can be briefed as follows.The major nations in this paper are listed in Table I.In the following sections, we will introduce DeepSTUQ in detail.

A. Spatio-Temporal Dependency 1) Graph Convolution:
A typical road network consists of a number of road segments.The spatial relationships within a road network with N R road segments can be described through a graph G(V, E), where the nodes V = {v 1 , v 2 , v 3 , . . ., v |V| } denote the sensors and the edges E denotes the road segment.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
A ∈ R |V|×|V| is the corresponding adjacency matrix.Subsequently, GCN [25] is utilized to model the spatial relationships of the traffic data.The output of the l-th GCN layer, Z (l+1) , can be computed by where Z (l) is the input.More specifically, the GCN first uses a degree matrix D to avoid changing the scale of feature vectors by multiplying it with A. Afterwards, an identity matrix I is used to sum up the neighboring nodes of a node as well as the node itself.As a result, the propagation rule of the GCN is described as follows: where W (l) is the weight matrix, b (l) is the bias, and S is an activation function, e.g., a sigmoid function.
2) Graph Structure Learning: In many real-world cases, we do not have the real spatial correlation knowledge of the multivariate traffic data.In such cases, the graph structure needs to be learned from data.To this end, the adaptive learning approach proposed in the study [5] is adopted to directly generate , which is easier than generating the adjacency matrix during the training process.Particularly, Â is developed by where E ∈ R |V|×d (the embedding dimension d |V|) is a learnable matrix representing the embedding of the nodes, and the softmax function is to normalize the learned matrix.To facilitate the graph learning process, a Node Adaptive Parameter Learning (NAPL) module [5] is also utilized to reduce the computational cost.As a result, (5) becomes Moreover, by using NAPL, the model is capable of learning the time-dependent graph structure of the traffic signals.
3) Temporal Dependency: Apart from the spatial dependency, the temporal dependency of traffic data also needs to be captured.To this end, the aforementioned graph convolutional operations and the adaptive graph learning module are integrated into a Gated Recurrent Unit (GRU) [7].Subsequently, the obtained spatio-temporal model can be formulated as follows: where z stands for the update gate, r stands for the reset gate, h denotes the hidden state, [•] denotes the concatenation operation, c denotes the memory cell, and W and b represent the weights and bias, respectively.Finally, the model introduced in (8) serves as the spatiotemporal architecture in DeepSTUQ.Note that though the above base model is employed in this work, DeepSTUQ has the potential to be applied to other spatial-temporal structures as well.In the following sections, we explain how to leverage this base model to forecast traffic and quantify the corresponding forecasting uncertainty.

B. Uncertainty Quantification
Generally, uncertainty can be classified into two types, i.e., epistemic and aleatoric.The former represents model uncertainty, while the latter represents data uncertainty.If variance is used to render uncertainty, the total uncertainty can be decomposed and approximated as follows: Aleatoric uncertainty where p(θ) stands for a probability distribution over the model parameters θ, and σ 2 θ and μ θ refer to the predicted variance and mean, respectively.
1) Aleatoric Uncertainty: Aleatoric uncertainty is caused by the intrinsic randomness of data, which is irreducible but learnable [1].Based on (9), we assume that the lower and upper bounds of the forecasting are symmetric due to the regressive nature of the prediction.Subsequently, the distribution of a sensor's value, e.g., traffic flow, at each time point can be modeled by a Gaussian distribution with predicted mean μ(x) and variance σ(x) 2 .However, directly maximizing the predictive Gaussian likelihood is numerically unstable.Instead, we choose to maximize the following log-likelihood: where log(σ(x) 2 ) and μ(x) are obtained directly via two independent neural networks.
In practice, to accelerate the training process and ensure convergence, we devise the following weighted loss by adding an L1 loss as the regularization term based on (10): where λ is the relative weight with 0 < λ ≤ 1.
2) Epistemic Uncertainty: Epistemic uncertainty represents model uncertainty, which arises from that lack of data or model mis-specification.Fortunately, as opposed to aleatoric uncertainty, epistemic uncertainty can be reduced by estimation.There are two general classes of approaches to do so: Bayesian variational inference and deep ensembling.However, they both have their pros and cons.Fig. 3 illustrates the relationships between different solutions and corresponding model performance.The solid and dashed lines represent the model performance during training and testing processes, respectively.The green line and blue dots represent the performance that can be obtained by variational inference and deterministic model, respectively.As it can be seen from the figure, deep ensembling can find a set of different deterministic model parameters (local minimums), e.g., W 1 , W 2 , and W 3 , which may have equally good performance in the solution space [10].On the other hand, variational inference can find a set of sub-optimal solutions near one local minimum in the loss space.However, it may fail to find other local minimums, which potentially leads to modal collapse.Therefore, a better way is to explore as many as local minimums as well as their corresponding nearby solutions.To this end, we propose to combine deep ensembling and variational inference to estimate epistemic uncertainty.
Variational Inference: Let D = {X, Y } be the training dataset.From a Bayesian perspective, we assume that each weight parameter of the neural network w obeys a probabilistic distribution to represent model uncertainty, e.g., Gaussian distribution.However, in practice, the true posterior of the the neural network weights p(w|D) is intractable.Therefore, a variational distribution q(w) is used to approximate p(w|D).Accordingly, the optimization goal is to minimize the following Kullback-Leibler (KL) divergence: where p(w) is the prior and log p(D|w) is the predictive loglikelihood.
To solve (12), MC dropout [12] is adopted as it performs Bayesian approximation in a simple and flexible manner.The variational distribution q(w) formulated in MC dropout can be described as follows.Let W i be a matrix of shape K j × K j−1 for layer i, we have where W i denotes the masked weight matrices, p i is the dropout rate used in both training and testing processes (as opposed to standard dropout), M i is the parameters of the neural network in the i-th layer, and z i,j is a binary variable indicating whether unit j at layer i − 1 (as the input of layer i) is dropped.As a result, minimizing ( 12) is equivalent to minimizing the following loss function: where E is the loss function, e.g., Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE), λ W is the weight decay, and λ W 2p ||w|| 2 can be computed through applying the L2 regularization during the training process.
In terms of implementation, dropout operations are deployed at two places within the spatial-temporal model: the graph convolutional layers in the encoder and the dropout convolutional layers in the decoder.Therefore, (7) becomes Note that the dropout rate here should be small when the adjacency matrix dimension is small, and vice versa.
Combined Uncertainty: Finally, ( 11) and ( 14) are combined to estimate both aleatoric and epistemic uncertainty jointly.The combined loss function is formulated by Equation ( 16) is utilized to pre-train the spatio-temporal model in DeepSTUQ.
Deep Ensembling: In contrast to variational inference, deep ensembling aims to find a set of different local minimums and averages the output of each trained model as the final prediction.Deep ensembling is shown to be quite effective in practice, yet it is computationally expensive as multiple models are trained [27].FGE [13] tackles this issue by using cycling learning rate to produce a set of different trained models in one learning process.However, FGE still needs to store multiple models for inference, which may result in high memory cost.To address this issue, SWA [19] adjusts the learning rate and averages the weights during the learning process to generate only one trained model to approximate FGE.In SWA, the model parameters are updated by where w SWA is the parameters of the SWA model and n models is the number of averaged models during training.
Inspired by SWA, we devise a re-training method called Adaptive Weight Averaging (AWA) to approximate deep ensembling.As depicted in Fig. 4, we vary the learning rate during the re-training process to find different local minimums and average those local minimums in the final stage to attain better solutions.The proposed AWA re-training approach includes two steps.Let Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the re-training learning rate be lr, the maximum learning rate be lr 1 , the minimum learning rate be lr 2 , and n iteration be the total iteration number within each epoch/total batch number, then the learning rate at the n i -th iteration changes according to the following rules.The first step is to enable the trained model to escape from the current local minimum.To this end, the learning rate of the optimizer decreases from lr 1 to lr 2 via a cosine learning rate scheduler at epoch n.The scheduler is described by Following that, the model is fine tuned by using the constant learning rate lr 2 at epoch n + 1, then at the end of the epoch the model parameters are averaged according to (17) and perform batch normalization.Specifically, we find that in practice using Adam [24] as the optimizer works more effectively than using Stochastic Gradient Decent (SGD) which is adopted in the original SWA method.In terms of finding different local minimums, Adam escapes saddles points where the gradients are close to 0 more efficiently than SGD, but these minimums found are sharp ones [53].Therefore we average those sharp minimums to obtain the flat minimum to finally achieve better generalization [23].The learning rate change during the AWA  For testing, we quantify the epistemic uncertainty by drawing multiple Monte Carlo samples from the learnt posterior distribution, then use the means and variances of the samples as the predictive mean and variances, respectively.
3) Model Calibration: To prevent the uncertainty estimation of the trained models being overconfident with respect to the training dataset, it is necessary to calibrate the trained model on the held-out validation/calibration dataset via post-processing.
To do so, we propose two calibration methods.One is based on the Gaussian likelihood assumption, and the other is based on the distribution-free assumption.See Section III.
Temperature Scaling Calibration: If the Gaussian likelihood assumption is still assumed to be held, a positive learnable variable T is imposed on the learned variance.Subsequently, the following log-likelihood similar to (10) is maximized: where T is the only learnable parameter.Accordingly, the calibration objective is where μ(x i ) and σ(x i ) 2 can be obtained via one deterministic forward pass or Monte Carlo estimation.Limited-memory Broyden Fletcher Goldfarb Shanno algorithm (L-BFGS) is used as the optimizer to find the optimal value of T .Multi-Horizon Conformal Calibration: Time series forecasting, e.g., traffic forecasting, may not rigorously comply with conditional Gaussian assumption on the predictive likelihood.Hence, it is necessary to relax the Gaussianity assumption to quantify the uncertainty in a distribution-free fashion.Furthermore, it can be found that for multi-horizon forecasting, the predicted intervals obtained at farther horizons may undercover the future ground truth datapoints.This makes the uncertainty quantification performance less reliable.Therefore, to address the above two issues, we propose the Multi-horizon Conformal Calibration (MHCC) approach via a split conformal inference fashion [28] to calibrate the trained model.
As a conformal inference method, MHCC calibrates the trained model by computing the quantiles without proceeding any optimization procedure, which makes it fast-computing.To obtain the quantile, we first need to compute the nonconformity scores to rank the prediction residuals.We do not use the absolute prediction error, i.e., | xi − x i |, to compute the nonconformity scores as it is not adaptive.Instead, we leverage the predicted means and variances obtained by the Gaussian predictive likelihoods to compute nonconformity scores.The advantage of doing so is that we do not need to train an auxiliary model to calculate the prediction residuals.MHCC does not adopt the Bonferroni correction [41] to obtain corrected significance level α c as its assumption is too conservative (α c ≤ α/τ ), which results in overlarge predictive intervals.
Instead, a novel significance level correction method is proposed in MHCC, which can be described as follows.We first assume that the Gaussianity assumption holds for each horizon h (h ∈ {1, 2, . . .τ}), i.e., if α is set to 0.05, then q = 1.96.As a result, the empirical horizon-wise prediction interval coverage rate on the calibration dataset p c on horizon h can be computed by where N Cali is the number of datapoints of the calibration dataset.Next, Then, the empirical significance level for each horizon is 1 − p h c .Afterwards, to reach the ideal significance level, 1 − p h c needs to be corrected to α.To this end, the corrected horizonwise significance level α c on horizon h can be attained by the following empirical equation.
where γ is a positive scalar.The first term, p h c + 2α − 1, is to compute the corrected significance level according to the empirical PICP.The second term, γ(p while i < N Cali : 10: c compute new horizon-wise uncertainty scalar.12: h = h + 1.

13:
end while 14: q c = q c ∪ q h c 15: end while 16: Return q c Fig. 6.Visualization of data splitting in Online MHCC.
horizon, the new upper and lower bounds become ŷU h i = μh i + q h σh i and ŷL h i = μh i − q h σh i , respectively.The MHCC approach is summarized in Algorithm 2.
Furthermore, conformal inference relies on i.i.d.assumption which may become weaker as time pass for time series data.To address this issue, we propose the Online MHCC by updating the calibration in an online fashion, which is illustrated in Fig. 6.Finally, the Online MHCC approach is summarized in Algorithm 3.

C. Unified Approach
Finally, combining the spatio-temporal correlation modelling method, Monte Carlo dropout, AWA re-training, and model calibration, the pipeline of the proposed unified uncertainty Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Algorithm 3: Online MHCC Method.Require: testing dataset X Test ; calibration dataset X Cali ; trained model f w ; significance level: α = 0.05; initial uncertainty scalar q c ; step: i = 0; calibration dataset update size: N Update .1: initialize temporary calibration dataset X Temp = {}; 2: while testing: 3: inference using f w , X i , and q c ; 4: update X Cali with X Temp ; 8: calculate empirical PICP on X Cali ; 9: re-correct q c with X Cali using ( 22); 10: reset X Temp = {}.11: end while quantification method is shown in Fig. 7, which can be summarized as follows.
r First, the spatio-temporal model introduced in Sec- tions IV-A and IV-A3 is pre-trained using (16) as the training loss function on the training dataset to estimate the aleatoric and epistemic uncertainty.
r Afterwards, the pre-trained model is re-trained via the AWA method on the training dataset to approximate deep ensembling.
r Finally, the predicted σ 2 obtained via the re-trained model on the validation dataset is calibrated according to (20).The graphical probabilistic model representation of Deep-STUQ is visualized in Fig. 8.The figure shows that h t is extracted from x t via a spatio-temporal structure with a learnable variable Â.The model weights are drawn repeatedly to estimate the epistemic uncertainty, which is implemented in an efficient manner by using MC dropout and AWA.The variance σ(x i ) 2 and mean μ(x i ) are obtained via N MC Monte Carlo samples.Finally, σ(x i ) 2 is calibrated through learning an auxiliary variable q.
For testing, according to (9), we draw N MC Monte Carlo samples to estimate the predictive mean μt+1 and variance σ2 where μt+1 is used as the point prediction of the proposed approach, and q 2 = 1 T when using the TS calibration method.

D. Theoretical Analysis: A PAC-Bayesian Perspective
We adopt the Probabilistic Approximate Correct (PAC)-Bayes theory [3], [32] to explain the proposed method.To this end, we first assume that the datapoints in the training, calibration, and testing datasets are all i.i.d.Once the model architecture is specified, we have the hypothesis/parameter space H. Consequently, the learning goal is to obtain a hypothesis h ∼ H. Let D be some unknown data distribution over X × Y and L : X × Y × H → R be the loss function.Then true risk is defined as follow: Pre-Training: In practice, R(h) cannot be computed as we do not know the distribution of D. Instead, we have the training dataset D Train ∼ D. Therefore, R(h) is estimated by computing the following empirical risk: Since our problem is a regression task with a Gaussian likelihood loss function, the parameter space H has to be restricted to only consider variances higher than a given constant [15].Then we have the following theorem.
Theorem 1: For all probability measure Q supported on H, with at least probability of 1 − δ, the following PAC-Bayes bound holds [15]: where N Train is the number of training datapoints.The above PAC-Bayes bound is minimized as the D KL (Q||P ) is minimized.Hence, we do not minimize the PAC-Bayes bound such that the KL divergence does not need to be computed explicitly.Instead, we minimize the evidence lower bound (ELBO) as in (12) using Mote Carlo dropout.
Re-Training.Furthermore, note that unlike Bayesian learning, PAC-Bayes does not need to know the exact form of the prior.Thus, we can formulate a delicate posterior by combining variational inference and deep ensembling to obtain better generalization through training.To this end, a mixture of Dirac-delta distributions is constructed: where q(h|D Train ) is the mixed Dirac delta posterior distribution over the model parameters.The PAC-Bayes bound (see (26)) holds for all q ∼ Q(H) in a set of Dirac masses {δ ∼ h ∈ H}.
In the proposed method, AWA is used to generate a mixture of Dirac-delta distributions sequentially in a computationally efficient manner.By approximating Bayesian model averaging via AWA, we can attain wider minima, which consequently leads to better generalization [19], [23].
Offline Calibration: The proposed calibration method is in a split conformal fashion.Once the training process is finished, a fixed hypothesis h is rendered.Let D Cali ∼ D be a held-out calibration dataset, and h is independent of D Cali .Then, we have the theorem as follows.
Theorem 2: With at least probability of 1 − δ, the following PAC bound holds [18]: where N Cali is the number of calibration samples.The above bound suggests that the true risk can be approximated by validation using sufficient number of datapoints.We can see that the proposed calibration method based on conformal inference is essentially attained via this validation PAC bound.
Online Re-Calibration: For time series data, the i.i.d.assumption may become weaker as time passes, which is detrimental to conformal methods.Therefore, the i.i.d.assumption can be re-enforced by updating the calibration dataset such that the validation PAC bound (cf.(28)) will continue to hold during inference.

V. EXPERIMENTS
To compare the performance of DeepSTUQ with other stateof-the-arts, extensive experiments are conducted on real-world datasets in terms of point prediction, uncertainty quantification, and ablation study.
In terms of the hardware environment, the CPU and GPU used for the experiments are AMD EPYC 7302 and NVIDIA Tesla T4, respectively.And for the software environment, all the methods are implemented via Python, CUDA 10.6, and Pytorch 1.7.

A. Datasets
Four different public datasets collected from the Caltrans Performance Measurement System (PEMS), i.e., PEMS03, PEMS04, PEMS07, and PEMS08 [39] are used for evaluation.We believe this type of data can be used to access the performance our approach on real-world traffic flow forecasting tasks and for fair comparison with other state-of-the-art methods.
The time interval of the traffic flow data is 5 minutes.For each prediction, we use the historic data of one-hour time range (12 time steps) as the input to predict the future traffic data of one-hour time range (12 time steps).All the datasets are split into three parts with ratio 6 : 2 : 2 for training, validation/calibration, and testing, respectively.

B. Settings
Pre-Training: The total number of training epochs is 100.The optimizer is Adam with learning rate 0.003 and weight decay 10 −6 .The batch size is 64.The relative weight λ in (11) for computing the aleatoric uncertainty is 0.1.The dropout rates of the graph convolutional operations in the encoder are 0.1 for PEMS03, PEMS04, and PEMS07 (the adjacency matrice are relatively large), and 0.05 for PEMS08 (the adjacency matrix is relatively small).The dropout rate at the final dropout layer in the decoder for all the datasets is 0.2.
AWA Re-Training: The optimizer of the AWA re-training process is Adam, and the maximum and minimum learning rates are 0.003 and 0.00003, respectively.The total number of re-training epochs is 20, which means that ten models are averaged.
Model Calibration: The number of Monte Carlo samples for calculating σ 2 is 10.In TS, the steps and numbers of iterations of the L-BFGS optimizer are 0.02 and 500, respectively.In offline and online MHCC, γ is set to 0 for PEMS03 and 0.03 for PEMS04, PEMS07, and PEMS08.The number of datapoints for online updating is 1,000.
Inference: To balance the inference time and model performance, we generate ten Monte Carlo samples for Bayesian model averaging.

C. Baselines
To compare the proposed DeepSTUQ with the state-of-thatart methods on point prediction and uncertainty quantification, two groups of recent traffic prediction methods are adopted as the baselines, respectively.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
r ST-GCN [54] utilizes a GNN and a GCNN to forecast traffic.
r STFGNN [29] employs a spatial-temporal fusion module and a gated dilated CNN.
r AGCRN [5] leverages a Node Adaptive Parameter Learning module and a Data Adaptive Graph Generation module to enhance traffic prediction performance.
r DeepSTUQ/S refers to the proposed method with sin- gle deterministic forward pass (dropout is turned off in testing).2) Uncertainty Quantification Baselines: Representative approaches of different uncertainty estimation paradigms (namely, frequentist, quantile prediction, Bayesian, and ensembling) are used as the baselines.Note that all the following methods employ the same base model structure for fair comparison.
r Point prediction refers to the AGCRN model which is used here to compare with other uncertainty quantification methods.
r Quantile regression [26] is a distribution-free method which directly computes the mean, lower and upper bounds using the corresponding quantile (0.025, 0.5, 0.975).
r Mean Variance Estimation (MVE) [35] refers to the method that estimates heterogeneous aleatoric uncertainty through computing (11).
r Monte Carlo dropout (MCDO) [12] performs dropout at both training and test time, where the number of Monte Carlo samples for inference is 10.
r Combined refers to the method that calculates both epis- temic and aleatoric uncertainty using (16) [22], where the number of Monte Carlo samples for inference is 10.
r Fast Geometric Ensembling (FGE) [13] performs fast en- sembling via varying the learning rate, where the number of the stored trained models is 10.

D. Metrics
Two groups of metrics are employed to evaluate the point prediction and uncertainty quantification performance, respectively.

1) Point Prediction Metrics:
The point traffic forecasting performance are evaluated by the following metrics.1) Root Mean Squared Error (RMSE): where y i is the ground truth, and ŷi is the prediction.2) Mean Absolute Error (MAE): 3) Mean Absolute Percentage Error (MAPE): 2) Uncertainty Quantification Metrics: The uncertainty quantification performance are evaluated by the following metrics.
1) Mean Negative Log-Likelihood (MNLL): where μi and σ2 i are the predicted mean and predicted variance, respectively.2) Prediction Interval Coverage Probability (PICP).The predicted lower and upper bounds of the prediction interval are denoted by ŷL and ŷU , respectively.Let the significance level α be 5%, which means that the expected probability of a ground truth data point falling into the range [ŷ L , ŷU ] is 95% (100% − α = 95%).Accordingly, under Gaussianity assumption, we have ŷU i = μi + 1.96σ i , and ŷL i = μi − 1.96σ i .Let k j i indicate whether the real speed value of a road segment j at time i is captured by the estimated prediction interval, and we have Then PICP can be formulated by Ideally, PICP should be equal or greater than 95%.3) Mean Prediction Interval Width (MPIW):

E. Point Prediction Results
The point prediction results of DeepSTUQ are compared with the aforementioned state-of-the-art methods for performance evaluation.The obtained point prediction results are demonstrated in Table II.As it can be seen from the results, with only ten Monte Carlo samples, DeepSTUQ achieves the smallest RMSEs, MAEs, and MAPEs, which suggests that DeepSTUQ has the best performance on point traffic flow prediction.In addition, the proposed method -even with only one single deterministic forward pass, namely DeepSTUQ/S -also outperforms other state-of-the-art methods, which indicates that the proposed method is competitive on point prediction at nearly the same inference time cost as other deterministic approaches.This is because that variational inference can obtain a set of solutions around on one local minimum, and deep ensembling can find multiple local minimums in the solution space.By combining these two approaches, DeepSTUQ is capable of finding better sub-optimal solutions and has better generalization ability compared to deterministic methods, and consequently has better performance regarding point prediction.Fig. 9 shows the point prediction performance with respect to different horizons, which suggests that DeepSTUP has better performance than AGCRN at each time step for all the datasets.

F. Uncertainty Quantification Results
To evaluate the uncertainty quantification performance, Deep-STUQ is compared with the uncertainty quantification baselines, whose results are demonstrated in Table III and Figs. 10,11,and 12.According to the results in the table, the proposed approach has the best overall performance regarding both the point prediction and uncertainty quantification results compared with others.As it is observed from Fig. 10, DeepSTUQ can forecast traffic flow accurately and provide valid coverage for future ground truth.Fig. 11 illustrates that in traffic flow forecasting, the aleatoric uncertainty is much larger than the epistemic uncertainty.Hence, considering total uncertainty can provide better uncertainty estimation than considering either one alone.Fig. 12 shows that, for all the datasets, generally, both aleatoric and epistemic uncertainty increase as the prediction horizons extend, which implies that short-term traffic flow forecasting is more reliable than long-term one.The conclusion accords with the intuition and results in the literature [5], [29], [30] In terms of uncertainty quantification, the aleatoric uncertainty-aware approaches, i.e., MVE and TS, outperform the epistemic uncertainty-aware approaches, which suggests that the traffic uncertainty is mainly data-related.The results indicate that only considering epistemic uncertainty improves the estimation of the predicted mean (which results in better point estimation) but underestimates the variance significantly.This conclusion is supported by the study [50] as well.Although we have made a strong Gaussianity assumption on the likelihood of the aleatoric uncertainty, the obtained experimental results indicate that the methods using this assumption (i.e., MVE, Combined, TS, and DeepSTUQ) outperform the distributionfree method, Quantile.Additionally, the PICPs obtained by DeepSTUQ on the four datasets are very close to or larger than 95%, which implies that the Gaussian distribution assumption is credible.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.According to the experimental results, we can also see that when only the epistemic uncertainty is considered using variational inference (MCDO) or deep ensembling (FGE), the traffic flow point prediction performance is improved compared to deterministic methods but the uncertainty quantification performance is poor.If merely the aleatoric uncertainty is taken into account (MVE, TS, Conformal, and CFRNN), the uncertainty quantification performance is satisfying while the point prediction slightly decreases compared to deterministic methods.On the other hand, if both the epistemic and aleatoric uncertainties are estimated, e.g., Combined and DeepSTUQ, the point prediction and uncertainty quantification performance are both improved.

G. Model Calibration Results
We compare the proposed calibration approaches, MHCC and online MHCC, to Split Conformal Prediction (SCP) [28], Local Weighted Conformal Inference (LWCI) [28] and TS.The results demonstrated in Table IV imply that online MHCC has the best overall (marginal) uncertainty quantification calibration performance.
In addition, we propose a new metric for evaluating the horizon-wise (conditional) uncertainty quantification performance, which is called Mean Horizon-wise Prediction Interval Coverage Error (MHPICE).Let e h be the horizon-wise prediction interval coverage error at horizon h, then MHPICE can be expressed as follows: Accordingly, MHPICE is defined as follows: Naturally, lower MHPICE means better performance.
From the results demonstrated in Fig. 13 and Table V, it can be seen that before MHCC, the test PICPs decrease as the horizon increase, which makes the uncertainty quantification results less reliable.After MHCC, the test PICPs at each horizon are closer to the target significance level, 95%, than before.This implies that MHCC can improve the credibility of the horizon-wise uncertainty quantification results.

H. Robustness Test
We report the MAE, RMSE, MAPE, MNLL, PICP, and MPIW results of the methods under Gaussian and non-Gaussian noises on PEMS04 to evaluate their robustness.From the results illustrated in Table VI, it can be seen that the proposed method are more resilient to noise in terms of point prediction performance.Besides, the online MHCC method has the best uncertainty quantification performance compared to other calibration methods.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

I. Ablation Study
Three groups of experiments are conducted to verify the effects of the proposed AWA training, the proposed calibration method, and different numbers of Monte Carlo samples, respectively.
1) Effect of AWA Re-Training: The prediction performance of the same pre-trained model prior to and following AWA post-processing re-training are compared.Table VII demonstrates that after AWA re-training, the point prediction performance has improved, indicating that the proposed AWA retraining method can approximate the deep ensembling method using only one single model with mere 20 additional epochs.The results suggest that the proposed AWA retraining method can also improve the performances of other methods, i.e., Point, MVE, and MCDO.Therefore, compared to conventional deep ensembling, DeepSTUQ requires less time and memory.
2) Effect of Monte Carlo Sample Number: To investigate how the number of Monte Carlo samples affects the model performance, the sample number is set to 1, 3, 5, 10 and 15.As shown in Fig. 14, the performance of the proposed method enhances as the number of Monte Carlo samples rises, and only a small number of Monte Carlo samples are required to provide high prediction performance.The performance gradually saturates when the sample size approaches 15.Accordingly, for the trade-off between the model performance and the inference cost, the testing sample number can be fixed to 10 at test time.

J. Memory Cost and Computation Time
The quantitative results of the memory cost and computation time on PEMS08 are reported in Table VIII.From the results, we can be see that DeepSTUQ has almost same model sizes and training time with the Point prediction model, which is significantly efficient than the standard Deep Ensembles.The inference time and memory cost of DeepSTUQ are lightly larger than standard Deep Ensembles, but the inference time per step is less than 7.80 ms.Therefore, DeepSTUQ is scalable for large traffic forecasting datasets and applicable for the potential practical applications.

VI. CONCLUSION
In this paper, we introduce a novel and unified uncertainty quantification method for traffic forecasting called DeepSTUQ.The proposed method consists of three components.In the first component, to model the aleatoric uncertainty, a hybrid loss function is used to train a base spatio-temporal model.The second component aims to model the epistemic uncertainty, where the merits of variational inference and deep ensembling are combined through the dropout pre-training and AWA retraining.Finally, the model is calibrated on the validation dataset using a post-processing calibration method based on Temperature Scaling to further improve the uncertainty estimation performance.Four distinct public datasets are then subjected to thorough experiments.The results indicate that DeepSTUQ outperforms contemporary state-of-the-art spatio-temporal models and uncertainty quantification methods.Moreover, the proposed DeepSTUQ can achieve high robustness against noise.
In terms of future work, we plan to explore other techniques to design novel model architectures.Another possible direction is to study data interpolation and imputation in traffic forecasting.

Fig. 1 .
Fig.1.Spatio-temporal dependency modelling for traffic data, where grey lines and green dash lines represent spatial and temporal dependency, respectively.

r Stage 1 :rr 3 :r 4 :
Pre-train the base spatial-temporal model with dropout on the training dataset to perform variational learning; Stage 2: Re-train the pre-trained model on the training dataset to proceed ensemble learning; Stage Calibrate the re-trained model on the calibration dataset to further improve the aleatoric variance estimation; Stage Update the calibration dataset and inference.

Fig. 3 .
Fig. 3. Performance demonstration of deterministic model, deep ensembles, and variational inference in solution space.
re-training is illustrated in Fig.5.The whole re-training process is summarized in Algorithm 1.

Fig. 8 .
Fig. 8. Graphical model representation of DeepSTUQ, where shaded circles represent observable variables, arrows denote dependencies, variables within rectangles appear repeatedly, N MC is the number of Monte Carlo samples, and M is the number of models for ensembling.

Fig. 11 .
Fig. 11.Quantification results of different uncertainties on partial data from a randomly selected segment of PEMS08.

13 .
PICPs with respect to various forecast horizons, where solid and dashed lines denote MHCC and the uncalibrated model, respectively.

Fig. 14 .
Fig. 14.Prediction results with respect to different numbers of Monte Carlo samples.

Algorithm 1 :
AWA Re-Training Method.Require: training dataset {X}; pre-trained model parameters w; AWA model parameters w AWA ; learning rates lr 1 and lr 2 ; total epoch epoch AWA ; total iteration/batch number n iteration .

TABLE II POINT
PREDICTION RESULTS ON PEMS03, PEMS04, PEMS07, AND PEMS08 Fig. 9. Point prediction results with respect to various forecast horizons, where solid and dashed lines denote DeepSTUQ and AGCRN, respectively.

TABLE III UNCERTAINTY
QUANTIFICATION RESULTS ON PEMS03, PEMS04, PEMS07, AND PEMS08, THE BEST RESULTS ARE HIGHLIGHTED IN BOLD Fig. 10.Uncertainty quantification results on randomly selected road segments from different datasets.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VI ROBUSTNESS
TEST RESULTS WITH DIFFERENT PERTURBATIONS ON PEMS4

TABLE VII ABLATION
STUDY RESULTS ON AWA RE-TRAINING TABLE VIII MEMORY COST AND COMPUTATION TIME ON PEMS08 (CPU: AMD EPYC 7302, GPU: NVIDIA TESLA T4)