Asymptotic properties of Bayesian inference in linear regression with a structural break

This paper studies large sample properties of a Bayesian approach to inference about slope parameters $\gamma$ in linear regression models with a structural break. In contrast to the conventional approach to inference about $\gamma$ that does not take into account the uncertainty of the unknown break location $\tau$, the Bayesian approach that we consider incorporates such uncertainty. Our main theoretical contribution is a Bernstein-von Mises type theorem (Bayesian asymptotic normality) for $\gamma$ under a wide class of priors, which essentially indicates an asymptotic equivalence between the conventional frequentist and Bayesian inference. Consequently, a frequentist researcher could look at credible intervals of $\gamma$ to check robustness with respect to the uncertainty of $\tau$. Simulation studies show that the conventional confidence intervals of $\gamma$ tend to undercover in finite samples whereas the credible intervals offer more reasonable coverages in general. As the sample size increases, the two methods coincide, as predicted from our theoretical conclusion. Using data from Paye and Timmermann (2006) on stock return prediction, we illustrate that the traditional confidence intervals on $\gamma$ might underrepresent the true sampling uncertainty.


Introduction
We consider the linear regression with a structural break, following the notations of Bai (1997): w t α + z t δ 1 + t , for t = 1, . . . , τ T w t α + z t δ 2 + t , for t = τ T + 1, . . . , T, where w t and z t are d w ×1 and d z ×1 vectors of covariates, and the random variable t is a regression error. a is the largest integer that is strictly smaller than a. The relationship between the outcome y t and the covariate z t , measured by δ's, changes across regimes, which are defined by the break location parameter τ ∈ (0, 1). There can be another set of covariates w t whose relationship with y t , measured by α, stays unchanged across the regimes. The unknown parameters include the break location τ as well as the slope parameters γ = (α, δ 1 , δ 2 ). The focus of the current study is on inference about the slope parameter γ 1 .

The classic literature
In the literature, the conventional least-squares estimators (τ LS ,γ LS ) for (τ, γ) are computed as follows: for each candidate τ , compute the sum of squared residuals of the regression and denote the minimizing choice byτ LS . Plug in the value τ =τ LS in the model and defineγ LS =γ(τ LS ), whereγ(τ ) is the usual OLS estimator of γ assuming the break location τ . Bai (1997) assumes that the true jump size δ 0 is either fixed or shrinks to zero as T → ∞, but at a rate slower than √ T → ∞. Bai shows thatτ LS converges at the rate T −1 in the former case and, in the latter case, finds an asymptotic distribution ofτ LS that can be used for constructing confidence intervals for τ . In both cases, Bai proves that the asymptotic distribution ofγ LS is the same as that ofγ(τ 0 ), where τ 0 is the true value of τ . This means that one can ignore the very problem of unknown τ when making inference on γ. Figure 1 displays finite-sample distributions ofτ LS (blue solid curves) which are produced based on 1,000 repeated experiments on the following model y t = δ 0 1 (t > τ 0 T ) + t 2 . Note that despite the T -consistency,τ LS displays significant variation, especially when the true break size δ 0 is small 3 . In practice, the conventional approach to inference on the slope parameters γ would ignore this uncertainty, neglecting all possible values of τ other thanτ LS . As a consequence, the corresponding confidence intervals on γ tend to undercover since it might not be the case that τ LS = τ 0 in a given sample (See our simulation in Section 5). t ∼ i.i.d.N (0, 1), τ 0 = 0.5, T = 100, with 1,000 repeated experiments. The horizontal axis is τ − τ 0 . We also show finite-sample distribution of the posterior mode of τ (red solid curve with small circles). In addition, we randomly chose 3 data realizations out of the 1,000 repetitions to plot posterior densities of τ in gray dashed curves (hence each of them represents a realization of one data set).

Bayesian perspective
For a Bayesian, this non-standard estimation problem 4 can be dealt with by placing prior on both τ and γ and by computing corresponding posterior probabilities. The uncertainty of τ will be automatically reflected on the marginal posterior probability of γ. This is because the posterior distribution of γ given the data D T can be written as a mixture where the weights correspond to the marginal posterior density π T (τ ) for τ : where p (γ|τ, D T ) is the posterior conditional distribution of γ given τ . The posterior density π T (τ ) reflects the uncertainty of τ given the data set. Figure 1 shows three realizations of π T (τ ) (gray dashed curves) which are randomly chosen out of the 1,000 repetitions. Compared to the conventional approach, the key difference is that the Bayesian approach (2) incorporates all possibilities of τ (not justτ LS ) and weights them according to the posterior density. As we see in simulation studies, this results in longer lengths of Bayesian credible intervals of γ compared to the conventional counterparts. Consequently, the credible intervals tend to avoid undercoverage. See Section 5 for further discussion. Note that, unlike conventional frequentist methods, Bayesian inference has a valid interpretation even in finite samples as it does not rely on asymptotic theory. In this study, we examine the asymptotic behavior of Bayesian estimation of the considered model under the fixed jump size framework. Specifically, we prove a Bernstein-von Mises type theorem for the slope parameters γ which validates a frequentist interpretation of Bayesian credible regions. A Bayesian researcher can invoke our theorem to convey statistical results to frequentist researchers. A frequentist researcher could look at the credible interval of γ to check robustness with respect to the uncertainty of the break location. Such sensitivity analysis is reasonable as our result guarantees the credible interval to converge to the conventional confidence interval. We first establish theoretical results under normal likelihood and natural conjugate prior. We further extend the results to non-conjugate priors using Laplace approximations.
The literature on the theoretical properties of Bayesian approaches in non-regular models such as (1) is very scarce despite their popularity in applications. To our knowledge, frequentist properties of the Bayesian approach for linear regression models with structural breaks have not been studied in the literature. Ghosal and Samanta (1995) consider a general non-regular estimation problem from a Bayesian perspective and establish conditions under which the Bernstein-von Mises theorem holds for the regular part of the parameter. However, their assumptions are difficult to verify in regard to our model in consideration.
Recently, Casini and Perron (2020) propose a generalized Laplace estimator of the break location τ which is defined by an integration rather than an optimization. Their approach provides a better approximation about the uncertainty in τ than the conventional method. Although our focus of the current paper is on inference about the slope coefficients γ and not τ , our Bayesian approach toward inference shares the same spirit; any statement about γ is expressed as a weighted average (2) over the marginal posterior density of τ .
The paper is organized as follows. Section 2 introduces the model and lists a set of assumptions. Section 3 introduces a Bayesian approach based on normal likelihood and conjugate prior. The section then establishes frequentist properties of the approach. Section 4 extends the results to nonconjugate priors. Section 5 presents simulation evidence to assess the adequacy of the asymptotic theory and to illustrate that conventional confidence intervals on the slope parameters tend to undercover. Section 6 reports an empirical application to the stock return prediction model of Paye and Timmermann (2006). Section 7 concludes the paper. The mathematical proofs and derivations are listed in the Appendix. Additional tables are provided in the online appendix.
2 The model and data generating process

Data generating process
The data are assumed to include T observations on a response and a vector of covariates: D T = (Y T , X T ) = (y 1 , . . . , y T , x 1 , . . . , x T ) where y t ∈ R and x t ∈ X ⊂ R dx , t = 1, ..., T . X is assumed to be a convex and bounded set. Conditional on X T , the response is generated according to model (4) with the true parameters (γ 0 , σ 2 0 , τ 0 ) . We use θ = (γ , σ 2 ) to denote the regression parameters. We make the following assumptions about the true data-generating-process (DGP): x t x t exists and is positive definite.
(iv) For all τ 1 , τ 2 ∈ (0, 1) with τ 1 < τ 2 , 1 Under the above assumptions, the classical theoretical results apply. Bai (1997) shows that the convergence rate ofτ LS is T −1 if δ 0 is fixed with respect to the sample size: and that the least-squares estimator for γ is asymptotically normal with the asymptotic covariance matrix being the same as if τ 0 is known: where This means that τ can be treated as known for the purpose of inference about γ. In other words, the uncertainty of the break location is essentially ignored, and thus the confidence interval for γ tends to undercover (see Section 5 for simulation). There are several comments on Assumption 1. In threshold regression models (see Hansen, 2000), the threshold variable is often one of the regressors. In this case, sorting the threshold variable leads to a trend in the regressors, which requires an alternative approach for the asymptotic analysis. We do not consider the case with one of the regressors being the threshold variable in this paper. In addition, we require the regression errors to be i.i.d. with variance σ 2 . Adding more flexibility such as heteroscedasticity and serial correlation would be an important future direction.
3 A Bayesian approach under normal likelihood and conjugate prior The distribution of covariates is assumed to be ancillary and it is not modeled. Throughout this paper, we assume the normal likelihood function 5 where χ τ,t is the tth row of the matrix χ τ . Note that the normality is not assumed for the true DGP, so the model can be mis-specified. The break location τ and the regression parameters θ are independent a-priori and the prior on θ is the natural conjugate prior. That is, π (γ, σ 2 , τ ) = π(γ|σ 2 )π(σ 2 )π(τ ) where the prior on γ conditional on σ 2 is normal N (dx+dz) (µ, σ 2 H −1 ) and the prior on σ 2 is inverse-gamma InvGamma(a, b). Note that by taking H → 0, a → −(d x + d z )/2, and b → 0, we have the uninformative improper prior π (γ, σ 2 ) ∝ σ −2 as a special case. The prior on τ can be of any form as long as it is positive at τ 0 , and π(τ ) is finite for all τ ∈ H.
The conjugate prior is a popular choice in the Bayesian estimation of linear regression models. Our restriction on the prior for the break location τ is very mild. For example, the uniform distribution on H satisfies the requirement. Recently, Baek (2021) investigates the same model (1). As the distribution ofτ LS might exhibit tri-modality for small jumps, Baek proposes a new point estimator for τ based on a modified objective function. The proposed modification can be regarded as equivalent to specifying a certain type of prior for τ and indeed such prior satisfies our restriction.
Under the normal likelihood function and the prior defined above, the posterior distributions are Hµ + χ τ Y ,b τ = b + 0.5 µ Hµ + Y Y −μ τH τμτ , andā = a + T /2, and t k (v, µ, Σ) is the k-dimensional t-distribution with v degrees of freedom, a location vector µ ∈ R k , and a k × k shape matrix Σ. See Appendix C for the derivation.
Due to the availability of the closed-forms for the conditional posteriors given τ , the posterior sampling is simple and fast. One can first draw τ (1) , . . . , τ (S) from the marginal posterior of τ as in (9) via, for example, the Metropolis-Hastings algorithm, where S is the number of posterior draws. For each τ (s) , one can sample posterior draws of σ 2 (s) from the posterior conditional on τ = τ (s) , namely (11). Conditional on τ and σ 2 , one can draw γ from p(γ|σ 2 , τ, D T ) 6 . For example, a laptop with a 2.2GHz processor and 8GB RAM takes about 4.1 seconds to draw 10,000 posterior draws in an empirical example in Section 6 that has ten slope coefficients in total.

Asymptotic theory
We investigate the asymptotic behavior of the Bayesian method under the normal likelihood and the conjugate prior defined above. We do so in two steps. Section 3.1.1 shows that the marginal posterior of the break location τ contracts to the true value τ 0 at the rate of T −1 , the same rate at which the least-squares estimatorτ LS converges. The proof is based on studying the behavior of the log ratio of the marginal posterior densities of τ . In addition, we establish the limiting distribution of the posterior mode of τ . Section 3.1.2 establishes a Bernstein-von Mises type theorem for the regression slope coefficients γ. The proof is based on the T -consistency of the marginal posterior of τ and the fact that the conditional posterior for √ T (γ −γ LS ) given τ is asymptotically normal. Proofs of the theorems can be found in Appendix A.
which is available up to a multiplicative constant under the normal likelihood and the conjugate prior as can be seen in (9). The marginal posterior density π T (τ ) of τ is defined as The following theorem establishes the first step for proving the Bernstein-von Mises theorem, the T -consistency of the marginal posterior of τ . It states that the posterior mass outside of a ball 6 It can be shown that γ σ 2 , τ, D T ∼ N (dx+dz) μ τ , σ 2H −1 τ around τ 0 with radius proportional to T −1 will be asymptotically negligible.
Theorem 1 (Marginal posterior consistency of τ at rate T −1 ). Suppose Assumption 1 holds. Then, under the normal likelihood and the conjugate prior described above, ∀η > 0, > 0, ∃M > 0 and k > 0 such that T ≥ k =⇒ The proof of Theorem 1 is built on some intermediate steps, Propositions 1-4. It can be L T (τ 0 ) dτ and the inverse of L T (τ 0 ) dτ for each T and for any M 0 > 0. Proposition 1 shows that under the normal likelihood and the conjugate prior, due to the availability of the marginal likelihood conditional on τ up to a normalization constant as in (9), studying the log marginal likelihood ratio boils down to comparing the sum of squared residuals S T (τ ). Proposition 2 establishes the probability limit of T −1 S T (τ ), for which we show examples in Figure 2. We then show that the limit of T −1 S T (τ ) achieves a unique minimum at τ 0 (Proposition 3), and study the modulus of continuity of an appropriate empirical process (Proposition 4) in order to derive bounds. The detail of the proof of Theorem 1 can be found in Appendix A.1. The Bayesian counterpart of the least-squares estimatorτ LS would be the posterior mode: . Bai (1997) shows that arg max m W * (m) is the asymptotic distribution ofτ LS 7 . A consequence of the proof of Theorem 1 is thatτ Bayes converges to the same limiting distribution. See Appendix A.2 for a proof.
Corollary 1 (Limiting distribution of the posterior mode of τ ). Suppose Assumption 1 holds.
Then, under the normal likelihood and the conjugate prior described above,

Bernstein-von Mises Theorem for γ
The marginal posterior of γ is a mixture with weights corresponding to the marginal posterior density π T (τ ). Furthermore, due to Theorem 1, we can focus our attention on the values of τ in a T −1 neighborhood of τ 0 : We are now ready to establish the Bernstein-von Mises type result.
Then, under the normal likelihood and the conjugate prior described above, The proof of Theorem 2 exploits the fact that the conditional posterior for √ T (γ −γ LS ) given τ is asymptotically normal, which is close to the asymptotic distribution ofγ LS when τ is close to 7 W * (m) is a stochastic process defined on the set of integers as follows: for m = 1, 2, ... τ 0 . A bound on the Kullback-Leibler (KL) divergence between two normal densities together with the T -consistency is used to make the argument precise. The proof is presented in Appendix A.3.

An extension to non-conjugate priors
The previous section establishes the asymptotic properties of the posterior distributions under the conjugate prior. A natural question is whether these results can be extended to other priors. For example, an independent prior between the slope coefficients γ and the error variance σ 2 , e.g., π(γ, σ 2 ) = π(γ)π(σ 2 ) with γ ∼ N (dx+dz) (µ, Σ) and σ 2 ∼ InvGamma(a, b), is a popular choice for the Bayesian estimation of regression models in practice. Under the normal likelihood and the conjugate prior, the analytical expressions of the marginal posterior of τ up to a normalization constant (9) and the conditional posterior of γ given τ (10) facilitate the theoretical analysis. They are not available, for instance, under the independent prior mentioned above. In this section, we extend the theoretical results by keeping the normal likelihood (8) but without requiring the conjugate prior on θ. In order to study the asymptotic behavior of the posterior distributions without having their closed-form expressions, we employ Laplace approximation type results in Hong and Preston (2012). To do so, we make an additional assumption as shown below. Letθ(τ ) be the maximum likelihood estimator of θ conditional on τ ∈ H, i.e.,θ(τ ) = arg sup θ∈Θ log p(Y T |X T , θ, τ ). Denote by θ * (τ ) the corresponding pseudo true parameter value that minimizes the KL divergence between the model p(Y T |X T , θ, τ ) and the DGP.
Under the normal likelihood and Assumption 1, together with Assumption 2, we can invoke the Laplace approximation results of Hong and Preston (2012). Note that, under the normal likelihood and Assumption 1, θ * (τ ) exists and is a function of parameters in the DGP. In this section, we no longer assume the natural conjugate prior on θ. For instance, the independent prior π(γ, σ 2 , τ ) = π(γ)π(σ 2 )π(τ ) mentioned above satisfies the conditions in (ii) of Assumption 2 as long as they are truncated on Θ and π(τ ) is positive and finite at all τ . Theorem 3 below establishes the T -consistency of the marginal posterior of τ under this prior and the additional assumption.
Theorem 3 (Marginal posterior consistency of τ at rate T −1 , non-conjugate priors). Suppose Assumptions 1 and 2 hold. Then, under the normal likelihood, ∀η > 0, > 0, ∃M > 0 and k > 0 such that T ≥ k =⇒ Recall that while proving the T -consistency under the conjugate prior (i.e., Theorem 1), we utilize the closed-form expression of the marginal posterior of τ up to a multiplicative constant (9) in order to study the behavior of the marginal likelihood ratio conditional on τ . Under non-conjugate priors, such expression is not available in general. For this reason, we invoke a Laplace approximation to investigate the quantity p(Y T |X T , θ, τ )π(θ, τ )dθ to prove Theorem 3. See Appendix A.4 for the detail.
As in the previous section, an implication of the T -consistency of the marginal posterior of τ is that the posterior mode converges to the limiting distribution ofτ LS . Proof is in Appendix A.5.
Corollary 2 (Limiting distribution of the posterior mode of τ , non-conjugate priors). Suppose Assumptions 1 and 2 hold. Then, under the normal likelihood, where the stochastic process W * (m) is defined in Section 3.1.1.
Theorem 4 establishes our main theoretical result, the Bernstein-von Mises theorem for γ, under the prior defined in Assumption 2 (ii).
Theorem 4 (Bernstein-von Mises theorem for the slope coefficients, non-conjugate priors). Suppose Assumptions 1 and 2 hold. Then, under the normal likelihood, When proving the corresponding result under the conjugate prior (i.e., Theorem 2), we utilize the closed-form expression of the marginal posterior of γ given τ (10). As this is not available under the prior in this section, we again use a Laplace approximation to study the asymptotic behavior of the marginal posterior. See Appendix A.6 for a proof.

Simulation
The main purpose of the simulation studies below is to compare inference on the slope parameters γ between the two methods: the conventional least-squares method in Bai (1997) and the Bayesian approach described in our paper. For the Bayesian approach, we use the uniform prior for τ and the conjugate prior for the regression parameters with H = 0.1I (dx+dz) , µ = 0 (dx+dz) , and a = b = 1. The findings are similar even when we use the uninformative improper prior. Following the literature (e.g., Casini & Perron, 2021), we set the range of the candidate values of τ to be ( , 1 − ) with = 0.05 for all methods 8 .
We consider the following model: In order to compare the methods in repeated experiments, for each combination of τ 0 , δ 0 , and T , we generate 1,000 data sets. We consider different values of the break location τ 0 ∈ {0.3, 0.5}, the jump size δ 0 ∈ {0.25, 0.5, 1.0, 2.0}, and the sample size T ∈ {20, 50, 100, 250, 500, 1000}. The error t is independently and identically generated from N (0, 1). In the online appendix, we present a robustness check with the errors generated from a mixture of two normals 0.5N −1/ √ 2, 1/2 + 0.5N 1/ √ 2, 1/2 and illustrate that the overall findings are similar to these under the normal DGP. Table 1 shows the simulation results concerning δ. The top panel "Coverage" shows empirical coverages of the true jump size δ 0 by the 95% confidence and credible intervals. The frequentist confidence intervals are computed based on the conventional asymptotic theory (7). For the Bayesian approach, we report the equal-tailed credible intervals. The middle panel "Length" presents the average lengths of the aforementioned intervals. The bottom panel "MSE for δ" shows the mean-squared-errors for the point estimator, which is the least-squares estimatorδ LS defined in (6) for the conventional method and the posterior mean for the Bayesian approach.
There are several significant findings. First, for small T and/or small δ 0 , the conventional confidence intervals significantly undercover. Meanwhile, the Bayesian credible intervals have relatively reasonable coverages. Second, the Bayesian intervals tend to be longer than the conventional confidence intervals for small T and/or δ 0 . Third, as T increases, the discrepancy between the two methods decreases, as expected from the Bernstein-von Mises theorem that we establish.     Table 2 shows the results of estimation and inference of the break location τ . Although the main focus of the current paper is on inference about the slope parameters γ and not on inference about τ , we report the empirical coverage and the length of the 95% confidence interval of Bai (1997) and the highest posterior density (HPD) set 9 . We also report the inverted likelihood ratio (ILR) confidence set suggested by Eo and Morley (2015).
Overall, the HPD set and the ILR confidence set of the break location τ behave similarly although the HPD set slightly undercovers relative to the ILR confidence set for small T and/or δ 0 . We confirm several findings of Eo and Morley (2015). First, when T is large, the confidence interval of Bai has longer lengths than the ILR confidence set and the HPD set 10 . Second, when T and δ 0 are small, the confidence interval of Bai tends to severely undercover compared to the ILR confidence set and the HPD set. The interval of Bai indeed has a shorter length than the other two sets for small T , but its undercoverage raises concerns for small samples in practice 11 12 .
The bottom panels of Table 2 shows the mean-absolute-error (MAE) of the point estimator of τ which isτ LS defined in (5) for the conventional method and the posterior modeτ Bayes for the Bayesian approach. It is known that the finite-sample distribution of the least-squares estimator τ LS tends to be trimodal (see Baek, 2021) when the jump size is relatively small. The same seems to be true for the Bayesian point estimator (see Figure 1).  Eo and Morley (2015) also find that the confidence interval of Qu and Perron (2007) for the break location, which is also based on the Wald-type test as the confidence interval of Bai, tends to undercover in small sample despite having a slightly shorter length than the ILR confidence set. 12 In addition, as also reported by Eo and Morley (2015), the ILR confidence set tends to slightly overcover even in large sample.
To better understand the importance of the uncertainty of the break location τ for inference on the slope parameters, we conduct a hypothetical experiment. We repeat the simulation exercise but now fixing τ at the least-squares estimateτ LS . Table 3 displays the results. Note that the results for the least-squares estimator are of course the same as in Table 1. We however now see that, not only the conventional confidence intervals of δ but also the credible intervals undercover for small T and/or small δ 0 . They also have similar lengths in general. Importantly, the credible intervals when τ is fixed atτ LS (Table 3) have shorter lengths compared to the full Bayesian intervals (Table 1). On average, the full Bayesian credible intervals are 17.1% longer 13 than the credible intervals produced by fixing the value of τ atτ LS . Note that a Bayesian equivalent of the conventional approach to inference on the slope parameters would be to fix the value of τ at the posterior mode (whose value is very similar toτ LS as we can see from Figure 1 and deduce from Corollary 1). We can see in Figure 1 that bothτ LS and the posterior mode of τ display significant amount of variations. Fixing τ at a point estimate forces the Bayesian approach to ignore this uncertainty of τ ; as a result, the credible interval on δ becomes shorter and hence undercovers. The full Bayesian approach takes into account such uncertainty via marginal posterior of τ (see examples of the density in Figure 1). This results in longer lengths of the full Bayesian intervals on the slope parameters and helps them avoid undercoverage. In contrast, by construction (i.e., Equation 7), the conventional confidence intervals do not have this feature.  In summary, the simulation exercises demonstrate that (1) the credible intervals on the slope coefficient tend to have more reasonable coverages than the conventional confidence intervals because of longer lengths, (2) the longer length of the credible intervals is a reflection of the uncertainty of the unknown 14 break location τ , and (3) the two intervals converge to each other asymptotically as expected from our Bernstein-von Mises theorem.  14 When τ 0 is known, the two intervals behave very similarly. To illustrate this point, we conduct another hypothetical experiment by repeating the simulation exercise as before but now fixing the value of τ at the true value τ 0 in both conventional and Bayesian approaches. Table 4 summarizes the results. In this case, we see that both confidence and credible intervals have coverages quite close to 95% in all cases. They also have similar lengths. Note that when the true value τ 0 is given, the usual asymptotic normality and the regular Bernstein-von Mises theorem apply. As a consequence, both frequentist and Bayesian intervals seem to converge faster to the limit compared to the case with unknown τ .

Application
In this section, we illustrate difference in estimation and inference of the regression parameters in linear regression models with a structural break between the conventional approach and the Bayesian approach that we consider in this paper. Paye and Timmermann (2006) consider the problem of ex-post prediction in stock returns under a structural break in the coefficients of state variables. Their multivariate model with a structural break is where Ret t is the excess return for the international index in question during month t, Div t−1 is the lagged dividend yield, T bill t−1 is the lagged local country short interest rate, Spread t−1 is the lagged local country term spread, and Def t−1 is the lagged U.S. default premium. The authors estimate the model using the conventional frequentist approach: they first computeτ LS and then obtain point estimates as well as confidence intervals for the slope coefficients by fixing τ atτ LS . We examine whether the Bayesian method performs differently from the conventional approach.
Monthly series are collected from Global Financial Data and Federal Reserve Economic Data (FRED). In this paper, we consider estimating the model for the United Kingdom and Japan 15 . The indices to which the total return and the dividend yield correspond are the FTSE All-share for the U.K. and Nikko Securities Composite for Japan. For each country, a 3-month Treasury bill rate is used as a measure of the short interest rate while the yield on a long-term government bond is used as a measure of the long interest rate. Excess returns are computed as the total return on stocks in the local currency minus the local short rate. The dividend yield is expressed as an annual rate and is constructed as the sum of dividends over the preceding 12 months, divided by the current price. A term spread is the difference between the long and short local country interest rates. The U.S. default premium is defined as the difference in yields between Moody's Baa and Aaa rated bonds. For each country, the sample spans between January 1970 and December 2003.
For both approaches, we set the range of the candidate values of τ to be ( , 1 − ) with = 0.05 as we do in the simulation studies in the previous section. For the Bayesian approach, we use the uniform prior on ( , 1 − ) for τ and the conjugate prior for the regression parameters with H = 0.1I (dx+dz) , µ = 0 (dx+dz) , and a = b = 1. The findings are similar even when we use the 15 Paye and Timmermann (2006) conduct the sequential method suggested by Bai and Perron (1998), Bai and Perron (2003), Perron (2006) for determining the number of breaks and find multiple breaks for some countries. They find single breaks for the U.K. and Japan, but, for example, two breaks for the U.S. A fully Bayesian approach would be to place a prior on the number of breaks and use a trans-dimensional estimation method such as a reversible jump MCMC, which is beyond the scope of this paper. The upper panel shows point estimates as well as 90% confidence (left) and equal-tailed credible (right) intervals for the regression slope parameters. The lower panel shows point estimates of τ with the corresponding months in parentheses as well as the bounds of 95% confidence intervals of Bai (1997) and highest posterior density (HPD) sets. It also displays the inverted likelihood ratio (ILR) confidence sets of Eo and Morley (2015). LB=lower bound and UB=upper bound of the intervals. uninformative improper prior. For the break date, we compute the least-squares estimatorτ LS and the posterior modeτ Bayes of τ as well as the 95% confidence interval of Bai (1997), the highest posterior density (HPD) set, and the inverted likelihood ratio (ILR) confidence set of Eo and Morley (2015). For the slope parameters, we computeγ LS and the posterior mean of γ as well as the 90% confidence intervals of Bai (1997) based on the asymptotic result (7) and the equal-tailed credible intervals.
When the uncertainty about τ is small, estimation and inference of the slope parameters roughly match between the conventional least-squares approach and the Bayesian approach, as illustrated by our simulation studies and indicated by our proven Bernstein-von mises theorem. See Table 5 for the results for the U.K. Both methods estimate a break at 1975:01. The confidence interval of Bai (1997), the Bayesian highest posterior density (HPD) set, and the inverted likelihood ratio (ILR) confidence set by Eo and Morley (2015) are all similar and narrow, indicating that the uncertainty about τ is small. This can be seen also from the posterior density on the break date in Panel (a) of Figure 3, which has a sharp peak around 1975:01 16 . Paye and Timmermann (2006) explain that the break in the mid-1970's might be related to the large macroeconomic shocks reflecting oil price increases. As a result of the small uncertainty about τ , the point estimates of the slope parameters as well as the corresponding confidence/credible intervals are similar between the conventional and the Bayesian approach. Importantly, when the confidence interval of a given slope parameter includes (or does not include) zero, the corresponding credible interval also includes (or does not include) zero. Hence, the conventional approach to inference about the slope parameters for the U.K. sample seems to be robust with respect to the uncertainty on the break date. 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997   In contrast, when the uncertainty on τ is large, the conventional and the Bayesian results on inference about the slope parameters might disagree. Table 6 shows the results for Japan. Although bothτ LS and the posterior mode of τ are at 1996:05, the HPD set and the ILR confidence set are much wider than the confidence interval of Bai (1997), indicating a large uncertainty of the break date. The posterior density on τ in Figure 3 also illustrates that the uncertainty of the break date is much larger for Japan than for the U.K. during the sample period 17 . The large uncertainty of τ is reflected on Bayesian inference on the slope parameters. In the upper panel of Table 6, we see that in general the Bayesian credible intervals are wider than the confidence intervals. Importantly, this can have a qualitative consequence on statistical importance of some parameters. For seven of the ten slope coefficients, the confidence intervals do not include zero while the the Bayesian credible intervals do. Hence, the conventional approach to inference on the slope parameters might not be robust with respect to the uncertainty of the break date, for the Japanese sample. The upper panel shows point estimates as well as 90% confidence (left) and equal-tailed credible (right) intervals for the regression slope parameters. The lower panel shows point estimates of τ with the corresponding months in parentheses as well as the bounds of 95% confidence intervals of Bai (1997) and highest posterior density (HPD) sets. It also displays the inverted likelihood ratio (ILR) confidence sets of Eo and Morley (2015). LB=lower bound and UB=upper bound of the intervals.

Conclusion and future direction
In this paper, we establish a Bernstein-von Mises type theorem for the slope coefficients in linear regression with a structural break. By doing so, we bridge the gap between the frequentist and the Bayesian approaches for inference on this model. On the one hand, a frequentist researcher can look at Bayesian credible intervals for the slope coefficients as a robustness check to see whether the uncertainty of the break location affects inference on the slope parameters. Such sensitivity analysis is natural as our theoretical result guarantees the credible interval to converge to the conventional confidence interval that the frequentist researcher would use otherwise. On the other hand, Bayesian inference can be conveyed to frequentists via our proven result.
Potential extensions include several directions. First, the homoscedasticity assumption could be too strong in some applications, and hence extending the results to the case of heteroscedasticity and autocorrelation would be of interest. Second, a popular Bayesian method of Chib (1998) is different from the approach we took in this paper in that we place an explicit prior on τ and that Chib's framework can be naturally extended to the case of multiple breaks. It would be interesting to study frequentist properties of Chib's approach.

A Proof of Theorems and Corollaries
In Appendix A, we provide proofs of Theorems 1-4 and Corollaries 1-2. See Appendix B for proofs of the Propositions used for proving the main theorems.

A.1 Proof of Theorem 1
Proof of Theorem 1. Note that for any M 0 > 0. Hence for each T and for any M 0 > 0, (12) Therefore, we want to find The proof of Theorem 1 is built on some intermediate steps, Propositions 1-4. Proposition 1 shows that, under the normal likelihood and the conjugate prior, studying this ratio boils down to comparing the sum of squared residuals S T (τ ).
Proposition 1. Suppose Assumption 1 holds. Then, with the normal likelihood and the conjugate prior described above, under P θ 0 ,τ 0 , for all τ , Let us first examine the limit of the quantity Q T (τ ) = T −1 S T (τ ). Proposition 2 states that Q T (τ ) converges in probability to some deterministic function Q(τ ). See Figure 2 for examples of Q T (τ ) and Q(τ ).

A.3 Proof of Theorem 2
Proof of Theorem 2 . Define z = √ T (γ −γ LS ) and let φ(x; µ, Σ) be the multivariate normal density with mean µ and covariance matrix Σ evaluated at x.
where the last equality is due to Theorem 1. From (10), asymptotically, the posterior of γ conditional on τ is normal: The total variation distance is bounded above by 2 times square root of the KL divergence. In general, the KL divergence between two p-dimensional normal distributions N p (µ 1 , Σ 1 ) and N p (µ 2 , Σ 2 ) is bounded above by where ||Σ|| ∞ = max ij |Σ ij | is the largest element of Σ in the absolute value, and ||Σ|| 2 = sup µ ||Σµ|| 2 /||µ|| 2 is a matrix norm induced by the standard norm on R p , ||µ|| 2 = p i=1 µ 2 i . We can bound the total variation distance between the posterior density of z conditional on τ and that of N (dx+dz) (0, σ 2 0 V −1 ) using the bound (18) By definition,μ To show I and II are o p (1), note that Σ 1 − Σ 2 equals to For the first term in (20), we have . Therefore, the term in the first square brackets in (20) is o p (1). For the second term in (20), we have that for |τ − τ 0 | < M T , whereV T (τ ) = 1 T χ τ χ τ . This implies that Σ −1 2 − Σ −1 1 = o p (1). Hence II = o p (1). By continuity of determinants, we also have that I = o p (1) for τ ∈ B M/T (τ 0 ).

A.4 Proof of Theorem 3
Proof of Theorem 3 . Recall that the proof of Theorem 1 is an implication of Propositions 1-4. Assumption 1 implies Propositions 2-4. Proposition 1 establishes that under the normal likelihood and the conjugate prior, Assumption 1 implies Therefore, Theorem 3 can be proved if we establish the above equality under the normal likelihood and the prior described in Section 4, together with Assumptions 1-2. For a given τ , denote by F T (θ, τ ) = log p(Y T |X T , θ, τ ) the log likelihood function conditional on τ . Under the normal likelihood and Assumption 1, together with Assumption 2, we can involke Theorem 3 of Hong and Preston (2012) (see their page 361) which establishes that Note that we assumed that π (θ * (τ ), τ ) and π (θ * (τ 0 ), τ 0 ) are finite and non-zero. Hence, the term involving the ratio of the priors is O p (T −1 ). Also, −A θ (τ ) is a positive definite matrix hence its determinant is a finite positive number. We have where the last equality is due to the fact thatσ 2 (τ ) = S T (τ )/T . This implies the desired result i.e., (21). Note that Propositions 2-4 hold under Assumption 1. Therefore, given (21), the rest of the proof of Theorem 3 follows the same argument in the proof of Theorem 1 in A.1.

A.5 Proof of Corollary 2
Proof of Corollary 2 . Note that due to (21), under the normal likelihood and the assumed condition on the prior, together with Assumptions 1-2. Furthermore, Theorem 3 implies thatτ Bayes = τ 0 + O p (T −1 ). Based on these two facts, the rest of the proof follows the same argument as in the proof of Corollary 1 in A.2.
For |τ − τ 0 | < M T , The rest of the proof can be done similarly as in the proof of Theorem 2 in A.3 by applying the bound (18).

B Proof of Propositions
B.1 Proof of Proposition 1 Proposition 1. Suppose Assumption 1 holds. Then, with the normal likelihood and the conjugate prior described above, under P θ 0 ,τ 0 , for all τ , Proof of Proposition 1. From (9), we have Assumption 1 implies that each component of (1/T )χ τ χ τ converges in probability to a constant matrix. By continuity of determinant, the determinant converges to the determinant of the limiting matrix. As a result, the quantity inside of log in the first term is O p (1) and hence the first term is O p (T −1 ). By the choice of the prior, the ratio π(τ )/π(τ 0 ) is bounded, so the last term is O(T −1 ). Note that Hence, we conclude that 1 T log L T (τ ) L T (τ 0 ) = 1 2 log S T (τ 0 ) S T (τ ) + O p (T −1 ).
Integrating the above with respect to σ 2 over the positive part of the real line and using the change of variable φ = 1/σ 2 , we get the marginal posterior for τ π(τ |D T ) ∝ det H τ −0.5b −ā τ π(τ ), Finally, we apply the well-known property that the integral of a normal-inverse-gamma distribution with respect to σ 2 is a t-distribution to (23) to conclude that γ τ, D T ∼ t p 2ā,μ τ , (b τ /ā)H −1 τ .