Safe and robust data-driven cooperative control policy for mixed vehicle platoons

This article considers mixed platoons consisting of both human-driven vehicles (HVs) and automated vehicles (AVs). The uncertainties and randomness in human driving behaviors highly affect the platoon safety and stability. However, most existing control strategies are either for platoons of pure AVs, or for special formations of mixed platoons with known HV models. This article addresses the control of mixed platoons with more general formations and unknown HV models. An innovative data-driven policy learning strategy is proposed to design the controllers for AVs based on vehicle-to-vehicle (V2V) communications. The policy learning strategy is embedded with the constraints of control input, inter-vehicular distance error and V2V communication topology. The strategy establishes a safe and robustly stable mixed platoon using prescribed communication topologies. The design efficacy is verified through simulations of a mixed platoon with different communication topologies and leader velocity profiles.

the behaviors of HVs need to be considered in the control of AVs to establish a safe (i.e., collision-free) and robust (i.e., formation-maintainable) mixed platoon. To address the third challenge, the platooning control strategy should also be adaptive to different formations of mixed platoons. However, the control strategies for platooning pure AVs 6-8 cannot address these challenges or guarantee the safety and robust stability of mixed platoons. This raises the necessity of developing new control strategies for mixed platoons.
Within a mixed platoon, the car-following behaviors of HVs can be captured by a few existing dynamic models, 14 among which the popular ones are the intelligent vehicle model and the optimal velocity (OV) model. Compared to other car-following models, the OV model has a simple mathematical representation and can characterize almost all kinds of traffic behaviors and the transitions between them. 14,15 The OV model has been used to develop control for mixed platoons. [15][16][17][18][19][20][21] An AV is controlled to smooth the mixed traffic flow on a ring road. 16,17 The "1 AV + n HVs" mixed platoon on more general roads is established by controlling an AV to lead n HVs. 18 The optimal control of "1 AV + n HVs" mixed platoons has also been designed in the context of a signalized intersection. 15 The "1 AV + n HVs + 1 AV" mixed platoon is achieved by controlling the rear AV using a tube model predictive controller. 19 The stability analysis and robust control have also been studied for a more general formation of mixed platoons. 20,21 However, all the above works assume known OV model parameters, which is too restrictive because the HV behaviors are difficult to be modeled exactly. 9 Even it is possible to calibrate accurate OV models, sharing the parameters is in general unrealistic for platoons that are formed during trips. 13 Therefore, it is more appealing to develop platooning control without knowing the HV parameters.
Only a few published works [22][23][24] have studied mixed platoons with unknown HV parameters. A recursive least squares method 22 is adopted to estimate the HV model, but platooning control is not investigated. Adaptive dynamic programming (ADP) 25 is currently the most well-established data-driven control policy learning for systems with unknown dynamic models. Building on ADP, strategies for learning data-driven control policy are developed for mixed platoons with input constraint 23 and with human reaction delays. 24 However, these works focus particularly on the "n HVs + 1 AV" mixed platoon. Moreover, their strategies cannot guarantee both (i) satisfaction of input and safety constraints, and (ii) platoon robustness against leader velocity changes and uncertain behaviors of HVs.
This article aims to develop a new data-driven control policy learning strategy for more general mixed platoons, to ensure satisfaction of input/safety constraints and platoon robustness against leader velocity disturbances and HV model uncertainties. The main contributions are summarized as follows: • A data-driven learning strategy based on ADP is proposed to obtain the cooperative control for mixed platoons with unknown HV parameters. The strategy is applicable for a wide range of mixed platoon formations that contain the "n HVs + 1 AV" platoons 23,24 as a special case.
• The policy learning incorporates input and safety constraints and a robust constrained invariant set, 26 which establishes a safe and robustly stable mixed platoon. This aspect has not been investigated in the existing mixed platoon designs. 23,24 Recent advances in the ADP theory can incorporate state constraints 27 or parameter uncertainties, 28 but none has studied both safety and robustness with vehicle platoon applications.
• The learning strategy includes structural constraint on the control gain, which enables the controller to be implemented under a prescribed V2V communication topology. It then offers more flexibility for implementing the control policy and offers a chance to consider the range limit of V2V communications during control design. This aspect has not been studied in the existing mixed platoon designs, 23,24 or the ADP designs. 25,27,28 The rest of this article is organized as follows. Section 2 describes the platoon model and control problem. Section 3 presents the model-based policy learning strategy, followed by its data-driven implementation in Section 4. Section 5 provides the simulation results. Section 6 draws the conclusions.
Notations: The symbols ⊗ and • are the Kronecker and element-wise products, respectively. vec is the vectorization operator. | ⋅ | is the absolute value. || ⋅ || is the 2-norm. I m is a m × m identity matrix. 1 a×b is a a × b dimensional matrix with all elements being 1. 0 is a zero matrix whose dimensions are known from the context unless it is necessary to be given. diag(⋅, … , ⋅) is a diagonal matrix whose main diagonals are the given elements. col(⋅, … , ⋅) stacks up its operands as a column vector.

PLATOON MODELING AND CONTROL PROBLEM
This article considers the general mixed platoon in Figure 1A, where all the vehicles can share their positions and velocities through the DSRC V2V wireless communication networks. 29 An AV is set as the leader to ensure controllability of the platoon and assist other AVs to design their control policies. The lead AV is assumed to be equipped with a model predictive controller 30 that guarantees accurate reference velocity tracking. This article aims to design the longitudinal acceleration commands of the other AVs to follow the lead AV by using information from the surrounding HVs to enhance the platooning performance. To facilitate the control design, the general mixed platoon in Figure 1A is divided into a set of small mixed platoons that are in the dash-dotted and dashed blocks. These small mixed platoons can be represented by the unified mixed platoon in Figure 1B. This unified mixed platoon has (N + 1) vehicles, including the host AV n c whose controller is to be designed, the assistant AV 0 supporting the control design, (n c − 1) HVs ahead the host AV, and (N − n c ) HVs behind the host AV. The unified mixed platoon is more general than the "1 AV + n HVs," 18 "1 AV + n HVs," 15 "1 AV + n HVs + 1 AV," 19 or "n HVs + 1 AV" 23,24 mixed platoons studied in the literature. This article will develop a cooperative control policy for the unified mixed platoon in Figure 1B, which can then be directly applied to the mixed platoon in Figure 1A.
A control-oriented mixed platoon model needs to be built to perform control design. Define the index set of HVs in the unified mixed platoon as The behaviors of HV i, i ∈  h , can be captured by the widely used OV model: 14-21,23ḣ where the variables p i and v i are the vehicle position and longitudinal velocity, respectively. h i = p i−1 − p i is the inter-vehicular distance between vehicles i − 1 and i, i is the headway gain and i is the relative velocity gain. V(h i ) is the spacing-dependent desired velocity defined by where h s is the smallest inter-vehicular distance before the HV intends to stop, and h g is the largest inter-vehicular distance after which the HV intends to maintain the maximum velocity v max . This article establishes a stable platoon and h s < h i < h g and the values of h s and h g are the same for all HVs. When AV 0 travels at the velocity v 0 , the equilibrium point of all the HVs is (h * , v * ), where v * = v 0 and h * satisfies v * = V(h * ). Upon knowing v * , the corresponding spacing h * can be easily determined from v * = V(h * ), because there is an one-to-one mapping between h i and V(h i ) when h s < h i < h g , ∀i ∈  h . This mapping is illustrated by the example in Figure 2 with the typical settings: 16,23 h s = 5 m, h g = 35 m, and v max = 30 m/s. These settings will also be used for simulation in Section 5.
Define platooning error vector as which is known for each value of h * . The OV model (1) is linearized around the equilibrium point (h * , v * ) and given aṡx The dynamics of AV 0 and AV n c are represented by the following point-mass model that is widely used for vehicle platoons: 6,7ṗ where i = 0, n c . The acceleration command u 0 is known while u n c is to be designed. The AV n c is controlled to track v 0 while keeping a desired and safe inter-vehicular distance h * between itself and HV n c − 1. Hence, the platooning error vector is defined as x n c = col ( Δh n c , Δv n c ) , where Δh n c = h n c − h * , Δv n c = v n c − v 0 , and h n c = p n c −1 − p n c . By using (4), the platooning error system of AV n c is derived aṡ where x n c −1 is the platooning error vector of HV n c − 1.
Define the overall platooning error vector as x = col(x 1 , … , x N ), control input as u = u n c and disturbance as d = u 0 . By using (3) and (5), the overall platooning error system is derived aṡ where B i = 0, i ∈  h . Here the leader acceleration command u 0 (i.e., d) is regarded as a disturbance, because it is an external input that will drift the platooning error system (6) away from the steady state. Hence, u will be designed to ensure that the platooning error system is robustly internal and string stable against d.
To eliminate the steady-state error of Δh n c , the integral term x I = ∫ t 0 Δh n c is to be used by the controller. Based on (6), the augmented platooning error system is given by [̇ẋ where . Discretizing (7) using the forward Euler method with the sampling time t s yields the control-oriented mixed platoon model whereĀ = I n + t sĀc , B = t s B c ,Ē = t sĒc , and n = 2N + 1.
Although the car-following behavior of HV can be captured by the OV model (3), the uncertainty and randomness properties of human driving behaviors make it impossible to identify the exact model parameters i and i . Hence, the system matrixĀ of (8) is unknown and the model-based platooning control designs 15-21 are inapplicable. By collecting experimental data, an OV model can be calibrated to capture the average behavior of human drivers and used to synthesize a robust controller for the AV. 21 However, the robust control is known to be conservative and it cannot ensure satisfaction of the input and safety constraints. This article proposes an online data-driven strategy to learn a control policy based on (8) to realize three objectives: 1. The mixed platoon maintains a safe inter-vehicular distance within the acceleration limits. 2. The mixed platoon is internally stable (i.e., settles at the desired velocity and inter-vehicular distance) and head-to-tail string stable 21 (i.e., robust against leader disturbances). 3. The mixed platoon performs well under different V2V communication topologies.
To realize Objective 1, the controller will be designed to satisfy the following input limits and safety constraints: where u max is the acceleration limit and Δh max is the maximum allowable inter-vehicular distance error (i.e., deviation from h * ). N], and avoids vehicle collisions. The HVs i, i ∈ [1, n c − 1], can be controlled by AV 0 but not by AV n c . Hence, their inter-vehicular distance errors cannot be controlled by u(t) to satisfy (9b). However, according to (2), the HVs will intend to stop once their inter-vehicular distances reduce to be h s to avoid collisions. Objective 2 will be realized by using the concept of robust constrained invariant set (RCIS). 26 To realize Objective 3, a structural constraint will be imposed on u(t) to indicate the platooning errors of which vehicles are used. The structural constraint is important because AV n c may not receive reliable information from all the HVs, especially when the inter-vehicular distances are large. 31 Incorporating the structural constraint enables u(t) to be implemented using a specified V2V communication topology, offering a chance to take the range limit of V2V communications into account during control design. This article aims to illustrate the key ideas of the proposed policy learning strategy and thus focuses only on ensuring that the mixed platoon travels at safe inter-vehicular distance and is string stable. The safety and robustness of platoons also need to be guaranteed in the presence of platoon formation/deformation 8,13 and disturbances from surrounding vehicles. 32 These will be considered in the future work by adding a trajectory planner [33][34][35] to generate real-time safe and optimal speed references for the platoon.
To clearly illustrate the proposed strategy, Section 3 will present a model-based control policy learning strategy, assuming that the matrixĀ of the platooning error system (8) is known. Based on this, Section 4 develops the data-driven policy learning strategy with an unknownĀ.

MODEL-BASED CONTROL POLICY LEARNING
As described in Section 2, the control policy to be designed needs to ensure safety, stability and robustness of the mixed platoon under the prescribed V2V communication topology. Section 3.1 presents the standard model-based control policy learning strategy to ensue platoon stability for the given V2V communication topology, without considering platoon safety and robustness. Section 3.2 further ensures safety and robustness of the policy learning. Section 3.3 summarizes the proposed model-based control policy learning strategy.

Standard structurally constrained policy learning
When the system (8) has known matricesĀ and B and d = 0, designing an optimal controller u(t) = K (t) can be formulated as solving the linear quadratic regulator (LQR) problem 25 with the cost function where Q ≽ 0 and R ≻ 0 are user-defined matrices. Solving the LQR problem gives an optimal control gain K * without any restrictions on the V2V communication topology. To address this, for a given topology I  , design the structural control gain K as The V2V communication topology I  is an 1 × n vector whose elements are either 0 or 1. If u(t) uses the ith element i (t) of the platooning error vector (t), then I  (i) = 1; otherwise, I  (i) = 0. For example, indicates that u(t) uses the platooning errors of AV n c , HV i, i ∈ [n c + 1, N], and the integration x I . Imposing the constraint in (10) ensures u(t) use the specified V2V communication topology I  . The platooning performance under different topologies will be investigated via simulations in Section 5. The structural control gain K in (10) is determined using Algorithm 1. Since the system (8) is controllable, then by selecting Q to make (Ā, √ Q) detectable, the sequence {K l } ∞ l=1 generated by Algorithm 1 converges to the optimal structural gain K opt . 36 The obtained controller ensures platoon stability, but cannot guarantee its safety and robustness. To overcome this, new policy evaluation and policy improvement methods are presented in Section 3.2.

3.2
Safe and robust policy learning

Policy evaluation
To incorporate the requirements of safety (formulated as (9)) and robustness into policy evaluation (see step 1 in Algorithm 1), it is necessary to establish their connections. The constraints in (9) are equivalently reformulated with respect to the augmented system (8) and given as
for a matrix P ≻ 0 and a scalar > 0.
Proof. If (12) holds, then By Due to the structures of the matrices H x and H u given in (11), the vector (t) is of the following form: It thus follows from (14) that Therefore, the constraint in (11) is satisfied. ▪ The disturbance d in the system (8) satisfies |d| ≤ d max , where d max is the maximal acceleration of AV 0. The system robustness is investigated using the concept of RCIS 26 defined below. Definition 1. Consider the ellipsoidal set (P, ) = { ∶ ⊤ P ≤ } with a matrix P ≻ 0 and a scalar > 0. The set (P, ) is a RCIS for the system (8) if for any initial (t 0 ) ∈ (P, ), there exists a control policy u(t) = K (t) such that (t) ∈ (P, ) and ( (t), u(t)) ∈ , for all disturbance |d(t)| ≤ d max and t ≥ t 0 .
Definition 1 shows that the system (8) is robust against the disturbance d if (P, ) is a RCIS for it. The condition to guarantee this is provided in Lemma 2.
This implies that the Lyapunov function W(t) decreases when W(t) > . Hence, the state (t) will remain in the set (P, ) = { ∶ (t) ⊤ P (t) ≤ } once entering it, which makes (P, ) a RCIS for the system (8). The condition (18) can be effectively examined by using the inequality: 37 with the scalars > 0 and > 0. Substituting W(t) = (t) ⊤ P (t) into the above inequality gives (17). ▪ According to Lemmas 1 and 2, the constraint in (11) is satisfied and the system (8) is robust against the disturbance if both (12) and (17) hold. Hence, the policy evaluation in Algorithm 1 is reformulated as the following optimization problem: with the cost function  l defined as and the given scalars ∈ (0, 1), > 0, 1 > 0, and 2 > 0.
Stop and return K l+1 . end if end for Minimizing || l || promotes a solution that is close to the solution of the traditional policy evaluation in Algorithm 1 without constraint and disturbance. Combining (19a) and (19b) ensures satisfaction of (17), while combining (19b) and (19c) ensures satisfaction of (12). The inequalities in (19d) ensure positive definiteness of the decision variables P l+1 and l+1 .

3.2.2
Policy improvement and structure-enforcement After solving P l+1 from (19), the non-structural gain K * l+1 and the structural gain K s l+1 are computed steps 2 and 3 in Algorithm 1 and given as The obtained gain K s l+1 may not satisfy the constraint in (11). To ensure this, a new gain K l+1 that is as close as possible to K s l+1 is generated using Algorithm 2 based on the backtracking linear search technique. 38

Model-based control policy learning strategy
The proposed model-based policy learning involves an iterative execution of two steps: (i) Policy evaluation by solving the optimization problem in (19), and (ii) Policy improvement and structure-enforcement by using (21) and Algorithm 2. The policy learning needs to be implemented online because the optimization problem in (19) depends on the real time values of (t + 1), (t), and d(t). This is different from the traditional policy iteration in Algorithm 1 that can be implemented fully offline. The matrix B is known and constant, but the system matrixĀ is unknown due to its dependence on the unknown HV parameters. Since both (19) and (21a) use the unknown system matrixĀ, the proposed model-based policy learning is not yet implementable. To address this, a data-driven control policy learning is developed in Section 4 based on the results in this section.

DATA-DRIVEN CONTROL POLICY LEARNING
Building on the model-based policy learning strategy in Section 3, Section 4.1 presents an online data-driven learning strategy for the mixed platoon. Section 4.2 further discusses the extension of the proposed strategy to mixed platoons that are (i) with nonlinear AV models and inertial delays, (ii) with more general formations, and (iii) under non-steady state.

4.1
The proposed data-driven policy learning strategy Efficient policy learning requires persistent excitation of the system by injecting a proper perturbation signal. 25 A traditional method is adding a noise to the controller of AV n c for policy learning, 23,24 but it cannot fully excite the platooning error system (8). This is due to the fact that AV n c has no impact on its preceding HVs. Since the entire platoon is influenced by the disturbance d (i.e., acceleration of AV 0), a small time-varying d is used as the excitation signal in the proposed policy learning. Define T as the number of data-points collected for each policy learning step and t l as the time instance that the lth learning step is executed. The set of all the learning execution time instants is denoted as T learn = {t ∶ t = lT, l ∈ N}. During the lth learning cycle, that is, within the time interval [t l−1 + 1, t l ], the controller u(k) = K l (k), k ∈ [t l−1 + 1, t l ], is applied to AV n c . The values of (k), u(k) and d(k), k ∈ [t l−1 + 1, t l ], are obtained through vehicle onboard sensors (e.g., radar) and V2V communications. At the learning execution time instance t l , the collected T historical data are used to construct the datasets with̃(k + 1) = (k + 1) −Ēd(k). By using these datasets, the data-driven policy learning is formulated below.

Policy improvement and structure-enforcement
The policy improvement in (21a) can be equivalently accomplished via solving the following optimization problem: 25 By using the historical datasets, (24) is reformulated as
Construct the datasets X l ,X l , U l and D l . if t ∈ T learn then Solve P l+1 from (23). Obtain K * l+1 using (26). Compute K s l+1 using (21b).
Stop learning and fix the gain as is removed because it does not affect the optimization results. The optimization problem (25) is solved using the following recursive least squares method: 39 where k ∈ [t l−1 + 1, t l ], F t l−1 +1 = K l , and F t l +1 = K * l+1 . The initial value G 0 and the learning rate > 0 are user-specified. After obtaining K * l+1 , the policy structure-enforcement is performed as in (21b) and independent of historical data. Combining (23), (26), (21b) and Algorithm 2 gives the proposed data-driven control policy learning strategy outlined in Algorithm 3. The initial feasible gain K 0 is obtained using Algorithm 1 based on the average HV model under the constraint in (11).
The property of Algorithm 3 is stated in Theorem 1.

Theorem 1. The proposed data-driven control policy learning in Algorithm 3 establishes a safe and robust mixed platoon when the inter-vehicular distance between each HV and its preceding vehicle lies within the interval
Proof. According to Lemmas 1 and 2, the set (P l , l ) = { ∶ ⊤ P l ≤ l }, where P l and l are determined at the learning time instant t l−1 , is a RCIS for the system (8) in the lth learning cycle. Hence, for the time t ∈ [t l−1 + 1, t l ], (t) is upper bounded as for all |d(t)| ≤ d max with l = 1∕ min (P l ).
Denote l f as the final learning cycle. According to Algorithm 3, the control gain K is fixed as K l f +1 . This implies that the set (P l f +1 , l f +1 ) is a RCIS for the platooning error system (8) for all t > t l f . Hence, by using (27), for all t ≥ 0, the state (t) is upper bounded as The relation in (28) is satisfied in both the transients and steady state. Therefore, the mixed platoon is internally stable and robust against the disturbance d introduced by AV 0. Also, by using Algorithm 3, the conditions (23b) and (23c) are satisfied, meaning that the controller satisfies the constraint in (11). Furthermore, since the platooning errors of the ego AV n c are robustly stable against the leader disturbance d, the mixed platoon is head-to-tail string stable. 21 ▪ The proof of Theorem 1 shows that applying the obtained control policy at each learning cycle results in a safe and robust mixed platoon. Hence, if the initial controller gain K 0 is chosen to ensure safety of the mixed platoon, then the safety is guaranteed during learning. In this article, the initial controller gain K 0 is the LQR gain computed based on an average HV model as in the robust control design. 21 It will be shown in the simulation results that the initial controller gain K 0 is only applied to the ego AV during the first learning cycle which is short (<2 s). Hence, in practice it is not difficult to ensure safety of the mixed platoon when applying K 0 .
The proposed online data-driven policy learning strategy in Algorithm 3 is practically implementable with low computational cost. The optimization problem (23) is convex and can be efficiently solved using off-the-shelf solver such as MOSEK. 40 The computation in all the other steps involves only matrix manipulations. For a fixed mixed platoon formation, the policy learning is terminated once it converges to the optimal control gain K. The low computational cost will be shown in the simulations in Section 5.

4.2
Extensions of the proposed policy learning strategy

Mixed platoons with nonlinear AV models and inertial delays
The proposed policy learning strategy is developed using the linear model (4) for the AVs. It is shown below that the proposed strategy is also applicable when the AVs are represented by nonlinear models and both the AVs and HVs have inertial delays. The dynamics of AV 0 and AV n c are represented by the widely used nonlinear model: 6 where i = 0, n c . p i is the vehicle position and v i is the longitudinal velocity. T i and T des,i are the actual and desired torques, respectively. T,i is the mechanical efficiency of drivetrain and r w,i is the wheel radius. m i is the vehicle mass and g is the gravity acceleration. C A,i is the lumped aerodynamic drag coefficient and f i is the coefficient of rolling resistance. i is the inertial delay. By applying the exact feedback linearization law where u i is the new control signal, (29) is converted into a linear model: The OV model (1) with inertial delay is represented bẏ Introducing the response delay i to the OV model helps to narrow the gap between the theoretic car-following model and the field test data. 41 The proposed policy learning strategy for mixed platoons where the HVs have response delays will be demonstrated through simulations in Section 5.
Since the obtained models (30) and (31) are linear, the proposed policy learning strategy is applicable after some trivial modifications. This has not been studied in the existing works on mixed vehicle platoons, [15][16][17][18][19][20][21][22][23][24] or most existing literature on platoons of pure AVs. 6,7 Note that in this case, the policy learning needs the vehicle acceleration data.

4.2.2
Mixed platoons with more general formations This article focuses on the platoon formation in Figure 1, which can represent general mixed vehicle platoons with different penetration rates of AVs. First, it contains the "1 AV + n HVs," 15,18 "n HVs + 1 AV" 23 and "1 AV + n HVs + 1 AV" 19 mixed platoons as special cases. Second, it also covers the case when there are several successive AVs in the platoon. This is because for the successive AVs, the following AVs are fully controllable and can track the first AV accurately by using a well-established cooperative adaptive cruise controller, for example, the model predictive controller. 30 In this sense, the successive AVs in the mixed platoon can be regarded as a single "virtual AV" and only the controller of the first AV needs to be designed. This will be demonstrated in the simulations in Section 5.
The proposed learning strategy works under different V2V communication topologies and the switching among them (which will be shown in Section 5). Note that the topology changes may also result from the platoon formation changes due to vehicle joining or leaving. Hence, the proposed strategy could be applied to mixed platoons with formation changes.

4.2.3
Mixed platoons under non-steady state The proposed policy learning strategy is developed under the condition that the HVs are operated near the steady-state (h * , v * ), where h s < h * < h g . For completeness, it is worth discussing applicability of the strategy to the non-steady state cases: h i ≤ h s and h i ≥ h g , i ∈  h . Without loss of generality, only the cases when HV i is not the rear vehicle are discussed below.
When h i ≤ h s holds for HV i, it follows from (2) that V(h i ) = 0 and HV i will brake to avoid collision with the vehicle ahead. The mixed platoon is then split into two sub-platoons: Sub-platoon 1 consists of all vehicles ahead of HV i, and Sub-platoon 2 contains the rest (including HV i). If Sub-platoon 1 contains AVs (apart from AV 0), it is stabilizable by applying the proposed policy learning strategy. If Sub-platoon 1 has no AV (except AV 0), then there is no controller to design and it is out of the scope of this article. If Sub-platoon 2 has AVs and there is enough time to learn new control policies, then Sub-platoon 2 will be steered to the new steady state with v * = 0 without collisions, where the inter-vehicular distances across the sub-platoon may not be the same. If Sub-platoon 2 has no AVs, then all the HVs behind HV i will also brake when h j ≤ h s holds for each HV j.
When h i ≥ h g holds for HV i, it follows from (2) that V(h i ) = v max and HV i will travel at the constant velocity v max . It is reasonable to assume that all the HVs on the mixed platoon have the same maximum velocity v max . If there is enough time for the AVs to learn new control policies by using the proposed strategy, then the mixed platoon will be steered to the new steady state with v * = v max , where the inter-vehicular distances across the platoon may not be the same.

SIMULATION RESULTS
To evaluate performance of the proposed policy learning strategy, two sets of simulations are conducted for a seven-vehicle mixed platoon: the first set demonstrates efficacy of the strategy using a non-aggressive leader (see Section 5.1), and the second set further demonstrates the robustness by considering an aggressive leader and uncertainties in HVs (see Section 5.2). The simulations are conducted in MATLAB and the optimization problem (23) is solved using the toolbox YALMIP 42 with the solver MOSEK. 40

Efficacy of the proposed policy learning strategy
This set of simulations consider the mixed platoon with a non-aggressive leader, whose velocity is 15 m/s in t ∈ It is seen from Figure 3 that the control gain K l converges to the optimal value K opt in 10 s for all the four topologies. By implementing the obtained controller, all the seven vehicles reach the same longitudinal velocity after short transients, as shown in Figure 4. The inter-vehicular distances between each pair of two successive vehicles all reach the desired distance at steady state, as seen from Figure 5. During the transients, the inter-vehicular distance errors Δh i , i = 4, 5, 6, always satisfy the imposed constraint |Δh i | ≤ Δh max . After a sudden acceleration of AV 0 at 60 s, the errors Δh i , i ∈ [2, 6], are not larger than Δh 1 . This means that the disturbance from AV 0 is not amplified when propagating downstream the platoon, confirming the platoon robustness and string stability. To compare the platooning performances under different topologies and the random switching among them, the 2-normed platooning error || (t)|| is used. The value of || (t)|| can quantify the overall deviations of (h i , v i ) from the equilibrium point (h * , v * ) at time t under each topology. As shown in Figure 6, the values of || (t)|| are the smallest under Topology 1, the biggest under Topology 4, and similar under the other two topologies. Recalling here that the number of platooning errors used by AV 4 is in decreasing order from Topologies 1 to 4. Hence, the results in Figure 6 demonstrate that the platoon stability is enhanced by using information from more vehicles, which coincides with the observations in References 18 and 20. The result of || (t)|| under V2V communication topology switching (from 2 to 4 at 30 s, then to 3 at 60 s, and to 1 at 90 s) shows that the proposed strategy is effective in the presence of topology changes.
Case 2: This case considers the seven-vehicle mixed platoon with different numbers of AVs: 2 AVs (vehicles 0 and 4), 3 AVs (vehicles 0, 4, and 6), 4 AVs (vehicles 0, 2, 4, and 6), and 5 AVs (vehicles 0, 2, 3, 4, and 6). There are no consecutive AVs on the platoons for the first three penetration rates. For these cases, the proposed learning strategy is applied to each AV by using the platooning errors of itself, the HVs ahead, and the HVs behind (but ahead of the next AV). For the highest penetration rate, there are three adjacent AVs (vehicles 2, 3, and 4). In this case, the learning strategy is applied to vehicle 2, while vehicles 3 and 4 are equipped with the model predictive controller 30 without considering communication delays.
The platooning errors obtained under different penetration rates of AVs are reported in Figure 7. It is seen that the platooning errors all reach zero after short transients, confirming efficacy of the proposed strategy in establishing stable mixed platoons. The cases of 2 AVs and 3 AVs have similar platooning errors, because the additional AV (vehicle 6) is at the rear and its control policy has no effect on the vehicles ahead of it. As the number of AVs increases to 4 (and to 5), convergence of the platooning errors becomes faster. Hence, increasing the penetration rates of AVs makes the platoon easier to stabilize.
Case 3: This case demonstrates advantages of the proposed learning strategy against the traditional ACC method 43 and the data-driven ADP method. 24 The three methods are applied to the seven-vehicle mixed platoon with two AVs (vehicles 0 and 4). The proposed strategy is implemented as in Case 1 under Topology 1. The traditional ACC for AV 4 consists of a gap controller u gap (t) and a speed controller u speed (t). It uses the time-varying safe inter-vehicular distance F I G U R E 8 Platooning errors by implementing different control methods: Case 3 where d still is the standstill distance and t g is the time headway. When the inter-vehicular distance between HV 3 and AV 4 satisfies h 4 (t) < d safe (t), the gap controller u ) is activated to maintain a safe inter-vehicular distance, where k h and k v are constant gains. When h 4 (t) ≥ d safe (t), the speed controller u speed (t) = min ( is activated to control AV 4 at the specified velocity v set , where k s is a constant gain. In this simulation, the ACC parameters are set following the MATLAB example "Adaptive Cruise Control with Sensor Fusion" as: k h = 0.2, k v = 0.4, k s = 0.5, d still = 5 m, t g = 1.5 s, and v set = 24.5 m/s. The ADP method 24 is adopted to compute the constant gain K ADP and the control law u(t) = −K ADP x(t) for AV 4, by using platooning errors of all the vehicles.
The platooning errors || (t)|| by applying the three methods are shown in Figure 8. The proposed policy learning strategy can steer the platooning errors to zero and establish a stable mixed platoon, while the traditional ACC cannot. Although the ADP method stabilizes the platoon, it cannot steer the platooning errors to zero. This means that the ADP method cannot steer the mixed platoon to the desired equilibrium, leading to larger inter-vehicular distances than the proposed method.

Robustness of the proposed policy learning strategy
This set of simulations demonstrate robustness of the proposed strategy by considering the seven-vehicle mixed platoon with two AVs (vehicles 0 and 4) in the presence of an aggressive leader velocity profile and uncertainties in human driving behaviors. The leader follows the SFTP-US06 Drive Cycle (see top plot in Figure 9) that can represent the aggressive, high speed and/or high acceleration driving behaviors with rapid speed fluctuations. To simulate the uncertainties in human driving behaviors, the models of HVs (1, 2, 3, 5, and 6) are assumed to have the following reaction delays: 41 1 = 0.12 s, 2 = 0.16 s, 3 = 0.15 s, 5 = 0.18 s, and 6 = 0.2 s, respectively. To capture the randomness of HVs, a white noise w(t) is added to the HV model parameters i and i , i = 1, 2, 3, 5, 6, that are used in the first set of simulations. The white noise satisfies |w(t)| < 0.1. All the other parameters are same as the first set of simulations, except that h g = 50 m, v max = 36 m∕s, and u max = 4 m∕s 2 . The proposed learning strategy is implemented using the communication Topology 1. As shown in the bottom plot in Figure 9, the inter-vehicular distances across the platoon are larger than zero. This confirms that the proposed policy learning strategy can ensure stability and safety of the mixed platoon under the aggressive leader and HVs uncertainties.
The head-to-tail string stability is further verified based on the closed-loop transfer functions T i from the leader acceleration a 0 to the acceleration a i of the ith follower, i = 1, 2, 3, 4, 5, 6. If the magnitude of T i is not larger than 1, then the deviation of the leader velocity at each sampling step (i.e., a 0 ) is not amplified when propagating to the ith follower. 44 The magnitudes of all the transfer functions are reported in Figure 10. It can be seen that the magnitudes of all the transfer functions do not exceed 1 (0 dB). Hence, the deviations of the leader velocity are not amplified when propagating through the entire platoon, which means that the established mixed vehicle platoon is head-to-tail string stable.

F I G U R E 9
The leader velocity v 0 and the inter-vehicular distance F I G U R E 10 Bode diagrams (magnitudes) of the transfer functions from a 0 to a i , i = 1, 2, 3, 4, 5, 6

CONCLUSION
An online data-driven strategy is proposed to learn the control policies of the AVs in the mixed vehicle platoon with unknown HV parameters. The proposed learning strategy incorporates constraints of control input, inter-vehicular distance errors, and V2V communication topology. The learned control policy can be implemented using a prescribed V2V communication topology, and establish a safe, robust and stable mixed platoon. The simulation results demonstrate that the proposed learning strategy is efficient under different communication topologies, and robust against the aggressive leader and uncertainties in human driving behaviors. The proposed strategy will be further developed to guarantee platoon safety and robustness in the presence of platoon formation/deformation and disturbances from surrounding vehicles. Since platoons are known to be beneficial for fuel saving, it is also worth extending the proposed strategy for ecological mixed vehicle platooning by reducing the fuel consumption of the platoon as a whole with the help of a high-level velocity planner.

CONFLICT OF INTEREST
There is no conflict of interest for this article.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.