Low Reliable and Low Latency Communications for Mission Critical Distributed Industrial Internet of Things

Achieving ubiquitous ultra-reliable low latency consensus in centralized wireless communication systems can be costly and hard to scale up. The consensus mechanism, which has been widely utilized in distributed systems, can provide fault tolerance to the critical consensus, even though the individual communication link reliability is relatively low. In this article, a widely used consensus mechanism, Raft, is introduced to the Industrial Internet of Things (IIoT) to achieve ultra-reliable and low latency consensus, where the consensus reliability performance in terms of nodes number and link transmission reliability is investigated. We propose a new concept, Reliability Gain, to show the linear relationship between consensus reliability and communication link transmission reliability. We also find that the time latency of consensus is contradictory to consensus reliability. These conclusions can provide guides to deploy Raft protocol in distributed IIoT systems.

Ultra-reliable and low latency communications (URLLC) has been recognized as a key feature of 5G to meet the stringent requirements of industrial or personal applications [3]. According to [4], in some critical IIoT application scenarios, URLLC needs to provide an end-to-end latency lower than 1 ms and exceedingly high reliability more than 1−10 −9 . Centralized communication systems are normally deployed in industry sectors, which requires the connected IoT nodes to transmit their data to a central control station, where the critical decisions will be made and send back to actuators for processing. However, a large number of new generation mobile IIoT applications are discretely distributed in their topology, which means the scheme of centralized systems may hardly be implemented in these applications. Additionally, the centralized system suffers from an ever-present single point of failure issues. Moreover, in a centralized IIoT system, the IIoT nodes can only synchronize the information with the central station, which means the system's reliability performance heavily relies on the central station, and the performance can be limited by the worst node connection with the central station. Any wireless communication link failure can cut off the synchronization, which may cause disaster or loss of human life in extreme cases. Finally, the centralized communication system can be very costly since it is well-known that high communication reliability is contradictory to low time latency with given spectrum resources. The cost can be unaffordable when the network scales-up [5], e.g., on a busy road of autonomous driving scenarios or a smart factory with a large number of mobile robots. Therefore, from algorithms and protocols perspectives, an alternative low-cost solution should be investigated on how to improve the overall network's critical decision reliability and latency with low individual link transmission reliability.
Distributed systems can achieve such stringent requirements with relaxed communication link reliability by using a consensus mechanism (CM) to achieve the necessary agreement on a single state of the network. As one of the most recently successful applications, CM is a key element of blockchain networks to ensure the synchronization among the distributed nodes [6]. Unlike the traditional centralized communication system that requires all communication links are reliable under a time delay constraint to make correct decisions for IIoT, CMs in a distributed system can tolerate a certain ratio of link transmission failure, i.e., it can achieve a high reliability critical decision with relatively low reliable communication links. Raft [7] is such a typical crash tolerating CM to manage log duplicate. However, up to this point, CM (and corresponding applications such as blockchain) is primarily designed in stable wired communication environments. Unlike wired networks, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ wireless communication channels are unstable, scarce, and prone to interference. In particular, original Raft considers that when the node failure happens, all associated communication links are faulty. However, with dynamic wireless communication channels, a node may work well, but some links connected with the node might be unstable. It is worth to derive the reliability in the presence of link failure to adopt the distributed IIoT in wireless environments. Moreover, it is unclear how such a distributed protocol affects the overall delay in the wireless environment. These concerns in the Raft consensus should be investigated to instruct the deployment of the CM in mission critical distributed IIoT applications.
For the first time, this article discusses how to use Raft to achieve highly reliable consensus for mission critical distributed IIoT, where the communication links can be low latency but low reliable. We first introduce a Raft CM link failure model to analyze the mathematical relationship between the communication link reliability and system decision reliability. Based on this derived relationship, the letter proposes an essential concept called Reliability Gain. It presents the mathematical relationship between consensus reliability and communication link reliability. We also find that the Reliability Gain is in a linear form of nodes number approximately. Additionally, the derivation reveals that the consensus reliability contradicts to delay, which provides design guidance for consensus to the distributed IIoT systems.

II. RAFT PROTOCOL IN IIOT
Section II introduces the concept of distributed system based on the Raft consensus. A Raft network is composed of a number of consensus nodes, as shown in Fig. 1, where the leader node needs to pack the commands in log entries and replicate these entries to all followers ceaselessly in every term through downlink communications; depending on the successful reception of the request, the followers confirm and send back the log to the leader by using the uplink communications. Consensus nodes can be actuators or work as a group to provide consensus to the actuator(s) in the IIoT. The actuators can only take actions if the critical decision is a consensus of the CM network. In the next, we assume the consensus nodes are actuators for simplicity. A successful Raft consensus represents that more than 50% of overall followers can receive the log entries and send confirmation back to the leader successfully in one term. In realistic cases, the ratio of followers with successful communication is flexible, which should fit in the requirement of scenarios. Thus, communication plays a key role in such a system and determines the consensus performance. The followers/actuators who cannot receive log entries or send back the confirmation because of communication link failure need to synchronize their state through other normal followers/actuators. Eventually, all actuators can get the correct log state to process these critical decisions made by consensus in the distributed system to accomplish complex manufacturing tasks. The leader can be selected by simple rotation or under the criteria of maximizing the system performance (e.g., select the node with the best communication connection with others), which is not the focus of this article. As shown in Fig. 1, a centralized system is compared with a distributed system with Raft. In the centralized communication system, any communication link failure related to IIoT devices can cause the failure of critical decisions to the actuator. However, in the distributed communication system with Raft, the consensus can be made even though there are some unstable communication links, who cannot have reliable communications with the leader. However, the normal follower with completed logs may become a backup for synchronization to guarantee that all followers can get the consensus state.
Moreover, Raft protocol does not concern potential malicious nodes' effects on the distributed network [8]. The autonomous driving and other critical IIoT also fit in this case because the probability of the malicious users in this system is in low risks, or the nodes are under high-security level protection. Even if the malicious nodes in systems cannot be ignored, other similar consensus mechanisms like Practical Byzantine Fault Tolerance (PBFT) [9] can be adopted, and our following derivations can be extended accordingly.

III. RAFT RELIABILITY AND LATENCY ANALYSIS
In this Section, we first establish a wireless communication model with Raft protocol to analyze the consensus reliability performance. Then, we investigate the Raft properties in terms of consensus reliability and time latency.

A. Reliability of Communication System With Raft
Considering there are N nodes in a distributed system with Raft CM, theoretically, a reliable critical decision requires that over N −1 2 followers can receive log entries from the leader and send the confirmed messages back to the leader to achieve the commitment of log replication, which means the number of nodes with both successful downlink and uplink transmissions should be more than half nodes (i.e., N −1 2 followers and the leader) to accomplish the consensus progress. It is worth to mention that 50% is the fault tolerance of Raft [7], however, this value can be higher in an environment with unstable communication links. Nevertheless, this value will not affect our derivations.
We assume that the communication link success rate is P l . Mathematically, the consensus success rate of the system P C is accumulated by the probability of every case in the successful consensus progress, which is in the form of two summations of probability in a binomial distribution. The accurate probability of success consensus P C in the distributed IIoT communication system can be derived in the following equation where the symbol x y denotes the combination of y choose x with y x and both x and y being non-negative integers. The first summation represents the probability that the majority of followers can download the log entry from the leader. The second summation equals to the probability that the majority of followers can upload their confirmation back to the leader. Because the downlink transmission happens before uplink transmission in Raft, the number of successful uplink transmission is never larger than the number of successful downlink transmissions. Therefore, the probability of a successful consensus term is the product of these two summations. It is worth to mention that consensus success rate P C increases monotonically with the nodes number N . Though this property cannot be revealed by equation (1) straightforwardly, however, our following simplification in Section III-B and the simulation result in Section IV can show this property explicitly.
Remark 1: According to the equation (1), to satisfy the most stringent reliability requirement in IIoT, i.e., the consensus failure rate 1 − P C is less than 10 −9 , the nodes number N should not be less than 69, 31, 12, 5, when the link success rate P l is 90%, 95%, 99% and 99.9%, respectively.

B. Reliability Gain
The remark in Section III-A indicates the fact that even if the link success rate P l is undesirable, the consensus success rate of Raft can still be improved to the standard of IIoT, and the nodes number N can influence this reliability improvement. Therefore, we introduce a parameter called Reliability Gain (also can be interpreted as reliability amplification factor) to represent the quantitative relationship between the reliabilities of consensus and communication link.
Theorem 1: When the link success rate P l is reasonably large 1 , it has a linear relationship with consensus failure rate 1 − P C in logarithm where the Reliability Gain k = N +1 2 and the intercept h = log( N − 3 2 N −1 ) + Δh, with Δh being given in Table I. Proof: See Appendix A From the definition of Reliability Gain k, we can find that consensus failure rate log(1 − P C ) and link failure rate log(1 − P l ) are in linear relation when the nodes number N is constant. With fixed link reliability, the increasing nodes number rises up the consensus reliability, which proves the increasing monotonicity of consensus success rate P C with the nodes number N . Compared to the equation (1), this equation shows a simple relationship between link reliability and consensus reliability. Thus, it can provide a valid guide for the real Raft CM deployment in the IIoT systems. Table  I shows the estimated Δh when nodes number N increases from 5 to 19, where Δh remains constant with a fixed nodes number N . The simulation result in Section IV shows that the consensus failure rate log(1 − P C ) satisfies the linear relationship in equation (2) when P l is as low as 90%.

C. Relationship Between Latency and Reliability
In this Subsection, we will show that the consensus reliability and the consensus latency are contradictory. A wireless communication model, which aims to analyze the packet error probability of the wireless short package transmissions in URLLC [10], is used to find out the relationship between consensus success rate P C and the consensus latency T , which we assume it is caused by downlink and uplink transmission delay, i.e., Raft consensus latency T only composes of communication transmission delay to show the communication impacts on the overall consensus latency. This model is an illustrational case and other models can be used without affecting the main conclusion of the letter. According to [10], the link failure rate 1 − P l used in equation (1) and (2) can be written as a function of T as follow where B is the available spectrum bandwidth. R and C are the uplink or downlink transmission rate and channel capacity, respectively. Note that here we assume both uplink and downlink transmissions are time divisioned, i.e., given the overall consensus delay, T , each transmission can have t = T 2N transmission internal since there are N transmissions in both uplink and downlink. Therefore, with a constant N , the increasing consensus delay T can provide more time t for each link transmission, which intuitively can reduce the link failure rate 1 − P l . By substituting equation (3) into equation (1) or (2), we can obtain the relationship of reliability 1−P C with the latency T . The contradiction of consensus reliability 1−P C and time delay T can be proved in mathematics by calculating the derivative of the variable Q =

Q-function
The derivative ∂Q ∂T in equation always keeps positive, which means the variable Q increases monotonically along with T . Based on the decreasing monotonicity of Q-function f Q ( * ) along with Q and the increasing monotonicity of P C along with P l , the time delay of consensus T and consensus reliability 1 − P C are contradictory.
According to the conclusion in Section III-A, the consensus reliability 1 − P C increases monotonically with the nodes number. However, given fixed consensus delay T , increasing node number will also result in a shorter transmission time t = T 2N for each link, thus causes a smaller P l , which may turn out a less reliable consensus according to equation (1) or (2). Thus, it is expected that there is an optimal N to achieve maximum consensus reliability.

IV. SIMULATION RESULTS AND DISCUSSION
Simulations are conducted to validate the proposed consensus communication model and its derivations. The given bandwidth for link transmission B is set as 18 kHz, and the SINR (signal-to-interference-plus-noise ratio) is set to 10 dB. The uplink and downlink capacity R is assumed 50% of the channel capacity, which is calculated by C = log(1+SINR). Fig. 2 indicates the consensus success rate in the model with an increasing number of nodes N in the Raft link failure model. The consensus failure rate 1 − P C declines as nodes number increases with relatively low communication link success rate P l = 90%, 95%, 99%, 99.9%, respectively. The simulated results (in asterisks) of the consensus failure rate 1 − P C is overlapped to their analytical curves (in lines) when the link success rate P l = 90% and 95%, which proves the correctness of the equation (1). The analytical curves shows the property that the consensus success rate P C increases monotonically with the nodes number N . Because the consensus failure rate is extremely low for a larger P l and  MATLAB compute power is limited, the simulated result of consensus failure rate 1 − P C cannot be completely presented in the Fig.1 when P l is 99% and 99.9%. Fig. 3 shows the consensus reliability tendency along with the link success rate P l . The analytical result represents the original consensus reliability 1 − P C in logarithm in equation (1). The simplified result represents the consensus failure rate log(1 − P C ) in equation (2). Analytical lines and simplified lines are highly matched, which support the accuracy of the linear relation in equation (2). The slopes of lines are equivalent to the value of Reliability Gain k = N +1 2 , which become steeper when the nodes number N rises up. The result shown here suggests that we can use a simplified model to guide the real deployment of Raft for distributed systems.
The simulation in Fig. 4 reveals the contradiction between consensus reliability 1 − P C and consensus delay T . Four curves are corresponding to different nodes number N = 10, 15, 20, 30, respectively. All curves in Fig. 4 show that with the constant nodes number, the consensus reliability 1 − P C reduces when the time delay T rises up, which proves the contradiction of the consensus reliability and time latency. The tendency of consensus failure rate at N = 15 drops more dramatically than the consensus failure rate at N = 10, along with the increase of time delay, which causes the interception of two curves. It implies that the consensus reliability does not have monotonicity along with the nodes number N if consensus delay T 's effect on the link transmission reliability is considered. Therefore, further investigation of this phenomenon is performed. Fig. 5 indicates the change in the consensus reliability 1−P C along with nodes numbers N and a constant consensus time latency T . The curves show that the consensus reliability 1 − P C fluctuates when nodes number increases, and there is maximum consensus reliability. By modifying the time delay in the consensus system, the maximum value of the reliability curve will be shifted to a higher value with the larger corresponding nodes number. The reason of this phenomenon is that the consensus failure rate follows the monotonicity in equation (1) when N is small; when N becomes large, given the fact that communication resource (i.e., the communication time T ) is limited, the time latency in each link transmission will be reduced, and 1 − P l will increase dramatically along N based on the property of Q function in equation (3), which causes the rise of 1 − P C . Therefore, the nodes number N has both positive and negative effects on consensus reliability. The shifting of curves indicates that the optimization of consensus reliability by allocating communication resources can be implemented to reach the requirements of different scenarios in IIoT.

V. CONCLUSION
The analysis of consensus reliability in the distributed IIoT system with Raft concludes that with a low communication link reliability, the consensus reliability in Raft can achieve ultra reliability by increasing the nodes number. The relationship of consensus reliability with communication link reliability is interpreted in a linear form for simplicity. Meanwhile, the results show that the time latency is contradictory to the consensus reliability in Raft. Therefore, this article provides a valuable guide for the design and deployment of Raft in distributed systems.

APPENDIX A PROOF OF EQUATION (2)
In the summations of the binomial distributions of the consensus success rate P C , the largest term dominates the summation if the link success rate P l increases to reasonably large. Thus, the inner summation of the binomial distributions can be replaced by the largest term of it for simplification i j= N −1 The result of equation (5) can be substituted into the equation (1). According to the cumulative distribution function of binomial distribution, the consensus failure rate 1-P C is The largest term in the summation of the consensus failure rate in equation (6) is also dominating. Since P l is reasonably large, the summation in equation (6) can be simplified in the same way as equation (5). And when the consensus failure rate 1 − P C is converted to logarithm form, it will correspond to the linear relation in equation (2) log(1 − P C ) = ( N + 1 2 ) log(1− P l ) + log( where Δh is the corrected value of the intercept in equation (2) to get the minimum error between equation (1) and equation (2).