🤖 AI Summary
Standard Raft suffers from delayed failure detection and increased outage time (OTS) in dynamic networks due to fixed election parameters (heartbeat interval and election timeout). To address this, we propose a lightweight, online parameter tuning mechanism that dynamically adapts these parameters without modifying Raft’s core logic or introducing extra communication overhead. Our approach constructs a network health metric from real-time heartbeat measurements—including round-trip time (RTT) and packet loss rate—and employs a feedback control law to adjust election parameters adaptively. Experimental results show that, compared to vanilla Raft, our method reduces leader failure detection time by 80% and OTS by 45%, while maintaining ≥99.9% availability even under highly volatile network conditions. This work is the first to apply closed-loop feedback control to Raft parameter optimization, achieving significant improvements in responsiveness and robustness with zero protocol modifications.
📝 Abstract
Raft is a leader-based consensus algorithm that implements State Machine Replication (SMR), which replicates the service state across multiple servers to enhance fault tolerance. In Raft, the servers play one of three roles: leader, follower, or candidate. The leader receives client requests, determines the processing order, and replicates them to the followers. When the leader fails, the service must elect a new leader to continue processing requests, during which the service experiences an out-of-service (OTS) time. The OTS time is directly influenced by election parameters, such as heartbeat interval and election timeout. However, traditional approaches, such as Raft, often struggle to effectively tune these parameters, particularly under fluctuating network conditions, leading to increased OTS time and reduced service responsiveness. To address this, we propose Dynatune, a mechanism that dynamically adjusts Raft's election parameters based on network metrics such as round-trip time and packet loss rates measured via heartbeats. By adapting to changing network environments, Dynatune significantly reduces the leader failure detection and OTS time without altering Raft's core mechanisms or introducing additional communication overheads. Experimental results demonstrate that Dynatune reduces the leader failure detection and OTS times by 80% and 45%, respectively, compared with Raft, while maintaining high availability even under dynamic network conditions. These findings confirm that Dynatune effectively enhances the performance and reliability of SMR services in various network scenarios.