🤖 AI Summary
To address sketch data loss caused by switch failures in network monitoring, this paper proposes a distributed recoverable sketch framework supporting efficient fault-tolerant recovery for linear sketches used in frequency estimation (e.g., Count-Min Sketch). Methodologically, it introduces a dual-mode mechanism—periodic full-state synchronization and incremental updates—leveraging incremental encoding and batched change representation to enable plug-and-play integration of multiple sketch structures via an abstract API. Its key contribution is the first incorporation of distributed collaborative recovery into sketch data management, achieving a balanced trade-off among storage, computation, and communication overheads through modular design. Experiments demonstrate that, compared to full-state synchronization, the incremental strategy reduces communication volume by 62% and recovery latency by 47%, while preserving the theoretical error bounds of frequency estimates. This framework significantly enhances the reliability and scalability of large-scale network monitoring systems.
📝 Abstract
Sketches are commonly used in computer systems and network monitoring tools to provide efficient query executions while maintaining a compact data representation. Switches and routers maintain sketches to track statistical characteristics of network traffic. The availability of such data is essential for the network analysis as a whole. Consequently, being able to recover sketches is critical after a switch crash. In this work, we explore how nodes in a network environment can cooperate to recover sketch data whenever any subset of them crashes. In particular, we focus on frequency estimation linear sketches, such as the Count-Min Sketch. We consider various approaches to ensure data reliability and explore the trade-offs between space consumption, runtime overheads, and traffic during recovery, which we point out as design guidelines. Besides different aspects of efficacy, we design a modular system for ease of maintenance and further scaling. A key aspect we examine is how the nodes update each other regarding their sketch content as it evolves over time. In particular, we compare periodic full updates vs incremental updates. We also examine several data structures to economically represent and encode a batch of latest changes. Our framework is generic, and other data structures can be plugged-in via an abstract API as long as they implement the corresponding API methods.