🤖 AI Summary
In multi-chip AI systems, high-speed interconnects (e.g., CXL, NVLink) face growing challenges in detecting and recovering flit-level silent packet loss as link rates increase. To address this, this paper proposes RXL, a Reliable eXtensible Link architecture. RXL introduces three key innovations: (1) an Implicit Sequence Number (ISN) mechanism—enabling precise, flit-granularity loss detection and in-order delivery with zero header overhead; (2) upward migration of CRC verification to the transport layer, synergistically layered with forward error correction (FEC) for hierarchical reliability; and (3) native support for multi-node CXL topologies without bandwidth overhead. Evaluation shows that RXL delivers end-to-end data integrity and sequence correctness while imposing minimal latency overhead (<50 ns), thereby significantly enhancing communication reliability and scalability in large-scale AI systems.
📝 Abstract
As AI models outpace the capabilities of single processors, interconnects across chips have become a critical enabler for scalable computing. These processors exchange massive amounts of data at cache-line granularity, prompting the adoption of new interconnect protocols like CXL, NVLink, and UALink, designed for high bandwidth and small payloads. However, the increasing transfer rates of these protocols heighten susceptibility to errors. While mechanisms like Cyclic Redundancy Check (CRC) and Forward Error Correction (FEC) are standard for reliable data transmission, scaling chip interconnects to multi-node configurations introduces new challenges, particularly in managing silently dropped flits in switching devices. This paper introduces Implicit Sequence Number (ISN), a novel mechanism that ensures precise flit drop detection and in-order delivery without adding header overhead. Additionally, we propose Reliability Extended Link (RXL), an extension of CXL that incorporates ISN to support scalable, reliable multi-node interconnects while maintaining compatibility with the existing flit structure. By elevating CRC to a transport-layer mechanism for end-to-end data and sequence integrity, and relying on FEC for link-layer error correction and detection, RXL delivers robust reliability and scalability without compromising bandwidth efficiency.