Scaling Out Chip Interconnect Networks with Implicit Sequence Numbers

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-chip AI systems, high-speed interconnects (e.g., CXL, NVLink) face growing challenges in detecting and recovering flit-level silent packet loss as link rates increase. To address this, this paper proposes RXL, a Reliable eXtensible Link architecture. RXL introduces three key innovations: (1) an Implicit Sequence Number (ISN) mechanism—enabling precise, flit-granularity loss detection and in-order delivery with zero header overhead; (2) upward migration of CRC verification to the transport layer, synergistically layered with forward error correction (FEC) for hierarchical reliability; and (3) native support for multi-node CXL topologies without bandwidth overhead. Evaluation shows that RXL delivers end-to-end data integrity and sequence correctness while imposing minimal latency overhead (<50 ns), thereby significantly enhancing communication reliability and scalability in large-scale AI systems.

Technology Category

Application Category

📝 Abstract
As AI models outpace the capabilities of single processors, interconnects across chips have become a critical enabler for scalable computing. These processors exchange massive amounts of data at cache-line granularity, prompting the adoption of new interconnect protocols like CXL, NVLink, and UALink, designed for high bandwidth and small payloads. However, the increasing transfer rates of these protocols heighten susceptibility to errors. While mechanisms like Cyclic Redundancy Check (CRC) and Forward Error Correction (FEC) are standard for reliable data transmission, scaling chip interconnects to multi-node configurations introduces new challenges, particularly in managing silently dropped flits in switching devices. This paper introduces Implicit Sequence Number (ISN), a novel mechanism that ensures precise flit drop detection and in-order delivery without adding header overhead. Additionally, we propose Reliability Extended Link (RXL), an extension of CXL that incorporates ISN to support scalable, reliable multi-node interconnects while maintaining compatibility with the existing flit structure. By elevating CRC to a transport-layer mechanism for end-to-end data and sequence integrity, and relying on FEC for link-layer error correction and detection, RXL delivers robust reliability and scalability without compromising bandwidth efficiency.
Problem

Research questions and friction points this paper is trying to address.

Detect silently dropped flits in multi-node chip interconnects
Ensure in-order delivery without header overhead in interconnects
Maintain bandwidth efficiency while enhancing reliability and scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Sequence Number for drop detection
Reliability Extended Link for multi-node interconnects
Enhanced CRC and FEC for robust reliability
🔎 Similar Papers
No similar papers found.
G
Giyong Jung
Dept. of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea
Saeid Gorgin
Saeid Gorgin
Sungkyunkwan University
Applied Machine LearningEmbedded AIHardware AcceleratorsComputer ArithmeticFPGA
J
John Kim
School of Electrical Engineering, KAIST, Daejeon, South Korea
Jungrae Kim
Jungrae Kim
Sungkyunkwan University
Computer architectureAI hardwarememory subsystemreliability