TeraNoC: A Multi-Channel 32-bit Fine-Grained, Hybrid Mesh-Crossbar NoC for Efficient Scale-up of 1000+ Core Shared-L1-Memory Clusters

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving interconnect bandwidth scalability, low latency, and high area efficiency in thousand-core shared-L1-memory clusters, this paper proposes a fine-grained hybrid Mesh-Crossbar on-chip network (NoC) architecture. The design synergistically combines the scalability of multi-channel 2D mesh topologies with the low-hop-count advantage of crossbars, employs router remapping for traffic load balancing, and integrates a 32-bit multi-channel topology, fine-grained routing scheduling, and an open-source implementation. Evaluated in 12 nm CMOS technology, the architecture reduces die area by 37.8% and improves area efficiency by 98.7% versus a pure crossbar baseline. Under representative GenAI workloads, it achieves an IPC utilization of 0.85—effectively overcoming the fundamental bottlenecks of conventional mesh networks (latency scaling with hop count) and crossbars (quadratic wiring complexity growth).

Technology Category

Application Category

📝 Abstract
A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, low-latency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh-crossbar on-chip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 single-stage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, data-parallel key GenAI kernels. TeraNoC only consumes 7.6% of the total cluster power in kernels dominated by crossbar accesses, and 22.7% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8% and improves area efficiency (GFLOP/s/mm2) by up to 98.7%, while occupying only 10.9% of the logic area.
Problem

Research questions and friction points this paper is trying to address.

Balancing bandwidth scalability with low latency in on-chip interconnects
Overcoming routing complexity in crossbars for large core counts
Enabling area-efficient scale-up of shared-memory many-core clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid mesh-crossbar for scalable low-latency interconnect
32-bit multi-channel topology for area efficiency
Router remapper balances traffic load effectively
🔎 Similar Papers
2024-09-26IEEE Transactions on Very Large Scale Integration (VLSI) SystemsCitations: 0