Clock Distribution with Gradient TRIX

📅 2023-01-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Gradient Clock Synchronization (GCS) algorithms for large-scale synchronous SoCs suffer from weak fault tolerance and high hardware overhead. Method: This paper proposes the first self-stabilizing GCS algorithm operating on a directed grid topology, where each node has in- and out-degree exactly three—achieving theoretical optimality—and tolerates a single permanent failure among its in-neighbors. Under assumptions of slowly varying link delays and clock drifts, the algorithm guarantees local skew Θ(log D) and global skew Θ(D), where D denotes network diameter. Contribution/Results: It achieves asymptotically optimal robustness against independent transient faults with probability p ∈ o(n⁻¹⁄²). Unlike conventional approaches relying on excessive edge replication (e.g., 16× redundancy), our design reduces hardware overhead significantly: at 1 GHz, it withstands a constant number of arbitrary faults per cycle with convergence probability 1−o(1), thereby enhancing reliability and scalability of ultra-large-scale synchronous SoCs.

📝 Abstract

Gradient clock synchronization (GCS) algorithms minimize the worst-case clock offset between the nodes in a distributed network of diameter $D$ and size $n$. They achieve optimal offsets of $Theta(log D)$ locally, i.e. between adjacent nodes as shown by Lenzen et al., and $Theta(D)$ globally as shown by Biaz and Welch. As demonstrated in the work of Bund et al., this is a highly promising approach for improved clocking schemes for large-scale synchronous Systems-on-Chip (SoC). Unfortunately, in large systems, faults hinder their practical use. State of the art fault-tolerant, as presented by Bund et al., has a drawback that is fatal in this setting: It relies on node and edge replication. For $f=1$, this translates to at least $16$-fold edge replication and high degree nodes, far from the optimum of $2f+1=3$ for tolerating up to $f$ faulty neighbors. In this work, we present a self-stabilizing GCS algorithm for a grid-like directed graph with optimal node in- and out-degrees of $3$ that tolerates $1$ faulty in-neighbor. If nodes fail with independent probability $pin o(n^{-1/2})$, it achieves asymptotically optimal local skew of $Theta(log D)$ with probability $1-o(1)$; this holds under general worst-case assumptions on link delay and clock speed variations, provided they change slowly relative to the speed of the system. The failure probability is the largest possible ensuring that with probabity $1-o(1)$ for each node at most one in-neighbor fails. As modern hardware is clocked at gigahertz speeds and the algorithm can simultaneously sustain a constant number of arbitrary changes due to faults in each clock cycle, this results in sufficient robustness to dramatically increase the size of reliable synchronously clocked SoCs.

Problem

Research questions and friction points this paper is trying to address.

Minimizes clock offset in large-scale SoCs with faults

Achieves optimal local skew Θ(log D) despite failures

Tolerates faulty neighbors with minimal node degrees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-stabilizing GCS algorithm for grid graphs

Tolerates 1 faulty in-neighbor with optimal degrees

Achieves asymptotically optimal local skew Θ(log D)

🔎 Similar Papers

No similar papers found.

Authors to Follow