🤖 AI Summary
This work addresses the high energy consumption of Softmax-based attention mechanisms on power-constrained hardware by introducing a novel attention mechanism grounded in the synchronization dynamics of Kuramoto–Lohe coupled oscillators. The method treats queries as fixed anchors on the unit hypersphere and evolves oscillators via spherical gradient flow, naturally encoding attention weights through cosine similarity. It employs only a lightweight affine normalization at the readout stage, thereby entirely eliminating exponential operations and global reductions. Theoretically, the system’s fixed point is proven to be globally unique and almost everywhere attractive, rendering it applicable across diverse physical systems. Experiments show that with oscillator dimension 2, accuracy improves by 1.00 and 5.27 percentage points on keyword spotting and subject–verb agreement tasks, respectively (the latter achieving zero training failures); increasing the dimension to 32 substantially narrows the perplexity gap on WikiText-2 and TinyStories language modeling benchmarks.
📝 Abstract
We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax's arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension $d_{\mathrm{osc}}$ = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as $d_{\mathrm{osc}}$ grows: from +11.09 PPL at $d_{\mathrm{osc}}$ = 2 to +2.98 PPL at $d_{\mathrm{osc}}$ = 32 on WikiText-2, and from +2.39 PPL at $d_{\mathrm{osc}}$ = 2 to +0.57 PPL at $d_{\mathrm{osc}}$ = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.