🤖 AI Summary
This work addresses the challenge of enhancing the global modeling capacity of local attention mechanisms while preserving linear computational complexity. Inspired by the random long-range connections observed in the Drosophila whole-brain connectome, the authors propose Stochastic Attention (SA), which applies sliding-window attention to a randomly permuted token sequence and then restores the original order, thereby transforming fixed local windows into dynamic global ones. This approach introduces, for the first time, the principle of random connectivity from neural connectomics into attention mechanisms, achieving full-sequence coverage with only logarithmic depth and enabling an exponentially expanding receptive field. Evaluated on Qwen3-8B and Qwen3-30B-A3B, SA outperforms standard sliding-window attention without any additional training, matches or exceeds hybrid block attention, and achieves state-of-the-art zero-shot accuracy in from-scratch pretraining.
📝 Abstract
The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.