🤖 AI Summary
To address the quadratic computational complexity $O(n^2)$ of softmax attention in long-context settings, this paper proposes SliceReLU Attention—a novel attention mechanism leveraging a one-dimensional linear projection of key-query differences, differentiable sorting, and piecewise ReLU activation to construct an asymmetric attention kernel. The resulting method achieves quasi-linear time complexity $O(n log n)$. To our knowledge, this is the first work to incorporate differentiable sorting into attention kernel design, striking a balance between computational efficiency and representational capacity. We theoretically establish that SliceReLU Attention possesses universal approximation capability over contexts and sequence-decoupled expressiveness—matching the expressive power of softmax attention. Empirical evaluation on small-scale benchmarks confirms its effectiveness. Key contributions include: (i) an asymmetric attention kernel construction driven by differentiable sorting; (ii) guaranteed quasi-linear computational complexity; and (iii) rigorous theoretical guarantees on representational expressiveness.
📝 Abstract
We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.