Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the quadratic computational complexity $O(n^2)$ of softmax attention in long-context settings, this paper proposes SliceReLU Attention—a novel attention mechanism leveraging a one-dimensional linear projection of key-query differences, differentiable sorting, and piecewise ReLU activation to construct an asymmetric attention kernel. The resulting method achieves quasi-linear time complexity $O(n log n)$. To our knowledge, this is the first work to incorporate differentiable sorting into attention kernel design, striking a balance between computational efficiency and representational capacity. We theoretically establish that SliceReLU Attention possesses universal approximation capability over contexts and sequence-decoupled expressiveness—matching the expressive power of softmax attention. Empirical evaluation on small-scale benchmarks confirms its effectiveness. Key contributions include: (i) an asymmetric attention kernel construction driven by differentiable sorting; (ii) guaranteed quasi-linear computational complexity; and (iii) rigorous theoretical guarantees on representational expressiveness.

Technology Category

Application Category

📝 Abstract

We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.

Problem

Research questions and friction points this paper is trying to address.

Introduces a new attention mechanism with quasi-linear complexity.

Enables efficient processing of very long contextual sequences.

Maintains strong theoretical expressive power for sequence tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliced ReLU attention uses sorting for quasi-linear complexity

It operates on one-dimensional projections of key-query differences

This differentiable kernel achieves O(n log(n)) computation for long contexts

🔎 Similar Papers

No similar papers found.