Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

📅 2024-10-09
🏛️ International Conference on Learning Representations
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the O(T) per-token computational complexity of softmax attention in Transformers, this work proposes Rodimus and Rodimus+ architectures that achieve linear-time inference while preserving large-model accuracy. Methodologically, it introduces three core innovations: (1) a data-dependent temperature selection (DDTS) mechanism for dynamic attention sparsification; (2) sliding-window shared key attention (SW-SKA) to reduce redundant key computation; and (3) joint compression across semantic, token, and attention-head dimensions. Empirically, Rodimus+-1.6B—trained on only 1 trillion tokens—outperforms larger baselines including Qwen2-1.5B and RWKV6-1.6B across multiple downstream tasks, establishing new state-of-the-art results. The architecture thus bridges the efficiency–accuracy trade-off in autoregressive language modeling without compromising expressivity or requiring full retraining at scale.

Technology Category

Application Category

📝 Abstract
Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O(T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus$+$. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus$+$ combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLMs' complexity while maintaining performance
Overcoming computational costs of classical softmax attention
Balancing accuracy and efficiency in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-dependent tempered selection mechanism
Linear attention-based recurrent framework
Hybrid sliding window shared-key attention
🔎 Similar Papers
No similar papers found.