Shared DIFF Transformer

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address parameter redundancy and suboptimal information utilization in the DIFF Transformer, this work proposes a differential attention architecture based on a shared base matrix and low-rank differential updates. Specifically, a lightweight shared base matrix captures global contextual patterns, while task-adaptive low-rank updates enable dynamic, efficient adaptation; sparse attention is further integrated to enhance noise robustness. This design preserves expressive capacity while substantially reducing parameter count and improving computational efficiency—particularly for long-sequence modeling. Experiments demonstrate that the proposed method consistently outperforms the original DIFF Transformer across long-sequence modeling, key-information retrieval, and in-context learning tasks. It reduces parameters by 37%, accelerates inference by 2.1×, and exhibits superior robustness to input noise.

Technology Category

Application Category

📝 Abstract
DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.
Problem

Research questions and friction points this paper is trying to address.

Attention Mechanism
Efficiency Improvement
Noise Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared DIFF Transformer
Parameter Efficiency
Attention Mechanism Improvement
🔎 Similar Papers
No similar papers found.