Depth-Attention: Cross-Layer Value Mixing for Language Models

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the limitation of conventional Transformers, which rely solely on residual connections for cross-layer information flow and thus struggle to selectively reuse shallow representations. To overcome this, the authors propose Depth-Attention, a novel mechanism that integrates cross-layer value fusion directly within standard attention computation: queries at the current layer attend to historical key-value pairs from the same token position across previous layers, dynamically aggregating shallow features without introducing additional parameters or altering the key-value cache structure. This approach uniquely embeds a cross-layer selection mechanism into the attention operation itself and is compatible with both Qwen3-style decoders and recurrent Transformers. Evaluated across model scales from 360M to 3B parameters, Depth-Attention consistently improves performance—achieving the lowest perplexity and highest average downstream accuracy, with gains up to 2.3 percentage points over baselines while increasing FLOPs by less than 0.01%.
📝 Abstract
Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.
Problem

Research questions and friction points this paper is trying to address.

cross-layer information flow
depth-wise attention
key-value cache efficiency
Transformer architecture
inference state overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-Attention
cross-layer attention
key-value cache
efficient inference
Transformer architecture
🔎 Similar Papers
No similar papers found.
B
Boyi Zeng
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yiqin Hao
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
Z
Zitong Wang
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University, Sun Yat-sen University
S
Shixiang Song
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai Innovation Institute
He Li
He Li
Postdoc, Institute of Natural Science, Shanghai Jiao Tong University
active matterfluid mechanicspattern formation
F
Feichen Song
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
Yifan Liu
Yifan Liu
Shanghai Jiao Tong University
Data Mining
Ziwei He
Ziwei He
Shanghai Jiao Tong University
Machine Learning
X
Xinbing Wang
Shanghai Jiao Tong University
Z
Zhouhan Lin
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai AI Laboratory