Where does Absolute Position come from in decoder-only Transformers?

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study investigates why decoder-only Transformers employing only relative positional encoding (RoPE) nonetheless exhibit absolute positional awareness. Through theoretical analysis and ablation experiments, the authors uncover that the key mechanism enabling absolute position leakage lies in the dynamic coupling between the softmax normalization term within the causal mask and the residual stream at position 0. They introduce the concept of an “attention sink” to stabilize token anchoring at this initial position. By integrating variants such as NTK scaling and sliding window attention, the work further examines how different components influence positional information propagation. Experiments demonstrate that replacing the BOS embedding reduces residual stream contributions in early queries by 40%, confirming that the attention sink conveys a deterministic fingerprint of the position-0 token, thereby explaining cross-input discrepancies in absolute positional behavior.

📝 Abstract

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

Problem

Research questions and friction points this paper is trying to address.

Absolute Position

RoPE

Decoder-only Transformers

Attention Patterns

Positional Encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

absolute position

RoPE

causal mask