KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Decoder-only large language models (LLMs) struggle to produce high-quality semantic embeddings without fine-tuning due to their causal attention mechanism and generative training objective. This work proposes a training-free KV rerouting mechanism that re-injects the key-value (KV) states of the last token from each layer as a prefix, enabling all tokens to access global context in a single forward pass and thereby unlocking the representational capacity of frozen LLMs. Coupled with an automatic layer selection strategy based on intrinsic dimensionality estimation, the method efficiently leverages internal LLM states for semantic embedding generation for the first time. Evaluated on the MTEB benchmark, it outperforms existing training-free approaches by up to 10% across Qwen, Mistral, and Llama architectures while supporting sequences up to 4096 tokens in length.

Technology Category

Application Category

📝 Abstract
While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.
Problem

Research questions and friction points this paper is trying to address.

causal attention
next-token prediction
text embedding
decoder-only LLMs
training-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Embedding
training-free embedding
KV re-routing
decoder-only LLMs
intrinsic dimensionality
🔎 Similar Papers
No similar papers found.