Repetition Improves Language Model Embeddings

📅 2024-02-23

🏛️ International Conference on Learning Representations

📈 Citations: 42

✨ Influential: 7

career value

183K/year

🤖 AI Summary

Autoregressive language models (LMs) suffer from unidirectional context modeling, preventing them from capturing subsequent semantic information and limiting their effectiveness for high-quality text embedding. To address this, we propose Echo Embedding: a zero-shot, architecture- and fine-tuning–free method that duplicates the input text and extracts token embeddings from the second occurrence—enabling early tokens to implicitly attend to global context via input repetition. This is the first approach to leverage input duplication to circumvent autoregressive locality constraints for context-aware embedding enhancement. Evaluated under the MTEB zero-shot setting, Echo Embedding achieves over 9% relative improvement over baselines; further fine-tuning yields an additional ~0.7% gain. When applied to Mistral-7B, it establishes a new state-of-the-art among open-source models for text embedding—without relying on synthetic data or architectural modifications.

Technology Category

Application Category

📝 Abstract

Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach,"echo embeddings,"in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.

Problem

Research questions and friction points this paper is trying to address.

Convert autoregressive LMs to embedding models without modification

Improve zero-shot text embedding performance via repetition

Eliminate bidirectional architecture requirement for embedding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Echo embeddings convert autoregressive LMs without architecture changes

Repeats input to extract embeddings from repeated tokens

Eliminates bidirectional requirement for high-quality text embeddings

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon