Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Prompt-based text embeddings exhibit significant dimensional redundancy, leading to excessive storage and computational overhead. To address this, we systematically investigate the geometric structure of prompt embeddings and its relationship with downstream task performance. We propose a post-hoc compression framework leveraging principal component analysis (PCA) and other dimensionality reduction techniques, complemented by quantitative analysis using maximum likelihood estimation (MLE) of intrinsic dimensionality and anisotropy metrics. Experiments demonstrate that retaining only 0.5% of the original dimensions incurs less than 1% performance degradation on classification and clustering tasks; at 25% dimensionality, average performance loss across classification, clustering, retrieval, and semantic similarity tasks remains below 0.3%. This work is the first to reveal the extremely low intrinsic dimensionality of prompt embeddings and to establish a quantitative correlation between embedding anisotropy and task sensitivity—providing both theoretical foundations and practical solutions for efficient embedding deployment.

Technology Category

Application Category

📝 Abstract

Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.

Problem

Research questions and friction points this paper is trying to address.

High-dimensional prompt-based embeddings increase storage and computational costs

Post-hoc dimensionality reduction's impact on task performance is investigated

Embeddings show high redundancy, especially in classification and clustering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-hoc dimensionality reduction for embeddings

Analyzing intrinsic dimensionality and isotropy

Minimal performance loss with extreme reduction

🔎 Similar Papers

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens