🤖 AI Summary
This work challenges the prevailing assumption that text embedding models require task- or domain-specific fine-tuning to perform well on downstream applications. We systematically evaluate general-purpose text embedders (GTEs) in zero-shot recommendation and search tasks. Methodologically, we propose a lightweight, parameter-free adaptation framework grounded in pretrained language models: without updating any parameters, we apply unsupervised Principal Component Analysis (PCA) to the original embeddings for dimensionality reduction, preserving the most discriminative semantic directions while suppressing noise and calibrating feature distributions. Experiments demonstrate that this zero-shot approach significantly outperforms fine-tuned task-specific models on both sequential recommendation and e-commerce search benchmarks. Crucially, we identify uniform embedding distribution as a key mechanism underlying improved generalization. Our findings establish a new paradigm for large-model representation transfer, empirically confirming that high-quality general-purpose embeddings inherently possess strong task adaptability.
📝 Abstract
Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps.