🤖 AI Summary
Conventional representation learning yields generic semantic embeddings that often misalign with downstream task requirements—e.g., scene-level features are more critical than class semantics in animal habitat analysis—while supervised fine-tuning incurs high annotation and computational costs.
Method: We propose Conditional Representation Learning (CRL), a novel paradigm that leverages large language models (LLMs) to parse user-provided descriptive keywords and construct task-specific semantic bases; visual–language models (VLMs) then project image features into the conditional feature space spanned by these bases, enabling unsupervised, task-adaptive representation learning.
Contribution/Results: CRL is the first framework to directly approximate semantic bases from natural language descriptions—bypassing fine-tuning entirely. Experiments demonstrate that CRL significantly outperforms generic representations on classification and cross-domain retrieval tasks, achieving superior task customization and cross-task generalization.
📝 Abstract
Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.