🤖 AI Summary
Existing contrastive learning frameworks rely solely on binary relevance signals (positive/negative pairs), failing to capture fine-grained ranking information—thus limiting retrieval ranking performance and necessitating external re-rankers. This work proposes Generalized Contrastive Learning (GCL), the first framework to directly incorporate continuous relevance scores into the contrastive objective, enabling end-to-end multimodal retrieval and ranking in a unified manner. Key contributions include: (1) a ranking-weighted contrastive loss that explicitly models graded relevance; (2) MarqoGS-10M—the first million-scale multimodal dataset with human-verified continuous relevance scores; and (3) high-quality synthetic training data generated via CLIP-based pipelines augmented with GPT-4 and Google Shopping. Experiments demonstrate significant improvements: +29.3% NDCG@10 on standard benchmarks, +6.0–10.0% in cold-start settings, and +11.2% on private user-behavior data.
📝 Abstract
Contrastive learning has gained widespread adoption for retrieval tasks due to its minimal requirement for manual annotations. However, popular training frameworks typically learn from binary (positive/negative) relevance, making them ineffective at incorporating desired rankings. As a result, the poor ranking performance of these models forces systems to employ a re-ranker, which increases complexity, maintenance effort and inference time. To address this, we introduce Generalized Contrastive Learning (GCL), a training framework designed to learn from continuous ranking scores beyond binary relevance. GCL encodes both relevance and ranking information into a unified embedding space by applying ranking scores to the loss function. This enables a single-stage retrieval system. In addition, during our research, we identified a lack of public multi-modal datasets that benchmark both retrieval and ranking capabilities. To facilitate this and future research for ranked retrieval, we curated a large-scale MarqoGS-10M dataset using GPT-4 and Google Shopping, providing ranking scores for each of the 10 million query-document pairs. Our results show that GCL achieves a 29.3% increase in NDCG@10 for in-domain evaluations and 6.0% to 10.0% increases for cold-start evaluations compared to the finetuned CLIP baseline with MarqoGS-10M. Additionally, we evaluated GCL offline on a proprietary user interaction data. GCL shows an 11.2% gain for in-domain evaluations. The dataset and the method are available at: https://github.com/marqo-ai/GCL.