Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Model upgrades in cross-modal retrieval often cause embedding incompatibility and necessitate costly re-embedding (i.e., “refilling”) of large-scale historical data. Method: This paper proposes Cross-modal Backward-Compatible Representation Learning (XBT), the first framework to extend backward-compatible training (BT) to vision-language pretraining (VLP). XBT introduces a lightweight, text-only pretrained projection module that aligns the embedding space of a frozen new VLP model (e.g., CLIP) with that of legacy models—without accessing the old model, requiring image-text pairs, or performing refilling. It adheres to the parameter-efficient fine-tuning (PEFT) paradigm. Contribution/Results: Evaluated on multiple cross-modal retrieval benchmarks, XBT significantly reduces training overhead while enabling zero-cost, seamless deployment of new VLP models—achieving backward compatibility without compromising retrieval performance or incurring storage or computational penalties from refilling.

Technology Category

Application Category

📝 Abstract

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

Problem

Research questions and friction points this paper is trying to address.

Ensures compatibility between old and new vision-language model embeddings

Reduces costly backfilling in cross-modal retrieval systems

Proposes efficient projection module for backward-compatible training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projection module aligns new and old embeddings

Pretrained with text data to reduce pairs

Parameter-efficient training preserves model knowledge

🔎 Similar Papers

No similar papers found.