TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the information asymmetry between image and text embeddings in vision-language models, which often leads to insufficient semantic alignment. To mitigate this imbalance, the authors propose a novel approach that integrates sparse autoencoders with a text-conditional masking mechanism. This framework leverages textual descriptions to guide the disentanglement of image embeddings and employs a masking module to selectively reconstruct visual representations relevant to the given text, thereby preserving critical information on demand. The method demonstrates consistent performance gains across multiple retrieval benchmarks, including MS COCO and Flickr for short captions, as well as IIW and DOCCI for long captions, with particularly pronounced improvements in rich-caption scenarios. It also exhibits enhanced robustness on the RoCOCO benchmark.
📝 Abstract
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.
Problem

Research questions and friction points this paper is trying to address.

vision-language alignment
image-text embedding
information imbalance
caption-image mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoders
vision-language alignment
text-conditioned editing
embedding disentanglement
masked reconstruction