Vision-Language Models Create Cross-Modal Task Representations

📅 2024-10-29

📈 Citations: 2

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates how vision-language models (VLMs) construct modality-agnostic task representations. We introduce the concept of *task vectors*—one-dimensional linear embeddings that compactly encode semantically equivalent task specifications across modalities (images/text) and formats (instructions/examples), outperforming full prompt inputs in downstream performance. Methodologically, we leverage autoregressive VLMs and employ cross-modal transfer evaluation, task vector disentanglement, and linear probing analysis. Key contributions: (1) We provide the first empirical evidence that task vectors exhibit strong generalization—enabling zero-shot transfer from LLMs to VLMs; (2) they can be extracted solely from instruction-based prompts without demonstrations; (3) they support cross-modal task triggering (e.g., image-to-text or text-to-image); and (4) they significantly improve zero-shot generalization across diverse tasks and VLM architectures. Our findings suggest that task semantics in VLMs are linearly separable and highly portable, offering a unified representation for multimodal task specification.

Technology Category

Application Category

📝 Abstract

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer -- the ability of a task vector derived in one modality to trigger the correct generation in another -- on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations. Project page: https://vlm-cross-modal-reps.github.io.

Problem

Research questions and friction points this paper is trying to address.

VLMs' opaque representations for multi-task handling

Cross-modal task vector alignment in VLMs

Transferring task vectors between modalities and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs align inputs into shared task vectors

Cross-modal transfer improves task performance

Task vectors derived from instructions alone

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions