VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of vision-language models (VLMs) in interactive decision-making tasks—specifically, their difficulty adhering to strict action syntax constraints and poor generalization. We propose the first framework integrating offline reinforcement learning (Offline RL) into open-source VLM alignment. Our method unifies Q-learning with VLM behavioral cloning, enabling autonomous learning from its own or stronger models’ failure trajectories via value correction—without requiring high-quality expert demonstrations. It jointly preserves the training stability of supervised fine-tuning (SFT) and enables online policy self-improvement, while explicitly modeling multimodal action spaces. Evaluated on three multimodal agent benchmarks, our approach significantly improves instruction-following accuracy and environment interaction success rates for LLaVA and MiniGPT-4. Notably, it surpasses SFT baselines using only low-quality demonstration data.

Technology Category

Application Category

📝 Abstract

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment's strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the unsuccessful decisions of our own model or more capable (larger) models. We explore an off-policy RL solution that retains the stability and simplicity of the widely used SFT workflow while allowing our agent to self-improve and learn from low-quality datasets. We demonstrate this technique with two open-weight VLMs across three multi-modal agent domains.

Problem

Research questions and friction points this paper is trying to address.

Aligning vision-language models for interactive decision-making tasks

Overcoming VLMs' limitations in following strict output syntax requirements

Fine-tuning VLMs using offline-to-online reinforcement learning techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligning Vision-Language Models for decision-making

Using offline-to-online reinforcement learning

Self-improving agents from low-quality datasets

🔎 Similar Papers

No similar papers found.

Authors to Follow