GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from limited solution diversity and poor generalization in multi-step reasoning tasks. This stems primarily from supervised fine-tuning (SFT), which relies on the i.i.d. assumption, and reinforcement learning methods (e.g., PPO), which optimize only cumulative reward while neglecting solution diversity and long-range dependencies among reasoning steps. To address this, we propose the first GFlowNet-based fine-tuning framework for VLMs in multi-step reasoning. Our approach introduces a non-Markovian decision process formulation, integrates chain-of-thought (CoT) prompting with task-driven sparse reward design, and explicitly models long-range dependencies across action sequences—thereby jointly optimizing for both solution diversity and optimality. Evaluated on benchmarks including NumberLine, BlackJack, and ALFWorld, our method significantly improves training efficiency, solution-space diversity, and in-distribution as well as out-of-distribution generalization, consistently outperforming both SFT and PPO baselines.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Problem

Research questions and friction points this paper is trying to address.

Limitations of SFT and RL in multi-step reasoning tasks

Need for diverse solutions in complex reasoning tasks

Enhancing generalization in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Generative Flow Networks for fine-tuning VLMs

Models environment as non-Markovian decision process

Enhances solution diversity and generalization in reasoning

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling