🤖 AI Summary
Current vision-language models (VLMs) suffer from limited solution diversity and poor generalization in multi-step reasoning tasks. This stems primarily from supervised fine-tuning (SFT), which relies on the i.i.d. assumption, and reinforcement learning methods (e.g., PPO), which optimize only cumulative reward while neglecting solution diversity and long-range dependencies among reasoning steps. To address this, we propose the first GFlowNet-based fine-tuning framework for VLMs in multi-step reasoning. Our approach introduces a non-Markovian decision process formulation, integrates chain-of-thought (CoT) prompting with task-driven sparse reward design, and explicitly models long-range dependencies across action sequences—thereby jointly optimizing for both solution diversity and optimality. Evaluated on benchmarks including NumberLine, BlackJack, and ALFWorld, our method significantly improves training efficiency, solution-space diversity, and in-distribution as well as out-of-distribution generalization, consistently outperforming both SFT and PPO baselines.
📝 Abstract
Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.