RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of sparse rewards and unstable training in fine-grained visual reasoning, this paper proposes RewardMap, a multi-stage reinforcement learning framework. Methodologically, it introduces (1) ReasonMap-Plus—a densely annotated dataset with fine-grained supervision; (2) a difficulty-aware hierarchical reward mechanism that decomposes visual question answering into sequential subtasks: perception → localization → reasoning; and (3) progressive task-guided optimization, replacing conventional supervised fine-tuning to enable stable cold-start training. Evaluated on ReasonMap and multiple fine-grained benchmarks, RewardMap achieves an average improvement of 3.47%. The framework significantly enhances the model’s capacity for spatial relation modeling, local feature discrimination, and cross-domain generalization—demonstrating robustness and scalability in complex visual reasoning scenarios.

Technology Category

Application Category

📝 Abstract
Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addressing sparse rewards in fine-grained visual reasoning tasks
Improving multimodal models' spatial reasoning with structured data
Developing multi-stage reinforcement learning for visual understanding enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage RL bootstraps from perception to reasoning
Difficulty-aware rewards tackle sparse reward issues
Dense reward signals enable effective cold-start training
🔎 Similar Papers
No similar papers found.