DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing reinforcement learning methods in multimodal large language models, which fail to model fine-grained cross-modal dynamic coordination during reasoning, leading to an imbalance between visual evidence extraction and textual context integration. To remedy this, the study introduces a dynamic cross-modal coordination mechanism into the reinforcement learning from verbal reward (RLVR) framework. It leverages the Fisher–Rao geodesic distance to quantify intra-modal attention shifts and assigns each token a vision- or text-oriented role accordingly. The policy gradient is then reweighted based on the degree of role alignment, enabling fine-grained optimization of the reasoning process. Integrated with four mainstream RLVR algorithms on Qwen2.5-VL-3B/7B, the proposed approach consistently improves performance across seven benchmarks spanning visual and mathematical reasoning tasks.

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

cross-modal coordination

multimodal large language models

reinforcement learning

chain-of-thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Cross-Modal Coordination

Reinforcement Learning with Verifiable Rewards

Fisher-Rao Geodesic Distance