SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing policy optimization methods based on scalar advantages face challenges in long-form vision-language generation due to coarse-grained credit assignment and limited adaptability to semantically rich images. This work proposes a segment-decomposed policy optimization approach that, for the first time, incorporates the natural segmentation structure of generated outputs into multimodal reinforcement learning. It introduces verifiable segment-level rewards and employs within-segment z-normalization to produce per-segment advantage vectors, enabling fine-grained credit assignment. Built upon the GRPO framework, the method integrates segment-level normalization, verifiable rewards, and a hybrid reward strategy, significantly enhancing performance. It consistently outperforms baselines across tasks such as DOCCI, MultiChartQA, and MMSci, with particularly pronounced gains when the number of segments increases or when segments exhibit semantic independence, and it seamlessly integrates into existing systems like Dr. GRPO.

📝 Abstract

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

Problem

Research questions and friction points this paper is trying to address.

vision-language generation

long-form output

credit assignment

segment decomposition

multimodal LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-Decomposed GRPO

per-segment advantage

long-form vision-language generation