CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional supervised fine-tuning for vision-language captioning, which relies on costly annotated data and often leads models to memorize fixed answers, thereby hindering diverse and generalizable visual descriptions. The authors propose CapRL++, a novel two-stage decoupled training framework that operates without reference captions. It defines caption quality by its utility for downstream tasks and employs the accuracy of a pure language model in answering visual questions based solely on generated captions as a verifiable reward signal. Through reinforcement learning, the framework jointly optimizes multimodal captioning models. Evaluated across more than twenty image and video benchmarks, CapRL++ significantly enhances caption quality, enabling smaller models to match the performance of much larger counterparts such as Qwen2.5-VL-72B, while also improving downstream task effectiveness.
📝 Abstract
Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.
Problem

Research questions and friction points this paper is trying to address.

image captioning
video captioning
supervised fine-tuning
generalization
diverse description
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Verifiable Rewards
reference-free captioning
dense image and video captioning
vision-language pretraining
utility-based reward
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30