SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak self-correction capability of image captioning models. We propose SC-Captioner, a reinforcement learning–based framework that leverages scene graph parsing to decompose captions into structured semantic units (entities, attributes, relations) and introduces a set-difference–based fine-grained reward function that jointly incorporates correctness rewards and error penalties for precise feedback. To enable rigorous evaluation at the semantic unit level, we introduce CAPTURE—a novel metric quantifying accuracy, consistency, and generalization across entities, attributes, and relations—and the RefinedCaps benchmark, a fine-grained annotated dataset with structural ground-truth labels. Experiments demonstrate that SC-Captioner significantly outperforms strong baselines—including direct preference optimization—on RefinedCaps, achieving consistent improvements in caption accuracy, logical consistency, and cross-domain generalization. These results validate the effectiveness of structured self-correction as a paradigm for advancing captioning fidelity.

Technology Category

Application Category

📝 Abstract
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
Problem

Research questions and friction points this paper is trying to address.

Enhancing image captioning accuracy with self-correction via reinforcement learning
Designing reward functions to incentivize precise caption corrections
Improving evaluation metrics for comprehensive image caption quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for self-correcting captions
Scene-graph parsing for reward calculation
Refined metrics for caption quality assessment
🔎 Similar Papers
No similar papers found.
L
Lin Zhang
College of Future Information Technology, Fudan University
X
Xianfang Zeng
StepFun
Kangcong Li
Kangcong Li
Fudan University
G
Gang Yu
StepFun
T
Tao Chen
College of Future Information Technology, Fudan University; Shanghai Innovation Institute