🤖 AI Summary
This work addresses the video detailed captioning (VDC) task by proposing the first self-evolving framework that requires neither human annotations nor a large teacher model. Methodologically, it establishes a closed-loop “generation–scoring–prompt optimization” pipeline, incorporating a principle-guided automatic scoring mechanism and a self-reflective correction pathway to generate the high-quality preference dataset VDC-Agent-19K. Training is performed on Qwen2.5-VL-7B-Instruct using chain-of-thought reasoning, hierarchical prompt engineering, and difficulty-graded direct preference optimization (DPO). On the VDC benchmark, the method achieves 49.08% accuracy (+5.13% relative improvement) and a composite score of 2.50—surpassing specialized models while maintaining comparable inference cost. This represents the first demonstration of autonomous evolution capability in video captioning models.
📝 Abstract
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.