VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the video detailed captioning (VDC) task by proposing the first self-evolving framework that requires neither human annotations nor a large teacher model. Methodologically, it establishes a closed-loop “generation–scoring–prompt optimization” pipeline, incorporating a principle-guided automatic scoring mechanism and a self-reflective correction pathway to generate the high-quality preference dataset VDC-Agent-19K. Training is performed on Qwen2.5-VL-7B-Instruct using chain-of-thought reasoning, hierarchical prompt engineering, and difficulty-graded direct preference optimization (DPO). On the VDC benchmark, the method achieves 49.08% accuracy (+5.13% relative improvement) and a composite score of 2.50—surpassing specialized models while maintaining comparable inference cost. This represents the first demonstration of autonomous evolution capability in video captioning models.

Technology Category

Application Category

📝 Abstract
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
Problem

Research questions and friction points this paper is trying to address.

Automatically generates detailed video captions without human supervision
Self-evolves caption quality through agentic reflection and refinement
Creates preference datasets from unlabeled videos for model optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving framework without human annotations
Agentic loop with principle-guided scoring and refinement
Curriculum direct preference optimization on auto-generated data
🔎 Similar Papers
No similar papers found.