VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the video detailed captioning (VDC) task by proposing the first self-evolving framework that requires neither human annotations nor a large teacher model. Methodologically, it establishes a closed-loop “generation–scoring–prompt optimization” pipeline, incorporating a principle-guided automatic scoring mechanism and a self-reflective correction pathway to generate the high-quality preference dataset VDC-Agent-19K. Training is performed on Qwen2.5-VL-7B-Instruct using chain-of-thought reasoning, hierarchical prompt engineering, and difficulty-graded direct preference optimization (DPO). On the VDC benchmark, the method achieves 49.08% accuracy (+5.13% relative improvement) and a composite score of 2.50—surpassing specialized models while maintaining comparable inference cost. This represents the first demonstration of autonomous evolution capability in video captioning models.

Technology Category

Application Category

📝 Abstract

We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

Problem

Research questions and friction points this paper is trying to address.

Automatically generates detailed video captions without human supervision

Self-evolves caption quality through agentic reflection and refinement

Creates preference datasets from unlabeled videos for model optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving framework without human annotations

Agentic loop with principle-guided scoring and refinement

Curriculum direct preference optimization on auto-generated data

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs