Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video detailed captioning (VDC) methods suffer from two key limitations: dimensional capability skew—where models overemphasize certain attributes while neglecting others—and misalignment with human preferences. To address these, we propose a three-stage collaborative training framework. First, we construct a fine-grained, human-annotated scorer to filter high-quality synthetic captions. Second, we train the Cockatiel-13B large language model on the curated dataset. Third, we perform human preference-aligned knowledge distillation to obtain the lightweight Cockatiel-8B variant. This framework pioneers a novel paradigm integrating synthetic data optimization with preference-driven distillation. It establishes a new state-of-the-art on VDCSCORE with balanced performance across all dimensions. Human evaluation confirms significant improvements in naturalness, accuracy, and completeness over prior methods, marking the first systematic enhancement of alignment between VDC outputs and human judgments.

Technology Category

Application Category

📝 Abstract
Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.
Problem

Research questions and friction points this paper is trying to address.

Address biased capability in video captioning aspects
Align video captions with human preferences
Improve fine-grained video-caption alignment performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage training pipeline for video captioning
Ensembles synthetic and human-aligned training data
Distills smaller model from larger for usability
🔎 Similar Papers
No similar papers found.
Luozheng Qin
Luozheng Qin
Shanghai Academy of AI for Science
generative modeltext-to-image generationneck-choking technology
Z
Zhiyu Tan
Shanghai Academy of Artificial Intelligence for Science, Fudan University
Mengping Yang
Mengping Yang
East China University of Science and Technology
Few-shot LearningGenerative Models
X
Xiaomeng Yang
Shanghai Academy of Artificial Intelligence for Science
H
Hao Li
Shanghai Academy of Artificial Intelligence for Science, Fudan University