🤖 AI Summary
Existing multimodal video models struggle to accurately follow complex, multidimensional user instructions and lack dedicated evaluation benchmarks. To address this gap, this work proposes the first instruction-following evaluation framework tailored for holistic multimodal video captioning, encompassing 50 systematically designed constraints that assess model performance along both formatting and content correctness dimensions, with an added emphasis on spatiotemporal grounding to measure temporal and spatial precision. Leveraging 1,920 high-quality human-annotated samples, we construct a large-scale instruction-tuning dataset comprising 54K examples and develop OmniCaptioner-IF, a novel instruction-following model. Experimental results demonstrate that prevailing models exhibit significant performance degradation under complex instructions, whereas OmniCaptioner-IF achieves notable improvements in both instruction adherence and general captioning quality.
📝 Abstract
While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.