OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal video models struggle to accurately follow complex, multidimensional user instructions and lack dedicated evaluation benchmarks. To address this gap, this work proposes the first instruction-following evaluation framework tailored for holistic multimodal video captioning, encompassing 50 systematically designed constraints that assess model performance along both formatting and content correctness dimensions, with an added emphasis on spatiotemporal grounding to measure temporal and spatial precision. Leveraging 1,920 high-quality human-annotated samples, we construct a large-scale instruction-tuning dataset comprising 54K examples and develop OmniCaptioner-IF, a novel instruction-following model. Experimental results demonstrate that prevailing models exhibit significant performance degradation under complex instructions, whereas OmniCaptioner-IF achieves notable improvements in both instruction adherence and general captioning quality.
📝 Abstract
While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.
Problem

Research questions and friction points this paper is trying to address.

instruction following
omni-modal captioning
audio-visual understanding
temporal grounding
multimodal constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction following
omni-modal captioning
Temporal Grounding
format-content tradeoff
benchmark
🔎 Similar Papers
Jiahao Wang
Jiahao Wang
Kuaishou Technology | NJU
computer vision
A
An Ping
NJU-LINK Team, Nanjing University
Y
Yanghai Wang
NJU-LINK Team, Nanjing University
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
S
Shihao Li
NJU-LINK Team, Nanjing University
H
Hanyan Bian
NJU-LINK Team, Nanjing University
Y
Yichi Ren
NJU-LINK Team, Nanjing University
Y
Yize Zhang
NJU-LINK Team, Nanjing University
H
Han Wang
NJU-LINK Team, Nanjing University
H
Haowen Chen
NJU-LINK Team, Nanjing University
Junze Li
Junze Li
The Hong Kong University of Science and Technology
Human-Computer InteractionNatural Language Processing
Jiaqi Wang
Jiaqi Wang
Unknown affiliation
Y
Yiyang Hu
NJU-LINK Team, Nanjing University
Z
Zhuze Xu
NJU-LINK Team, Nanjing University
Zijie Zhang
Zijie Zhang
Assistant Professor, University of Texas at San Antonio
Trustworthy Machine LearningAdversaril A/DFederated LearningGraph
J
Jiaheng Liu
NJU-LINK Team, Nanjing University