🤖 AI Summary
Current large vision-language models (VLMs) exhibit limited capability in executing instruction-following with fine-grained spatial control over video temporal sequences, hindering intent-driven controllable video captioning. To address this, we propose a unified spatiotemporal modeling framework: (1) a prompt composition strategy that jointly encodes temporal instructions and spatial localization priors; and (2) a parameter-efficient, plug-and-play Box Adapter that explicitly aligns bounding boxes with semantic intents. Furthermore, we enhance global visual context modeling via object-semantic augmentation to strengthen spatial intent perception. Our method achieves consistent improvements across multiple state-of-the-art VLMs. It ranked second in the IntentVC Challenge and establishes new SOTA performance, significantly boosting both the accuracy of intent-aligned descriptions and the fidelity of spatial detail generation.
📝 Abstract
Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM's ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.