IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (VLMs) exhibit limited capability in executing instruction-following with fine-grained spatial control over video temporal sequences, hindering intent-driven controllable video captioning. To address this, we propose a unified spatiotemporal modeling framework: (1) a prompt composition strategy that jointly encodes temporal instructions and spatial localization priors; and (2) a parameter-efficient, plug-and-play Box Adapter that explicitly aligns bounding boxes with semantic intents. Furthermore, we enhance global visual context modeling via object-semantic augmentation to strengthen spatial intent perception. Our method achieves consistent improvements across multiple state-of-the-art VLMs. It ranked second in the IntentVC Challenge and establishes new SOTA performance, significantly boosting both the accuracy of intent-aligned descriptions and the fidelity of spatial detail generation.

Technology Category

Application Category

📝 Abstract
Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM's ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.
Problem

Research questions and friction points this paper is trying to address.

Bridging spatio-temporal gaps in video captioning
Enhancing fine-grained spatial control in time sequences
Generating intent-oriented captions with user customization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies temporal and spatial understanding in LVLMs
Uses prompt combination strategy for intent modeling
Implements parameter-efficient box adapter for semantics
🔎 Similar Papers
No similar papers found.
Tianheng Qiu
Tianheng Qiu
University of Science and Technology of China
J
Jingchun Gao
University of Science and Technology of China, Hefei, China
J
Jingyu Li
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China; State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China
H
Huiyi Leong
University of Chicago, Chicago, America
X
Xuan Huang
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
X
Xi Wang
National University of Defense Technology, Changsha, China
X
Xiaocheng Zhang
Harbin Institute of Technology, Harbin, China
K
Kele Xu
National University of Defense Technology, Changsha, China
L
Lan Zhang
University of Science and Technology of China, Hefei, China