TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

📅 2024-10-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of long-video understanding in multimodal large language models (MLLMs), this paper proposes TimeSuite—a framework that enhances the spatiotemporal modeling capability of short-video-pretrained MLLMs for long videos through three key innovations: (1) temporal-aware visual representations coupled with TAPE (Temporal Adaptive Positional Encoding); (2) a Token Shuffling compression mechanism for efficient long-video token reduction; and (3) the novel Temporal Grounded Captioning task and grounded instruction fine-tuning paradigm, supported by TimePro—the first large-scale long-video instruction dataset (349K samples, 9 tasks). Evaluated on EgoSchema and VideoMME, TimeSuite achieves +5.6% and +6.8% absolute improvements, respectively. VideoChat-T—our instantiated model—demonstrates strong zero-shot temporal grounding capability and, after fine-tuning, matches the performance of supervised expert models.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long video understanding for MLLMs
Introducing efficient long video processing framework
Reducing hallucination risk in video descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token shuffling compresses long video tokens.
Temporal Adaptive Position Encoding enhances awareness.
Temporal Grounded Caption reduces hallucination risk.
🔎 Similar Papers
No similar papers found.
X
Xiangyu Zeng
Nanjing University, Shanghai AI Laboratory
Kunchang Li
Kunchang Li
ByteDance Seed
Video UnderstandingMultimodal Learning
Chenting Wang
Chenting Wang
Shanghai Jiao Tong University
Computer VisionVideo Understanding
Xinhao Li
Xinhao Li
Nanjing University
Video UnderstandingMultimodal LLMVision-Language Learning
T
Tianxiang Jiang
University of Science and Technology of China, Shanghai AI Laboratory
Z
Ziang Yan
Zhejiang University, Shanghai AI Laboratory
S
Songze Li
Fudan University, Shanghai AI Laboratory
Y
Yansong Shi
University of Science and Technology of China, Shanghai AI Laboratory
Zhengrong Yue
Zhengrong Yue
Shanghai Jiao Tong University, PhD
Unified Multimodal ModelingVideo UnderstandingVideo Generation
Y
Yi Wang
Shanghai AI Laboratory
Y
Yali Wang
SIAT, Chinese Academy of Sciences, Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
L
Limin Wang
Nanjing University, Shanghai AI Laboratory