T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work identifies insufficient instruction diversity as the primary cause of low fine-tuning efficiency in current video multimodal large language models (Video-LLMs). To address this, we propose T2Vid: a method that leverages a pretrained image-based LLM to generate multi-image sequences from long textual prompts, thereby approximating video semantics without explicit temporal modeling of frame-to-frame dynamics. T2Vid establishes a text-to-multi-image data augmentation paradigm, achieving performance on par with—or surpassing—that of full-video training while using only 15% of the original video samples. Experiments demonstrate state-of-the-art results across multiple video understanding benchmarks, notably enhancing long-video comprehension without requiring long-video training data. The core contributions are: (i) revealing instruction diversity as a critical bottleneck in Video-LLM adaptation; and (ii) introducing the first efficient paradigm that substitutes video training with synthesized image-text data—eliminating the need for costly video annotation and computation.

Technology Category

Application Category

📝 Abstract

The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at https://github.com/xjtupanda/T2Vid.

Problem

Research questions and friction points this paper is trying to address.

Investigates low learning efficiency in video-LLMs due to limited instruction diversity.

Proposes Sparrow, a text-to-image augmentation method for video-LLM training.

Enhances long video understanding without requiring extensive long video data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes image-LLMs with video data

Uses text-to-image augmentation for video-LLMs

Enhances training with synthetic video-like samples

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?