Pretrained Image-Text Models are Secretly Video Captioners

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the high computational cost, heavy reliance on large-scale video data, and complex temporal modeling in video captioning models, this work proposes a lightweight and efficient paradigm. We first uncover the latent video understanding capability of pre-trained image-text models (e.g., BLIP-2) and subsequently perform post-training using only 6K video-text pairs. Our method integrates frame concatenation for input encoding, reinforcement learning–based optimization, and joint model scaling and data-efficient design. Evaluated on MSRVTT and MSVD, our approach ranks second; on VATEX, it ranks third—matching the performance of state-of-the-art specialized models trained on millions of video samples. This work substantially reduces training cost and data requirements, offering a scalable, low-barrier pathway for reusing pre-trained vision-language models in video understanding.

Technology Category

Application Category

📝 Abstract

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

Repurpose image models for video captioning

Minimize computational resources in captioning

Enhance data efficiency with fewer pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing image-based models

Post-training with minimal data

Optimizing model scale and efficiency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs