Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current video understanding models suffer from excessive computational cost, large parameter counts, and high inference latency, hindering real-time deployment on mobile devices. To address this, we propose VLM-Lite, a lightweight vision-language model for video understanding. Our method introduces two key innovations: (1) an attention-driven key-frame scoring mechanism that dynamically identifies semantically salient frames, and (2) a redundancy-aware visual token pruning projector that eliminates superfluous visual tokens before fusion. These are integrated with dual lightweight visual encoders and a compact language model (0.5B parameters). VLM-Lite significantly reduces computational overhead while enhancing representation efficiency: it cuts total parameters by 40%, accelerates inference by over 2×, and achieves a throughput of 46 tokens/sec. On six standard video understanding benchmarks, it outperforms existing state-of-the-art 0.5B models by an average of 6.0 points—marking the first instance where a smaller model surpasses larger counterparts, thereby breaking the long-standing trade-off between accuracy and latency.

Technology Category

Application Category

📝 Abstract

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Problem

Research questions and friction points this paper is trying to address.

Reduces high computational requirements in video understanding models

Addresses slow inference speed for practical video analysis

Minimizes parameter count while maintaining accuracy in video tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight dual visual encoders for efficiency

Attention-Based Frame Scoring for key-frame selection

Token projector pruning redundant visual tokens

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs