🤖 AI Summary
Current video understanding models suffer from excessive computational cost, large parameter counts, and high inference latency, hindering real-time deployment on mobile devices. To address this, we propose VLM-Lite, a lightweight vision-language model for video understanding. Our method introduces two key innovations: (1) an attention-driven key-frame scoring mechanism that dynamically identifies semantically salient frames, and (2) a redundancy-aware visual token pruning projector that eliminates superfluous visual tokens before fusion. These are integrated with dual lightweight visual encoders and a compact language model (0.5B parameters). VLM-Lite significantly reduces computational overhead while enhancing representation efficiency: it cuts total parameters by 40%, accelerates inference by over 2×, and achieves a throughput of 46 tokens/sec. On six standard video understanding benchmarks, it outperforms existing state-of-the-art 0.5B models by an average of 6.0 points—marking the first instance where a smaller model surpasses larger counterparts, thereby breaking the long-standing trade-off between accuracy and latency.
📝 Abstract
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.