SpecVLM: Fast Speculative Decoding in Vision-Language Models

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the excessive visual token count and prohibitive computational/memory overhead during prefill in vision-language models (VLMs) processing high-resolution images/videos, this paper proposes SpecVLM—the first efficient speculative decoding system tailored for VLMs. Its core innovations are: (1) an elastic visual compression mechanism that dynamically combines pruning, pooling, convolution, and resampling to adaptively reduce visual token sequences; and (2) an online logit distillation protocol—requiring no offline data—that jointly optimizes cross-entropy and Smooth L1 losses to achieve lossless output distribution alignment using teacher logits and penultimate-layer features. Evaluated on LLaVA and MMMU benchmarks, SpecVLM achieves 2.5–2.9× end-to-end speedup, consistently outperforms EagleVLM within five training rounds, supports multi-resolution inputs and multi-task inference, and incurs zero accuracy degradation.

Technology Category

Application Category

📝 Abstract

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

Problem

Research questions and friction points this paper is trying to address.

Accelerates vision-language model inference with speculative decoding

Reduces visual token computational and memory overhead

Achieves lossless decoding while maintaining output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic visual compressor adaptively selects compression primitives

Online-logit distillation trains draft model without offline corpora

Achieves 2.5-2.9x speedup while preserving output distribution

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models

2024-03-04Computer Vision and Pattern RecognitionCitations: 3

Authors to Follow