🤖 AI Summary
Existing speculative decoding methods for vision-language models struggle to simultaneously achieve high generation speed and hardware efficiency. This work proposes DREAM-S, a novel framework that introduces, for the first time in multimodal speculative decoding, a searchable draft mechanism coupled with target-aware refinement. DREAM-S leverages neural architecture search to automatically optimize both the draft model’s architecture and its interaction strategy with the target model, while incorporating attention entropy–guided adaptive intermediate feature distillation to enhance training efficiency. By co-optimizing model structure and hardware deployment, the method achieves up to 3.85× generation speedup across multiple mainstream vision-language models, substantially outperforming existing baselines.
📝 Abstract
Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textit{DREAM-S}, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a $3.85\times$ speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM-S .