🤖 AI Summary
Vision-language models (VLMs) are fundamentally limited by purely textual decoding, hindering complex reasoning tasks requiring visual imagination; explicit image generation—though explored—often degrades reasoning capability. To address this, we propose *Machine Mental Imagery*, a novel paradigm that dynamically injects implicit visual tokens into the decoding process, enabling multimodal reasoning trajectories without pixel-level image synthesis. Our method introduces a two-stage supervised framework—comprising image-embedding distillation and text-guided latent trajectory alignment—jointly optimized with reinforcement learning to ensure task alignment and plug-and-play implicit visual reasoning. Evaluated across multiple multimodal reasoning benchmarks, our approach significantly outperforms state-of-the-art methods, achieving superior reasoning performance while eliminating the computational overhead of image generation. This work provides the first empirical validation of the effectiveness and feasibility of implicit visual representations for advanced reasoning.
📝 Abstract
Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.