Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

πŸ“… 2026-05-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

174K/year
πŸ€– AI Summary
This work challenges the prevailing assumption that performance gains in current latent visual reasoning methods stem primarily from latent tokens encoding visual memory. We propose the first systematic decomposition of latent tokens into three testable components: latent slots, boundary markers, and formatting cues. Through mechanistic interpretability analyses, controlled ablations, attention visualizations, and cross-setting comparisons across six methodological stages and four perception-intensive benchmarks, we demonstrate that the observed improvements largely arise from boundary markers, formatting structures, and specific attention patternsβ€”not from visual memory stored in latent slots. Remarkably, retaining only boundary markers recovers 78%–100% of the performance gain, and models exhibit more focused visual attention at latent positions. These findings advocate for evaluating method effectiveness through underlying mechanisms rather than accuracy alone.
πŸ“ Abstract
Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.
Problem

Research questions and friction points this paper is trying to address.

latent visual reasoning
visual memory
multimodal language models
mechanistic diagnostics
boundary markers
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual reasoning
mechanistic interpretability
boundary markers
multimodal language models
visual memory