π€ AI Summary
This work challenges the prevailing assumption that performance gains in current latent visual reasoning methods stem primarily from latent tokens encoding visual memory. We propose the first systematic decomposition of latent tokens into three testable components: latent slots, boundary markers, and formatting cues. Through mechanistic interpretability analyses, controlled ablations, attention visualizations, and cross-setting comparisons across six methodological stages and four perception-intensive benchmarks, we demonstrate that the observed improvements largely arise from boundary markers, formatting structures, and specific attention patternsβnot from visual memory stored in latent slots. Remarkably, retaining only boundary markers recovers 78%β100% of the performance gain, and models exhibit more focused visual attention at latent positions. These findings advocate for evaluating method effectiveness through underlying mechanisms rather than accuracy alone.
π Abstract
Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.