🤖 AI Summary
This work identifies and isolates a previously uncharacterized failure mode—termed “spatial lexical bias”—in multimodal large language models (MLLMs), wherein models exhibit selection bias on spatial reasoning multiple-choice questions due to interference from spatial relation terms in answer options. By constructing diagnostic samples that are binary-stable yet ternary-fragile, and leveraging interpretability techniques including visual attention analysis, residual stream probing, activation patching, and sparse interventions, the study localizes this bias to specific channels and neurons within the language module rather than the visual encoder. Building on this insight, the authors propose a lightweight mitigation strategy that operates solely on the language model, achieving up to a 100-point improvement in robust accuracy on synthetic data and substantial gains of 68.0, 32.6, and 20.1 points on the WhatsUp, SpatialMQA-Direct, and VSR benchmarks, respectively.
📝 Abstract
Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.