🤖 AI Summary
This work addresses the tendency of multimodal large language models to exhibit “randomness collapse” in logically neutral scenarios, resulting in repetitive outputs that fail to uniformly cover equivalent options. To systematically evaluate this behavior, the authors introduce RandomBench, a novel benchmark for assessing distributional neutrality under random instructions, along with three quantitative metrics: Randomness Index (RI), Between-Choice Inequality (BCI), and Between-Instruction Inequality (BII). Through multilingual, multi-format ablation studies and entropy analysis, the study reveals severe limitations in current models’ random decision-making: for instance, Claude Sonnet 4.6 exhibits a top-1 selection probability of 97% and an RI of merely 0.068, deviating substantially from the ideal uniform distribution.
📝 Abstract
Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.