🤖 AI Summary
This work addresses evaluation contamination in large reasoning models (LRMs) on automatically verifiable tasks. We propose ROME—the first low-contamination multimodal benchmark for vision-grounded reasoning. Methodologically, we construct an automatically verifiable question-answering dataset spanning both textual and visual modalities, and introduce a contamination-free evaluation framework that rigorously ensures zero overlap between test questions and mainstream training corpora, enabling strict, reproducible reasoning assessment. Experiments reveal significant bottlenecks in current LRMs’ vision–language joint reasoning capabilities, particularly on tasks requiring cross-modal logical inference. All evaluation data, tooling, and results are publicly released, establishing a new paradigm and foundational infrastructure for transparent, trustworthy model evaluation.
📝 Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/