🤖 AI Summary
This study systematically evaluates the physical consistency and generalization capabilities of multimodal foundation models (e.g., FLAVA, BLIP-2) for crystallographic reasoning. To address two core challenges—spatial interpolation/extrapolation and compositional variation—we construct the first benchmark dataset spanning multiple scales and diverse crystal systems. We introduce a dual-exclusion evaluation protocol (spatial exclusion + compositional exclusion) alongside interpretable metrics—including a Physical Consistency Index and Hallucination Score—to rigorously assess predictions of lattice parameters, density, volume conservation, and space group assignment. Comprehensive evaluation across nine state-of-the-art models reveals pervasive geometric hallucinations and violations of fundamental physical constraints. All data, code, and evaluation tools are publicly released to advance AI for Science toward reproducible, physics-grounded assessment paradigms.
📝 Abstract
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.