Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study systematically evaluates the physical consistency and generalization capabilities of multimodal foundation models (e.g., FLAVA, BLIP-2) for crystallographic reasoning. To address two core challenges—spatial interpolation/extrapolation and compositional variation—we construct the first benchmark dataset spanning multiple scales and diverse crystal systems. We introduce a dual-exclusion evaluation protocol (spatial exclusion + compositional exclusion) alongside interpretable metrics—including a Physical Consistency Index and Hallucination Score—to rigorously assess predictions of lattice parameters, density, volume conservation, and space group assignment. Comprehensive evaluation across nine state-of-the-art models reveals pervasive geometric hallucinations and violations of fundamental physical constraints. All data, code, and evaluation tools are publicly released to advance AI for Science toward reproducible, physics-grounded assessment paradigms.

Technology Category

Application Category

📝 Abstract

Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating generalization in crystallographic reasoning models

Stress-testing multimodal models with spatial and compositional exclusions

Assessing physics-consistency and reliability in structural annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiscale multicrystal dataset for stress-testing models

Spatial-Exclusion benchmark assesses spatial generalization

Compositional-Exclusion benchmark tests stoichiometric generalization

🔎 Similar Papers

Crystalline Material Discovery in the Era of Artificial Intelligence

2024-08-15arXiv.orgCitations: 3

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

2024-06-25arXiv.orgCitations: 10