🤖 AI Summary
Existing methods struggle to effectively evaluate the scientific validity of scientific images due to the weak correlation between perceptual quality metrics and scientific accuracy, as well as the limited domain-specific verification capabilities of general-purpose language models. This work proposes SIU²A, a novel framework that systematically defines scientific image utility—encompassing error detectability and correctability—and upgradability, which refers to restorability to scientific fidelity. The authors introduce SIU²A-Benchmark, a comprehensive dataset covering four categories of scientific distortions, along with a two-stage evaluation protocol that first assesses error identification and then evaluates restoration quality. Experimental results reveal significant shortcomings in current multimodal systems regarding both scientific error detection and faithful correction, highlighting a fundamental gap between visual perception and scientific usability.
📝 Abstract
Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.