🤖 AI Summary
Ultrasound image quality varies significantly due to operator skill and acquisition conditions, hindering reliable AI-based quality control in clinical practice.
Method: We introduce Ultrasound-QBench—the first multimodal large language model (MLLM) benchmark for ultrasound image quality assessment—integrating two expert-annotated datasets (IVUSQA and CardiacUltraQA). It establishes a novel three-dimensional evaluation paradigm: qualitative classification, quantitative scoring, and pairwise image comparison, covering multiple anatomical regions, artifact types, and three-tier fine-grained quality labels. Our MLLM-based cross-modal understanding framework jointly processes ultrasound images and textual instructions via supervised fine-tuning to enable fine-grained quality reasoning.
Results: Evaluated on seven open-source and one closed-source MLLMs, Ultrasound-QBench demonstrates robust low-level visual quality discrimination capability. It provides a scalable, reproducible benchmark and open-source infrastructure for medical imaging AI quality assurance.
📝 Abstract
With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.