Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical multimodal large language models (MLLMs) exhibit insufficient robustness under real-world clinical noise—such as imaging artifacts and textual typos—hindering clinical deployment. Existing approaches typically rely on costly fine-tuning and fail to systematically model biomedical-specific noise characteristics. This paper introduces IMC, the first training-free cross-modal calibration framework, following a “perceive–calibrate” paradigm. IMC comprises two core components: (1) Perturbation-Aware Prototype-Guided Visual Feature Reconstruction (PDC), which reconstructs corrupted visual features using noise-aware prototypes; and (2) a Self-Evaluation-Driven Multi-Agent Text Denoising System (SMS), enabling collaborative correction of textual errors. Evaluated on a novel benchmark covering 11 clinically relevant noise types, IMC significantly improves diagnostic consistency and reasoning stability, achieving state-of-the-art performance. Crucially, it delivers zero-shot robustness enhancement without any parameter updates, demonstrating strong potential for clinical translation.

Technology Category

Application Category

📝 Abstract
Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.
Problem

Research questions and friction points this paper is trying to address.

Analyzes impact of real-world noise on medical multi-modal large language models.
Enhances robustness without costly fine-tuning for clinical safety standards.
Addresses complex noise patterns in both visual and textual modalities.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free calibration framework enhances cross-modal robustness
Vision encoder identifies noise patterns for prototype-guided feature calibration
Multi-agent system refines noisy text through self-assessment hierarchy
🔎 Similar Papers
No similar papers found.
D
Dunyuan XU
Department of Computer Science and Engineering, CUHK, Hong Kong, China
X
Xikai Yang
Department of Computer Science and Engineering, CUHK, Hong Kong, China
Y
Yaoqian Li
Department of Computer Science and Engineering, CUHK, Hong Kong, China
Juzheng Miao
Juzheng Miao
PhD student, The Chinese University of Hong Kong
Medical image analysislabel-efficient learningreinforcement learningcausality
J
Jinpeng Li
Department of Computer Science and Engineering, CUHK, Hong Kong, China
P
Pheng-Ann Heng
Department of Computer Science and Engineering, CUHK, Hong Kong, China; Institute of Medical Intelligence and XR, CUHK, Hong Kong, China