🤖 AI Summary
This work addresses the challenges in multimodal medical image disease recognition, where existing methods struggle to effectively fuse complementary cross-modal information and are hindered by scarce annotated data and domain shifts from natural images, limiting the applicability of vision foundation models. To overcome these limitations, the authors propose an Early Intervention (EI) framework that introduces high-level semantic cues from a reference modality as an intervention signal during the embedding stage to guide feature learning in the target modality. Additionally, they design a parameter-efficient Mixture of Low-varied-Ranks Adaptation (MoR) strategy for efficient fine-tuning of vision foundation models. By enhancing few-shot transferability through multimodal semantic token interaction, the method achieves significant performance gains over current baselines on three public datasets—retinal diseases, skin lesions, and knee abnormalities—demonstrating strong effectiveness and generalization.
📝 Abstract
Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing "fusion after unimodal image embedding" paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality's embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.