NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image analysis is hindered by scarce annotated data and a substantial semantic gap between general-purpose vision-language models (VLMs) and the medical domain. Existing approaches often rely on unidirectional modality adaptation or prompt tuning, leading to misalignment between visual and textual representations. To address this, we propose a parameter-efficient framework for dynamic cross-modal interaction. Our method introduces a unified collaborative embedding Transformer coupled with orthogonal cross-attention adapters, enabling bidirectional and decoupled vision–language interaction. Additionally, we impose orthogonal regularization on modality-specific representation spaces to mitigate representation misalignment. With only 1.46M trainable parameters, our approach consistently outperforms unidirectional interaction and single-modality fine-tuning baselines across multiple medical vision–language tasks. Experimental results validate the effectiveness of dynamic cross-modal alignment and knowledge-separated learning in bridging domain-specific semantic gaps while maintaining high parameter efficiency.

Technology Category

Application Category

📝 Abstract
Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose extbf{NEARL-CLIP} (iunderline{N}teracted quunderline{E}ry underline{A}daptation with ounderline{R}thogonaunderline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces extbf{1.46M} learnable parameters.
Problem

Research questions and friction points this paper is trying to address.

Bridges domain gap in medical vision-language models
Enhances cross-modality interaction for medical imaging
Reduces modality misalignment in vision-language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic cross-modality queries with USEformer
Orthogonal technique for knowledge decoupling
Parameter-efficient medical VLM adaptation
🔎 Similar Papers
No similar papers found.
Zelin Peng
Zelin Peng
Shanghai Jiao Tong University
Computer VisionMedical Image Processing
Y
Yichen Zhao
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University
Y
Yu Huang
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University
P
Piao Yang
Department of Radiology, The First Affiliated Hospital, School of Medicine, Zhejiang University
F
Feilong Tang
Mohamed bin Zayed University of Artificial Intelligence
Z
Zhengqin Xu
State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics, Chinese Academy of Science
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University
W
Wei Shen
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University