LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal large language models (med-MLLMs) rely on large-scale datasets and autoregressive pretraining, resulting in weak vision–language alignment and high instruction-tuning costs. To address this, we propose a long-context multi-image alignment framework featuring a novel fine-grained ternary alignment algorithm that explicitly models structured semantic correspondences among image regions, dialogue turns, and sentence spans—thereby overcoming modality decoupling bottlenecks. Our method integrates a multi-image graph neural network, black-box gradient estimation, end-to-end contrastive learning, and a lightweight LLaMA-7B adaptation architecture. Using only 10% of the standard pretraining data, our model surpasses LLaVA-Med by 20.13% on VQA-RAD and achieves 99.8% of full-data performance. It also establishes new state-of-the-art results on zero-shot image classification and visual dialogue tasks.

Technology Category

Application Category

📝 Abstract
State-of-the-art medical multi-modal large language models (med-MLLM), like LLaVA-Med or BioMedGPT, leverage instruction-following data in pre-training. However, those models primarily focus on scaling the model size and data volume to boost performance while mainly relying on the autoregressive learning objectives. Surprisingly, we reveal that such learning schemes might result in a weak alignment between vision and language modalities, making these models highly reliant on extensive pre-training datasets - a significant challenge in medical domains due to the expensive and time-consuming nature of curating high-quality instruction-following instances. We address this with LoGra-Med, a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. This helps the model capture contextual meaning, handle linguistic variability, and build cross-modal associations between visuals and text. To scale our approach, we designed an efficient end-to-end learning scheme using black-box gradient estimation, enabling faster LLaMa 7B training. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data. For example, on VQA-RAD, we exceed LLAVA-Med by 20.13% and nearly match the 100% pre-training score (72.52% vs. 72.64%). We also surpass SOTA methods like BiomedGPT on visual chatbots and RadFM on zero-shot image classification with VQA, highlighting the effectiveness of multi-graph alignment.
Problem

Research questions and friction points this paper is trying to address.

Weak vision-language alignment in medical multi-modal LLMs
Over-reliance on costly instruction-following data
Need for efficient training in large-scale medical AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-graph alignment for vision-language models
Efficient black-box gradient estimation training
Enhanced semantic grounding with extended captions
🔎 Similar Papers
D
D. M. Nguyen
Max Planck Research School for Intelligent Sytems (IMPRS-IS), University of Stuttgart, German Research Centre for Artificial Intelligence (DFKI)
N
N. T. Diep
German Research Centre for Artificial Intelligence (DFKI)
T
Trung Q. Nguyen
German Research Centre for Artificial Intelligence (DFKI), Technical University of Munich
Hoang-Bao Le
Hoang-Bao Le
Dublin City University
Information RetrievalNatural Language ProcessingLarge Language Models
T
Tai Nguyen
German Research Centre for Artificial Intelligence (DFKI)
Tien Nguyen
Tien Nguyen
Virginia Tech
software engineeringtestingdebugging
T
Trung Q. Nguyen
University of Queensland
Nhat Ho
Nhat Ho
Assistant Professor at University of Texas, Austin
Machine LearningBayesian StatisticsOptimizationOptimal TransportDeep Learning
Pengtao Xie
Pengtao Xie
Associate Professor, UC San Diego; Adjunct Faculty, MBZUAI
Machine Learning
R
R. Wattenhofer
ETH Zurich
J
James Zhou
Stanford University
Daniel Sonntag
Daniel Sonntag
DFKI and University of Oldenburg
Interactive Machine LearningIntelligent User InterfacesMultimodal Interaction
Mathias Niepert
Mathias Niepert
University of Stuttgart & NEC Labs Europe
Machine learning