Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the critical gap in medical vision-language models (VLMs) for PET/CT functional imaging and low-resource languages—particularly Vietnamese—this work introduces the first large-scale Vietnamese PET/CT paired dataset (1.56 million image–report pairs; 2,757 full clinical reports) and proposes an enhanced training framework integrating cross-modal alignment, supervised fine-tuning, and expert knowledge guidance. Key contributions are: (1) the first high-quality, publicly available benchmark dataset bridging PET/CT imaging and Vietnamese clinical reporting; (2) an expert-validated multitask evaluation set and standardized assessment protocol; and (3) substantial performance gains for mainstream VLMs on Vietnamese medical report generation and visual question answering, establishing reusable data, methodology, and evaluation paradigms for clinical AI in low-resource language settings.

Technology Category

Application Category

📝 Abstract
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited PET/CT imaging data in medical vision-language models
Overcoming underrepresentation of Vietnamese language in medical AI
Enhancing multimodal medical report generation for low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vietnamese PET/CT multimodal dataset with clinical reports
Training framework using data augmentation techniques
Benchmarking VLMs for medical report generation tasks
🔎 Similar Papers
No similar papers found.
H
Huu Tien Nguyen
AI4LIFE, Hanoi University of Science and Technology, Vietnam
Dac Thai Nguyen
Dac Thai Nguyen
Unknown affiliation
T
The Minh Duc Nguyen
AI4LIFE, Hanoi University of Science and Technology, Vietnam
T
Trung Thanh Nguyen
Nagoya University, Japan
T
Thao Nguyen Truong
AIST, Japan
H
Huy Hieu Pham
VinUniversity, Vietnam
J
Johan Barthelemy
NVIDIA, United States
M
Minh Quan Tran
NVIDIA, United States
Thanh Tam Nguyen
Thanh Tam Nguyen
Lecturer, Griffith University
Social Network MiningStream ProcessingBig DataPrivacy-Preserving MLRecommender Systems
Q
Quoc Viet Hung Nguyen
Griffith University, Australia
Q
Quynh Anh Chau
Hanoi Medical University, Vietnam
H
Hong Son Mai
108 Military Central Hospital, Vietnam
Thanh Trung Nguyen
Thanh Trung Nguyen
Le Quy Don Technical University, Viet Nam
blockchainend-to-end encryptionnosqlkey-valuebig data
P
Phi Le Nguyen
AI4LIFE, Hanoi University of Science and Technology, Vietnam