BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA

๐Ÿ“… 2025-03-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing medical visual question answering (VQA) methods suffer from insufficient fine-grained multimodal semantic alignment, as they perform modality interaction solely at the large language model (LLM) level, leading to weak semantic coupling. Method: We propose a dual-level semantic consistency constraint framework featuring novel model-level and feature-levelๅๅŒ alignment. Specifically, we introduce conditional visual feature learning for fine-grained cross-modal alignment, design a text-queue-driven cross-modal soft semantic loss, and construct BioVGQโ€”the first debiased medical visual-grounding question answering dataset with precise image-text localization annotations. Contribution/Results: Our approach achieves significant improvements over state-of-the-art methods across multiple medical VQA benchmarks, enhancing model robustness, generalization capability, and clinical applicability.

Technology Category

Application Category

๐Ÿ“ Abstract
Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal semantic alignment in biomedical VQA
Introduces dual-level semantic interaction at model and feature levels
Addresses dataset biases with a new dataset, BioVGQ
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-level semantic interaction alignment
Image-text fusion mechanism integration
Text-queue-based cross-modal loss function
๐Ÿ”Ž Similar Papers
No similar papers found.
Zhengyang Ji
Zhengyang Ji
Shandong University
AI
S
Shang Gao
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
L
Li Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Y
Yifan Jia
Shandong University, Qingdao, China; The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Y
Yutao Yue
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Institute of Deep Perception Technology, JITRI, Wuxi, China