UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of fine-grained supervision in 3D medical visual question answering (VQA) by proposing UniReason-Med, a unified reasoning framework that establishes the first shared grounding-based reasoning interface across 2D and 3D modalities. The method leverages a shared bounding box grammar, injects region tokens, and serializes 3D inputs into slice sequences to generate interpretable reasoning trajectories through reinforcement learning without requiring IoU or Dice rewards. By integrating instruction tuning with interleaved region-text representations, UniReason-Med enables structured reasoning transfer across modalities. Experiments on the newly curated UniMed-CoT dataset—comprising 220K samples—demonstrate that jointly leveraging 2D and 3D grounding supervision significantly outperforms training solely on 3D data, thereby substantially improving performance in 3D medical VQA.
📝 Abstract
We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at https://github.com/IQuestLab/unireason-med.
Problem

Research questions and friction points this paper is trying to address.

medical VQA
2D-to-3D transfer
grounded reasoning
slice-serialized 3D volume
reasoning interface
Innovation

Methods, ideas, or system contributions that make the work stand out.

grounded reasoning
2D-to-3D transfer
medical VQA
region-token injection
shared reasoning interface
🔎 Similar Papers