🤖 AI Summary
Medical image report generation faces two key challenges: insufficient capture of fine-grained pathological details and degraded zero-shot (image-only) inference performance. To address these, we propose DTrace, a dynamic trace learning framework introducing—novelty—the semantic validity tracing supervision mechanism and a modality-adaptive dynamic learning strategy, enabling robust generation under weak textual supervision. Our method integrates cross-modal masked semantic reconstruction, vision-language joint representation learning, dynamic weight adjustment, and trace consistency constraints. Evaluated on IU-Xray and MIMIC-CXR, DTrace substantially outperforms state-of-the-art methods, particularly in zero-shot settings: it achieves significant improvements in clinical relevance and descriptive accuracy of generated reports. These results validate DTrace’s enhanced capability to model critical pathological features and generalize effectively across unseen reporting scenarios.
📝 Abstract
Automated medical report generation has the potential to significantly reduce the workload associated with the time-consuming process of medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multi-modal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.