🤖 AI Summary
This work addresses the lack of transparent source attribution in clinical summarization, which undermines summary credibility. The authors propose a training-free, generation-time evidence attribution framework that leverages decoder attention mechanisms to directly cite supporting text spans or image regions during summary generation, enabling real-time provenance for multimodal clinical summaries. Notably, this approach—without requiring fine-tuning or post-processing—integrates raw image patch attention with generated image captions as alignment cues, balancing accuracy and practicality. Evaluated on the CliConSummation and MIMIC-CXR datasets, the method achieves significantly higher text and multimodal attribution F1 scores than embedding-matching and self-attribution baselines, with improvements up to 15%. Moreover, its caption-based variant remains competitive even under lightweight computational constraints.
📝 Abstract
Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.