Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the lack of transparent source attribution in clinical summarization, which undermines summary credibility. The authors propose a training-free, generation-time evidence attribution framework that leverages decoder attention mechanisms to directly cite supporting text spans or image regions during summary generation, enabling real-time provenance for multimodal clinical summaries. Notably, this approach—without requiring fine-tuning or post-processing—integrates raw image patch attention with generated image captions as alignment cues, balancing accuracy and practicality. Evaluated on the CliConSummation and MIMIC-CXR datasets, the method achieves significantly higher text and multimodal attribution F1 scores than embedding-matching and self-attribution baselines, with improvements up to 15%. Moreover, its caption-based variant remains competitive even under lightweight computational constraints.

Technology Category

Application Category

📝 Abstract

Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

Problem

Research questions and friction points this paper is trying to address.

evidence attribution

multimodal summarization

clinical summarization

source citation

trustworthy AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free attribution

decoder attention

multimodal clinical summarization