Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses the "template collapse" problem in existing 3D medical vision-language models for CT report generation, which leads to missed detection of rare yet critical pathologies and limited output diversity. The work formally defines this issue and introduces CLarGen, a novel framework that decouples clinical finding detection from language generation through a three-stage pipeline: pathology detection, exemplar retrieval, and report synthesis. CLarGen employs a Latent Query Transformer for multi-label pathology recognition, followed by pathology-guided retrieval of clinical exemplars and high-fidelity report generation using a medical language model. Experiments demonstrate substantial improvements, with macro F1-score increasing from 0.189 to 0.487 and clinical relevance score rising from 0.368 to 0.472, significantly enhancing rare pathology detection while preserving report fluency.

📝 Abstract

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Template Collapse

3D CT report generation

pathology detection

output diversity

medical vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Template Collapse

3D CT Report Generation

Clinical Grounding