๐ค AI Summary
Existing methods for temporal chest X-ray analysis struggle to simultaneously achieve change detection, classification, and spatial localization while lacking interpretability. This work proposes TRACE, the first model to jointly learn change state classification (worsening, improving, stable), visual grounding, and natural language radiology report generation from temporal X-ray image pairs. Built upon a visionโlanguage architecture, TRACE end-to-end integrates temporal image inputs with dedicated classification heads and a visual grounding module. Experiments demonstrate that effective change detection emerges only when temporal contrastive learning and spatial localization are trained jointly, highlighting the critical role of grounding in temporal reasoning. On chest X-ray data, the model achieves over 90% visual grounding accuracy, establishes a new benchmark for temporal radiology report generation, and empirically validates the necessity of joint learning across these tasks.
๐ Abstract
Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.