CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models lack explicit modeling of fine-grained anatomical structures in chest X-rays, limiting their ability to support spatially precise localization tasks. This work proposes a novel approach that, for the first time, incorporates anatomical structures into the generative objective of a vision-language model in an autoregressive manner. Leveraging CT-derived synthetic X-ray images and their corresponding segmentation masks, the method provides scalable supervision to directly predict anatomical segmentation masks from a pretrained model—without requiring additional task-specific heads. Evaluated on both synthetic and real data, the approach achieves segmentation performance comparable to specialized U-Net architectures while demonstrating superior out-of-distribution geometric robustness and efficient adaptation to novel localization tasks under few-shot settings.
📝 Abstract
Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
anatomical segmentation
chest radiographs
spatially precise tasks
medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

anatomy-aware modeling
vision-language model
autoregressive segmentation
medical image synthesis
spatially grounded representation