Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing methods for 3D medical image generation struggle to simultaneously achieve spatial controllability and labeling flexibility. This work proposes a multimodal controllable generation framework that innovatively integrates optional radiology report text with localized semantic segmentation prompts, enabling high-precision spatial control without requiring full organ annotations. A semantic binding mechanism endows segmentation masks with contextual semantics, substantially enhancing generation flexibility. Built upon an enhanced diffusion Transformer architecture, the model employs gated attention to handle long textual inputs and incorporates memory-efficient strategies to support high-resolution 3D CT synthesis. Experiments demonstrate that the method achieves state-of-the-art performance—yielding a 24% relative improvement in FID—and produces anatomically consistent, clinically realistic images, while significantly boosting data efficiency in data augmentation scenarios.

📝 Abstract

Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists' evaluation further confirms strong alignment between generated and real medical images.

Problem

Research questions and friction points this paper is trying to address.

3D CT generation

controllable generation

text prompting

segmentation prompts

medical image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal conditioning

segmentation prompting

diffusion transformer