Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenge of generating anatomically consistent 3D medical images in data-scarce clinical settings. The authors propose Sketch2CT, a multimodal conditional diffusion framework that jointly leverages user-provided 2D sketches and textual descriptions to synthesize realistic CT images along with structurally accurate 3D organ segmentation masks. The method innovatively integrates local sketch cues with global textual semantics through dedicated feature alignment and representation fusion modules, and employs a capsule-attention-based backbone to enable structure-aware 3D generation. Experiments on public CT datasets demonstrate that Sketch2CT significantly improves both anatomical fidelity and visual realism of the generated outputs, offering an effective and controllable approach for low-cost medical data augmentation.

Technology Category

Application Category

📝 Abstract

Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.

Problem

Research questions and friction points this paper is trying to address.

3D medical volume generation

structure-aware

multimodal conditioning

anatomical consistency

medical image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal diffusion

structure-aware generation

sketch-text fusion