DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Vision Transformers (ViTs) suffer from quadratic computational complexity in self-attention, while conventional 3D convolutions incur excessive parameters and FLOPs for 3D medical image–text joint modeling. Method: We propose DCFormer, a lightweight and efficient encoder that decomposes 3D convolutions into three orthogonal 1D convolutions—preserving geometric awareness and spatial modeling capability while drastically reducing computation. Integrated with the CLIP cross-modal alignment framework, DCFormer enables end-to-end vision–language joint learning. Results: On the CT-RATE dataset, DCFormer-Tiny achieves 62.0% accuracy and 46.3% F1-score, outperforming mainstream baselines despite requiring orders-of-magnitude fewer parameters than ViT or ConvNeXt variants. This work establishes a computationally efficient and scalable paradigm for 3D medical multimodal understanding under resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) align visual and textual representations, enabling high-performance zero-shot classification and image-text retrieval in 2D medical imaging. However, extending VLMs to 3D medical imaging remains computationally challenging. Existing 3D VLMs rely on Vision Transformers (ViTs), which are computationally expensive due to self-attention's quadratic complexity, or 3D convolutions, which demand excessive parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D medical image encoder that factorizes 3D convolutions into three parallel 1D convolutions along depth, height, and width. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports, for zero-shot multi-abnormality detection across 18 pathologies. Compared to ViT, ConvNeXt, PoolFormer, and TransUNet, DCFormer achieves superior efficiency and accuracy, with DCFormer-Tiny reaching 62.0% accuracy and a 46.3% F1-score while using significantly fewer parameters. These results highlight DCFormer's potential for scalable, clinically deployable 3D medical VLMs. Our codes will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Efficient 3D vision-language model

Reduces computational cost

Improves multi-abnormality detection accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed 3D convolutions

Parallel 1D convolutions

CLIP-based vision-language framework

🔎 Similar Papers

No similar papers found.