🤖 AI Summary
This study systematically evaluates whether the additional computational cost of 3D models is justified over 2D or 2.5D approaches in pulmonary CT analysis. Under a unified training protocol, the authors conduct controlled experiments comparing convolutional neural networks (CNNs) and Vision Transformers (ViTs) across 2D, 2.5D, and 3D input representations using the NLST (n=1,977) and LIDC-IDRI datasets, assessing performance, stability, and resource consumption. The work introduces the first joint dimension–architecture evaluation framework tailored for lung cancer screening, revealing that 3D CNNs suffer from threshold instability and ViTs are prone to degenerate predictions such as all-positive outputs. Results demonstrate that 2.5D CNNs achieve the optimal trade-off between discriminative capability and stability (ROC-AUC 0.682), highlighting the practical advantages of lower-dimensional models in real-world clinical deployment.
📝 Abstract
Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.