🤖 AI Summary
Medical image computing (MIC) faces the dual challenge of jointly modeling global long-range dependencies and local fine-grained details: CNNs suffer from limited receptive fields, Transformers incur prohibitive computational overhead in self-attention for high-resolution features, and MLP-based architectures remain underexplored in MIC. This work presents the first systematic investigation of MLPs for modeling fine-grained long-range dependencies in high-resolution medical images, demonstrating their superior accuracy–efficiency trade-off over both CNNs and Transformers. Building on this insight, we propose a hybrid backbone architecture that synergistically integrates Transformer and MLP components—leveraging Transformers for global dependency capture and MLPs for efficient, high-fidelity feature preservation at full resolution—while natively supporting both pixel-level and image-level tasks. Extensive evaluation across diverse medical vision benchmarks shows consistent performance gains, establishing a new, efficient, and scalable backbone paradigm for MIC.
📝 Abstract
Medical Image Computing (MIC) is a broad research topic covering both pixel-wise (e.g., segmentation, registration) and image-wise (e.g., classification, regression) vision tasks. Effective analysis demands models that capture both global long-range context and local subtle visual characteristics, necessitating fine-grained long-range visual dependency modeling. Compared to Convolutional Neural Networks (CNNs) that are limited by intrinsic locality, transformers excel at long-range modeling; however, due to the high computational loads of self-attention, transformers typically cannot process high-resolution features (e.g., full-scale image features before downsampling or patch embedding) and thus face difficulties in modeling fine-grained dependency among subtle medical image details. Concurrently, Multi-layer Perceptron (MLP)-based visual models are recognized as computation/memory-efficient alternatives in modeling long-range visual dependency but have yet to be widely investigated in the MIC community. This doctoral research advances deep learning-based MIC by investigating effective long-range visual dependency modeling. It first presents innovative use of transformers for both pixel- and image-wise medical vision tasks. The focus then shifts to MLPs, pioneeringly developing MLP-based visual models to capture fine-grained long-range visual dependency in medical images. Extensive experiments confirm the critical role of long-range dependency modeling in MIC and reveal a key finding: MLPs provide feasibility in modeling finer-grained long-range dependency among higher-resolution medical features containing enriched anatomical/pathological details. This finding establishes MLPs as a superior paradigm over transformers/CNNs, consistently enhancing performance across various medical vision tasks and paving the way for next-generation medical vision backbones.