Toward Next-generation Medical Vision Backbones: Modeling Finer-grained Long-range Visual Dependency

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Medical image computing (MIC) faces the dual challenge of jointly modeling global long-range dependencies and local fine-grained details: CNNs suffer from limited receptive fields, Transformers incur prohibitive computational overhead in self-attention for high-resolution features, and MLP-based architectures remain underexplored in MIC. This work presents the first systematic investigation of MLPs for modeling fine-grained long-range dependencies in high-resolution medical images, demonstrating their superior accuracy–efficiency trade-off over both CNNs and Transformers. Building on this insight, we propose a hybrid backbone architecture that synergistically integrates Transformer and MLP components—leveraging Transformers for global dependency capture and MLPs for efficient, high-fidelity feature preservation at full resolution—while natively supporting both pixel-level and image-level tasks. Extensive evaluation across diverse medical vision benchmarks shows consistent performance gains, establishing a new, efficient, and scalable backbone paradigm for MIC.

Technology Category

Application Category

📝 Abstract

Medical Image Computing (MIC) is a broad research topic covering both pixel-wise (e.g., segmentation, registration) and image-wise (e.g., classification, regression) vision tasks. Effective analysis demands models that capture both global long-range context and local subtle visual characteristics, necessitating fine-grained long-range visual dependency modeling. Compared to Convolutional Neural Networks (CNNs) that are limited by intrinsic locality, transformers excel at long-range modeling; however, due to the high computational loads of self-attention, transformers typically cannot process high-resolution features (e.g., full-scale image features before downsampling or patch embedding) and thus face difficulties in modeling fine-grained dependency among subtle medical image details. Concurrently, Multi-layer Perceptron (MLP)-based visual models are recognized as computation/memory-efficient alternatives in modeling long-range visual dependency but have yet to be widely investigated in the MIC community. This doctoral research advances deep learning-based MIC by investigating effective long-range visual dependency modeling. It first presents innovative use of transformers for both pixel- and image-wise medical vision tasks. The focus then shifts to MLPs, pioneeringly developing MLP-based visual models to capture fine-grained long-range visual dependency in medical images. Extensive experiments confirm the critical role of long-range dependency modeling in MIC and reveal a key finding: MLPs provide feasibility in modeling finer-grained long-range dependency among higher-resolution medical features containing enriched anatomical/pathological details. This finding establishes MLPs as a superior paradigm over transformers/CNNs, consistently enhancing performance across various medical vision tasks and paving the way for next-generation medical vision backbones.

Problem

Research questions and friction points this paper is trying to address.

Modeling fine-grained long-range visual dependency in medical images

Overcoming computational limitations of transformers for high-resolution medical features

Developing MLP-based models to capture subtle medical image details

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using transformers for medical vision tasks

Developing MLP models for fine-grained dependency

Modeling long-range dependency in high-resolution features

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis