π€ AI Summary
To address insufficient multi-scale feature modeling in medical imaging and the limited inductive bias and high data dependency of Vision Transformers (ViTs), this paper proposes a dual-path Transformer architecture that synergistically integrates the hierarchical priors of CNNs with the global representational capacity of ViTs. Key contributions include: (i) a novel scale-aware attention mechanism that jointly models intra-layer local details and inter-layer global semantics; and (ii) a plug-and-play hierarchical tokenization scheme that explicitly preserves CNNsβ inherent multi-scale inductive bias. The architecture is compatible with mainstream CNN backbones and requires no modification to downstream task adaptation pipelines. Extensive experiments on multiple medical image classification benchmarks demonstrate consistent and significant improvements over both ViT- and CNN-based baselines, achieving average accuracy gains of 3.2β5.7 percentage points. These results validate the proposed methodβs effectiveness, generalizability, and transferability.
π Abstract
Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.