DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer

πŸ“… 2025-06-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient multi-scale feature modeling in medical imaging and the limited inductive bias and high data dependency of Vision Transformers (ViTs), this paper proposes a dual-path Transformer architecture that synergistically integrates the hierarchical priors of CNNs with the global representational capacity of ViTs. Key contributions include: (i) a novel scale-aware attention mechanism that jointly models intra-layer local details and inter-layer global semantics; and (ii) a plug-and-play hierarchical tokenization scheme that explicitly preserves CNNs’ inherent multi-scale inductive bias. The architecture is compatible with mainstream CNN backbones and requires no modification to downstream task adaptation pipelines. Extensive experiments on multiple medical image classification benchmarks demonstrate consistent and significant improvements over both ViT- and CNN-based baselines, achieving average accuracy gains of 3.2–5.7 percentage points. These results validate the proposed method’s effectiveness, generalizability, and transferability.

Technology Category

Application Category

πŸ“ Abstract
Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-scale learning in medical vision transformers
Integrating CNN and ViT for hierarchical feature extraction
Addressing ViT's lack of inductive biases and data dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical transformer integrating CNNs and ViTs
CNN backbone for multi-scale inductive biases
Scale-wise attention for intra and inter-scale associations
πŸ”Ž Similar Papers
No similar papers found.