DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address insufficient multi-scale feature modeling in medical imaging and the limited inductive bias and high data dependency of Vision Transformers (ViTs), this paper proposes a dual-path Transformer architecture that synergistically integrates the hierarchical priors of CNNs with the global representational capacity of ViTs. Key contributions include: (i) a novel scale-aware attention mechanism that jointly models intra-layer local details and inter-layer global semantics; and (ii) a plug-and-play hierarchical tokenization scheme that explicitly preserves CNNs’ inherent multi-scale inductive bias. The architecture is compatible with mainstream CNN backbones and requires no modification to downstream task adaptation pipelines. Extensive experiments on multiple medical image classification benchmarks demonstrate consistent and significant improvements over both ViT- and CNN-based baselines, achieving average accuracy gains of 3.2–5.7 percentage points. These results validate the proposed method’s effectiveness, generalizability, and transferability.

Technology Category

Application Category

📝 Abstract

Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-scale learning in medical vision transformers

Integrating CNN and ViT for hierarchical feature extraction

Addressing ViT's lack of inductive biases and data dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical transformer integrating CNNs and ViTs

CNN backbone for multi-scale inductive biases

Scale-wise attention for intra and inter-scale associations

🔎 Similar Papers

No similar papers found.