Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing object detection methods typically employ only single- or dual-level feature refinement, limiting their ability to model inter-layer dependencies across multi-scale features and resulting in insufficient contextual awareness for objects with large scale variations. To address this, we propose a novel cross-layer cross-attention module—the first to introduce cross-layer self-attention into multi-scale feature fusion—unifying long-range dependency modeling while synergistically integrating convolutional local representations with Transformer-based global contextual reasoning. The module is lightweight and computationally efficient. Integrated into the SSD300 framework, it achieves 78.6% mAP on PASCAL VOC and 52.1% mAP on COCO, significantly outperforming the baseline. Moreover, it accelerates training convergence and incurs minimal computational overhead.

Technology Category

Application Category

📝 Abstract

Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.

Problem

Research questions and friction points this paper is trying to address.

Capturing inter-layer dependencies in multi-scale object detection

Modeling local and global dependencies across feature maps

Improving detection performance for objects with scale variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer self-attention module for multi-scale detection

Transformer-based unit capturing cross-layer feature interactions

Feature fusion mechanism enhancing original representations efficiently

🔎 Similar Papers

No similar papers found.

Authors to Follow