Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing object detection methods typically employ only single- or dual-level feature refinement, limiting their ability to model inter-layer dependencies across multi-scale features and resulting in insufficient contextual awareness for objects with large scale variations. To address this, we propose a novel cross-layer cross-attention module—the first to introduce cross-layer self-attention into multi-scale feature fusion—unifying long-range dependency modeling while synergistically integrating convolutional local representations with Transformer-based global contextual reasoning. The module is lightweight and computationally efficient. Integrated into the SSD300 framework, it achieves 78.6% mAP on PASCAL VOC and 52.1% mAP on COCO, significantly outperforming the baseline. Moreover, it accelerates training convergence and incurs minimal computational overhead.

Technology Category

Application Category

📝 Abstract
Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.
Problem

Research questions and friction points this paper is trying to address.

Capturing inter-layer dependencies in multi-scale object detection
Modeling local and global dependencies across feature maps
Improving detection performance for objects with scale variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer self-attention module for multi-scale detection
Transformer-based unit capturing cross-layer feature interactions
Feature fusion mechanism enhancing original representations efficiently
🔎 Similar Papers
No similar papers found.
D
Dingzhou Xie
Guangzhou Huashang College, Hua shang Road, Guangzhou, 511300, Guangdong, China
Rushi Lan
Rushi Lan
Guilin University of Electronic Technology
image processingpattern classification
Cheng Pang
Cheng Pang
Guilin University of Electronic Technology, Jin ji Road, Guilin, 541004, Gangxi, China
E
Enhao Ning
Guangxi Normal University, Yu cai Road, Guilin, 541004, Gangxi, China
J
Jiahao Zeng
Guangxi Normal University, Yu cai Road, Guilin, 541004, Gangxi, China
W
Wei Zheng
Jinling Institute of Technology, Hong jing Road, Nanjing, 211169, Jiangsu, China