RCCFormer: A Robust Crowd Counting Network Based on Transformer

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low counting accuracy in complex scenes caused by scale variations and background clutter, this paper proposes a Transformer-based robust crowd counting network. The method introduces three key innovations: (1) an Adaptive Scale-Aware Module (ASAM) driven by input-dependent deformable convolution (IDConv) to enable dynamic receptive field modeling; (2) a Detail-Embedded Attention Block (DEAB) that integrates global-local self-attention to enhance head-structure perception; and (3) a Multi-level Feature Fusion Module (MFFM) to strengthen cross-scale semantic representation. Evaluated on four major benchmarks—ShanghaiTech Part_A, ShanghaiTech Part_B, NWPU-Crowd, and QNRF—the proposed approach achieves state-of-the-art performance, consistently outperforming existing methods in both Mean Absolute Error (MAE) and Mean Squared Error (MSE).

Technology Category

Application Category

📝 Abstract
Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model's ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network's capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.
Problem

Research questions and friction points this paper is trying to address.

Addresses scale variations in crowd counting accuracy
Reduces background noise interference in crowd analysis
Improves adaptation to head target shape and scale changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based network for crowd counting
Multi-level Feature Fusion Module integration
Adaptive Scale-Aware Module with IDConv
🔎 Similar Papers
No similar papers found.
P
Peng Liu
School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
H
Heng-Chao Li
School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
Sen Lei
Sen Lei
Southwest Jiaotong University
computer visiondeep learningremote sensing
Nanqing Liu
Nanqing Liu
Southwest Jiaotong University
Remote SensingDeep LearningObject Detection
B
Bin Feng
School of Physical Education, Southwest Jiaotong University, Chengdu 611756, China
X
Xiao Wu
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China