A Dual-Modulation Framework for RGB-T Crowd Counting via Spatially Modulated Attention and Adaptive Fusion

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RGB-T crowd counting faces two key challenges in complex scenes: (1) attention dispersion due to the lack of spatial inductive bias in standard Transformers, and (2) ineffective cross-modal fusion between RGB and thermal infrared modalities. To address these, we propose a Dual-Modulation Transformer framework comprising: (1) Spatial Modulation Attention (SMA), which introduces a learnable spatial decay mask to suppress long-range irrelevant attention from background regions and improve crowd localization accuracy; and (2) Adaptive Fusion Modulation (AFM), a reliability-driven dynamic gating mechanism that enables complementary RGB–thermal feature integration. Evaluated on multiple RGB-T benchmark datasets, our method achieves state-of-the-art performance—reducing counting error by 12.6%–18.3% and improving localization mAP by 9.4%–14.1%—demonstrating superior robustness and accuracy in challenging real-world environments.

Technology Category

Application Category

📝 Abstract
Accurate RGB-Thermal (RGB-T) crowd counting is crucial for public safety in challenging conditions. While recent Transformer-based methods excel at capturing global context, their inherent lack of spatial inductive bias causes attention to spread to irrelevant background regions, compromising crowd localization precision. Furthermore, effectively bridging the gap between these distinct modalities remains a major hurdle. To tackle this, we propose the Dual Modulation Framework, comprising two modules: Spatially Modulated Attention (SMA), which improves crowd localization by using a learnable Spatial Decay Mask to penalize attention between distant tokens and prevent focus from spreading to the background; and Adaptive Fusion Modulation (AFM), which implements a dynamic gating mechanism to prioritize the most reliable modality for adaptive cross-modal fusion. Extensive experiments on RGB-T crowd counting datasets demonstrate the superior performance of our method compared to previous works. Code available at https://github.com/Cht2924/RGBT-Crowd-Counting.
Problem

Research questions and friction points this paper is trying to address.

Improves RGB-Thermal crowd counting accuracy in challenging public safety conditions
Addresses Transformer's spatial bias limitation causing attention to irrelevant backgrounds
Bridges modality gap between RGB and thermal data for effective fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially Modulated Attention with learnable decay mask
Adaptive Fusion Modulation using dynamic gating mechanism
Dual-Modulation Framework for RGB-T crowd counting
🔎 Similar Papers
No similar papers found.
Yuhong Feng
Yuhong Feng
Associate Professor
Workflow ManagementCloud ComputingThe Internet of thingsLinux Operating System
H
Hongtao Chen
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Q
Qi Zhang
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
J
Jie Chen
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Z
Zhaoxi He
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
M
Mingzhe Liu
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
J
Jianghai Liao
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China