Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

๐Ÿ“… 2025-12-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

212K/year
๐Ÿค– AI Summary
To address real-time suspicious behavior detection in complex surveillance scenarios, this paper proposes DeepUSEvisionโ€”a lightweight multimodal framework. Methodologically, we introduce USE50k, the first large-scale, multi-source, heterogeneous surveillance dataset (65.5K samples); design a modular, interpretable Transformer-based fusion architecture integrating enhanced YOLOv12 for suspicious object detection, dual-path DCNNs for facial expression and body pose recognition (using both RGB images and skeletal keypoints), and a cross-modal discriminative network; and incorporate multi-task joint training with attention visualization. Our key contributions are: (1) the first release of the USE50k benchmark, and (2) the first interpretable multimodal fusion paradigm enabling fine-grained attribution analysis. Experiments demonstrate state-of-the-art performance: 89.3% mAP@0.5 on real-world scenes, 42 FPS inference speed, superior accuracy and robustness, and deployment-ready traceability.

Technology Category

Application Category

๐Ÿ“ Abstract
Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.
Problem

Research questions and friction points this paper is trying to address.

Estimating suspiciousness in visual surveillance for threat detection
Introducing a large-scale dataset for diverse public environment analysis
Developing a lightweight multimodal fusion framework for real-time scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced YOLOv12 for suspicious object detection
Dual DCNNs for facial and body language recognition
Transformer-based network for multimodal fusion and interpretability