CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal violence detection in real-world scenarios, where audio is often corrupted by noise or exhibits weak correlation with visual cues. To overcome this limitation, the authors propose a video-guided audio fusion architecture that, for the first time, integrates a CLS-guided conditional LoRA module into the Mamba framework. This design dynamically modulates the state-space parameters of AudioMamba, enabling efficient cross-modal modeling without token-level cross-attention. Trained with a combination of binary classification and symmetric AV-InfoNCE losses, the model achieves 88.63% and 75.77% accuracy on the NTU-CCTV and DVD audio subsets, respectively—significantly outperforming existing uni- and multimodal approaches—while maintaining lower parameter counts and computational overhead.
📝 Abstract
Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.
Problem

Research questions and friction points this paper is trying to address.

multimodal violence detection
noisy audio
weak audio-visual correlation
supervised learning
real-world soundscapes
Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional LoRA
Mamba
multimodal violence detection
state-space model
audio-video alignment
🔎 Similar Papers
No similar papers found.
D
Damith Chamalke Senadeera
Digital Environmental Research Institute, Queen Mary University of London, UK
Dimitrios Kollias
Dimitrios Kollias
Associate Professor in Multimodal AI at Queen Mary University of London
Multimodal AIDeep Learning & Computer VisionBehavior AnalysisHMIMedical Imaging & Healthcare
G
Gregory Slabaugh
Digital Environmental Research Institute, Queen Mary University of London, UK