From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of high-quality, large-scale datasets for emergency vehicle (EV) siren recognition and the excessive computational cost of existing models—hindering edge deployment—this paper proposes E2PANNs, a lightweight and interpretable audio event detection architecture. Built upon the PANNs framework, E2PANNs integrates lightweight convolutional modules, transfer learning, and optimization using an AudioSet EV subset, while enhancing interpretability via Guided Backpropagation and Score-CAM. Cross-domain training further improves robustness. Evaluated on multiple benchmarks, E2PANNs achieves state-of-the-art performance: 92.3% frame-level F1-score and a 27% reduction in event-level detection error. It enables real-time inference on embedded devices (e.g., Jetson Nano) with latency under 30 ms and reduces false positive rate by 41%, significantly improving safety and practicality in intelligent transportation and autonomous driving applications.

Technology Category

Application Category

📝 Abstract
Accurate recognition of Emergency Vehicle (EV) sirens is critical for the integration of intelligent transportation systems, smart city monitoring systems, and autonomous driving technologies. Modern automatic solutions are limited by the lack of large scale, curated datasets and by the computational demands of state of the art sound event detection models. This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture derived from the PANNs framework, specifically optimized for binary EV siren detection. Leveraging our dedicated subset of AudioSet (AudioSet EV) we fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware. The experimental campaign includes ablation studies, cross-domain benchmarking, and real-time inference deployment on edge device. Interpretability analyses exploiting Guided Backpropagation and ScoreCAM algorithms provide insights into the model internal representations and validate its ability to capture distinct spectrotemporal patterns associated with different types of EV sirens. Real time performance is assessed through frame wise and event based detection metrics, as well as a detailed analysis of false positive activations. Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.
Problem

Research questions and friction points this paper is trying to address.

Detect emergency vehicle sirens accurately in real-time
Overcome lack of large-scale datasets for siren detection
Optimize lightweight CNN for edge device deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight CNN for EV siren detection
Fine-tuned on dedicated AudioSet EV subset
Real-time edge deployment with interpretability
🔎 Similar Papers
No similar papers found.
S
Stefano Giacomelli
Department of Information Engineering, Computer Science and Mathematics (DISIM), University of L’Aquila
M
Marco Giordano
Department of Information Engineering, Computer Science and Mathematics (DISIM), University of L’Aquila
Claudia Rinaldi
Claudia Rinaldi
CNIT - National Inter-University Consortium for Telecommunications
wireless communicationsdigital signal processingmultimedia
Fabio Graziosi
Fabio Graziosi
University of L'Aquila - Italy