Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of real-world data hindering assistive technology development for misophonia—a condition characterized by intense negative emotional responses to specific everyday sounds, known as trigger sounds. To overcome this limitation, the authors propose constructing the first misophonia trigger sound dataset using synthetic soundscapes and introduce a lightweight hybrid model for sound event detection. The model leverages a frozen pretrained CNN to extract audio features, combined with a trainable bidirectional temporal module—implemented with GRU, LSTM, or Echo State Network (ESN)—to localize trigger sound segments within continuous audio. This approach uniquely integrates synthetic soundscapes with lightweight bidirectional temporal architectures, enabling few-shot personalization. Experimental results demonstrate that BiGRU achieves the best performance across multiple trigger categories, while BiESN attains competitive results with minimal trainable parameters and remains robust even with only five training samples for detecting eating-related sounds.

Technology Category

Application Category

📝 Abstract
Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot"eating sound"detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.
Problem

Research questions and friction points this paper is trying to address.

misophonia
trigger sound detection
sound event detection
synthetic soundscapes
assistive technology
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic soundscapes
hybrid CNN–RNN model
frozen pre-trained CNN
bidirectional ESN
few-shot personalization
🔎 Similar Papers
No similar papers found.
K
Kurumi Sashida
Department of Computer Science, Nagoya Institute of Technology, Nagoya 466-8555, Japan
Gouhei Tanaka
Gouhei Tanaka
Nagoya Institute of Technology
Complex Systems DynamicsMathematical EngineeringNeural Networks