🤖 AI Summary
This study addresses the scarcity of real-world data hindering assistive technology development for misophonia—a condition characterized by intense negative emotional responses to specific everyday sounds, known as trigger sounds. To overcome this limitation, the authors propose constructing the first misophonia trigger sound dataset using synthetic soundscapes and introduce a lightweight hybrid model for sound event detection. The model leverages a frozen pretrained CNN to extract audio features, combined with a trainable bidirectional temporal module—implemented with GRU, LSTM, or Echo State Network (ESN)—to localize trigger sound segments within continuous audio. This approach uniquely integrates synthetic soundscapes with lightweight bidirectional temporal architectures, enabling few-shot personalization. Experimental results demonstrate that BiGRU achieves the best performance across multiple trigger categories, while BiESN attains competitive results with minimal trainable parameters and remains robust even with only five training samples for detecting eating-related sounds.
📝 Abstract
Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot"eating sound"detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.