🤖 AI Summary
This study addresses a critical limitation in existing speech enhancement methods, which optimize only acoustic metrics while neglecting the cognitive bottleneck induced by informational masking in multi-talker environments. The work proposes a novel cognitive-inspired paradigm that, for the first time, disentangles the cognitive costs of informational and energetic masking within a deep neural network framework. It introduces a silicon-analog RAMPHO memory buffer mechanism based on frame-level phoneme entropy derived from wav2vec 2.0, integrated with SNR scanning and Concentration Shield phase decorrelation to differentiate the auditory cognitive impacts of semantically coherent interference versus phase distortion. The findings reveal a Pareto trade-off: semantic disruption alleviates informational masking at high SNRs but impairs temporal cues at low SNRs, thereby demonstrating the necessity of joint cognitive-acoustic optimization for next-generation speech enhancement systems.
📝 Abstract
The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor's semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.