🤖 AI Summary
Existing speech quality assessment methods rely on complete audio signals and exhibit significant performance degradation in streaming or prefix-constrained scenarios. This work proposes an incremental, multi-resolution autoregressive evaluation framework that jointly predicts both segment-level and overall-level quality using a single decoder. The core innovations include dual-resolution tokens, a resolution-aware hierarchical architecture, and a hierarchical supervision mechanism, enabling coarse-to-fine progressive optimization and revealing the temporal accumulation characteristics of perceived speech quality. Experimental results demonstrate that with only a 2-second prefix as input, the proposed method reduces PLCMOS prediction error by 48%, with an effective perceptual context window of 4–6 seconds, substantially enhancing robustness under partial-input conditions.
📝 Abstract
While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.