π€ AI Summary
Current automatic speech recognition (ASR) systems rely on reference transcriptions for evaluation, lacking effective reference-free methods to assess recognition hypotheses. This work proposes the READ metric, which introduces acoustic consistency as a core principle for reference-free ASR hypothesis evaluation. Specifically, READ leverages a pretrained autoregressive text-to-speech (TTS) model to compute the conditional likelihood of speech tokens given a hypothesized transcript, thereby quantifying fine-grained acoustic-textual mismatches. Without requiring any additional training, the method enables effective hypothesis refinement across diverse noise conditions, demonstrating strong correlation with recognition errors and achieving up to a 20% relative reduction in word error rate.
π Abstract
Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.