Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current automatic speech recognition (ASR) systems rely on reference transcriptions for evaluation, lacking effective reference-free methods to assess recognition hypotheses. This work proposes the READ metric, which introduces acoustic consistency as a core principle for reference-free ASR hypothesis evaluation. Specifically, READ leverages a pretrained autoregressive text-to-speech (TTS) model to compute the conditional likelihood of speech tokens given a hypothesized transcript, thereby quantifying fine-grained acoustic-textual mismatches. Without requiring any additional training, the method enables effective hypothesis refinement across diverse noise conditions, demonstrating strong correlation with recognition errors and achieving up to a 20% relative reduction in word error rate.

📝 Abstract

Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation

automatic speech recognition

acoustic discrepancy

hypothesis evaluation

speech-text alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation

acoustic discrepancy

automatic speech recognition