🤖 AI Summary
To address the challenge of error detection in speech recognition for visually impaired users, this paper proposes a novel auditory-feedback-based approach: it maps speech recognition confidence scores in real time to the playback speed of synthesized speech—lower confidence triggers slower speech, enabling non-visual, discriminative audio cues. The method integrates lightweight confidence modeling, adaptive text-to-speech (TTS) rate control, and human–machine auditory interaction evaluation. Experiments demonstrate that, compared to conventional uniform speech-rate reduction, our approach improves error detection rate by 12% (relative) and reduces average decision time by 11%, significantly enhancing both timeliness and accuracy of auditory error correction. The core contribution lies in establishing an interpretable and perceptible mapping mechanism between confidence and speech rate, introducing a new feedback paradigm for accessible voice interaction.
📝 Abstract
Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Despite progress on speech recognition accuracy, errors may still sometimes occur and can significantly affect the end-user utility of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a 12% relative increase in participants' ability to detect errors compared to uniformly slowing the audio. It also reduced the time it took participants to listen to the recognition result and decide if there was an error by 11%.