π€ AI Summary
Spoken Grammar Error Correction (SGEC) faces challenges including speech disfluencies, ASR transcription errors, and scarcity of annotated training data; conventional cascaded approaches suffer from error propagation. This paper proposes an end-to-end SGEC framework built upon Whisper, unifying speech and text modalities within a single model. We introduce three key innovations: (1) reference alignment to mitigate misalignment between ASR output and target corrections, (2) edit confidence estimation to filter unreliable predictions, and (3) context-aware pseudo-labeling to generate high-quality synthetic training data. Leveraging automated pseudo-labeling, we scale the training corpus from 77 hours to over 2,500 hours. Evaluated on Linguaskill and Speak & Improve benchmarks, our method achieves significant improvements in grammatical correction accuracy and pedagogically appropriate feedback quality, enabling more robust and precise real-time spoken feedback for second-language learners.
π Abstract
Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners' speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.