End-to-End Spoken Grammatical Error Correction

πŸ“… 2025-06-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Spoken Grammar Error Correction (SGEC) faces challenges including speech disfluencies, ASR transcription errors, and scarcity of annotated training data; conventional cascaded approaches suffer from error propagation. This paper proposes an end-to-end SGEC framework built upon Whisper, unifying speech and text modalities within a single model. We introduce three key innovations: (1) reference alignment to mitigate misalignment between ASR output and target corrections, (2) edit confidence estimation to filter unreliable predictions, and (3) context-aware pseudo-labeling to generate high-quality synthetic training data. Leveraging automated pseudo-labeling, we scale the training corpus from 77 hours to over 2,500 hours. Evaluated on Linguaskill and Speak & Improve benchmarks, our method achieves significant improvements in grammatical correction accuracy and pedagogically appropriate feedback quality, enabling more robust and precise real-time spoken feedback for second-language learners.

Technology Category

Application Category

πŸ“ Abstract
Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners' speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.
Problem

Research questions and friction points this paper is trying to address.

Develops End-to-End spoken grammatical error correction for L2 learners
Addresses data scarcity via pseudo-labeling for training SGEC systems
Improves feedback precision using reference alignment and confidence estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-End SGEC framework using Whisper model
Automatic pseudo-labeling for data augmentation
Reference alignment for precise feedback generation
πŸ”Ž Similar Papers
No similar papers found.