🤖 AI Summary
To address the low reliability and validity of human scoring, along with substantial inter-rater variability, in second-language (L2) English speaking and writing proficiency assessment, this paper proposes the first end-to-end deep learning framework jointly modeling speech and text. Our method unifies acoustic representations (extracted via wav2vec 2.0) and linguistic representations (from BERT) within a novel role-aware, long-dialogue-adapted wav2vec 2.0–BERT collaborative evaluation architecture. We further integrate ResNet, 2D and frequency-domain CNNs, and multimodal features to enhance representation learning. Extensive experiments on EFCamDat, ANGLISH, and a proprietary dataset demonstrate that the wav2vec 2.0–only modality achieves state-of-the-art performance; across tasks, our full model improves average accuracy by 12.3% over strong baselines and significantly outperforms conventional approaches.
📝 Abstract
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators, with the inherent intra- and inter-rater variability. This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription. We analyze spoken proficiency classification prediction using diverse architectures, including 2D CNN, frequency-based CNN, ResNet, and a pretrained wav2vec 2.0 model. Additionally, we examine text-based proficiency assessment by fine-tuning a BERT language model within resource constraints. Finally, we tackle the complex task of spontaneous dialogue assessment, managing long-form audio and speaker interactions through separate applications of wav2vec 2.0 and BERT models. Results from experiments on EFCamDat and ANGLISH datasets and a private dataset highlight the potential of deep learning, especially the pretrained wav2vec 2.0 model, for robust automated L2 proficiency evaluation.