🤖 AI Summary
To address the out-of-vocabulary (OOV) problem in Korean NLP, this paper proposes KOPL, a novel framework that exploits the highly regular phoneme–grapheme correspondence in Korean to jointly model phonemic and word-level representations. KOPL performs phoneme segmentation and learns phoneme-level embeddings, which are seamlessly integrated into both static and contextualized word embedding models, enabling plug-and-play deployment. Its core contribution is the first systematic incorporation of pronunciation information to enhance Korean word vector representations, thereby capturing both orthographic and phonological semantics. Evaluated on multiple Korean downstream tasks—including part-of-speech tagging, named entity recognition, and dependency parsing—KOPL achieves an average improvement of 1.9% over prior state-of-the-art methods. This work establishes a scalable, multimodal paradigm for OOV modeling in low-resource languages.
📝 Abstract
In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git.