KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses a critical limitation in existing Korean pre-trained language models, which overlook the systematic compositional principles of Hangul as defined by Hunminjeongeum, thereby failing to adequately capture its linguistic characteristics. To overcome this, the authors propose KOMBO, a novel framework that, for the first time, explicitly integrates Hangul’s structured jamo composition rules into character-level representations, transcending the constraints of conventional subword tokenization. By employing structure-aware sub-character modeling, KOMBO achieves a deeper understanding of Hangul’s morphographic logic. Evaluated across five Korean natural language understanding benchmarks, KOMBO outperforms state-of-the-art Korean pre-trained models by an average of 2.11%, demonstrating the efficacy and superiority of representation methods grounded in Hangul’s intrinsic orthographic principles.

Technology Category

Application Category

📝 Abstract

The Korean writing system, \textit{Hangeul}, has a unique character representation rigidly following the invention principles recorded in \textit{Hunminjeongeum}.\footnote{\textit{Hunminjeongeum} is a book published in 1446 that describes the principles of invention and usage of \textit{Hangeul}, devised by King Sejong \cite{Hunminjeongeum_Guide}.} However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of \textit{Hangeul} to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11\% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: [https://github.com/SungHo3268/KOMBO](https://github.com/SungHo3268/KOMBO).

Problem

Research questions and friction points this paper is trying to address.

Korean language

Hangeul

character representation

pre-trained language models

subcharacter

Innovation

Methods, ideas, or system contributions that make the work stand out.

KOMBO

Hangeul

subcharacter representation