π€ AI Summary
This work addresses the challenges of Korean short-text classification, where performance is hindered by data sparsity, limited labeled resources, and the neglect of Koreanβs agglutinative morphology and flexible word order in existing approaches. To overcome these limitations, the authors propose a novel heterogeneous graph neural network that integrates morphemes, part-of-speech tags, and named entities into a multi-level linguistic structure, explicitly embedding Korean-specific syntactic features into graph representations for the first time. Furthermore, they introduce SemCon, a semantic-aware contrastive learning framework, to refine decision boundaries. Extensive experiments on four Korean short-text datasets demonstrate that the proposed method significantly outperforms current state-of-the-art baselines, confirming the effectiveness of linguistically informed modeling and contrastive learning in enhancing text classification for agglutinative languages.
π Abstract
Short text classification (STC) remains a challenging task due to the scarcity of contextual information and labeled data. However, existing approaches have pre-dominantly focused on English because most benchmark datasets for the STC are primarily available in English. Consequently, existing methods seldom incorporate the linguistic and structural characteristics of Korean, such as its agglutinative morphology and flexible word order. To address these limitations, we propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification. The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts while precisely capturing the grammatical and semantic dependencies inherent in Korean. In addition, we apply Semantics-aware Contrastive Learning (SemCon) to reflect semantic similarity across documents, enabling the model to establish clearer decision boundaries even in short texts where class distinctions are often ambiguous. We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models. These outcomes validate that integrating language-specific graph representations with SemCon provides an effective solution for short text classification in agglutinative languages such as Korean.