🤖 AI Summary
This study addresses the limited understanding of the intrinsic structural properties of large language models (LLMs) in multilingual processing, particularly the systematic differences between low-resource and high-resource languages such as English. Moving beyond prior work that primarily focuses on token-level representations, this paper pioneers a language-structure-oriented perspective by employing representational structural analysis combined with cross-lingual representation comparison and structural similarity metrics. The findings reveal that low-resource languages exhibit significantly divergent internal structures compared to English within LLMs, and that the degree of structural similarity strongly correlates with language resource availability. Furthermore, language-specific post-training is shown to effectively reshape internal representations while preserving inter-language relationships, thereby uncovering the formative role of post-training in shaping the multilingual structural geometry of LLMs.
📝 Abstract
Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.