Contextual morphologically-guided tokenization for Latin encoder models

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard tokenization methods, optimized for information-theoretic objectives (e.g., compression), neglect linguistically grounded constraints—such as morphological alignment—leading to suboptimal performance on downstream tasks for morphologically rich languages like Latin. To address this, we propose a context-aware tokenization framework that explicitly integrates morphological knowledge. Our approach is the first to incorporate fine-grained Latin lexicons and morphological analyzers into medium-scale pretraining, jointly optimizing input representations via morphology-guided segmentation and contextual encoding. Crucially, it operates without reliance on large-scale annotated data, offering a linguistically motivated alternative for low-resource language modeling. Evaluated across four diverse downstream tasks, our method achieves consistent and significant performance gains, with particularly strong out-of-domain generalization—demonstrating the substantial and empirically validated benefit of morphological priors for encoder-based representation learning.

Technology Category

Application Category

📝 Abstract
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models'improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
Problem

Research questions and friction points this paper is trying to address.

Standard tokenization methods neglect morphological alignment in language models
Morphologically rich languages suffer from suboptimal tokenization impacting performance
Latin lacks sufficient pretraining data despite abundant linguistic resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Morphologically-guided tokenization for Latin models
Leveraging curated lexical resources for tokenization
Improves downstream task and generalization performance
🔎 Similar Papers
2024-06-21arXiv.orgCitations: 0
Marisa Hudspeth
Marisa Hudspeth
University of Massachusetts, Amherst
natural language processing
P
Patrick J. Burns
Institute for the Study of the Ancient World, New York University
B
Brendan O'Connor
Manning College of Information & Computer Sciences, University of Massachusetts Amherst