Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses three core challenges in part-of-speech (POS) tagging of medieval Romance languages (Occitan, Spanish, French): diachronic linguistic change, orthographic variation, and severe scarcity of annotated training data. To assess the applicability of large language models (LLMs) to low-resource historical texts—including biblical, hagiographic, medical, and culinary corpora—we systematically evaluate a specialized modeling pipeline integrating cross-lingual transfer learning, domain-adaptive fine-tuning, prompt engineering, and optimized decoding strategies. Experimental results demonstrate substantial improvements in POS tagging accuracy, effectively mitigating data scarcity constraints. Crucially, the work not only exposes inherent limitations of general-purpose LLMs in handling non-standard historical variants but also establishes a transferable methodology for historical text processing. This framework advances digital humanities by enabling robust, multilingual automated analysis of premodern manuscripts.

Technology Category

Application Category

📝 Abstract

Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs' ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.

Problem

Research questions and friction points this paper is trying to address.

Enhancing POS tagging for Medieval Romance languages

Addressing data scarcity and linguistic variations

Evaluating techniques for low-resource historical languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning approaches for POS tagging

Cross-lingual transfer learning techniques

Prompt engineering for historical languages

🔎 Similar Papers

No similar papers found.