Building UD Cairo for Old English in the Classroom

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

119K/year

🤖 AI Summary

Scarce annotated resources and high entry barriers hinder pedagogical applications of historical linguistics, particularly for Old English (OE) dependency parsing. Method: We construct the first pedagogically oriented OE Universal Dependencies (UD) treebank—UD_Cairo—via a novel paradigm integrating LLM-assisted generation (leveraging prompt engineering and retrieval from authentic OE corpora) with novice collaborative annotation. Twenty representative sentences were annotated by students and verified by experts; inter-annotator agreement was measured to ensure reliability, and cross-temporal dependency parsing transfer experiments were conducted. Contributions/Results: (1) Post-editing effectively corrects systematic grammatical biases in LLM-generated OE annotations; (2) novice collaboration yields high-quality UD annotations, delivering dual pedagogical and data curation value; (3) lexical and semantic features (e.g., lemmatization, UPOS, and FEATS) significantly improve the transfer performance of modern English–trained parsers on OE.

Technology Category

Application Category

📝 Abstract

In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world's languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.

Problem

Research questions and friction points this paper is trying to address.

Creating a treebank for Old English using UD Cairo sentences

Mitigating inauthentic LLM outputs via post-editing for Old English syntax

Improving parsing performance on Old English with annotated features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining LLM prompting with Old English data searches

Adjudicating annotations from multiple beginner students

Using Modern English data for Old English parsing

🔎 Similar Papers

No similar papers found.