GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

145K/year
🤖 AI Summary
This work addresses the high cost of manually producing interlinear glossed text (IGT) and the limited adoption of existing automatic annotation tools by field linguists due to their lack of interpretability and mechanisms for expert feedback. To overcome these challenges, the authors propose GlossAssist—a retrieval-based morphological annotation system that integrates a Contrastive Word-Morpheme Pretraining (CWoMP) model with a dynamic morpheme lexicon to generate interpretable predictions. Its key innovation lies in incorporating linguists’ real-time corrections as active learning signals to iteratively expand the lexicon and refine predictions without requiring model retraining. Through an interactive human-in-the-loop interface, GlossAssist significantly enhances the practicality and user acceptance of automatic annotation, particularly in low-resource language settings.
📝 Abstract
Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.
Problem

Research questions and friction points this paper is trying to address.

Interlinear Glossed Text
Low-Resource
NLP Tools
Linguistic Annotation
Field Linguistics
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-based glossing
mutable lexicon
active learning
low-resource NLP
Interlinear Glossed Text
🔎 Similar Papers
No similar papers found.