Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

132K/year

🤖 AI Summary

This work addresses the scarcity of morpheme segmentation annotations for low-resource languages. We propose a Transformer-based multitask learning framework that jointly models morpheme segmentation and word-level glossing, and—novelly—incorporates synthetic data generated via zero-shot or in-context learning from large language models (LLMs). To enhance cross-lingual generalization and alleviate data scarcity, the framework shares document-level representations across languages. Evaluated on the SIGMORPHON 2023 multilingual benchmark, our approach achieves significant improvements in word-level segmentation accuracy and morpheme-level F1 score, especially for extremely low-resource languages such as Old Church Slavonic and Ainu. These results demonstrate the effectiveness and strong generalization capability of synergistically combining multitask learning with LLM-generated synthetic data for morphological analysis in low-resource settings.

Technology Category

Application Category

📝 Abstract

We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Develops morpheme segmentation for low-resource languages

Uses multitask learning and synthetic data

Improves accuracy in morphological analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based morpheme segmentation system

Multitask learning for shared linguistic representations

LLM-generated synthetic data via in-context learning

🔎 Similar Papers

Cross-lingual Character-Level Neural Morphological Tagging