TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Tibetan spelling errors frequently occur simultaneously at both the character and syllable levels, yet existing approaches typically address only one level and suffer from a lack of task-specific data and effective augmentation strategies. Method: This paper proposes the first unified model for joint character- and syllable-level spelling correction. It introduces a novel semi-masked modeling paradigm and a syllable-aware Transformer architecture, coupled with a multi-level corruption strategy over unlabeled text and a controlled nine-category noise injection scheme for data augmentation. Contribution/Results: We construct the first open-source, multi-level Tibetan spelling correction dataset. Extensive experiments on both synthetic and real-world Tibetan texts demonstrate that our method significantly outperforms all baselines, achieving state-of-the-art performance and validating the effectiveness and generalizability of multi-level collaborative modeling.

Technology Category

Application Category

📝 Abstract

Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Multi-level Tibetan spelling correction for character and syllable errors

Lack of open-source datasets and augmentation methods for Tibetan

Semi-masked model to simplify syllable-level error correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-masked model for multi-level Tibetan correction

Data augmentation with unlabeled text generation

Nine synthetic corruptions for robust training

🔎 Similar Papers

No similar papers found.