TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tibetan spelling errors frequently occur simultaneously at both the character and syllable levels, yet existing approaches typically address only one level and suffer from a lack of task-specific data and effective augmentation strategies. Method: This paper proposes the first unified model for joint character- and syllable-level spelling correction. It introduces a novel semi-masked modeling paradigm and a syllable-aware Transformer architecture, coupled with a multi-level corruption strategy over unlabeled text and a controlled nine-category noise injection scheme for data augmentation. Contribution/Results: We construct the first open-source, multi-level Tibetan spelling correction dataset. Extensive experiments on both synthetic and real-world Tibetan texts demonstrate that our method significantly outperforms all baselines, achieving state-of-the-art performance and validating the effectiveness and generalizability of multi-level collaborative modeling.

Technology Category

Application Category

📝 Abstract
Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Multi-level Tibetan spelling correction for character and syllable errors
Lack of open-source datasets and augmentation methods for Tibetan
Semi-masked model to simplify syllable-level error correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-masked model for multi-level Tibetan correction
Data augmentation with unlabeled text generation
Nine synthetic corruptions for robust training
🔎 Similar Papers
No similar papers found.
Y
Yutong Liu
School of Information and Software Engineering, University of Electronic Science and Technology of China
F
Feng Xiao
School of Information and Software Engineering, University of Electronic Science and Technology of China
Z
Ziyue Zhang
School of Information and Software Engineering, University of Electronic Science and Technology of China
Yongbin Yu
Yongbin Yu
University of Electronic Science and Technology of China
Memristor、Neural Network、Natural Language Processing、Impulsive Control、Swarm Intelligence、EDA、MBSE
C
Cheng Huang
Department of Ophthalmology, University of Texas Southwestern Medical Center
Fan Gao
Fan Gao
Caltech; MIT
NGS BioinformaticsImage data processingAI/MLNeurodegenerationProtein Bioinformatics
Xiangxiang Wang
Xiangxiang Wang
University of Electronic Science and Technology of China
neural networkstime scalesnonlinear systemsimpulsive control
M
Ma-bao Ban
School of Information and Software Engineering, University of Electronic Science and Technology of China
M
Manping Fan
School of Information and Software Engineering, University of Electronic Science and Technology of China
T
Thupten Tsering
School of Information Science and Technology, Tibet University
Gadeng Luosang
Gadeng Luosang
Sichuan University, Tibet University
Multilingual natural language processingmedical image processing
R
Renzeng Duojie
School of Information Science and Technology, Tibet University
N
Nyima Tashi
School of Information Science and Technology, Tibet University