UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of toxicity detection systems for low-resource Indic languages, this paper introduces UnityAI-Guard—the first open-source framework systematically supporting toxicity detection across seven low-resource Brahmic-script Indian languages (e.g., Marathi, Gujarati). Methodologically, it integrates fine-tuned multilingual pre-trained language models, script-aware feature encoding, data augmentation, and robustness optimization, and introduces a high-quality, human-verified cross-lingual dataset comprising 35k samples. Its key contribution lies in the first unified modeling and standardized evaluation across Brahmic scripts. Experiments demonstrate an average F1-score of 84.23% across all seven languages—significantly outperforming existing baselines. To foster equitable content safety governance, the authors publicly release both the trained models and a RESTful API.

Technology Category

Application Category

📝 Abstract
This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 888k training instances and 35k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.
Problem

Research questions and friction points this paper is trying to address.

Develops toxicity detection for low-resource Indian languages
Addresses lack of systems for diverse Brahmic/Indic scripts
Advances multilingual content moderation with public API
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary toxicity classification for low-resource languages
State-of-the-art models for diverse Indic scripts
Public API access for broader adoption
🔎 Similar Papers
No similar papers found.
Himanshu Beniwal
Himanshu Beniwal
Indian Institute of Technology Gandhinagar
Natural Language ProcessingMachine LearningComputational LinguisticsDeep Learning
R
Reddybathuni Venkat
Indian Institute of Technology Gandhinagar
R
Rohit Kumar
Indian Institute of Technology Goa
B
Birudugadda Srivibhav
Indian Institute of Technology Gandhinagar
D
Daksh Jain
Indian Institute of Technology Gandhinagar
P
Pavan Doddi
Indian Institute of Technology Gandhinagar
E
Eshwar Dhande
Indian Institute of Technology Gandhinagar
A
Adithya Ananth
Indian Institute of Technology Tirupati
K
Kuldeep
Indian Institute of Technology Gandhinagar
H
Heer Kubadia
Indian Institute of Technology Gandhinagar
P
Pratham Sharda
Indian Institute of Technology Gandhinagar
M
Mayank Singh
Indian Institute of Technology Gandhinagar