BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Harmful online content in Bengali is pervasive, yet high-quality detoxification resources and methods remain scarce for low-resource languages. Method: We propose the first explainable text detoxification framework tailored to low-resource languages, integrating Pareto-optimal large language models with chain-of-thought (CoT) prompting. This yields BanglaNirTox—a manually verified, attribution-annotated parallel corpus of 68,000 Bengali detoxification instances. Contribution/Results: Our framework pioneers the integration of multi-objective Pareto optimization with explainable reasoning, significantly improving detoxification quality, consistency, and transparency. BanglaNirTox and the proposed methodology establish critical infrastructure and a novel technical paradigm for explainable AI and content safety research in low-resource linguistic settings.

Technology Category

Application Category

📝 Abstract

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

Problem

Research questions and friction points this paper is trying to address.

Detoxifying toxic Bengali text in online environments

Addressing resource scarcity for Bengali text detoxification

Generating explainable detoxified Bengali sentences using AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pareto-optimized LLMs for Bengali detoxification

Chain-of-Thought prompting for explainable AI

Large-scale parallel corpus with toxicity labels

🔎 Similar Papers

No similar papers found.

Authors to Follow