BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

๐Ÿ“… 2025-11-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Harmful online content in Bengali is pervasive, yet high-quality detoxification resources and methods remain scarce for low-resource languages. Method: We propose the first explainable text detoxification framework tailored to low-resource languages, integrating Pareto-optimal large language models with chain-of-thought (CoT) prompting. This yields BanglaNirToxโ€”a manually verified, attribution-annotated parallel corpus of 68,000 Bengali detoxification instances. Contribution/Results: Our framework pioneers the integration of multi-objective Pareto optimization with explainable reasoning, significantly improving detoxification quality, consistency, and transparency. BanglaNirTox and the proposed methodology establish critical infrastructure and a novel technical paradigm for explainable AI and content safety research in low-resource linguistic settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.
Problem

Research questions and friction points this paper is trying to address.

Detoxifying toxic Bengali text in online environments
Addressing resource scarcity for Bengali text detoxification
Generating explainable detoxified Bengali sentences using AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pareto-optimized LLMs for Bengali detoxification
Chain-of-Thought prompting for explainable AI
Large-scale parallel corpus with toxicity labels
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Ayesha Afroza Mohsin
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
M
Mashrur Ahsan
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
N
Nafisa Maliyat
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
S
Shanta Maria
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
S
Syed Rifat Raiyan
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
Hasan Mahmud
Hasan Mahmud
Postdoctoral Research Associate, Rochester Institute of Technology
Information SystemsAlgorithmic decision-makingHCI/Human-AI interaction
M
Md Kamrul Hasan
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh