Score: 0

BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Published: November 3, 2025 | arXiv ID: 2511.01512v1

By: Ayesha Afroza Mohsin , Mashrur Ahsan , Nafisa Maliyat and more

Potential Business Impact:

Cleans up mean online Bengali words.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

Page Count
6 pages

Category
Computer Science:
Computation and Language