Score: 3

A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Published: April 19, 2025 | arXiv ID: 2504.16120v1

By: Chaima Njeh, Haïfa Nakouri, Fehmi Jaafar

Potential Business Impact:

Makes AI say safer, less harmful things.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Cryptography and Security

Finds ways to trick AI into saying bad things.

21 Nov 2025 0

89%

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Computation and Language

Cleans up computer brains to stop bad ideas.

4 May 2025 2

89%

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Cryptography and Security

Makes AI safer from bad instructions.

27 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Repos / Data Links

github.com huggingface.co

Page Count

14 pages

A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Makes AI say safer, less harmful things.

Technical Abstract

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks