SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues
By: Martin Kuo , Jianyi Zhang , Aolin Ding and more
Potential Business Impact:
Stops bad guys tricking smart computers with talking.
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
Similar Papers
SafeMT: Multi-turn Safety for Multimodal Language Models
Computation and Language
Makes AI safer in long, tricky talks.
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Machine Learning (CS)
Finds new ways to trick AI in conversations.
EASE: Practical and Efficient Safety Alignment for Small Language Models
Cryptography and Security
Keeps small AI safe from bad instructions.