SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
By: Shaona Ghosh , Amrita Bhattacharjee , Yftah Ziser and more
Potential Business Impact:
Makes AI say safe things without refusing.
Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.
Similar Papers
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Computation and Language
Controls AI's opinions on sensitive topics.
Interpretable Risk Mitigation in LLM Agent Systems
Artificial Intelligence
Makes AI agents safer and more predictable.
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation
Artificial Intelligence
Keeps AI helpful but stops it from saying bad things.