Score: 1

Depth-Wise Activation Steering for Honest Language Models

Published: December 8, 2025 | arXiv ID: 2512.07667v1

By: Gracjan Góral, Marysia Winkels, Steven Basart

Potential Business Impact:

Makes AI tell the truth, not just know it.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Computation and Language

Makes AI say helpful and honest things.

25 Aug 2025 1

88%

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Artificial Intelligence

Fixes AI to stop saying unfair or wrong things.

12 Aug 2025 0

88%

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Cryptography and Security

Makes AI do bad things by changing its thoughts.

21 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇵🇱 Poland

Repos / Data Links

github.com

Page Count

11 pages

Depth-Wise Activation Steering for Honest Language Models

Makes AI tell the truth, not just know it.

Technical Abstract

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models