Score: 1

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Published: April 23, 2025 | arXiv ID: 2504.17130v3

By: Hannah Cyberey, David Evans

Potential Business Impact:

Lets computers share more honest answers.

Business Areas:

Darknet Internet Services

Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering

Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Computation and Language

Changes AI's answers without retraining it.

16 May 2025 1

89%

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Computation and Language

Controls AI's opinions on sensitive topics.

18 Dec 2025 0

87%

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Machine Learning (CS)

Makes AI fairer by reducing unfair ideas.

7 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com github.com

Page Count

31 pages

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Lets computers share more honest answers.

Technical Abstract

Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs