Score: 0

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Published: January 6, 2026 | arXiv ID: 2601.02978v1

By: Ruikang Zhang, Shuo Wang, Qi Su

Potential Business Impact:

Changes AI personality to be more helpful.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Computation and Language

Makes AI understand its own thoughts better.

21 Feb 2025 1

91%

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

Computation and Language

Shows how computers understand words and sentences.

27 Feb 2025 0

91%

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

Artificial Intelligence

Guides AI to think better and solve problems.

7 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

20 pages

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Changes AI personality to be more helpful.

Technical Abstract

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering