Score: 0

When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability

Published: January 6, 2026 | arXiv ID: 2601.03047v1

By: Raphael Ronge, Markus Maier, Frederick Eberhardt

Potential Business Impact:

Makes AI easier to understand and control.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Recent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models by extracting human-interpretable features from their neural activation patterns using sparse autoencoders (SAEs). If successful, this approach offers one of the most promising routes for human oversight in AI safety. We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1. While we successfully reproduce basic feature extraction and steering capabilities, our investigation suggests that major caution is warranted regarding the generalizability of these claims. We find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context. We observe non-standard activation behavior and demonstrate the difficulty to distinguish thematically similar features from one another. While SAE-based interpretability produces compelling demonstrations in selected cases, current methods often fall short of the systematic reliability required for safety-critical applications. This suggests a necessary shift in focus from prioritizing interpretability of internal representations toward reliable prediction and control of model output. Our work contributes to a more nuanced understanding of what mechanistic interpretability has achieved and highlights fundamental challenges for AI safety that remain unresolved.

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Computation and Language

Changes AI personality to be more helpful.

6 Jan 2026 0

89%

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Machine Learning (CS)

Makes AI understand and control ideas better.

11 Dec 2025 0

89%

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Computation and Language

Makes AI understand its own thoughts better.

21 Feb 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

65 pages

When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability

Makes AI easier to understand and control.

Technical Abstract

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders