Score: 0

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Published: December 17, 2025 | arXiv ID: 2512.15938v1

By: Vegard Flovik

Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{crit}$, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

CV and Pattern Recognition

Stops AI from making up fake objects in pictures.

8 Dec 2025 1

89%

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

Machine Learning (CS)

Helps AI understand medical words better.

12 Aug 2025 0

89%

AlignSAE: Concept-Aligned Sparse Autoencoders

Machine Learning (CS)

Lets AI understand and change specific ideas easily.

1 Dec 2025 1

View PDF Login to Bookmark

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Technical Abstract

SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

AlignSAE: Concept-Aligned Sparse Autoencoders