Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
By: Dong Shu , Xuansheng Wu , Haiyan Zhao and more
Potential Business Impact:
Finds key ideas in AI brains for better control.
Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.
Similar Papers
Scaling sparse feature circuit finding for in-context learning
Machine Learning (CS)
Finds how computers learn tasks from examples.
Dense SAE Latents Are Features, Not Bugs
Machine Learning (CS)
Helps understand how computers "think" about words.
On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond
Machine Learning (CS)
Unlocks AI's hidden thoughts for better understanding.