Distribution-Aware Feature Selection for SAEs
By: Narmeen Oozeer , Nirmalendu Prakash , Michael Lan and more
Potential Business Impact:
Helps computers understand ideas better by picking key parts.
Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an "activation lottery," where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via $L_2$ norm or entropy), forming a candidate pool of size $Kl$, and then apply Top-$K$ to select tokens across the batch from the restricted pool of features. Varying $l$ traces a spectrum between batch-level and token-specific selection. At $l=1$, tokens draw only from $K$ globally influential features, while larger $l$ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small $l$ thus enforces global consistency; large $l$ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes $l$ across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.
Similar Papers
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Machine Learning (CS)
Makes AI fairer by changing how it learns.
Sparse Autoencoders Trained on the Same Data Learn Different Features
Machine Learning (CS)
AI finds different "thinking parts" each time.
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
Machine Learning (CS)
Makes AI understand what it's reading better.