Deriving Decoder-Free Sparse Autoencoders from First Principles
By: Alan Oursland
Potential Business Impact:
Teaches computers to learn by guessing and checking.
Gradient descent on log-sum-exp (LSE) objectives performs implicit expectation--maximization (EM): the gradient with respect to each component output equals its responsibility. The same theory predicts collapse without volume control analogous to the log-determinant in Gaussian mixture models. We instantiate the theory in a single-layer encoder with an LSE objective and InfoMax regularization for volume control. Experiments confirm the theory's predictions. The gradient--responsibility identity holds exactly; LSE alone collapses; variance prevents dead components; decorrelation prevents redundancy. The model exhibits EM-like optimization dynamics in which lower loss does not correspond to better features and adaptive optimizers offer no advantage. The resulting decoder-free model learns interpretable mixture components, confirming that implicit EM theory can prescribe architectures.
Similar Papers
Gradient Descent as Implicit EM in Distance-Based Neural Models
Machine Learning (CS)
Makes computers learn like brains by seeing patterns.
Steered Generation via Gradient Descent on Sparse Features
Computation and Language
Changes how smart computers write to be simpler or harder.
Contrastive Entropy Bounds for Density and Conditional Density Decomposition
Machine Learning (CS)
Helps computers learn better by understanding data patterns.