Structural Inference: Interpreting Small Language Models with Susceptibilities
By: Garrett Baker , George Wang , Jesse Hoogland and more
Potential Business Impact:
Finds which words help computers understand text.
We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.
Similar Papers
TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference
Machine Learning (CS)
Small AI learns to understand how things connect.
A Statistical Physics of Language Model Reasoning
Artificial Intelligence
Explains how AI thinks, predicts mistakes.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
Computation and Language
Models store word meanings early, grammar later.