Many Minds from One Model: Bayesian Transformers for Population Intelligence
By: Diji Yang, Yi Zhang
Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.
Similar Papers
Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation
Machine Learning (Stat)
Learns to guess answers faster and better.
LLMs are Bayesian, in Expectation, not in Realization
Machine Learning (Stat)
Makes AI learn faster and more reliably.
The Bayesian Geometry of Transformer Attention
Machine Learning (CS)
Makes computers think like humans using math.