LLMs are Bayesian, in Expectation, not in Realization
By: Leon Chlon , Sarah Rashidi , Zein Khamis and more
Potential Business Impact:
Makes AI learn faster and more reliably.
Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $\Theta(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = \Theta(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.
Similar Papers
Latent Thought Models with Variational Bayes Inference-Time Computation
Computation and Language
Computers learn to think and reason better.
Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research
Artificial Intelligence
Makes AI better at guessing, not knowing for sure.
Transformers Are Universally Consistent
Machine Learning (CS)
Makes computers learn better from complex data.