Score: 0

Bayesian Evaluation of Large Language Model Behavior

Published: November 4, 2025 | arXiv ID: 2511.10661v1

By: Rachel Longjohn , Shang Wu , Saatvik Kher and more

Potential Business Impact:

Measures AI honesty and safety more accurately.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation strategies typically deployed in LLM-based systems. We present two case studies applying this approach: 1) evaluating refusal rates on a benchmark of adversarial inputs designed to elicit harmful responses, and 2) evaluating pairwise preferences of one LLM over another on a benchmark of open-ended interactive dialogue examples. We demonstrate how the Bayesian approach can provide useful uncertainty quantification about the behavior of LLM-based systems.

Textual Bayes: Quantifying Uncertainty in LLM-Based Systems

Machine Learning (CS)

Makes AI smarter and more honest about what it knows.

11 Jun 2025 1

91%

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Computation and Language

Tests AI better, even with less data.

30 Apr 2025 1

90%

Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Computation and Language

Makes AI tell the truth, not make things up.

25 Apr 2025 2

View PDF Login to Bookmark

Page Count

23 pages

Bayesian Evaluation of Large Language Model Behavior

Measures AI honesty and safety more accurately.

Technical Abstract

Textual Bayes: Quantifying Uncertainty in LLM-Based Systems

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review