Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks
By: Kevin Wang , Subre Abdoul Moktar , Jia Li and more
Potential Business Impact:
Helps computers know when they are unsure.
Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.
Similar Papers
Reconsidering LLM Uncertainty Estimation Methods in the Wild
Machine Learning (CS)
Helps AI know when it's making things up.
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Computation and Language
Makes AI tell the truth, not make things up.
Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLMs
Computation and Language
Helps AI understand when doctors are unsure.