Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment
By: Nuo Chen , Hanpei Fang , Piaohong Wang and more
Potential Business Impact:
AI personalities help judge search results better.
Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators.
Similar Papers
Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing
Computation and Language
Makes AI better at judging information fairly.
Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Computation and Language
Computers can't guess your personality from talking.
Mind Reading or Misreading? LLMs on the Big Five Personality Test
Computation and Language
Helps computers guess your personality from writing.