PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
By: Lingfeng Zhou , Jialing Zhang , Jin Gao and more
Potential Business Impact:
Helps computers know who's talking in a story.
Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.
Similar Papers
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Computation and Language
Computers get worse at talking over time.
Misalignment of LLM-Generated Personas with Human Perceptions in Low-Resource Settings
Computers and Society
AI personalities don't understand people like real humans.
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
Computation and Language
Lets AI judge other AI's answers fairly.