Approximating Human Preferences Using a Multi-Judge Learned System
By: Eitán Sprejer , Fernando Avalos , Augusto Bernardi and more
Potential Business Impact:
Makes AI understand what people truly want.
Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
Similar Papers
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Machine Learning (CS)
Makes AI judges agree better with people.
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
Artificial Intelligence
Helps computers judge other computers better.
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Artificial Intelligence
Makes AI judges fairer and more trustworthy.