VAL-Bench: Measuring Value Alignment in Language Models
By: Aman Gupta, Denny O'Shea, Fazl Barez
Potential Business Impact:
Tests if AI keeps same values on tough topics.
Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia's controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.
Similar Papers
VAL-Bench: Measuring Value Alignment in Language Models
Artificial Intelligence
Checks if AI has fair and steady opinions.
Benchmarking Multi-National Value Alignment for Large Language Models
Computation and Language
Tests if AI agrees with a country's rules.
MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
Computation and Language
Tests if AI understands people everywhere.