A Decision-Theoretic Approach for Managing Misalignment
By: Daniel A. Herrmann , Abinav Chari , Isabelle Qian and more
When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent's value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal's uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent's superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.
Similar Papers
Friend or Foe: Delegating to an AI Whose Alignment is Unknown
Theoretical Economics
Helps doctors trust AI for patient treatment choices.
Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives
Artificial Intelligence
Makes AI agents follow human rules and values.
Rethinking How AI Embeds and Adapts to Human Values: Challenges and Opportunities
Artificial Intelligence
AI learns to change its mind with our values.