Score: 0

Institutional AI: A Governance Framework for Distributional AGI Safety

Published: January 15, 2026 | arXiv ID: 2601.10599v1

By: Federico Pierucci , Marcello Galisai , Marcantonio Syrnikov Bracale and more

Potential Business Impact:

Makes AI agents follow rules, not cheat.

Business Areas:
Artificial Intelligence Artificial Intelligence, Data and Analytics, Science and Engineering, Software

As LLM-based systems increasingly operate as agents embedded within human social and technical systems, alignment can no longer be treated as a property of an isolated model, but must be understood in relation to the environments in which these agents act. Even the most sophisticated methods of alignment, such as Reinforcement Learning through Human Feedback (RHLF) or through AI Feedback (RLAIF) cannot ensure control once internal goal structures diverge from developer intent. We identify three structural problems that emerge from core properties of AI models: (1) behavioral goal-independence, where models develop internal objectives and misgeneralize goals; (2) instrumental override of natural-language constraints, where models regard safety principles as non-binding while pursuing latent objectives, leveraging deception and manipulation; and (3) agentic alignment drift, where individually aligned agents converge to collusive equilibria through interaction dynamics invisible to single-agent audits. The solution this paper advances is Institutional AI: a system-level approach that treats alignment as a question of effective governance of AI agent collectives. We argue for a governance-graph that details how to constrain agents via runtime monitoring, incentive shaping through prizes and sanctions, explicit norms and enforcement roles. This institutional turn reframes safety from software engineering to a mechanism design problem, where the primary goal of alignment is shifting the payoff landscape of AI agent collectives.

Page Count
34 pages

Category
Computer Science:
Computers and Society