A Framework for Inherently Safer AGI through Language-Mediated Active Inference
By: Bo Wen
Potential Business Impact:
Makes smart computers safer by design.
This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework's safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.
Similar Papers
AGENTSAFE: A Unified Framework for Ethical Assurance and Governance in Agentic AI
Multiagent Systems
Makes AI agents safer and more trustworthy.
A Proprietary Model-Based Safety Response Framework for AI Agents
Artificial Intelligence
Keeps smart computer programs from saying bad things.
Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering
Software Engineering
Makes AI writing computer code safer.