Abstract Counterfactuals for Language Model Agents
By: Edoardo Pona , Milad Kazemi , Yali Du and more
Potential Business Impact:
Helps AI understand "what if" questions better.
Counterfactual inference is a powerful tool for analysing and evaluating autonomous agents, but its application to language model (LM) agents remains challenging. Existing work on counterfactuals in LMs has primarily focused on token-level counterfactuals, which are often inadequate for LM agents due to their open-ended action spaces. Unlike traditional agents with fixed, clearly defined action spaces, the actions of LM agents are often implicit in the strings they output, making their action spaces difficult to define and interpret. Furthermore, the meanings of individual tokens can shift depending on the context, adding complexity to token-level reasoning and sometimes leading to biased or meaningless counterfactuals. We introduce \emph{Abstract Counterfactuals}, a framework that emphasises high-level characteristics of actions and interactions within an environment, enabling counterfactual reasoning tailored to user-relevant features. Our experiments demonstrate that the approach produces consistent and meaningful counterfactuals while minimising the undesired side effects of token-level methods. We conduct experiments on text-based games and counterfactual text generation, while considering both token-level and latent-space interventions.
Similar Papers
Counterfactual reasoning: an analysis of in-context emergence
Computation and Language
Helps computers guess what happens if things change.
Show Me How: Benefits and Challenges of Agent-Augmented Counterfactual Explanations for Non-Expert Users
Human-Computer Interaction
Helps doctors explain health risks better.
Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification
Computation and Language
Makes AI explain its decisions with small changes.