Score: 0

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Published: May 1, 2025 | arXiv ID: 2505.00808v1

By: Kola Ayonrinde, Louis Jaburi

Potential Business Impact:

Helps us understand how AI thinks and learns.

Business Areas:

Mechanical Engineering Science and Engineering

Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Machine Learning (CS)

Helps us understand how AI thinks and works.

2 May 2025 0

92%

Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

Machine Learning (CS)

Explains how computer brains make decisions.

24 Nov 2025 0

92%

On the Mechanistic Interpretability of Neural Networks for Causality in Bio-statistics

Applications

Explains how computer "brains" make health predictions.

1 May 2025 0

View PDF Login to Bookmark

Page Count

35 pages

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Helps us understand how AI thinks and learns.

Technical Abstract

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

On the Mechanistic Interpretability of Neural Networks for Causality in Bio-statistics