Score: 0

From superposition to sparse codes: interpretable representations in neural networks

Published: March 3, 2025 | arXiv ID: 2503.01824v1

By: David Klindt , Charles O'Neill , Patrik Reizinger and more

Potential Business Impact:

Helps computers understand what they see like humans.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.

A mathematical theory for understanding when abstract representations emerge in neural networks

Neurons and Cognition

Brain learns to understand things by practicing.

10 Oct 2025 0

88%

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

Machine Learning (CS)

Unlocks AI's hidden thoughts for better understanding.

5 Dec 2025 0

88%

Superposition disentanglement of neural representations reveals hidden alignment

Machine Learning (CS)

Helps computers understand brain signals better.

3 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

26 pages

From superposition to sparse codes: interpretable representations in neural networks

Helps computers understand what they see like humans.

Technical Abstract

A mathematical theory for understanding when abstract representations emerge in neural networks

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

Superposition disentanglement of neural representations reveals hidden alignment