Concept-Guided Interpretability via Neural Chunking
By: Shuchen Wu , Stephan Alaniz , Shyamgopal Karthik and more
Potential Business Impact:
Makes AI understandable by finding patterns.
Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model's behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.
Similar Papers
Temporal Chunking Enhances Recognition of Implicit Sequential Patterns
Machine Learning (CS)
Teaches computers to learn faster from past experiences.
From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences
Machine Learning (CS)
Learns patterns to understand and predict things.
Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision
Machine Learning (CS)
Teaches computers to understand new ideas like people.