NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models
By: Weiqi Liu , Yongliang Miao , Haiyan Zhao and more
Potential Business Impact:
Finds hidden meanings inside AI brains.
Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
Similar Papers
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Machine Learning (CS)
Helps computers understand ideas inside their brains.
SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Computation and Language
Helps AI remember and connect information better.
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Software Engineering
Helps computers understand and use different programming languages.