Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
By: Yoon Pyo Lee
Potential Business Impact:
Shows how smart computers learn nuclear power secrets.
The integration of Large Language Models (LLMs) into safety-critical domains, such as nuclear engineering, necessitates a deep understanding of their internal reasoning processes. This paper presents a novel methodology for interpreting how an LLM encodes and utilizes domain-specific knowledge, using a Boiling Water Reactor system as a case study. We adapted a general-purpose LLM (Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation. By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered during the adaptation process. To probe the causal role of these specialized neurons, we employed a neuron silencing technique. Our results demonstrate that while silencing most of these specialized neurons individually did not produce a statistically significant effect, deactivating the entire group collectively led to a statistically significant degradation in task performance. Qualitative analysis further revealed that silencing these neurons impaired the model's ability to generate detailed, contextually accurate technical information. This paper provides a concrete methodology for enhancing the transparency of an opaque black-box model, allowing domain expertise to be traced to verifiable neural circuits. This offers a pathway towards achieving nuclear-grade artificial intelligence (AI) assurance, addressing the verification and validation challenges mandated by nuclear regulatory frameworks (e.g., 10 CFR 50 Appendix B), which have limited AI deployment in safety-critical nuclear operations.
Similar Papers
A LoRA-Based Approach to Fine-Tuning LLMs for Educational Guidance in Resource-Constrained Settings
Artificial Intelligence
Helps students get study abroad advice easily.
Towards Secure and Private Language Models for Nuclear Power Plants
Computation and Language
Teaches computers nuclear power words for safety.
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Artificial Intelligence
Makes AI safer from bad instructions.