Score: 0

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Published: June 5, 2025 | arXiv ID: 2506.05166v2

By: Bhavik Chandna, Zubair Bashir, Procheta Sen

Potential Business Impact:

Fixes computer "thinking" to be less unfair.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models

Computation and Language

Finds and fixes unfairness in AI language.

15 Mar 2025 2

91%

A Comprehensive Study of Implicit and Explicit Biases in Large Language Models

Machine Learning (CS)

Finds and fixes unfairness in AI writing.

18 Nov 2025 0

91%

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

Computation and Language

Fixes AI's unfairness and improves its understanding.

18 Nov 2025 0

View PDF Login to Bookmark

Page Count

15 pages

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Fixes computer "thinking" to be less unfair.

Technical Abstract

No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models

A Comprehensive Study of Implicit and Explicit Biases in Large Language Models

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation