Steering Towards Fairness: Mitigating Political Bias in LLMs
By: Afrozah Nadeem, Mark Dras, Usman Naseem
Potential Business Impact:
Makes AI less biased in its opinions.
Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.
Similar Papers
Steering Towards Fairness: Mitigating Political Bias in LLMs
Computation and Language
Makes AI less biased about politics.
Framing Political Bias in Multilingual LLMs Across Pakistani Languages
Computation and Language
Computers show bias in different languages.
Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
Artificial Intelligence
Fixes AI to stop saying unfair or wrong things.