Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
By: Zhengyang Shan, Aaron Mueller
Potential Business Impact:
Fixes AI bias without losing its smarts.
We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
Similar Papers
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
CV and Pattern Recognition
Fixes AI art to be fair and unbiased.
Decomposing Direct and Indirect Biases in Linear Models under Demographic Parity Constraint
Machine Learning (Stat)
Shows how computer decisions unfairly favor some groups.
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Computation and Language
Fixes computer "thinking" to be less unfair.