Score: 1

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Published: December 23, 2025 | arXiv ID: 2512.20796v1

By: Zhengyang Shan, Aaron Mueller

Potential Business Impact:

Fixes AI bias without losing its smarts.

Business Areas:
Biometrics Biotechnology, Data and Analytics, Science and Engineering

We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
25 pages

Category
Computer Science:
Computation and Language