What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
By: Janiça Hackenbuchner, Arda Tezcan, Joke Daems
Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
Similar Papers
Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation
Computation and Language
Translates speech, guessing gender from sound, not just pitch.
Different Speech Translation Models Encode and Translate Speaker Gender Differently
Computation and Language
Translators learn gender, but some new ones don't.
Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models
Computation and Language
AI pictures change based on word gender.