Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models
By: Zhibo Hu , Chen Wang , Yanfeng Shu and more
Potential Business Impact:
Metaphors make AI models think wrongly.
Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
Similar Papers
Unveiling LLMs' Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence
Computation and Language
Computers still struggle to understand word pictures.
Conceptual Metaphor Theory as a Prompting Paradigm for Large Language Models
Computation and Language
Makes AI think more like people.
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Computation and Language
Tests if AI reasons the same in all languages.