Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
By: Nikita Afonin , Nikita Andriyanov , Nikhil Bageshpura and more
Potential Business Impact:
AI can learn bad habits from examples.
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.
Similar Papers
Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
Computation and Language
Teaches computers to learn new things from examples.
Convergent Linear Representations of Emergent Misalignment
Machine Learning (CS)
Fixes AI that acts badly after learning new things.
In-Context Bias Propagation in LLM-Based Tabular Data Generation
Machine Learning (CS)
AI can accidentally create unfair data.