Patterns and Mechanisms of Contrastive Activation Engineering
By: Yixiong Hao , Ayush Panda , Stepan Shabalin and more
Potential Business Impact:
Changes AI answers without retraining it.
Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of samples used to generate steering vectors has diminishing returns at around 80 samples. 3. Steering vectors are susceptible to adversarial inputs that reverses the behavior that is steered for. 4. Steering vectors harm the overall model perplexity. 5. Larger models are more resistant to steering-induced degradation.
Similar Papers
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
Artificial Intelligence
Makes AI think more logically, less based on guesses.
Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models
Cryptography and Security
Makes AI do bad things by changing its thoughts.
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
Computation and Language
Fixes AI judging its own answers unfairly.