On the Limitations of Steering in Language Model Alignment
By: Chebrolu Niranjan, Kokil Jaidka, Gerard Christopher Yeo
Potential Business Impact:
Makes AI follow instructions better, but not always.
Steering vectors are a promising approach to aligning language model behavior at inference time. In this paper, we propose a framework to assess the limitations of steering vectors as alignment mechanisms. Using a framework of transformer hook interventions and antonym-based function vectors, we evaluate the role of prompt structure and context complexity in steering effectiveness. Our findings indicate that steering vectors are promising for specific alignment tasks, such as value alignment, but may not provide a robust foundation for general-purpose alignment in LLMs, particularly in complex scenarios. We establish a methodological foundation for future investigations into steering capabilities of reasoning models.
Similar Papers
Steering off Course: Reliability Challenges in Steering Language Models
Computation and Language
Makes AI less reliable for specific tasks.
Understanding (Un)Reliability of Steering Vectors in Language Models
Machine Learning (CS)
Makes AI follow instructions better, but sometimes it gets confused.
A Unified Understanding and Evaluation of Steering Methods
Machine Learning (CS)
Guides AI to write better without retraining.