Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding
By: Yijiang River Dong , Tiancheng Hu , Zheng Hui and more
Potential Business Impact:
Changes AI behavior without retraining it.
Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
Similar Papers
Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
Computation and Language
Makes AI understand and work in many languages.
A Closer Look at System Prompt Robustness
Computation and Language
Makes AI follow instructions better, even tricky ones.
Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions
Computation and Language
Lets computers write exactly what you want.