Revisiting the Reliability of Language Models in Instruction-Following
By: Jianshuo Dong , Yutong Zhang , Yan Liu and more
Potential Business Impact:
Makes AI understand slightly different questions better.
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
Similar Papers
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Computation and Language
Teaches AI to follow tricky, unexpected orders.
Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
Artificial Intelligence
Tests AI code writing with helpful feedback.
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
Software Engineering
Computers can't always tell if code matches instructions.