Score: 1

Revisiting the Reliability of Language Models in Instruction-Following

Published: December 15, 2025 | arXiv ID: 2512.14754v1

By: Jianshuo Dong , Yutong Zhang , Yan Liu and more

Potential Business Impact:

Makes AI understand slightly different questions better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Computation and Language

Teaches AI to follow tricky, unexpected orders.

4 Sep 2025 2

90%

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Artificial Intelligence

Tests AI code writing with helpful feedback.

26 Aug 2025 2

90%

Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

Software Engineering

Computers can't always tell if code matches instructions.

17 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com

Page Count

29 pages

Revisiting the Reliability of Language Models in Instruction-Following

Makes AI understand slightly different questions better.

Technical Abstract

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications