Score: 3

CIFE: Code Instruction-Following Evaluation

Published: December 19, 2025 | arXiv ID: 2512.17387v1

By: Sravani Gunnu, Shanmukha Guttula, Hima Patel

BigTech Affiliations: IBM

Potential Business Impact:

Helps computers write code that follows all rules.

Business Areas:

Simulation Software

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

Software Engineering

Tests if AI can write code correctly.

31 Oct 2025 1

90%

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Software Engineering

Makes computers better at understanding language and code.

4 Dec 2025 1

90%

Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

Software Engineering

Computers can't always tell if code matches instructions.

17 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇮🇳 United States, India

Repos / Data Links

github.com

Page Count

20 pages

CIFE: Code Instruction-Following Evaluation

Helps computers write code that follows all rules.

Technical Abstract

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications