Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
By: Dimitrios Rontogiannis , Maxime Peyrard , Nicolas Baldwin and more
Potential Business Impact:
Tests AI code writing with helpful feedback.
Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an ``interviewer'' LLM, aware of the ground-truth solution, provides minimal, targeted hints to an ``interviewee'' model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure. We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation. Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.
Similar Papers
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Computation and Language
Tests AI's ability to talk and learn.
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Computation and Language
Tests computers on talking and solving tricky problems.
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
Software Engineering
Computers can't always tell if code matches instructions.