HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
By: Dimitrios Stamatios Bouras , Yihan Dai , Tairan Wang and more
Potential Business Impact:
Checks if computer code follows instructions.
While software requirements are often expressed in natural language, verifying the correctness of a program against such requirements is a hard and underexplored problem. Large language models (LLMs) are promising candidates for addressing this challenge, however our experience shows that they are ineffective in this task, often failing to detect even straightforward bugs. To address this gap, we introduce HoarePrompt, a novel approach that adapts fundamental ideas from program verification to natural language artifacts. Inspired from the strongest postcondition calculus, HoarePrompt employs a systematic, step-by-step process in which an LLM generates natural language descriptions of reachable program states at various code points. To manage loops, we propose few-shot-driven k-induction, an adaptation of the k-induction method widely used in model checking. Once program states are described, HoarePrompt leverages the LLM to assess whether the program, annotated with these state descriptions, conforms to the natural language requirements. For evaluating the quality of classifiers of program correctness with respect to natural language requirements, we constructed CoCoClaNeL, a challenging dataset of solutions to programming competition problems. Our experiments show that HoarePrompt improves the MCC by 61% compared to directly using Zero-shot-CoT prompts for correctness classification. Furthermore, HoarePrompt outperforms a classifier that assesses correctness via LLM-based test generation by an MCC increase of 106%. The inductive reasoning mechanism contributes a 26% boost to MCC, underscoring its effectiveness in managing loops.
Similar Papers
Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers
Artificial Intelligence
Makes AI better at solving hard science problems.
Are Prompts All You Need? Evaluating Prompt-Based Large Language Models (LLM)s for Software Requirements Classification
Software Engineering
Helps computers sort software ideas faster, needing less data.
Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance
Computation and Language
Makes AI think better even with short questions.