Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
By: Xu Xu , Xin Li , Xingwei Qu and more
Potential Business Impact:
Tests if AI can write code that works together.
We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.
Similar Papers
A benchmark for vericoding: formally verified program synthesis
Software Engineering
Makes computer code work perfectly, every time.
TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F
Computation and Language
Tests if computers truly understand code.
Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond
Artificial Intelligence
Tests AI better by making questions smarter.