Score: 0

UCRBench: Benchmarking LLMs on Use Case Recovery

Published: December 15, 2025 | arXiv ID: 2512.13360v1

By: Shuyuan Xiao , Yiran Zhang , Weisong Sun and more

Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models (LLMs) in generating use cases from source code. We address this gap by introducing code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases across nine real-world software projects. Using this benchmark, we conduct the first systematic study of LLMs and propose a hierarchical evaluation protocol that assesses actor correctness, name accuracy, path fidelity, and behavioral coverage. The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects, with particularly noticeable shortcomings in domain-specific and multi-module systems. The models also exhibit high omission rates and struggle to maintain consistent abstraction when aggregating subfunctions into user-goal use cases, highlighting both the potential and current limitations of LLM-based use case reverse engineering.

Leveraging Large Language Models for Use Case Model Generation from Software Requirements

Software Engineering

Helps make computer plans much faster.

12 Nov 2025 0

89%

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Software Engineering

Computers struggle to write real code.

30 Oct 2025 1

89%

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Computation and Language

Tests make smart computers seem less smart.

4 Sep 2025 1

View PDF Login to Bookmark

UCRBench: Benchmarking LLMs on Use Case Recovery

Technical Abstract

Leveraging Large Language Models for Use Case Model Generation from Software Requirements

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs