Score: 1

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Published: November 13, 2025 | arXiv ID: 2511.10049v1

By: Divyanshu Saxena , Rishikesh Maurya , Xiaoxuan Ou and more

Potential Business Impact:

Creates better tests for smart computer helpers.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

Software Engineering

Helps AI build better computer programs.

6 Nov 2025 1

91%

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Artificial Intelligence

Tests AI agents on real-world tasks.

16 Jan 2026 1

90%

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Software Engineering

Helps AI build better computer programs.

10 Oct 2025 1

View PDF Login to Bookmark

Page Count

5 pages

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Creates better tests for smart computer helpers.

Technical Abstract

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System