Score: 1

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Published: November 18, 2025 | arXiv ID: 2511.14136v1

By: Sushant Mehta

Potential Business Impact:

Makes AI useful and cheap for businesses.

Business Areas:

Artificial Intelligence Artificial Intelligence, Data and Analytics, Science and Engineering, Software

Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ρ=0.83$) compared to accuracy-only evaluation ($ρ=0.41$).

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

Artificial Intelligence

Measures AI's real-world usefulness, not just speed.

11 Nov 2025 0

89%

The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims

Computers and Society

Tests AI for real-world use, not just speed.

1 Jun 2025 1

88%

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Artificial Intelligence

Helps AI agents work better together on jobs.

13 Sep 2025 2

View PDF Login to Bookmark

Page Count

7 pages

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Makes AI useful and cheap for businesses.

Technical Abstract

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise