Score: 0

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production

Published: September 18, 2025 | arXiv ID: 2509.14647v1

By: NVJK Kartik , Garvit Sapra , Rishav Hada and more

Potential Business Impact:

Finds and fixes problems in smart computer helpers.

Business Areas:

Application Performance Management Data and Analytics, Software

With the growing adoption of Large Language Models (LLMs) in automating complex, multi-agent workflows, organizations face mounting risks from errors, emergent behaviors, and systemic failures that current evaluation methods fail to capture. We present AgentCompass, the first evaluation framework designed specifically for post-deployment monitoring and debugging of agentic workflows. AgentCompass models the reasoning process of expert debuggers through a structured, multi-stage analytical pipeline: error identification and categorization, thematic clustering, quantitative scoring, and strategic summarization. The framework is further enhanced with a dual memory system-episodic and semantic-that enables continual learning across executions. Through collaborations with design partners, we demonstrate the framework's practical utility on real-world deployments, before establishing its efficacy against the publicly available TRAIL benchmark. AgentCompass achieves state-of-the-art results on key metrics, while uncovering critical issues missed in human annotations, underscoring its role as a robust, developer-centric tool for reliable monitoring and improvement of agentic systems in production.

EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths

Artificial Intelligence

Lets AI agents learn and improve faster.

3 Dec 2025 2

89%

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 4

89%

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 3

View PDF Login to Bookmark

Page Count

10 pages

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production

Finds and fixes problems in smart computer helpers.

Technical Abstract

EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models