JudgeFlow: Agentic Workflow Optimization via Block Judge
By: Zihan Ma , Zhikai Zhao , Chuanbo Hua and more
Potential Business Impact:
Fixes AI mistakes by finding bad steps.
Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose {\our{}}, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces -- particularly failed runs -- and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate {\our{}} on mathematical reasoning and code generation benchmarks, where {\our{}} achieves superior performance and efficiency compared to existing methods. The source code is publicly available at https://github.com/ma-zihan/JudgeFlow.
Similar Papers
$A^2Flow:$ Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators
Artificial Intelligence
Automates computer task planning without human help.
Agent-as-a-Judge
Computation and Language
Makes AI judges smarter and more trustworthy.
Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
Artificial Intelligence
Checks AI's thinking, not just its answers.