Score: 1

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Published: August 28, 2025 | arXiv ID: 2508.20453v1

By: Zhenting Wang , Qi Chang , Hemani Patel and more

Potential Business Impact:

Tests AI's ability to use many tools together.

Business Areas:

Simulation Software

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Computation and Language

Tests AI helpers using tools better.

10 Sep 2025 1

94%

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Artificial Intelligence

Tests how well AI uses real-world tools.

11 Aug 2025 3

93%

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Artificial Intelligence

Tests AI helpers on many real-world tasks.

3 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com github.com github.com github.com github.com github.com

Page Count

52 pages

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Tests AI's ability to use many tools together.

Technical Abstract

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?