Score: 0

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Published: August 21, 2025 | arXiv ID: 2508.15760v1

By: Ming Yin , Dinghan Shen , Silei Xu and more

Potential Business Impact:

Tests how well AI uses tools for hard jobs.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Computation and Language

Tests AI's ability to use many tools together.

28 Aug 2025 1

92%

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Artificial Intelligence

Tests AI helpers on many real-world tasks.

3 Aug 2025 0

92%

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Artificial Intelligence

Tests how well AI uses real-world tools.

11 Aug 2025 3

View PDF Login to Bookmark

Page Count

14 pages

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Tests how well AI uses tools for hard jobs.

Technical Abstract

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark