Score: 0

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Published: January 12, 2026 | arXiv ID: 2601.07641v1

By: Jiaxuan Lu , Ziyu Kong , Yemin Wang and more

The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Artificial Intelligence

AI learns new skills while playing games.

15 Oct 2025 3

88%

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Software Engineering

Helps AI build and fix big computer programs.

20 Dec 2025 1

88%

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Software Engineering

Tests AI's ability to update big computer programs.

20 Dec 2025 1

View PDF Login to Bookmark

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Technical Abstract

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios