Score: 2

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

Published: June 12, 2025 | arXiv ID: 2506.10764v1

By: Xiaozhe Li , Jixuan Chen , Xinyu Fang and more

Potential Business Impact:

Helps computers learn to solve hard problems better.

Business Areas:

A/B Testing Data and Analytics

Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Computation and Language

Helps computers solve tricky planning problems better.

6 Apr 2025 1

90%

A Survey on the Optimization of Large Language Model-based Agents

Artificial Intelligence

Makes smart computer helpers plan better.

16 Mar 2025 1

89%

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

Artificial Intelligence

Tests AI that handles many jobs at once.

3 Apr 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

16 pages

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

Helps computers learn to solve hard problems better.

Technical Abstract

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

A Survey on the Optimization of Large Language Model-based Agents

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions