ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
By: Yuki Imajuku , Kohki Horie , Yoichi Iwata and more
Potential Business Impact:
Tests AI's skill in solving tough planning puzzles.
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.
Similar Papers
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
Artificial Intelligence
Tests computers on planning complex game quests.
AI Idea Bench 2025: AI Research Idea Generation Benchmark
Artificial Intelligence
Tests AI's best new ideas for science.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Software Engineering
Tests AI's ability to fix complex computer code.