Can Small Agent Collaboration Beat a Single Big LLM?
By: Agata Żywot, Xinyi Chen, Maarten de Rijke
Potential Business Impact:
Small AI with tools beats big AI without.
This report studies whether small, tool-augmented agents can match or outperform larger monolithic models on the GAIA benchmark. Using Qwen3 models (4B-32B) within an adapted Agentic-Reasoning framework, we isolate the effects of model scale, explicit thinking (no thinking, planner-only, or full), and tool use (search, code, mind-map). Tool augmentation provides the largest and most consistent gains. Using tools, 4B models can outperform 32B models without tool access on GAIA in our experimental setup. In contrast, explicit thinking is highly configuration- and difficulty-dependent: planner-only thinking can improve decomposition and constraint tracking, while unrestricted full thinking often degrades performance by destabilizing tool orchestration, leading to skipped verification steps, excessive tool calls, non-termination, and output-format drift.
Similar Papers
When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark
Computation and Language
Helps AI answer questions better, but can be slow.
ML-Tool-Bench: Tool-Augmented Planning for ML Tasks
Machine Learning (CS)
Helps AI agents plan complex data tasks better.
Towards a Science of Scaling Agent Systems
Artificial Intelligence
Makes AI agents work better together.