Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
By: Chenyu Zhao , Shenglin Zhang , Zeshun Huang and more
Potential Business Impact:
Helps computers fix code when changing computer types.
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
Similar Papers
Empirical Evaluation of Large Language Models in Automated Program Repair
Software Engineering
Fixes computer code errors faster and better.
Automated Repair of C Programs Using Large Language Models
Software Engineering
Fixes computer code bugs automatically.
Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
Software Engineering
Makes computers better at understanding language and code.