Score: 3

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Published: November 2, 2025 | arXiv ID: 2511.00780v1

By: Chenyu Zhao , Shenglin Zhang , Zeshun Huang and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Helps computers fix code when changing computer types.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.