Score: 1

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Published: November 8, 2025 | arXiv ID: 2511.06090v2

By: Jeffrey Jian Ma , Milad Hashemi , Amir Yazdanbakhsh and more

Potential Business Impact:

Helps computers fix slow code automatically.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Software Engineering

Helps computers fix slow code automatically.

8 Nov 2025 1

95%

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Software Engineering

Makes computer programs run much faster.

16 Jul 2025 0

92%

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Software Engineering

Teaches computers to fix and add code.

19 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

39 pages

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Helps computers fix slow code automatically.

Technical Abstract

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories