Score: 4

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Published: November 7, 2025 | arXiv ID: 2511.05459v3

By: Jingxuan Xu , Ken Deng , Weihao Li and more

BigTech Affiliations: Kuaishou

Potential Business Impact:

Tests AI's ability to write and fix code.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 4

100%

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 3

90%

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Software Engineering

Helps computers fix slow code automatically.

8 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com huggingface.co

Page Count

24 pages

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Tests AI's ability to write and fix code.

Technical Abstract

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?