Score: 3

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Published: November 7, 2025 | arXiv ID: 2511.05459v1

By: Jingxuan Xu , Ken Deng , Weihao Li and more

BigTech Affiliations: Kuaishou

Potential Business Impact:

Tests AI's ability to write and fix code.

Business Areas:

Software Engineering Science and Engineering, Software

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 4

100%

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Software Engineering

Tests AI's ability to write and fix code.

7 Nov 2025 4

90%

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models

Software Engineering

Tests AI code for speed and quality.

19 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

huggingface.co

Page Count

24 pages

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Tests AI's ability to write and fix code.

Technical Abstract

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models