Score: 3

Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Published: December 7, 2025 | arXiv ID: 2512.06915v1

By: Kelin Fu , Tianyu Liu , Zeyu Shang and more

Potential Business Impact:

Helps computers set up software faster and more reliably.

Business Areas:

Simulation Software

Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Software Engineering

Helps computers fix slow code automatically.

8 Nov 2025 1

88%

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Software Engineering

Helps computers fix slow code automatically.

8 Nov 2025 1

88%

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Software Engineering

Helps computers learn to fix software bugs faster.

12 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com huggingface.co

Page Count

15 pages

Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Helps computers set up software faster and more reliably.

Technical Abstract

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks