Score: 0

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Published: October 9, 2025 | arXiv ID: 2510.08697v1

By: Terry Yue Zhuo , Xiaolong Jin , Hange Liu and more

Potential Business Impact:

Tests computer code writing without humans.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

CodeArena: A Collective Evaluation Platform for LLM Code Generation

Software Engineering

Tests AI code writing fairly, without cheating.

3 Mar 2025 2

89%

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Computation and Language

Tests AI helpers without needing real people.

6 Oct 2025 1

88%

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Machine Learning (CS)

Tests if AI can solve tricky real-world problems.

25 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇦🇺 Australia

Page Count

56 pages

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Tests computer code writing without humans.

Technical Abstract

CodeArena: A Collective Evaluation Platform for LLM Code Generation

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning