Score: 3

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Published: May 12, 2025 | arXiv ID: 2505.07473v1

By: Kai Xu , YiWei Mao , XinYi Guan and more

Potential Business Impact:

Tests AI's ability to build websites.

Business Areas:

Application Performance Management Data and Analytics, Software

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench's Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Software Engineering

Tests computer code better for websites.

16 Jun 2025 2

90%

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Computation and Language

Tests AI's ability to build websites.

9 Jun 2025 1

90%

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Computation and Language

Helps computers build websites from simple instructions.

6 May 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com huggingface.co

Page Count

28 pages

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Tests AI's ability to build websites.

Technical Abstract

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch