Score: 1

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Published: October 5, 2025 | arXiv ID: 2510.04363v2

By: Hyunjun Kim, Sejong Kim

Potential Business Impact:

Teaches computers to control websites automatically.

Business Areas:

Simulation Software

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Software Engineering

Teaches computers to control websites like a person.

5 Oct 2025 1

89%

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Artificial Intelligence

Tests AI's ability to build websites.

12 May 2025 3

88%

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Computation and Language

Tests AI's ability to build websites.

9 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Repos / Data Links

github.com

Page Count

12 pages

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Teaches computers to control websites automatically.

Technical Abstract

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code