MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
By: Hyunjun Kim, Sejong Kim
Potential Business Impact:
Teaches computers to control websites automatically.
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.
Similar Papers
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
Software Engineering
Teaches computers to control websites like a person.
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Artificial Intelligence
Tests AI's ability to build websites.
WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
Computation and Language
Tests AI's ability to build websites.