WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
By: Ralph Peeters , Aaron Steiner , Luca Schwarz and more
Potential Business Impact:
Helps online shoppers find best deals automatically.
LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.
Similar Papers
ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?
Computation and Language
Tests if shopping AI gives safe and good advice.
ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents
Computation and Language
Helps online shoppers do harder tasks.
MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents
Artificial Intelligence
Helps computers understand pictures and text together.