EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce
By: Kaiyan Zhao , Zijie Meng , Zheyong Xie and more
Potential Business Impact:
Tests how AI helps shoppers and sellers online.
Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications to assist customer services in tasks such as product inquiries, recommendations, and order management. Existing benchmarks primarily evaluate whether these agents successfully complete the final task, overlooking the intermediate reasoning stages that are crucial for effective decision-making. To address this gap, we propose EComStage, a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process: Perception (understanding user intent), Planning (formulating an action plan), and Action (executing the decision). EComStage evaluates LLMs through seven separate representative tasks spanning diverse e-commerce scenarios, with all samples human-annotated and quality-checked. Unlike prior benchmarks that focus only on customer-oriented interactions, EComStage also evaluates merchant-oriented scenarios, including promotion management, content review, and operational support relevant to real-world applications. We evaluate a wide range of over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs, revealing stage/orientation-specific strengths and weaknesses. Our results provide fine-grained, actionable insights for designing and optimizing LLM-based agents in real-world e-commerce settings.
Similar Papers
Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications
Artificial Intelligence
Tests online shopping AI on real customer questions.
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
Computation and Language
Helps online stores avoid fake product claims.
ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
Computation and Language
Tests smart helpers for online shopping problems.