Score: 0

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Published: November 4, 2025 | arXiv ID: 2511.03051v1

By: Tao Zhang , Kehui Yao , Luyi Ma and more

Potential Business Impact:

Tests AI to judge other AI fairly.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Artificial Intelligence

Helps AI tutors give better, personalized learning help.

2 Sep 2025 1

90%

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Computers and Society

Computers grade student code, but not like teachers.

30 Sep 2025 1

90%

AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

Computation and Language

New AI translates languages better than old AI.

2 May 2025 2

View PDF Login to Bookmark

Page Count

15 pages

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tests AI to judge other AI fairly.

Technical Abstract

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains