Score: 1

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Published: November 4, 2025 | arXiv ID: 2511.03047v1

By: Emi Soroka , Tanmay Chopra , Krish Desai and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Checks AI talking to people without human help.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Country of Origin
🇺🇸 United States

Page Count
32 pages

Category
Computer Science:
Machine Learning (CS)