Score: 1

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Published: November 4, 2025 | arXiv ID: 2511.03047v1

By: Emi Soroka , Tanmay Chopra , Krish Desai and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Checks AI talking to people without human help.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Artificial Intelligence

Tests AI code writing with helpful feedback.

26 Aug 2025 2

89%

Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

Computation and Language

Helps computers check their own work better.

10 Sep 2025 0

89%

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Computation and Language

Makes chatbots remember conversations better.

7 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

32 pages

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Checks AI talking to people without human help.

Technical Abstract

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models