Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions
By: Emi Soroka , Tanmay Chopra , Krish Desai and more
Potential Business Impact:
Checks AI talking to people without human help.
Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
Similar Papers
Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
Artificial Intelligence
Tests AI code writing with helpful feedback.
Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals
Computation and Language
Helps computers check their own work better.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Computation and Language
Makes chatbots remember conversations better.