Score: 0

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Published: April 28, 2025 | arXiv ID: 2504.19982v2

By: Emre Can Acikgoz , Carl Guo , Suvodip Dey and more

Potential Business Impact:

Tests AI chatbots better than humans can.

Business Areas:

EdTech Education, Software

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.

Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

Computation and Language

Teaches computers to chat and do tasks.

18 Feb 2025 0

89%

Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

Computation and Language

Makes chatbots understand feelings and finish tasks.

2 Jul 2025 0

89%

Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback

Computation and Language

Helps robots finish tasks by learning new skills faster.

18 Feb 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

20 pages

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Tests AI chatbots better than humans can.

Technical Abstract

Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback