Score: 0

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Published: January 9, 2026 | arXiv ID: 2601.06329v1

By: Jeff Chan-Jan Sju , Liang-Hsuan Tseng , Yi-Cheng Lin and more

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Computation and Language

Helps computers understand tricky grammar rules better.

26 May 2025 2

87%

Language Model Perplexity Predicts Scientific Surprise and Transformative Impact

Social and Information Networks

Finds surprising science ideas that change the world.

6 Sep 2025 1

87%

AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

Sound

Helps computers understand sounds and music better.

2 Sep 2025 2

View PDF Login to Bookmark

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Technical Abstract

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Language Model Perplexity Predicts Scientific Surprise and Transformative Impact

AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation