Score: 1

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Published: May 20, 2025 | arXiv ID: 2505.14438v1

By: Yuanbo Fang , Haoze Sun , Jun Liu and more

Potential Business Impact:

Makes talking computers think better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

Computation and Language

Tests how well AI understands people talking.

27 Oct 2025 2

90%

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

Computation and Language

Tests talking computers on understanding, thinking, and speaking.

25 Feb 2025 2

89%

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

Computation and Language

Tests how well talking computers understand and speak.

7 Mar 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

8 pages

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Makes talking computers think better.

Technical Abstract

SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information