Score: 2

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Published: October 26, 2025 | arXiv ID: 2510.22588v1

By: Wenming Tu , Guanrou Yang , Ruiqi Yan and more

Potential Business Impact:

Makes voices sound happy, sad, fast, or slow.

Business Areas:

Speech Recognition Data and Analytics, Software

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

Sound

Makes computer voices sound like anyone you want.

8 Jan 2026 0

87%

StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation

Multimedia

Makes talking faces move realistically for any person.

12 Mar 2025 1

86%

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Sound

Computers learn to change their voice on command.

9 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com github.com huggingface.co huggingface.co

Page Count

23 pages

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Makes voices sound happy, sad, fast, or slow.

Technical Abstract

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions