Score: 0

GLM-TTS Technical Report

Published: December 16, 2025 | arXiv ID: 2512.14291v1

By: Jiayan Cui , Zhihan Yang , Naihan Li and more

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Sound

Makes computers talk with any voice, any style.

3 Mar 2025 2

89%

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Sound

Makes computers talk like any person.

3 Oct 2025 1

89%

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

Audio and Speech Processing

Speaks many Indian languages like a person.

5 Aug 2025 1

View PDF Login to Bookmark

GLM-TTS Technical Report

Technical Abstract

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis