Score: 0

Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

Published: December 19, 2025 | arXiv ID: 2512.17293v1

By: June Young Yi, Hyeongju Kim, Juheon Lee

This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

Training Flow Matching Models with Reliable Labels via Self-Purification

Audio and Speech Processing

Cleans messy data so computers learn better.

23 Sep 2025 0

88%

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Sound

Makes computers talk like any person.

3 Oct 2025 1

88%

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

Sound

Makes voices sound like anyone, super fast.

11 Sep 2025 1

View PDF Login to Bookmark

Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

Technical Abstract

Training Flow Matching Models with Reliable Labels via Self-Purification

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech