Score: 1

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

Published: September 28, 2025 | arXiv ID: 2509.23833v1

By: Cancan Li , Fei Su , Juan Liu and more

Potential Business Impact:

Helps computers understand quiet talking.

Business Areas:

Speech Recognition Data and Analytics, Software

Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

Cocktail-Party Audio-Visual Speech Recognition

Sound

Helps computers understand talking even in loud places.

2 Jun 2025 1

88%

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

Sound

Identifies speakers in any language, even noisy ones.

13 Mar 2025 1

87%

Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

Computation and Language

Helps computers judge how well people speak English.

18 Oct 2025 0

View PDF Login to Bookmark

Page Count

5 pages

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

Helps computers understand quiet talking.

Technical Abstract

Cocktail-Party Audio-Visual Speech Recognition

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment