ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data
By: Vladislav Stankov, Matyáš Kopp, Ondřej Bojar
Potential Business Impact:
Helps computers understand spoken Czech better.
We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.
Similar Papers
ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Computation and Language
Helps computers understand spoken words and feelings.
SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data
Computation and Language
Helps computers understand Slovak speech better.
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Sound
Makes computer voices sound like real Persian speakers.