Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
By: Fabian Retkowski, Alexander Waibel
Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.
Similar Papers
Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
Computation and Language
**Splits talking into meaningful parts automatically.**
Synthetic Data Generation for Phrase Break Prediction with Large Language Model
Computation and Language
Makes computer voices sound more natural.
BabyLM's First Words: Word Segmentation as a Phonological Probing Task
Computation and Language
Teaches computers to understand word sounds in many languages.