Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
By: Linyang He , Qiaolin Wang , Xilin Jiang and more
Potential Business Impact:
Listens to speech, understands grammar better than meaning.
Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.
Similar Papers
Dual Information Speech Language Models for Emotional Conversations
Computation and Language
Lets computers understand feelings in spoken words.
Layer-wise Analysis for Quality of Multilingual Synthesized Speech
Audio and Speech Processing
Makes computer voices sound more human-like.
From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
Computation and Language
Computers learn ideas from talking and reading.