From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
By: Asım Ersoy , Basel Mousi , Shammur Chowdhury and more
Potential Business Impact:
Computers learn ideas from talking and reading.
The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
Similar Papers
Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels
Computation and Language
Makes computers understand talking and writing together.
Probing Audio-Generation Capabilities of Text-Based Language Models
Sound
Computers learn to make sounds from words.
Words That Make Language Models Perceive
Computation and Language
Makes text-only AI "see" and "hear" with words.