Score: 1

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Published: December 8, 2025 | arXiv ID: 2512.07571v1

By: Nicolas Calbucura, Valentin Barriere

Potential Business Impact:

Lets computers understand speech and text together better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

Towards Audio Token Compression in Large Audio Language Models

Audio and Speech Processing

Makes AI understand long sounds with less computer power.

26 Nov 2025 1

88%

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Sound

Lets computers understand and talk like humans.

12 Aug 2025 1

88%

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Sound

Lets computers understand and talk like people.

12 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇱 Chile

Page Count

8 pages

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Lets computers understand speech and text together better.

Technical Abstract

Towards Audio Token Compression in Large Audio Language Models

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models