Score: 0

Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Published: October 20, 2025 | arXiv ID: 2510.18036v1

By: Stavros Mitsis , Ermos Hadjikyriakos , Humaid Ibrahim and more

Potential Business Impact:

Helps tiny computers understand feelings from voices.

Business Areas:

Speech Recognition Data and Analytics, Software

Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Sound

Lets computers understand your feelings from your voice.

1 Nov 2025 1

89%

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Computation and Language

Helps computers understand feelings from talking, seeing, and hearing.

5 May 2025 0

89%

Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers

Sound

Makes computers understand feelings in voices faster.

17 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Page Count

13 pages

Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Helps tiny computers understand feelings from voices.

Technical Abstract

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers