Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware
By: Stavros Mitsis , Ermos Hadjikyriakos , Humaid Ibrahim and more
Potential Business Impact:
Helps tiny computers understand feelings from voices.
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.
Similar Papers
Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study
Sound
Lets computers understand your feelings from your voice.
Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture
Computation and Language
Helps computers understand feelings from talking, seeing, and hearing.
Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers
Sound
Makes computers understand feelings in voices faster.