LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation
By: Jun Chen , Shichao Hu , Jiuxin Lin and more
Potential Business Impact:
Lets car voices be heard clearly, even with noise.
In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.
Similar Papers
Neural personal sound zones with flexible bright zone control
Sound
Creates personal sound bubbles for everyone nearby.
Spatial Audio Processing with Large Language Model on Wearable Devices
Sound
Listens to where sounds come from.
Geolocation-Aware Robust Spoken Language Identification
Computation and Language
Helps computers tell apart different accents of the same language.