Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement
By: Haixin Zhao, Nilesh Madhu
Potential Business Impact:
Makes phone calls clearer with less power.
In speech enhancement, achieving state-of-the-art (SotA) performance while adhering to the computational constraints on edge devices remains a formidable challenge. Networks integrating stacked temporal and spectral modelling effectively leverage improved architectures such as transformers; however, they inevitably incur substantial computational complexity and model expansion. Through systematic ablation analysis on transformer-based temporal and spectral modelling, we demonstrate that the architecture employing streamlined Frequency-Time-Frequency (FTF) stacked transformers efficiently learns global dependencies within causal context, while avoiding considerable computational demands. Utilising discriminators in training further improves learning efficacy and enhancement without introducing additional complexity during inference. The proposed lightweight, causal, transformer-based architecture with adversarial training (LCT-GAN) yields SoTA performance on instrumental metrics among contemporary lightweight models, but with far less overhead. Compared to DeepFilterNet2, the LCT-GAN only requires 6% of the parameters, at similar complexity and performance. Against CCFNet+(Lite), LCT-GAN saves 9% in parameters and 10% in multiply-accumulate operations yet yielding improved performance. Further, the LCT-GAN even outperforms more complex, common baseline models on widely used test datasets.
Similar Papers
A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
CV and Pattern Recognition
Makes satellite pictures tell better stories.
Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression
CV and Pattern Recognition
Makes AI understand pictures and words faster on phones.
A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation
Audio and Speech Processing
Cleans up noisy sounds for clearer hearing.