Adapting Language Balance in Code-Switching Speech
By: Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel
Potential Business Impact:
Helps computers understand mixed-language sentences better.
Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
Similar Papers
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Computation and Language
Helps computers learn many languages by mixing them.
Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models
Computation and Language
Helps computers understand mixed-language conversations.
Strategies of Code-switching in Human-Machine Dialogs
Computation and Language
Chatbot learns to switch languages like people.