Learning to Retrieve with Weakened Labels: Robust Training under Label Noise
By: Arnab Sharma
Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model's confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
Similar Papers
Pre-train to Gain: Robust Learning Without Clean Labels
Machine Learning (CS)
Teaches computers to learn better from messy information.
Revisiting Meta-Learning with Noisy Labels: Reweighting Dynamics and Theoretical Guarantees
Machine Learning (CS)
Cleans up messy data for smarter computer learning.
Soft-Label Training Preserves Epistemic Uncertainty
Machine Learning (CS)
Teaches computers to understand when things are unclear.