How to Achieve Higher Accuracy with Less Training Points?
By: Jinghan Yang, Anupam Pani, Yunchao Zhang
Potential Business Impact:
Trains computers faster using less data.
In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.
Similar Papers
Testing Most Influential Sets
Machine Learning (Stat)
Finds when a few facts change results too much.
Testing Most Influential Sets
Machine Learning (Stat)
Finds when a few facts unfairly change results.
Learning Accurate Models on Incomplete Data with Minimal Imputation
Machine Learning (CS)
Fixes messy data faster for smarter computers.