Score: 1

Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

Published: May 2, 2025 | arXiv ID: 2505.01523v1

By: Madhav Kotecha , Vijendra Kumar Vaishya , Smita Gautam and more

Potential Business Impact:

Teaches computers math faster with fewer examples.

Business Areas:

A/B Testing Data and Analytics

We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Machine Learning (CS)

Makes AI learn better with less data.

19 Oct 2025 1

88%

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Machine Learning (CS)

Teaches computers to follow instructions better with less data.

14 Mar 2025 0

87%

Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm

Computation and Language

Teaches computers better using smarter data choices.

4 Mar 2025 2

View PDF Login to Bookmark

Country of Origin

🇮🇳 India

Page Count

9 pages

Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

Teaches computers math faster with fewer examples.

Technical Abstract

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm