MuCPT: Music-related Natural Language Model Continued Pretraining
By: Kai Tian , Yirong Mao , Wendong Bi and more
Potential Business Impact:
Helps computers create and understand music better.
Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.
Similar Papers
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Computation and Language
Helps computers understand African languages better.
Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation
Information Retrieval
Helps music apps pick songs you'll love.
Advancing the Foundation Model for Music Understanding
Sound
Lets computers understand all parts of music.