Score: 2

LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Published: August 2, 2025 | arXiv ID: 2508.01317v2

By: Xuemiao Zhang , Can Ren , Chengying Tu and more

BigTech Affiliations: Meituan

Potential Business Impact:

Makes AI smarter by creating better practice questions.

The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51\%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

Large-Scale Diverse Synthesis for Mid-Training

Computation and Language

Makes AI smarter with more diverse questions.

2 Aug 2025 2

90%

Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Computation and Language

Helps computers answer hard questions better.

26 May 2025 2

88%

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Machine Learning (CS)

Makes AI smarter by teaching it to find facts.

6 Nov 2025 3

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

27 pages

LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Makes AI smarter by creating better practice questions.

Technical Abstract

Large-Scale Diverse Synthesis for Mid-Training

Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs