Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications
By: Chenhua Shi , Gregor Macdonald , Bhavika Jalli and more
Potential Business Impact:
Makes AI learn hard jobs without people.
The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.
Similar Papers
Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks
Computation and Language
Helps computers answer harder questions by learning from themselves.
Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs
Computation and Language
Helps computers answer money questions better.
FrugalRAG: Learning to retrieve and reason for multi-hop QA
Computation and Language
Answers questions using fewer searches.