Score: 0

FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

Published: October 21, 2025 | arXiv ID: 2510.19025v1

By: Hamed Jelodar , Samita Bai , Roozbeh Razavi-Far and more

Potential Business Impact:

Creates fake data for training smart computer programs.

Business Areas:

Semantic Web Internet Services

Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.

SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data

Artificial Intelligence

Creates better AI by making more training data.

21 Aug 2025 1

87%

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Computation and Language

Makes smart computers learn new jobs easily.

5 Jul 2025 1

87%

AutoDDG: Automated Dataset Description Generation using Large Language Models

Databases

Makes finding data easier by writing descriptions.

3 Feb 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Page Count

7 pages

FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

Creates fake data for training smart computer programs.

Technical Abstract

SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

AutoDDG: Automated Dataset Description Generation using Large Language Models