Score: 1

Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Published: July 24, 2025 | arXiv ID: 2507.18055v1

By: Tevin Atwal , Chan Nam Tieu , Yefeng Yuan and more

Potential Business Impact:

Makes fake text data more varied and private.

Business Areas:
Text Analytics Data and Analytics, Software

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Page Count
17 pages

Category
Computer Science:
Computation and Language