Score: 1

Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models

Published: June 1, 2025 | arXiv ID: 2506.05379v1

By: Seyed Moein Ayyoubzadeh , Kourosh Shahnazari , Mohammmadali Keshtparvar and more

Potential Business Impact:

Gets better AI by fairly paying for good data.

Business Areas:
Semantic Search Internet Services

Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support settings with limited liquidity or long-term incentives, we introduce the Marginal Utility Token (MUT), which allocates future rights based on marginal contributions. We unify these in Mixed-MIA, a hybrid mechanism balancing upfront payments and deferred rewards. All mechanisms support verifiable, privacy-preserving implementation. Theoretically and empirically, they outperform volume-based and trust-based baselines, eliciting higher-quality data under budget constraints while remaining robust to misreporting and collusion. This establishes a principled foundation for sustainable and fair data markets for future LLMs.

Country of Origin
🇮🇷 Iran

Page Count
26 pages

Category
Computer Science:
CS and Game Theory