Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models
By: Seyed Moein Ayyoubzadeh , Kourosh Shahnazari , Mohammmadali Keshtparvar and more
Potential Business Impact:
Gets better AI by fairly paying for good data.
Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support settings with limited liquidity or long-term incentives, we introduce the Marginal Utility Token (MUT), which allocates future rights based on marginal contributions. We unify these in Mixed-MIA, a hybrid mechanism balancing upfront payments and deferred rewards. All mechanisms support verifiable, privacy-preserving implementation. Theoretically and empirically, they outperform volume-based and trust-based baselines, eliciting higher-quality data under budget constraints while remaining robust to misreporting and collusion. This establishes a principled foundation for sustainable and fair data markets for future LLMs.
Similar Papers
From Fairness to Truthfulness: Rethinking Data Valuation Design
CS and Game Theory
Pays people fairly for data used by AI.
Measuring the Hidden Cost of Data Valuation through Collective Disclosure
CS and Game Theory
Helps fairly pay people for their data.
Measuring the Hidden Cost of Data Valuation through Collective Disclosure
CS and Game Theory
Fairly pays everyone for their data.