April 24, 2024 | Jiachen T. Wang, Zhun Deng, Hiroaki Chiba-Okabe, Boaz Barak, Weijie J. Su
This paper proposes a framework for fairly compensating copyright owners for their contributions to generative AI models, based on the Shapley value from cooperative game theory. The framework quantifies the contribution of each copyright owner's data to the generation of AI-generated content using the log-likelihood of generating the content. This allows for a fair and interpretable distribution of revenues among copyright owners, ensuring that those whose data is most critical to the generation of AI content receive appropriate compensation. The framework does not require modifying the inference process and preserves the full capabilities of generative models. It is evaluated on datasets such as WikiArt and FlickrLogo-27, demonstrating its effectiveness in identifying the most relevant data sources used in artwork generation. The framework also addresses scenarios where multiple entities, each holding a private dataset, seek to jointly train a generative AI model for revenue generation. It ensures fair revenue sharing among private data owners and addresses potential financial disagreements. The framework is computationally intensive, but efficient methods such as Monte Carlo sampling and fine-tuning can be used to approximate the Shapley value. The paper also discusses related work, including other data valuation techniques and the limitations of the leave-one-out score. The framework is shown to be robust in discerning the relative significance of contributions from diverse data sources and is capable of handling scenarios with numerous data sources. The paper concludes that the framework provides a fair and interpretable method for compensating copyright owners while fostering innovation in AI.This paper proposes a framework for fairly compensating copyright owners for their contributions to generative AI models, based on the Shapley value from cooperative game theory. The framework quantifies the contribution of each copyright owner's data to the generation of AI-generated content using the log-likelihood of generating the content. This allows for a fair and interpretable distribution of revenues among copyright owners, ensuring that those whose data is most critical to the generation of AI content receive appropriate compensation. The framework does not require modifying the inference process and preserves the full capabilities of generative models. It is evaluated on datasets such as WikiArt and FlickrLogo-27, demonstrating its effectiveness in identifying the most relevant data sources used in artwork generation. The framework also addresses scenarios where multiple entities, each holding a private dataset, seek to jointly train a generative AI model for revenue generation. It ensures fair revenue sharing among private data owners and addresses potential financial disagreements. The framework is computationally intensive, but efficient methods such as Monte Carlo sampling and fine-tuning can be used to approximate the Shapley value. The paper also discusses related work, including other data valuation techniques and the limitations of the leave-one-out score. The framework is shown to be robust in discerning the relative significance of contributions from diverse data sources and is capable of handling scenarios with numerous data sources. The paper concludes that the framework provides a fair and interpretable method for compensating copyright owners while fostering innovation in AI.