Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

2024 | Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
This paper proposes AUG-PE, an augmented version of the Private Evolution (PE) algorithm, for generating differentially private (DP) synthetic text without requiring model training. The PE algorithm leverages API access to foundation models to generate DP synthetic data by iteratively improving samples based on their similarity to private data. AUG-PE extends this approach to text generation by using API access to large language models (LLMs) and introducing new techniques for generating diverse and high-quality synthetic text. The algorithm uses a combination of random sampling, variation generation, and histogram-based selection to produce DP synthetic text that maintains high utility and privacy guarantees. AUG-PE was evaluated on three benchmark datasets: Yelp, OpenReview, and PubMed. The results show that AUG-PE can generate DP synthetic text with comparable or better performance than state-of-the-art DP finetuning baselines. The algorithm is particularly effective when using more powerful LLMs such as GPT-3.5, where DP finetuning is not feasible. AUG-PE is also more computationally efficient than DP finetuning, as it only requires LLM inference APIs. The paper also explores the properties of AUG-PE, including its text length distribution, compatibility with stronger LLMs, and behavior under data scaling. The results show that AUG-PE can produce text length distributions similar to real data and is effective across a wide range of API-accessible LLMs. Additionally, the algorithm is robust to empirical privacy attacks, as demonstrated by membership inference attacks on the PubMed dataset. Overall, AUG-PE provides a promising solution for generating DP synthetic text that maintains high utility and privacy guarantees. The algorithm is particularly useful for applications where proprietary LLMs or open-source LLMs are not suitable for DP finetuning, as it only requires API access to generate high-quality DP synthetic text.This paper proposes AUG-PE, an augmented version of the Private Evolution (PE) algorithm, for generating differentially private (DP) synthetic text without requiring model training. The PE algorithm leverages API access to foundation models to generate DP synthetic data by iteratively improving samples based on their similarity to private data. AUG-PE extends this approach to text generation by using API access to large language models (LLMs) and introducing new techniques for generating diverse and high-quality synthetic text. The algorithm uses a combination of random sampling, variation generation, and histogram-based selection to produce DP synthetic text that maintains high utility and privacy guarantees. AUG-PE was evaluated on three benchmark datasets: Yelp, OpenReview, and PubMed. The results show that AUG-PE can generate DP synthetic text with comparable or better performance than state-of-the-art DP finetuning baselines. The algorithm is particularly effective when using more powerful LLMs such as GPT-3.5, where DP finetuning is not feasible. AUG-PE is also more computationally efficient than DP finetuning, as it only requires LLM inference APIs. The paper also explores the properties of AUG-PE, including its text length distribution, compatibility with stronger LLMs, and behavior under data scaling. The results show that AUG-PE can produce text length distributions similar to real data and is effective across a wide range of API-accessible LLMs. Additionally, the algorithm is robust to empirical privacy attacks, as demonstrated by membership inference attacks on the PubMed dataset. Overall, AUG-PE provides a promising solution for generating DP synthetic text that maintains high utility and privacy guarantees. The algorithm is particularly useful for applications where proprietary LLMs or open-source LLMs are not suitable for DP finetuning, as it only requires API access to generate high-quality DP synthetic text.
Reach us at info@study.space
[slides and audio] Differentially Private Synthetic Data via Foundation Model APIs 2%3A Text