Understanding Privacy Backdoors%3A Stealing Data with Corrupted Pretrained Models

The paper "Privacy Backdoors: Stealing Data with Corrupted Pretrained Models" explores a new type of attack on machine learning models, specifically focusing on the privacy risks introduced by fine-tuning pretrained models. The authors demonstrate that by tampering with the weights of a pretrained model, an attacker can compromise the privacy of the finetuning data. They show how to build privacy backdoors for various models, including transformers, which allow the attacker to reconstruct individual finetuning samples with high probability. The backdoors are designed to be single-use, ensuring that once activated, they become inactive to prevent further alteration during training. The paper also highlights the implications of these attacks on models trained with differential privacy (DP), showing that loose privacy guarantees are insecure if the model is not trusted. The authors provide experimental results on both MLPs and transformers, demonstrating the effectiveness of their backdoor construction and the ability to reconstruct training inputs. Additionally, they discuss black-box attacks, where the attacker has only query access to the finetuned model, and show that even in this setting, perfect membership inference attacks are possible. The paper concludes by emphasizing the need for more stringent privacy protections when operating with untrusted shared models.The paper "Privacy Backdoors: Stealing Data with Corrupted Pretrained Models" explores a new type of attack on machine learning models, specifically focusing on the privacy risks introduced by fine-tuning pretrained models. The authors demonstrate that by tampering with the weights of a pretrained model, an attacker can compromise the privacy of the finetuning data. They show how to build privacy backdoors for various models, including transformers, which allow the attacker to reconstruct individual finetuning samples with high probability. The backdoors are designed to be single-use, ensuring that once activated, they become inactive to prevent further alteration during training. The paper also highlights the implications of these attacks on models trained with differential privacy (DP), showing that loose privacy guarantees are insecure if the model is not trusted. The authors provide experimental results on both MLPs and transformers, demonstrating the effectiveness of their backdoor construction and the ability to reconstruct training inputs. Additionally, they discuss black-box attacks, where the attacker has only query access to the finetuned model, and show that even in this setting, perfect membership inference attacks are possible. The paper concludes by emphasizing the need for more stringent privacy protections when operating with untrusted shared models.

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

30 Mar 2024 | Shanglun Feng, Florian Tramèr