Understanding Blind Baselines Beat Membership Inference Attacks for Foundation Models

The paper "Blind Baselines Beat Membership Inference Attacks for Foundation Models" by Debeshee Das, Jie Zhang, and Florian Tramer from ETH Zurich addresses the issue of membership inference (MI) attacks on foundation models trained on unknown web data. MI attacks aim to determine if a data sample was used to train a machine learning model, and are used for various purposes such as detecting copyrighted materials, measuring test set contamination, and auditing machine unlearning methods. The authors find that existing evaluations of MI attacks for foundation models are flawed because they sample members and non-members from different distributions. They demonstrate that *blind* attacks, which do not rely on any trained model, outperform state-of-the-art MI attacks. These blind attacks can distinguish members from non-members by leveraging simple heuristics, such as date detection and bag-of-words classification, even when the distributions are shifted. The paper provides a detailed analysis of eight published MI evaluation datasets for foundation models, showing significant distribution shifts between members and non-members. These shifts are due to temporal differences, biases in data replication, and distinguishable tails. The authors argue that current MI evaluations are unreliable and suggest using models trained on datasets with a clear train-test split, such as the Pile or DataComp, for future evaluations. The paper concludes that existing MI attacks may not effectively extract actual membership information and should not be trusted for applications like copyright detection or auditing unlearning methods. It recommends replacing the current MI evaluation datasets with those that have a random train-test split to ensure more reliable evaluations.The paper "Blind Baselines Beat Membership Inference Attacks for Foundation Models" by Debeshee Das, Jie Zhang, and Florian Tramer from ETH Zurich addresses the issue of membership inference (MI) attacks on foundation models trained on unknown web data. MI attacks aim to determine if a data sample was used to train a machine learning model, and are used for various purposes such as detecting copyrighted materials, measuring test set contamination, and auditing machine unlearning methods. The authors find that existing evaluations of MI attacks for foundation models are flawed because they sample members and non-members from different distributions. They demonstrate that *blind* attacks, which do not rely on any trained model, outperform state-of-the-art MI attacks. These blind attacks can distinguish members from non-members by leveraging simple heuristics, such as date detection and bag-of-words classification, even when the distributions are shifted. The paper provides a detailed analysis of eight published MI evaluation datasets for foundation models, showing significant distribution shifts between members and non-members. These shifts are due to temporal differences, biases in data replication, and distinguishable tails. The authors argue that current MI evaluations are unreliable and suggest using models trained on datasets with a clear train-test split, such as the Pile or DataComp, for future evaluations. The paper concludes that existing MI attacks may not effectively extract actual membership information and should not be trusted for applications like copyright detection or auditing unlearning methods. It recommends replacing the current MI evaluation datasets with those that have a random train-test split to ensure more reliable evaluations.

BLIND BASELINES BEAT MEMBERSHIP INFERENCE ATTACKS FOR FOUNDATION MODELS

30 Mar 2025 | Debeshee Das, Jie Zhang, Florian Tramèr