This paper investigates the effectiveness of membership inference (MI) attacks on foundation models, revealing significant flaws in current evaluation methods. The authors demonstrate that existing MI evaluations are flawed due to distribution shifts between members and non-members, making it easy to distinguish them without access to the model. They show that "blind" attacks—those that do not use the model—outperform state-of-the-art MI attacks on eight published MI evaluation datasets. These attacks achieve higher true positive rates (TPR) at low false positive rates (FPR) by exploiting simple features like dates, word frequencies, and n-grams.
The paper highlights that many MI evaluation datasets are constructed with biases, such as temporal shifts between members and non-members, which make them easy to distinguish. For example, in the WikiMIA dataset, members are from before 2017 and non-members from after 2023, creating a clear temporal shift. Similarly, in the BookMIA dataset, members are from books memorized by GPT-3, while non-members are from books published after 2023. These biases make it possible for blind attacks to achieve high accuracy.
The authors propose simple "blind" attack techniques that exploit these biases. For instance, they use date detection to identify members based on the dates in text samples. They also use bag-of-words classifiers and n-gram analysis to distinguish members from non-members. These methods achieve high TPR at low FPR, showing that existing MI attacks are suboptimal and often perform worse than chance.
The paper also discusses the implications of these findings for the field of AI. Current MI attacks cannot be relied upon to detect membership leakage, as they may be inferring membership based on data features rather than actual leakage from the model. This undermines the validity of MI evaluations and the trust in results derived from them. The authors suggest that future MI attacks should be evaluated on models with clear train-test splits, such as those based on the Pile or DataComp datasets.
In conclusion, the paper shows that current MI evaluations are flawed due to distribution shifts and biases in the datasets used. Blind attacks can easily distinguish members from non-members, indicating that existing MI attacks are not effective. The authors recommend using datasets with random train-test splits for more reliable MI evaluations.This paper investigates the effectiveness of membership inference (MI) attacks on foundation models, revealing significant flaws in current evaluation methods. The authors demonstrate that existing MI evaluations are flawed due to distribution shifts between members and non-members, making it easy to distinguish them without access to the model. They show that "blind" attacks—those that do not use the model—outperform state-of-the-art MI attacks on eight published MI evaluation datasets. These attacks achieve higher true positive rates (TPR) at low false positive rates (FPR) by exploiting simple features like dates, word frequencies, and n-grams.
The paper highlights that many MI evaluation datasets are constructed with biases, such as temporal shifts between members and non-members, which make them easy to distinguish. For example, in the WikiMIA dataset, members are from before 2017 and non-members from after 2023, creating a clear temporal shift. Similarly, in the BookMIA dataset, members are from books memorized by GPT-3, while non-members are from books published after 2023. These biases make it possible for blind attacks to achieve high accuracy.
The authors propose simple "blind" attack techniques that exploit these biases. For instance, they use date detection to identify members based on the dates in text samples. They also use bag-of-words classifiers and n-gram analysis to distinguish members from non-members. These methods achieve high TPR at low FPR, showing that existing MI attacks are suboptimal and often perform worse than chance.
The paper also discusses the implications of these findings for the field of AI. Current MI attacks cannot be relied upon to detect membership leakage, as they may be inferring membership based on data features rather than actual leakage from the model. This undermines the validity of MI evaluations and the trust in results derived from them. The authors suggest that future MI attacks should be evaluated on models with clear train-test splits, such as those based on the Pile or DataComp datasets.
In conclusion, the paper shows that current MI evaluations are flawed due to distribution shifts and biases in the datasets used. Blind attacks can easily distinguish members from non-members, indicating that existing MI attacks are not effective. The authors recommend using datasets with random train-test splits for more reliable MI evaluations.