Stealing Part of a Production Language Model

Stealing Part of a Production Language Model

2024 | Nicholas Carlini, Daniel Paleka, Krishnamurthy (Dj) Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr
This paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. The attack recovers the embedding projection layer (up to symmetries) of a transformer model using typical API access. For under $20 USD, the attack extracts the entire projection matrix of OpenAI's ada and babbage language models, confirming their hidden dimensions of 1024 and 2048, respectively. The attack also recovers the exact hidden dimension of the gpt-3.5-turbo model and estimates the cost to recover the entire projection matrix as under $2,000. The paper discusses potential defenses and implications of future work. The attack operates top-down, directly extracting the model's last layer, which projects from the hidden dimension to a higher-dimensional logit vector. This layer is low-rank, and targeted queries can extract its embedding dimension or weight matrix. The attack is effective and efficient, applicable to production models with full logprobs or logit bias APIs, including Google's PaLM-2 and OpenAI's GPT-4. OpenAI and Google have implemented defenses after responsible disclosure. The paper presents an attack for logit-vector APIs, demonstrating how to recover the hidden dimensionality of a model using singular value decomposition (SVD). The attack is then extended to recover the full output projection matrix W. The attack is also adapted for logit-bias APIs, where the adversary can manipulate logit biases to infer the full logit vector. The attack is evaluated on various models, showing high precision and efficiency. The paper also discusses logprob-free attacks, where the adversary can extract logits without logprob access, though at a higher cost. The attack is efficient, recovering 18 bits of precision at just 3.7 queries per logit. The paper evaluates the efficacy of the attack on five OpenAI models, confirming the hidden dimensions and recovering the projection matrix with high accuracy. The paper concludes with potential defenses, including removing logit bias, replacing it with a block-list, architectural changes, and post-hoc altering the architecture. Mitigations include restricting logit bias queries, adding noise, and detecting malicious queries. The paper highlights the importance of system-level design decisions in security and the need for further research on practical attacks on machine learning models. The attack underscores the vulnerability of production models to model stealing and the importance of securing them against such threats.This paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. The attack recovers the embedding projection layer (up to symmetries) of a transformer model using typical API access. For under $20 USD, the attack extracts the entire projection matrix of OpenAI's ada and babbage language models, confirming their hidden dimensions of 1024 and 2048, respectively. The attack also recovers the exact hidden dimension of the gpt-3.5-turbo model and estimates the cost to recover the entire projection matrix as under $2,000. The paper discusses potential defenses and implications of future work. The attack operates top-down, directly extracting the model's last layer, which projects from the hidden dimension to a higher-dimensional logit vector. This layer is low-rank, and targeted queries can extract its embedding dimension or weight matrix. The attack is effective and efficient, applicable to production models with full logprobs or logit bias APIs, including Google's PaLM-2 and OpenAI's GPT-4. OpenAI and Google have implemented defenses after responsible disclosure. The paper presents an attack for logit-vector APIs, demonstrating how to recover the hidden dimensionality of a model using singular value decomposition (SVD). The attack is then extended to recover the full output projection matrix W. The attack is also adapted for logit-bias APIs, where the adversary can manipulate logit biases to infer the full logit vector. The attack is evaluated on various models, showing high precision and efficiency. The paper also discusses logprob-free attacks, where the adversary can extract logits without logprob access, though at a higher cost. The attack is efficient, recovering 18 bits of precision at just 3.7 queries per logit. The paper evaluates the efficacy of the attack on five OpenAI models, confirming the hidden dimensions and recovering the projection matrix with high accuracy. The paper concludes with potential defenses, including removing logit bias, replacing it with a block-list, architectural changes, and post-hoc altering the architecture. Mitigations include restricting logit bias queries, adding noise, and detecting malicious queries. The paper highlights the importance of system-level design decisions in security and the need for further research on practical attacks on machine learning models. The attack underscores the vulnerability of production models to model stealing and the importance of securing them against such threats.
Reach us at info@study.space