[slides] Stealing Part of a Production Language Model

This paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. Specifically, the attack recovers the *embedding projection layer* (up to symmetries) of a transformer model, given typical API access. For under $200 USD, the attack extracts the entire projection matrix of OpenAI’s ada and babbage language models, confirming their hidden dimensions of 1024 and 2048, respectively. The attack also recovers the exact hidden dimension size of the gpt-3.5-turbo model, estimated to cost under $2,000 in queries. The paper discusses potential defenses and mitigations, and highlights the implications for future work that could extend the attack. The authors share their findings with affected services and model providers, leading to the implementation of mitigations to prevent or make the attack more expensive.This paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. Specifically, the attack recovers the *embedding projection layer* (up to symmetries) of a transformer model, given typical API access. For under $200 USD, the attack extracts the entire projection matrix of OpenAI’s ada and babbage language models, confirming their hidden dimensions of 1024 and 2048, respectively. The attack also recovers the exact hidden dimension size of the gpt-3.5-turbo model, estimated to cost under $2,000 in queries. The paper discusses potential defenses and mitigations, and highlights the implications for future work that could extend the attack. The authors share their findings with affected services and model providers, leading to the implementation of mitigations to prevent or make the attack more expensive.

Stealing Part of a Production Language Model

2024 | Nicholas Carlini, Daniel Paleka, Krishnamurthy (Dj) Djvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr