15 Mar 2024 | Matthew Finlayson, Xiang Ren, Swabha Swayamdipta
This paper investigates how information about API-protected large language models (LLMs) can be inferred from their outputs. We show that, under a conservative assumption about the model architecture, it is possible to learn a surprising amount of non-public information about an API-protected LLM from a small number of API queries. Our findings are based on the observation that most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We exploit this fact to unlock several capabilities: efficiently discovering the LLM's hidden size, obtaining cheap full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. We also discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.This paper investigates how information about API-protected large language models (LLMs) can be inferred from their outputs. We show that, under a conservative assumption about the model architecture, it is possible to learn a surprising amount of non-public information about an API-protected LLM from a small number of API queries. Our findings are based on the observation that most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We exploit this fact to unlock several capabilities: efficiently discovering the LLM's hidden size, obtaining cheap full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. We also discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.