garak: A Framework for Security Probing Large Language Models

garak: A Framework for Security Probing Large Language Models

16 Jun 2024 | Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie
Garak is a framework for security probing of large language models (LLMs), designed to identify vulnerabilities in LLMs and dialog systems. The framework allows for structured exploration and discovery of security issues, enabling a holistic approach to LLM security evaluation. Garak is inspired by the linguistically unpredictable nature of LLMs and includes four main components: Generators, Probes, Detectors, and Buffs. Generators are used to create prompts, Probes are designed to elicit specific vulnerabilities, Detectors analyze model responses, and Buffs modify inputs or model parameters to elicit responses. Garak offers end-to-end testing of dialog systems and supports a wide range of models and platforms. It includes a variety of probes to test for different vulnerabilities, such as false claims, training data replay, malware generation, and prompt injection. The framework also includes detectors that use keyword-based and machine learning methods to identify vulnerabilities. Additionally, Garak includes a "hitlog" to track successful probe attempts and an "attack generation" feature to adaptively create new test cases based on model responses. Garak provides detailed reporting on test results, including a JSONL file with prompt details and detector results, and an HTML summary of the run. It also integrates with the AI Vulnerability Database to allow users to upload discovered vulnerabilities. The framework is flexible and can be customized for different security evaluation procedures. Garak's attack generation module, atkgen, is designed to adaptively generate new test cases based on model responses. It uses a conversational red-teaming model to orchestrate dialogue between attacking and target models. The module is trained using data from previous attacks and is capable of learning from successful probe attempts to improve its effectiveness. The paper discusses the importance of a holistic approach to LLM security, emphasizing the need for exploration and discovery rather than benchmarking. It argues that benchmarks are not a productive evaluation of LLM security and that red teaming is oriented towards facilitating better-informed decisions and producing more robust artifacts. Garak provides a common venue and methodology for assessing LLM security, advancing practices by establishing a baseline for conducting LLM security analyses and suggesting a holistic view of LLM security based on established cybersecurity red teaming methods. The framework also provides an open-source place to share LLM vulnerabilities, aiming to improve awareness of LLM security failures and enhance LLM security for all.Garak is a framework for security probing of large language models (LLMs), designed to identify vulnerabilities in LLMs and dialog systems. The framework allows for structured exploration and discovery of security issues, enabling a holistic approach to LLM security evaluation. Garak is inspired by the linguistically unpredictable nature of LLMs and includes four main components: Generators, Probes, Detectors, and Buffs. Generators are used to create prompts, Probes are designed to elicit specific vulnerabilities, Detectors analyze model responses, and Buffs modify inputs or model parameters to elicit responses. Garak offers end-to-end testing of dialog systems and supports a wide range of models and platforms. It includes a variety of probes to test for different vulnerabilities, such as false claims, training data replay, malware generation, and prompt injection. The framework also includes detectors that use keyword-based and machine learning methods to identify vulnerabilities. Additionally, Garak includes a "hitlog" to track successful probe attempts and an "attack generation" feature to adaptively create new test cases based on model responses. Garak provides detailed reporting on test results, including a JSONL file with prompt details and detector results, and an HTML summary of the run. It also integrates with the AI Vulnerability Database to allow users to upload discovered vulnerabilities. The framework is flexible and can be customized for different security evaluation procedures. Garak's attack generation module, atkgen, is designed to adaptively generate new test cases based on model responses. It uses a conversational red-teaming model to orchestrate dialogue between attacking and target models. The module is trained using data from previous attacks and is capable of learning from successful probe attempts to improve its effectiveness. The paper discusses the importance of a holistic approach to LLM security, emphasizing the need for exploration and discovery rather than benchmarking. It argues that benchmarks are not a productive evaluation of LLM security and that red teaming is oriented towards facilitating better-informed decisions and producing more robust artifacts. Garak provides a common venue and methodology for assessing LLM security, advancing practices by establishing a baseline for conducting LLM security analyses and suggesting a holistic view of LLM security based on established cybersecurity red teaming methods. The framework also provides an open-source place to share LLM vulnerabilities, aiming to improve awareness of LLM security failures and enhance LLM security for all.
Reach us at info@study.space
[slides and audio] garak%3A A Framework for Security Probing Large Language Models