AI jailbreaks: What they are and how they can be mitigated

June 17, 2024

Today’s subject will be a little bit different from the usual discussions of security flaws. This is a security flaw, to be sure, but it works along different angles. Generative AI is a current industry fad that is being implemented across multiple industries in multiple different capacities, often acting as a human-free method of interacting with customers and end users. Large language models such as ChatGPT are capable of giving certain malicious responses, such as providing the instructions for creating a Molotov cocktail, or, more relevant to security, providing code samples for a DDoS attack. These programs are normally restrained from responding to such instructions through a set of restrictions known as guardrails. For instance, when asked how to create a Molotov cocktail, their program will be overridden to deny the request.

‍

AI Jailbreaking refers to the means by which an attacker circumvents these guardrails through techniques similar to social engineering on a human. There are several methods for doing this, but one of the more common is known as the Crescendo attack. The Crescendo attack, also known as the Multiturn LLM Jailbreak, uses a serious of steps to move the LLM closer to a position of providing harmful content. In the case of our above example, this took the form of first asking about the history of Molotov cocktails, then the history of their use, then asking how they were made historically. In this context, the guardrail did not fire. Another method, hilariously referred to as the Grandma Exploit, asks the AI to assume a persona that would provide harmful content, such as a grandmother who worked at a napalm factory. These exploits, which continue to remain effective, demonstrate the limitations of generative AI, and should sound a note of caution for those interested in implementing it as a primary part of their workflow.

AI jailbreaks: What they are and how they can be mitigated

More from Blackwired

The Rise of Precision-Validated Credential Theft: A New Challenge for Defenders

Hunters International Dumps Ransomware, Goes Full-on Extortion

How SSL Misconfigurations Impact Your Attack Surface