AI jailbreaks: What they are and how they can be mitigated

June 17, 2024

Today’s subject will be a little bit different from the usual discussions of security flaws. This is a security flaw, to be sure, but it works along different angles. Generative AI is a current industry fad that is being implemented across multiple industries in multiple different capacities, often acting as a human-free method of interacting with customers and end users. Large language models such as ChatGPT are capable of giving certain malicious responses, such as providing the instructions for creating a Molotov cocktail, or, more relevant to security, providing code samples for a DDoS attack. These programs are normally restrained from responding to such instructions through a set of restrictions known as guardrails. For instance, when asked how to create a Molotov cocktail, their program will be overridden to deny the request.

AI Jailbreaking refers to the means by which an attacker circumvents these guardrails through techniques similar to social engineering on a human. There are several methods for doing this, but one of the more common is known as the Crescendo attack. The Crescendo attack, also known as the Multiturn LLM Jailbreak, uses a serious of steps to move the LLM closer to a position of providing harmful content. In the case of our above example, this took the form of first asking about the history of Molotov cocktails, then the history of their use, then asking how they were made historically. In this context, the guardrail did not fire. Another method, hilariously referred to as the Grandma Exploit, asks the AI to assume a persona that would provide harmful content, such as a grandmother who worked at a napalm factory. These exploits, which continue to remain effective, demonstrate the limitations of generative AI, and should sound a note of caution for those interested in implementing it as a primary part of their workflow.

More from Blackwired

December 16, 2024

CISOs need to consider the personal risks associated with their role

CISOs face personal liability for cybersecurity incidents, boosting accountability but increasing stress and deterring professionals.

Read more
December 9, 2024

The Shocking Speed of AWS Key Exploitation

AWS keys exposed online are exploited in minutes, highlighting the need for faster detection and response to prevent breaches.

Read more
December 2, 2024

Advanced Cyberthreats Targeting Holiday Shoppers

The holiday season sees increased e-commerce scams, with AI-driven phishing, fake sites, and data theft targeting consumers and businesses.

Read more