AI jailbreaks: What they are and how they can be mitigated

June 17, 2024

Today’s subject will be a little bit different from the usual discussions of security flaws. This is a security flaw, to be sure, but it works along different angles. Generative AI is a current industry fad that is being implemented across multiple industries in multiple different capacities, often acting as a human-free method of interacting with customers and end users. Large language models such as ChatGPT are capable of giving certain malicious responses, such as providing the instructions for creating a Molotov cocktail, or, more relevant to security, providing code samples for a DDoS attack. These programs are normally restrained from responding to such instructions through a set of restrictions known as guardrails. For instance, when asked how to create a Molotov cocktail, their program will be overridden to deny the request.

AI Jailbreaking refers to the means by which an attacker circumvents these guardrails through techniques similar to social engineering on a human. There are several methods for doing this, but one of the more common is known as the Crescendo attack. The Crescendo attack, also known as the Multiturn LLM Jailbreak, uses a serious of steps to move the LLM closer to a position of providing harmful content. In the case of our above example, this took the form of first asking about the history of Molotov cocktails, then the history of their use, then asking how they were made historically. In this context, the guardrail did not fire. Another method, hilariously referred to as the Grandma Exploit, asks the AI to assume a persona that would provide harmful content, such as a grandmother who worked at a napalm factory. These exploits, which continue to remain effective, demonstrate the limitations of generative AI, and should sound a note of caution for those interested in implementing it as a primary part of their workflow.

More from Blackwired

October 14, 2024

SOC teams are frustrated with their security tools

SOC teams face noise from security tools, with only 16% of alerts being genuine. AI tools are increasingly adopted to improve efficiency.

Read more
October 7, 2024

NIST proposes barring some of the most nonsensical password rules

NIST recommends longer passwords, no resets, and no special characters. Use random passwords or memorable passphrases stored in a manager.

Read more
September 30, 2024

Don’t panic and other tips for staying safe from scareware

This social engineering tactic convinces users they are compromised, urging them to download malware disguised as antivirus software.

Read more