Safeguarding AI against ‘jailbreaks’ and other prompt attacks

Abstract artwork featuring vertical black and white striped patterns with a curved line intersecting them. The composition is framed by bold red and blue borders, set against a dark background with purple textured elements.

Safeguarding AI against ‘jailbreaks’ and other prompt attacks

by Vanessa Ho

Getting an AI tool to answer customer service questions can be a great way to save time. Same goes for using an AI assistant to summarize emails. But the powerful language capabilities of those tools also make them vulnerable to prompt attacks, or malicious attempts to trick AI models into ignoring their system rules and produce unwanted results.

There are two types of prompt attacks. One is a direct prompt attack known as a jailbreak, like if the customer service tool generates offensive content at someone’s coaxing, for example. The second is an indirect prompt attack, say if the email assistant follows a hidden, malicious prompt to reveal confidential data.

Microsoft safeguards against both types of prompt attacks with AI tools and practices that include new safety guardrails, advanced security tools and deep investment in cybersecurity research and expertise.

This post is part of Microsoft’s Building AI Responsibly series, which explores top concerns with deploying AI and how the company is addressing them with its responsible AI practices and tools.  

“Prompt attacks are a growing security concern that Microsoft takes extremely seriously,” says Ken Archer, a Responsible AI principal product manager at the company. “Generative AI is reshaping how people live and work, and we are actively working to help developers build more secure AI applications.”

Jailbreaks are when someone directly inputs malicious prompts into an AI system, such as telling it to “forget” its rules or pretend it’s a rogue character. The term was used for smartphones before AI: It described someone trying to customize their phone by breaking it out of a manufacturer’s “jail” of restrictions.

Indirect prompt attacks are when someone hides malicious instructions in an email, document, website or other data that an AI tool processes. An attacker can send an innocuous-looking email that hides a harmful prompt in white font, encoded text or an image. A business or resume website can insert hidden text to manipulate AI screening tools to skip an audit of the business or push a resume to the top of a pile.

People are more aware of jailbreaks, but indirect attacks carry a greater risk because they can enable external, unauthorized access to privileged information. Organizations often need to ground AI systems in documents and datasets to leverage the benefit of generative AI. But doing so can open them to paths for indirect attacks leading to data leaks, malware and other security breaches when those documents and datasets are untrusted or compromised.

“This creates a fundamental trade-off,” Archer says.

To help protect against jailbreaks and indirect attacks, Microsoft has developed a comprehensive approach that helps AI developers detect, measure and manage the risk. It includes Prompt Shields, a fine-tuned model for detecting and blocking malicious prompts in real time, and safety evaluations for simulating adversarial prompts and measuring an application’s susceptibility to them. Both tools are available in Azure AI Foundry.

Microsoft Defender for Cloud helps prevent future attacks with tools to analyze and block attackers, while Microsoft Purview provides a platform for managing sensitive data used in AI applications. The company also publishes best practices for developing a multi-layered defense that includes robust system messages, or rules that guide an AI model on safety and performance.

“We educate customers about the importance of a defense-in-depth approach,” says Sarah Bird, chief product officer for Responsible AI at Microsoft. “We build mitigations into the model, create a safety system around it and design the user experience so they can be an active part of using AI more safely and securely.”

The defense strategy stems from the company’s longtime expertise in cybersecurity, ranging from its AI Red Team attacking its own products to the Microsoft Security Response Center researching and monitoring attacks. The center manages Bug Bounty programs for outside researchers to report vulnerabilities in Microsoft products and recently launched a new opportunity for reporting high-impact vulnerabilities in the company’s AI and Cloud products.

“We stay on top of emerging threats by inviting people to attack us,” says Archer. “We’re constantly learning from a network of researchers dedicated to understanding novel attacks and improving our security measures.”

He says prompt attacks exploit an inability of large language models (LLMs) to distinguish user instructions from grounding data. The architecture of the models, which process inputs in a single, continuous stream of text, is expected to improve with newer iterations.

Microsoft researchers studying indirect attacks are contributing to those improvements. They’ve found that “spotlighting,” a group of prompt engineering techniques, can reduce attack risk by helping LLMs differentiate valid system instructions from adversarial ones. And they’re studying “task drift” — deviations in how models respond to tasks with and without grounding documents — as a new way to detect indirect attacks.

“Given the early stages of generative AI architectures, enterprises with sensitive data assets should be focused on security,” Archer says. “But they should also know they can build generative AI applications with confidence by closing off these attack vectors.”

Learn more about Microsoft’s Responsible AI work.

Lead illustration by Makeshift Studios / Rocio Galarza. Story published on December 3, 2024.