Full Report
Various generative artificial intelligence (GenAI) services have been found vulnerable to two types of jailbreak attacks that make it possible to produce illicit or dangerous content. The first of the two techniques, codenamed Inception, instructs an AI tool to imagine a fictitious scenario, which can then be adapted into a second scenario within the first one where there exists no safety
Analysis Summary
# Vulnerability: Generative AI Jailbreaks Leading to Illicit Content Generation
## CVE Details
This summary addresses general security flaws in LLMs related to prompting techniques rather than specific, vendor-assigned CVEs for each jailbreak mechanism.
- CVE ID: Not explicitly assigned for the reported jailbreak techniques (Inception, Prompting for Negative Instructions).
- CVSS Score: Not applicable as this is a systemic issue across multiple services, though a successful jailbreak would warrant a High score for Confidentiality/Integrity/Availability depending on the payload.
- CWE: General category of CWE-438: Reliance on Environmental Protection or Security Features (for insufficient safety alignments).
## Affected Systems
- Products: OpenAI ChatGPT, Anthropic Claude, Microsoft Copilot, Google Gemini, XAi Grok, Meta AI, Mistral AI, and other GenAI services.
- Versions: Not specified, as the vulnerabilities relate to the alignment and safety guardrails common across current-generation LLMs. GPT-4.1 is specifically noted as being *more* susceptible to misuse than GPT-4o in some assessments.
- Configurations: Any configuration leveraging these LLMs in their default safety alignment state.
## Vulnerability Description
Generative AI services are susceptible to two primary jailbreak techniques that bypass safety guardrails, potentially enabling the generation of illicit content (e.g., malware code, instructions for controlled substances, phishing emails).
1. **Inception Jailbreak:** An attacker prompts the AI to imagine a fictitious scenario, then introduces a second nested scenario within the first. Continued prompting within the context of the second scenario bypasses safety limits.
2. **Negative Instruction Reversal:** An attacker requests instructions on *how not* to respond to a specific illicit request. The AI then provides the negative instructions, after which the attacker cycles back to the illicit request, often succeeding due to the preceding input.
Additionally, related security concerns include:
* **Context Compliance Attack (CCA):** Injecting a "simple assistant response" into history about a sensitive topic to prime the model.
* **Policy Puppetry Attack:** Crafting malicious instructions that mimic policy files (XML, JSON) to override system prompts.
* **Memory INJection Attack (MINJA):** Injecting malicious records into the model's memory bank via interactions to force undesirable actions.
* **Model Context Protocol (MCP) Risks:** Malicious MCP server tool descriptions can lead to indirect prompt injection, unauthorized data access, and agent hijacking (Tool Poisoning Attacks).
## Exploitation
- Status: PoC available (The CERT/CC advisory confirms successful demonstration of these techniques).
- Complexity: Low to Medium, as it relies on creative, iterative prompting rather than complex pre-existing exploits.
- Attack Vector: Network (via standard prompting interface).
## Impact
- Confidentiality: Medium to High (If jailbreaks lead to data exfiltration via compromised agents using MCP).
- Integrity: High (Ability to generate malicious code, phishing material, or manipulate agent behavior).
- Availability: Low (No direct denial of service reported, but could lead to service provider resource exhaustion through misuse).
## Remediation
### Patches
Specific vendor patches for these exact prompting techniques are rarely released universally, as fixes usually involve updating underlying model alignment logic.
* Vendors (OpenAI, Google, etc.) are expected to apply updates to continuously strengthen safety alignments.
### Workarounds
* **For Deployers:** Implement robust input validation, use content filters on both inputs and outputs, and enforce strict session management.
* **For LLM Developers:** Reinforce system prompts with stronger directives against nested scenario use and self-referential negative instruction following.
* **MCP Mitigation:** Implement strict authentication and authorization checks for MPC tools, deeply inspect tool descriptions before use, and avoid granting unrestricted file system access to external agents.
* **Prompt Engineering:** Increase prompt specificity when requesting secure or compliant output ("vibe coding" avoidance).
## Detection
- **Indicators of Compromise:** Identifying sequences of prompts involving nested scenarios ("imagine X, now inside X, do Y") or requests for negative responses followed immediately by the target request.
- **Detection Methods and Tools:** Continuous monitoring of conversation context for unusual structural patterns indicative of jailbreaking attempts. Backslash Security suggests that having built-in guardrails in the form of policies and prompt rules is invaluable.
## References
- CERT Coordination Center (CERT/CC) Advisory: kb dot cert dot org/vuls/id/667211
- Microsoft Whitepaper on Failure Modes: microsoft dot com/en-us/security/blog/2025/04/24/new-whitepaper-outlines-the-taxonomy-of-failure-modes-in-ai-agents/
- CCA Reference: msrc dot microsoft dot com/blog/2025/03/jailbreaking-is-mostly-simpler-than-you-think/
- Policy Puppetry Reference: hiddenlayer dot com/innovation-hub/novel-universal-bypass-for-all-major-llms/
- MINJA Reference: arxiv dot org/abs/2503.03704
- Vibe Coding Pitfalls: thehackernews dot com/2025/04/lovable-ai-found-most-vulnerable-to.html
- GPT-4.1 Safety Concerns: splx dot ai/blog/the-missing-gpt-4-1-safety-report
- MCP Tool Poisoning: invariantlabs dot ai/blog/mcp-security-notification-tool-poisoning-attacks
- Chrome Extension MCP Risk: blog dot extensiontotal dot com/trust-me-im-local-chrome-extensions-mcp-and-the-sandbox-escape-1875a0ee4823