Full Report
If you want a picture of the future of LLM security, imagine Whac-a-Mole meets Groundhog Day
Analysis Summary
# Tool/Technique: CoT (Chain of Thought) Forgery
## Overview
CoT Forgery is a structural prompt injection technique that exploits "role confusion" in Large Language Models (LLMs). Instead of attempting to persuade or "jailbreak" a model through social engineering, the technique spoofs the internal reasoning style of the model (the `thought` or `reasoning` role). By mimicking the specific linguistic style of the model’s internal Chain of Thought, attackers dupe the model into believing it has already evaluated a prohibited request and deemed it safe, bypassing safety filters and system prompts.
## Technical Details
- **Type:** Technique / Structural Prompt Injection
- **Platform:** Large Language Models (specifically those using role-based tagging like OpenAI’s GPT series, Anthropic’s Claude, and others using system/user/assistant/thought roles).
- **Capabilities:** High-reliability bypass of safety guardrails; cross-model transferability; spoofing of internal reasoning states.
- **First Seen:** Reported via ICML 2026 conference proceedings; notably won the 2025 OpenAI Kaggle red-teaming contest.
## MITRE ATT&CK Mapping
- **[TA0001 - Initial Access]**
- [T1566 - Phishing] (Indirect Prompt Injection via malicious documents)
- **[TA0011 - Command and Control]**
- [T1204.003 - User Execution: Malicious Prompt]
- **[MITRE ATLAS - Jailbreak]**
- [AML.T0054 - LLM Jailbreak]
- [AML.T0051 - Prompt Injection]
## Functionality
### Core Capabilities
- **Role Spoofing:** Exploits the model's reliance on "writing style" rather than secure metadata tags to identify roles.
- **Style Mimicry:** Uses a specialized LLM to generate text that perfectly matches the terse, analytical "Chain of Thought" style of the target model.
- **State Manipulation:** Inserts "fake reasoning" into the prompt that concludes a harmful action is permissible.
- **Trust Seizure:** Forces the model to treat the attacker-supplied text not as an external input to be scrutinized, but as its own "already-reached conclusion."
### Advanced Features
- **High Transferability:** Unlike "fragile" jailbreaks (like "Do Anything Now" / DAN), CoT Forgery exploits a fundamental structural flaw in how LLMs distinguish between user input and internal logic, making it effective across different model families.
- **Improved Success Rates:** Increases attack success rates from near 0% to approximately 60% on standard jailbreaking benchmarks.
## Indicators of Compromise
- **File Hashes:** N/A (Text-aware technique)
- **File Names:** N/A
- **Registry Keys:** N/A
- **Network Indicators:** N/A
- **Behavioral Indicators:**
- Prompts containing delimiters or tags normally reserved for the model (e.g., `<thought>`, `[reasoning]`, or specific system-level role tags).
- User inputs that mirror the model’s specific internal "Chain of Thought" formatting and linguistic style.
- Unexpected model compliance with high-harm requests (e.g., chemical synthesis, illegal activities) despite active safety filters.
## Associated Threat Actors
- **Red Team Researchers:** Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell (MIT).
## Detection Methods
- **Signature-based detection:** Monitoring for known role-tag delimiters (e.g., `<thought>`) within user-supplied text blocks.
- **Behavioral detection:** Using "Guardrail" models to analyze if the user input is attempting to emulate a system-level role or internal reasoning style.
- **Perplexity Analysis:** Detecting anomalous transitions in writing style within a single prompt that may indicate a role-switching attempt.
## Mitigation Strategies
- **Secure Tagging:** Moving away from "style-based" role identification to a secure architecture where system, user, and reasoning roles are strictly separated at the token or metadata level.
- **Internal State Inspection:** Implementing checks to ensure that "Thought" tokens can only be generated by the model itself and never accepted as part of a user-input stream.
- **Input Sanitization:** Stripping role-specific formatting and system-level keywords from user-provided content before processing.
## Related Tools/Techniques
- **Adversarial Prompting:** Broad category of manipulation.
- **Indirect Prompt Injection:** Where the CoT Forgery is hidden in a document the LLM reads.
- **Role Play Attacks:** Earlier, less sophisticated versions of role confusion.
- **Chain of Thought (CoT) Prompting:** The legitimate prompting technique that this attack subverts.