Full Report
AI is starting to become REALLY good at finding security vulnerabilities. Is it going to replace us? The claim of this author is that it hollows you out and makes you dumb. This post is about the journey of this Certora researcher as they use AI in their security workflow. Early on, AI was a huge help. Understanding codebases quickly, bouncing reasoning off something... lots of hard things being done very quickly. This was great, until they realized something: they were reaching for AI earlier and earlier in the process. They started using it NOT for context, but for judgment calls. Is this code really exploitable? They would just ask the AI instead of tracing through everything themselves and accept the answer. There is a blurry line between good and bad prompts. One is asking the AI to do all of the work; the other is asking a comprehension question and doing the work yourself. The latter is a force multiplier while the former is hardly convincing. Worse than that, because you didn't reason through this process, you can't tell if the logic is off. You never had a mental model to begin with. The LLM may be wrong, and you would never know. Threat modeling is a muscle. Sitting with uncertainty for hours with a hypothesis is a skill that either breaks or holds. This uncertainty is uncomfortable. So, the AI is able to remove this uncertainty and make them more confident. According to the author, the feeling of I think there's something here but I can't prove it yet IS a major part of the process. It's important to sit in this. The author says that many folks are delegating this at great cost. They are faster and can cover more code. But their hit rate hasn't gone up. AI gave them breadth but took their depth. This is a bad, bad trade in the world of security research. Here's the process that the author now uses: Write the attack scenario in plain language. Use AI to verify the mechanics. Execution paths, state transitions, etc. AI excels at understanding the logic of a complex chain of calls. If the AI is wrong somewhere, you can quickly disprove it with your context. Try to disprove the finding. The AI is useful for gathering evidence here. They leave with a good quote: "The security researchers who will thrive with AI are the ones who treat it like a debugger. A tool that extends your reach without replacing your judgment. The ones who will quietly decline are those who let it think for them, one prompt at a time. They will never notice the moment they stopped being the researcher and became a triage layer for an LLM."
Analysis Summary
# Best Practices: Human-Centric AI Integration in Security Research
## Overview
These practices address the "hollowing out" effect of Large Language Models (LLMs) in cybersecurity. They focus on maintaining cognitive depth and manual reasoning skills while leveraging AI for speed, ensuring that AI remains a "force multiplier" rather than a replacement for professional judgment.
## Key Recommendations
### Immediate Actions
1. **Inverse Your Prompting Order:** Do not start a security review by asking the AI if code is vulnerable. First, perform a manual scan to form a hypothesis, then use AI to query specific technical mechanics.
2. **Verify Logic Fragments:** Treat AI outputs as "untrusted data." For every execution path or state transition the AI suggests, manually trace the source code to confirm its existence.
3. **Sit with Uncertainty:** Set a mandatory "struggle timer" (e.g., 30–60 minutes) for complex logic before reaching for an LLM to resolve a mental block.
### Short-term Improvements (1-3 months)
1. **Implement the "Debugger Workflow":** Transition your AI usage from a "judgment engine" to a "technical debugger." Use it to explain complex syntax or find specific call sites, not to determine exploitability.
2. **Develop an Hypothesis-Verification Log:** Document your initial theory of an attack scenario in plain language *before* involving AI. Use the AI specifically to gather evidence to either support or disprove that specific scenario.
3. **Red-Teaming AI Assertions:** Periodically attempt to "disprove" the AI's findings. If the AI claims a path is safe, look specifically for the edge case that proves it wrong to maintain your adversarial mindset.
### Long-term Strategy (3+ months)
1. **Internal Benchmarking:** Track "Hit Rates" vs. "Code Coverage." If you find you are covering 3x more code but finding fewer high-severity bugs, mandate a reduction in AI usage for the discovery phase.
2. **Skill Retention Audits:** Conduct periodic "blind" manual audits without AI tools to ensure senior researchers retain the "muscle" for threat modeling and deep architectural analysis.
3. **Curate a Prompt Library for Context, Not Answers:** Standardize prompts that request "Contextual Summaries" (e.g., "Map out the state transitions for this module") over "Judgmental Queries" (e.g., "Is this reentrancy bug exploitable?").
## Implementation Guidance
### For Small Organizations
- **Focus on Training:** Ensure junior researchers are taught manual tracing *before* being given access to AI-augmented security tools.
- **Peer Review:** Require researchers to explain the logic of a find *without* referencing the AI's explanation.
### For Medium Organizations
- **Protocol Standardization:** Establish a "Manual First" policy for high-risk codebases.
- **Verification Requirements:** Mandate that all AI-assisted findings include a link to a manual trace or a custom test case (POC) that validates the AI's logic.
### For Large Enterprises
- **Tooling Integration:** Integrate LLMs as "Copilots" within the IDE that focus on documentation and syntax rather than autonomous vulnerability scanners.
- **Knowledge Management:** Use AI to index internal documentation so researchers can find context faster, but restrict AI from being the final sign-off on security tickets.
## Configuration Examples
*While specific code was not provided, the author’s recommended workflow configuration follows:*
1. **Phase 1 (Human):** Write the attack scenario in plain language.
* *Input:* "I believe User A can drain the vault by manipulating the exchange rate during the withdrawal callback."
2. **Phase 2 (AI):** Verify mechanics/Execution paths.
* *Prompt:* "Analyze the call stack from `withdraw()` to `onWithdrawCallback()`. List all state changes involving the `exchangeRate` variable."
3. **Phase 3 (Human):** Judgment Call.
* *Action:* Verify if the state changes listed by the AI allow for the manipulation hypothesized in Phase 1.
## Compliance Alignment
- **NIST SP 800-218 (SSDF):** Supports the "Review and Analyze Code" task by emphasizing human oversight in automated tool chains.
- **ISO/IEC 27001:** Aligns with A.14.2 (Security in development and support processes) by ensuring technical rigor is maintained.
- **CIS Controls (Control 16):** Application Software Security—ensures the "root cause" is understood rather than just checking a box.
## Common Pitfalls to Avoid
- **Becoming a Triage Layer:** Letting the AI lead the investigation while the human only checks the AI's output for obvious errors.
- **Depth-for-Breadth Tradeoff:** Covering 100% of the code at 10% depth, missing the subtle architectural flaws that AI cannot yet perceive.
- **Acceptance Bias:** Assuming the AI is correct because its explanation is coherent and confident, even if the logic is flawed.
## Resources
- **Certora Research Blog:** For methodology on formal verification and security workflows.
- **OWASP LLM Top 10:** Specifically "Over-reliance" patterns in AI usage.
- **CWE-1100:** Concepts relating to "Hard-to-Maintain" or "Obscure" logic that humans must manage.