A new DryRun Security report just quantified what many engineering teams suspected: AI coding agents are fast, useful, and still dangerously inconsistent on security defaults.
DryRun tested three agents — Claude Code (Sonnet 4.6), OpenAI Codex (GPT-5.2), and Google Gemini (2.5 Pro) — by having each one build two realistic applications through iterative pull requests. Then they scanned every PR and each final codebase.
The topline result is blunt:
- 30 PRs total
- 38 scans run
- 143 security issues found
- 26 of 30 PRs vulnerable
- 87% vulnerability rate per PR
This is not a synthetic CTF benchmark. It’s a realistic workflow simulation with normal product prompts and no explicit security instructions.
Two Apps, Same Security Failure Pattern
DryRun used two very different projects:
- FaMerAgen — a family allergy/contact web app
- Road Fury — a browser racing game with backend APIs, leaderboards, and multiplayer
Different domains, same pattern: high rates of logic and authorization flaws that look “fine” to pattern-based scanners until someone exploits them.
The recurring vulnerabilities included:
- Broken access control (unauthenticated destructive/sensitive endpoints)
- Business logic flaws (server trusting client-provided score/currency state)
- OAuth implementation mistakes (missing
state, insecure account linking) - Missing WebSocket authentication (REST auth exists, WS upgrade path left open)
- Rate limiting gaps (middleware defined but never mounted)
- Weak JWT secret management (hardcoded fallback secrets)
This is exactly the class of bugs that slip through when teams mistake “code compiles” for “system is secure.”
Why This Matters More Than Another “AI Hallucination” Story
Security failures here weren’t mostly hallucinated APIs or obvious syntax mistakes. They were architectural omissions:
- Middleware not wired across all protocols
- Authorization assumptions that break under adversarial use
- Token/session lifecycle weaknesses
- Trust-boundary violations at feature design time
In other words: reasoning problems, not autocomplete problems.
And reasoning problems are harder to catch with regex-heavy SAST alone. DryRun explicitly calls out that logic-level flaws require contextual analysis — data flow, auth boundary tracing, and end-to-end execution semantics.
Relative Agent Performance (But Don’t Overread It)
DryRun reports Codex ended with the fewest final vulnerabilities in both apps, with Claude and Gemini retaining more high-severity findings in the final scans.
But the bigger signal isn’t a winner/loser leaderboard. It’s that all three agents repeatedly produced vulnerable code paths unless security controls were explicitly introduced.
If your strategy is “pick the safest model and trust defaults,” you’re solving the wrong problem.
The Pattern Matches Broader Industry Signals
This report lands days after large-enterprise evidence that agentic coding needs controlled friction:
- Amazon reportedly ordered a 90-day code safety reset after major AI-assisted incident fallout
- OWASP released its Agentic App Top 10 with Tool Misuse, Goal Hijack, and Privilege Abuse as core risks
- NIST is actively collecting input on security and identity standards for autonomous agents
The direction is clear: agent velocity without governance is operational debt.
What OpenClaw Teams Should Do Right Now
If you’re using AI coding agents in production workflows, treat this as a process design issue, not a model upgrade issue.
Minimum baseline:
- Scan every PR, not just pre-release branches
- Run full codebase scans periodically (PR scans miss cross-file compounding)
- Threat-model in planning, before agents write code
- Enforce deterministic validation gates (tests, linters, auth checks, policy checks)
- Require human approval before merge on high-risk scopes (auth, billing, data deletion)
- Explicitly test WebSocket/auth parity — this repeatedly failed across agents
- Ban insecure JWT defaults and enforce secret sourcing from secure stores
OpenClaw’s architecture already supports several of these controls:
- Command approval for sensitive actions
- Sandboxed execution and environment boundaries
- Auditable run logs and file-based traceability
- Tool-level permission shaping
The missing piece for most teams is consistency: making secure defaults unavoidable, not optional.
Bottom Line
AI coding agents are now good enough to ship production features. They are not good enough to safely ship production systems without strict security scaffolding.
The 87% vulnerable-PR number isn’t a temporary glitch. It’s a reminder that agents optimize for task completion unless you explicitly optimize the system around them for safety.
Speed is real. So is blast radius.
Source: Help Net Security coverage of DryRun report
Related reading
- OWASP Top 10 for Agentic Applications: the practical security checklist
- Claude Code MCP vulnerabilities and supply-chain attacks