Here’s a number that should sober up anyone deploying AI agents in production: a 10-step agentic workflow where each step succeeds 90% of the time will fail more than 6 times per day if run 100 times. That’s the “March of Nines” — the compounding math that makes individually impressive AI agents collectively unreliable.

Princeton’s Sayash Kapoor and Arvind Narayanan — authors of AI Snake Oil and co-writers of the “AI As Normal Technology” blog — just published research with four co-authors that quantifies this gap with uncomfortable precision.

Accuracy vs. Reliability: Not the Same Thing

The core insight: AI agents are benchmarked on average accuracy, not reliability. A model that gets 90% of tasks right sounds great — until you realize it fails unpredictably on the other 10%, and you have no idea which 10%.

The Princeton team breaks reliability into four dimensions that existing benchmarks largely ignore:

Consistency. If you ask the agent to do the same task the same way twice, does it produce the same result? For Claude Opus 4.5 — the best performer on this metric — consistency was still only 73%. That means roughly one in four identical requests produces a different outcome.

Robustness. Can the agent function when conditions aren’t ideal? Real-world inputs are messy — ambiguous instructions, malformed data, unexpected edge cases.

Calibration. Does the agent accurately convey its confidence? Gemini 3 Pro scored just 52% on calibration — essentially a coin flip on whether the agent knows it might be wrong.

Safety. When the agent fails, how bad is it? Gemini 3 Pro scored 25% on catastrophic failure avoidance. One in four failures was severe.

The Improvement Rate Problem

The most alarming finding isn’t any single score. It’s the trajectory.

On a general agentic benchmark, reliability improved at half the rate of accuracy across successive model releases. On a customer service benchmark, it was one-seventh.

This means the industry’s rapid accuracy gains are masking a reliability plateau. Each new model release gets better at the average case while barely improving worst-case behavior. For production deployments, the worst case is what matters.

What Fortune’s Eye on AI Found in Practice

The research aligns with real-world experience. Fortune’s Jeremy Kahn reports that Perplexity’s Computer agent — using Claude Sonnet 4.6 as its reasoning engine — successfully booked a recycling center appointment but failed completely on flight research, burning 45 minutes of tokens with nothing to show.

At an Anthropic demo event in London, Claude Cowork struggled with basic Excel data sorting before effortlessly creating a complex budget forecasting model. Claude Code built a visually polished business strategy game whose underlying logic was nonsensical.

This is the signature pattern of unreliable agents: they succeed impressively on some tasks and fail spectacularly on superficially similar ones, with no way to predict which outcome you’ll get.

The Compounding Problem

The math gets worse as workflows grow. Each step in an agentic workflow multiplies the failure probability:

StepsPer-Step ReliabilityWorkflow Success Rate
290%81%
590%59%
1090%35%
2090%12%

At 10 steps and 90% per-step reliability, your workflow fails nearly two-thirds of the time. The only way out is pushing per-step reliability toward 99%+ — which, based on current improvement rates, is years away for complex tasks.

A real-world example from the paper: three medical AI tools with individual accuracies of 90%, 85%, and 97% achieved only 74% combined reliability when chained together. One in four patients could be misdiagnosed.

Augmentation vs. Automation: Different Reliability Thresholds

The researchers make a crucial distinction: AI that augments humans needs less reliability than AI that replaces them.

When a human reviews every agent output, 85% reliability is useful — it saves time on most tasks and the human catches failures. But for fully autonomous agents running unattended workflows, the same 85% reliability means constant breakdowns.

As the paper puts it: “An agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system.”

This maps directly to how most people should be using AI agents right now: human-in-the-loop for anything that matters.

What This Means for OpenClaw Users

If you’re running OpenClaw agents on production tasks:

Design for failure. Every multi-step workflow should have checkpoints where the agent can pause and a human can verify progress. Don’t chain 20 tool calls without validation.

Monitor consistency. Run the same task multiple times and compare outputs. If results vary wildly, the agent isn’t reliable enough for that task yet.

Keep humans in the loop. For high-stakes actions — sending emails, making purchases, modifying data — require human confirmation. The reliability math says your agent will eventually get it wrong.

Shorter chains, higher reliability. Break complex workflows into smaller segments. A 3-step workflow at 90% per step succeeds 73% of the time. A 10-step workflow at the same rate succeeds 35%. Fewer steps, fewer failures.

Track your agent’s failure modes. OpenClaw logs everything. Review failures periodically to understand where your specific workflows break down — then add guardrails at those points.

The industry will solve the reliability gap eventually. But the current generation of models — even the best ones — are reliably unreliable. Build your workflows accordingly.

Based on “Towards a Science of AI Agent Reliability” (arXiv:2602.16666) by Kapoor, Narayanan, et al., and Fortune’s Eye on AI analysis (March 24, 2026).