01 April 2026

Your AI Pilots Aren’t Failing Because of AI

They’re Failing Because You Set Them Up to Fail

The headlines are everywhere: AI coding tools don’t deliver. Pilots fail. Productivity gains are marginal at best.

If you’re a CTO or VP of Engineering, you’ve probably seen these numbers land on your desk, maybe in a board presentation, maybe from a skeptical peer. And they’re compelling enough to make you question whether your AI investment is worth protecting.

I’d argue you’re reading the data wrong. And the conclusions you’re drawing could cost your organization years.

The Numbers Everyone Is Citing (And What They Actually Tell You)

Let’s start with the data, because it’s real and it matters.

A longitudinal study by DX tracked a random sample of 400 companies from November 2024 through February 2026. The findings: AI tool usage increased by 65% on average, but pull request throughput (actual code shipping) increased by only about 10%. Their headline said it plainly: “AI productivity gains are 10%, not 10x.”

METR ran a randomized controlled trial with 16 experienced open-source developers working on real issues: bug fixes, features, refactors. Developers using AI tools actually took 19% longer to complete tasks. Perhaps more telling: those same developers believed they were 24% faster.

Google’s own internal measurements showed well-scoped coding tasks completed 21 to 36% faster with AI assistance, but overall engineering velocity increased roughly 10%.

And the 2025 DORA report documented what they called the “AI Productivity Paradox”: individual developer output increases, but organizational delivery metrics stay flat. AI amplifies both strengths and weaknesses.

These are credible studies. I’m not here to dismiss the data.

But I am here to tell you what the data can’t tell you, and why that gap matters more than the numbers themselves.

The Question These Studies Can’t Answer

Here’s the problem with drawing strategic conclusions from aggregate data across hundreds of organizations: no study at this scale can meaningfully assess how each organization actually adopted AI.

Think about what you’d need to know to determine whether a pilot “failed” because AI doesn’t deliver versus because the organization set it up to fail:

Did the team receive meaningful training, or were they handed a license and told to figure it out? Did the organization standardize on a single AI tool and workflow, or did every developer choose their own? Were pilot goals specific and measurable, or vague aspirations like “explore AI opportunities”? Was the team dedicated and protected, or were they squeezing AI learning into 10% of their time between production fires?

Did the organization assess which of their existing bottlenecks were rooted in human behaviors and decision-making, not technology? Did they re-engineer business processes to align with AI, or just layer AI on top of the same approval chains and handoff patterns that were already slowing them down?

And perhaps most importantly: did leadership commit to re-evaluating what it actually means to be an AI-native organization? Did they rethink their workflows, their team structures, their roles, their decision-making processes? Or did they simply retrofit AI into existing practices and hope for better numbers?

No survey of 400 companies can control for those variables at the depth required to make the distinction. What these studies measure is the average outcome across every adoption approach, from thoughtful, structured pilots all the way down to “we bought licenses and hoped for the best.”

That average is 10%.

But the average doesn’t tell you the ceiling. It tells you that most organizations are adopting AI badly.

The data doesn’t prove AI can’t deliver 10x. It proves most organizations don’t do what’s required to get there.

And here’s a truth that no study can account for: no technology, including AI, can solve an organization’s fundamental inability to make good, rapid decisions. If your bottleneck is a decision-maker who takes three weeks to approve a direction, or a stakeholder process that requires six sign-offs before work begins, AI won’t fix that. It will just make the waiting more expensive.

If you’re using these numbers to justify slowing down your AI investment, you’re drawing exactly the wrong conclusion. The right conclusion is that your approach needs to change.

What Fred Brooks Told Us in 1975

The answer to why most organizations get this wrong isn’t new. Fred Brooks laid the foundation fifty years ago in The Mythical Man-Month (Addison-Wesley, 1975), drawing on his experience managing IBM’s System/360 project.

Brooks’s Law is deceptively simple: “Adding manpower to a late software project makes it later.” Brooks himself called this “an outrageous oversimplification,” but the principle underneath it has held up for half a century.

The reasoning: communication overhead between team members scales as n(n-1)/2, where n is the number of people. Five developers means 10 communication channels. Twelve developers means 66. Fifty developers means 1,225. Team size grows linearly; coordination cost grows quadratically.

AI magnifies this problem. AI-assisted developers generate more output: larger pull requests, more code, more artifacts. The coordination overhead of reviewing, integrating, and maintaining that output across a large team grows even faster than Brooks predicted. Individual developers may feel more productive, but the team-level friction of inconsistent outputs, inconsistent handoffs, and exponentially more material to review can negate those gains entirely.

When organizations deploy AI tools across large teams without standardized workflows, every developer becomes a node in a communication network that now includes unpredictable AI output. Different developers using different tools with different prompting approaches producing different-quality output, all requiring review by humans who themselves may not have enough training to assess what they’re reviewing.

But Brooks also proposed a solution. He advocated for what Harlan Mills originally called the “surgical team”: a small, focused team built around one chief programmer who holds the conceptual integrity of the system, supported by specialists. The chief programmer designs and decides. Everyone else, and in today’s terms every AI tool, supports that vision.

This idea was considered impractical for decades. The economics didn’t work. You couldn’t staff enough surgical teams to build large systems.

AI changes that equation. One small, well-trained team with standardized AI-native workflows can now produce output that previously required a team three to five times its size. But only if the team is structured for it.

Why Your AI Pilot Is Probably Failing

If your AI adoption looks like what most organizations are doing, you’re likely making one or more of these structural mistakes:

You’re adding AI to large teams instead of building small ones. Brooks showed us that coordination overhead is the silent killer of productivity. AI doesn’t fix that. It amplifies it. Ten developers each using AI their own way means ten different workflows producing inconsistent output that all needs cross-review. You haven’t reduced the bottleneck. You’ve multiplied it.

You’re letting developers choose their own adventure. Individual tool choice feels like empowerment. It’s actually fragmentation. When every team member uses a different AI tool with different prompting styles and different workflows, you lose the ability to standardize quality, share learnings, or build repeatable processes. The “freedom” costs you everything that makes AI-native workflows actually work.

You’re underinvesting in training. Most organizations treat AI adoption as a tooling decision: buy licenses, deploy, done. But the DX study found that the bottleneck isn’t code generation. It’s everything around it: handoffs, code review burden, and managing AI output. Those are skill gaps, not tool gaps. If your team doesn’t know how to structure prompts, design workflows, and critically evaluate AI output, faster code generation just moves confusion downstream faster.

You’re preserving role silos that no longer make sense. Traditional team structures, with dedicated QA, dedicated front-end, dedicated back-end, and a product manager who only writes requirements, were designed for a world where each role required deep, full-time specialization. AI changes that calculus. A QA engineer can now orchestrate agents that execute and verify code per a defined workflow. A product manager can use AI-native workflows to move from requirements into technical specifications. Small teams need generalists who can wear multiple hats, with AI handling the depth in each area. If your people are still confined to single roles, you’ll have team members waiting on handoffs while others are overwhelmed, regardless of how good your AI tools are.

You’re measuring the wrong things too early. If your pilot’s success metric is “measurable ROI within six months,” you’re optimizing for the wrong signal. The first phase of any serious AI adoption isn’t about productivity gains. It’s about building the muscle memory, workflow fluency, and institutional knowledge that enable productivity gains later. Measuring ROI before your team has learned to work differently is like measuring a runner’s speed during their first week of training.

You’re running too many pilots at once. Organizations that spread their investment across five or ten simultaneous AI pilots typically end up with none of them getting enough attention, training, or executive support to succeed. The result is predictable: none of them gain real traction, and leadership concludes that “AI doesn’t work here.” Concentration beats diffusion.

What 10x Actually Requires

I’ve spent over two years building and refining AI-native development workflows, not as a side experiment, but as my primary way of working. I’ve built Claude Code plugins that encode an entire software development lifecycle into repeatable, skill-driven pipelines with robust, consistent artifacts designed for both AI agents and humans, with clearly defined human decision gates.

Here’s what I’ve learned: 10x isn’t a fantasy, but it’s not a tool upgrade either. It’s an organizational transformation. And it has specific preconditions.

Small, focused teams. This is Brooks’s surgical team, updated for 2026. Two to four people who share a single way of working. Communication overhead drops from quadratic to minimal. Everyone knows the same tools, the same workflows, the same quality standards.

Deep training on one AI tool, not shallow familiarity with ten. Pick your AI coding assistant. Learn it deeply: its capabilities, its limitations, its primitives, its extension points. Build expertise that compounds over time. A team that knows one tool at depth will outperform a team that knows five tools superficially every single time.

Encoded, repeatable workflows with robust artifacts. Not “use AI to help with coding.” An actual pipeline where each step has defined inputs, outputs, and automation paths. Skills that capture requirements and produce structured artifacts. Commands that plan sprints and create issues. Automated loops that implement, test, and package work for review. The artifacts matter as much as the workflow: consistent, well-structured outputs that both humans and AI agents can consume, review, and build on. This is what separates AI-native from AI-assisted. The workflow itself is engineered, not improvised.

Humans at decision points, not in the production loop. The highest-value use of human attention in an AI-native workflow is making the decisions that only humans can make: What should we build? In what order? Is this good enough? Is this aligned with our architecture? Everything between those decision points can, and should, be handled by well-designed automated workflows. Every time a human does work that a skill or automation could handle, you’re burning your most expensive resource on your cheapest task.

Leadership that protects this way of working. Because the moment someone says “let’s add three more people to go faster” or “everyone should use whatever tool they’re comfortable with,” you’re back to 10%. This is an executive decision, not a developer preference.

The Maturity Model Nobody Talks About

Most organizations are stuck arguing about whether AI delivers 10% or 20%, and they don’t realize the argument itself reveals where they are on the maturity curve.

Level 0: AI-Assisted (where most organizations are). Individual developers use AI tools with no standardized workflow. Different tools, different approaches, different quality. This is what the studies measure. And here’s why the numbers are so modest: even when individual team members are brilliant with AI and personally feel more productive, the team as a whole doesn’t go faster. Inconsistent tools produce inconsistent artifacts. Inconsistent artifacts create inconsistent handoffs. Inconsistent handoffs demand extra collaboration to reconcile. Everyone is doing their own thing, and the team’s output converges to the speed of its coordination, not the speed of its fastest members.

Level 1: AI-Native (what intentional teams are building). Small teams with standardized AI-native workflows, encoded processes, and human decision gates. The entire lifecycle is redesigned, not just coding, but requirements, planning, implementation, testing, and review. This is where 10x becomes real, because you’re compressing the whole process, not just the coding slice.

Level 2: AI-Autonomous (where this is heading). AI agents handle entire feature implementations: planning, coding, testing, and packaging for final human review. The human role shifts from checkpoint approver to supervisor across parallel workstreams. The multiplier at this level isn’t about speed on one project. It’s about how many concurrent workstreams one person can effectively oversee.

If your organization is debating Level 0 numbers while your competitors are building Level 1 teams, you have a strategic problem that no amount of pilot data will solve.

How to Actually Get There

If any of this resonates, here’s what I’d recommend:

Start with one small, dedicated team. Not ten pilots. Not a cross-functional task force. One team of two to four people who are committed, curious, and willing to change how they work. Protect their time. This is their primary job, not a side project.

Run no more than two hyper-focused pilots. Each pilot should have a clear scope, a specific AI tool, and specific learning goals, not delivery targets. A pilot’s purpose is to build capability, not to ship a production product under deadline pressure. It would be a mistake to run a pilot where the team is simultaneously expected to deliver a product that has to be released on time and be solid. That creates exactly the wrong incentives.

Define what capabilities the team should have at the end of the pilot: Can they articulate a repeatable workflow? Can they critically evaluate AI output in your specific domain? Can they teach what they’ve learned to the next team? Those are the outcomes that matter. The team should emerge from the pilot with captured knowledge and demonstrated capabilities they can articulate to the rest of the organization.

Standardize early. Pick one AI tool. Design one workflow. Document everything. The goal is repeatability, so that what the team learns can eventually scale to other teams. If every team member is experimenting independently, you’re generating anecdotes, not institutional knowledge.

Invest in real training. Not a one-hour overview. Not a vendor demo. Dedicated time to learn the tool’s primitives, practice structured workflows, and build the judgment required to evaluate AI output critically. This is the training most organizations skip because “everyone’s too busy.” It’s also the investment that determines whether your adoption returns 10% or 10x.

Measure learning first, then output. In the first 30 to 60 days, track: How quickly is the team iterating on their workflow? Are they building reusable artifacts, like templates, skills, and commands? Can they articulate what the AI is good and bad at in their specific context? These are the leading indicators that predict whether productivity gains will follow. The delivery metrics come later. If you optimize for delivery before the team has learned to work differently, you’ll get exactly the 10% the studies predict.

The Real Question

Every study I’ve cited in this post measures what happens when organizations adopt AI without fundamentally changing how they work. The results are consistent: modest gains at best.

But those results don’t define the ceiling. They define the floor.

10x with AI is real. I’ve seen it. I’m living it. But it requires a level of organizational commitment: small teams, deep training, standardized workflows, generalist roles, protected time, and leadership willing to rethink processes and team structures from the ground up. Most organizations aren’t willing to make those changes.

If you’ve read this far, the question isn’t whether AI can deliver 10x for your organization.

The question is whether your organization is willing to do what that requires.

Latest Blog

AI governance illustrated as a heavily fortified door standing alone with no wall, while streams of light flow freely around it.

Your AI Governance Is Guarding the Wrong Door

AI governance keeps guarding the wrong door. Restricting tools does not make AI safe. Here is where the real risk lives, and what disciplined adoption takes.

Three luminous pillars on a dark editorial background representing the three commitments of responsible AI adoption: individual training, team workflows, and encoded guardrails

Confident, Biased, Often Wrong: What Responsible AI Adoption Actually Requires

Responsible AI adoption needs three things working together: trained individuals, agreed team workflows, and encoded guardrails. Not any two of them.

Off-the-shelf Claude plug-ins visualized as a generic launch pad in the foreground with a unique custom destination beacon on the distant horizon, illustrating that off-the-shelf is the starting point not the goal of an AI-native workflow

Why Off-the-Shelf Claude Plug-Ins Aren’t the Destination

Claude plug-ins like Superpowers are a launch pad for AI-native workflows, not the destination. Here’s why customization is unavoidable, and what to do about it.