Why AI Code Generation Is Solved and Still Accelerating

Some people who are deep in the AI world have started saying that AI code generation is a solved problem. I agree.

Hold on, before you close the tab.

I’ll bet a lot of you just disagreed without reading another word. That reaction is exactly why I’m writing this. For the past two and a half years, I’ve been using AI for coding daily, almost seven days a week. I’ve spent over a year on Claude Code specifically. Thousands of hours of hands-on time. I teach engineers and engineering leaders how to adopt these tools. And what I’m watching every six months is the same conversation, with the goalposts moved.

This post is about agentic coding capability specifically. Not AI in general. Not whether AGI is near. Not consumer chatbots. Just: is AI good enough at writing software to call the problem solved, and is it going to keep getting better? My answer is yes and yes. Here’s why.

Here’s the path I’ll walk through. First, what “solved” actually means, and why most skeptics are working with a definition no human meets either. Then what I see every day in real production work, with the last 90 days of benchmark data as corroboration, not as the main argument. Then the three reasons coding capability is going to keep accelerating fast. Two are commercial. The third is recursive self-improvement: the idea that AI is now writing the very code that AI itself is built from. Then what to do about it as an engineering leader. By the end, the trajectory should be clear, and so should the bet you’re making if you keep saying “this is the ceiling.”

What “Solved” Actually Means

Let’s get the definition right, because the skeptics are working with one no human meets either.

The wrong definition: AI never makes mistakes. AI writes code exactly the way a senior engineer at my company would write it. AI handles every edge case I throw at it without prompting. By that standard, no human engineer is solved either. I’ve worked with thousands of engineers across my career. Every single one of them made mistakes. Sometimes they caught their own mistakes. Sometimes they didn’t, and the mistakes shipped to prod. None of them write code exactly the way another senior engineer would. None of them anticipate every edge case without context.

The right definition is much simpler. AI code generation is solved when, given the same context a competent engineer would need to do the work, it produces results as reliable as that engineer would. Context here is whatever a real engineer would need to ship the task: clear intent, the constraints that real engineering work always involves (language, framework, version, testing approach, architectural conventions), and the actual feature or bug to address. How that context reaches the model matters less than that it reaches the model. It can come through prompts, scaffolding, skills, conversation history, documented conventions, or the project’s existing code. AI doesn’t need to read your mind any more than a human engineer does. Give it what a human would need, and it produces work in the same league.

AI code generation is solved when, given the same context a competent engineer would need to do the work, it produces results as reliable as that engineer would. Any stricter bar is moving the goalposts to a place no human meets either.

There’s a pattern hiding inside the skeptic argument that’s the strongest piece of evidence I have. Every six months, the skeptics point to a different set of mistakes. The mistakes from a year ago aren’t being made anymore. The mistakes pointed to today won’t be made six months from now. That goalpost movement is the skeptic argument quietly conceding the point.

Why I Say AI Code Generation Is Solved

I’m not asking you to take this on faith. I’m telling you what I see every day.

Code quality is night and day from a year ago, and still night and day from six months ago. A year ago required heavy review and rework. Six months ago required moderate review. Today the code is often shippable on the first pass when the plan is solid. I’m not exaggerating to make a point. I’m describing my workflow.

Instruction following has stepped up substantially. A year ago the model would drift from explicit instructions on longer tasks. Today it more often holds the spec, asks clarifying questions when constraints conflict, and respects negative instructions like “use this framework, not that one” or “don’t introduce a new dependency.” Does it always remember? No. Sometimes you’ve got to remind it. Sometimes you correct mid-stream. Humans do that too. The comparison is statistical, not absolute, and the drift rate now is dramatically lower than it was even six months ago.

Long-horizon focus is the most recent shift, and it’s the one that surprised me. With Opus 4.6 and especially 4.7, tasks that used to fall apart after thirty minutes of agentic work now hold together for multi-hour sessions far more reliably. The model maintains intent across many subtasks, picks the thread back up after a tool call, and stays on the plan. Not perfectly. But when the plan is detailed and approved up front, the model follows it more often than not. In my own work, I’d put it well over 90% of the time, which is honestly a higher rate than I see from human engineers on equivalently complex tasks.

End-to-end implementation quality is now coherent, not stitched. When the plan is solid, what comes out is far more often a single coherent implementation: production code, unit tests, and integration tests wired together as one piece of work, rather than three things produced in isolation and forced to fit.

Self-correction is now inside the loop. The model catches more of its own mistakes than it used to. It reviews its own code. It thinks through its answer before committing. A year ago this had to be scaffolded externally with explicit review steps. Today it happens on its own much more often, though not always. The comparison, again, is to a human reviewer, not to perfection.

One honest caveat the skeptics most often miss: none of this is the model alone. It’s the model plus the human, plus the scaffolding the human builds around the model. The teams who tell me Claude routinely goes off the rails are almost always teams that haven’t yet done the work to provide good context, build the right workflow, and develop the skills (literally, in the Claude Code sense) that keep the model on track. I’ve built such scaffolding for my own work, with planning gates, explicit human approval steps, and sub-agent reviews for security, code health, and plan-vs-implementation checks. With that in place, the model follows an approved detailed plan well over 90% of the time, which is a higher rate than I see from human engineers on equivalently complex tasks. The gap between “AI coding works great” and “AI coding doesn’t work” is, in my experience, almost entirely a gap in human skill and scaffolding.

That’s the bar. The model meets it. By the working definition of solved, the problem is solved. I’m not predicting this will happen. I’m telling you it’s already happening, in production, on real codebases, every day.

The Benchmarks Corroborate It (One Data Point Among Many)

Now to the benchmarks. Up front: I’m a benchmark skeptic. Benchmarks tell one part of the story, never the whole story. I’m not pointing at the 90-day data to claim definitive proof of anything. I’m pointing at it as corroboration of what I’m already seeing in daily use. The deltas are worth showing for that reason and that reason only.

The last 90 days produced four major frontier coding-model releases: Anthropic Opus 4.6 on February 5, 2026, Anthropic’s research-tier Mythos on April 7 (gated to partners only), Anthropic Opus 4.7 on April 16, and OpenAI’s GPT-5.5 on April 23. Coding capability was the headline for every one of them.

Start with Anthropic’s two generally-available flagships. Opus 4.6 launched with 80.8% on SWE-bench Verified and 53.4% on the harder SWE-bench Pro (Anthropic). Roughly ten weeks later, Opus 4.7 scored 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro (Anthropic). That’s nearly eleven points on the harder benchmark in ten weeks, inside the 90-day window. The labs are now leading with the harder benchmark precisely because the older one is saturating. Saturation is what the late stage of solving a problem looks like.

OpenAI ran a parallel track in the same window. GPT-5.5 scored 82.7% on Terminal-Bench 2.0, narrowly beating both Opus 4.7 and Mythos (OpenAI).

The Mythos signal is the most interesting piece of recent evidence and the one almost no one’s talking about. On April 7, 2026, Anthropic released Mythos to a small set of partners, and publicly conceded that Opus 4.7 trails it (Anthropic). Why isn’t Mythos generally available? Anthropic concluded that its cybersecurity capabilities required restricted deployment. A frontier lab voluntarily withholding its highest-capability coding model is a fundamentally different kind of evidence than a benchmark number. It says capability is now ahead of what’s comfortable to ship broadly.

Honest caveats, because every release has trade-offs. Opus 4.7 regressed on long-context retrieval. Specifically, the MRCR v2 8-needle test at one million tokens fell from 78.3% on Opus 4.6 to 32.2% on Opus 4.7 (Vellum). Both labs are flagging each other for SWE-bench Pro memorization. GPT-5.5 API pricing roughly doubled relative to GPT-5.4. None of that adds up to “not solved yet.” It adds up to “a fast-moving, mature frontier with normal trade-offs at each release.” That, again, is what late-stage solving looks like.

And It’s Going to Keep Getting Better. Three Reasons.

If it’s solved, why am I still writing? Because solved isn’t the end of the story. The same forces that solved the problem are still pushing, and the next twelve months are going to look like the last ninety days, only more so. Three reasons it’ll keep getting dramatically better. One you’ve heard. One you probably haven’t. And a third, more practical reason almost no one talks about.

Driver One: Enterprise Revenue Is Where the Money Is

This is the surface story. It’s true. It’s also incomplete.

Anthropic generates roughly 80% of revenue from enterprise customers, and Claude Code crossed $2.5B in annualized run-rate within roughly a year of launch (Sacra). Anthropic has now passed OpenAI in total ARR, hitting around $30B (The AI Corner). They got there by making coding their differentiation.

OpenAI publicly pivoted in March 2026, in Axios’s framing, “from consumer hype to business reality” (Axios). Sora got scaled back, consumer experiments got shelved, resources got refocused on coding and enterprise ahead of a planned IPO. OpenAI’s own internal forecast acknowledges that consumer paid-WAU conversion will plateau at only 8.5% by 2030 (PYMNTS). The dominant consumer AI brand has internalized that consumer subscriptions cap structurally below ten percent. So they’re racing to where the money actually is: developer and enterprise coding workloads.

Two labs, different starting positions, identical convergence target. The commercial incentive to keep improving is locked in.

That’s the surface story. There’s a second driver, and it’s the bigger one.

Driver Two: Recursive Self-Improvement

Here’s the part most coverage misses, and the one I think matters most for your mental model of where this goes next.

The second reason the labs are racing on coding is recursive self-improvement. AI improving AI. The idea that a sufficiently capable AI can write, refactor, and optimize the code that AI itself is built from, accelerating its own progress in a way that humans alone can’t.

To see why this matters, look at what AI actually is. AI is made of code. Training pipelines are code. Evaluation harnesses are code. Model architectures are code. Agent scaffolding is code. The orchestration that turns a model into a product is code. So a lab that gets better at making AI good at code is getting better at making AI good at the very stuff AI itself is built from.

Why coding specifically and not any other domain? Because code has automatic verification. Tests pass or they don’t. Loss goes down or it doesn’t. Compilation succeeds or it fails. Every other domain (language, reasoning, design, science) needs human evaluation to close the loop. Coding is the only domain where AI can verifiably improve AI without humans in the loop.

Even outside the labs, the loop is being demonstrated publicly. In March 2026, Andrej Karpathy, the former OpenAI co-founder and Tesla AI director, released a project called autoresearch: a stripped-down, single-GPU LLM training script that an AI agent reads, modifies, tests, and optimizes autonomously (GitHub). Over two days, the agent surfaced twenty additive improvements that transferred to larger models. The repo went viral, and dozens of public forks followed across different problem domains (Fortune).

Be clear about what this is and isn’t. Autoresearch isn’t recursive self-improvement running on a frontier-scale production training pipeline. It’s a research-scale demonstration of the loop running end-to-end on a teaching-sized model. But Karpathy’s framing was the moment the broader AI community caught up to a conversation the frontier labs have been having internally. Karpathy called it “the final boss battle” that all LLM frontier labs are now racing to fight.

Inside those labs, public signals from OpenAI specifically suggest the same kind of loop is now running on production-scale infrastructure. Two examples on the record.

Greg Brockman, OpenAI’s president, has publicly described using GPT-5.3-Codex internally to find bugs in OpenAI’s own training runs, manage rollout, and analyze evaluation results (Big Technology). That’s a frontier lab using its own coding model on its own production infrastructure. OpenAI is, in their own words, closing the loop in production.

OpenAI’s stated public roadmap goes further. Jakub Pachocki, OpenAI’s chief scientist, told MIT Technology Review in March 2026 that OpenAI’s research is “building towards automating scientific research,” with a specific timeline: an automated AI research intern by September 2026, and a fully automated multi-agent research system by 2028 (MIT Technology Review).

That’s OpenAI’s leadership saying, on the record, that they’re building AI to do AI research. The substrate they’re building it on is code.

The labs aren’t pouring resources into coding only because it’s the most lucrative product. They’re pouring resources into coding because it’s the one product category that improves the lab. Enterprise revenue is the runway. Recursive self-improvement is the prize.

Driver Three: Real-World Usage Feeds the Next Generation

There’s a third, more practical reason. It’s not the biggest. It’s almost never discussed.

The two largest commercial AI labs are now training their next-generation models on the real-world coding sessions of millions of consumer-plan users.

As of an August 28, 2025 policy update, Anthropic’s Free, Pro, and Max plans (including Claude Code on those tiers) feed training data by default, with retention up to five years. Users have to actively opt out. Team, Enterprise, API, Government, and Education plans are excluded (Anthropic; TechCrunch). OpenAI’s policy has the same shape. ChatGPT Free, Plus, and Pro (including Codex usage on personal workspaces) feed training by default, with Data Controls to opt out. API and Business and Enterprise plans are excluded (OpenAI).

Think about what that data stream actually contains. Real coding tasks. Real bugs. Real edge cases. Real moments where the model produced bad output and the user followed up with a correction. Real moments where the model nailed something hard. Multiply by millions of developers, every single day.

Neither lab publicly attributes specific capability gains to consumer coding data. They don’t have to. The policies permit it, the incentives align, and the dataset of what real engineers actually struggle with when they use AI to write code exists nowhere else.

Every consumer-tier coding session is a feedback signal. The labs that own the consumer coding tools own a dataset of real-world coding work that exists nowhere else. That dataset is unique, ongoing, and uniquely useful for training the next model.

The Honest Caveats

I’m not telling you AI takeoff is here. The argument doesn’t need that, and the evidence doesn’t support it.

I’m also not telling you AI is currently writing 90% of the code at any particular company. Capability and impact aren’t the same thing, and people conflate them constantly. Electricity existed long before wires were run to every home. The wires didn’t change what electricity could do. They determined whether electricity reached your house. Same with AI right now. The capability is here. Whether it’s reaching your team is a different question entirely.

Across the teams I work with, the two biggest missing wires are these. First, most teams haven’t accepted that an engineer’s primary role is now specification: patient, detailed articulation of intent, goals, and constraints in a way an AI agent can actually act on. Most engineers were trained to do the work, not to brief the worker. They haven’t put in the reps yet to be good at the new job. Second, most teams are retrofitting existing pre-AI workflows with AI tools sprinkled on top, rather than stepping back, reimagining how the work should flow when an AI agent is a core participant, and codifying new AI-native workflows the whole team uses consistently. If your organization isn’t seeing the value yet, the gap almost certainly isn’t in the model. It’s in the wires.

Ilya Sutskever, the former OpenAI chief scientist now running Safe Superintelligence, has been publicly cautious through the back half of 2025 and into 2026, arguing that the field needs a new learning paradigm and pointing to a puzzling gap between strong benchmark performance and limited economic impact (The Decoder). Sutskever isn’t Gary Marcus. Sutskever was right about scaling. The argument here doesn’t require hard takeoff. It requires that the coding-specific trajectory keeps moving, which the last ninety days confirm.

Anthropic’s own automated alignment researcher experiments offer a useful corrective. The published result: the approach worked well on math tasks (94% performance gap recovery) but didn’t generalize on a production-scale test on Sonnet 4 using real training infrastructure (Anthropic Alignment). The self-improvement loop is real in narrow settings. Generalization at frontier scale is still fragile.

Lab predictions on timeline run aggressive. Dario Amodei said in early 2025 that AI would write 90% of code within six months. Outside frontier labs, that didn’t materialize industry-wide (IT Pro). Adjust your confidence in lab timelines accordingly. And note again: Amodei’s claim was a capability prediction. Whether the industry adopts that capability at scale is, once more, a wires question.

The conclusion isn’t that AI takeoff is here. It’s that AI code generation is solved by any sensible working definition of solved, the incentive structure to keep accelerating that capability is uniquely strong because it now includes recursive self-improvement and real-world data feedback, and the practitioners who’ve actually run the wires are watching the curve bend in real time.

What This Means for Engineering Leaders

Four principles, drawn from working with teams across the last two and a half years.

The skill that compounds isn’t typing. It’s specification. The model can implement well now, but only when the plan is solid. Engineers who get good at defining what good looks like, in language a model can act on, compound. Engineers who stay in the typing layer don’t. This is the most important career calibration any engineer can make in 2026.

Stop arguing about whether it’s solved. Start designing for the world in which it is. Teams that have rebuilt their workflows around current-generation agentic coding (planning gates, integrated test generation, self-review loops, well-scoped specifications) are out-shipping teams that haven’t. The gap is widening every quarter.

Calibrate hiring, training, and architecture decisions to a twelve-month-forward view of capability, not a twelve-month-back snapshot. Engineering leaders running 2024-vintage assumptions on 2026 capability are leaving compounding gains on the table. I’ve been teaching teams how to adopt these tools since the first usable version dropped. The teams that compound are the ones that stop arguing with the trajectory and start designing for it.

Engineering leaders specifically: make AI-native non-negotiable, and put real budget behind it. This is where most engineering organizations are quietly stalling out. They let individual engineers experiment with AI tools, but they don’t step back as a leadership team and ask the harder question: what does our workflow actually look like when an AI agent is a core participant, not a side-tool?

Answering that takes real time and real money. It requires designing, codifying, and continuously refining AI-native workflows the whole team uses consistently. And it requires a message from the top that doesn’t waver.

We’re moving toward an autonomous, agentic software engineering world. The vision isn’t up for debate. The how is.

Teams define the how. Leaders insist on the what. AI-native is not optional.

Closing

AI code generation is solved by the only definition of solved that survives contact with reality. Code as reliable as a competent engineer’s, given the same context that engineer would need.

If you still disagree, ask yourself: by your definition, is human code generation solved? Because if your bar is “never makes a mistake,” no human ships either.

The labs aren’t done. They’re not even close to done. The mistakes the skeptics are pointing to right now? Six months from now, those won’t be made either. The labs aren’t just incentivized to make that happen. They’re now using AI itself to do it.

AI code generation visualized as lines of code streaming from a bright cyan light source and accelerating into light trails on a dark background

Latest Blog

Scroll to Top