We've Been Measuring AI Productivity All Wrong

Two headlines from the same era of AI development.

First: GitHub's research found AI coding assistants helped developers complete a task 55% faster. They ran a controlled experiment, measured how long it took to implement an HTTP server in JavaScript, and published the numbers widely.

Second: A non-profit called METR ran a randomized controlled trial with 16 experienced open-source developers. They completed 246 real tasks on large, complex codebases. Developers predicted AI tools would cut their time by 24%. Instead, they took 19% longer.

Two studies. Opposite conclusions. Which one is right?

Both of them. That's the problem.

A developer working late into the evening, surrounded by AI chat windows and empty coffee cups

The Study Everyone Quotes

The GitHub Copilot study gets referenced constantly in vendor pitches and board decks. It has the kind of stat that makes executives sit up: 55% faster. Who doesn't want that?

But look at what they measured. One specific task: implement an HTTP server in JavaScript. No legacy code. No unclear requirements. No waiting on a pull request review from a colleague who is off sick. No architectural debates about whether this should be a microservice or not. No debugging a flaky test that worked yesterday and broke today for no apparent reason.

Real software development looks nothing like that task.

When you are deep in a mature codebase, the problem is not "write code." It is "figure out which of these 12 interconnected services is responsible for this bug," or "understand why the previous team made this decision before you change it and break everything downstream."

The 55% figure measures one isolated task well. It tells you almost nothing about your engineering team's actual output in production.

The Study No One Wants to Talk About

The METR study is uncomfortable reading. Sixteen experienced open-source developers. 246 real tasks. Their own repositories. Randomized assignment to AI-allowed and AI-forbidden conditions.

The result: developers using AI tools took 19% longer to finish their tasks.

Not a little slower. Measurably, statistically slower. And they did not see it coming. Before the study, they predicted AI would make them 24% faster. The gap between expectation and reality was 43 percentage points.

Why the slowdown? The researchers identified two main culprits. First, experienced developers spent significant time writing prompts, waiting for responses, and reviewing AI-generated output. That overhead ate into whatever time the AI saved on raw typing. Second, AI tools struggled with the complexity of mature codebases... systems too large, too entangled, and too context-heavy for the model to navigate accurately.

The researchers were careful not to generalise. Other studies do show productivity gains, and AI capabilities are improving fast. But this study punctures the assumption that AI speed gains are automatic.

Here is the part no one is discussing: the developers in the METR study felt faster. They predicted a 24% speedup because they genuinely believed they were being more productive. That gap between feeling productive and being productive is where things get dangerous for your organisation.

A person standing puzzled in front of two charts, one bar soaring and one barely registering

The Real Problem Is Upstream

The GitHub study and the METR study are measuring different things. The deeper problem is that most companies have not thought carefully about which thing to measure.

Typing speed is not productivity. Lines of code are not productivity. Task completion time, measured in isolation, is not productivity.

Productivity is outcomes. Did the feature ship? Did it work? Did customers use it? Did the team understand the code three months later when something broke? Did the business metric it was supposed to move... move?

When you measure AI productivity by "how fast did developers write this code," you are measuring the wrong variable. You might as well measure the speed of hammer blows without asking whether the house got built.

The ISG State of Enterprise AI Adoption report for 2025 shows this playing out at scale. Only 1 in 4 AI initiatives is achieving expected ROI on growth. Only 50% are hitting expected efficiency gains. Companies are spending an average of $1.3 million per organisation on AI initiatives, and most are not seeing the results they planned for.

That is not a technology problem. It is a measurement problem.

A magnifying glass held over a dashboard, revealing it is measuring keystrokes and clicks rather than outcomes

What You Should Measure Instead

If you lead engineers and want to know whether AI is helping, stop asking "are people using it?" and "do they feel faster?" Ask these instead:

Cycle time from idea to production. Is the time from ticket creation to deployed feature getting shorter? This captures everything: requirements clarity, code quality, review speed, and deployment reliability. A single number that reflects whether your whole pipeline got better, not one person's typing speed.

Defect rate on AI-assisted code. Are features built with AI assistance generating more bugs than code written without it? Less? This is the number that tells you whether the faster code is also good code. Speed into production followed by a rash of incidents is not a win.

Review throughput. If AI generates code faster, reviewers need to keep pace. Are they keeping up, or are they becoming the new bottleneck? If your team writes code 30% faster and reviews 0% faster, you have not gained 30%.

Team-level delivery, not individual speed. One developer writing code 50% faster matters little if the team's overall throughput stays flat. Look at what the whole team ships per sprint. That is the number the business cares about.

These metrics are harder to collect than "tasks per hour." That is why most companies skip them. But they are the only ones that tell you whether AI is genuinely helping your business or whether you are buying a fast hammer for a house you are not finishing.

The Questions to Ask Your Vendors

The next time an AI vendor shows you a productivity statistic, ask three things before you do anything with it.

What was the task? Single-task benchmarks on simple, isolated problems tell you almost nothing about real-world performance on complex, mature systems. An HTTP server built in isolation is not your codebase.

Who were the developers? The METR study found experienced developers on complex codebases got slower. Less experienced developers on greenfield code tend to get faster results. These are different situations. Know which one your team resembles before you assume the headline applies to you.

What happened after the code was written? Speed of writing is the beginning of the process, not the end. Code quality, defect rates, and how maintainable the output is downstream are where the real cost lives.

If the vendor cannot answer those questions, you are looking at marketing material dressed up as research.

Get the Metrics Right, Now

AI tools will improve. The gap between the METR results and the GitHub results will narrow as models get better at understanding large, complex codebases. That is a reasonable expectation.

The problem is the measurement framework companies are locking in right now. If you build your AI productivity story on proxy metrics today, you will keep chasing those proxies long after anyone still believes they mean anything. Meanwhile, the actual outcomes you care about... faster, higher-quality software delivery, fewer production incidents, better team retention... will not move.

You have an opportunity to set this up correctly before the industry standardises on bad metrics. Define what outcomes matter in your organisation. Measure those. Let the tools earn their place by moving the numbers that count.

What are you using to track AI's impact on your team? If the answer is adoption rate or developer satisfaction surveys, that is a start. What would it take to add a defect rate or a cycle time metric alongside it?