The Benchmark Every Engineering Leader Should Worry About

Anthropic's Claude Mythos scored 93.9% on SWE-bench Verified, the benchmark built from real GitHub issues in production codebases like Django and Flask. Top models in 2024 were resolving 40 to 55% of these same tasks. In roughly a year, the ceiling moved from "still figuring it out" to "nearly solved."

I read the breakdown of what changed, and my first reaction wasn't excitement. It was a question: if a model reads a codebase, finds a bug, writes the fix, and passes the tests at a 94% success rate, what am I paying a junior engineer to do?

Ten years ago the answer was obvious. Not anymore.

A senior software engineer looking thoughtfully at a wall display showing a benchmark leaderboard with a dramatic upward trend

What 94% Measures, and What It Doesn't

Here's the part most headlines skip. SWE-bench tests whether a model resolves a defined, scoped issue in an existing repository with tests already written. It does not test whether the model decides which forty features to cut from a roadmap. It does not test whether the model tells a founder their pet feature is a mistake. It does not test whether the model notices a security hole nobody asked about.

Those are the jobs of an engineer with judgment. Judgment happens to be the one thing most technical interviews don't measure.

The Interview You're Still Running Measures the Wrong Thing

Ask yourself when your last hire solved a whiteboard algorithm problem in production. For most engineering leaders, the honest answer is never. Yet the standard interview loop is still built around inverting binary trees and reversing linked lists under a countdown clock.

Crosschq Data Labs studied this directly. Their research found only 9% of traditional interview scores correlate to quality of hire. Nine percent. We put candidates through hours of prep, stress, and rejection for a signal barely stronger than a coin flip.

Now add AI to the mix. A candidate solving a LeetCode problem with an AI assistant open in another tab isn't cheating anymore. In plenty of shops it's the job. Meta and Google have both started running interview rounds where the AI assistant is built in, and the question shifts from "did you get the right answer" to "did you prompt it well, catch its mistake, and know when to override it." This is a different skill than what most interview loops still test, and most loops haven't caught up.

An empty conference room with a whiteboard covered in erased algorithm diagrams and a lone marker left on the table

The Real Cost of Getting This Wrong

Recent computer science graduates are feeling this first. Unemployment for new CS grads sits around 6.1%, well above the overall US rate of 4.3%, according to CIO's reporting on the junior developer market. Companies are cutting entry-level coding roles because AI now handles the boilerplate, the bug fixes, and the test scripts junior engineers used to cut their teeth on.

I get the short-term math. If a model resolves 94% of scoped issues, why pay a person to do the same job slower? Here's the problem with this reasoning: junior engineers were never hired only to fix bugs. They were hired to become senior engineers. Cut the entry point and you don't save money, you starve your own pipeline. In five years you'll be short of the exact mid-level engineers you need, because you stopped growing them today.

I've watched companies make this mistake with layoffs before. They cut the people closest to the work to protect a quarter's numbers, then spend three years rebuilding capability they threw away. An AI-driven hiring freeze is the same trade with better PR.

I've Sat on Both Sides of This Table

I've built engineering teams and I've sat across from candidates deciding whether to bet a career on my company. The best hires I ever made weren't the ones who solved the puzzle fastest. They were the ones who asked a question I hadn't thought to ask, or who pushed back on a requirement because the requirement was wrong.

None of it shows up on a leaderboard. It shows up in a design review, six months in, when the wrong assumption would have cost real money. A model at 94% on SWE-bench doesn't push back on your requirements. It solves the ticket you gave it, even when the ticket is wrong.

I've also watched the opposite play out. A brilliant coder who couldn't sit across from a customer and admit the feature they built missed the point. Talent at the keyboard and judgment in the room are two different hires, and most interview loops only screen for one of them.

What to Hire For Now

If the model resolves the well-defined ticket, your engineers need to own everything before and after it: deciding what the ticket should say, reviewing whether the fix is correct and not only plausible, and owning the consequences when it isn't.

Your hiring process should test for:

  • Code review, not code writing. Show a candidate a pull request, some of it AI-generated, and ask them to find what's wrong with it. This sits closer to the daily job than any algorithm puzzle ever did.
  • Judgment under ambiguity. Give them a half-specified problem and watch how they ask questions, not how fast they type. The engineers worth hiring narrow the problem before they touch a keyboard.
  • Systems thinking. Ask how a change ripples through a codebase they've never seen, rather than whether they memorized a sorting algorithm from a textbook.
  • Communication. The engineer who explains a tradeoff clearly to a non-technical stakeholder is worth more than the one who inverts a tree blindfolded without knowing why anyone would want it.

  • How they handle being wrong. Show them a bug in their own sample code, live, and watch how they react. Defensiveness is a bigger red flag today than a slow solve time ever was.

None of this is new advice. Good engineering leaders have said versions of it for years. What's new is the cost of ignoring it. When the model handles the mechanical work at 94%, the mechanical interview stops measuring anything real.

A senior engineer and a junior developer sitting together reviewing code on a laptop screen

The Uncomfortable Truth

I don't think AI replaces engineers. I think it replaces the version of the job which never should have been the whole job in the first place: rote implementation, disconnected from judgment, context, or consequence.

If your interview process still rewards the person who grinds three hundred LeetCode problems over the person who catches a bad assumption in a design doc, you're not testing for the job AI left behind. You're testing for the job AI already does better than any candidate walking through your door.

Redesign the interview before the market forces your hand. It's cheaper now than it will be in a year. Every quarter you keep the old loop running, you hire more people who are strong at a skill the model already owns, and fewer people who are strong at the skill the model still lacks.

I'd rather build a team of five engineers who catch the model's mistakes than a team of fifteen who don't know the difference between a fix which passes tests and a fix which is right for the business.

What does your engineering hiring process test for right now? When was the last time it changed?