AI Can Debug Complex Systems, But Only If You Give It the Right Context

AI Can Debug Complex Systems, But Only If You Give It the Right Context

The hidden cost center in software engineering: unpredictable failures

In most engineering organizations, software failures (new regressions or production incidents) represent one of the largest unplanned drains on capacity.

AI promised to eliminate debugging toil. In practice, when the bug is hard-to-reproduce and non-deterministic, most AI coding assistants still guess.

Senior leaders in software engineering feel the pain:

  • Missed roadmap commitment and release delays
  • Unpredictable delivery
  • Slow Mean Time to Resolution (MTTR)
  • Customer escalations
  • Loss of productivity

Debugging software consumes an enormous amount of costly senior engineering time, so engineering teams are not able to progress at the pace they should.

Experimenting on a real bug: automating root cause analysis with AI

Current AI coding assistants help generate boilerplate, suggest fixes, review pull requests, explain legacy systems, and even draft architecture diagrams. But ask it to debug a real crash in a complex codebase… and that’s where the illusion cracks.

We put two leading AI coding agents – Claude Code and Codex CLI – to the test on a real, large open source codebase to see how they performed.

Experiment details

We ran multiple experiments taking a real, production-level crash in a large open source debugger called GDB. The crash was related to a complex use-after-free bug that manifested only under specific conditions and whose root cause was deeply buried in a mature, multilayered C/C++ codebase.

Experiment 1

Rather than ask a human to reproduce and debug the issue (often a multi-day effort), we recorded the exact execution of the failing run (using Undo’s time travel debugging technology), capturing every memory change, thread event, system call and variable state from start to crash.

We then fed the Undo recording of the program’s execution to Claude and Codex CLI, prompting them simply with: “Why did GDB crash?

Results: Both agents correctly identified the root cause on the first attempt, with explanations that matched the eventual patch.

Experiment 2

Now could the agents do the same without Undo?

We removed Undo from the equation and asked Claude and Codex CLI to debug the issue by giving them the ability to modify GDB, add logging, compile and run.

Results: Far less effective on both fronts – neither AI agents managed to get to the root cause on the first try. It took a total of four attempts for Claude to finally arrive at the correct explanation. Codex CLI reached an incorrect answer while taking an unacceptably long time…

Experiment 3

We then tried a different approach: we asked the agents to debug by source inspection and we gave them the exact sequence of operations that caused the crash.

Results: On the first attempt, both AI assistants came up with a wrong and remarkably similar explanation. By repeating this experiment, they sometimes got to the correct answer – but it’s hit and miss.

Success rate comparison

Here’s how the various approaches compared for root-causing the bug in a single attempt:

Approach First-attempt success Attempts needed Token cost
AI coding assistants + Undo recording Yes 1 $1.21
AI coding assistants + compile/run No 4 $7.96
AI coding assistants + source inspection + hints Inconsistent 1–N $0.39

It appears that when AI has the full execution context (via an Undo recording), it drastically outperforms both naive AI on source + compile/run and traditional debugging workflows in terms of speed and accuracy.

With time travel debugging, both agents could reliably trace through the execution, examine memory at any point, and understand the precise sequence of events leading to the crash. Without it, they resorted to educated guessing, even when given significant hints.

View the full experiment details

Slow AI adoption due to lack of trust

Engineers want facts, not smoke and mirrors.

Today’s AI agents are probabilistic. AI making things up (aka the LLM hallucination problem) is one of the major causes of adoption reluctance.

Undo closes that gap by anchoring AI reasoning to an immutable execution record.

Slow progress

If engineers can’t rely on their AI coding assistants to debug complex issues, they’ll continue using decade-old debugging techniques that slow down development.

Broken releases

Even worse than slow progress, the huge backlog of tickets remains unresolved and new bugs get introduced. The risk of shipping a broken product significantly increases.

Spiraling token costs

In our tests, one agent spent $7.96 and four rounds of hand-holding to eventually stumble near the right answer. Another burned 58× more tokens than the Undo-assisted run – and still got it wrong.

Here’s the full data set on costs based on Claude’s built-in cost tracking (sorted from cheapest to most expensive)

Approach Cost Result Notes
Claude Code: “don’t do logging” suggestion $0.04 Just don’t use that feature is not a viable bug fix; at least it failed quickly and cheaply!
Claude Code: source inspection, wrong answer $0.39 Confidently incorrect
Claude Code with Undo AI $1.21 Correct at the first attempt 🎯
Claude Code: compile + run, 1st attempt $2.05 No explanation of the actual bug and hacky workaround for the crash
Claude Code: compile + run, 4th attempt $7.96 ⚠️ Eventually correct, but expensive and time-consuming, requiring multiple hints from me (who already knew the answer)

Looking at these experiment results, investing in tooling and engineering practices that provide full execution context (not just logs or traces) seems a no-brainer.

Takeaway

AI doesn’t fail at debugging because it’s dumb – it fails because it’s blind.

Undo removes that blindness. By capturing exact program execution, we turn debugging from a statistical guessing game into a deterministic investigation. The payoff?

  • Lower AI costs (between 1/4 and 1/58th of the token spend in our test)
  • Faster MTTR (first-attempt success vs iterative flailing) meaning more predictable delivery
  • Get more tickets closed and reduce the risk of shipping a broken product

The next leap in AI-assisted engineering won’t come from smarter models. It’ll come from smarter context. At Undo, we’re focused on that context. Not with more parameters – but with a recording of a program’s execution which provides the ultimate ground truth.

And because this is a genuine, public bug in GDB, we’ll be submitting the correct patch upstream – proving this isn’t a lab trick, but a production-grade capability.

 

Want to explore how Undo AI can help your team reduce MTTR and deliver real productivity improvements across your SDLC?

Get in touch

Stay informed. Get the latest in your inbox.