What is Reverse Debugging and Why Do We Need It?
In a world increasingly run by evermore complex software, failures caused by software bugs have never been more visible or high profile.
Finding and fixing software bugs faster in a more predictable and productive way has become essential to developers – not to mention organizations which need to deliver more complex software in shorter timeframes.
This technical paper sets out to explain how traditional approaches to debugging struggle to cope with the scale and complexity of today’s software. It weighs up the potential benefits and drawbacks of debugging tools currently on the market, and makes the case for the technique of reverse debugging (a.k.a. time travel debugging), what it is and why serious programmers should care.
Challenges of debugging
As software gains in complexity and the world becomes increasingly dependent on ever-more complex software, debugging is moving from being an inconvenience to a major problem for software companies, for both commercial and technical reasons. Delays in shipping code caused by bugs push back product release dates and directly impact company productivity. While tools exist to help developers prevent bugs when writing code, there has been little innovation in development tools that will help you locate and fix bugs once they have been found.
Applications are increasingly complex, multi-threaded, larger, and have a greater number of developers working on them, which makes tracking down bugs correspondingly more difficult and unpredictable. Multi-threaded programs lengthen the time elapsed between the root cause of the bug and its detection as well as making bugs less deterministic and difficult to reproduce.
As software increases in complexity, debugging is taking up more developer time and becoming vital for brand protection. A 2013 study from the Judge Business School of the University of Cambridge, UK, found that the global cost of debugging software has risen to $312 billion annually, half of which ($156 billion) is spent on wages.
The study identified that developers spend 50% of their development time fixing bugs or making code work, rather than designing or writing new code. The vast majority of debugging time is spent locating the bug – once it has been found, correcting it is normally relatively simple. As Brian Kernighan, co-author of the first book on C, wrote:
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
Traditional debugging tools and techniques
So how can developers make themselves smarter? There are a range of options and approaches available to help them. These can be classified into three groups: programmatic techniques, special-case diagnosis analysis tools, and general-purpose debuggers.
Essentially developers modify or write their program in a way that helps them find bugs. Techniques include print statements, assertions and the use of test suites.
Special case diagnosis/analysis tools
These automated tools (such as Coverity, Purify and Valgrind) detect the most common bugs (memory access violations, touching unallocated memory, or potential deadlock conditions, for example). While they help with particular types of errors, they are not comprehensive to give total coverage. If your bug doesn’t fit neatly into one of these categories, such tools don’t offer any help. And even instances of very common bugs can elude these tools. The recent Heartbleed bug was a very common form of bug (buffer over-read), yet every commonly-used detection tool failed to spot it.
General purpose debuggers
When bugs cannot be found with special case diagnosis tools, many programmers turn to general-purposes debuggers, such as GDB. These let the programmer step forwards, inch by inch, through their code and set watchpoints as they go.
However debugging involves thinking backwards, as Brian Kernighan and Rob Pike point out in their book The Practice of Programming:
Reason back from the state of the crashed program to determine what could have caused this. Debugging involves backwards reasoning, like solving murder mysteries. Something impossible occurred, and the only solid information is that it really did occur. So we must think backwards from the result to discover the reasons.
Therefore, to be really useful, a debugger needs to help the programmer walk through the program’s execution backwards. Consequently, developers need a different approach, and this is where reversible debugging comes in.
Time travel debuggers enable developers to record all program activities (every memory access, every computation, and every call to the operating system) and then rewind and replay to inspect the program state. This colossal amount of data is presented via a powerful metaphor: the ability to travel backward in time (and forward again) to inspect the program state. Essentially they enable developers to solve the murder mystery by letting them rewind their code to walk backward, as well as forward, through the program. Take an example use case of tracking down some corrupted memory. With a time travel debugger, a developer can simply put a watchpoint on the variable that contains bad data, and run backwards to go straight to the line of code that most recently modified it. Bugs that would take a very long time to track down can be found in minutes.
There are many benefits of using reverse debugging:
General development productivity
Making time travel debugging part of the development and debugging process improves overall development productivity. Common but difficult to identify bugs can be found more quickly, freeing up developer time. Time travel debuggers that are fully compatible with the open source debugger, GDB, can be easily integrated into development environments, without the need for extensive training. Time travel debugging can also help developers become familiar with legacy code or code they did not write themselves, but which they now have to work with.
Finding and fixing intermittent bugs
Sporadic bugs, that strike seemingly at random, are incredibly difficult to find as well as being potentially the most damaging to company reputation. They can easily slip through the net of normal debugging routines, and only surface when code is close to shipping – or worse, has already shipped. By running a time travel debugger until the intermittent bug strikes, developers can then step backward from the point of failure line by line until the bug itself is found. If necessary, multiple instances of the debugger can be run on different servers to increase the chance of the bug manifesting itself.
Bugs in production software that manifest at a customer site
For many software vendors, their tools are being used on customer sites. If the program crashes, vendors must learn about the circumstances in order to reproduce the issue in-house before investigating it. Unfortunately, this is often impossible, meaning software vendors have no choice but to send an engineer to the customer site to investigate the failure. Running a time travel debugger on-site, on the machine demonstrating the issue, means that, the next time it occurs, the engineer can step back to see what went wrong.
The business case for using a time travel debugger
The recent University of Cambridge research analyzed the financial cost of debugging, and how it could be reduced. The math is simple. The global cost of software development is $1.25 trillion. It found that debugging represents a quarter of the overall budget, representing $156 billion in wages, with overhead costs doubling this to $312 billion.
Time travel debugging can deliver significant savings. Siemens EDA (formerly Mentor Graphics) has reduced debugging time by 66% (two thirds) after implementing UDB (formerly known as UndoDB).
Take an average software developer earning $90,000. Currently they spend a quarter of their time debugging, costing $22,500 in wages. Reducing that by two thirds, creates a saving of $15,000 and increasing available developer time.
Of course, this solely focuses on part of the financial impact of finding and fixing bugs. It ignores the costs of:
- Delaying product launches
- Running recall programs after software has been released
- The reputational damage to a company when things go wrong
- Lost customers if your tool is responsible for delays or issues
- The personal cost to developers and managers, in terms of stress and sleepless nights, as they struggle to find and fix bugs
Long run times
Sometimes tracking down a bug can itself be an O(n2) iteration: running the debugger 5 minutes until the bug manifests itself, setting a breakpoint earlier in the code and running again for 4 minutes, setting an earlier breakpoint and rerunning, etc. With reverse debugging, that time-consuming run-restart cycle can be reduced to an O(n) process. Run until you hit the bug, then step backwards to see what led to the problem. Did you miss it? Step forward a little, and backward again.
Take the example of a function which is called many times, but fails after about a thousand calls with a fault such as SIGSEGV or SIGFPE. Setting a breakpoint in the function doesn’t work well because it stops at the first occurrence, when you really want it to stop at the last occurrence – but that involves predicting the future! With a time travel debugger, it’s possible to run to the end and only then set a breakpoint. When running in reverse, the first breakpoint you hit is the last time that code was executed.
Some applications generate specialized code at runtime. Debugging such code is hard because source code analysis tools are obviously unable to help; there is none of the normal debug information to locate functions, and the code could be generated at different addresses on different runs. A time travel debugger allows the developer to examine a single run in detail without the headaches associated with re-running.
An intermittent bug might only strike in 1 in 300 runs. If the developer investigating the bug discovers the need to set a different breakpoint or add another logging command to help understand the problem, it will take a lot of runs before the bug is hit again, so progress will be very slow. A time travel debugger can’t help to make the bug appear sooner, but once it does appear the entire history of the run can be examined.
Dynamically generated code, stack corruption
Often an issue, a bug creates code that corrupts the stack. GDB cannot cope, and the coredump provides no information. Using time travel debugging, the developer can rewind to see the stack corrupting and fix the issue in minutes.
Obscure memory leaks can cause software to run slower over time and potentially even crash. Memory leaks are hard to debug using conventional tools because there is a large gap in time between the allocation of a buffer and the point where it should be freed. It’s also not clear where the fault is – the problem is likely to be an absence of code where it should be. Worse, if the program is re-run it may be a different buffer that leaks.
A time travel debugger gives developers the chance to work on a single example failure, moving freely backwards and forwards through the history to identify where the missing code should be.
Real-time, network protocols
Software can fail when it receives data in unexpected formats. But it may not be possible to step through the code using a debugger if it is communicating with an external program or device which has real-time constraints – the other device may simply give up. With a time travel debugger, there is no need to stop during the initial ‘recording’ phase, because it is always possible to rewind later.
Take a bug where some code accesses shared data but claims the wrong lock. This shows up as a threading bug where two threads are accessing data A but one of them has locked data B by mistake; so there is a race condition between the threads. Using a conventional debugger, the bug will show up as a corruption of data A, but the cause won’t be obvious. Typically the response is to run again with watchpoints set, but this can result in a lot of false positives unless a complex condition is defined to filter out the OK accesses, and having set all that up there’s a strong chance that the bug won’t manifest next time. Time travel debugging makes it faster: by starting at the end where the corruption is detected, setting a watchpoint and running backwards, the source of the corruption can be found much sooner.
Corruption of a linked list leads to a crash, but it is difficult to see when the corruption occurs. Rather than having to continually re-run the program, reverse debugging allows developers to go back in time before the list was corrupted and use a binary search to quickly find out when the list got corrupted. This brings debugging time down from over an hour to less than 10 minutes.
These are examples of where time travel debugging aids developers and are by no means exhaustive.
UDB is an interactive time travel debugger for C/C++ on Linux that works on any user-mode compiled code, on x86. This reverse debugging tool is available standalone or bundled with Undo’s LiveRecorder platform. It takes the guesswork out of debugging by allowing developers to step or run their program backward as well as forward in time. It incorporates the full functionality expected of modern debuggers (such as scripting, conditional breakpoints and watchpoints, full inspection of globals and locals) and also allows these features to be used with the program running in reverse. Bugs can be fixed in minutes, not weeks.
UDB is a drop-in replacement for GDB and therefore seamlessly integrates into a developer’s workflow. UDB can be used at the command line, from any popular IDE (VS Code, CLion, Eclipse, Emacs etc.), allowing developers to choose their preferred work environment.
To explain UDB’s advantage, we need to introduce the concept of determinism. A deterministic process is one which always produces the same output when fed with the same starting state. The insight which drives UDB is that computers are mostly deterministic (which explains why they are not very good at generating random numbers). If a program behaves deterministically, there is no need to record its intermediate states, because they can be reconstructed at any time simply by running the program from the beginning.
Real programs are not completely deterministic, and to correctly replay a program all the sources of non-determinism must be captured. Sources of non-determinism include:
- Inputs from outside the program, e.g. from user interaction, files, network sockets, real-time clocks etc. Usually these interactions take the form of system calls or accesses to memory-mapped files
- Scheduling variation – in a multithreaded program where more than one thread is unblocked, the OS can decide to schedule the threads in any order, or even simultaneously on a multicore machine
- The CPU itself, which may have certain instructions whose effects are not predictable from the program state, e.g. the x86 CPUID and RTDSC instructions
The program being debugged with UDB is instrumented on-the-fly to identify all sources of non-determinism. Instrumentation also provides a timebase, so that any point in a program’s run can be identified via a count of ‘simulated nanoseconds’ (which correspond very approximately to real-time nanoseconds). The ‘event log’ captures all the information needed to reconstruct the effects of non-determinism. For example, if the program executes a read() system call, UDB will capture the new buffer contents into its event log. If the same section of code is later replayed, the read() is not executed again – instead its effect is simulated by copying the saved buffer from the event log.
Today, software is central to every organization. Finding and fixing bugs has never been more important – meaning that application software debugging tools are no longer a ‘nice to have’, but are a business and development necessity.
Reverse debugging (also called time travel debugging) provides a viable, cost-effective way of locating bugs as developers can now record, rewind and replay their code. This makes it simpler to quickly find and fix customer-critical bugs, deliver to ever-shortening deadlines and boosts overall productivity.
By reducing debugging time by two thirds, engineering teams can free up developers to code more productively, thereby increasing operational efficiency and safeguarding corporate reputation. With the pressures on software development growing, now is the time to investigate reverse debugging and the benefits it brings.