In recent years we have seen a big increase in the use of, and interest in, reverse debuggers (also known as time travel debuggers). To some, this technology is old-hat; it’s been around a while (~7+ years) , but most people don’t even know it exists. In this article, we will explain what reverse debugging is, where the notion came from, and why developers are starting en masse to adopt it as the standard means of fixing hard-to-reproduce bugs.
Reverse debugging / time travel debugging explained
In his 2012 articles, Jakob Engblom, a Wind River (subsidiary of Intel) Simics specialist with a keen interest in the method of reverse debugging, said that:
“Reverse debugging is the ability of a debugger to stop after a failure in a program has been observed and go back into the history of the execution to uncover the reason for the failure.”
If you think about it, this is the logical way of debugging. The problem you are trying to fix is at the end of a trail of breadcrumbs in the program’s execution history. You know the endpoint but you need to find where the beginning is, so working backwards is the logical approach.
In reality, this is very hard to implement, especially in commercial grade software, which is why it took a long time for any real, usable implementation of efficient time travel capability. After all, running programs under a debugger typically incurs an overhead, as does replay and backwards execution, not to mention the many variables to consider around kernel implementation, threading, concurrency-related instructions, system calls, shared memory, and any other cool feature/way of doing things you decided to use in your software.
With these challenges in mind, it took developers a pretty long time to make a useful and efficient time travel debugger (check out Engblom’s articles on the various early research papers from 1977 onwards and experiments from as early as 1995 which tried different implementations of reverse debugging, as well as the more modern commercial and open source products that have become one of the standards of reverse debugging today).
In fact, it was after a spate of interest in reverse debugging in the early 00’s, that inspired me and my co-founder, Julian Smith, to think about developing a reverse debugger that worked on compiled code. ‘How hard could it be?’ we naively thought, so we set out to make our own reversible debugger, UndoDB. It was a long road, and we had a lot to learn, not only about how to make a time travel debugger, but how to build a viable business around it (I’ll be talking more about this at CppCon on the 25th September!).
So why did we embark on this journey? Well, in large part it was because we were frustrated at using conventional tools. Core dumps and logging just seemed so last century, and forwards-only debugging was just too frustrating. With UDB (formerly UndoDB), we wanted to make something that was robust, had the ability to work for a wide user base (x86, arm etc.), and would help developers like us master the science of debugging complex failures in large systems we didn't always fully understand.
Time travelling into the present: reverse debugging today
Fast forward to 2017 and lots of languages have some kind of reverse debugger capability. Be it Java (Chronon), .Net/C# (RevDeBug), Python (RevPDB), Elm (Elm TTD) to name but a few of the most popular. The compiled languages field is now the most crowded, with UDB, GDB, rr (all of these on Linux) and the Windows time travel debugger.
It’s very exciting to see reverse debugging becoming ‘a thing’. It’s now available to programmers in multiple languages and on various operating systems, and continued development of the field is fundamentally changing the way people understand software. The ability to rewind and step back through compiled applications removes a lot of the fear of software failing in test. Undo has expanded this idea further by allowing developers to take a recording of their program’s execution, which they can then rewind and replay as often as they wish to find and fix a bug. For Undo, this is an area where SAP has found success, as the fuzz testing infrastructure in SAP HANA threw up issues so complex that the team couldn’t comprehend them using conventional means. Now the team has recordings of sporadic misbehaviours and failed tests, which can be fixed, allowing developers to fix their unfixable bugs, and spend more time writing new code.
One of the most useful aspects of record, rewind and replay debugging is the ability to capture the recording (sometimes known as a trace) of the exact issue you want to fix. This is particularly useful in the case of failures that are hard to reproduce, as you only need them to happen once. You can then replay and debug them as often as you want, and even have the ability to share them between developers for collaborative debugging. The applications of this go well beyond traditional debugging methods and into the realm of monitoring machine learning, AI and more.
We like to think of this as making software accountable for what it does. If it is under constant scrutiny through being recorded, it can no longer pull a fast one and crash for unknown reasons. In the real world, this means that there will be no more excuses for why a self-driving car crashed, why a trading platform decided to sell all of its shares in one go, or why that hacker got into your bank account.
Reverse debugging is the future and has been for some time. If you still aren’t using it, it’s time you time travelled into the present.