Time travel debugging enables developers to rapidly resolve runtime errors without needing to invest time and effort into replicating issues. It shortens the traditional debugging workflow, and enables us to get to the root cause of bugs with 100% certainty. This, in turn, allows us to fix bugs in our programs more quickly and easily. The question is, how?
This walkthrough guide outlines how time travel debugging works with a C/C++ application running on Linux. For this, we will be using UDB, Undo’s time travel debugger.
Our starting point is a simple example application called Cache Calculate.
When we run this application, we can see that it runs to a point, but then crashes.
Let's use UDB to analyze the execution of this program and diagnose what caused this failure.
First, we will type
backtrace to look at the call stack, and see a summary of how our program has just got to where it is.
Because UDB is a time travel debugger, we can run the execution of the program in reverse.
So, we will use the reverse-finish command twice, to reverse up the stack to the
abort() statement in
main() at cache.c line 85.
Note: in UDB, like GDB, pressing Enter on an empty line repeats the previous command.
Now, we’ll switch into TUI (Text User Interface) mode, so that we can more easily see what’s going on in the source code.
To do this we press
Ctrl+X followed by
Unlike forward debugging, the great thing about a time travel debugger is that we can go back to any line of code that executed and see the complete program state.
We’ll now type
info locals to inspect the state of the variables at this point.
This shows us that the integer square root of 255 is 15 (in
sqroot_cache is 0, which is the wrong value.
This is the point at which the bug presents, but it’s not the root cause.
Next, we need to track down the point where the cache is populated with the incorrect value.Line 78 is where the sqroot_correct variable is set.
reverse-next command can be used to execute the program backward to the previous source line in the same file.
So we use the
reverse-next command 3 times, to jump back in time to line 78.
We can see that the previous line is where the
sqroot_cache variable is set to its incorrect value.
reverse-step command executes the program backward until it reaches a different source line.
So we’ll use the
reverse-step command once to step back into the
cache_calculate() function, and again to go back to line 39, where it returns this incorrect value.
Now we need to inspect what the value is equal to at this point.
To do this, we’ll type
Which gives us this:
We see that the square root stored in the cache for 255 is 0; which is incorrect.
Next, we need to find out where this cache entry was populated with the incorrect value. We can do this by setting a watchpoint (also known as a data breakpoint) on the incorrect entry in the cache, and running back in time to where it was set.
First, to set the watchpoint we’ll type,
watch -l g_cache[i].sqroot.
Next, by typing
reverse-continue we can run backward in time to see where this value was written to.
The incorrect value was written to
locals again, to see the state of the variables at this point.
Here we see that
sqroot_adj is a very large negative number. But when stored into
g_cache[i].sqroot, it is being stored as value 0. This suggests a type casting error. We need to investigate why the large negative number is being stored as a 0.
So, let's print the type of the variables by typing the following:
The data structure shows the array which is made up of 100 pairs of
unsigned char. But
unsigned char can only hold values 0 to 255, so when
sqroot_adj is stored in the cache, its value is shortened.
So, now we know how we got the zero, and similarly, when -1 is cast to an
unsigned char, it becomes
255, but where did the
-2,147,483,648 actually come from?
Line 48 shows that
sqroot_adj is set to
sqrt(number_adj), and statically cast to an integer as
sqrt() returns a double.
If we type
print sqrt(number_adj) it returns the following:
And now, we have discovered that the root cause of this application failure happens as a result of attempting to put the square root of -1 into the cache, which was not intended.
The reason for this is that for loop in line 46 loops from number-1 to number+1, but there is no protection anywhere in the code to deal with the special case that we just hit, where the number is zero.
This error could have been avoided by adding the following statement at the start of the for loop:
if (number_adj < 0) continue;
So that’s how time travel debugging enables developers to run backward through the execution flow of a program and drill down into what happened, in order to quickly identify the root cause of the program crash.
As shown, it is an easy and efficient debugging process, the principles of which can, of course, be used to debug much more complex code. You can try some of the features of time travel debugging out for yourself on your own application, by downloading a free UDB 60-day trial (desktop software only).