Debugging Race Conditions in C/C++

Debugging Race Conditions in C/C++

Race conditions in C++

 

 

 

 

 

 

 

 

 

Some bugs are trivial to fix, others can be almost impossible. The two factors that make the biggest difference are:

  1. Is the bug deterministic or does it occur sporadically and/or in a different fashion each time?
  2. How much compute happens between the bug itself and the point of failure?
    (This is why assertions are so useful: they reduce the time from the bug itself to when you notice, and it’s why incorrect results are usually a horrible bug report.)

Race conditions tend to be really bad on both fronts, hence they’re usually horrible to diagnose.

What are race conditions?

Race conditions in software or any system are when multiple tasks are happening concurrently and correctness depends on the ordering of operations of the tasks with respect to one another. The most commonly-cited kind is when multiple threads in a program operate on the same data without adequate locking; here the two threads are said to be racing with each other – depending on which wins the race, the outcome will differ (one outcome is correct, the other(s) a crash, a deadlock, or other incorrect behaviour).

In practice, it’s just as likely that one thread (perhaps the only thread) in your program is racing with something outside your program – often the operating system, perhaps another process. e.g. one process writes a file, another reads from it; usually the read happens much later than the write, but once in a blue moon the write gets delayed for some reason (perhaps the writer is starved of CPU time because the system is busy).

debug defects

Is this a race condition?

Firstly, how do you know if some erratic behavior is caused by a race condition? Race conditions (by definition) cause erratic behavior, but not all erratic behavior is caused by race conditions. For example, erratic behavior might be caused by the memory allocator putting objects at different addresses. One example I heard from a customer was caused by a printf (“pid=%d\n”, getpid());– the malloc inside printf would either fit inside the page or not depending on the number of digits of the pid, which in turn caused subsequently malloc’d pointers to be different, which were then hashed in some data structure and ultimately would cause incorrect results. That’s erratic, but not a race condition.

The bad news is that it’s usually impossible to know for sure whether something is a race condition until you have a root-cause diagnosis. My advice however is don’t be too quick to ascribe erratic behavior to a race condition – until you know. It could be anything!

How to find and fix race condition bugs in C/C++ (the traditional way)

There are several tools that can help find race conditions.

Static analysis

Static analysis tools like Coverity can be good at finding certain classes of race conditions. They’re not cheap, but can save you a lot of time. Even better, they can identify races that you don’t need a test-case to find for you. (One of the many horrible things about race conditions is that they often escape testing and then blow up in production.)

Dynamic analysis

There are two particularly good, free dynamic analysis tools available: ThreadSanitizer and Helgrind. ThreadSanitizer, like the other sanitizers such as AddressSanitizer, requires you to compile your code specially. Helgrind (like the rest of the Valgrind suite) does not, but imposes more runtime overhead. Both can identify potential data races and deadlocks. For example, they will flag if one thread writes some memory after another thread has written it but before anyone else has read it – not necessarily a race, but it’s a good clue that it might be. They will also find occurrences of a thread acquiring multiple locks simultaneously and not always in the same order (this is almost always a deadlock waiting to bite you). Both are somewhat slow: ThreadSanitizer will slow your program by 10-15x, Helgrind by perhaps 100x. Compute time is usually cheap, and if it lets you find races quickly and before your customers do, then it’s a small price to pay.

Logging

Often we resort to adding print statements in increasing desperation until, eventually (hopefully) we see the sequence of events that we hadn’t anticipated. This is seldom a happy story – it can take an arbitrarily long time. Adding logging statements has a tendency to change timings or impose unintended serialization of threads, causing the race to disappear until the logging statements are removed!

But there is another way, a better way.

New call-to-action

How to find and fix race condition bugs in C/C++ (using time travel debugging)

There is another way which involves using time travel debugging (a.k.a. reverse debugging).

Time travel debugging is the ability of a debugger to stop after a failure in a program has been observed and go back into the history of the execution to uncover the reason for the failure.

Time travel debugging is particularly useful for race conditions (or indeed any bug where the symptom is not observed until a long time after the root cause and/or the program behaves nondeterministically). It is a very different and much more satisfying method of debugging, which is worth walking through to see the impact time travel debugging has on debugging race conditions.

We’re basing this on an example program you can grab from the end of this post.

First, you don’t necessarily even need to identify where the program fails. You start UDB (Undo’s time travel debugger and a component of LiveRecorder), pointing it at the program:

./udb race

Start by running the program and see where it goes wrong:

not running> run
...
threadfn1: i=20000
threadfn1: i=30000
race: race.cpp:47: void threadfn1(): Assertion `g_value == old_value + a' failed.
[New Thread 108.120]
[New Thread 108.119]
[New Thread 108.121]

Thread 2 received signal SIGABRT, Aborted.
[Switching to Thread 108.120]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	}

In our example, the error is all about the

Assertion `g_value == old_value + a' failed

We want to know why one thread expects one value but gets another. We use reverse-finish to step back out of the C standard library until we arrive at a line in our program:

recording 9,241,049> reverse-finish
0x00007f5b70bf191c in __GI_abort () at abort.c:79
79	      raise (SIGABRT);
99% 9,241,043> reverse-finish
...
92	  abort ();
99% 9,241,040> reverse-finish
...
101	  __assert_fail_base (_("%s%s%s:%u: %s%sAssertion `%s' failed.\n%n"),
99% 9,240,636> reverse-finish
0x00005601508016ee in threadfn1 () at race.cpp:47
47	        assert(g_value == old_value + a);

This brings us to where our program aborted, but doesn’t tell us why. We know that when this line executes the value of g_value won’t be equal to old_value + a so we want to find out what other part of the program altered the value.

At this point, we can start to see that it’s a race condition. The few lines leading up to the assert should see a be added to g_value but by the time we hit the assert this isn’t the case, so another thread must have changed g_value at the same time. So, it must be a race condition.

int a = random_int(5);
g_value += a;
assert(g_value == old_value + a);

We find this out by putting a watchpoint on g_value:

99% 9,240,591> watch g_value

Now we’re paused at the point before the program aborted and we know that the cause of the problem is g_value. Remember the approach before – wading through the code to find references to g_value. Contrast this with what we do next which is to run the program in reverse (automatically, not manually) watching for where g_value is changed:

99% 9,240,591> reverse-continue
Continuing.

Thread 2 hit Hardware watchpoint 1: g_value

Old value = 756669
New value = 756668
0x00005601508016bc in threadfn1 () at race.cpp:46
46	        g_value += a;

UDB pauses the program, but looking at the context shows that this is just the line before in the same thread so isn’t that interesting:

99% 9,240,591> list
41	            std::cout << __FUNCTION__ << ": i=" << i << "\n";
42	        }
43	        /* Increment . Should be safe because we own g_mutex. */
44	        int old_value = g_value;
45	        int a = random_int(5);
46	        g_value += a;
47	        assert(g_value == old_value + a);
48	
49	        (void)old_value;
50	    }

…so we type reverse-continue again:

99% 9,240,591> reverse-continue
Continuing.
[Switching to Thread 218.231]

Thread 4 hit Hardware watchpoint 1: g_value

Old value = 756668
New value = 756667
0x000056015080179e in threadfn2 () at race.cpp:62
62	        g_value += 1; /* Unsafe. */

Now, the “Unsafe” comment isn’t going to be in a real system but you get the idea. We have arrived at the line which is causing the race conditions:

99% 9,240,581> list
57	    {
58	        if (i % (100) == 0)
59	        {
60	            std::cout << __FUNCTION__ << ": i=" << i << "\n";
61	        }
62	        g_value += 1; /* Unsafe. */
63	        usleep(10 * 1000);
64	    }
65	}

And that’s it! We’ve found the offending line.

 

To recap, that took just 5 steps:

1. Run the program in UDB
2. The program aborted
3. We type reverse-finish to get back into the program and discover that there are race conditions on g_value
4. We set a watch point on g_value and…
5. reverse-continue until we find the offending line

Of course this is a tiny program, but the principle of time travel debugging is true for larger programs. Rather than wading through lines of code and guessing at what line is accessing what variable, we find the problem and step back until the cause is found.

The code we used

#include <assert.h>
#include <unistd.h>

#include <iostream>
#include <mutex>
#include <random>
#include <thread>


/*
 * Returns a random integer in the range [0, max]
 */
static int
random_int(int max)
{
    static std::random_device rd;
    static std::mt19937 generator(rd());
    static std::uniform_int_distribution<int> distribution(0, max);

    return distribution(generator);
}

static int g_value = 0;

static std::mutex g_mutex;

void
threadfn1()
{
    for (int i = 0;; ++i)
    {
        /* Take ownership of g_mutex for the duration of this scoped block. */
        const std::lock_guard<std::mutex> lock(g_mutex);

        if (i % (10 * 1000) == 0)
        {
            std::cout << __FUNCTION__ << ": i=" << i << "\n";
        }
        /* Increment <g_value>. Should be safe because we own g_mutex. */
        int old_value = g_value;
        int a = random_int(5);
        g_value += a;
        assert(g_value == old_value + a);

        (void)old_value;
    }
}

void
threadfn2()
{
    for (int i = 0;; ++i)
    {
        if (i % (100) == 0)
        {
            std::cout << __FUNCTION__ << ": i=" << i << "\n";
        }
        g_value += 1; /* Unsafe. */
        usleep(10);
    }
}

int
main()
{
    std::thread t1(threadfn1);
    std::thread t2(threadfn1);
    std::thread t3(threadfn2);

    t1.join();
    t2.join();
    t3.join();

    return EXIT_SUCCESS;
}

Stay informed. Get the latest in your inbox.