WatchPoint

Image link

Using ThreadSanitizer to debug data races

A data race is when multiple threads in a process each access the same piece of state, leading to the program’s behavior changing based on which thread accesses that state first. We’ve talked about AddressSanitizer before, which was used to detect memory-related corruption problems; ThreadSanitizer (TSan) is another tool from the same family, used to detect data races.

Debugging Basic Data Races with TSan

Here’s a fairly standard example of a data race (this code is from the tsan documentation; it’s a variant of the canonical data-race example):

#include <pthread.h>
#include <stdio.h>

int Global;

void *Thread1(void *x) {
    Global++;
    return NULL;
}

void *Thread2(void *x) {
    Global--;
    return NULL;
}

int main() {
    pthread_t t[2];
    pthread_create(&t[0], NULL, Thread1, NULL);
    pthread_create(&t[1], NULL, Thread2, NULL);
    pthread_join(t[0], NULL);
    pthread_join(t[1], NULL);
}

These are two threads incrementing and decrementing a global variable – the global starts is initialized to zero, there is one increment and one decrement, and so it should be zero at the end. But because the increment and decrement are not atomic and happen in parallel, this is a data race. If you compile this program and run it with TSan however:

$ gcc -g -fsanitize=thread simple_race.c -lpthread
$ ./a.out

ThreadSanitizer has picked up on the fact that two different threads are reading from and writing to the same variable without any sort of synchronization, and it gives us the location of these accesses in the source code. Similarly to how AddressSanitizer intercepts all of the memory accesses the program makes, ThreadSanitizer keeps track of the last N accesses each thread made to memory. If it finds that more than one thread accesses the same memory, and at least one of those accesses was a write, it will flag them up as a race condition. So in our example increment/decrement program TSan can tell us that there is a data race.

 

A More Problematic Example

Here’s another example C program from the TSan wiki, an even simpler race:

#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>

int Global;

void*
Thread1(void *x) {
  Global = 42;
  return x;
}

int
main(void) {
  pthread_t t;
  pthread_create(&t, NULL, Thread1, NULL);
  Global = 43;
  pthread_join(t, NULL);
  return Global == 42 ? EXIT_SUCCESS : EXIT_FAILURE;
}

The main thread and the child thread both update the Global variable concurrently, and the value of the variable at the end will be determined by whichever thread “lost” the race – i.e. the value written by the winner of the race will be overwritten by the value written by the loser. In this case, the loser is usually the child thread. If you compile and run this program it would likely exit successfully, as the child thread would set the variable to 42, but occasionally the main thread will lose the race instead. This can be seen by running the program in a loop:

$ gcc -g race.c -lpthread
$ i=0 ; while ./a.out ; do echo $i ; i=$(($i+1)) ; done

The program will run repeatedly and eventually stop, and you’ll be able to see how many iterations it took before the program failed, which could be in the hundreds or thousands of iterations. If we compile this with TSan:

$ gcc -g -fsanitize=thread  race.c -lpthread
$ i=0 ; while ./a.out ; do echo $i ; i=$(($i+1)) ; done

TSan doesn’t notice that there’s a data race every time. Sometimes the main thread writes the Global before the child thread has come into existence, so TSan doesn’t pick up on that write as being in a separate thread. To test this we can put  usleep(100000); following the line creating the thread, and sure enough when it is rerun, TSan catches a data race every time.

It has even noticed that it looks as if it has been “synchronized via a sleep”. This is neat – synchronizing using sleep is, in my experience, depressingly common and almost always a race condition waiting to bite you: no matter how long you think is long enough, one day excessive system load or a clock adjustment or something else will mean the sleep wasn’t long enough. So let’s change it to use a busy loop rather than a sleep to get around Tsan’s sleep synchronization detection:

for (int i = 0; i < 10000000; i++);

The program will show a data race every time again, but this time without the sleep warning. This is an example for why TSan isn’t infallible, and just because the program passes without an issue doesn’t mean there isn’t a data race present. That is, just because your code is “tsan clean” doesn’t mean it has no races. However, Tsan is a very useful way of finding at least some of the races that exist in your program (and if your multithreaded program is non-trivial it probably has plenty of races lurking!)

Debug Linux C++ race conditions

ThreadSanitizer Options

TSan has a few options you can set using the TSAN_OPTIONS environment variable. Some very useful ones are:

report_thread_leaks/report_destroy_locked
If a program exits and there are still threads hanging around, or if you destroy a locked mutex, it will flag it up as it’s pretty unlikely that is intentional.

report_signal_unsafe
Doing something like printf from a signal handler is unsafe. POSIX declares only certain functions are “async signal safe” – this means safe to call from an asynchronous signal handle, and printf is not one of them. (This is because libc likes taking locks, and if an async signal interrupts libc while it has a lock held, and then that signal handler tries to call a libc function that also takes that lock, you will have deadlock.) This is a common error in my experience, and so it’s good that TSan will catch it for you.

history_size
As mentioned before, TSan tracks the last N memory accesses; it lets you customize how long the history is, with longer windows giving TSan more chances to spot these data races. This can seriously increase memory use as you use more and more threads.

exitcode
You can force TSan to exit using a specific exit code if it detects an error which can be helpful if you’re building it into your CI, for example.

halt_on_error/stop_on_start
Very useful for attaching things like debuggers, so you can play around and see what happens either as it runs or when it detects an error.

There are also ways to customize your code to make it “sanitizer-aware” (true for all sanitizers, but these examples are for TSan). Using the __has_feature(thread_sanitizer) macro you can specify code to only be run when it detects that ThreadSanitizer is present, and you can tell the compiler not to sanitize certain elements. For minimal TSan instrumentation on a thread you can use__attribute__((no_sanitize(“thread”))), and if you really don’t want any you can use __attribute__((disable_sanitizer_instrumentation)) (although that’s not recommended unless you really know what you’re doing).

Hopefully this brief insight into ThreadSanitizer has been useful; it’s a simple-to-use tool to help spot data races at their source. Data races and other race conditions can be very nasty bugs to find – they commonly escape testing to bite in production, and often go unresolved because they happen so rarely that the developers cannot get a handle on them. ThreadSanitizer will find race conditions that you didn’t even know were there, and it will help you root cause races much more quickly when you are looking for them.

 

Don’t miss Greg’s next debugging tutorial: sign up to the WatchPoint mailing list below.

Don’t miss my next C++ debugging tutorial: sign up to my WatchPoint mailing list below.
Get tutorials straight to your inbox

Become a GDB Power User. Get Greg’s debugging tips directly in your inbox every 2 weeks.

Want GDB pro tips directly in your inbox?

Share this tutorial