Our Fastest Ever Release: Performance Improvements in Undo 9.0

Our Fastest Ever Release: Performance Improvements in Undo 9.0

Undo 9.0 brings some exciting performance enhancements under the hood. As always, our goal in this release is to make time travel debugging as fast and responsive as possible – especially when tackling complex, real-world applications.

This release improves performance by mitigating thread-scheduling issues, optimizing an internal cache for shared-memory workloads and improving our algorithm for generating snapshots of your application. Let’s dive into the details.

Thread-scheduling mitigation for multi-threaded programs

On certain systems, particularly Red Hat Enterprise Linux / CentOS 7 and 8, applications which use spinlocks can experience pathological slowdowns while being recorded by Undo.

We identified a scenario where the Linux scheduler would fail to switch threads even after waking a waiting thread. Internally, Undo uses a lock to serialize the execution of threads, and only the one thread which holds the lock can execute. During execution, this lock is periodically dropped and retaken, which gives a chance for other threads to execute.

On the affected kernels, the sequence of events looks something like this:

  1. The thread which holds the Undo thread serialization lock releases it and wakes a waiting thread.
  2. The scheduler marks the woken thread as runnable but doesn’t immediately start executing it.
  3. The original thread continues to run, reacquires the lock.
  4. By the time the kernel runs the woken thread, the first thread has already re-locked the thread serialization lock, so the woken thread blocks again.

To combat this, Undo 9.0 can detect when this unfair scheduling happens. When a thread repeatedly fails to acquire the Undo thread serialization lock after being woken, we intervene by forcing the thread currently holding the lock to call nanosleep() after it unlocks. We continue doing this until the lock is successfully transferred to a different thread, ensuring fairer scheduling. This can provide a significant performance improvement for some multi-threaded applications – one customer application runs 5.1x faster than it did with Undo 8.3.

Larger, configurable cache for shared memory workloads

Applications using shared memory can be recorded and debugged faster in Undo 9.0. By updating the default size of an internal cache that is used when recording such applications, Undo’s recording performance is improved by up to 10% in benchmarks such as Postgres TPC-H.

For applications with heavier shared memory usage, you can now specify a larger cache size by setting the UNDO_shmem_turbo_cache_size environment variable. The value you provide will be automatically rounded up to the nearest power of two.

A smarter snapshot selection algorithm

Undo uses snapshots under the hood to capture the state of a program at a particular time in execution history. To minimize the time it takes to jump to a specific point in the program’s history, we need to select snapshots that are evenly spaced throughout the recording.

Our old snapshot selection algorithm, which used greedy heuristics, often resulted in fewer snapshots being saved than requested, even when enough were available.

In Undo 9.0, we’ve replaced this approach with a new algorithm based on dynamic programming. The main upshot of this is that with a freshly-loaded Undo recording, time travel operations such as ugo time will show a small performance improvement in UDB 9.0 when compared with previous releases.

See Undo 9.0 in action today:

Request a demo

Stay informed. Get the latest in your inbox.