BIFOLD researchers solve tricky concurrency bugs with time travel debugging

BIFOLD researchers solve tricky concurrency bugs with time travel debugging

About NebulaStream

NebulaStream is an open‑source stream processing system built for IoT workloads across the Unified Sensor–Edge–Cloud. The team’s research explores how to answer many concurrent, long‑running queries over large fleets of heterogeneous sensors and nodes. In that environment, subtle concurrency issues and logic bugs are both common and hard to reproduce.

Through the Undo Educational License Program, the team started using Undo as their primary debugger to speed up investigation and make elusive failures reliably reproducible.

“Many of us use Undo as our primary general-purpose debugger”

Debugging challenges

The team noticed several recurring sources of defects:

  • Concurrency bugs
  • Bad code generation from the JIT compiler
  • General logic bugs

When preparing their demo for SIGMOD, the team faced crashes on a Raspberry Pi doing simple video processing. Because the crashes could not be reliably reproduced, and the overhead of recording directly on the Pi made the setup impractical, they risked a public failure. After a large refactor of their internal operator representation, long-running tests also began to crash intermittently. Assuming the new code was to blame, the team spent substantial time digging in the wrong area.

Before using Undo, the team had experienced a variety of different debugging approaches, including:

  • Record-and-replay tools: Previous attempts with alternative record‑and‑replay debuggers, such as rr, frequently ran into limitations such as recording failures or crashes during replay, which made them hard to rely on.
  • Sanitizers: The project runs ASAN, UBSAN, and TSAN, with ongoing work on MSAN. Triaging false positives in CI is a significant time investment. JIT‑generated code adds further complexity.
  • Fuzzing: There is ongoing work to apply fuzzing strategies to different components.

Most of my attempts using rr have ended in me giving up as the time invested in getting the tool to function did not seem plausible at the time.

Why Undo?

The team’s motivation to try Undo was straightforward: a time travel debugging tool was intriguing enough to try, and they were keen to explore technology that could make hard failures easier to diagnose.

A debugger that is capable of time‑traveling is already intriguing enough to try the tool.

How NebulaStream uses Undo

Undo has become a core part of the team’s workflow, both for day-to-day debugging and for tackling the hardest failures. 

On their workstations, they have used Undo extensively, and in the SIGMOD case, a replicated setup allowed them to track down and resolve the Raspberry Pi crashes. In the refactor example, they were initially hesitant to attempt recording because the tests were processing gigabytes of input data. But once they did, the very first run reproduced the crash and revealed the real cause: a race condition that had likely been present for years.

Undo also plays a role in how the team approaches testing their distributed version of NebulaStream. They are making the distributed setup reproducible within a single process, using an in-memory channel for communication. 

Undo’s thread fuzzing* brings them close to a deterministic simulation environment. The ability to control the network channel and vary thread interleavings allows them to explore many possible execution states, even in relatively small test setups. With sufficient resources, this setup could run thousands of simulations in parallel, helping uncover concurrency and network bugs before release.

*Thread Fuzzing is a feature of Undo’s LiveRecorder product. LiveRecorder records the runtime behavior of a program and saves it as an Undo recording so that it can later be replayed in Undo’s debugger. LiveRecorder allows only one thread to run at a time, by taking a lock in that thread, and letting all other threads block on the lock, but it regularly releases the lock to give other threads the opportunity to claim it and run.

When Thread Fuzzing is enabled, LiveRecorder varies the timing with which the lock is released and other threads may run. Several fuzzing strategies can be configured: see the fuzzing modes documentation for details.

Impact

Debugging concurrency issues had previously meant long hours chasing failures that might never appear twice. With Undo, those “once in a blue moon” bugs became repeatable, and more importantly, solvable. Instead of days spent setting up debug sessions that went nowhere, the team was able to pinpoint the problem on the very first run.

This has given them confidence that concurrency bugs are reproducible and debuggable, rather than hiding until just before a release. It also frees the team to focus on research, while enabling new approaches, such as reproducible distributed testing, that would not otherwise be possible.

About the Undo Educational License Program

The Undo Educational License Program provides free licenses to students and academic researchers during their studies, helping future engineers learn modern debugging workflows and bring time travel debugging into real‑world projects.

Interested in taking part? Get in touch to learn more about eligibility and how to apply.

Find out more

Stay informed. Get the latest in your inbox.