Technical details

The Undo Engine records only non-deterministic data, which is sufficient for it to be able to recreate the debuggee’s entire memory and registers on demand for any point in its execution.

To do this, it performs a JIT (just-in-time) binary translation of the machine code as it executes, in order that all sources of non-determinism can be captured.

For each non-deterministic operation, the results of the non-deterministic operation are recorded in an event log, which is stored in the memory of the debugged application process. In most programs these non-deterministic operations represent a tiny fraction of the instructions executed so the event log can be very efficient. Snapshots of the program address space are also stored (but using copy-on-write so it’s also very efficient). In this way, it is possible to replay a session precisely by restoring the program starting state and running it forwards, but re-executing only the deterministic operations; all non-deterministic operations are synthesised based on what is stored in the event log.

Recording non-deterministic events

Asynchronous signals are intercepted and recorded at userspace using a combination of an interceptor signal handler and the standard kernel ptrace mechanism. Thread switches are handled using a patented design and implementation based on our instrumentation technology and standard kernel calls. For non-deterministic instructions our JIT engine translates these specially and records any non-deterministic side-effects.

RAM and disk usage

The Undo Engine has been designed in a way that avoids the need to store an excessive amount of data in order to reconstruct program execution. It uses various techniques involving, for example, intelligent distribution of process snapshots through the history of the recording and storage of only the nondeterminstic events that cannot be reconstructed, which ensure that the required state is kept to a minimum. Replaying the recording then requires us simply to re-run from appropriate snapshots, substituting stored events on-the-fly.

Replay

When the Undo Engine replays the execution of a recorded process, it chooses an appropriate process snapshot and executes it, replacing any non-deterministic events with recorded data in the event log. The implications of this are as follows:

  1. It doesn’t modify any system state outside of the memory of the debugged process.
  2. It doesn’t replay at the original speed, since there is a slight overhead in substituting the events from the event log.
  3. Currently it only allows replayed snapshots to read from the event log, and it prohibits them from generating their own events and creating an “alternative version of history”. If it allowed this, the internal program state would become inconsistent with the external state (for example, in a TCP networking scenario, new packets would need to be sent which the TCP peer would not expect).

Aside from these limitations, the behaviour of the debugged process is “as if” the system state has been rolled back and re-executed from that point.

Shared memory

The Undo Engine is aware of any shared memory maps in the debugged application, and if the application accesses these, an event is written to the event log in the same way as if a system call were executed.

Multi-threaded applications

Threads are tasks that execute concurrently within a shared address space. The interaction of threads is often non-deterministic and this is a common source of bugs. The following paragraphs explain how the Undo Engine handles threads.

UndoDB supports concurrent threads, and therefore debuggees can use all normal threading capabilities made available by the system. However in order to achieve deterministic record/replay UndoDB serialises the execution of threads, as if they were running on a uniprocessor CPU.

The Undo Engine allows each thread to run independently, but imposes a global mutex lock such that only a single thread at a time can execute. Thread preemption is handled by the kernel as normal, with the proviso that thread switches are permitted only after certain intervals. In this way, it remains possible for the Undo Engine to solve many types of race condition.

If there are synchronization problems in the original process being recorded, these will also be present when replaying the recording of that process. Likewise, if there are no synchronization problems, they will not be present when replaying the recording. In other words, the Undo Engine doesn’t introduce any synchronization problems, but it may help to expose existing synchronization problems in your application.

Source code and debug symbols

The Undo Engine works without source code or DWARF debuginfo, because it works by instrumenting the program binary on-the-fly. Of course, it’s useful to have source code or DWARF info when debugging a recording, but these can be referenced offline, and do not need to be on the system on which the recording was made.