Accelerating Debugging in Networking Scaled Environment Labs

Networking lab

In the networking industry, it’s common to test products in a scaled environment lab. This lab is used by software engineers to replicate key customer configurations (with complex heavy loads) and debug customer-reported defects. Typical defects investigated in these scaled labs include: logic errors, thread starving, race conditions, resource exhaustion, out-of-order execution, and out-of-memory conditions.

Usually, these labs tend to be very scarce resources. Sometimes, there’s only 1 lab to be shared between hundreds of engineers. Engineers who need to investigate defects have to book the lab in advance, and often need to wait 2–4 weeks to get access to it, and are allowed no more than 2 hours of access. In other words, lab device access is very limited; and once engineers do gain access, they only have limited access to the device and therefore limited time to debug the issue live

This device capacity issue is a real blocker – preventing engineers from rapidly troubleshooting customer issues, and negatively impacting on customer experience.

What if engineers could record the process failure in the scaled environment once, then debug the recording offline? 

LiveRecorder equips engineers with the time travel debugging capability which allows them to achieve just that.

Here is what their new workflow looks like:

  1. Reproduce the issue and record the failing process using LiveRecorder – the recording captures an exact replica of that failed process down to the instruction level.
  2. Replay the recording on a generic server using LiveRecorder’s replay component (UDB) – enabling engineers to move forward and backward in the recording to rapidly root-cause the issue.

Because recordings are portable, cross-machine replay is supported by LiveRecorder (thanks to its in-process virtualization engine), provided that the machines share the same CPU architecture.

In-Process Virtualization Marchitecture diagram

This cross-machine replay capability enables engineers to rapidly free up lab devices, as the debugging no longer needs to be performed on the specific device itself. 

Learn more about LiveRecorder

Time travel debugging technical paper Feb 2022