Accelerating Debugging in Networking Scaled Environment Labs
In the networking industry, it’s common to test products in a scaled environment lab. This lab is used by software engineers to replicate key customer configurations (with complex heavy loads) and debug customer-reported defects. Typical defects investigated in these scaled labs include: logic errors, thread starving, race conditions, resource exhaustion, out-of-order execution, and out-of-memory conditions.
Usually, these labs tend to be very scarce resources. Sometimes, there’s only 1 lab to be shared between hundreds of engineers. Engineers who need to investigate defects have to book the lab in advance, and often need to wait 2–4 weeks to get access to it, and are allowed no more than 2 hours of access. In other words, lab device access is very limited; and once engineers do gain access, they only have limited access to the device and therefore limited time to debug the issue live.
This device capacity issue is a real blocker – preventing engineers from rapidly troubleshooting customer issues, and negatively impacting on customer experience.
What if engineers could record the process failure in the scaled environment once, then debug the recording offline?
LiveRecorder equips engineers with the time travel debugging capability which allows them to achieve just that.
Here is what their new workflow looks like:
- Reproduce the issue and record the failing process using LiveRecorder – the recording captures an exact replica of that failed process down to the instruction level.
- Replay the recording on a generic server using LiveRecorder’s replay component (UDB) – enabling engineers to move forward and backward in the recording to rapidly root-cause the issue.
Because recordings are portable, cross-machine replay is supported by LiveRecorder (thanks to its in-process virtualization engine), provided that the machines share the same CPU architecture.
This cross-machine replay capability enables engineers to rapidly free up lab devices, as the debugging no longer needs to be performed on the specific device itself.