The case of the broken pipes
"We used LiveRecorder to replay the code execution and we identified the problem in less than an hour."
Brian Janes, Senior Engineering Director, High Performance Computing at Altair
Altair's Accelerator product line powers the distributed computing infrastructure in high-stakes fields like semiconductor design and is expected to provide 24x7x365 service. One major customer experienced a problem that caused a portion of their workload to fail. In our industry, every minute counts to get customers back up and running.
We could not reproduce this problem internally, despite repeated attempts by several personnel. This is common in HPC, as there are too many variables across corporate environments to accurately model that of each customer. From our software's logs, we noticed a failure pattern where a single PTY communication failure, prompted by some unknown external factor, would result in the failure of all subsequent jobs executed on the same host.
For this case, we asked the customer to use LiveRecorder to make a recording of our software’s execution and send it back to us. We used the LiveRecorder recording to replay the code execution and we identified the problem in less than an hour. The recording captured the bug. We then replayed the recording and stepped through the code (just as we would if we were running live under a debugger) to see what happened. We took the following steps:
- We located where the correct value was being set for one of the pipes, and set a watchpoint (aka data breakpoint)
- We reverse-continued back to the watchpoint
- We discovered that, upon encountering the PTY error, we were not correctly resetting all the pipes associated with that job for all possible failure cases
Once that specific external factor was encountered, the next job would come along and reuse one of the pipes that the previous failed job had used. From there, the rest of the workload was doomed.
Would we have eventually figured out the magic incantation to reproduce this issue, or found the root cause via code inspection? Most likely… but how long would it have taken? We don't know; and that is why LiveRecorder is such a critical tool in our support arsenal.
When you are on the hook for processing 100% of a customer's workload, 100% of the time, issue-resolution time needs to be minimized. LiveRecorder enables us to significantly reduce bug-fix time and deliver a robust customer experience. For us, this is priceless. No one wants to tell customers they don’t know when their issue is going to get fixed. With LiveRecorder, we can provide tangible fix estimates down to the day. You can’t put a price on that certainty.
Aside from making debugging predictable, LiveRecorder makes the root-cause identification process significantly shorter; sometimes a LOT shorter. We once diagnosed a 20-year-old bug in just one day with LiveRecorder!
I cannot overstate our appreciation for LiveRecorder. Everyone who debugs C/C++ should be using time travel debugging. If you’re not using it, you’re just wasting time.