Resources

Routing engineer resolves memory corruption bug in under 10 min

“We fixed this issue in 5-10 minutes which would have initially taken up to 5 days." Routing Software Developer, Networking Equipment Manufacturer

I am working on middleware software for our routers. The codebase is written in C (legacy code) and C++ and encompasses a lot of libraries from different teams (middleware libraries, application libraries etc).

We had a bug in our legacy C code where a process was attempting to use memory already freed (Use After Free type of defect), causing it to crash. The underlying issue had been there for months; some timing circumstances brought it to light. 

This class of defect is a release blocker; crashes in the underlying infrastructure library often have system wide impact and it’s extremely important for my team to triage and fix them fast. 

Our default approach was to spin a virtual router and recreate the use case; when we got the crash to reproduce, we took the core dump and passed it around across teams, to find out who caused the bug and get a fix. Unfortunately, it could take a long time until we got a fix this way.

Another approach was to use address sanitizers; they can tell you roughly where the problem resides, but it’s really painful for us to set up and difficult to get something useful from them, plus we don’t get symbols (llvm is not part of our build infrastructure).

So I decided to change tack: I made a LiveRecorder recording of the failed process - capturing the bug in the act - and replayed the recording file in UndoDB (LiveRecorder’s replay engine). I then put a watchpoint on the memory pointer and worked backwards in the recording towards the root cause. I quickly found where the crash happened and why.

With LiveRecorder, the issue was diagnosed in 5 - 10 minutes. Previously, this kind of issue would require an undetermined amount of time - up to 5 days.

In order to use LiveRecorder, the Tools team invested about an hour to include UndoDB into our standard developer environment. 

Thankfully, we resolved the issue in development before it made it into the release build. And we were able to reduce time-to-resolution by over 200x. This saves me so much time that I am expecting to turn to LiveRecorder more frequently than I initially anticipated.