Resources
Debugging a complex multithreaded codebase like SAP HANA
“In total we are dealing with around 5 million lines of productive C++ code. The application itself is highly multithreaded (which comes with the territory of being a DBMS), and as it is very rich in its feature scope, there are many components that must interact within the same process.” Andreas Erz, Software Developer, SAP HANA
Prof. Roberto V. Zicari of ODBMS.org interviewed Andreas Erz on how he and his team debug a complex multithreaded codebase like SAP HANA. This interview below is an extract of the original on ODBMS.org.
Q1. Can you tell us a bit about the codebase you’re working on at SAP HANA Cloud?
SAP HANA was released to the market in 2010, and is now serving more than 70.000 customers globally. Its successor SAP HANA Cloud is not only our strategic offering, but also our flagship product, as it provides cloud native qualities with the power and performance known from SAP HANA. Over the course of its history, many teams have contributed to the codebase, and sometimes other products have been assimilated to work as part of SAP HANA Cloud. Parts of the code even pre-date the development of SAP HANA Cloud, as they were taken over from previous SAP products. In total we are dealing with around 5 million lines of productive C++ code. The application itself is highly multithreaded (which comes with the territory of being a DBMS), and as it is very rich in its feature scope, there are many components that must interact within the same process.
Q2. How do you debug such a complex data management system?
Well, ideally, you don’t. As with any other database management system on the market, when you start debugging, it means some defences preventing bugs from making it into the product in the first place, show room for improvement. The first line of defence are our significant investments into code quality and developer education. Additionally, we also safeguard our codebase with a multitude of static analyses using various tools, both homegrown and from third parties. Of course, each change also undergoes extensive unit and integration testing. A lot of potential issues are found by these tests, and the fix is often obvious from the traces alone or easily understood by running and debugging the respective unit test. The most troublesome issues, however, are those only revealed by our randomised testing. These are sporadic in nature, and often both hard to understand and hard to reproduce. In the context of these sporadically failing randomised tests, time travel debugging is essential. At SAP HANA, we use Undo LiveRecorder for that. It allows us to capture the exact sequence of inputs and executed CPU instructions of a test execution. By that we can enable the highest quality standards for SAP HANA Cloud which our customers expect from our technology.
Q3. What is time travel debugging?
As I just mentioned, the technology developed by Undo can record the exact sequence of inputs and CPU instructions that make up the execution of a process. You can go back to any point in the execution history and inspect the complete state of the process at that point (including the contents of all registers and memory locations). It is like travelling backwards in time, hence the term time travel debugging.
Q4. It sounds cool, but why is that so much better than traditional troubleshooting methods?
A lot of challenges come down to variations of the following question: Why is my precious flag 0 even though it is supposed to be 1? Did I overlook something in my logic? Did somebody else dare to change it? Maybe a hardware issue, i.e., faulty CPU or RAM? After the fact, it is tough to find these answers. Now, a long and tedious process starts, by trying to reproduce the issue, add traces and debug code to test your assumptions, running the test again (often many times) until you reproduce the issue, adding more traces and debugging code, rinse, and repeat, until you have finally understood why your flag is 0 and not 1.
Having a recording of a problematic execution, however, makes all this obsolete. Just go to the end of the recording, set a watchpoint on the memory location that was mysteriously changed, tell the debugger to reverse continue and, voilà, we land at the exact instruction that changed the memory. Even if it is not possible to narrow down the issue to a single change of a variable, stepping forward and backward in an execution allows me to quickly explore complex interactions, especially between multiple threads and components.
Q5. You described SAP HANA Cloud as a highly multithreaded application. You must get some really tricky concurrency challenges? How does LiveRecorder help with these?
You are right, what makes it to the randomised testing stage is in one way or another related to concurrency, and that is always tricky. Which is why it comes up so late in the testing game, as there is a myriad of thread interleavings to consider.
With LiveRecorder, we can eliminate any guesswork because we have visibility into how the threads have interacted. But even with perfect reproducibility, you must run these tests many, many times to hit certain edge cases. At least you had to, until Undo introduced an advanced feature for LiveRecorder called thread fuzzing.
Q6. Can you tell us more about how you use thread fuzzing internally in practice?
As LiveRecorder has complete control over the execution, it can also modify how the threads of a process are scheduled. For example, you can instruct it to randomly starve threads, which helps to trigger edge cases in producer-consumer scenarios. Another way to trigger otherwise rare interleavings is to tell LiveRecorder to yield threads around lock and sync instructions. This is helpful, as these instructions are often used to implement synchronisation primitives. Whenever we suspect a multithreading issue, we switch on thread fuzzing, and more often than not, it generates an unforeseen interleaving which we can then analyse using the Undo toolchain.
Q7. Can you give us an example of what thread fuzzing allowed you to achieve recently?
Sure. Just recently it revealed a very sporadic race condition. The race condition would allow a write to an already deallocated object. Thread fuzzing made this very sporadic issue easily reproducible and within an hour we could determine the lifecycle issue that caused it.
Q8. How has having these capabilities changed the way you think about debugging SAP HANA Cloud?
Before I knew about time travel debugging, I thought about debugging as applying the empirical method: Form a hypothesis that explains the erroneous behaviour (which involves a lot of static code analysis, sometimes staring at the code for days), instrument the code with asserts and traces to decide whether the hypothesis is consistent with the observed behaviour, and if so, refine the hypothesis, until, after many iterations, you (maybe) figure out the root cause.
With time travel debugging, I think about it more as becoming Laplace’s demon. The main task has now become to create a recording of the issue (which still might require some ingenuity), but once that is done, the past and future of the application you are debugging is like an open book, just waiting for you to read it.
To the SAP HANA Cloud engineering team, this technology has become essential to continue ensuring stable releases. We take software quality very seriously and regularly invest in the latest technologies enabling us to catch defects before they affect our customers.
Q9. How would you feel if asked to help tackle similar challenges on another project elsewhere without capabilities like time travel debugging and thread fuzzing?
Not good, to be honest. The time saved by having time travel debugging and thread fuzzing in our toolkit is substantial. Especially when dealing with a large codebase, there is another use case for which I would miss time travel debugging, i.e., what I call explorative debugging. I often just run code I want to understand in the debugger and being able to go back and forth in the program flow is a much more natural way to do that and a huge time saver.
The original interview was published by Prof. Roberto Zicari on ODBMS.org.