Resources

How to Improve Software Quality in SAP HANA

If you’ve never heard of SAP HANA, it’s essentially an in-database management system that allows for the processing of massive amounts of data in real-time. Understandably, crafting such an appliance requires quite a complex code base and therefore, testing it can be a pretty scary task.


So how do you go about testing it?

Well, SAP’s answer to this is to not only run regular end-to-end functional and performance tests, but to complement them with something they like to call PMUT.

Put simply PMUT is highly-parallel, multi-user stress testing. This approach is designed to be explorative and pseudo-randomised. Since HANA is multithreaded and stateful, this can introduce non-deterministic behaviour into successive runs, thereby having the ability to capture sporadic failures not detected in other tests.

So great! HANA now has sufficient testing to unveil even the nastiest of bugs that could catch up with it in production. However, a new, potentially even more stomach-turning problem has arisen…

As many developers will know, non-deterministic behaviour makes reasoning very hard. Finding the root cause of a non-deterministic failure can take days or even weeks to resolve because the problem reproduction is exceedingly time-consuming. So even though we’re finding these intermittent bugs in the software, what about actually locating the root cause and fixing the problem?


So how do you go about debugging?

You may have heard of the interesting analogy popularized by Brian Kernighan and Rob Pike, in which debugging is compared to a murder mystery where backwards reasoning must be used. Something impossible occurred, and the only solid information is that it really did occur. So we must think backwards from the result to discover the reasons.

So what’s the easiest way to solve a murder?

Check the security footage.

In this talk from DBTest (a workshop of SIGMOD, a leading international forum for database researchers and developers), Undo’s co-founder and CEO Greg Law and Stefan Bäuerle, Chief Development Architect at SAP explain how. 

As explained in the video above, SAP’s engineers are checking the ‘security footage’ using Undo’s Live Recorder which offers record & replay technology. It works by capturing all non-deterministic stimuli during execution. This allows for the replaying and rewinding of any sporadic failures which occurred during execution - in order to easily discover and fix the root cause. In essence, these bugs are made 100% reproducible, meaning it doesn’t matter that they’re occurring differently each time - they can be solved at their first instance. 

So how horrible can these bugs really get? Well, generally speaking the difficulty can be illustrated between two axes as shown on the left. When you’re staring down the barrel of a bug in the top right of this graph, you could be staring for some time...

SAP_talk_blog.jpg (imported)

The SAP HANA engineering team managed to catch a memory corruption defect before it got shipped to customers. They did this by using Live Recorder solution and diagnosed the root cause with its UndoDB reverse-debugging functionality. There could be a significant length of time between the point of corruption (when the piece of memory is actually overwritten incorrectly by the program) and the point where you realize this has happened. This can occur in seemingly unpredictable ways, giving rise to the non-deterministic behaviour of corruption.

Fortunately for us, software testing has improved drastically in the software engineering industry over the last decade, running orders of magnitude more tests over short periods of time.

Unfortunately for us, the implication of this is that even a small percentage of failing tests is a very large number. And worse, you only need one critical bug in there to potentially cause havoc on customer site and pose a major risk to your organisation - reputational damage, plunging share price or the loss of a key client.

Traditional debugging techniques aren’t good enough to cope with increased complexity and a growing backlog of undiagnosed failing tests. Software execution record & replay offers a solution to this conundrum.

The bottom line is: SAP does not lead the in-memory database revolution by developing software the same way it was done 10 years ago. The SAP HANA engineering team is at the vanguard of this new approach to debugging, allowing them to release a superior quality database application to their customers which can withstand the pressure the database is under when out in the field. Learn more about Live Recorder.