Improving software quality in SAP HANA with LiveRecorder
SAP is the market leader in enterprise application software. It helps 437,000 businesses across 180 countries manage their business operations and customer relations.
Its flagship product is SAP HANA - a scalable, heavily multi-threaded, feature-rich in-memory database built from millions of lines of highly-optimized Linux C++ code. SAP HANA forms the foundation of SAP’s technology stack and its product portfolio. It is the backbone of major businesses worldwide - making quality, stability and reliability a core requirement for the engineering team.
Read the full story below or download the PDF version
SAP HANA - a comprehensive approach to testing
In a concerted drive to improve software quality, SAP uses fuzz testing as part of their routine QA process. Fuzz testing is a technique in which randomised test behaviours are presented to the system under test, making it possible to catch corner case bugs that were not anticipated by the system’s designers. This approach allows developers to find and fix failures - even highly obscure ones - before their users encounter them.
The demanding fuzz tests that validate builds of SAP HANA exercise the database server with randomly generated simulated user workloads. Combined with internal and external consistency checks, this provides a means to discover errors that would not be revealed by more traditional testing approaches.
However, the resulting test failures proved challenging to diagnose, due to a set of factors that are familiar to modern software vendors:
- Complex control flow
Difficult to make inferences about how a failure unfolded
- Huge code base
Collaboration across teams is essential to pinpoint a bug
- Non-deterministic failures
Difficult to reproduce reliably in order to investigate the root cause
These traits intersect: the need for cross-team collaboration makes it vital to have a reliable reproducer of the problem so that developers can work together on the same faulty behaviour, each assisting with a part of the puzzle. However, the non-deterministic nature of many of SAP HANA’s test failures makes collaboration challenging for the SAP HANA team because the failures cannot be reliably reproduced on developers’ machines.
Before approaching Undo, SAP developers investigated test failures using three primary methods to help them identify the root cause of failures:
- Analysing logs from failed runs
These helped to produce a partial picture as to why a failure happened but often did not capture enough of the right information for the root cause to be easily identifiable.
- Reproducing failures on live systems
For complex problems that need to be debugged within a running system, a developer had to reproduce the original failure on a live system, which for rare faults was a time-consuming and unproductive use of resources.
- Developer collaboration
When complex problems could not be solved using the above methods, several developers with specialist knowledge would work together to figure out the source of the problem. Without the ability to reliably reproduce a failure on more than one machine, the developers often didn’t see the same program behaviour
Introducing Undo’s LiveRecorder
LiveRecorder for Automated Test can record all or part of a Linux (or Android) program’s execution for subsequent replay and analysis. It captures an exact recording of why a test failed, allowing developers to go back in time to any instruction in the program’s history and view the contents of any location in memory and any register. Recordings can be shared among developers and analysed on a different machine to the one on which the error occurred, making triage and analysis of failures much quicker, easier and more effective.
SAP identified Undo’s LiveRecorder as a solution to multiply the value of their test strategy, making test failure results actionable by “closing the loop” between the fault and a developer understanding its root cause. By sharing recordings, developers can analyse an identical copy of the original failure while collaborating on a fix.
Recordings can be analysed using Undo’s interactive reversible debugger, UndoDB which allows developers to step or run their recorded program backwards as well as forwards in time to home in on a point of interest. The SAP HANA team is able to navigate quickly to the root cause of a problem using the full functionality expected of modern debuggers (such as scripting, conditional breakpoints and watchpoints, full inspection of globals and locals, etc.) in both forwards and reverse execution.
Improving software quality in SAP HANA using LiveRecorder
SAP HANA is feature-rich and includes millions of lines of code so the Undo team optimised LiveRecorder to meet SAP HANA’s complex requirements. SAP HANA already has a sophisticated continuous integration suite which LiveRecorder enhances, helping developers get the most out of the tools already at their disposal. With one simple command, LiveRecorder can be activated, making it easy to use Undo’s record, rewind and replay technology with minimum changes to the SAP HANA team’s existing workflow. SAP developers chose to deploy LiveRecorder in always-on mode to a subset of their testfarm hosts. This allows them to phase in recording in parallel to their existing workflow, with their previous testing routine continuing as normal.
With LiveRecorder, automated testing within SAP HANA has been made considerably quicker, easier and more effective.
LiveRecorder generates recordings of each and every test failure, helping developers find and fix defects as the software is being written. Failures no longer need to be replicated on the machine on which they originally occurred: multiple copies of the exact program execution which led to a failure can be shared within the SAP HANA team, allowing developers to collaborate when solving the most complex errors. With hundreds of developers working on the SAP HANA database across multiple countries, the SAP HANA team can overcome language, communication and time-zone barriers when fixing problems, further strengthening the team’s responsiveness to issues that appear in testing.
For SAP HANA developers, engineers no longer need to draw inferences from textual logs to fix the failure but instead have total visibility of exactly what their program did before it failed, making debugging quicker and easier while further boosting stability and code quality to the underlying SAP HANA database.
SAP’s use of LiveRecorder reflects the importance of software quality to SAP HANA. With minimum disruption to their existing workflow, SAP developers are using LiveRecorder to maximise the value provided by SAP’s existing testing suite, increasing its value by letting them directly review and debug the behaviour of failing tests while the software is under development. The time and effort spent debugging is being reduced, freeing up time for writing new features.
Undo’s record, rewind and replay technology is becoming a central part of the testing process of SAP HANA. As companies like SAP move their products to the cloud, improving software quality and reducing time to fix issues that occur is becoming business-critical. SAP’s decision to use LiveRecorder is an example of how software quality is helping SAP to maintain a competitive advantage in a volatile market environment, helping the teams to produce better quality code faster while improving productivity and collaboration between development and QA teams.