What is Software Failure Replay?

A DEFINITION

5 minute read

Software Failure Replay

 

Software Failure Replay (SFR) is a method of recording the execution of a software program as it fails. The recording can be replayed forwards and backwards in a reversible debugger. Software Failure Replay is used to rapidly diagnose and resolve software failures.

 

Software Failure Replay explained

           

Software Failure Replay vs traditional methods of debugging

 

While there have been huge advances in building software over the last few decades, the way developers approach debugging has hardly changed. 

Unlike traditional, manual debugging which is based on guesswork, Software Failure Replay provides data-driven insights into what software did before it failed.

By recording a failed process down to instruction level, they get a 100% reproducible test case. This makes bug fixing predictable. Software Failure Replay can also capture intermittent failures which cannot be tracked down in any other way.

The compelling advantage of Software Failure Replay is the drastic reduction in time needed to reproduce a failure by recording the failure - catching it in the act. But the magic happens when the recording is replayed. Replay can be likened to "winding the tape" back and forth in real-time in order to get a clear picture of your program's execution. This observation fast tracks understanding what happened to cause the error. Once the root cause of an error has been identified, developers can fix and resolve it.

Development teams building complex enterprise software typically spend over 50% of their engineering effort in addressing software failures.

7 Benefits of Software Failure Replay

 

Whether the failures occurred in test, development, or production, there are many benefits of Software Failure Replay. 

Developers spend most of their day building software, but too often they are pulled away to debug new customer bug reports or test failures in QA. Software Failure Replay gives developers the gift of time to:

 

  • Accelerate failure diagnosis
    Record the failing process and capture the cause of bugs without having to reproduce the entire system environment 
  • Track down impossible defects
    Capture the most challenging software defects - even the sporadic non-deterministic failures no-one can get to the bottom of
  • Clear backlogs and accelerate delivery
    Accelerate software delivery by clearing your backlog of failing tests and turning your test suite green 
  • Boost developer productivity
    Reduce time spent debugging and its associated overhead - allowing more time for programming new features  
  • Enable cross-team collaboration
    Portable recordings can be replayed on any machine, making it easier than ever for testers/QA, developers and architects to work collaboratively to understand and diagnose failures
  • Increase understandability
    Replaying & analyzing the recording enables developers to observe the failing process and understand the root cause of the failure, without needing prior knowledge of the codebase
  • Resolve customer issues faster
    Accelerate time to resolution of fielded defects, minimize customer impact, and safeguard your customer relationships to protect your bottom line

Software Failure Replay has a net positive impact across all stages of the development lifecycle. It can accelerate development velocity, resolve customer defects faster, and help achieve overall productivity savings.

Failure is inevitable. How you resolve those failures and recover is what matters.

How Software Failure Replay fits into your pipeline

 

Software Failure Replay is applicable across the development lifecycle, whether in test, development or production - as outlined in the model below.CI-CD-Workflow-Image

Where and how Software Failure Replay is employed is dependent upon the specific needs of the software engineering team. The common thread throughout is the ability to reproduce the failure faster and deploy a fix once.

Specific use cases include:

 

  • Unit tests
    Determine if specific units of source code are free of bugs
  • Integration tests
    Test and expose faults within the integrated code before deployment
  • Alpha/Beta
    Quickly reproduce software failures after code has been deployed to the customer
  • Early adopter bugs
    Find and fix software failures for early champions and influencers faster.
  • Critical outages 
    Reduce downtime for customer end users to maintain client relationships and preserve contract renewals

LiveRecorder - the leading Software Failure Replay platform

 

The LiveRecorder platform is based on patented in-process virtualization, an approach that enables the system to record everything that a program does. The recording is done at machine level. 

In-Process Virtualization Marchitecture diagram

When replaying a recording, a user can rewind the system to any program state. This provides them with full visibility of every memory location (including heap, stack, registers, program counters, system calls etc). The user has DVR-like controls to step forward/backwards, rewind, and freeze-frame code. 

This enables developers to identify software defects, analyze security breaches, and research historical activity with greater control. Furthermore, LiveRecorder is the only solution for multi-site debugging and analysis of code, without connection to the host. 

As a result, root cause detection time is significantly reduced so developers can get straight to debugging the recording artifact - reducing the number of loops in agile development cycles and therefore increasing development velocity.

The image below is an example of how LiveRecorder integrates into modern CI pipelines.
LiveRecorder record CI test failures

In conclusion

 

The LiveRecorder platform transforms software failure diagnosis from a slow process of elimination into a systematic, repeatable workflow that delivers predictability in bug fixing and rapid defect resolution. LiveRecorder can be used to resolve software failures in C/C++, Go and Java.

Enterprise software organizations that use LiveRecorder, are able to speed up the time-to-resolution of all software failures by an average of 10x.

 

CI White Paper featured Image homepage

Learn more about Software Failure Replay

Sign up for a demo of LiveRecorder
  • Share this page