Debugging Kubernetes Services Using Undo

Debugging Kubernetes Services Using Undo

Author: Nick Read, Senior Software Engineer at Undo

 

Kubernetes is an open-source platform designed to automate the deployment, scaling, and management of containerized applications. It helps developers manage complex applications by providing tools for orchestrating containers, ensuring that they run consistently across different environments. By using Kubernetes, organizations can improve resource utilization, streamline application updates, and enhance system resilience.

Challenges with debugging Kubernetes services

There are a number of challenges involved when debugging erroneous behaviors in Kubernetes, namely:

  • Ephemerality: Containers can be replaced due to self-healing, losing state and debugging context
  • Distributed Nature: Issues may span multiple services and nodes
  • Production Access Limitations: Security restrictions on production environments
  • Resource Constraints: Limited ability to run heavy debugging tools in production
  • Reproducibility: Difficulty reproducing issues that only occur in production
  • Timing-Sensitive Bugs: Race conditions and concurrency issues that are hard to catch

Tools and techniques for troubleshooting Kubernetes issues

There are many tools and techniques that can be used to troubleshoot Kubernetes issues. The most commonly used strategies include log analysis and monitoring, interactive debugging, and tracing. 

However, these strategies have drawbacks:

  • Kubernetes logging requires the foresight of what to log.
  • Interactive debugging can alter program state and can be difficult in ring-fenced production environments.
  • Tracing adds the overhead of modifying code to collect traces or telemetry for further analysis.

These debugging methods can sometimes miss unwanted behaviors in distributed environments. Even when the bug is found, reproducing it may still be difficult.

What is Undo?

Undo is a time travel debugging solution for large-scale enterprise applications. When there’s a bug in your software, you can use Undo to capture the bug in a recording file (capturing the full program execution in a single binary file) and to step back in the recording to examine the full state of the program at any point in time. 

Undo comes with 2 components:

  1. LiveRecorder: for recording the runtime behavior of an application and saving it as a portable recording.
  2. UDB: for replaying recording files (or live debug sessions) back and forth in time to see what happened.

Undo also has extensions for languages that are popular in Kubernetes environments:

  • Go – Delve debugger with Undo replay and time travel support. Also GoLand integration.
  • Java/Scala/Kotlin – LiveRecorder java library, Undo replay, and Time Travel Debug IntelliJ plugin.

Our customers use Undo to understand the behavior of their mission-critical applications.  When they use Kubernetes, they need to record and observe services with minimal disruption to the existing application code, container definitions and container runtime. LiveRecorder helps our customers diagnose and find the root cause of bugs in Kubernetes environments by recording service states without modifying the original service application code or container definition. 

In Kubernetes environments, debugging can be particularly challenging due to the complexity of distributed systems, where tracing requests across multiple microservices often leads to difficulties in correlating logs effectively. Logs frequently lack sufficient context, making it hard to understand the sequence of events that led to an issue. Additionally, the dynamic nature of Kubernetes, with its ephemeral pods and containers, can result in the loss of crucial logs from terminated instances, complicating the debugging process further. And limited query capabilities in some logging tools also restrict the ability to perform complex analyses.

LiveRecorder differs from existing logging and observability tooling in that it records every line of code executed, every variable accessed and every event during a service process lifespan. This can then be replayed using UDB to see exactly what was unexpected and when it happened.

LiveRecorder in Kubernetes

LiveRecorder produces recordings of an application’s process and can be replayed to see exactly what happened during runtime. LiveRecorder works at the single process level, so we would expect to attach to processes in the current process namespace. LiveRecorder can target a process in a main application container while it runs in another, with no change to the target’s code or container definition. This is made possible by 2 Kubernetes features: 

⚠️ CAUTION: when using shareProcessNamespace, processes and other container filesystems are visible to other containers via /proc, and are only protected by regular Unix permissions.

We created an example controller application running in a sidecar container that provides a mechanism for dynamically enabling or disabling LiveRecorder sessions. Recording sessions can be started at any time during the target application’s lifetime. This allows LiveRecorder to be used in different workflows and scenarios:

  • LiveRecorder can be triggered to restart daily if traditional observability methods have not shown the cause of the error. This would allow the controller application to upload daily recordings of the error to ensure that it has been captured by LiveRecorder. 
  • Triggering recording sessions where service failures exhibit a recurring pattern or can be anticipated by specific system events. 
  • If a specific error log message consistently precedes a service crash, the application could be set to activate LiveRecorder as soon as that message appears in the logs. 

How to use LiveRecorder in Kubernetes environments

The sidecar container needs to access the executable and symbol files in the main container to save them in an Undo recording. This is achieved by sharing the temporary directory across both containers. If this doesn’t work for your use case, please let us know by contacting support [at] undo.io, so we can implement an alternative approach.

We have produced an example demo application available in this GitHub repository https://github.com/undoio/addons/tree/master/k8s_live_recorder.

The demo comprises of 3 parts:

  • Sidecar controller application – written in Go, allows the user to control LiveRecorder via K8s annotations, obtain target process executable, and upload recordings to S3.
  • Example application – very basic Go HTTP server app that produces a segfault when accessing an endpoint.
  • K8s config – a simple example of how to create a pod with the sidecar, roles for the sidecar to listen for annotations, and service to allow external access to the breaking app.

Block diagram showing how the components interact

To enable recording a target process in the main application container, we rely on some essential K8s configuration to allow LiveRecorder to gain access to the target process, but also obtain the information it needs when saving the recording:

  • shareProcessNamespace: true to allow the sidecar to observe the target process in the main application container.
  • Grant securityContext.capabilities.add.[‘SYS_PTRACE’] to both containers for process monitoring.
  • Create an emptyDir: {} shared volume that both the main application container and sidecar container mount at /tmp as LiveRecorder uses this space to gather information for the recording.

The sidecar controller is controlled by using Annotations to mark the pod. The controller listens for the pod to be marked with an annotation called undo.io/live-record.

There are 2 controls:

  • start executes live-record
kubectl annotate pod <pod-name> undo.io/live-record=start --overwrite || true
  • Stop kills live-record with SIGINT to trigger the saving of a recording
kubectl annotate pod <pod-name> undo.io/live-record=stop --overwrite || true

The recording is saved to a directory called /recordings. When the controller is started, it starts a goroutine that periodically checks the /recordings dir for any recording files, if any exist then uploading to an S3 bucket is initiated.

Undo recording files are created by LiveRecorder, and contain a comprehensive capture of a program’s execution. This includes the original executable, a log of all non-deterministic behaviour (system calls, thread switches etc), debug information, and other essential information required for the execution to be replayed and debugged later using UDB. By replaying a recording there is no need to reproduce the issue as all of the bug’s behaviour has been captured in the recording. The recording files are self-contained so these recordings can be replayed on a separate machine. When loaded in UDB / Delve / IntelliJ it is possible to step through the program’s execution history to see where the bug occurred.

For a more detailed overview of debugging with Undo recordings, please see:

Stay informed. Get the latest in your inbox.

FAQs

Pod self-healing measures are often used to restart the main application container on failed health checks (i.e. if the application has crashed) which would mean the loss of a recording as the container would be restarted while LiveRecorder saves and uploads the recording.

This behaviour is handled by the wrapper application by adding a undo.io/status=busy annotation to the pod while the sidecar is still saving or uploading a recording, and then a preStop lifecycle hook is used to halt the main application teardown until the annotation has been cleared by the controller.

Furthermore, if a fault originates from a different service from the one where the bug is evident (e.g. if a request value from service A has not been validated correctly at service B’s receiving endpoint), then record both services and use the experimental Multi-Process Correlation for Sockets feature when replaying to follow data from one service process to another.

As part of the development of the sidecar approach we first looked at using kubectl debug.  kubectl debug is a command in Kubernetes that allows users to troubleshoot and debug running pods by creating an ephemeral container within an existing pod. This command enables developers to inspect the state of the application, run diagnostic commands, and interact with the pod’s environment without altering the original container’s state. With kubectl debug, a debug container can be spun up with tools and utilities necessary for troubleshooting and debugging.  However, it became apparent that this approach would be restrictive for our customers’ use-cases.

When recording long running processes in production use cases a developer might not have access to the production environment to use kubectl due to organization security protocols. This limitation prevents them from attaching debuggers, inspecting logs in real-time, or executing commands necessary for traditional debugging workflows.  By using the sidecar container, LiveRecorder becomes a fine-grained observability tool that records exactly what happens in the main application process and the recordings can be shipped to developers to analyse with minimal production environment intrusion.

Debug containers are instantiated with kubectl debug, which may not start at the same time as the main application container’s startup. Sidecar containers do start at the same time as the main application container, so LiveRecorder can record as much of the process as possible, which is useful if the target process fails early in its lifecycle. To ensure comprehensive recording of a process’s execution, particularly when early-lifecycle failures are a concern, sidecar containers are initiated concurrently with the main application container. This parallel startup allows LiveRecorder to capture the maximum possible duration of the process, from its very beginning, thus providing invaluable data for debugging and analysis even if the target process terminates early.