Collector soft lock-up advisory

Learn how to mitigate a Linux kernel issue with eBPF probes on certain kernel versions.

2 minute read

Published: May 15, 2020

Summary

StackRox Collector monitors runtime activity on each node in your secured clusters by installing probes. These probes are either kernel modules or eBPF programs specific to the Linux kernel version installed on the node.

We recently encountered a bug in the Linux kernel’s BPF subsystem that causes kernel hangs when the StackRox eBPF probe is installed. The issue is often triggered when the node is under heavy load. You’ll observe the nodes being restarted because of these hangs.

Affected versions

This bug affects a subset of Linux kernels of versions 4.18 or later. Whether a kernel is affected depends on the precise set of backports pulled in by each vendor, hence it isn’t possible to tie this to the kernel version number alone.

We’ve confirmed this issue on kernels shipped with Google’s Container-Optimized OS (COS) version 77 or later. If you’re using Google Kubernetes Engine (GKE) with COS images from the GKE Rapid or Regular release channels, there is a chance that your nodes are running an affected kernel. The Stable release channel ships an unaffected kernel.

However, every distribution that ships a reasonably modern kernel may be affected.

Fix

The fix for the underlying issue has been merged into the Linux kernel mainline and has been pulled into various kernel distributions by their respective vendors. The most recent images of COS now ship an unaffected kernel.

However, due to the changes made to the eBPF verifier on the kernel side, you may see CrashLoopBackOff failure messages for the Collector container when inserting the probe that contain a very large output of assembly-style code.

We’ve also undertaken mitigations in our eBPF probes and replaced the probes for affected kernels on our download service and in published Collector images. If you suspect that you’re running an affected kernel and you’re using eBPF as a collection method in at least one of your clusters, follow these steps:

  1. Ensure that you are running the StackRox Kubernetes Security Platform version 3.0.33.0 or newer. If you’re on an earlier version, we strongly advise you to upgrade; if you can’t do so, consider switching to kernel module-based runtime collection in the meantime. To change your collection method:

    1. Navigate to the Platform Configuration > Clusters view.
    2. Select your cluster. The configuration form appears on the right in a side panel.
    3. Select Kernel Module in the Collection Method menu.
    4. Select the Next button, then download the deployment bundle and deploy it in your cluster.
  2. If you’re using a private registry mirror for the collector.stackrox.io/collector container image, sync the active Collector image with the collector.stackrox.io registry. You can find out the image name by running the command:

    Copy
    kubectl -n stackrox get ds collector -o jsonpath='{.spec.template.spec.containers[0].image}'

    in each cluster with a deployed Sensor. The image tag should end in -latest.

  3. If you’re running StackRox in offline mode and are using collector support packages, download the latest support package. Then, install the support package by using the roxctl collector support-packages upload command with the --overwrite flag turned on.

  4. Finally, delete all Collector pods to force them to restart by running:

    Copy
    kubectl -n stackrox delete pod -lapp=collector

After you complete these steps, all your Collectors should be functional again in a state that doesn’t risk triggering the soft lockup.

Advice on the use of eBPF

The eBPF system was designed to be a less invasive, safer alternative to kernel modules, in particular reducing the risk of host-level crashes. While we remain excited by the long-term prospects of eBPF, the encountered issue highlights that eBPF is still a relatively recent technology that, in some aspects, is less mature than its alternatives. We, therefore, advise you to prefer kernel module-based collection for the foreseeable future whenever your platform supports it.

We don’t support kernel modules for Google’s Container-Optimized OS.

Questions?

We're happy to help! Reach out to us to discuss questions, issues, or feature requests.

© 2021 StackRox Inc. All rights reserved.