Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

Open
snowzach opened this issue Nov 19, 2024 · 14 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@snowzach
Copy link

Platform I'm building on:

Running a very simple NFS server container on bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41

Dockerfile:

# syntax=docker/dockerfile:1.9
FROM public.ecr.aws/debian/debian:bookworm

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
    nfs-common nfs-kernel-server curl ca-certificates zip unzip make time jq yq netbase iproute2 net-tools bind9-dnsutils procps xz-utils nano && rm -rf /var/lib/apt/lists/*

# Copy the entrypoint script
COPY --chmod=755 ./tools/docker/cloud-nfs/kernel/cloud-nfs-entrypoint.sh /entrypoint.sh
RUN mkdir /exports

# Install the AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install \
    && rm -rf \
    awscliv2.zip

# Expose ports for NFSD/StatD/MountD/QuotaD
EXPOSE 2049 2050 2051 2052
VOLUME /exports

# Entrypoint
ENTRYPOINT ["/entrypoint.sh"]
CMD [ "/exports" ]

Entrypoint script:

#!/bin/bash
set -e

NFS_THREADS=${NFS_THREADS:-64}

function start() {

    # prepare /etc/exports
    fsid=0
    for i in "$@"; do
        echo "$i *(rw,fsid=$fsid,no_subtree_check,no_root_squash)" >> /etc/exports
        if [ -v gid ] ; then
            chmod 070 $i
            chgrp $gid $i
        fi
        echo "Serving $i"
        fsid=$((fsid + 1))
    done

    # start rpcbind if it is not started yet
    set +e
    /usr/sbin/rpcinfo 127.0.0.1 > /dev/null; s=$?
    set -e
    if [ $s -ne 0 ]; then
       echo "Starting rpcbind"
       /sbin/rpcbind -w
    fi

    mount -t nfsd nfds /proc/fs/nfsd

    # -V 3: enable NFSv3
    /usr/sbin/rpc.mountd -p 2050

    /usr/sbin/exportfs -r
    # -G 10 to reduce grace time to 10 seconds (the lowest allowed)
    # -V 3: enable NFSv3
    /usr/sbin/rpc.nfsd -G 10 -p 2049 $NFS_THREADS
    /sbin/rpc.statd --no-notify -p 2051 -o 2052 -T 2053
    echo "NFS started with $NFS_THREADS threads"
}

function stop()
{
    echo "Stopping NFS"

    /usr/sbin/rpc.nfsd 0
    /usr/sbin/exportfs -au
    /usr/sbin/exportfs -f

    kill $( pidof rpc.mountd )
    umount /proc/fs/nfsd
    echo > /etc/exports
    exit 0
}

trap stop TERM

start "$@"

# Keep the container running
sleep infinity

Essentially I run this on an AWS i3en with local flash provisioned as ephemeral storage shared via this NFS server. It's a high performance cache drive. Testing with i3en.2xlarge

What I expected to happen:

It would be a super fast NFS server sharing this ephemeral storage.

What actually happened:

I can mount this storage from another i3en.2xlarge instance and mostly it works unless we really push it.
If I run the disk testing tool bonnie++ -d /the/nfs/share -u nobody and wait, within a minute or two the machine will start displaying errors in the logs about watchdog: BUG: soft lockup - CPU#7 stuck for 22s! as well as ena 0000:00:06.0 eth2: TX hasn't completed, qid 5, index 801. 26560 msecs since last interrupt, 41910 msecs since last napi execution, napi scheduled: 1

How to reproduce the problem:

Run the container, run bonnie++ on the NFS share.

It's very reliably reproduced.

Attached is a log:
bottlerocket-log.txt

@snowzach snowzach added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Nov 19, 2024
@snowzach
Copy link
Author

No idea if this is related but googling for CPU soft lockup and a few other keywords led me to this issue: https://bbs.archlinux.org/viewtopic.php?id=264127&p=3 that appears to be affected up until Kernel 5.15 (which is what we are running it appears)

@koooosh
Copy link
Contributor

koooosh commented Nov 19, 2024

Hey @snowzach, thanks for reporting this issue! A question:

  1. Have you experienced this issue with older versions of Bottlerocket (< v1.26.1) or even our latest release v1.26.2?

@bcressey
Copy link
Contributor

@koooosh the ENA "TX hasn't completed" error is probably the most relevant here. awslabs/amazon-eks-ami#1704 mentions upgrading the ENA driver to 2.12.0 as a potential fix, but the 5.15 kernel is already on 2.13.0.

We should try the repro. aws-k8s-1.25 is still fully supported and if it's a driver issue, moving to aws-k8s-1.28 and a newer kernel may not help.

@snowzach
Copy link
Author

Hi @koooosh I have experienced it with older versions also. I could try with v1.26.2 as well. What's also interesting, I believe I experienced it also when using the AL2 nodes also but I don't recall the versions of those. I am not super familiar with how kernel versions work with Kubernetes versions, is Kernel 5.15 the highest I would be able to run with Kube 1.25? I am not super keen to upgrade Kubernetes but would consider it if I knew it would for sure fix that problem.

@snowzach
Copy link
Author

snowzach commented Nov 21, 2024

Here's another interesting one that VERY closely matches the issue I am having: awslabs/amazon-eks-ami#454 that was resolved with a kernel patch from quite a few years ago.

@snowzach
Copy link
Author

snowzach commented Nov 21, 2024

Just another thought, since the NFS server runs in the Kernel, is there something sort of throttling/policing of system resources by kubelet that could be causing this?

@koooosh
Copy link
Contributor

koooosh commented Nov 21, 2024

is Kernel 5.15 the highest I would be able to run with Kube 1.25?

That's correct - the Bottlerocket k8s-1.25 variant specifically uses kernel 5.15.

If you'd like to use kernel 6.1, you would have to use k8s 1.28.

@snowzach
Copy link
Author

snowzach commented Nov 21, 2024

Alright... So I decided to try creating a new cluster with 1.25 and was going to upgrade a version at a time to see if the issue stopped happening... I've been having an issue in us-west-2 and I spun this one up in us-east-2 to try to reproduce it.... and it didn't have the issue....

Same NFS server container, same Kube version 1.25, same bottlerocket version (testing with 1.27.1 now) The CNI plugin wasn't the same, tried upgrading to match, still had the issue. I'm at a loss unless there's some underlying hardware difference you can't see.. The only difference I can think of is that some of the plugins are different between them. (No issue cluster has newer versions of EFS driver, kube-proxy, EBS driver, CoreDNS) and the cluster having the issue has Grafana Agent collecting logs/metrics. That's it...

Anything else I can check!?

Linux ip-10-250-7-127.us-east-2.compute.internal 5.15.168 #1 SMP Fri Nov 15 19:12:46 UTC 2024 x86_64 GNU/Linux
[    2.132461] ena 0000:00:05.0: Elastic Network Adapter (ENA) v2.13.0g
bottlerocket-aws-k8s-1.25-x86_64-v1.27.1-efd46c32
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Linux ip-172-27-57-172.us-west-2.compute.internal 5.15.168 #1 SMP Fri Nov 15 19:12:46 UTC 2024 x86_64 GNU/Linux
[    4.942529] ena 0000:00:05.0: Elastic Network Adapter (ENA) v2.13.0g
bottlerocket-aws-k8s-1.25-x86_64-v1.27.1-efd46c32
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Edit: I noticed the platform version is different: eks.25 vs eks.35 but not sure if that should matter..

@arnaldo2792
Copy link
Contributor

Hey @snowzach, I'm taking a look at this, sorry for the late reply 👍 . I'm reproducing as @bcressey suggested.

@arnaldo2792
Copy link
Contributor

Hey @snowzach , could you please provide more details about your setup? You mentioned EFS (CSI?) driver, but aren't you running your own NFS server? I'm trying to understand how the EFS driver comes into the picture if the goal is to use your own NFS server.

On a side note, I did a first test with bonie++ with the NFS server using 8 threads instead of 64 in Bottlerocket k8s 1.25, 1.27.1, and I didn't see the failure. I'll perform another test with 64 threads to get as close as I can to your environment. It took me a while to get the NFS server up and running, but it seems like you are using a slightly different version than the script provided in the registry.k8s.io/volume-nfs image (use NFS v4 instead of v3 and increase the number of threads).

Another question I have for you is related to your experience of using and configuring the ephemeral storage. I tried to use apiclient ephemeral-storage init and apiclient ephemeral-storage bind --dirs /mnt/block but the later failed since we are strict on the directories that can be used to mount the RAID array configured with apiclient ephemeral-storage. My question for you is do you use bootstrap containers to configure the ephemeral storage? I'm trying to get data to see if we should discuss whether it would be valuable to loosen the dirs constrains a bit.

@snowzach
Copy link
Author

snowzach commented Nov 23, 2024

I build the node using Karpenter with the following settings (I'm using slightly old version of Karpenter)

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: deck
  labels:
    app.kubernetes.io/name: deck
    app.kubernetes.io/component: deck
spec:
  securityGroupSelector:
    karpenter.sh/discovery: {{ .Values.environment }},{{ .Values.environment }}-nfs
  subnetSelector:
    Type: private
  amiFamily: Bottlerocket
  userData: |
    [settings.kernel.sysctl]
    "net.core.rmem_max" = "67108864"
    "net.core.wmem_max" = "67108864"
    "net.ipv4.tcp_rmem" = "4096 87380 33554432"
    "net.ipv4.tcp_wmem" = "4096 87380 33554432"
    "net.ipv4.tcp_congestion_control" = "htcp"
    "net.ipv4.tcp_mtu_probing" = "1"
    "net.core.default_qdisc" = "fq"
    "fs.file-max" = "512000000"
    "vm.min_free_kbytes" = "524288"

    [settings.bootstrap-commands.k8s-ephemeral-storage]
    commands = [
      ["apiclient", "ephemeral-storage", "init", "-t", "ext4"],
      ["apiclient", "ephemeral-storage" ,"bind", "--dirs", "/var/lib/kubelet"]
    ] 
    essential = true
    mode = "always"

To mount the NVMes I just specify an emptyDir: {} and since it's setup to use ephemeral storage it mounts the NVMe in the container for me. I only mentioned EFS because I have the EFS plugin in kubernetes enabled so there are a couple daemonsets that run on every node (including the one in question) I doubt it's related but figured I should mention any differences.

As far as configuring ephemeral storage I think it would be good to loosen the directories... as well I think it would be good to allow setting mount options. I mount with noatime and nodiratime when I can to increase speed. As well, making it so XFS works would be nice as well and specify the mkfs options so you can optimize the size.

Alright, so I am seeing that this is NOT necessarily related to Bottlerocket.

Info:

  • I can reproduce this issue on Bottlerocket and AL2 Kubernetes nodes
  • I can reproduce this error with XFS or EXT4 underlying filesystem
  • I could NOT reproduce this in us-east-2 when building another cluster there (only on my cluster/node in us-west-2b)
  • It runs for a while and then a kernelworker thread spikes to 100% (the machine is not 100% loaded) and the whole machine hangs

I'm starting to wonder if this is the ENA driver or something to do with NVMEs (or them fighting with each other). Attached is a log from the AL2 instance with the same failure. It looks almost exactly the same instead it says xfs instead of ext4.
failure-nfs-AL2.txt

@arnaldo2792
Copy link
Contributor

arnaldo2792 commented Nov 26, 2024

Hey @snowzach, I gave this one more try but still I can't replicate what you are experiencing. Please let us know what additional configurations are you setting or the flags you are passing to bonnie++ to reproduce, since I have been unsuccessful to get the error you are getting in my attempts. This is what I have tried.


Created a cluster with AMI ami-0a01be35f6388db1e (bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41) and used the configs you provided:

[ssm-user@control]$ apiclient get settings.kernel
{
  "settings": {
    "kernel": {
      "lockdown": "integrity",
      "sysctl": {
        "fs.file-max": "512000000",
        "net.core.default_qdisc": "fq",
        "net.core.rmem_max": "67108864",
        "net.core.wmem_max": "67108864",
        "net.ipv4.tcp_congestion_control": "htcp",
        "net.ipv4.tcp_mtu_probing": "1",
        "net.ipv4.tcp_rmem": "4096 87380 33554432",
        "net.ipv4.tcp_wmem": "4096 87380 33554432",
        "vm.min_free_kbytes": "524288"
      }
    }
  }
}
[ssm-user@control]$

Created RAID array, and mounted it manually, because I didn't want to use /var/kubelet:

mkdir /mnt/block
apiclient ephemeral-storage init -t ext4
mount /dev/md/ephemeral /mnt/block

Created container image to run the NFS server with the same Dockerfile you provided, and a slightly modified version of your entrypoint (I used the default ports):

#!/bin/bash
set -e

NFS_THREADS=${NFS_THREADS:-64}

function start() {

    # prepare /etc/exports
    fsid=0
    for i in "$@"; do
        echo "$i *(rw,fsid=$fsid,no_subtree_check,no_root_squash)" >> /etc/exports
        if [ -v gid ] ; then
            chmod 070 $i
            chgrp $gid $i
        fi
        echo "Serving $i"
        fsid=$((fsid + 1))
    done

    # start rpcbind if it is not started yet
    set +e
    /usr/sbin/rpcinfo 127.0.0.1 > /dev/null; s=$?
    set -e
    if [ $s -ne 0 ]; then
       echo "Starting rpcbind"
       /sbin/rpcbind -w
    fi

    mount -t nfsd nfds /proc/fs/nfsd

    /usr/sbin/rpc.mountd

    /usr/sbin/exportfs -r
    # -G 10 to reduce grace time to 10 seconds (the lowest allowed)
    # -V 3: enable NFSv3
    /usr/sbin/rpc.nfsd -G 10 $NFS_THREADS
    /sbin/rpc.statd --no-notify
    echo "NFS started with $NFS_THREADS threads"
}

function stop()
{
    echo "Stopping NFS"

    /usr/sbin/rpc.nfsd 0
    /usr/sbin/exportfs -au
    /usr/sbin/exportfs -f

    kill $( pidof rpc.mountd )
    umount /proc/fs/nfsd
    echo > /etc/exports
    exit 0
}

trap stop TERM

start "$@"

# Keep the container running
sleep infinity

Deployed the pods in the cluster, with:

  • A daemonset that runs the NFS server only in i3en.2xlarge instances
  • A daemonset that runs the client only in m5.xlarge instances
  • A service in front of the NFS server pods
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nfs
spec:
  selector:
    matchLabels:
      name: nfs
  template:
    metadata:
      labels:
        name: nfs
    spec:
      containers:
      - name: nfs
        # image: registry.k8s.io/volume-nfs:latest
        image: <>.dkr.ecr.us-west-2.amazonaws.com/nfs-server-problem:v3
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /exports
            name: ephemeral
      volumes:
        - name: ephemeral
          hostPath:
            path: /mnt/block
            type: Directory
      nodeSelector:
        node.kubernetes.io/instance-type: i3en.2xlarge
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nfs-client
spec:
  selector:
    matchLabels:
      name: nfs-client
  template:
    metadata:
      labels:
        name: nfs-client
    spec:
      containers:
      - name: nfs-client
        command: ["sleep", "infinity"]
        image: fedora:41
        securityContext:
          privileged: true
      nodeSelector:
        node.kubernetes.io/instance-type: m5.xlarge
---
apiVersion: v1
kind: Service
metadata:
  name: nfs-1-26-1
spec:
  selector:
    name: nfs
  ports:
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111

Mounted the NFS server in the client:

[root@nfs-client-dz59r /]# mount -t nfs -o vers=4.2 ${NFS_1_26_1_SERVICE_HOST}:/ /mnt
# Test the file was created in the remote filesystem
[root@nfs-client-dz59r /]# touch /mnt/test
# Perform the test
[root@nfs-client-dz59r /]# bonnie++ -d /mnt/ -u root -c 10
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
nfs-clie 31008M::10 1015k  99  549m  33  271m  21  892k  99  515m  21 +++++ +++
Latency              8920us     419ms    1335ms    1349ms    9199us    8192us
Version  2.00       ------Sequential Create------ --------Random Create--------
nfs-client-dz59r    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16     0  10 +++++ +++     0  11     0  10     0  12     0  11
Latency             11897us      25us   10324us    6796us    1563us    3903us
1.98,2.00,nfs-client-dz59r,10,1732664184,31008M,,8192,5,1015,99,562528,33,277052,21,892,99,527541,21,+++++,+++,16,,,,,1426,10,+++++,+++,3865,11,1441,10,9554,12,3912,11,8920us,419ms,1335ms,1349ms,9199us,8192us,11897us,25us,10324us,6796us,1563us,3903us
[root@nfs-client-dz59r /]#
# From the host
bash-5.1# ls /mnt/block/test
/mnt/block/test
## After starting the test
bash-5.1# ls /mnt/block/
Bonnie.270  lost+found  test
## No logs regarding incomplete TX/RX
bash-5.1# journalctl -k | grep "TX hasn't completed"
# Nothing

@snowzach
Copy link
Author

Yeah, I have no idea what the issue is... It very much seems like maybe some sort of hardware/driver issue related maybe to the ENA + NVMe contention... just a guess... So another interesting thing I have done to the resolve the issue (thus far) is to disable the sync option on the NFS server. Since it's basically a ephemeral drive I don't care about sync'ing. I'm guessing that has reduced the contention/load and now I have been unable to trigger the error since...

Like I said, I setup the exact same thing in US East 2 and I could not make the issue happen. I have no idea at this point.

@arnaldo2792
Copy link
Contributor

arnaldo2792 commented Nov 27, 2024

Thanks for your patience @snowzach, and I'm glad you have a workaround. We will engage with the ENA team to understand what could be the root cause. They may have a better way to reproduce but gladly you provided that logs so that's great!

On the ephemeral-storage experience, I'll begin a thread with the team and discuss where do we stand on opening up the API a bit more to other directories like /mnt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants