Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nsncd: 23.05 regression compared to 22.11 #218813

Open
sir4ur0n opened this issue Feb 28, 2023 · 21 comments
Open

Nsncd: 23.05 regression compared to 22.11 #218813

sir4ur0n opened this issue Feb 28, 2023 · 21 comments
Labels
0.kind: bug Something is broken

Comments

@sir4ur0n
Copy link

sir4ur0n commented Feb 28, 2023

Describe the bug

In 23.05 nsncd is enabled by default.
However we have found this seems to have changed in a breaking manner a behavior in one of our NixOS machines (CI runner).

Steps To Reproduce

Steps to reproduce the behavior:

  1. Have this NixOS configuration for a nixos-22.11 version
{
  security.sudo.enable = true;
  services.openssh.passwordAuthentication = false;
  security.sudo.wheelNeedsPassword = false;
}
  1. SSH into the machine
  2. You can run sudo commands, e.g. sudo ls or sudo su
  3. Have the same NixOS configuration but for a nixos-unstable version
  4. SSH into the machine
  5. 💥 You cannot sudo:
sudo: PAM account management error: User not known to the underlying authentication module
sudo: a password is required
  1. Add services.nscd.enableNsncd = false; to your NixOS configuration
  2. SSH into the machine
  3. sudo commands work

Expected behavior

Either the same configuration should work in nixos-unstable/nixos-23.05 (once it exists), or this breaking change / a migration guide should be added in https://nixos.org/manual/nixos/unstable/release-notes.html#sec-release-23.05-incompatibilities

Additional context

I don't know if any of this is relevant, but just in case:

  • the machine is a GCP VM instance
  • the deployment is done via Terraform
  • we SSH using https://console.cloud.google.com/ which drops us in a browser shell
  • the user with which GCP SSHs us into the machine is not present in /etc/passwd nor in users.users in the NixOS configuration

Notify maintainers

As I don't really know what is the problem (documentation, code, other?) I am unsure who to ping 😐

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.95, NixOS, 23.05 (Stoat), 23.05pre-git`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.13.2`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
  • Git revision of the nixos-22.11 version: 6a0d270
  • Git revision of the nixos-unstable version: 7f5639f

CC @yorickvP

@sir4ur0n sir4ur0n added the 0.kind: bug Something is broken label Feb 28, 2023
@yorickvP
Copy link
Contributor

yorickvP commented Mar 1, 2023

cc @flokli

@flokli
Copy link
Contributor

flokli commented Mar 1, 2023

@sir4ur0n hmmh, we do have a nixos/tests/sudo.nix which does succeed, and we also have nixos/tests/google-oslogin/default.nix which does succeed, testing ssh over sudo.

What is the user you ssh in as? Is oslogin enabled, or how does it work?

@flokli
Copy link
Contributor

flokli commented Mar 1, 2023

cc @NinjaTrappeur

@picnoir
Copy link
Member

picnoir commented Mar 1, 2023

I can't reproduce locally or in a VM test. Sounds like we have a weird interaction with the google NSS module happening here.

@sir4ur0n: Could you dump the nsncd logs around the time you call sudo? If you can, dumping all the journald logs around the time you call sudo could also help. Are you using os-login?

Edit: I suspect this is triggerred by the google OS-login NSS module crashing.

@yorickvP
Copy link
Contributor

cc @ConnorBaker , who inherited the workaround :D

@de11n
Copy link

de11n commented Apr 14, 2023

We're also being bitten by this switch due to twosigma/nsncd#37. Our errors aren't related to sudo however and are not in Google Cloud.

@flokli
Copy link
Contributor

flokli commented Apr 14, 2023

As written by @NinjaTrappeur, we still need some logs / socket dumps to understand what's going on. This is not visible outside Google Cloud.

@sekunho
Copy link
Contributor

sekunho commented Aug 6, 2023

Hi, I also got bitten by this. It works for 22.11 but not 23.05, and I can reproduce it consistently:

  1. Create a NixOS 23.05 image with nixos-generators (flakes), have its input's nixpkgs point to nixos-23.05, and upload it on GCP.
  2. Create a VM with said image following the steps here https://nixos.wiki/wiki/Install_NixOS_on_GCE under the Create a VM instance section. I also added the enable-oslogin TRUE metadata part.
  3. SSH into the VM once ready using the gcloud CLI
  4. sudo -i produces the same error message as OP did

I ran dmesg after that, and it seems to be related with nsncd cause I saw a bunch of its segfaults which seems correlated to the number of times I ran sudo -i.

If you were to switch the first step to point to nixos-22.11, I can sudo -i just fine:

[user@host:~]$ sudo -i

[root@host:~]# 

If anyone could advise me what logs to post here, and exactly how to get them, I'd be happy to. 😄

@picnoir
Copy link
Member

picnoir commented Nov 6, 2023

Could you check if #263634 fixes this issue as well?

@peter-romfeld-bcw
Copy link

i was following this guide using master branch and had this issue. checking out from 22.11 to build the image fixed it

@picnoir
Copy link
Member

picnoir commented Jan 10, 2024

Thanks for testing this! Closing the issue.

(edit: read this the other way around)

@picnoir picnoir closed this as completed Jan 10, 2024
@picnoir picnoir reopened this Jan 10, 2024
@i10416
Copy link

i10416 commented Jan 12, 2024

NixOS instance built from Image generated from current master(version is inferred to be 24.05) also causes the same error when running on GCP with oslogin=true as mentioned in https://nixos.wiki/wiki/Install_NixOS_on_GCE.

@i10416
Copy link

i10416 commented Jan 12, 2024

The steps to reproduce:

git clone  [email protected]:NixOS/nixpkgs.git && cd nixpkgs
git checkout 4a6c1d765bd660f8379f961e641f5c9fccd312dc

Then I run the steps described in https://nixos.wiki/wiki/Install_NixOS_on_GCE inside Docker because I need linux amd64 image, but my machine is Mac OS M1.

docker run --rm -it --platform=linux/amd64 -v `pwd`:nixpkgs --workdir /nixpkgs nixpkgs/nix:latest bash

Now, set nix path to unstable to use google-cloud-sdk.

NIX_PATH=nixpkgs=channel:nixos-unstable
nix-shell -p 'google-cloud-sdk'
export NIX_CONFIG=$'system-features = benchmark big-parallel nixos-test uid-range kvm\\nfilter-syscalls = false\\nexperimental-features = nix-command flakes'

These options come from

Now, make sure you unset nix path pointing to unstable to build os with expected version. If you forget doing, nix complains system.stateVersion not found and fallback to runtime version.

Then, I create a Google Storage bucket BUCKET_NAME

BUCKET_NAME=<BUCKET_NAME> nixpkgs/nixos/maintainers/scripts/gce/create-gce.sh

NOTE: If it failed to upload the artifacts for some reasons, copy the artifacts from the container to your local machine and upload tar.gz in the artifacts to GCS bucket and manually create disk image according to the instruction here https://cloud.google.com/compute/docs/import/import-existing-image.

Then, create a Google Compute Engine instance from the image.

After the instance starts running, connect to it via SSH.

gcloud compute ssh --zone "<ZONE>" "<INSTANCE_NAME>" --project "<PROJECT>"      

Now, I'm inside Compute Engine instance.

sudo -i

It results in the following error.

sudo: PAM account management error: User not known to the underlying authentication module
sudo: a password is required

@picnoir
Copy link
Member

picnoir commented Jan 14, 2024

Thanks for the details!

Could this be the Google PAM module segfaulting and crashing the Nsncd daemon?

Is Nsncd generating any useful logs? Is it crashing when you see this error? (journalctl -u nscd to access the logs)

(I do not use gcloud, I cannot diagnose this by myself :( )

@flokli
Copy link
Contributor

flokli commented Jan 14, 2024

The NSS module segfaulting and crashing ns(n)cd indeed did happen than once - see #214811 and the linked upstream bug.

@peterromfeldhk
Copy link
Contributor

related GoogleCloudPlatform/guest-oslogin#33

@peter-romfeld-bcw
Copy link

@picnoir error still happens, cant check logs :(

$ journalctl -u nscd
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
      Pass -q to turn off this notice.
No journal files were opened due to insufficient permissions.

@peter-romfeld-bcw
Copy link

peter-romfeld-bcw commented Aug 6, 2024

here are the logs:

systemd[1]: nscd.service: Main process exited, code=killed, status=11/SEGV
systemd[1]: nscd.service: Failed with result 'signal'.
nscd.service: Consumed 32ms CPU time, received 7.9K IP traffic, sent 1.3K IP traffic.
nscd.service: Scheduled restart job, restart counter is at 3.
Starting Name Service Cache Daemon (nsncd)...
nsncd[213941]: Aug 06 07:16:58.391 INFO started, config: Config { ignored_request_types: {}, worker_count: 8, handoff_timeout: 3s }, path: "/var/run/nscd/socket"
systemd[1]: Started Name Service Cache Daemon (nsncd).

@picnoir
Copy link
Member

picnoir commented Aug 6, 2024

Right, the google NSS module is segfaulting. The segfault seem to bring Nsncd down.

@peter-romfeld-bcw
Copy link

peter-romfeld-bcw commented Aug 6, 2024

yeah i tried a overlay with the older nsncd and co versions

  nixpkgs.overlays = [
    (self: super: {
      nsncd = pkgs-22-11.nsncd;
      google-guest-oslogin = pkgs-22-11.google-guest-oslogin;
      google-guest-configs = pkgs-22-11.google-guest-configs;
      google-guest-agent = pkgs-22-11.google-guest-agent;
    })
  ];

but i still got the error (like you said something else is bringing nsncd down), for now i just use this (seems working so far):

services.nscd.enableNsncd = false;

@picnoir
Copy link
Member

picnoir commented Aug 6, 2024

Yeah, seems like Nsncd needs to better handle the NSS segfaults. That's not on my todolist though. PR welcome, I'll review it.

Alternatively, google could also work on fixing their NSS module to prevent it to continuously segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

9 participants