Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Add socket activation via /run/rpm-ostreed.socket #3874

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jul 22, 2022

This is a re-submit of #2932
daemon: Add socket activation via /run/rpm-ostreed.socket

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name. For example,
it calls ostree_sysroot_load(). We also end up scanning
the RPM database, and actually parse all the GPG keys
in /etc/pki/rpm-gpg etc.

This causes two related problems:

  • By doing all this work before claiming the bus name, we
    race against the (pretty low) DBus service timeout of 25s.
  • If something hard fails at initialization, the client can't
    easily see the error because it just appears in the systemd
    journal. The client will just see a service timeout.

The solution to this is to adopt systemd socket activation.
There's a new rpm-ostreed.socket and the daemon can be activated
that way.

The client (when run as root, the socket is only accessible to root
right now) connects, which will activate the daemon and attempt
initialization - without claiming the DBus name yet.

If something goes wrong here, the daemon will reply to the client
that activated it with the error, and then also exit with failure.

On success, everything continues as normal, including claiming
the DBus name.

Note that this also logically replaces the code that does explicit
systemctl start rpm-ostreed invocations.

After this patch:

$ systemctl stop rpm-ostreed
$ umount /boot
$ rpm-ostree status
error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory

@cgwalters
Copy link
Member Author

This blocks on SELinux policy updates: https://bugzilla.redhat.com/show_bug.cgi?id=2110012

rust/src/daemon.rs Outdated Show resolved Hide resolved
src/daemon/rpm-ostreed.socket Show resolved Hide resolved
@cgwalters cgwalters force-pushed the privsocket-activation branch 2 times, most recently from 9d0f220 to b1f7105 Compare July 27, 2022 15:24
@cgwalters cgwalters marked this pull request as draft July 27, 2022 17:26
@cgwalters
Copy link
Member Author

OK, I now made this into a build-time option. (Though looking at it, I also need to make installation of the socket unit a build-time option)

This will allow us to ship this in F37 now (with a matching specfile conditional) and gain experience with it.

Leaving as draft though for now since this is a nontrivial change and could use some more design thought before merging.

@cgwalters cgwalters force-pushed the privsocket-activation branch 2 times, most recently from d8f96dd to 43a5e9f Compare July 28, 2022 01:29
@cgwalters
Copy link
Member Author

/test all

@cgwalters cgwalters force-pushed the privsocket-activation branch from 43a5e9f to 4a71f22 Compare July 29, 2022 19:13
@cgwalters
Copy link
Member Author

cgwalters commented Jul 29, 2022

Dang it, I did a whole port to https://crates.io/crates/uds before discovering tormol/uds#6

For now...I switched back to using worker threads. That said...hmmm. What would probably make sense here is to factor out an async-sock-seqpacket crate (that internally uses worker threads, but actually exposes a tokio channel that takes serde-serializable types) ...we've now got very similar code in all 3 of:

I guess one can tell that I like SOCK_SEQPACKET?

@cgwalters
Copy link
Member Author

If we want to support a non-DBus API, one thing we'll likely need to do for now is bridge requests from the tokio thread to the glib/DBus daemon thread. We can do this by acquiring that worker thread's main context and then invoking https://docs.rs/glib/latest/glib/struct.MainContext.html#method.spawn

@cgwalters cgwalters force-pushed the privsocket-activation branch 3 times, most recently from 424e91f to 2d70a00 Compare August 4, 2022 10:33
@cgwalters cgwalters marked this pull request as ready for review August 4, 2022 12:11
@cgwalters
Copy link
Member Author

OK C9S and RHEL9.1 should include the necessary selinux policy changes for this, so I think this can be considered for review and landing.

Looks like there's a CI failure in the Jenkins/vmcheck bits I need to dig into.

But right now more problematically the socket path isn't being tested by CI at all; need to add a f37 build or extend our c9s build to enable this feature and actually do tests too.

@cgwalters cgwalters force-pushed the privsocket-activation branch from 2d70a00 to 4ef468f Compare August 14, 2022 21:43
jmarrero
jmarrero previously approved these changes Aug 15, 2022
Copy link
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cgwalters
Copy link
Member Author

I've been looking in the background at https://github.com/mitsuhiko/tokio-unix-ipc/ and have two issues open, though neither are blocking.

Cargo.toml Outdated Show resolved Hide resolved
rust/src/client.rs Outdated Show resolved Hide resolved
rust/src/daemon.rs Outdated Show resolved Hide resolved
rust/src/client.rs Outdated Show resolved Hide resolved
@RishabhSaini RishabhSaini force-pushed the privsocket-activation branch from 4837241 to c80b74f Compare August 3, 2023 22:35
@RishabhSaini RishabhSaini force-pushed the privsocket-activation branch 2 times, most recently from 6d28576 to 82fd7bb Compare August 14, 2023 18:19
rust/src/daemon.rs Outdated Show resolved Hide resolved
@RishabhSaini RishabhSaini force-pushed the privsocket-activation branch 2 times, most recently from 40836d7 to 9be6242 Compare August 14, 2023 21:04
@cgwalters
Copy link
Member Author

OK I made a few more changes here just to clean things up:

  • Squashed all the commits
  • Removed now-spurious whitespace changes leftover from the removed feature flag
  • Changed you to be the author, listed me as co-author

@cgwalters cgwalters force-pushed the privsocket-activation branch from 9be6242 to 5b01dfe Compare August 14, 2023 21:36
rust/src/daemon.rs Outdated Show resolved Hide resolved
rust/src/daemon.rs Show resolved Hide resolved
@RishabhSaini RishabhSaini force-pushed the privsocket-activation branch 3 times, most recently from f21c9db to 88db71b Compare August 16, 2023 13:59
@RishabhSaini
Copy link
Contributor

The jenkins pipeline shows error: cleanup: Loading sysroot: Starting daemon via socket: Connecting: No such file or directory (os error 2). Does the rpm-ostreed.socket file in /etc/systemd/system need to be enabled?

@cgwalters
Copy link
Member Author

I think what's going on here is:

We include the .socket unit in the RPM, but it's not enabled by default because it's not in the preset list.

This has burned us before. In theory, what we'd need to do is add it to that list. However doing so will dramatically slow down our ability to ship this.

So what I'd propose is:

  • Change the client code to check if it's running as root, and if so systemctl start rpm-ostreed.socket explicitly
  • Continue to improve and land this PR
  • Once it lands, add a PR to fedora-release to enable the socket by default
  • Eventually, drop the client auto-starting

Hmm I think we may also need to add %systemd_post for our units.

@RishabhSaini RishabhSaini force-pushed the privsocket-activation branch from 88db71b to aa48433 Compare August 17, 2023 14:48
@cgwalters cgwalters self-assigned this Aug 23, 2023
@cgwalters cgwalters force-pushed the privsocket-activation branch from aa48433 to c589d98 Compare August 23, 2023 15:24
@cgwalters cgwalters marked this pull request as ready for review August 23, 2023 15:24
@cgwalters
Copy link
Member Author

OK will pick this back up in the background; I made a few more tweaks. Let's see what CI says.

@RishabhSaini
Copy link
Contributor

Thanks for the help!

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name.  For example,
it calls `ostree_sysroot_load()`.  We also end up scanning
the RPM database, and actually parse all the GPG keys
in `/etc/pki/rpm-gpg` etc.

This causes two related problems:

- By doing all this work before claiming the bus name, we
  race against the (pretty low) DBus service timeout of 25s.
- If something hard fails at initialization, the client can't
  easily see the error because it just appears in the systemd
  journal.  The client will just see a service timeout.

The solution to this is to adopt systemd socket activation.
There's a new `rpm-ostreed.socket` and the daemon can be activated
that way.

The client (when run as root, the socket is only accessible to root
right now) connects, which will activate the daemon and attempt
initialization - without claiming the DBus name yet.

If something goes wrong here, the daemon will reply to the client
that activated it with the error, and then also exit with failure.

On success, everything continues as normal, including claiming
the DBus name.

Note that this also logically replaces the code that does explicit
`systemctl start rpm-ostreed` invocations.

After this patch:

```
$ systemctl stop rpm-ostreed
$ umount /boot
$ rpm-ostree status
error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
```

Co-authored-by: Colin Walters <[email protected]>
@cgwalters cgwalters force-pushed the privsocket-activation branch from c589d98 to c993732 Compare August 23, 2023 15:48
Ok(())
}

static SHUTDOWN_SIGNAL: Lazy<Mutex<Option<Sender<()>>>> = Lazy::new(|| Mutex::new(None));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this again and I remembered that there's a race condition this was intended to fix but we didn't. Basically, the socket clients aren't accounted for in our current exit-on-idle logic.

What we need to do is:

  • If we have a socket client, then disable the idle exit
  • When we choose to idle exit, first shut down the accept() handler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm and a further complication here is to make this race-free we probably need to use the glib main loop (aka main thread) and not the tokio thread - at least to accept clients on the socket; we could still do the actual processing on the tokio thread with appropriate synchronization.

@cgwalters cgwalters marked this pull request as draft August 28, 2023 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants