Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send only streams data if channel is not closed #734

Merged
merged 1 commit into from
Sep 20, 2022

Conversation

saschagrunert
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

The stream data is only read on read_all_with_timeout, which is called via exec sync or on on container creation error. This means we can end-up having a closed sender, but we should not fail in this case because it will lead into unexpected container termination.

Which issue(s) this PR fixes:

Fixes #720

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Fixed bug where a closed stream sender causes container termination.

The stream data is only read on `read_all_with_timeout`, which is called
via exec sync or on on container creation error. This means we can
end-up having a closed sender, but we should not fail in this case
because it will lead into unexpected container termination.

Signed-off-by: Sascha Grunert <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 20, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@saschagrunert
Copy link
Member Author

saschagrunert commented Sep 20, 2022

An alternative would be to read after successful container creation:

diff --git a/conmon-rs/server/src/container_io.rs b/conmon-rs/server/src/container_io.rs
index b1ebc52..c1e4a61 100644
--- a/conmon-rs/server/src/container_io.rs
+++ b/conmon-rs/server/src/container_io.rs
@@ -267,11 +267,9 @@ impl ContainerIO {
                                 .await
                                 .context("write to attach endpoints")?;
 
-                            if !message_tx.is_closed() {
-                                message_tx
-                                    .send(Message::Data(data.into(), pipe))
-                                    .context("send data message")?;
-                            }
+                            message_tx
+                                .send(Message::Data(data.into(), pipe))
+                                .context("send data message")?;
                         }
                         Err(e) => match Errno::from_i32(e.raw_os_error().context("get OS error")?) {
                             Errno::EIO => {
diff --git a/conmon-rs/server/src/rpc.rs b/conmon-rs/server/src/rpc.rs
index daea94e..5462ced 100644
--- a/conmon-rs/server/src/rpc.rs
+++ b/conmon-rs/server/src/rpc.rs
@@ -14,7 +14,7 @@ use std::{
     process, str,
     time::Duration,
 };
-use tokio::time::Instant;
+use tokio::{task, time::Instant};
 use tracing::{debug, debug_span, error, Instrument};
 use uuid::Uuid;
 
@@ -150,18 +150,28 @@ impl conmon::Server for Server {
 
                 // register grandchild with server
                 let io = SharedContainerIO::new(container_io);
+                let io_clone = io.clone();
                 let child = Child::new(
                     id,
                     grandchild_pid,
                     exit_paths,
                     oom_exit_paths,
                     None,
-                    io,
+                    io_clone,
                     cleanup_cmd,
                     token,
                 );
                 capnp_err!(child_reaper.watch_grandchild(child))?;
 
+                task::spawn(
+                    async move {
+                        if let Err(e) = io.read_all_with_timeout(None).await {
+                            error!("IO read failure: {:#}", e);
+                        }
+                    }
+                    .instrument(debug_span!("read_all_with_timeout")),
+                );
+
                 results
                     .get()
                     .init_response()

@saschagrunert
Copy link
Member Author

So the e2e tests are now looking way better with this change: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/cri-o_cri-o/6220/pull-ci-cri-o-cri-o-main-ci-e2e-conmonrs/1572188830188965888

The 2 failing tests are also failing on main right now.

@haircommander
Copy link
Collaborator

/lgtm

good find!

@openshift-ci openshift-ci bot added the lgtm label Sep 20, 2022
@openshift-merge-robot openshift-merge-robot merged commit 8e4cec9 into containers:main Sep 20, 2022
@saschagrunert saschagrunert deleted the send-closed branch September 20, 2022 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment