Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check that StartTransientUnit/StopUnit succeeds #2331

Merged
merged 1 commit into from
Apr 28, 2020

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented Apr 20, 2020

fix #2313(relate to #2310) #2344

To fix #2313
We can check whether StartTransientUnit succeeds or not with the value from the channel.
If we got failed, we should run ResetFailedUnit and return the error.

[root@localhost systemdtest]# runc --systemd-cgroup run -d test
WARN[0000] signal: killed                               
ERRO[0000] cannot detect unified path                   
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"StartTransientUnit job is not done: failed\"" 
container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"StartTransientUnit job is not done: failed\""

The error ERRO[0000] cannot detect unified path relates to #2329 .

To fix #2344
We should wait StopUnit succeeds.

Signed-off-by: lifubang [email protected]

@lifubang
Copy link
Member Author

@kolyshkin PTAL

@lifubang lifubang force-pushed the StartTransientUnit branch from 613c712 to 28fc3f3 Compare April 20, 2020 01:38
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on that! Left a couple of nits.

Can you also add the same code (wait for the operation to complete, check it's "done", return an error if not) to the Destroy() method. Currently it does not even wait.

Oh, finally, could you please wait until #2299 is merged and then rebase? It is mostly code refactoring in there.

@kolyshkin
Copy link
Contributor

@lifubang have you tested that the fix works? I have described the repro in details here: containers/crun#331 (comment), should take 10-15 minutes to check it (mostly waiting for vagrant up).

@lifubang
Copy link
Member Author

Can you also add the same code (wait for the operation to complete, check it's "done", return an error if not) to the Destroy() method. Currently it does not even wait.

Oh, finally, could you please wait until #2299 is merged and then rebase? It is mostly code refactoring in there.

OK

@lifubang
Copy link
Member Author

lifubang commented Apr 20, 2020

have you tested that the fix works?

I have tested it in vmware.

@lifubang lifubang force-pushed the StartTransientUnit branch from bb47d24 to c0b435d Compare April 20, 2020 07:13
@lifubang
Copy link
Member Author

Can you also add the same code (wait for the operation to complete, check it's "done", return an error if not) to the Destroy() method. Currently it does not even wait.

I add it in #2329

@kolyshkin
Copy link
Contributor

Needs rebase

@lifubang lifubang force-pushed the StartTransientUnit branch 2 times, most recently from 2aa85d6 to a5d5d89 Compare April 22, 2020 01:40
select {
case s := <-statusChan:
if DbusJobCompletionResultType(s) != DbusJobDone {
logrus.Warnf("error stopping unit `%s`: got `%s`. Continuing...", unitName, s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think s/stopping/removing/ is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!


const (
DbusJobDone DbusJobCompletionResultType = "done"
DbusJobCanceled DbusJobCompletionResultType = "canceled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using any of those (I mean, except for DbusJobDone)?

Copy link
Member Author

@lifubang lifubang Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed now!

@@ -11,6 +11,26 @@ import (
"github.com/opencontainers/runc/libcontainer/configs"
)

// Refer to https://github.com/coreos/go-systemd/blob/5a0db84d3dc459ccdc6ffcc44b1c452bf9f171cb/dbus/methods.go#L78-L101
// or https://godoc.org/github.com/coreos/go-systemd/dbus#Conn.StartUnit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two are the same: one is the source code, another is godoc extracted from it.

And we only need "done", let's not overcomplicate things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v1, shall we need to add this?
If yes, I'll add a const string DbusJobDone.

@lifubang lifubang force-pushed the StartTransientUnit branch 3 times, most recently from 8fcd201 to ac366bc Compare April 22, 2020 02:56
@lifubang
Copy link
Member Author

Can you also add the same code (wait for the operation to complete, check it's "done", return an error if not) to the Destroy() method. Currently it does not even wait.

@kolyshkin this code also can fix #2344

// Please refer to https://godoc.org/github.com/coreos/go-systemd/dbus#Conn.StartUnit
if s != "done" {
logrus.Warnf("error removing unit `%s`: got `%s`. Continuing...", unitName, s)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel can be closed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not close the channel, because if someone send string to this closed channel, it will cause goroutine panic. Although I guess there is nobody would send sth to this channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the github.com/coreos/go-systemd/dbus code.

Looks like once it sends a string over the channel it forgets it. So yes we should close the channel -- unless we hit the timeout!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished now

@kolyshkin
Copy link
Contributor

kolyshkin commented Apr 24, 2020

this code also can fix #2344

Nice! @lifubang please update the description to say so (and about the StopUnit, too).

@kolyshkin
Copy link
Contributor

So, I guess this code should be moved to libcontainer/cgroups/systemd/common.go and used from both v1 and v2.

@lifubang lifubang changed the title check that StartTransientUnit succeeds check StartTransientUnit/StopUnit succeeds Apr 24, 2020
@lifubang lifubang changed the title check StartTransientUnit/StopUnit succeeds check that StartTransientUnit/StopUnit succeeds Apr 24, 2020
@lifubang lifubang force-pushed the StartTransientUnit branch 2 times, most recently from 0da17ed to 59912cd Compare April 24, 2020 07:42
@lifubang
Copy link
Member Author

Both v1 and v2 are using the same StartTransientUnit/StopUnit method now, and close the channel when finished now.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too early :)

@@ -99,3 +102,51 @@ func isUnitExists(err error) bool {
}
return false
}

func startJob(unitName string, properties []systemdDbus.Property) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/startJob/startUnit/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job is a very misleading term here.

return nil
}

func stopJob(unitName string) error {
Copy link
Contributor

@kolyshkin kolyshkin Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/stopJob/stopUnit/

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit wrt functions naming. Otherwise, it's perfecto!

@lifubang lifubang force-pushed the StartTransientUnit branch from 59912cd to 2b91a2d Compare April 24, 2020 23:46
@AkihiroSuda
Copy link
Member

AkihiroSuda commented Apr 24, 2020

LGTM

Approved with PullApprove

@AkihiroSuda
Copy link
Member

needs rebase

@lifubang lifubang force-pushed the StartTransientUnit branch from 2b91a2d to b0fffaa Compare April 25, 2020 23:24
@AkihiroSuda
Copy link
Member

AkihiroSuda commented Apr 25, 2020

LGTM

Approved with PullApprove

@AkihiroSuda
Copy link
Member

@kolyshkin LGTY?

@AkihiroSuda AkihiroSuda marked this pull request as draft April 28, 2020 07:26
@AkihiroSuda AkihiroSuda marked this pull request as ready for review April 28, 2020 07:28
@lifubang lifubang force-pushed the StartTransientUnit branch from b0fffaa to bfa1b2a Compare April 28, 2020 07:46
@AkihiroSuda
Copy link
Member

AkihiroSuda commented Apr 28, 2020

LGTM

Approved with PullApprove

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

kolyshkin commented Apr 28, 2020

LGTM

Approved with PullApprove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants