Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extension/cgroupruntime] Be aware of ECS task and CPU limits #36920

Merged
merged 13 commits into from
Jan 8, 2025

Conversation

r0mdau
Copy link
Contributor

@r0mdau r0mdau commented Dec 21, 2024

Description

Allow the cgroupruntime extension to set GOMAXPROCS based on AWS ECS metadata. See related issue for detailed informations.

Link to tracking issue

Fixes #36814

Testing

Added integration test with httptest for the ECS metadata endpoint.

Something to clarify: #36617 (comment)

Documentation

Added extension name and link in the README.

Copy link
Contributor

@rogercoll rogercoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, the code lgtm, just left a couple of nits/question.

extension/cgroupruntimeextension/factory.go Outdated Show resolved Hide resolved
extension/cgroupruntimeextension/go.mod Show resolved Hide resolved
@@ -220,6 +257,13 @@ func TestCgroupV2SudoIntegration(t *testing.T) {
})
require.NoError(t, err)

if test.setECSMetadataURI {
server := startMockECSServer()
defer server.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt if we move this defer calls to the t.Cleanup function? That way is aligned with the other resources rollback strategy, and we are also safe on panics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure how to implement it as I see multiple options. I need to clean only if test.setECSMetadataURI is true. Because I find it handy to start the mock http server only if this same bool is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I think you could append the cleanup function inside the conditional. Something like:

			if test.setECSMetadataURI {
				server := startMockECSServer()
				os.Setenv("ECS_CONTAINER_METADATA_URI_V4", server.URL)
				t.Cleanup(func() {
					server.Close()
					os.Unsetenv("ECS_CONTAINER_METADATA_URI_V4")
				})
			}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved with latest commit 6212bde

@@ -63,6 +66,21 @@ func cgroupMaxCpu(filename string) (quota int64, period uint64, err error) {
return quota, period, err
}

func startMockECSServer() *httptest.Server {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the integration test interact with this server? Is it a dependency for gomaxecs.IsECS()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the gomaxecs library yes, it is making a http call to the ECS metadata uri.

Does it truly interacts ? I am not sure even for the github action CI, see my comment in the integration test PR.

Commands launched in Github action CI:

gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -skip Sudo -tags=integration,""
gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -skip Sudo -exec sudo -run Sudo -tags=integration,""

On my local workstation it is Skipped, compared to the github ci, I removed the -skip Sudo flag, the output:

$ cd extension/cgroupruntimeextension
$ ../../.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
∅  internal/metadata
✓  . (1.056s)

=== Skipped
=== SKIP: . TestCgroupV2SudoIntegration (0.00s)
    integration_test.go:48: System running in hybrid or cgroupv1 mode

DONE 15 tests, 1 skipped in 5.366s

I could run the tests in a docker container with some cgroups, open to any suggestion for making the integration test pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me identify where the gomaxecs library does the http call to the ECS metadata URI? I quickly checked the IsECS method and brought me to this GetECSMetadataURI method, which just looks for the environment variable.

I could run the tests in a docker container with some cgroups, open to any suggestion for making the integration test pass.

Nice catch, just pushed this small fix to make them run #36941

It seems that a changelog file is missing (make ch-new) and the linting CI is complaining. Could you please take a look?

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

  1. I rebased the main branch
  2. Added a changelog entry

The http call is not made using IsECS() but within gomaxecs.Set() function call.

I am wondering to revert the current integration test proposition because it is bad implementation.

Proposition is to create a dedicated integration test because it needs to make the http call mocked in order to do a proper integration test. Let me know, multiple options available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding a few more test cases (e.g. no container CPU limit defined) would help to understand the behavior, but this approach already looks good to me. Looks very similar to what gomaxecs is currently doing: https://github.com/rdforte/gomaxecs/blob/main/internal/task/task_test.go#L455

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a change 6212bde

@r0mdau
Copy link
Contributor Author

r0mdau commented Dec 27, 2024

@rogercoll writing my local setup for executing the integration test, let me know if I write it in the README or somewhere else.

Inside the extension folder, start a privileged docker container and share the code

cd extension/cgroupruntimeextension
docker run -ti --privileged --cgroupns=host -v $(pwd):/workspace -w /workspace debian:bookworm-slim

Install Go and gcc to run the integration test

apt update && apt install -y wget sudo gcc && wget https://go.dev/dl/go1.23.4.linux-amd64.tar.gz && tar -C /usr/local -xzf go1.23.4.linux-amd64.tar.gz && export PATH=$PATH:/usr/local/go/bin && go version && rm go1.23.4.linux-amd64.tar.gz

Run the integration test

CGO_ENABLED=1 go test -v -exec sudo -race -timeout 360s -parallel 4 -tags=integration,""

@rogercoll
Copy link
Contributor

@rogercoll writing my local setup for executing the integration test, let me know if I write it in the README or somewhere else.

That would be more than welcomed, feel free to update the README (in this or a follow-up PR). Thanks!

@@ -144,6 +168,71 @@ func TestCgroupV2SudoIntegration(t *testing.T) {
// 134217728 * 0.1
expectedGoMemLimit: 13421772,
},
{
name: "AWS ECS 90% the max cgroup memory and 12 GOMAXPROCS",
cgroupCpuQuota: pointerInt64(100000),
Copy link
Contributor

@rogercoll rogercoll Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of using the cgroupCpuQuoata variable to set the task/container CPU limits? (instead of expectedGoMaxProcs*1024)? That way is more aligned to other test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion continuing here: #36920 (comment)

// to get the Cgroup CPU quota from the httptest server
cleanECS := func() {}
if test.setECSMetadataURI {
server := testServerECSMetadata(t, test.expectedGoMaxProcs*1024, test.expectedGoMaxProcs*1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above, maybe using cgroupCpuQuota might be more appropriate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cgroup2 manager.update() is still applying the settings. I would even prefer in the ECS tests to put values in cgroupCpuQuota and cgroupCpuPeriod that have nothing to do with the expectedGoMaxProcs value. To prove the code fetched the value from the HTTP endpoint.

Because if IsECS() == true whatever you set in the cgroup, the lib will make a http call to the ECS metadata uri and set GOMAXPROCS.

So I see 3 viable options:

  • Keep as is
  • Only change the cgroupCpuQuota.. not aligned with expectedGoMaxProcs to prove the HTTP call is happening and working
  • Move the ECS tests into a new func next to func TestCgroupV2SudoIntegration(t *testing.T) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account that cgroupCpuQuota and cgroupCpuPeriod won't be taken into consideration if IsECS() == true, I would go with option 3. It seems to me that having a new TestECSCgroupV2SudoIntegration will provide a cleaner testing interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposition: 12db2c1

I moved some redundant code into functions in order to make it cleaner and that's debatable.

@r0mdau
Copy link
Contributor Author

r0mdau commented Dec 30, 2024

@rogercoll writing my local setup for executing the integration test, let me know if I write it in the README or somewhere else.

That would be more than welcomed, feel free to update the README (in this or a follow-up PR). Thanks!

Proposition to start a CONTRIBUTING page: a510187


## Testing

To run the integration tests locally for this extension, you can follow theses steps in a Linux environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some Linux distributions already run systemd under cgroupv2, in that case, integration tests can be run without the docker workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice added 12acc97

extension/cgroupruntimeextension/CONTRIBUTING.md Outdated Show resolved Hide resolved
// to get the Cgroup CPU quota from the httptest server
cleanECS := func() {}
if test.setECSMetadataURI {
server := testServerECSMetadata(t, test.expectedGoMaxProcs*1024, test.expectedGoMaxProcs*1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account that cgroupCpuQuota and cgroupCpuPeriod won't be taken into consideration if IsECS() == true, I would go with option 3. It seems to me that having a new TestECSCgroupV2SudoIntegration will provide a cleaner testing interface.

Copy link
Contributor

@rogercoll rogercoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this feature

@mx-psi
Copy link
Member

mx-psi commented Jan 7, 2025

@r0mdau can you fix the merge conflict?

@r0mdau
Copy link
Contributor Author

r0mdau commented Jan 7, 2025

@r0mdau can you fix the merge conflict?

@mx-psi rebase done

@mx-psi
Copy link
Member

mx-psi commented Jan 7, 2025

Error: cgroupruntimeextension/integration_test.go:35:2: var-naming: const ecsMetadataUri should be ecsMetadataURI (revive)
	ecsMetadataUri     = "ECS_CONTAINER_METADATA_URI_V4"
	^
Error: cgroupruntimeextension/integration_test.go:379:4: os.Setenv() can be replaced by `t.Setenv()` in anonymous function (tenv)
			os.Setenv(ecsMetadataUri, server.URL)
			^

@mx-psi mx-psi merged commit cc1232e into open-telemetry:main Jan 8, 2025
161 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 8, 2025
@r0mdau r0mdau deleted the gomaxecs-procs branch January 9, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[extension/cgroupruntime] Be aware of ECS task and CPU limits
4 participants