Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Retries due to periodic failure of underlying docker commands (ex. rm)? #538

Closed
t3hmrman opened this issue Nov 24, 2023 · 3 comments · Fixed by #575
Closed

Feature: Retries due to periodic failure of underlying docker commands (ex. rm)? #538

t3hmrman opened this issue Nov 24, 2023 · 3 comments · Fixed by #575

Comments

@t3hmrman
Copy link

As always thanks for the awesome library, it's been incredibly useful for testing.

I've been doing some stress-testing on my test suite (i.e. running the tests continuously until one failed) lately and found that sometimes the Cli actually fails to perform some lower level docker CLI commands.

The first failure I encountered was a failure with creating a container, but unfortunately I didn't have --nocapture on, so I couldn't get the output. After repeating the process I found that I got a failure:

.......... TERMINATING [>120.000s] project::mod components::mod::inner::test_name_ci_serial
thread '<unnamed>' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'components::mod::inner::test_name_ci_serial' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
test components::mod::inner::test_name_ci_serial ... FAILED

failures:

failures:
    components::mod::inner::test_name_ci_serial

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 10 filtered out; finished in 120.47s

     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
   Canceling due to test failure
------------
     Summary [ 188.463s] 11 tests run: 10 passed, 1 timed out, 0 skipped
     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
error: test run failed
error: Recipe `test-int` failed on line 124 with exit code 100

I've anonymized the details of the project and test suite, but it should be clear that the failure was inside (but not the fault of) testcontainers.

Looking at the output of my docker systemd service, I see a failure to write stderr (emphasis via spacing added below):

Nov 24 12:14:30 host dockerd[13873]: time="2023-11-24T12:14:30.456846562+09:00" level=info msg="ignoring event" container=444c245cd67691e5ad93decc764347c145299ffc33636300ffeff89348cdbca3 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.062710838+09:00" level=error msg="Error running exec 53c30d4cedccd888d169f833fca4db8fab88d5520aadc2df6edc4affde44174b in container: exec attach failed: error attaching stderr stream: write unix /run/docker.sock->@: write: broken pipe"

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.094747073+09:00" level=info msg="ignoring event" container=622b59590b33913f5ecbc7e93d01bad6a2fe4b610cda63f9c12a42efa79d9a18 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

After this error came up I restarted the test suite and it worked just fine -- the lower level failure seems transient.

Does it make sense to add error detection and/or a dumb retry policy at this level to the underlying client? I'm not sure if there's a better way to handle this, and unfortunately I didn't increase the log level on docker so it wasn't more specific on why it failed (like it has been for others).

@thomaseizinger
Copy link
Collaborator

Are you running your tests concurrently? I wouldn't be surprised if there are race conditions within the docker CLI. I ran into some myself. You can try using the experimental HTTP client that talks to the docker daemon directly.

@t3hmrman
Copy link
Author

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

@thomaseizinger
Copy link
Collaborator

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

I didn't say that it is a known issue but you can try the experimental client to narrow down which component causes the issue.

github-merge-queue bot pushed a commit that referenced this issue Apr 22, 2024
Quite large refactoring as part of project revamp #563, and also the
long-awaited refactoring #386

Now it's really simple API, e.g:
```rs
let container = GenericImage::new("redis", "latest")
        .with_exposed_port(6379)
        .with_wait_for(WaitFor::message_on_stdout("Ready to accept connections"))
        .start();
```

I find this new API much easier to use, solves a lot of problems, and
seems flexible enough to be extended.
This also works regardless of the tokio runtime flavor (multi-thread vs
current-thread). And the sync API is still available, right under
`blocking` feature (just like `reqwest` does).

From a maintainer's perspective this also simplifies the code, we don't
have to worry about two different clients and their differences.

### Docker host resolution
The host is resolved in the following order:

1. Docker host from the `tc.host` property in the
`~/.testcontainers.properties` file.
2. `DOCKER_HOST` environment variable.
3. Docker host from the "docker.host" property in the
`~/.testcontainers.properties` file.
4. Else, the default Docker socket will be returned.

### Notes
- MSRV was bumped to `1.70` in order to use `std::sync::OnceLock`. This
should NOT be a problem, tests usually executed on more recent versions
(also see [this
ref](https://github.com/testcontainers/testcontainers-rs/pull/503/files#r1242651354)).
- `Cli` client is removed, instead we provide `sync` (under `blocking`
feature) and `async` impls based on HTTP client (bollard)
- tested with
[modules](https://github.com/testcontainers/testcontainers-rs-modules-community)

## Migration guide 

- Sync API migration (`Cli` client)
  - Add `blocking` feature
  - Drop all usages of `clients::Cli`
  - Add `use testcontainers::runners::SyncRunner;`
  - Replace `client.run(image)` with `image.start()`
- Async API migration (`Http` client)
  - Remove `experimental` feature
  - Drop all usages of `clients::Http`
  - Add `use testcontainers::runners::AsyncRunner;`
  - Replace `client.run(image)` with `image.start()`

## References

Closes #386
Closes #326
Closes #475
Closes #508
Closes #392
Closes #561
Closes #559
Closes #564
Closes #538
Closes #507
Closes #89
Closes #198
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants