Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)? #538

t3hmrman · 2023-11-24T03:30:50Z

As always thanks for the awesome library, it's been incredibly useful for testing.

I've been doing some stress-testing on my test suite (i.e. running the tests continuously until one failed) lately and found that sometimes the Cli actually fails to perform some lower level docker CLI commands.

The first failure I encountered was a failure with creating a container, but unfortunately I didn't have --nocapture on, so I couldn't get the output. After repeating the process I found that I got a failure:

.......... TERMINATING [>120.000s] project::mod components::mod::inner::test_name_ci_serial
thread '<unnamed>' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'components::mod::inner::test_name_ci_serial' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
test components::mod::inner::test_name_ci_serial ... FAILED

failures:

failures:
    components::mod::inner::test_name_ci_serial

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 10 filtered out; finished in 120.47s

     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
   Canceling due to test failure
------------
     Summary [ 188.463s] 11 tests run: 10 passed, 1 timed out, 0 skipped
     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
error: test run failed
error: Recipe `test-int` failed on line 124 with exit code 100

I've anonymized the details of the project and test suite, but it should be clear that the failure was inside (but not the fault of) testcontainers.

Looking at the output of my docker systemd service, I see a failure to write stderr (emphasis via spacing added below):

Nov 24 12:14:30 host dockerd[13873]: time="2023-11-24T12:14:30.456846562+09:00" level=info msg="ignoring event" container=444c245cd67691e5ad93decc764347c145299ffc33636300ffeff89348cdbca3 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.062710838+09:00" level=error msg="Error running exec 53c30d4cedccd888d169f833fca4db8fab88d5520aadc2df6edc4affde44174b in container: exec attach failed: error attaching stderr stream: write unix /run/docker.sock->@: write: broken pipe"

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.094747073+09:00" level=info msg="ignoring event" container=622b59590b33913f5ecbc7e93d01bad6a2fe4b610cda63f9c12a42efa79d9a18 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

After this error came up I restarted the test suite and it worked just fine -- the lower level failure seems transient.

Does it make sense to add error detection and/or a dumb retry policy at this level to the underlying client? I'm not sure if there's a better way to handle this, and unfortunately I didn't increase the log level on docker so it wasn't more specific on why it failed (like it has been for others).

The text was updated successfully, but these errors were encountered:

thomaseizinger · 2024-01-15T03:06:31Z

Are you running your tests concurrently? I wouldn't be surprised if there are race conditions within the docker CLI. I ran into some myself. You can try using the experimental HTTP client that talks to the docker daemon directly.

t3hmrman · 2024-01-15T14:09:10Z

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

thomaseizinger · 2024-01-15T17:04:10Z

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

I didn't say that it is a known issue but you can try the experimental client to narrow down which component causes the issue.

Quite large refactoring as part of project revamp #563, and also the long-awaited refactoring #386 Now it's really simple API, e.g: ```rs let container = GenericImage::new("redis", "latest") .with_exposed_port(6379) .with_wait_for(WaitFor::message_on_stdout("Ready to accept connections")) .start(); ``` I find this new API much easier to use, solves a lot of problems, and seems flexible enough to be extended. This also works regardless of the tokio runtime flavor (multi-thread vs current-thread). And the sync API is still available, right under `blocking` feature (just like `reqwest` does). From a maintainer's perspective this also simplifies the code, we don't have to worry about two different clients and their differences. ### Docker host resolution The host is resolved in the following order: 1. Docker host from the `tc.host` property in the `~/.testcontainers.properties` file. 2. `DOCKER_HOST` environment variable. 3. Docker host from the "docker.host" property in the `~/.testcontainers.properties` file. 4. Else, the default Docker socket will be returned. ### Notes - MSRV was bumped to `1.70` in order to use `std::sync::OnceLock`. This should NOT be a problem, tests usually executed on more recent versions (also see [this ref](https://github.com/testcontainers/testcontainers-rs/pull/503/files#r1242651354)). - `Cli` client is removed, instead we provide `sync` (under `blocking` feature) and `async` impls based on HTTP client (bollard) - tested with [modules](https://github.com/testcontainers/testcontainers-rs-modules-community) ## Migration guide - Sync API migration (`Cli` client) - Add `blocking` feature - Drop all usages of `clients::Cli` - Add `use testcontainers::runners::SyncRunner;` - Replace `client.run(image)` with `image.start()` - Async API migration (`Http` client) - Remove `experimental` feature - Drop all usages of `clients::Http` - Add `use testcontainers::runners::AsyncRunner;` - Replace `client.run(image)` with `image.start()` ## References Closes #386 Closes #326 Closes #475 Closes #508 Closes #392 Closes #561 Closes #559 Closes #564 Closes #538 Closes #507 Closes #89 Closes #198

DDtKey mentioned this issue Apr 22, 2024

feat!: container-centric API with refactored underlying layer #575

Merged

DDtKey closed this as completed in #575 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)? #538

Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)? #538

t3hmrman commented Nov 24, 2023

thomaseizinger commented Jan 15, 2024

t3hmrman commented Jan 15, 2024

thomaseizinger commented Jan 15, 2024

Feature: Retries due to periodic failure of underlying docker commands (ex. rm)? #538

Feature: Retries due to periodic failure of underlying docker commands (ex. rm)? #538

Comments

t3hmrman commented Nov 24, 2023

thomaseizinger commented Jan 15, 2024

t3hmrman commented Jan 15, 2024

thomaseizinger commented Jan 15, 2024

Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)? #538

Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)? #538