Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky kafka cluster example #4549

Merged
merged 12 commits into from
Oct 7, 2021
Merged

Fix flaky kafka cluster example #4549

merged 12 commits into from
Oct 7, 2021

Conversation

rnorth
Copy link
Member

@rnorth rnorth commented Oct 6, 2021

Fixes #4479
I believe the original issue relates to the GitHub Actions runners simply not being powerful enough to launch 3 instances of Kafka in parallel, within reasonable startup time limits

  • Increase startup time limit and start containers in series. The total startup time in series seems to be the same as in parallel, which I think confirms my suspicion that this was resource constrained. (i.e. we observed ~27s startup time for parallel start, vs ~12+7+7s for serial start)
  • Add logback to kafka-cluster example for better, timestamped, log output
  • Change KafkaContainer to emit full exec stdout/stderr/exit code in case of a kafka-configs failure.

@rnorth
Copy link
Member Author

rnorth commented Oct 7, 2021

Our observations so far: the kafka-configs process is exiting with exit code 137, which means that it is receiving SIGKILL. @bsideup tried upgrading the image to one that has a cgroup-aware JDK version, in case a GHA change is causing containers to be OOMKilled. It doesn't look like that's helped.

I'm still considering the hypothesis that startup is just taking too long, since it involves pulling the image and then starting N instances of Kafka in parallel (25s+ just to start the container looks really long). If deepStart is timing out, perhaps it's killing the containers (and by coincidence, killing kafka-configs).

@rnorth
Copy link
Member Author

rnorth commented Oct 7, 2021

I'm still considering the hypothesis that startup is just taking too long, since it involves pulling the image and then starting N instances of Kafka in parallel (25s+ just to start the container looks really long). If deepStart is timing out, perhaps it's killing the containers (and by coincidence, killing kafka-configs).

I think this is it:

Kafka being started at 08:57:31.423:

    08:57:31.423 INFO  🐳 [confluentinc/cp-kafka:6.2.1] - Pulling docker image: confluentinc/cp-kafka:6.2.1. Please be patient; this may take some time but only needs to be done once.

First failure logged at 08:58:01.917:

08:58:01.917 ERROR 🐳 [confluentinc/cp-kafka:6.2.1] - Could not start container
    java.lang.IllegalStateException: Container.ExecResult(exitCode=137, stdout=, stderr=)
    	at org.testcontainers.containers.KafkaContainer.containerIsStarted(KafkaContainer.java:121)
    	at org.testcontainers.containers.GenericContainer.containerIsStarted(GenericContainer.java:687)
    	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:503)
    	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:330)
    	at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
    	at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:328)
    	at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:316)
    	at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
    	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
    	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)

This lines up with the 30s timeout applied in CONTAINER_RUNNING_TIMEOUT_SEC.

@kiview
Copy link
Member

kiview commented Oct 7, 2021

I would approve this, once out of draft state 🙂

@rnorth rnorth changed the title Debugging for #4479 Fix flaky kafka cluster example Oct 7, 2021
@rnorth
Copy link
Member Author

rnorth commented Oct 7, 2021

Have tidied up and updated the PR description.

@rnorth rnorth marked this pull request as ready for review October 7, 2021 10:42
…KafkaContainerCluster.java

Co-authored-by: Kevin Wittek <[email protected]>
@rnorth rnorth merged commit dace7e4 into master Oct 7, 2021
@rnorth rnorth deleted the 4479-debug branch October 7, 2021 11:15
@kiview kiview added this to the next milestone Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: kafka-cluster example
3 participants