-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "broker received out of order sequence" when brokers die #1661
Conversation
9c6582c
to
bb8dd43
Compare
When the following three conditions are satisfied, the producer code can skip message sequence numbers and cause the broker to complain that the sequences are out of order: * config.Producer.Idempotent is set * The producer loses, and then regains, its connection to a broker * The client code continues to attempt to produce messages whilst the broker is unavailable. For every message the client attempted to send while the broker is unavailable, the transaction manager sequence number will be incremented, however these messages will eventually fail and return an error to the caller. When the broker re-appears, and another message is published, it's sequence number is higher than the last one the broker remembered - the values that were attempted while it was down were never seen. Thus, from the broker's perspective, it's seeing out-of-order sequence numbers. The fix to this has a few parts: * Don't obtain a sequence number from the transaction manager until we're sure we want to try publishing the message * Affix the producer ID and epoch to the message once the sequence is generated * Increment the transaction manager epoch (and reset all sequence numbers to zero) when we permenantly fail to publish a message. That represents a sequence that the broker will never see, so the only safe thing to do is to roll over the epoch number. * Ensure we don't publish message sets that contain messages from multiple transaction manager epochs.
bb8dd43
to
9df3038
Compare
Think " CI / Go 1.14.x with Kafka 2.4.0 on Ubuntu" job just needs to be poked, looks like it failed before it even got to running the test suite |
@KJTsanaktsidis thanks for debugging this through and coming up with a solution. The changes that you propose here look good to me. It would be good if we could replicate this in the functional tests with toxiproxy, but that doesn't necessarily have to be done as part of this PR nor before merging. I've noticed a few people now saying they've been unable to get the functional tests to run locally. I wonder if we should investigate making the,m simpler to run by using docker or k3s for the brokers rather than the existing vagrant-based setup (/cc @bai) |
FWIW I too didn't manage to get the functional tests to run properly either! (EDIT: Just realised I said this in my PR description, which is why you brought it up) In theory it should be relatively simple to set up a functional testcase for this - all that's needed is to publish a bunch of messages, use toxiproxy to blackhole the broker for a while, and then bring the broker back and check that messages can still be published. If i get a moment this week I'll see if I can wrestle with vagrant enough to get an integration test going. We're currently working out how to get enough confidence in this change to ship it to production, so integration tests could help us with that anyway. |
Yeah I've had that on my TODO but haven't had time to attack this to be honest. I've been looking into either docker-compose or k3s but personally leaning towards the latter. I think current vagrant setup is somewhat outdated. |
I think the Vagrantfile won't have worked since it was bumped from trusty to bionic in b8c5f7c - setup_services.sh tries to set up classic upstart/sysv-init style jobs for toxiproxy/zookeeper/kafka. This could probably be fixed by turning them into systemd units, but a more wholesale fix to use containers for this could simplify it a lot, you're right. In any case i'll bodge something on my machine to try and get the tests to work, submit a functional test to this PR, and actually fixing the local functional test runner is probably a job for a different PR |
@bai @dnwe it was certainly an ordeal, but I got the integration tests running on my machine. What I ended up doing was making a docker-compose file with kafka/zookeeper/toxiproxy, and moving all of the topic creation/seeding stuff to the golang test code itself. I did have a look at k3s for this, but it doesn't seem to have a good macOS story. This is it: zendesk@0650324 so, the tests can be run like:
Could probably replace the CI runner with the docker compose file with a bit of tweaking too. In any case, after all that, I did write a functional test and add it to this PR. |
955b7df
to
ca14191
Compare
@@ -96,6 +97,83 @@ func TestFuncProducingToInvalidTopic(t *testing.T) { | |||
safeClose(t, producer) | |||
} | |||
|
|||
func TestFuncProducingIdempotentWithBrokerFailure(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
@KJTsanaktsidis thanks for the update. Great news. Are you running against kafka 2.3/2.4 in the backend? |
Ha, unfortunately we're using kafka 2.1.1 in production (and 2.2.2 in our staging environment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KJTsanaktsidis thanks for this fix 👍
@KJTsanaktsidis did you want to also propose a PR for your vagrant --> docker-compose migration for the functional tests? |
@dnwe yeah, if you want to go in that direction I think I can put together a PR that has
Sounds like a job for my next lab day, which is end of next week. Sounds like a plan? |
@KJTsanaktsidis that sounds great to me. I think making functional tests against a real kafka easier to run and hence easier to write will be extremely valuable. |
I'm 👍 on getting rid of Vagrant and simplifying this setup and I think docker-compose is a good candidate for this. I'd also suggest k3s but assuming |
Yeah so we use k3s internally in our CI builds, but in this scenario it doesn't really buy us much above and beyond docker-compose as we'd invariably be running it within docker (for non-Linux devs) so it would just add an additional layer and we'd have the complexities of statefulset yaml or adopting something like strimzi, but all of that increases the complexity of things for devs to run/understand so I think compose is a reasonable solution. The alternative would be to use the docker go bindings and drive the start and stop of containers directly from the tests themselves and not having any "bring up the cluster" outside of Go at all |
Makes sense, thanks. So my vote goes for docker-compose then 😄 |
When the following three conditions are satisfied, the producer code can skip message sequence numbers and cause the broker to complain that the sequences are out of order:
config.Producer.Idempotent
is setFor every message the client attempted to send while the broker is unavailable, the transaction manager sequence number will be incremented, however these messages will eventually fail and return an error to the caller. When the broker re-appears, and another message is published, it's sequence number is higher than the last one the broker remembered - the values that were attempted while it was down were never seen. Thus, from the broker's perspective, it's seeing out-of-order sequence numbers.
The fix to this has a few parts:
This should be a fix for #1430 I think. I tested it with this test harness: https://gist.github.com/KJTsanaktsidis/12a33a9e6e864857b91f639947567ac3 and tried a few things while it was running:
Seemed to all come out OK.
This is, however, very much still a WIP PR. In particular:
*
It needs testsI added a test case that covers this.*
It needs to be used in anger in one of our production systemsWe've deployed this to production in Zendesk now over the last week*
I couldn't get the e2e tests running on my machineI provided some suggestions about how we could make the e2e tests work more easily on dev machines* I very much would love your eyeballs on this to see if this looks like the right solution to you!