Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for alter/list partition reassignements APIs #1617

Closed

Conversation

sladkoff
Copy link
Contributor

@sladkoff sladkoff commented Feb 17, 2020

We added support for API Keys 45 (AlterPartitionReassignments) and 46 (ListPartitionReassignments).

This enables sarama to update the replication factor of topics for Kafka 2.4.0.0+ (See also #1238).

We had to do change quite a lot since these APIs require the use of newer data structures in the body as well as in the header of the requests/responses (mainly compact structures and tagged fields).

Please let us know what you think.

@sladkoff sladkoff requested a review from bai as a code owner February 17, 2020 15:53
@ghost ghost added the cla-needed label Feb 17, 2020
@sladkoff sladkoff changed the title Feature/update replication assignment Add support for alter/list partition reassignements APIs Feb 17, 2020
@d1egoaz d1egoaz requested a review from dnwe February 19, 2020 13:31
@dnwe
Copy link
Collaborator

dnwe commented Feb 19, 2020

@sladkoff 🌟 thanks for this, great PR!

It'll take a little while to review it all as I also need to brush up my knowledge on KIP-482 and how this fits with the existing messages too.

@ghost ghost removed the cla-needed label Feb 20, 2020
@bai
Copy link
Contributor

bai commented Feb 20, 2020

@sladkoff 💯 Do you think you could rebase your branch to make sure it's ✅on CI 🙏

@sladkoff sladkoff force-pushed the feature/update-replication-assignment branch from 601e78f to 093729b Compare February 20, 2020 07:08
@sladkoff sladkoff force-pushed the feature/update-replication-assignment branch from 093729b to c4390e1 Compare February 20, 2020 07:20
Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, the protocol changes look good 👍

A small number of enquiries below:

admin.go Outdated
}

request := &AlterPartitionReassignmentsRequest{
TimeoutMs: int32(10000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this 10_000 millisecond default timeout come from? The protocol description suggests 60_000 should be the default?

https://github.com/apache/kafka/blob/2.4.0/clients/src/main/resources/common/message/AlterPartitionReassignmentsRequest.json#L23

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably just put something in there :) Using the suggested default makes sense. I added this.

admin.go Outdated
}

request := &ListPartitionReassignmentsRequest{
TimeoutMs: int32(10000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, shouldn't this be 60_000 ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

admin_test.go Outdated
seedBroker.SetHandlerByMap(map[string]MockResponse{
"MetadataRequest": NewMockMetadataResponse(t).
SetController(seedBroker.BrokerID()).
SetBroker(seedBroker.Addr(), seedBroker.BrokerID()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you add a second mockBroker to the test and set that as the controller in the mock metadata response and not the seed broker? That should exercise that the request must be sent to the Controller and not just any broker (or the broker currently connected to)

admin_test.go Outdated
seedBroker.SetHandlerByMap(map[string]MockResponse{
"MetadataRequest": NewMockMetadataResponse(t).
SetController(seedBroker.BrokerID()).
SetBroker(seedBroker.Addr(), seedBroker.BrokerID()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, please use different MockBroker for the Controller

admin.go Outdated

request.AddBlock(topic, partitions)

b, err := ca.findAnyBroker()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that ListPartitionReassignments also needed to be sent to the Controller (not just any broker) like the Alter request? The Java Admin client seems to do so here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I fixed that

@ghost
Copy link

ghost commented Feb 25, 2020

@dnwe I added commits which address your remarks. Can you have a look? One of the CI jobs failed, but it does not look related to our change, right?

Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated changes look good to me.

One small remaining fix below and then I think the PR is ready to be approved and merged

@@ -512,7 +512,7 @@ func (ca *clusterAdmin) ListPartitionReassignments(topic string, partitions []in

request.AddBlock(topic, partitions)

b, err := ca.findAnyBroker()
b, err := ca.Controller()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is using the controller you probably want to wrapper the body in a return ca.retryOnError(isErrNoController, ...) call like you did for AlterPartitionReassignments so it refreshes the cached controller if it is stale

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retryOnError function cannot be used when a value needs to be returned. other functions in admin.go where a controller is used and a return type is needed also don't use retrying. How should this be handled? should we implement a retry function that returns something like interface{}?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwi-di ah of course — let's just leave this for now and we can always follow-up in another PR

@dnwe
Copy link
Collaborator

dnwe commented Feb 27, 2020

One of the CI jobs failed, but it does not look related to our change, right?

I'm guessing it was just a slight flake — but it was in the new TestListPartitionReassignmentRequest func:

--- FAIL: TestListPartitionReassignmentRequest (0.00s)
##[error]    request_test.go:63: Encoding two blocks failed
        got  [0 0 39 16 3 7 116 111 112 105 99 50 3 0 0 0 1 0 0 0 2 0 6 116 111 112 105 99 2 0 0 0 0 0 0] 
        want [0 0 39 16 3 6 116 111 112 105 99 2 0 0 0 0 0 7 116 111 112 105 99 50 3 0 0 0 1 0 0 0 2 0 0]

@dnwe
Copy link
Collaborator

dnwe commented Feb 27, 2020

I was able to re-create with:

go test -run '^TestListPartitionReassignmentRequest$' -count=500 -race

@dnwe
Copy link
Collaborator

dnwe commented Feb 27, 2020

Ah I can see the problem, the ordering of the two blocks in your check of testRequest(t, "two blocks", request, listPartitionReassignmentsRequestTwoBlocks) isn't guaranteed, so sometimes you get "topic" followed by "topic2" in your response, and sometimes you get "topic2" followed by "topic" in your response.

@ghost
Copy link

ghost commented Feb 27, 2020

ah I see, but what do you propose to do here? the only way to fix it is to change the encoder to sort the blocks based on topic name, right? but that does not feel right, because it is not really needed, only for the test...

@dnwe
Copy link
Collaborator

dnwe commented Mar 1, 2020

ah I see, but what do you propose to do here? the only way to fix it is to change the encoder to sort the blocks based on topic name, right? but that does not feel right, because it is not really needed, only for the test...

Yes I think it just isn't possible to use the existing testRequest helper which is what is doing the byte-comparison along the way. In your two-block test you'd probably just test a struct --> protocolEncode --> protocolDecode --> struct roundtrip correctly gives you a pair of structs that are DeepEqual — which is essentially a slimmed down version of the existing helpers

@ghost
Copy link

ghost commented Mar 4, 2020

I modified the flaky test, that should be all then.

Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM 👍

dnwe added a commit that referenced this pull request Mar 6, 2020
…ignment

* feature/update-replication-assignment:
  feat: Add alter and list partition reassignments
@dnwe
Copy link
Collaborator

dnwe commented Mar 6, 2020

Hmm I squash merged this manually from the CLI, but GitHub doesn't seem to have tracked the merge (2d2326e via 9501120)

@dnwe dnwe closed this Mar 6, 2020
@sladkoff sladkoff deleted the feature/update-replication-assignment branch March 10, 2020 09:52
@sladkoff
Copy link
Contributor Author

@dnwe Nice! Can we expect to get these changes into a release soon?

@d-rk
Copy link
Contributor

d-rk commented Mar 12, 2020

we should probably wait for #1640 before releasing this, otherwise the list partition replica assignement status is not really usable.

@sladkoff
Copy link
Contributor Author

sladkoff commented May 6, 2020

Hi guys! It's been two months now, so I'll allow myself to re-post my question 😅

Could we get an ETA for a new release with this these features?

@dnwe @bai

@bai
Copy link
Contributor

bai commented May 6, 2020

That's fair. @dnwe any objections on cutting release today?

@dnwe
Copy link
Collaborator

dnwe commented May 6, 2020

@bai yes I was merging a few PRs in yesterday as I think we do need to cut a release with this and also to include @KJTsanaktsidis's fix for the session ID cache exhaustion that was introduced in 1.26 and is a nasty bug.

@dnwe
Copy link
Collaborator

dnwe commented May 6, 2020

@bai we might need to check on the github actions builds though, as they've been a bit flakey recently. Not sure if we can do anything to improve them

@bai
Copy link
Contributor

bai commented May 6, 2020

I think we can address GitHub Actions issue short-term by reducing build matrix — it's gotten so much flakier with 9 builds instead of 6.

Hopefully when we have that docker-compose setup in place it'll get better 🙏

Want to do this release yourself or would you prefer me cutting it?

@dnwe
Copy link
Collaborator

dnwe commented May 6, 2020

Sure I can give a go to cutting the release. Just a tag in Git and then create a release with a small changelog linking to the PRs?

@bai
Copy link
Contributor

bai commented May 6, 2020

@dnwe I usually use Draft New Release button on Releases page here: https://github.com/Shopify/sarama/releases.
Once I'm confident (title/tag look what I want them to look) I'm promoting Draft Release with "Publish Release" button.

So basically I'm not doing manual git tag but use GitHub UI to release 🤷‍♀️

@dnwe
Copy link
Collaborator

dnwe commented May 6, 2020

@sladkoff released and announced!

https://github.com/Shopify/sarama/releases/tag/v1.26.2

https://twitter.com/oldmanuk/status/1258035278771621888

(apologies for missing shoutouts to those I couldn't immediately find on Twitter)

@sladkoff
Copy link
Contributor Author

sladkoff commented May 6, 2020

@dnwe @bai Awesome 🎉 Thanks for the quick reaction!

@bai
Copy link
Contributor

bai commented May 6, 2020

Props to Dominic for cutting this one and all of you folks for contributing 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants