Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Successes() channel returning &MessageToSend with flags:2 after requeue #179

Closed
ORBAT opened this issue Nov 14, 2014 · 10 comments
Closed

Successes() channel returning &MessageToSend with flags:2 after requeue #179

ORBAT opened this issue Nov 14, 2014 · 10 comments
Labels

Comments

@ORBAT
Copy link

ORBAT commented Nov 14, 2014

I'm seeing messages like &sarama.MessageToSend{Topic:"benchmark-kafka", Key:sarama.Encoder(nil), Value:sarama.Encoder(nil), offset:5, partition:2, flags:2}) occasionally being returned on the Successes() channel. Here's what seems to usually happen in the logs right before receiving the message:

[Sarama] 2014/11/14 13:18:34.767471 Connected to broker 192.168.100.1:10093
[Sarama] 2014/11/14 13:18:34.767505 Connected to broker 192.168.100.1:10092
[Sarama] 2014/11/14 13:18:34.767524 Connected to broker 192.168.100.1:10094
[Sarama] 2014/11/14 13:18:34.874508 Producer requeueing batch of 16 messages due to error: kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
[Sarama] 2014/11/14 13:18:34.874540 Producer dispatching retried messages to new leader.
[Sarama] 2014/11/14 13:18:34.874553 Fetching metadata from broker 192.168.100.1:10094
[Sarama] 2014/11/14 13:18:34.874620 Messages requeued

Is this intended behavior?

@eapache
Copy link
Contributor

eapache commented Nov 14, 2014

I'll have to double check but I think so. Flags are used internally to track the state of messages such as if they've been retried so I would expect retried messages to have some extra flag(s) set.

How did you even check that, we don't expose those flags in the API?

@eapache
Copy link
Contributor

eapache commented Nov 14, 2014

(the log messages indicate that your cluster is undergoing leader election and so the producer has to switch brokers)

@ORBAT
Copy link
Author

ORBAT commented Nov 14, 2014

Printf's %#v prints private fields too (although I did run into this in my sarama io.Writer fork.)

Anyhow, I'm keeping tabs on the MessageToSend pointers I'm sending on the Input() channel, so I know the message isn't something I've sent (plus the Value field of these unexpected messages is always nil.)

I wonder why the cluster's doing leader election / why the metadata goes stale in the first place; it's running on my laptop, and there shouldn't be any surprising changes in cluster membership since it's pretty much always up.

@ORBAT
Copy link
Author

ORBAT commented Nov 14, 2014

Looks like some messages don't get acknowledged or error out at all after this happens, although I haven't had time to check whether they actually get sent or not. If needed, I can try to come up with a test case for this, although it might take some time with the work load I currently have.

@eapache
Copy link
Contributor

eapache commented Nov 14, 2014

%#v prints private fields too

today I learned

so I know the message isn't something I've sent

Ya, haha I know what this is. In retry scenarios the producer internally passes some "fake" messages in order to carry additional flags. I guess we shouldn't be returning those to the user :)

For now you can just ignore them - they're basically meaningless, just an implementation detail leaking out where it shouldn't. Easy to fix.

I wonder why the cluster's doing leader election / why the metadata goes stale in the first place

I dunno, but "kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date." is a message we get from the broker, so the broker is definitely telling us that something is wrong.

@eapache
Copy link
Contributor

eapache commented Nov 14, 2014

Easy to fix.

Or not. There's a corner case here that isn't great. Still digging...

@eapache
Copy link
Contributor

eapache commented Nov 14, 2014

A reproducible case or a network traffic capture (tcpdump or wireshark) at the time of the event would really help (a more complete log sample might also be helpful). Every time I think I've figured out how this could occur I realize it should be impossible.

It would be trivial to add an "if flags set, don't send on Successes channel" check, but all that would do is mask the real problem - those messages shouldn't be in that part of the flow to begin with.

eapache added a commit that referenced this issue Nov 15, 2014
I worked out the formal state machines of the relevant goroutines and stuck a
log message in at every state change. This should help with debugging #179.

The logs should still only trigger on broker rebalance or super-heavy traffic,
so normal operation should be quiet.
potocnyj pushed a commit to VividCortex/sarama that referenced this issue Nov 15, 2014
I worked out the formal state machines of the relevant goroutines and stuck a
log message in at every state change. This should help with debugging IBM#179.

The logs should still only trigger on broker rebalance or super-heavy traffic,
so normal operation should be quiet.
@eapache
Copy link
Contributor

eapache commented Nov 18, 2014

I have a hypothesis - is it possible you are reusing MessageToSend structs at any point (either by sending the same struct multiple times or by resending a struct that came out of the successes or errors channel)?

@ORBAT
Copy link
Author

ORBAT commented Nov 19, 2014

Hm, that shouldn't be the case: a new *MessageToSend is always created, and each pointer is sent to Input() exactly once

eapache added a commit that referenced this issue Nov 19, 2014
Also reported at #199 (comment)

Go's closure semantics are really annoying - simply stop spawning goroutines
with the wrong arguments.

Add a test (heavily based on https://gist.github.com/ORBAT/d0adcd790dff34b37b04)
to ensure this behaviour doesn't regress.

Huge thanks to Tom Eklöf for getting me all the logs etc. needed to track this
down.
eapache added a commit that referenced this issue Nov 19, 2014
@ORBAT
Copy link
Author

ORBAT commented Nov 19, 2014

Looks like PR #202 did the trick, yay! I'll go ahead and close the issue.

@ORBAT ORBAT closed this as completed Nov 19, 2014
ORBAT pushed a commit to ORBAT/sarama that referenced this issue Nov 19, 2014
ORBAT pushed a commit to ORBAT/sarama that referenced this issue Nov 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants