Successes() channel returning &MessageToSend with flags:2 after requeue #179

ORBAT · 2014-11-14T11:51:18Z

I'm seeing messages like &sarama.MessageToSend{Topic:"benchmark-kafka", Key:sarama.Encoder(nil), Value:sarama.Encoder(nil), offset:5, partition:2, flags:2}) occasionally being returned on the Successes() channel. Here's what seems to usually happen in the logs right before receiving the message:

[Sarama] 2014/11/14 13:18:34.767471 Connected to broker 192.168.100.1:10093
[Sarama] 2014/11/14 13:18:34.767505 Connected to broker 192.168.100.1:10092
[Sarama] 2014/11/14 13:18:34.767524 Connected to broker 192.168.100.1:10094
[Sarama] 2014/11/14 13:18:34.874508 Producer requeueing batch of 16 messages due to error: kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
[Sarama] 2014/11/14 13:18:34.874540 Producer dispatching retried messages to new leader.
[Sarama] 2014/11/14 13:18:34.874553 Fetching metadata from broker 192.168.100.1:10094
[Sarama] 2014/11/14 13:18:34.874620 Messages requeued

Is this intended behavior?

The text was updated successfully, but these errors were encountered:

eapache · 2014-11-14T12:10:21Z

I'll have to double check but I think so. Flags are used internally to track the state of messages such as if they've been retried so I would expect retried messages to have some extra flag(s) set.

How did you even check that, we don't expose those flags in the API?

eapache · 2014-11-14T12:12:56Z

(the log messages indicate that your cluster is undergoing leader election and so the producer has to switch brokers)

ORBAT · 2014-11-14T12:43:56Z

Printf's %#v prints private fields too (although I did run into this in my sarama io.Writer fork.)

Anyhow, I'm keeping tabs on the MessageToSend pointers I'm sending on the Input() channel, so I know the message isn't something I've sent (plus the Value field of these unexpected messages is always nil.)

I wonder why the cluster's doing leader election / why the metadata goes stale in the first place; it's running on my laptop, and there shouldn't be any surprising changes in cluster membership since it's pretty much always up.

ORBAT · 2014-11-14T14:25:44Z

Looks like some messages don't get acknowledged or error out at all after this happens, although I haven't had time to check whether they actually get sent or not. If needed, I can try to come up with a test case for this, although it might take some time with the work load I currently have.

eapache · 2014-11-14T19:51:30Z

%#v prints private fields too

today I learned

so I know the message isn't something I've sent

Ya, haha I know what this is. In retry scenarios the producer internally passes some "fake" messages in order to carry additional flags. I guess we shouldn't be returning those to the user :)

For now you can just ignore them - they're basically meaningless, just an implementation detail leaking out where it shouldn't. Easy to fix.

I wonder why the cluster's doing leader election / why the metadata goes stale in the first place

I dunno, but "kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date." is a message we get from the broker, so the broker is definitely telling us that something is wrong.

eapache · 2014-11-14T20:45:35Z

Easy to fix.

Or not. There's a corner case here that isn't great. Still digging...

eapache · 2014-11-14T22:40:08Z

A reproducible case or a network traffic capture (tcpdump or wireshark) at the time of the event would really help (a more complete log sample might also be helpful). Every time I think I've figured out how this could occur I realize it should be impossible.

It would be trivial to add an "if flags set, don't send on Successes channel" check, but all that would do is mask the real problem - those messages shouldn't be in that part of the flow to begin with.

I worked out the formal state machines of the relevant goroutines and stuck a log message in at every state change. This should help with debugging #179. The logs should still only trigger on broker rebalance or super-heavy traffic, so normal operation should be quiet.

I worked out the formal state machines of the relevant goroutines and stuck a log message in at every state change. This should help with debugging IBM#179. The logs should still only trigger on broker rebalance or super-heavy traffic, so normal operation should be quiet.

eapache · 2014-11-18T13:54:47Z

I have a hypothesis - is it possible you are reusing MessageToSend structs at any point (either by sending the same struct multiple times or by resending a struct that came out of the successes or errors channel)?

ORBAT · 2014-11-19T12:31:49Z

Hm, that shouldn't be the case: a new *MessageToSend is always created, and each pointer is sent to Input() exactly once

Also reported at #199 (comment) Go's closure semantics are really annoying - simply stop spawning goroutines with the wrong arguments. Add a test (heavily based on https://gist.github.com/ORBAT/d0adcd790dff34b37b04) to ensure this behaviour doesn't regress. Huge thanks to Tom Eklöf for getting me all the logs etc. needed to track this down.

Fix bug #179

ORBAT · 2014-11-19T18:02:06Z

Looks like PR #202 did the trick, yay! I'll go ahead and close the issue.

eapache added bug producer labels Nov 14, 2014

eapache mentioned this issue Nov 15, 2014

prod: explicitly log every state change #184

Merged

ORBAT mentioned this issue Nov 19, 2014

Replace successes channel with individual promises #199

Closed

5 tasks

eapache added a commit that referenced this issue Nov 19, 2014

Merge pull request #202 from Shopify/fix-179

bdf31b7

Fix bug #179

ORBAT closed this as completed Nov 19, 2014

ORBAT pushed a commit to ORBAT/sarama that referenced this issue Nov 19, 2014

Cherry pick fix for IBM#179

56a1969

ORBAT pushed a commit to ORBAT/sarama that referenced this issue Nov 19, 2014

Cherry pick fix for IBM#179

e63f0f2

eapache mentioned this issue May 14, 2015

panic: sync: negative WaitGroup counter #449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Successes() channel returning &MessageToSend with flags:2 after requeue #179

Successes() channel returning &MessageToSend with flags:2 after requeue #179

ORBAT commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

ORBAT commented Nov 14, 2014

ORBAT commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 18, 2014

ORBAT commented Nov 19, 2014

ORBAT commented Nov 19, 2014

Successes() channel returning &MessageToSend with flags:2 after requeue #179

Successes() channel returning &MessageToSend with flags:2 after requeue #179

Comments

ORBAT commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

ORBAT commented Nov 14, 2014

ORBAT commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 14, 2014

eapache commented Nov 18, 2014

ORBAT commented Nov 19, 2014

ORBAT commented Nov 19, 2014