Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove external calls to disconnectBroker #338

Merged
merged 1 commit into from
Mar 13, 2015

Conversation

eapache
Copy link
Contributor

@eapache eapache commented Mar 12, 2015

It is now only called from one place in the client. This looks simple but is
actually super-subtle, and depends on lazy broker connections (PR #309).

disconnectBroker does a whole bunch of different things:

  • calls Close on the broker connection
  • adds the address to the internal deadBrokerAddrs map even if the broker is
    not a seed, which I think is wrong, since resurrectDeadBrokers will use it to
    repopulate the seedBrokerAddrs list
  • rotates seedBrokers (if the broker was a seed broker)
  • removes it from the brokers map (otherwise)

In the producer and consumer where we used to call disconnectBroker:

  • We now call Close directly on the broker.
  • The broker we are dealing with is not a seed broker, so the seedBrokers do not
    need rotating, and I don't think it's a problem that it no longer gets added
    to deadBrokerAddrs.
  • The reason we removed it from the broker map was so that the next request for
    that broker would trigger a metadata request and reopen the connection. The
    producer and consumer both manually trigger metadata requests when necessary,
    and the fact that we now have lazy connection opening means simply closing it
    (which we do, see first bullet) is enough to cause the connection to reopen
    the next time it is requested, even if no metadata refresh is requested.

In a subsequent PR, the last remaining call will be inlined and refactored. This will effectively fix the last non-trivial issue spawned from #15.

@Shopify/kafka

@eapache eapache force-pushed the die-disconnect-broker-die branch 3 times, most recently from f3953ca to ae411cd Compare March 12, 2015 21:34
It is now only called from one place in the client. This looks simple but is
actually super-subtle, and depends on lazy broker connections (PR #309).

disconnectBroker does a whole bunch of different things:
- calls `Close` on the broker connection
- adds the address to the internal `deadBrokerAddrs` map even if the broker is
  not a seed, which I think is wrong, since resurrectDeadBrokers will use it to
  repopulate the seedBrokerAddrs list
- rotates seedBrokers (if the broker was a seed broker)
- removes it from the brokers map (otherwise)

In the producer and consumer where we used to call disconnectBroker:
- We now call `Close` directly on the broker.
- The broker we are dealing with is not a seed broker, so the seedBrokers do not
  need rotating, and I don't think it's a problem that it no longer gets added
  to `deadBrokerAddrs`.
- The reason we removed it from the broker map was so that the next request for
  that broker would trigger a metadata request and reopen the connection. The
  producer and consumer both manually trigger metadata requests when necessary,
  and the fact that we now have lazy connection opening means simply closing it
  (which we do, see first bullet) is enough to cause the connection to reopen
  the next time it is requested, even if no metadata refresh is requested.
@eapache eapache force-pushed the die-disconnect-broker-die branch from ae411cd to a778a8e Compare March 13, 2015 14:22
@eapache
Copy link
Contributor Author

eapache commented Mar 13, 2015

The lazy connections have been merged so this is ready for review. The code is trivial (not much to review there), I'm more interested in somebody verifying that my logic is correct around deadBrokerAddrs, connection bouncing, etc.

As part of #309 this has already undergone some decent practical testing using the vagrant cluster and willem's stressproducer.

@wvanbergen
Copy link
Contributor

Did we run any consumer tests by any chance?

@eapache
Copy link
Contributor Author

eapache commented Mar 13, 2015

Just ran some of those using your new console-consumer, but I'm not too concerned. The changes are symmetric so if the producer works fine then the consumer should (and does) also.

@wvanbergen
Copy link
Contributor

I think the logic here is sound.

I don't think it's necessary to Close() the broker connection for every error we see in the producer or consumer, but this is one of these optimizations that are easy to fuck up.

What is the number of (producer) retries we potentially waste under relatively normal operations?

  1. Produce requests fails (e.g. not leader for partition), so it asks for new metadata.
  2. Metadata update request fails. This is already unlikely, because it happens only if it completely drains the Metadata retries as well, right?

@eapache
Copy link
Contributor Author

eapache commented Mar 13, 2015

I don't think it's necessary to Close() the broker connection for every error we see in the producer or consumer

You don't? If a broker gives us a network error or an unexpected protocol error then I don't see what choice we have besides disconnecting from that one and asking metdata for another one?

What is the number of (producer) retries we potentially waste under relatively normal operations?

No more than we ever have?

@wvanbergen
Copy link
Contributor

I meant, not for every error. E.g. if we get a NotLeaderForPartition, we don;t technically have to disconnect. Or does that error never end up in abort?

Anyway, that's an optimization we don't really need.

@wvanbergen
Copy link
Contributor

:shipit:

wvanbergen added a commit that referenced this pull request Mar 13, 2015
Remove external calls to disconnectBroker
@wvanbergen wvanbergen merged commit 20f98a6 into master Mar 13, 2015
@eapache eapache deleted the die-disconnect-broker-die branch March 13, 2015 19:15
@eapache
Copy link
Contributor Author

eapache commented Mar 13, 2015

I meant, not for every error. E.g. if we get a NotLeaderForPartition, we don;t technically have to disconnect. Or does that error never end up in abort?

abort cleans up the entire worker - the only errors that end up there are ones returned from broker.Fetch(), nothing at the protocol layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants