-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][pulsar-io] KCA to use index (if available) instead of sequenceId and to handle batched messages non-unique sequenceIds #16098
Conversation
…ched messages non-unique sequenceIds
help = "Number of bits (0 to 20) to use for index of message in the batch for translation into an offset.\n" | ||
+ "0 to disable this behavior (Messages from the same batch will have the same " | ||
+ "offset which can affect some connectors.)") | ||
private int maxBatchBitsForOffset = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is zero than the behaviour is the same as before.
So I would keep 0 as default, in order to not break compatibility with data (offsets) already stored in existing envs that are upgrading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if we use Integer we can detect that the connector has been created with a old version if the value is null and so enable the legacy behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Pulsar does not have Sink implementations that use kafka connectors (Debezium etc are sources).
I don't think we need to worry about legacy behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
defaultValue = "true", | ||
help = "Allows use of message index instead of message sequenceId as offset, if available.\n" | ||
+ "Requires AppendIndexMetadataInterceptor and " | ||
+ "enableExposingBrokerEntryMetadataToClient=true on brokers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the config is exposingBrokerEntryMetadataToClientEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update in #16100 since this PR is merged already
…ched messages non-unique sequenceIds (apache#16098) (cherry picked from commit a18c01d)
…ched messages non-unique sequenceIds (apache#16098) (cherry picked from commit a18c01d)
…ched messages non-unique sequenceIds (apache#16098) (cherry picked from commit a18c01d)
Motivation
Record's getRecordSequence() returns non-unique sequenceId for the messages from the same batch.
The root cause is that
FunctionCommon.getSequenceId()
does not account for the index of the message in the batch and only uses ledgerid and entryId.For the KCA Sink it mean that messages start arriving with the same offset, and some Kafka Sinks will ignore such messages as duplicates.
Changing this behavior in the
FunctionCommon.getSequenceId()
is potentially breaking (requires separate discussion) and I assume it does not affect Pulsar right now.Another problem is that we are already packing two longs (ledgerId and entryId) to get one long (sequenceId), batch adds an int to this. With the KCA one can make assumptions around batch size/number of entries in ledger before rotation and configure this to avoid/minimize lossiness of this packing, in general Pulsar such assumption aren'tr eliable.
Modifications
KCA's produced kafka offset can use
PulsarKafkaConnectSinkConfig added couple of parameters (documented in FieldDoc)
Verifying this change
Added unit tests
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesNo
There are no Sinks in the Apache Pulsar that use KCA.
Documentation
Check the box below or label this PR directly.
Need to update docs?
doc-required
(Your PR needs to update docs and you will update later)
doc-not-needed
(Please explain why)
doc
(Your PR contains doc changes)
doc-complete
(Docs have been already added)