Kafka controller uses old version of Sarama client with known bug which leads to log truncation error in data plane and triggers re-processing of all topic data from beginning #3909
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Describe the bug
Knative Eventing control plane version >1.12.0 uses Sarama client version 1.41.2. However, this version of Samara client contains a known bug which creates wrong metadata (for leader epoch) for the initial commit on new consumer groups.
IBM/sarama#2705
The problem has been fixed in Samara client version 1.42.1 which should be bumped as a new version for the control plane. A fix woud be to merge the dependency version from main which is already v1.43.1 https://github.com/knative-extensions/eventing-kafka-broker/blob/main/go.mod#L6
This bug hits the data plane when starting a new consumer group. Since the expected metadata (leader epoch) is not correct according to Kafka protocol, the Kafka client in the data plane recognizes a partition truncation error. This log truncation error occurs after a leader switch for the topic partition occured at least once, because then the committed medata does not match the cluster state.
{"@timestamp":"2024-05-28T09:46:29.893Z","@version":"1","message":"[Consumer clientId=xxx-0.f7215124-d6c3-4eff-9ab7-c79ee947dde0-5, groupId=xxx.active-monitoring-kn-sequence-0.f7215124-d6c3-4eff-9ab7-c79ee947dde0] Truncation detected for partition xxx.active-monitoring-kn-sequence-0-0 at offset FetchPosition{offset=183705, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[xxx.com:9092 (id: 11 rack: 2)], epoch=37}}, resetting offset to the first offset known to diverge FetchPosition{offset=176697, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[xxx.com:9092 (id: 11 rack: 2)], epoch=37}}","logger_name":"org.apache.kafka.clients.consumer.internals.SubscriptionState","thread_name":"vert.x-kafka-consumer-thread-4","level":"INFO","level_value":20000}
The impact is that a new consumer will start consuming all messages from earliest (instead of latest as expected). This results in a high risk for production systems since it generates a huge load and impacts data consistency due to duplicate processing. This can happen even after changing sequence step configs (causing new consumer groups to be created).
Expected behavior
A consumer is created for a new consumer group which consumes messages from latest. No log truncation error occurs which would lead to messages to be processed from earliest.
To Reproduce
Create a new trigger for an existing topic with retained messages. In case the topic partitions have a leader epoch > 0 (which is the case when a leader change happened), the consumer is started with a log truncation error. The consumer offset is then reset to earliest and all messages from the topic are consumed from the beginning.
Knative release version
Tested on Knative Eventing version 1.13.0 with Samara client version 1.41.2
This Samara client version is used in all Knative Eventing versions > >1.12.0
Additional context
Add any other context about the problem here such as proposed priority
The text was updated successfully, but these errors were encountered: