Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement draining logic for transactional flow #757

Merged

Conversation

szymonm
Copy link
Contributor

@szymonm szymonm commented Mar 25, 2019

Pull Request Checklist

Fixes

Duplications in a transactional flow that occur, when either closing the stream or partitions are revoked by Kafka.

Purpose

This PR introduces draining logic for TransactionalSource. Specifically, for each partition managed by the source, we keep track of the message offsets emitted by the source.
When closing the stage or revoking partitions, we make sure that all the offsets are committed back to the Kafka by the producer.
To know that an offset was committed by the producer we attach CommittedMarker to each message produced by the stream. TransactionalProducerStage uses this marker to tell the source that an offset was committed.

Background Context

In a transactional stream processing partition P, we need to wait for all commits to be acknowledged by Kafka before closing the consumer.
If we don't, it's possible that another consumer is assigned P, which triggers fetching offset that can be stale or overwritten later. This causes data duplication.

References

#758
#756

@szymonm szymonm changed the title Add test for draining transactional flow Implement draining logic for transactional flow Apr 1, 2019
@szymonm
Copy link
Contributor Author

szymonm commented Apr 1, 2019

Some afterthoughts:
IIUC, we need to ensure that consumer is not closed before producer finishes committing transaction also in the case of stream failure.
IIUC, in this case the failure message will reach source stage and stop the consumer. However, it will not reach TransactionalProducerStage if producer.commitTransaction will be in progress (it's blocking). This can lead to race conditions, when producer commits after some other consumer is assigned the partition.

@2m 2m force-pushed the szymon-add-tests-for-draining-transactional-flow branch from c0e92db to 9c04dcf Compare April 4, 2019 11:15
@2m
Copy link
Contributor

2m commented Apr 4, 2019

Rebased on top of the latest master, and also fixed a bug, where a wrong Committed message was being handled in the Trnsational Source stage.

@@ -151,9 +151,9 @@ private final class TransactionalProducerStageLogic[K, V, P](stage: Transactiona
private def maybeCommitTransaction(beginNewTransaction: Boolean = true): Unit = {
val awaitingConf = awaitingConfirmation.get
batchOffsets match {
case batch: NonemptyTransactionBatch if awaitingConf == 0 =>
case batch: NonemptyTransactionBatch if awaitingConf == 0 && !hasBeenPulled(stage.in) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need this logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the transaction commit is split into two synchronous blocks in the Producer stage. That means that in between, a message could be sent to the stage which would be sent to the Kafka producer, but the transaction has not yet been started.

So this guard was needed to be sure that no message will come in between the the committing of transaction and beginning of the new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still didn't quite get it. I was hoping that suspending demand when transaction is in progress will be enough.
Also, I'm wondering if we just shouldn't block waiting for internal commit.
kafkaProducer.commitTrasnaction() is already blocking the whole stage waiting on a network call. And internalCommit is just waiting for local memory change, so it shouldn't add much...
That would simplify the code a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making internalCommitjust send a message would also be possible. Seems like TransactionalProducerStage does not have to wait for the SourceActor to acknowledge getting the message, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that suspending demand when transaction is in progress will be enough.

Calling suspendDemand does make sure that stage.in is not going to be pulled anymore from that point on, but it could have been pulled from before.

Seems like TransactionalProducerStage does not have to wait for the SourceActor to acknowledge getting the message, right?

Yes, currently it looks like that's the case. I'll need to think about it a bit more.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 4, 2019

Thanks, @2m

def waitForDraining(): Unit = {
import akka.pattern.ask
implicit val timeout = Timeout(5.seconds)
Await.result(ask(stageActor.ref, Drain(stageActor.ref, Drained)), 10.seconds)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid if the behaviour in case of a timeout is correct.
IIUC TimeoutException is thrown and it's caught in kafka code by ConsumerCoordinator,
here:

try {
            listener.onPartitionsRevoked(revoked);
        } catch (WakeupException | InterruptException e) {
            throw e;
        } catch (Exception e) {
            log.error("User provided listener {} failed on partition revocation", listener.getClass().getName(), e);
        }

So due to catch (Exception e) we won't fail the stream and we'll just continue like nothing happened and the stream was drained.

Therefore if it takes too long to drain then we still allow messages to remain in the stream. In the meantime some other consumer might start processing data from the same partition because we just released it.

@szymonm @2m Could you verify if my reasoning is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right @jotek7. In this case we should fail the stream starting from the producer. Which is similar to what we should do generally in the failure case (see my comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the stage needs to be stopped in that scenario. But I am still not sure if the stopping should start from the consumer: see #757 (comment)

case (_, Drain(ack, msg)) =>
if (inFlightRecords.empty()) {
log.debug("Source drained")
ack ! msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also send an answer back to the sender otherwise the ask on line 139 will not be completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in the next commit.

@2m
Copy link
Contributor

2m commented Apr 9, 2019

IIUC, in this case the failure message will reach source stage and stop the consumer. However, it will not reach TransactionalProducerStage if producer.commitTransaction will be in progress (it's blocking). This can lead to race conditions, when producer commits after some other consumer is assigned the partition.

I am not sure if this is a problem. If the stream is being torn down, then after the blocking producer.commitTransaction is done, it will complete the stage with onCompletionFailure and will not commit the next transaction, which might have duplicates. Am I missing something here?

@2m
Copy link
Contributor

2m commented Apr 9, 2019

I added a commit where it fails the consumer stage if the draining timeout happens.

@@ -266,7 +266,7 @@ class TransactionsSpec extends SpecBase(kafkaPort = KafkaPorts.TransactionsSpec)
"provide consistency when multiple transactional streams are being restarted" in {
val sourcePartitions = 10
val destinationPartitions = 4
val consumers = 3
val consumers = 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need to change this here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this test-case does not finish when running with 3 consumers. The stopping of the stage on drain timeout does not work nicely with how this test uses the RestartSource and it gets restarted all the time and the test never finishes. I also noticed that in Travis it did not finish even with 1 consumer. I need to improve that test-case.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 10, 2019

I am not sure if this is a problem. If the stream is being torn down, then after the blocking producer.commitTransaction is done, it will complete the stage with onCompletionFailure and will not commit the next transaction, which might have duplicates. Am I missing something here?

There is still a chance that consumer will be closed before the producer finishes producer.commitTransaction. This can cause the partition to be picked up by another consumer that will start from the offsets that are no yet committed.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 10, 2019

What are the requirements for binary compatibility? After turning PartitionOffset case class to a normal class we have some warnings. Should we fix them all?

@2m
Copy link
Contributor

2m commented Apr 10, 2019

Should we fix them all?

Yes, otherwise we will not be able to include this in a patch release.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 11, 2019

Last travis run reports: The job exceeded the maximum time limit for jobs, and has been terminated.

Tests for TransacionsSpec take a lot of time:

[info] - must provide consistency when multiple transactional streams are being restarted (8 minutes, 18 seconds)

[info] - must drain stream on partitions rebalancing (2 minutes, 10 seconds)

@szymonm
Copy link
Contributor Author

szymonm commented Apr 11, 2019

Looks like the test provide consistency when multiple transactional streams are being restarted takes 8 mins. We run it twice (for each scala version) and thus we increase the test times from ~30mins to over 50mins (the test for draining takes another 2 mins for each run).

Can we make the test case smaller, @2m ?

@2m
Copy link
Contributor

2m commented Apr 11, 2019

Yes, lets decrease the number of messages to 100k and see how long does that take.


def waitForDraining(): Boolean = {
import akka.pattern.ask
// Shall we use commitTimeout here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be at least commit timeout. Possible a bit larger as well. Can you refactor this from hardcoded constant to configuration value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thoughts I think using commitTimeout is just ok, will use that.

log.debug(s"Draining partitions {}", inFlightRecords)
// TODO: This timeout should be somehow related to the committing interval.
// For instance having 10.millis makes no sense if we commit transaction every second
materializer.scheduleOnce(30.millis, new Runnable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to move this to configuration as well.

@2m
Copy link
Contributor

2m commented Apr 11, 2019

Two of the other tests in the TransactionalSpec started failing. Could be the new timing configuration not playing nicely with the test-cases.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 12, 2019

These tests actually catch a valid issue now:
We keep track of offsets emitted by the source and now we are waiting for all emitted offsets. But we should not wait for the offset emitted if processing of this element caused an exception.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 12, 2019

If produced is closed after committing a transaction we can safely close the consumer.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 12, 2019

Looks like IntegrationSpec signal rebalance events to actor test is flakky.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 12, 2019

TBH there is no reason why this test should work reliably.
It uses ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest" config with no offsets storing.

This means that after a rebalancing consumers pick up the latest element from the queue. If there is more elements written to a ktp during the rebalancing this elements are never read.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 12, 2019

Can you run tests again?

@2m
Copy link
Contributor

2m commented Apr 15, 2019

We keep track of offsets emitted by the source and now we are waiting for all emitted offsets. But we should not wait for the offset emitted if processing of this element caused an exception.

Great catch. That was the reason I saw the multiple "waiting for draining" messages in the logs when the stage was stopping because of failure.

Looks like IntegrationSpec signal rebalance events to actor test is flaky.

I do not see it failing anymore, but if we see that again, lets create an issue and fix it in a separate PR.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 15, 2019

I do not see it failing anymore, but if we see that again, lets create an issue and fix it in a separate PR.

👍

@szymonm
Copy link
Contributor Author

szymonm commented Apr 15, 2019

What are the next steps with the PR?

@2m
Copy link
Contributor

2m commented Apr 15, 2019

I am currently running the test with consumers = 3 and after a little tweak, where we let the copy flow to restart a couple more times, it is passing. I am going to push that and if that looks green on Travis, this will be good to go in.

@2m 2m force-pushed the szymon-add-tests-for-draining-transactional-flow branch from 91fa31f to 488baca Compare April 15, 2019 13:45
@2m
Copy link
Contributor

2m commented Apr 16, 2019

With multiple consumers the test fails with missing messages.

Copy link
Contributor

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. I think the only missing piece might be some updates to the documentation to explain the draining configuration and its purpose during the commit process. Though maybe that work could be deferred until after PR's for the synchronous rebalance listeners and partitioned transaction sources are merged, which will making distributed transactional workloads more robust.

I'm really happy to see this feature get used and hardened. Thanks a lot for all the hard work!

@@ -93,6 +93,10 @@ akka.kafka.consumer {
# This value is used instead of Kafka's default from `default.api.timeout.ms`
# which is 1 minute.
metadata-request-timeout = 5s

# Interval for cheking that transaction was completed before closing the consumer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any more documentation for the end user about his new functionality. I suggest updating the Transactions section in the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Typo on "checking".

}(materializer.executionContext)
}

val onInternallCommitAckCb: AsyncCallback[Boolean] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Typo onInternalCommitAckCb

@@ -159,21 +190,35 @@ private final class TransactionalProducerStageLogic[K, V, P](stage: Transactiona
override def onCompletionFailure(ex: Throwable): Unit = {
log.debug("Aborting transaction due to stage failure")
abortTransaction()
batchOffsets.committingFailed()
super.onCompletionFailure(ex)
}

private def commitTransaction(batch: NonemptyTransactionBatch, beginNewTransaction: Boolean): Unit = {
val group = batch.group
log.debug("Committing transaction for consumer group '{}' with offsets: {}", group, batch.offsetMap())
Copy link
Contributor

@seglo seglo Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Use the local variable offsetMap so we don't have to build it for each log line?

Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, I left a few minor comments.

core/src/main/scala/akka/kafka/ConsumerSettings.scala Outdated Show resolved Hide resolved

# Interval for checking that transaction was completed before closing the consumer.
# Used in the transactional flow.
draining-check-interval = 30ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the setting name should contain eos or something. There are so many settings now it is hard to see which belong together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh... eos is cryptic even for me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we may have some distinct section for transactional processing?

core/src/main/scala/akka/kafka/ConsumerMessage.scala Outdated Show resolved Hide resolved
@szymonm
Copy link
Contributor Author

szymonm commented Apr 24, 2019

The test for the failure case is still failing showing both duplication and data loss.

My intuition what happens is as follows:

  1. Data loss
    Since we send messages and offsets in transaction the only possibility to loose data is to loose some records while they are in the stream and commit records with higher offsets.

  2. Duplication
    We close consumer before the producer in a single stream. Thus another consumer starts consuming from the partition before the previous producer flushes last offsets.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 24, 2019

I'm also seeing some problems with the draining mechanism, specifically it happens that it waits for the last message that is never drained.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 24, 2019

Also setting higher backoff for restarts increases the probability of test success (from what I observed)

RestartSource
          .onFailuresWithBackoff(1.second, 5.seconds, 0.2)

@2m
Copy link
Contributor

2m commented Apr 24, 2019

Since we send messages and offsets in transaction the only possibility to loose data is to loose some records while they are in the stream and commit records with higher offsets.

I only noticed lost messages when running that test-case with multiple consumers. Investigating that right now.

Duplication

I have not seen test-case failing because of duplication for quite some time now. The duplication check is first, and at least in Travis it passes now for quite some time before the test-case fails with missing elements check.

@2m
Copy link
Contributor

2m commented Apr 26, 2019

Pushed two more commits that improve the failing test-case in a couple of ways. When it comes to duplicates, it could very well be, that we get duplicated messages when rebalance happens while consuming messages after transactional processing. For that I added offset tracking which will ignore duplicates that have same offsets. We should only care about duplicates with different offsets, because those would be from the transactions.

Also I moved the place where the transactional processing flow is terminated in the test stream, so the stream is always terminated with an error (from upstream) instead of cancellation (from downstream). Lets see what Travis thinks about that when it comes to missing messages.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 26, 2019

I'm looking into it too, and what worries me know is seeing java.lang.Error: Timeout while draining. My intuition is that with embedded kafka and timeout set to 15s we have plenty of time to drain.
Maybe timeout indicates that we don't pull correctly while draining?

@szymonm
Copy link
Contributor Author

szymonm commented Apr 26, 2019

Also, when the timeout happens, we should make sure that close the producer before closing the consumer. This doesn't happen, right?

@2m
Copy link
Contributor

2m commented Apr 26, 2019

Also, when the timeout happens, we should make sure that close the producer before closing the consumer.

Not explicitly, but there is a akka.kafka.consumer.stop-timeout that delays stopping the ConsumerStage so late commits can happen when using a CommittableSource.

I see that we are overriding stopConsumerActor in TransactionalSource and not doing the delay anymore. We should probably bring the delay back so the producer gets closed first.

Maybe timeout indicates that we don't pull correctly while draining?

Yea, noticed those timeouts as well. But I also see Source drained. in the logs. Could it be that the reply after draining does not come back to complete the ask Future?

@szymonm
Copy link
Contributor Author

szymonm commented Apr 26, 2019

Yea, noticed those timeouts as well. But I also see Source drained. in the logs. Could it be that the reply after draining does not come back to complete the ask Future?

That was another idea I had!

@szymonm
Copy link
Contributor Author

szymonm commented Apr 26, 2019

After increasing the parallelism of sending messages to kafka we don't have to wait for messages to be flushed. As a result we are not failing stream draining that often, which gives very good chance of the test to pass (didn't see a single failure on my machine after changing that).

It also speeds up the test a lot!

@2m
Copy link
Contributor

2m commented Apr 29, 2019

Looks good. I also like the simplification in the Producer where the transaction is started sooner. That simplification means that these two commits are not needed anymore, because transaction is committed and new transaction is started in the same pass.

7be3450
9c04dcf

@2m
Copy link
Contributor

2m commented Apr 30, 2019

Pushed a revert of those two commits. I'll do a docs update and then this will be good to go.

@szymonm
Copy link
Contributor Author

szymonm commented Apr 30, 2019

Yeap, happy to see this simplified again!

Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks.

core/src/main/scala/akka/kafka/ConsumerMessage.scala Outdated Show resolved Hide resolved
docs/src/main/paradox/transactions.md Outdated Show resolved Hide resolved
docs/src/main/paradox/transactions.md Outdated Show resolved Hide resolved
@ennru ennru added this to the 1.0.2 milestone Apr 30, 2019
@2m 2m force-pushed the szymon-add-tests-for-draining-transactional-flow branch from 58046cb to 7203516 Compare May 2, 2019 05:28
@2m
Copy link
Contributor

2m commented May 2, 2019

Rebased on top of master and addressed the latest feedback.

* extract rebalancing hanling and stopping consumer to separate methods
* introduce CommittedMarker and hook it up to TransactionalMessage
* add marking committed offsets logic to TransactionalProducerStage
* add draining mechanism to TransactionalSource
* add draining-check-interval to the reference config
* handle committing failure when draining the stream
* add test for transactions and multi messages
* get continuous blocks for missing messages
* distinguish duplicates that happen during counting from transactions
* update documentation
@2m 2m force-pushed the szymon-add-tests-for-draining-transactional-flow branch from 7203516 to 5101c54 Compare May 2, 2019 07:29
@ennru ennru merged commit 1b273e6 into akka:master May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants