Cancellable handler #571

martincostello · 2019-08-26T09:55:50Z

I had an idea this morning based on the changes in #546, so I thought I'd see how far I could get with it and submit a PR for discussion.

In the event that the issue that lead to #546 occurring and tasks to process messages back up, it's possible that a task to process a message may be trying to process a message that was received so long ago that the message visibility timeout has expired and it is now a potential duplicate waiting to be processed.

If the handler had a CancellationToken that cancels after the approximate message visibility timeout expires, JustSaying could enable two behaviours:

It could self-cancel trying to do anything with the message before getting to the handler and abandoning it (potentially just extending the visibility instead, but I've not done that here).
The handler, when called, could observe the cancellation token itself and make a decision about what to do with the message. It could return false to send it back to the queue (where it could have since been handled elsewhere anyway), or it could return true on the basis, if this has expired I'll just ignore it and pretend I've dealt with it.

This is implemented here by adding the ICancellableHandlerAsync<T> interface that derives from IHandlerAsync<T>, and then plumbing it in appropriately so it's used if implemented by the handler for a particular message and extending things like the exactly-once handler and stopwatch wrappers.

Replaces #549.

Add a new concept of a handler that can be cancelled if the message visibility timeout has expired.

codecov · 2019-08-26T10:09:47Z

Codecov Report

Merging #571 into master will decrease coverage by 0.04%.
The diff coverage is 41.81%.

@@            Coverage Diff             @@
##           master     #571      +/-   ##
==========================================
- Coverage   38.98%   38.94%   -0.05%     
==========================================
  Files          78       78              
  Lines        2783     2876      +93     
  Branches      488      508      +20     
==========================================
+ Hits         1085     1120      +35     
- Misses       1541     1585      +44     
- Partials      157      171      +14

Impacted Files	Coverage Δ
JustSaying/AwsTools/MessageHandling/HandlerMap.cs	`100% <ø> (ø)`	⬆️
...wsTools/MessageHandling/SqsNotificationListener.cs	`54.07% <18.18%> (-7.18%)`	⬇️
...ying/AwsTools/MessageHandling/MessageDispatcher.cs	`62.83% <21.42%> (-4.56%)`	⬇️
...ustSaying/Messaging/Monitoring/StopwatchHandler.cs	`72.72% <50%> (-19.59%)`	⬇️
.../AwsTools/MessageHandling/MessageHandlerWrapper.cs	`64.28% <66.66%> (-21.01%)`	⬇️
...ng/Messaging/MessageHandling/ExactlyOnceHandler.cs	`65.78% <84.61%> (+8.64%)`	⬆️
...Messaging/MessageProcessingStrategies/Throttled.cs	`76.69% <0%> (+5.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04bc31b...33b7258. Read the comment docs.

slang25 · 2019-08-26T10:38:54Z

Just to add another option, it could renew the message visibility timeout: #291
This is what Rebus does, and kind of what I'd expect by default for this type of library.

martincostello · 2019-08-26T13:35:54Z

To be honest I thought we already had that capability, so I've re-opened it. It's also something we'd probably want as opt-in?

However, that's not what I was trying to achieve here. This is more of a reaction to a backed-up local queue (see #546).

In the event of messages being sat on scheduled Tasks waiting to progress and not being processed, the chances are AWS has already considered the message "lost" and redelivered it somewhere else. By that point if the task does eventually run, it's probably dealing with what is now a duplicate, and a waste of processing. By observing the fact the timeout has probably elapsed, we can just not process the message at all and end the Task and release some resources back to the worker to try and make some headway through the queue.

In that scenario, renewing the message visibility timeout would probably make things worse as it would leave messages on a backed-up local queue even longer?

slang25 · 2019-08-26T14:36:14Z

I would expect an "auto-renew" to be on by default, but you'd certainly want control. This is something that the PE team needed a long time back for their deployment tool, so we exposed the RecieptHandle on the messages so they could do it themselves.

I happen to have the opposite view to you on this 😄, I think it more directly solves the problem of local queues holding stale messages. Cancelling sounds more haphazard as a general pattern here. If one worker has the message, it has a duty to process it, and if it doesn't it will eventually become visible to others.

I could see that cancelling before a handler gets the message would solve your problem, but adding cancellation to handlers feels wrong from an event-driven architecture perspective. If a hander is still working as (or soon before) a message visibility times out, it should let others know it's still working on it. It's "at least once" message delivery after all, cancellation doesn't belong here.

martincostello · 2019-08-26T15:02:44Z

But here, the cancellation is pre-handler. Once control is passed to the handler, it's just as it is today, except you have to opt-in to implementing the new interface, and then have to decide to observe the token to do something different in your handler implementation.

This is just trying to improve the situation where the internals get stuck in a never-progressing backlog (which is what we had in the issue that lead to #546 as the Task queue was backing up) by never calling the handler in the first place.

If we wanted to do something to allow renewing the message visibility, I think that's something else that would need to be a first-class concept somewhere in the handlers to give you the hook to decide to renew/extend the visibility after the handler has already seen the message to make a decision.

Maybe that could be morphed into a more detailed handler concept that did both separately, so for more advanced scenarios you'd get like this very-of-the-top-of-my-head thing:

public interface ISuperDuperHandler<in T>
{
    Task<bool> HandleAsync(HandlerContext<T> context);
}

public abstract class HandlerContext<T>
    where T : Message
{
    public CancellationToken CancellationToken { get; protected set; }

    public T Message { get; protected set; }

    public MessageContext MessageContext { get; protected set; }

    public abstract async Task ExtendVisibilityAsync(TimeSpan value);
}

internal sealed class SuperDuperHandlerContext<T> : HandlerContext<T>
    where T : Message
{
    private _client;

    public SuperDuperHandlerContext(
        T message,
        MessageContext messageContext,
        SomeClient client)
    {
        Message = message;
        MessageContext = messageContext;
        _client = client;
    }

    public override async Task ExtendVisibilityAsync(TimeSpan value)
    {
        await _client.ExtendTheVisbilityAsync(Message.Id, value, CancellationToken);
    }
}

Then you could do stuff like this:

public class MyHandler : ISuperDuperHandler<MyMessage>
{
    public async Task<bool> HandleAsync(HandlerContext<T> context)
    {
        if (context.CancellationToken.IsCancellationRequested)
        {
            // Too old or shutting down, decide to junk it and let someone else handle it
            return false;
        }

        if (context.Message.IsMessageHuge)
        {
            await context.ExtendVisibilityAsync(TimeSpan.FromHours(1));
        }

        await DoExpensiveStuffAsync(context.Message);

        return true;
    }
}

slang25 · 2019-08-26T15:05:34Z

Also want to add, I'm not saying that by default the behaviour is that messages are continually renewed in terms of visibility, there would be a configurable policy of how long / how many renews before giving up.

Edit: this is not a response, just reading your reply now

slang25 · 2019-08-27T13:52:47Z

Ah, I was (or maybe still am) confused 😄

So if I understand you, ICancellableHandlerAsync<T> lets us build a pipeline that's cancellable, before calling the user provider handler.

If we want to let the user optionally know if the message had timed out, the cancellation token could lead to a little confusion, as the signature would imply that the ct is live, and you'd be expected to constantly check it and flow it for cooperative cancellation.

Edit: No, I am still confused, they will get a live cancellation token, which may make people think cancelable handlers are a reasonable thing to do. I think this is a bad pattern vs renewing visibility locks, I get that pre-handler works as a short term workaround though.

martincostello · 2019-08-27T16:47:01Z

No, the interface would be to allow a developer to opt-in to doing something different in the handler if the message is considered "stale" or if the message loop is shutting down.

To me, visibility locks is a different feature for people who need to process messages that may take a long time to handle in the handler.

I don't think they're a solution to a backed-up internal queue that is operating with no context that could be potentially having messages lying around in memory that are dozens of minutes old.

Effectively this makes this PR two different, related, things in one:

If a message has been waiting to be passed to a handler since receipt for longer than the visibility timeout, don't bother trying to process (effectively, abandon it as if the service had crashed).
Provides a separate opt-in mechanism that allows a handler to decide to do different logic if the message loop has been requested to stop or if the message visibility timeout has lapsed.

slang25 · 2019-08-27T21:02:24Z

To me, visibility locks is a different feature for people who need to process messages that may take a long time to handle in the handler.

The way I see it, JustSaying grabs a handful of messages and will use the processing strategy to decide when to call the handlers, it's its duty to try to guarantee correctness (within reason), in this case by dropping the messages (FIFO as this change is), or with re-acquiring visibility locks. This wouldn't need much further complexity than what is done in this PR, there is still the same timer involved. (I sense I won't sell you in the short term)

I see though that dropping the messages is a good solution, and is the minimal change. I'd add a suggestion of being able to add (or subtract rather) a buffer to the timeout, if I'm 0.1 seconds away from someone else picking up the message, and the timeout is 30 seconds, I'd probably want to still drop it.

On the opt-in behaviour, the only thing I don't like is the way it's presented, when I look at the ICancellableHandlerAsync<T> type I see a message handler with cancellation, not as you describe. I'd like to see it be a little more unambiguous, but I haven't had much sleep, so I could just be not seeing things clearly 😄

slang25 · 2019-08-27T21:21:17Z

For some more reading in favour of extending timeouts, here's the guidance on the matter:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/working-with-messages.html#processing-messages-timely-manner

The thing with dropping messages is that the messages being dropped will be mostly random, so it's very easy for messages to be dropped multiple times, exceeding the max receive count and ending up in a dead letter queue. If you extend locks, you generally get a much more ordered processing of messages, which will leave messages in queues where they are nice and safe.

slang25 · 2019-08-27T22:01:08Z

I think it's worth comparing a prototype of that guidance, when a message is received at the earliest occasion we create a heartbeat that renews the lock, it would be configurable and defaults to half the timeout, and extends by the timeout.

So if it's 30 seconds, schedule a task in 15 secs to renew by 30 (45s cumulative), then again 15 secs later forever pushing out the lock until some overall timeout. Once the handler has completed, any heartbeat task is cancelled via a cts.

Waddaya say? I can have a bash and we can compare notes?

martincostello · 2019-08-28T08:08:52Z

I'd be happy to drop the ICancellableHandler<T> part (though I think it's a nice addition) and just keep the "this has been sat around internally for too long" part.

If you want to try a different tack and raise a PR I'll take a look.

slang25 · 2019-08-28T09:05:39Z

Thanks, I'll start work in a different PR (likely not today but soon).

Sorry if this sounds negative (you know I love your work ❤️), but now I've considered it some more I think this PR should be closed, as it's trading off one problem for another an not really solving the issue, I hope my points in the previous comments have persuaded you too.

Edit: Closing it a bit strong sorry, this PR doesn't worsen the situation as the messages would would be taken from the queue by multiple consumers regardless of whether the processing is skipped, it just seems like a workaround rather than a fix.

martincostello · 2019-08-28T13:14:22Z

I'll re-review when I get the chance and see what could be removed to make it simpler but still provide a mitigation for a runaway Task queue, as ultimately that issue would have brought down the worker due to a memory leak.

Also waiting for anyone else to chip in with their thoughts 😄

slang25 · 2019-08-29T12:37:51Z

In other news, sometimes a repo name tells a story...

slang25 · 2019-08-29T14:55:39Z

Oh god, I've just encountered the same issue, what gives? The fork upstream link is gone from GH repos for JustSaying 😞

Edit: #573 👍

martincostello · 2019-08-30T11:49:44Z

JustSaying/AwsTools/MessageHandling/SqsNotificationListener.cs

@@ -177,6 +179,13 @@ internal async Task ListenLoopAsync(CancellationToken ct)
                    continue;
                }

+                if (sqsMessageResponse == null || sqsMessageResponse.Messages.Count < 1)


This block is duplicated due to a bad merge and needs removing.

Support handler self-cancellation

33b7258

Add a new concept of a handler that can be cancelled if the message visibility timeout has expired.

martincostello added enhancement question labels Aug 26, 2019

martincostello mentioned this pull request Aug 26, 2019

Cancellable handler #549

Closed

martincostello marked this pull request as ready for review August 26, 2019 10:11

martincostello requested a review from a team as a code owner August 26, 2019 10:11

ghost requested a review from brainmurphy August 26, 2019 10:11

slang25 mentioned this pull request Aug 29, 2019

Message visibility heartbeat #574

Closed

brainmurphy removed their request for review August 30, 2019 07:28

martincostello commented Aug 30, 2019

View reviewed changes

martincostello closed this Oct 10, 2019

martincostello deleted the Cancellable-Handler branch October 10, 2019 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancellable handler #571

Cancellable handler #571

martincostello commented Aug 26, 2019

codecov bot commented Aug 26, 2019 •

edited

Loading

slang25 commented Aug 26, 2019

martincostello commented Aug 26, 2019

slang25 commented Aug 26, 2019

martincostello commented Aug 26, 2019

slang25 commented Aug 26, 2019 •

edited

Loading

slang25 commented Aug 27, 2019 •

edited

Loading

martincostello commented Aug 27, 2019

slang25 commented Aug 27, 2019

slang25 commented Aug 27, 2019

slang25 commented Aug 27, 2019 •

edited

Loading

martincostello commented Aug 28, 2019

slang25 commented Aug 28, 2019 •

edited

Loading

martincostello commented Aug 28, 2019

slang25 commented Aug 29, 2019

slang25 commented Aug 29, 2019 •

edited

Loading

martincostello Aug 30, 2019

Cancellable handler #571

Cancellable handler #571

Conversation

martincostello commented Aug 26, 2019

codecov bot commented Aug 26, 2019 • edited Loading

Codecov Report

slang25 commented Aug 26, 2019

martincostello commented Aug 26, 2019

slang25 commented Aug 26, 2019

martincostello commented Aug 26, 2019

slang25 commented Aug 26, 2019 • edited Loading

slang25 commented Aug 27, 2019 • edited Loading

martincostello commented Aug 27, 2019

slang25 commented Aug 27, 2019

slang25 commented Aug 27, 2019

slang25 commented Aug 27, 2019 • edited Loading

martincostello commented Aug 28, 2019

slang25 commented Aug 28, 2019 • edited Loading

martincostello commented Aug 28, 2019

slang25 commented Aug 29, 2019

slang25 commented Aug 29, 2019 • edited Loading

martincostello Aug 30, 2019

Choose a reason for hiding this comment

codecov bot commented Aug 26, 2019 •

edited

Loading

slang25 commented Aug 26, 2019 •

edited

Loading

slang25 commented Aug 27, 2019 •

edited

Loading

slang25 commented Aug 27, 2019 •

edited

Loading

slang25 commented Aug 28, 2019 •

edited

Loading

slang25 commented Aug 29, 2019 •

edited

Loading