Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow Stuck Waiting for Greylisted Event #722

Open
dflor003 opened this issue Dec 9, 2020 · 5 comments · May be fixed by #786
Open

Workflow Stuck Waiting for Greylisted Event #722

dflor003 opened this issue Dec 9, 2020 · 5 comments · May be fixed by #786

Comments

@dflor003
Copy link
Contributor

dflor003 commented Dec 9, 2020

Hi, we have an interesting scenario where we need to coordinate events across multiple workflows and we are having some issues where events are getting greylisted and sometimes never get processed.

First, let me describe our setup:

We have 2 different types of workflows, lets say the first one is called CoordinatorWorkflow and the second is called SubTaskWorkflow. For a given task, there will be exactly 1 CoordinatorWorkflow spun up and N SubTaskWorkflows.

When CoordinatorWorkflow is spun up, it knows how many SubTaskWorkflows there are that correspond to it and has a unique identifier for each of the SubTaskWorkflow in the "set".

As an example, let's say we have a set of 2 SubTaskWorkflows identified by SubtaskA and SubtaskB that both start around the same time and CoordinatorWorkflow is passed a set of ["SubTaskA", "SubTaskB"].

The very first thing that CoordinatorWorkflow does is go into a for loop on ["SubTaskA", "SubTaskB"] and waits for an event of type SubTaskFinished with the key being each of the identifiers SubTaskA and SubTaskB.

Eventually each of the SubTaskWorkflows gets to a certain point and publishes an event of type SubTaskFinished with its identifier (SubTaskA or SubTaskB) as the key. At this point they wait for an event of type CoordinationFinished.

Once CoordinatorWorkflow received the events from each of the SubTaskWorkflow, it then proceeds to do some work and then fires off a CoordinationFinished event and completes.

Each SubTaskWorkflow then gets the CoordinationFinished event and then proceeds to do some more work and completes.

This is roughly the flow we are trying to achieve and have gotten most of the way there (Note: These aren't the actual names of the workflows, but I've tried to make them somewhat generic and domain-agnostic to simplify it).

The problem we are getting, however, is that at some point in the process, we get stuck waiting for events that never arrive. The events do indeed get published but we see a bunch of messages in the logs like the following and the workflows waiting for the events never have a chance to process them.

[16:06:44 DBG] Got greylisted event evt:{Id}

Any ideas as to what we may be doing wrong? Is this a known issue? If so, any work arounds?

Here are a few other observations that my team saw while troubleshooting this:

  • This happens sporadically (but fairly frequently) when running locally with 1 workflow node, but we do not see this in our development and production environments where we run 2 and 4 workflow engine nodes respectively. I have a feeling this is because IGreyList is registered as an in-memory singleton and the other workflow processes don't have it grey listed so they are free to pick up those events.
  • We recently added some integration tests around the process outlined above and we also ran into this issue. Work around in the integration tests was to provide our own fake implementation of IGreyList that basically ignored any items prefixed with evt.
  • I did some digging in the code and it looks like while there is something that inserts events into the grey list, there doesn't seem to be anything that removes events from the greylist.
@anderr225
Copy link

Hi, I think I encountered the same problem.

Do you see in logs "Workflow locked 'some-id'" from EventConsumer when this happens?
If yes, then the reason is following: EventConsumer tries to acquire a lock for the corresponding workflow and if it fails, event stays in GreyList until GreyList's Cycle is invoked (but IQueueProvider does not have it anymore, so event is not being processed at all). Adding line _greylist.Remove($"evt:{evt.Id}"); when EventConsumer failed to acquire a lock would solve the problem. RunnablePoller would not find event in the greylist after consumers's failed lock, and will queue this event one more time.

The reason why this does not always happen: sometimes EventConsumer successfully acquires a lock and an event is being processing as expected, sometimes EventConsumer failes to acquire a lock, then the event will be processed only after GreyLists's Cycle.

Also would like to notice that tuning the PollInterval could help with frequency of such scenario (until this is fixed). If interval is less than a second, it would happen very often. If interval is more than 10 second, it happens rarely.

I will send PR soon to fix this.

@anderr225
Copy link

Also instead of removing an event from the greylist, we could just use IQueueProvider to queue it without waiting for RunnablePoller

@anderr225 anderr225 linked a pull request Mar 2, 2021 that will close this issue
@dflor003
Copy link
Contributor Author

dflor003 commented Mar 3, 2021

Yeah, this lines up with what I'm seeing. In ours, we have a pretty short poll interval and are getting this somewhat frequently. Good catch! Looking forward to the PR.

@danielgerlag
Copy link
Owner

I think adding to the back of the queue again could cause a poison message, but I'm not sure why we're renewing the grey list time for events and not workflow.
Is there a particular reason you're using such short poll intervals?

@anderr225
Copy link

anderr225 commented Mar 4, 2021

I think it could cause a poison message only when we are adding to the queue at failed event lock, but not when we are adding at failed workflow lock.

The reason for renewing the grey list for events is a mystery to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants