Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Global Distribution packet retransmission #1689

Merged
merged 1 commit into from
Feb 8, 2018

Conversation

fenek
Copy link
Member

@fenek fenek commented Jan 25, 2018

This PR fixes #1683 and obsoletes #1684

The race condition was provoked by test_in_order_messages_on_multiple_connections_with_bounce test case. It manifested under following conditions:

  1. Due to lack of recipient mapping, e.g. 4 messages were stored in bounce storage
  2. Mapping appeared
  3. 5th message got sent and it triggered known_recipient hook, which in turn triggered retransmission of stored items
  4. For every retransmitted item, full routing logic is executed, so each time the recipient mapping is retrieved
  5. Test process may remove (as a part of test case) recipient mapping in the middle of retransmission, so only some of the items reach recipient.
  6. Code returns from hook execution to a point, where the mapping is already retrieved and the message 5 gets delivered anyway
  7. Result: messages delivered out of order

I've decided not to create new test case nor make the existing one more predictable, as with 100 messages sent it reproduces the issue fairly frequently and there is very little chance it won't test what it is supposed to. I mean an edge case when e.g. all 100 messages hit a window where recipient mapping is available due to test-c2s concurrency.

Note: This version was built on Travis about 6 times with full set of jobs and test_in_order_messages_on_multiple_connections_with_bounce was never a reason for crash.

@arcusfelis
Copy link
Contributor

pamparam, don't like any concurrency in this quantum world^W^W^W any tests.

@@ -76,17 +76,20 @@ maybe_reroute({From, To, Acc0, Packet} = FPacket) ->
Acc = maybe_initialize_metadata(Acc0),
LocalHost = opt(local_host),
GlobalHost = opt(global_host),
case lookup_recipients_host(To, LocalHost, GlobalHost) of
case lookup_recipients_host(get_metadata(Acc, target_host_override, undefined),
Copy link
Contributor

@arcusfelis arcusfelis Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have it on a separate line?

TargetHost = get_metadata(Acc, target_host_override, undefined),
case lookup_recipients_host(TargetHost, To, LocalHost, GlobalHost) of

It still will be two lines long :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_host_override should be documented, I have no idea what is for just by looking at the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why it can be undefined should be described too.

Copy link
Contributor

@arcusfelis arcusfelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments!

@kzemek kzemek force-pushed the fix-gd-retransmission-race-condition branch from 43c6d87 to 59739cf Compare February 8, 2018 13:02
Copy link
Contributor

@arcusfelis arcusfelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

@kzemek kzemek merged commit cf97c70 into master Feb 8, 2018
@kzemek kzemek deleted the fix-gd-retransmission-race-condition branch February 8, 2018 16:36
@fenek fenek added this to the 3.0.0 milestone Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable broken test_in_order_messages_on_multiple_connections_with_bounce test case in GlobDist
3 participants