TAN messages no longer used and in-transit messages recorded in the RTI #1074

Soroosh129 · 2022-04-04T19:41:28Z

This PR removes TAN messages entirely.

Instead, a federate with a physical action (that is connected to a network output) is going to periodically create a dummy event (with the period controlled by coordination-options: {advance-message-interval: 10 msec}) which forces the federate to advance its tag and allow downstream federates to make progress.

After fixing this bug, another bug was exposed in the RTI, in which the RTI could potentially lose track of a federate's actual earliest next event (see #1074 (comment) comment for more detail). This caused the RTI to grant incorrect tag advance grant (TAG) messages. This bug was fixed by adding a queue to the RTI that keeps a record of all currently in-transit messages.

Soroosh129 · 2022-05-30T15:29:10Z

I've been wrestling with this issue for the past few days. I think I have identified two possible race conditions, one of which I think I found a fix for (but not the other).

Imagine the following federated program (PassThrough and TestCount are simple library reactors with trivial implementation that is elided for the sake of simplicity):

target C { 
    timeout: 4 sec
};

import PassThrough from "../lib/PassThrough.lf"
import TestCount from "../lib/TestCount.lf"

reactor WithLogicalAction {
    output out:int;
    state thread_id:lf_thread_t(0);
    state counter:int(1);
    logical action act(0):int;

    reaction(startup, act) -> act, out {=
        SET(out, self->counter);
        schedule_int(act, MSEC(1), self->counter++);
    =}
}

federated reactor {
    a = new WithLogicalAction();
    test = new TestCount(num_inputs=21);

    passThroughs = new PassThrough();

    a.out, passThroughs.out -> 
    passThroughs.in, test.in
}

Race condition 1: NET with FOREVER tag from the federate after message forwarding by the RTI but before delivery

Example scenario:

...
Federate WithLogicalAction sends a message with tag 1 sec to the RTI
The RTI sees this message and forwards it to PassThrough
- As part of its forwarding, the RTI checks if the tag of the message is smaller than the recorded next event of PassThrough. In our case, assume that it is (1 sec < FOREVER), so the RTI will record 1 sec as the next event for PassThrough.
While the message is in transit, PassThrough sends a(nother) NET with tag FOREVER to the RTI (because it has no other local events). I'm not sure why this happens.
Upon receiving this NET, the RTI incorrectly replaces the next event of the PassThrough from the in-transit message to FOREVER.
Race condition: At this point (while the message is in-transit), if TestCount asks for a TAG of FOREVER, it will be granted because the RTI incorrectly thinks that PassThrough has a next event of FOREVER.

Fix: RTI checks that a recorded next event is already completed before replacing it with a larger next event.

Race condition 2: NET with smaller tag from the federate before message forwarding by the RTI

Example scenario:

...
Federate WithLogicalAction sends a message with tag 1 sec to the RTI
...
Federate PassThrough starts processing the message with tag 1 sec with a properly sent NET(1 sec) and received TAG(1 sec)
Federate WithLogicalAction sends a message with tag 1001 msec to the RTI
The RTI forwards the message to PassThrough
- The recorded next event for PassThrough is 1 sec, so the RTI will not replace that with 1001 msec.
While the message is in-transit, PassThrough finishes processing the message with tag 1 sec and sends a NET of FOREVER (because it has no other local events and the message is still in-transit).
Race condition: At this point, if TestCount asks for a TAG of FOREVER, it will be granted because the RTI incorrectly thinks that PassThrough has a next event of FOREVER (even though a message with tag 1001 msec is in transit).

Fix: ?

Any suggestion on how to fix the second race condition?

edwardalee · 2022-06-02T07:16:38Z

It seems that maybe the RTI needs to keep a list of tags of messages it has forwarded to a federate for which it has not yet received an LTC. When it receives an LTC, it removes items from this list with tags <= the LTC. When it receives a NET, the RTI checks and sets its local record of the NET to the minimum of the NET it received and the tags in the list.

This is unfortunate because, although this list is likely to be small in practice, there is no upper bound on its size, so the implementation inevitably will require a malloc. But it probably could be optimized to minimize the likelihood of a malloc.

This whole protocol is screaming to be formally verified...

Soroosh129 · 2022-06-02T08:00:40Z

It seems that maybe the RTI needs to keep a list of tags of messages it has forwarded to a federate for which it has not yet received an LTC. When it receives an LTC, it removes items from this list with tags <= the LTC. When it receives a NET, the RTI checks and sets its local record of the NET to the minimum of the NET it received and the tags in the list.

This was my initial instinct as well. I implemented a prototype using a vector but I quickly realized that this list of tags needs to be sorted to be efficient. I think our existing pqueue sounds like the right structure for this.

This whole protocol is screaming to be formally verified...

I agree. Tagging @lsk567 to see if he would be interested in continuing our unfinished work toward formally verifying the federated execution.

Soroosh129 · 2022-06-02T21:14:50Z

Here is an implementation for the queue of in-transit messages: lf-lang/reactor-c@98dd2c2

edwardalee

LGTM.

test/C/src/federated/DistributedLogicalActionUpstreamLong.lf

test/C/src/federated/DistributedPhysicalActionUpstream.lf

test/C/src/lib/PassThrough.lf

Added a new failing test for TAN messages

3c10dcd

Soroosh129 force-pushed the hotfix-C-TAN branch from c3cd051 to 3c10dcd Compare April 4, 2022 19:45

Soroosh129 added 4 commits April 4, 2022 14:49

Updated reactor-c

c4666fc

Updated the test

126d562

Updated reactor-c

7612e07

Lowered the interval to add stress

05ea29b

Soroosh129 mentioned this pull request Apr 5, 2022

Removal of TAN messages and new capability to record in-transit messages in the RTI lf-lang/reactor-c#61

Merged

Soroosh129 and others added 9 commits April 5, 2022 13:02

Updated reactor-c

286483e

Align with reactor-c

f77a7ae

Update reactor-c ref

0607ed8

Merge remote-tracking branch 'origin/master' into hotfix-C-TAN

8c92f38

Updated reference to reactor-c

b5d8e98

Added more slack to the test

a918721

Updated CI to use the RTI in this branch

5aec6a6

Added test that uses a logical action, but still fails

2348162

Updated pointer to reactor-c

57b113e

Soroosh129 added 6 commits May 30, 2022 10:35

Simplified test

265af36

Updated pointer to reactor-c

ac4ad97

Merge remote-tracking branch 'origin/master' into hotfix-C-TAN

3e5a9c9

Update reference to reactor-c

982eecc

Update reactor-c

9e12bc5

Updated pointer to reactor-c

b7bccaf

Soroosh129 mentioned this pull request Jun 1, 2022

Fix for deadlock in federated execution #1189

Merged

Soroosh129 added 3 commits June 3, 2022 15:40

Updated pointer to reactor-c

a6109d4

Enable debug for test

61d49a9

Updated ref to reactor-c

c6ef9c4

Soroosh129 changed the title ~~Hotfix: TAN message are not working as intended in federated execution~~ Remove TAN messages Jun 3, 2022

Soroosh129 marked this pull request as ready for review June 3, 2022 22:33

Soroosh129 requested review from edwardalee and lhstrh June 3, 2022 22:33

edwardalee approved these changes Jun 4, 2022

View reviewed changes

test/C/src/federated/DistributedLogicalActionUpstreamLong.lf Show resolved Hide resolved

test/C/src/federated/DistributedPhysicalActionUpstream.lf Show resolved Hide resolved

test/C/src/lib/PassThrough.lf Show resolved Hide resolved

Added comments

bbc0e0a

Soroosh129 changed the title ~~Remove TAN messages~~ Remove TAN messages and record in-transit messages in the RTI Jun 4, 2022

Soroosh129 added 5 commits June 10, 2022 12:36

Update ref to reactor-c

d70bc48

Updated ref to reactor-c

49040c9

Merge remote-tracking branch 'origin/master' into hotfix-C-TAN

5441682

Updated reference to reactor-c

1924b80

Updated ref to reactor-c

ee6de73

Soroosh129 merged commit f890aec into master Jun 11, 2022

Soroosh129 deleted the hotfix-C-TAN branch June 11, 2022 05:28

lhstrh added the bug Something isn't working label Jul 7, 2022

lhstrh changed the title ~~Remove TAN messages and record in-transit messages in the RTI~~ TAN messages no longer used and in-transit messages recorded in the RTI Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAN messages no longer used and in-transit messages recorded in the RTI #1074

TAN messages no longer used and in-transit messages recorded in the RTI #1074

Soroosh129 commented Apr 4, 2022 •

edited

Loading

Soroosh129 commented May 30, 2022 •

edited

Loading

edwardalee commented Jun 2, 2022

Soroosh129 commented Jun 2, 2022

Soroosh129 commented Jun 2, 2022

edwardalee left a comment

TAN messages no longer used and in-transit messages recorded in the RTI #1074

TAN messages no longer used and in-transit messages recorded in the RTI #1074

Conversation

Soroosh129 commented Apr 4, 2022 • edited Loading

Soroosh129 commented May 30, 2022 • edited Loading

Race condition 1: NET with FOREVER tag from the federate after message forwarding by the RTI but before delivery

Race condition 2: NET with smaller tag from the federate before message forwarding by the RTI

edwardalee commented Jun 2, 2022

Soroosh129 commented Jun 2, 2022

Soroosh129 commented Jun 2, 2022

edwardalee left a comment

Choose a reason for hiding this comment

Soroosh129 commented Apr 4, 2022 •

edited

Loading

Soroosh129 commented May 30, 2022 •

edited

Loading