Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docker tests in fed-gen #1556

Merged
merged 8 commits into from
Jan 18, 2023
Merged

Fix docker tests in fed-gen #1556

merged 8 commits into from
Jan 18, 2023

Conversation

cmnrd
Copy link
Collaborator

@cmnrd cmnrd commented Jan 18, 2023

This fixes several problems in the docker test execution. It simplifies and generalizes the script that is generated in TestBase for running federated docker test. The script has now a little "smartness" build into it. It simply starts all containers and uses grep to find if any of them failed (as opposed to start each container separately as was done before).

I also noticed that the non-federated docker tests were broken, as any errors within the container were not picked up by our test framework. Both test categories now use the same script. This also allowed a significant cleanup of the test code.

We might not be fully in the green yet. During my testing I observed multiple times that federates could not connect to the RTI. The log looked always looked smth like this:

DistributedCountContainerized-c    | Federate 0: Could not connect to RTI at rti. Will try again every 2 seconds.
DistributedCountContainerized-c    | Federate 0: Failed to connect to RTI on port 15065. Trying 15066.
DistributedCountContainerized-c    | Federate 0: Could not connect to RTI at rti. Will try again every 2 seconds.
DistributedCountContainerized-c    | Federate 0: Failed to connect to RTI on port 15066. Trying 15067.
DistributedCountContainerized-rti  | WARNING: RTI failed to accept the socket. Resource temporarily unavailable. Trying again.
DistributedCountContainerized-c    | Federate 0: Could not connect to RTI at rti. Will try again every 2 seconds.
DistributedCountContainerized-c    | Federate 0: Failed to connect to RTI on port 15067. Trying 15068.
DistributedCountContainerized-c    | Federate 0: Could not connect to RTI at rti. Will try again every 2 seconds.
DistributedCountContainerized-c    | Federate 0: Failed to connect to RTI on port 15068. Trying 15069.
DistributedCountContainerized-c    | Federate 0: Could not connect to RTI at rti. Will try again every 2 seconds.
DistributedCountContainerized-c    | Federate 0: Failed to connect to RTI on port 15069. Trying 15070.

I am not sure if this is a problem in my setup or a problem with our tests. I always had to restart the docker daemon to make it work again.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jan 18, 2023

Looks like the RTI issue also occurs in CI: https://github.com/lf-lang/lingua-franca/actions/runs/3949937855/jobs/6761805738

@lhstrh
Copy link
Member

lhstrh commented Jan 18, 2023

The rti problem is puzzling and needs looking into, but this PR sure is a major improvement!

if (LFCommand.get("docker", List.of()) == null) {
throw new TestError("Executable 'docker' not found" , Result.NO_EXEC_FAIL);
}
if (LFCommand.get("docker-compose", List.of()) == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hasn't the API changed to just docker (no docker-compose)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just remove this check...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to install docker-compose separately, though (even if you use docker compose). This also puzzled me at first, but they shipp as separate packages

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert the removal?

Copy link
Member

@lhstrh lhstrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge but left a question.

@lhstrh
Copy link
Member

lhstrh commented Jan 18, 2023

Much cleaner indeed! Thank you so much for these fixes :-)

@lhstrh lhstrh merged commit 539a12e into fed-gen Jan 18, 2023
@lhstrh lhstrh deleted the fix-docker-tests branch January 18, 2023 16:35
@edwardalee
Copy link
Collaborator

There is a long-standing flaw in the RTI/federate mechanism for handling ports. The RTI tries to get a default port, and if is unavailable, it tries a port number that is one larger, and if that fails, it tries one more, etc. The federates go through a similar sequence, trying the default port number first, and if failing, trying one more.

However, this really doesn't work. In particular, if you start a federate before the RTI, it skips the default port, and it takes a very long time for it to circle around to try that default port again.

The problem this was trying to address is that when a program releases a port, the OS does not make the port available to other programs for some time. There is a good reason for this: the OS wants to prevent a program from grabbing a port and then receiving messages that were intended for a program that has exited. It therefore holds the port long enough that any messages that were in flight die before it releases the port.

This feature was making CI fail because it runs many federated programs in sequence.

I think a better solution is just that the RTI should just use a fixed port, perhaps optionally specified as a command-line argument (which the federates will also need to be told). Then we just have to figure out how to make CI work (wait long enough between federated tests?).

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jan 19, 2023

Why not pass the port to use as a command line argument to the RTI and the federates? Then we can even have multiple instances running in parallel on the same host using different ports. What we are missing is an actual deployment.

@lhstrh
Copy link
Member

lhstrh commented Jan 19, 2023

I personally think we need a service/daemon that listens at default and recognizable port (like 80 for http) with the sole purpose of brokering connections between participants in federations. We never built this and implemented this kind of the port knocking scheme instead, but I think this issue belongs in the category of problems solved by existing frameworks...

@edwardalee
Copy link
Collaborator

Yes, you can improve what we have or start over with another framework. That choice seems clear.

@edwardalee
Copy link
Collaborator

Why not pass the port to use as a command line argument to the RTI and the federates? Then we can even have multiple instances running in parallel on the same host using different ports. What we are missing is an actual deployment.

Yes, this would be easy to implement and would delete a bunch of code.

@lhstrh
Copy link
Member

lhstrh commented Jan 19, 2023

Update: resolving the race condition between RTI and federates by declaring a dependency in docker-compose.yml seems to have done the trick. We have some remaining errors, but those are of a different nature.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jan 20, 2023

I converted this discussion to an issue lf-lang/reactor-c#146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants