Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace most shell script logic with Java #85758

Merged
merged 193 commits into from
May 19, 2022
Merged

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Apr 7, 2022

Elasticsearch provides several command line tools, as well as the main script to start elasticsearch. While most of the logic is abstracted away for cli tools, the main elasticsearch script has hundreds of lines of platform specific shell code. That code is difficult to maintain because it uses many special shell features which then must also exist in other platforms (ie windows batch files). Additionally, the logic in these scripts are not easy to test, we must be on the actual platform and test with a full installation of Elasticsearch, which is relatively slow (compared to most in process tests).

This commit replaces most of the shell specific logic with Java code. It introduces a singular entrypoint, the Launcher, to start any Elasticsearch CLI. Each shell script must then only describe the tool to call and the lib directories it needs to load.

There is a small amount of shell logic that remains. Specifically, that is to identify the location of ES_HOME from the shell script path, and then to find which java installation should be used. After that, the cli can be launched, using a small heap (as we do already for CLIs). For the main Elasticsearch server, the cli figures out all the jvm options and such necessary, then launches the real server process. If run in the foreground, the launcher will stay alive for the lifetime of Elasticsearch; the streams are effectively inherited so all output from Elasticsearch still goes to the console. If daemonizing, the launcher waits around until Elasticsearch is "ready" (this means the Node startup completed), then detaches and exits.

@rjernst rjernst requested a review from grcevski May 18, 2022 00:49
Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, the tests we can write now! I left some questions/comments. We should probably add a test-windows label to the PR.

}
} catch (IOException e) {
ioFailure = e;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to put the flush and the latch countdown into a finally block here? I'm thinking about getting some unexpected error here and maybe it's better if we actually terminated and reported the error in userExceptionMsg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree. We should do both in a finally block in case we encounter some error reading from the input stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user exceptions are for things the user can control, like configuration. If it is a general coding error (eg an NPE in our code here), then it would get thrown to the default uncaught exception handler. In the Elasticsearch process, we have an uncaught exception handler that would catch it and log it. I wonder if we should add something similar to CliToolLauncher? Normally CLIs don't create threads, but in the case they do, we can be more consistent about (1) exiting and (2) how we log the error (eg a nice message like "there was an unexpected internal error, see below"). WDYT? I could do that as a followup?

exit $?
CLI_NAME=server
CLI_LIBS=lib/tools/server-cli
source "`dirname "$0"`"/elasticsearch-cli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anticipating that the java process launch will be heavier on resources than the bash script, can we add an additional JVM command line option here to stop the JDK from using the optimizing compiler, i.e. -XX:TieredStopAtLevel=1. This will reduce both startup time, CPU and memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be for all CLIs or just server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we were still launching Java processes before (JVM options parser, ergonomics, etc) so I don't think we've added any overhead here in terms of JVM startup, we've just consolidated some of that logic. I'd be surprised if there was any measurable change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about it. If most of our cli tools are generally short running we should use it by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll take it upon myself to get some data on CPU/memory while running without the optimizing compiler and maybe if it's beneficial to a great extent, I'll follow up with a PR.

}

sendShutdownMarker();
errorPump.drain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this, it seems that waitFor drains stderr anyways.

assertMutuallyExclusiveOptions("--version", "-p", "/tmp/pid");
assertMutuallyExclusiveOptions("--version", "--pidfile", "/tmp/pid");
assertMutuallyExclusiveOptions("--version", "-q");
assertMutuallyExclusiveOptions("--version", "--quiet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also test with --enrollment-token. Right now -V is incompatible with --enrollment-token.

try {
msg = stdin.read();
} catch (IOException e) {}
if (msg == BootstrapInfo.SERVER_SHUTDOWN_MARKER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also put this in finally?

exit $?
CLI_NAME=server
CLI_LIBS=lib/tools/server-cli
source "`dirname "$0"`"/elasticsearch-cli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we were still launching Java processes before (JVM options parser, ergonomics, etc) so I don't think we've added any overhead here in terms of JVM startup, we've just consolidated some of that logic. I'd be surprised if there was any measurable change here.

*
* http://commons.apache.org/proper/commons-daemon/procrun.html
*
* NOTE: If this method is renamed and/or moved, make sure to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem right. The configuration of --StopMethod used to be in the batch file but now it lives in WindowsServiceInstallCommand. I think we should point folks there.

}
} catch (IOException e) {
ioFailure = e;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree. We should do both in a finally block in case we encounter some error reading from the input stream.

// the thread pumping stderr watching for state change messages
private final ErrorPumpThread errorPump;

// a flag marking whether the java process has been detached from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment reads like it's been truncated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just poor grammar on my part. I've reworded.

@rjernst
Copy link
Member Author

rjernst commented May 18, 2022

@elasticmachine run elasticsearch-ci/packaging-tests-windows

@rjernst rjernst added the test-windows Trigger CI checks on Windows label May 18, 2022
@rjernst
Copy link
Member Author

rjernst commented May 19, 2022

@elasticmachine run elasticsearch-ci/part-1

command.add("-cp");
// The '*' isn't allowed by the windows filesystem, so we need to force it into the classpath after converting to a string.
// Thankfully this will all go away when switching to modules, which take the directory instead of a glob.
command.add(esHome.resolve("lib") + (isWindows ? "\\" : "/") + "*");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivially, this could simplified to use java.io.File.separator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can! We specifically use the local isWindows here so that we can check the unix and linux behavior in tests, regardless of which platform we are actually on. See ServerProcessTests.testCommandLine

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LVGTM

@rjernst
Copy link
Member Author

rjernst commented May 19, 2022

@grcevski I think I addressed all your comments.

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rjernst rjernst merged commit b9c504b into elastic:master May 19, 2022
@rjernst rjernst deleted the launcher branch May 19, 2022 15:29
elasticsearchmachine pushed a commit that referenced this pull request May 20, 2022
Password must be at least 114 bits in FIPS mode. This PR fixes the
password length in the new ServerCliTests so it passes in FIPS mode.

Relates: #85758 

PS: The test
[failed](https://gradle-enterprise.elastic.co/s/mrlw6o27onxee/tests/:distribution:tools:server-cli:test/org.elasticsearch.server.cli.ServerCliTests/testKeystorePassword)
on my PR CI.
rjernst added a commit to rjernst/elasticsearch that referenced this pull request May 25, 2022
A debugging statement was left being printed when ES exits with a
non-zero status. This commit removes the debug statement.

relates elastic#85758
rjernst added a commit that referenced this pull request May 25, 2022
A debugging statement was left being printed when ES exits with a
non-zero status. This commit removes the debug statement.

relates #85758
rjernst added a commit to rjernst/elasticsearch that referenced this pull request May 25, 2022
A debugging statement was left being printed when ES exits with a
non-zero status. This commit removes the debug statement.

relates elastic#85758
elasticsearchmachine pushed a commit that referenced this pull request May 25, 2022
A debugging statement was left being printed when ES exits with a
non-zero status. This commit removes the debug statement.

relates #85758
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request May 26, 2022
A debugging statement was left being printed when ES exits with a
non-zero status. This commit removes the debug statement.

relates elastic#85758
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >refactoring Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team test-windows Trigger CI checks on Windows v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants