Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking] Remove ResumableFileSlot and rely on high ulimits #1582

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

allada
Copy link
Member

@allada allada commented Feb 9, 2025

ResumableFileSlot became to difficult to manage, instead of managing resources this way, we use much higher
DEFAULT_OPEN_FILE_PERMITS, set ulimit (when unix) and warn user if the limits are likely too low.

Breaking: Removed idle_file_descriptor_timeout_millis from config.

closes: #1288, #513, #1298, #527


This change is Reviewable

@allada allada force-pushed the remove-local-fs-limit branch from b74b3fc to 43044cc Compare February 9, 2025 04:43
@allada allada force-pushed the remove-local-fs-limit branch from 43044cc to d7a3655 Compare February 9, 2025 04:52
@allada allada force-pushed the remove-local-fs-limit branch from d7a3655 to d81c645 Compare February 9, 2025 04:55
@allada allada force-pushed the remove-local-fs-limit branch from d81c645 to f2a2385 Compare February 9, 2025 04:56
@allada allada force-pushed the remove-local-fs-limit branch from f2a2385 to c8c51ba Compare February 9, 2025 05:00
@allada allada force-pushed the remove-local-fs-limit branch from c8c51ba to a9302dc Compare February 9, 2025 05:08
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 21 files reviewed, and pending CI: Web Platform Deployment / ubuntu-24.04, pre-commit-checks, and 1 discussions need to be resolved (waiting on @aaronmondal)


a discussion (no related file):
+@aaronmondal


nativelink-worker/src/running_actions_manager.rs line 995 at r1 (raw file):

                            "Could not kill process",
                        );
                    } else {

fyi: This branch is the success case and the error was wrong, and it shouldn't be an ERROR.


nativelink-store/tests/filesystem_store_test.rs line 1313 at r1 (raw file):

#[serial]
#[nativelink_test]
async fn update_with_whole_file_slow_path_when_low_file_descriptors() -> Result<(), Error> {

fyi: Removed this as it is an outlier and we mostly rely on ulimit now.


nativelink-worker/tests/running_actions_manager_test.rs line 458 at r1 (raw file):

            arguments: vec!["touch".to_string(), "./some/path/test.txt".to_string()],
            output_files: vec!["some/path/test.txt".to_string()],
            environment_variables: vec![EnvironmentVariable {

fyi: sneaking these in on this PR.


nativelink-worker/tests/running_actions_manager_test.rs line 1478 at r1 (raw file):

        assert_eq!(poll!(&mut run_action_fut), Poll::Pending);
        tokio::task::yield_now().await;
        match fs::metadata(&process_started_file).await {

fyi: This should fix this flaky test by waiting for the process to start. The test was fine before, but gave exit code 127 or 9 depending on the state of the process during the kill event.


nativelink-util/src/buf_channel.rs line 375 at r1 (raw file):

                .recv()
                .await
                .err_tip(|| "During next read of buf_channel::take()")?;

fyi: Fixed typo.

Copy link
Member

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 17 of 21 files at r1, 1 of 1 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3207 at r3 (raw file):

#[serial]
#[nativelink_test]
#[cfg(target_family = "unix")]

nit: Let's leave a todo to handle the windows case or document why we're not testing windows here.

ResumableFileSlot became to difficult to manage, instead of
managing resources this way, we use much higher
DEFAULT_OPEN_FILE_PERMITS, set ulimit (when unix) and warn
user if the limits are likely too low.

Breaking: Removed idle_file_descriptor_timeout_millis from config.

closes: TraceMachina#1288, TraceMachina#513, TraceMachina#1298, TraceMachina#527
@allada allada force-pushed the remove-local-fs-limit branch from 3e2df54 to 56f427d Compare February 11, 2025 22:16
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and 20 of 22 files reviewed, and pending CI: Web Platform Deployment / ubuntu-24.04, pre-commit-checks, and 1 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3207 at r3 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

nit: Let's leave a todo to handle the windows case or document why we're not testing windows here.

Done.

Copy link
Member

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Coverage, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-14, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-14, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, Web Platform Deployment / ubuntu-24.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable

@allada allada merged commit 8b89c31 into TraceMachina:main Feb 12, 2025
37 checks passed
Copy link
Member

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained, and all files reviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apparently creating too many threads for tracing-subscriber
2 participants