Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support merging output from multiple file keys when writing to stdout #115

Merged
merged 11 commits into from
Oct 22, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Oct 17, 2024

Closes #114

Adds support for passing --file-key multiple times to the same invocation of rapids-dependency-file-generator, for requirements and conda output. This allows writing to stdout the union of multiple file keys.

Like this

rapids-dependency-file-generator \
  --output requirements \
  --file-key "py_build_libcuspatial" \
  --file-key "py_rapids_build_libcuspatial" \
  --matrix "cuda=11.8;arch=x86_64;py=3.11;cuda_suffixed=true"
| tee ./requirements-build.txt

Notes for Reviewers

The motivating use case for this is generating requirements.txt files containing all the build dependencies for wheels (rapidsai/cuspatial#1473 (review)).

Design choices?

See inline comments on the diff.

This does not support pyproject output

I cannot think of an application where we'd want to write the merged output of 2 pyproject.toml files to stdout, and the implementation to support that would be a lot more difficult than supporting conda and requirements, because with pyproject support we're splicing into an existing file's content (instead of creating the entire file content from scratch).

Here I'm proposing just raising an exception if someone attempts to pass multiple --file-key + --output pyproject and write to stdout.

How I tested this

Added new unit tests here.

Also tried some different invocations locally, in cudf and cuspatial.

Temporarily added code in ci/build_libcuspatial_wheel.sh over in rapidsai/cuspatial#1473 to install rapids-dependency-file-generator from this branch, and saw it do the right thing there, generating a single list that merged the build backend (rapids-build-backend + scikit-build-core) and build-time dependencies (e.g. librmm):

# This file is generated by `rapids-dependency-file-generator`.
# To make changes, edit . and run `rapids-dependency-file-generator`.
--extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple
--extra-index-url=https://pypi.nvidia.com/
cmake>=3.26.4,!=3.30.0
libcudf-cu11==24.12.*,>=0.0.0a0
librmm-cu11==24.12.*,>=0.0.0a0
ninja
rapids-build-backend>=0.3.0,<0.4.0.dev0
scikit-build-core[pyproject]>=0.10.0
wheel

(build link)

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 17, 2024

contents = make_dependency_file(
file_type=output.pop(),
conda_env_name="rapids-dfg-combined",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name: field in these generated conda environments is ignored in every workflow I'm aware of in RAPIDS... it's always overwritten by -n / --name passed through like this:

rapids-dependency-file-generator \
  --output conda \
  --file-key test_cpp \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch)" | tee "${ENV_YAML_DIR}/env.yaml"

mamba env create --yes -f "${ENV_YAML_DIR}/env.yaml" -n test

That's why I'm proposing a hard-coded mildly-informative name instead of taking on the complexity of other alternatives, like:

  • allowing conda_env_name=None to be passed to make_dependency_file()
  • trying to come up with some other representative environment name based on which --file-keys were passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d weakly prefer allowing None here. I don’t think conda environment files require a name? Maybe we can just omit it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just pushed df3848d changing this to allow None.

I don't think conda environment files require a name?

Correct.

cat > env.yaml <<EOF
channels:
- conda-forge
dependencies:
  - pandas
  - pip
  - pip:
    - scikit-learn
EOF

conda env create \
  --name delete-me \
  --file ./env.yaml

Works without issue

...
done
#
# To activate this environment, use
#
#     $ conda activate delete-me
#
# To deactivate an active environment, use
#
#     $ conda deactivat

contents = make_dependency_file(
file_type=output.pop(),
conda_env_name="rapids-dfg-combined",
file_name="ignored-because-multiple-pyproject-files-are-not-supported",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_dependency_file() is already really 3 distinct codepaths inside one function. The codepath for generating pyproject output rightly expects to be given a path to an existing pyproject.toml file to read in.

I'm proposing just not allowing multiple --file-key together with --output pyproject and passing through this hard-coded nonsense string instead of taking on the complexity of other options like:

  • breaking make_dependency_file() up into e.g. make_conda_env_file(), make_requirements_file(), and make_pyproject_file()
  • allowing file_name=None to be passed to make_dependency_file() and generating TOML content from scratch if it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this 3 times but I think I understand all of the information in this comment? I don't have any feedback, I think this is maybe-fine-enough and if not we can fix it later.

Copy link
Member Author

@jameslamb jameslamb Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try a different way with links.

After processing dependencies.yaml, filtering stuff by output type and matrices, deduplicating, etc., DFG wants the literal text that it'll either echo to stdout or write to a file.

For that, it calls make_dependency_file():

Despite the name, that function actually just creates and returns that text... it never writes anything to the filesystem.

The body of the function, in pseudocode, is like:

def make_dependency_file(...):
    relative_path_to_config_file = os.path.relpath(config_file, output_dir)
    file_contents = textwrap.dedent(
        f"""\
        {HEADER}
        # To make changes, edit {relative_path_to_config_file} and run `{cli_name}`.
        """
    )
    if conda:
        # {code specific to conda env YAML files}
    elif requirements:
        # {code specific to requirements.txt}
    elif pyproject:
        # {code specific to pyproject.toml}

    return file_contents

So there's very little shared code in there. The list of arguments for the function contains a mix of things that are only used for some but not all of the output types.

Like this:

conda_channels : list[str]
The channels to include in the file. Only used when `file_type` is
CONDA.

That's complex but has been kind of "fine" as long as this was always being called with the data from a single --file-key.

For this PR, we now want to use that function to generate a single type of output from multiple --file-key entries. For that, there's no single representative output_dir, for example. Here, I've just kind of awkwardly provided values with the right types so mypy will be happy, with the knowledge that they don't really matter (since most of the arguments are pyproject-specific, and this PR isn't supporting pyproject.toml).

It'd be clearer if make_dependency_file() was decomposed into make_conda_env_file(), make_requirements_file() and make_pyproject_file(). But I really really did not want to take that on in the scope of this PR.

@@ -130,3 +130,47 @@ def test_validate_args():
"all",
]
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are wondering "why isn't there a test case here checking that passing multiple --file-key together with --output pyproject is rejected?"... it's because I put the exception-raising code for that down further, in make_dependency_file(), so that the exception would also be raised if rapids-build-backend (which does not use the CLI) tried to pass inputs like that.

I did add a unit test in test_rapids_dependency_file_generator checking that that exception is raised as expected.

@jameslamb jameslamb changed the title WIP: feat: support merging output from multiple file keys when writing to stdout feat: support merging output from multiple file keys when writing to stdout Oct 17, 2024
@jameslamb jameslamb marked this pull request as ready for review October 17, 2024 19:32
@jameslamb jameslamb requested review from vyasr and bdice October 17, 2024 19:33

contents = make_dependency_file(
file_type=output.pop(),
conda_env_name="rapids-dfg-combined",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d weakly prefer allowing None here. I don’t think conda environment files require a name? Maybe we can just omit it.

contents = make_dependency_file(
file_type=output.pop(),
conda_env_name="rapids-dfg-combined",
file_name="ignored-because-multiple-pyproject-files-are-not-supported",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this 3 times but I think I understand all of the information in this comment? I don't have any feedback, I think this is maybe-fine-enough and if not we can fix it later.

"scikit-learn>=1.5",
{"pip": [
"folium",
"numpy >=2.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super tiny nit. This specifier has a space between the package name and version pinnings, but the others do not.

More broadly, do we think DFG should be in the business of normalizing any of this kind of thing on output? Sometimes our combined RAPIDS environments have (for example) both numpy >=2.0 and numpy>=2.0 as separate dependencies because they're not identical strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the spaces in df3848d

More broadly, do we think DFG should be in the business of normalizing any of this kind of thing on output?

I'm indifferent about this. I do think it'd probably be safe to have DFG do dep.replace(" ", "") unconditionally. And there are benefits... slightly smaller output, slightly less work for solvers to do.

But I definitely would prefer it not be part of the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure not in scope here. Thanks for considering. We can think about it some other time if it becomes a significant pain point.

@jameslamb jameslamb requested a review from bdice October 18, 2024 18:22
@jameslamb jameslamb merged commit 6c2ae6d into rapidsai:main Oct 22, 2024
4 checks passed
@jameslamb jameslamb deleted the multiple-file-keys branch October 22, 2024 13:39
GPUtester pushed a commit that referenced this pull request Oct 22, 2024
# [1.16.0](v1.15.1...v1.16.0) (2024-10-22)

### Features

* support merging output from multiple file keys when writing to stdout ([#115](#115)) ([6c2ae6d](6c2ae6d))
@GPUtester
Copy link

🎉 This PR is included in version 1.16.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support generating multiple file keys at once
3 participants