-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support merging output from multiple file keys when writing to stdout #115
Conversation
|
||
contents = make_dependency_file( | ||
file_type=output.pop(), | ||
conda_env_name="rapids-dfg-combined", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name:
field in these generated conda environments is ignored in every workflow I'm aware of in RAPIDS... it's always overwritten by -n
/ --name
passed through like this:
rapids-dependency-file-generator \
--output conda \
--file-key test_cpp \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch)" | tee "${ENV_YAML_DIR}/env.yaml"
mamba env create --yes -f "${ENV_YAML_DIR}/env.yaml" -n test
That's why I'm proposing a hard-coded mildly-informative name instead of taking on the complexity of other alternatives, like:
- allowing
conda_env_name=None
to be passed tomake_dependency_file()
- trying to come up with some other representative environment name based on which
--file-key
s were passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d weakly prefer allowing None here. I don’t think conda environment files require a name? Maybe we can just omit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just pushed df3848d changing this to allow None
.
I don't think conda environment files require a name?
Correct.
cat > env.yaml <<EOF
channels:
- conda-forge
dependencies:
- pandas
- pip
- pip:
- scikit-learn
EOF
conda env create \
--name delete-me \
--file ./env.yaml
Works without issue
...
done
#
# To activate this environment, use
#
# $ conda activate delete-me
#
# To deactivate an active environment, use
#
# $ conda deactivat
contents = make_dependency_file( | ||
file_type=output.pop(), | ||
conda_env_name="rapids-dfg-combined", | ||
file_name="ignored-because-multiple-pyproject-files-are-not-supported", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_dependency_file()
is already really 3 distinct codepaths inside one function. The codepath for generating pyproject
output rightly expects to be given a path to an existing pyproject.toml
file to read in.
I'm proposing just not allowing multiple --file-key
together with --output pyproject
and passing through this hard-coded nonsense string instead of taking on the complexity of other options like:
- breaking
make_dependency_file()
up into e.g.make_conda_env_file()
,make_requirements_file()
, andmake_pyproject_file()
- allowing
file_name=None
to be passed tomake_dependency_file()
and generating TOML content from scratch if it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read this 3 times but I think I understand all of the information in this comment? I don't have any feedback, I think this is maybe-fine-enough and if not we can fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try a different way with links.
After processing dependencies.yaml
, filtering stuff by output type and matrices, deduplicating, etc., DFG wants the literal text that it'll either echo to stdout
or write to a file.
For that, it calls make_dependency_file()
:
dependency-file-generator/src/rapids_dependency_file_generator/_rapids_dependency_file_generator.py
Line 439 in 6e10811
contents = make_dependency_file( |
Despite the name, that function actually just creates and returns that text... it never writes anything to the filesystem.
The body of the function, in pseudocode, is like:
def make_dependency_file(...):
relative_path_to_config_file = os.path.relpath(config_file, output_dir)
file_contents = textwrap.dedent(
f"""\
{HEADER}
# To make changes, edit {relative_path_to_config_file} and run `{cli_name}`.
"""
)
if conda:
# {code specific to conda env YAML files}
elif requirements:
# {code specific to requirements.txt}
elif pyproject:
# {code specific to pyproject.toml}
return file_contents
So there's very little shared code in there. The list of arguments for the function contains a mix of things that are only used for some but not all of the output types.
Like this:
dependency-file-generator/src/rapids_dependency_file_generator/_rapids_dependency_file_generator.py
Lines 117 to 119 in 6e10811
conda_channels : list[str] | |
The channels to include in the file. Only used when `file_type` is | |
CONDA. |
That's complex but has been kind of "fine" as long as this was always being called with the data from a single --file-key
.
For this PR, we now want to use that function to generate a single type of output from multiple --file-key
entries. For that, there's no single representative output_dir
, for example. Here, I've just kind of awkwardly provided values with the right types so mypy
will be happy, with the knowledge that they don't really matter (since most of the arguments are pyproject
-specific, and this PR isn't supporting pyproject.toml).
It'd be clearer if make_dependency_file()
was decomposed into make_conda_env_file()
, make_requirements_file()
and make_pyproject_file()
. But I really really did not want to take that on in the scope of this PR.
@@ -130,3 +130,47 @@ def test_validate_args(): | |||
"all", | |||
] | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are wondering "why isn't there a test case here checking that passing multiple --file-key
together with --output pyproject
is rejected?"... it's because I put the exception-raising code for that down further, in make_dependency_file()
, so that the exception would also be raised if rapids-build-backend
(which does not use the CLI) tried to pass inputs like that.
I did add a unit test in test_rapids_dependency_file_generator
checking that that exception is raised as expected.
src/rapids_dependency_file_generator/_rapids_dependency_file_generator.py
Outdated
Show resolved
Hide resolved
|
||
contents = make_dependency_file( | ||
file_type=output.pop(), | ||
conda_env_name="rapids-dfg-combined", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d weakly prefer allowing None here. I don’t think conda environment files require a name? Maybe we can just omit it.
contents = make_dependency_file( | ||
file_type=output.pop(), | ||
conda_env_name="rapids-dfg-combined", | ||
file_name="ignored-because-multiple-pyproject-files-are-not-supported", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read this 3 times but I think I understand all of the information in this comment? I don't have any feedback, I think this is maybe-fine-enough and if not we can fix it later.
"scikit-learn>=1.5", | ||
{"pip": [ | ||
"folium", | ||
"numpy >=2.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super tiny nit. This specifier has a space between the package name and version pinnings, but the others do not.
More broadly, do we think DFG should be in the business of normalizing any of this kind of thing on output? Sometimes our combined RAPIDS environments have (for example) both numpy >=2.0
and numpy>=2.0
as separate dependencies because they're not identical strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the spaces in df3848d
More broadly, do we think DFG should be in the business of normalizing any of this kind of thing on output?
I'm indifferent about this. I do think it'd probably be safe to have DFG do dep.replace(" ", "")
unconditionally. And there are benefits... slightly smaller output, slightly less work for solvers to do.
But I definitely would prefer it not be part of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure not in scope here. Thanks for considering. We can think about it some other time if it becomes a significant pain point.
…e issue page link
src/rapids_dependency_file_generator/_rapids_dependency_file_generator.py
Outdated
Show resolved
Hide resolved
…enerator.py Co-authored-by: Bradley Dice <[email protected]>
# [1.16.0](v1.15.1...v1.16.0) (2024-10-22) ### Features * support merging output from multiple file keys when writing to stdout ([#115](#115)) ([6c2ae6d](6c2ae6d))
🎉 This PR is included in version 1.16.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Closes #114
Adds support for passing
--file-key
multiple times to the same invocation ofrapids-dependency-file-generator
, forrequirements
andconda
output. This allows writing to stdout the union of multiple file keys.Like this
Notes for Reviewers
The motivating use case for this is generating
requirements.txt
files containing all the build dependencies for wheels (rapidsai/cuspatial#1473 (review)).Design choices?
See inline comments on the diff.
This does not support
pyproject
outputI cannot think of an application where we'd want to write the merged output of 2
pyproject.toml
files to stdout, and the implementation to support that would be a lot more difficult than supportingconda
andrequirements
, because withpyproject
support we're splicing into an existing file's content (instead of creating the entire file content from scratch).Here I'm proposing just raising an exception if someone attempts to pass multiple
--file-key
+--output pyproject
and write to stdout.How I tested this
Added new unit tests here.
Also tried some different invocations locally, in
cudf
andcuspatial
.Temporarily added code in
ci/build_libcuspatial_wheel.sh
over in rapidsai/cuspatial#1473 to installrapids-dependency-file-generator
from this branch, and saw it do the right thing there, generating a single list that merged the build backend (rapids-build-backend
+scikit-build-core
) and build-time dependencies (e.g.librmm
):(build link)