-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/Feature Request]: Reusing a step overwrites artifact names #2685
Comments
Hey @jlopezpena , thanks for reporting, and sorry about the delay in responding to this! This is indeed not possible in the current setting and to overcome this in our LLM finetuning project, I have used this trick: https://github.com/zenml-io/zenml-projects/blob/main/llm-lora-finetuning/steps/evaluate_model.py#L112 Aside from the workaround, I'll create a ticket to look into what we can do to improve this UX piece. Stay tuned! |
Hey @jlopezpena! This just caught my attention. Thanks for bringing up this issue about dynamically naming artifacts in ZenML. It's definitely a tricky situation when you're reusing steps with different inputs. Your idea for an While we wait for an official feature to be implemented, I've got a couple of workarounds that might help you out. These use the current ZenML capabilities to tackle the problem:
This approach lets you create dynamically named artifacts by whipping up step functions on the fly: from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, ArtifactConfig
def create_dynamic_step(prefix: str):
@step
def dynamic_step(data: Any) -> Annotated[Dict[str, Any], ArtifactConfig(name=f"{prefix}_artifact")]:
# Do your data processing magic here
result = {"processed_data": data}
return result
return dynamic_step
# Here's how you'd use it
train_step = create_dynamic_step("train")
val_step = create_dynamic_step("validation")
test_step = create_dynamic_step("test")
@pipeline
def dynamic_artifact_pipeline(train_data, val_data, test_data):
train_step(train_data)
val_step(val_data)
test_step(test_data) This way, you get unique artifact names for each step, making it a breeze to track and grab specific artifacts later.
If you'd rather stick with a single step and use metadata to tell your artifacts apart, try this: from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, get_step_context
@step
def generic_step(data: Any, prefix: str) -> Annotated[Dict[str, Any], "generic_artifact"]:
result = {"processed_data": data}
# Sprinkle in some custom metadata
step_context = get_step_context()
step_context.add_output_metadata(
output_name="generic_artifact",
metadata={"custom_prefix": prefix}
)
return result
@pipeline
def metadata_artifact_pipeline(train_data, val_data, test_data):
generic_step(train_data, prefix="train")
generic_step(val_data, prefix="validation")
generic_step(test_data, prefix="test") With this approach, we're using a single from zenml.client import Client
client = Client()
artifacts = client.list_artifact_versions("generic_artifact")
for artifact in artifacts:
prefix = artifact.run_metadata.get("custom_prefix")
if prefix == "train":
train_data = artifact.load()
elif prefix == "validation":
val_data = artifact.load()
elif prefix == "test":
test_data = artifact.load() Both of these approaches let you custom-identify your artifacts without messing with ZenML's core functionality. The first option gives you more control over the artifact name itself, while the second keeps the artifact name the same but adds custom metadata for identification. These solutions are temporary fixes that don't exactly match your output_names suggestion, but they might help manage the artifact naming issue for now. The factory function approach comes closer to allowing runtime name customization, while the metadata approach helps with identification without changing names. Your idea for an Give these a shot and let me know how they work for you! And hey, if you come up with any other cool workarounds, definitely share them with the community. We're all in this together! |
Tested these out and this code works for the two approaches. Posting it here in case someone else reads this thread and wants working code: Approach 1from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, pipeline, get_step_context, ArtifactConfig
def create_step(prefix: str):
def _entrypoint(data: Any) -> Annotated[Dict[str, Any], ArtifactConfig(name=f"{prefix}_artifact")]:
context = get_step_context()
return {"processed_data": data, "step_name": context.step_name}
step_name = f"dynamic_step_{prefix}"
_entrypoint.__name__ = step_name
s = step(_entrypoint)
globals()[step_name] = s
return s
# Create the dynamic steps
train_step = create_step(prefix="train")
validation_step = create_step(prefix="validation")
test_step = create_step(prefix="test")
# Resolve the steps
train_step.resolve()
validation_step.resolve()
test_step.resolve()
@pipeline
def dynamic_artifact_pipeline(train_data, val_data, test_data):
train_result = train_step(train_data)
validation_result = validation_step(val_data)
test_result = test_step(test_data)
dynamic_artifact_pipeline(train_data=1, val_data=2, test_data=3) One caveat applies to this first method which is that either of the following
As you can see, this is not always possible or desirable and you should probably use Approach 2
|
Contact Details [Optional]
No response
System Information
N/A
What happened?
(Issue already discussed in this slack thread, logging here for easier tracking)
Currently, the names used for saving artifacts in a
step
are determined by the type annotation in the function definition.This works great when a step is only used once in a pipeline, but not so much when the same step needs to be called multiple times with different inputs, and the resulting artifacts need to later be used in a different pipeline. When that happens, the output artifacts get saved with the same name and a bumped version number, which makes it really hard to track the specific one needed later down the road.
Typical example of this: training some preprocessor that later needs to be used three different times for transforming train, validation, and test data, and I end up with three versions of an object called
transformed_data
and need to keep track of which is which.Returning outputs from pipelines quickly gets out of control and is very hard to maintain when there are lots of artifacts that might potentially be reused later.
Suggested solution: Similar to how a step name (usually the function name) can be overriden at run time by the
id
parameter when calling the step, introduce an optional parameter to step calls (something likeoutput_names: Optional[Dict[str, str]]
where the dict must contain the names defined in the function type annotations as keys and the desired saved names as values) overriding the saved names of the outputs. If that parameter is not passed, things should behave as they currently do, but when passed the produced artifacts should be saved with the passed names.Reproduction steps
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: