Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Feature Request]: Reusing a step overwrites artifact names #2685

Open
1 task done
jlopezpena opened this issue May 9, 2024 · 3 comments
Open
1 task done

[BUG/Feature Request]: Reusing a step overwrites artifact names #2685

jlopezpena opened this issue May 9, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@jlopezpena
Copy link
Contributor

Contact Details [Optional]

No response

System Information

N/A

What happened?

(Issue already discussed in this slack thread, logging here for easier tracking)

Currently, the names used for saving artifacts in a step are determined by the type annotation in the function definition.
This works great when a step is only used once in a pipeline, but not so much when the same step needs to be called multiple times with different inputs, and the resulting artifacts need to later be used in a different pipeline. When that happens, the output artifacts get saved with the same name and a bumped version number, which makes it really hard to track the specific one needed later down the road.

Typical example of this: training some preprocessor that later needs to be used three different times for transforming train, validation, and test data, and I end up with three versions of an object called transformed_data and need to keep track of which is which.

Returning outputs from pipelines quickly gets out of control and is very hard to maintain when there are lots of artifacts that might potentially be reused later.

Suggested solution: Similar to how a step name (usually the function name) can be overriden at run time by the id parameter when calling the step, introduce an optional parameter to step calls (something like output_names: Optional[Dict[str, str]] where the dict must contain the names defined in the function type annotations as keys and the desired saved names as values) overriding the saved names of the outputs. If that parameter is not passed, things should behave as they currently do, but when passed the produced artifacts should be saved with the passed names.

Reproduction steps

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jlopezpena jlopezpena added the bug Something isn't working label May 9, 2024
@avishniakov
Copy link
Contributor

Hey @jlopezpena , thanks for reporting, and sorry about the delay in responding to this!

This is indeed not possible in the current setting and to overcome this in our LLM finetuning project, I have used this trick: https://github.com/zenml-io/zenml-projects/blob/main/llm-lora-finetuning/steps/evaluate_model.py#L112

Aside from the workaround, I'll create a ticket to look into what we can do to improve this UX piece. Stay tuned!

@htahir1
Copy link
Contributor

htahir1 commented Aug 23, 2024

Hey @jlopezpena! This just caught my attention. Thanks for bringing up this issue about dynamically naming artifacts in ZenML. It's definitely a tricky situation when you're reusing steps with different inputs. Your idea for an output_names parameter is pretty solid.

While we wait for an official feature to be implemented, I've got a couple of workarounds that might help you out. These use the current ZenML capabilities to tackle the problem:

  1. Using a factory function to cook up steps with custom artifact names:

This approach lets you create dynamically named artifacts by whipping up step functions on the fly:

from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, ArtifactConfig

def create_dynamic_step(prefix: str):
    @step
    def dynamic_step(data: Any) -> Annotated[Dict[str, Any], ArtifactConfig(name=f"{prefix}_artifact")]:
        # Do your data processing magic here
        result = {"processed_data": data}
        return result
    
    return dynamic_step

# Here's how you'd use it
train_step = create_dynamic_step("train")
val_step = create_dynamic_step("validation")
test_step = create_dynamic_step("test")

@pipeline
def dynamic_artifact_pipeline(train_data, val_data, test_data):
    train_step(train_data)
    val_step(val_data)
    test_step(test_data)

This way, you get unique artifact names for each step, making it a breeze to track and grab specific artifacts later.

  1. Using metadata to slap a custom identifier on your artifacts:

If you'd rather stick with a single step and use metadata to tell your artifacts apart, try this:

from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, get_step_context

@step
def generic_step(data: Any, prefix: str) -> Annotated[Dict[str, Any], "generic_artifact"]:
    result = {"processed_data": data}
    
    # Sprinkle in some custom metadata
    step_context = get_step_context()
    step_context.add_output_metadata(
        output_name="generic_artifact",
        metadata={"custom_prefix": prefix}
    )
    
    return result

@pipeline
def metadata_artifact_pipeline(train_data, val_data, test_data):
    generic_step(train_data, prefix="train")
    generic_step(val_data, prefix="validation")
    generic_step(test_data, prefix="test")

With this approach, we're using a single generic_step but adding custom metadata to each artifact. The prefix gets stored in the metadata, so you can use it later to identify and differentiate between artifacts:

from zenml.client import Client

client = Client()
artifacts = client.list_artifact_versions("generic_artifact")
for artifact in artifacts:
    prefix = artifact.run_metadata.get("custom_prefix")
    if prefix == "train":
        train_data = artifact.load()
    elif prefix == "validation":
        val_data = artifact.load()
    elif prefix == "test":
        test_data = artifact.load()

Both of these approaches let you custom-identify your artifacts without messing with ZenML's core functionality. The first option gives you more control over the artifact name itself, while the second keeps the artifact name the same but adds custom metadata for identification.

These solutions are temporary fixes that don't exactly match your output_names suggestion, but they might help manage the artifact naming issue for now. The factory function approach comes closer to allowing runtime name customization, while the metadata approach helps with identification without changing names.

Your idea for an output_names parameter is spot-on and would be a more elegant solution. It would be great to see this implemented in ZenML. In the meantime, have you considered opening a feature request for this specific functionality? That would allow the ZenML team to properly track and potentially implement this useful feature.

Give these a shot and let me know how they work for you! And hey, if you come up with any other cool workarounds, definitely share them with the community. We're all in this together!

@strickvl
Copy link
Contributor

Tested these out and this code works for the two approaches. Posting it here in case someone else reads this thread and wants working code:

Approach 1

from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, pipeline, get_step_context, ArtifactConfig

def create_step(prefix: str):
    def _entrypoint(data: Any) -> Annotated[Dict[str, Any], ArtifactConfig(name=f"{prefix}_artifact")]:
        context = get_step_context()
        return {"processed_data": data, "step_name": context.step_name}

    step_name = f"dynamic_step_{prefix}"
    _entrypoint.__name__ = step_name
    s = step(_entrypoint)
    globals()[step_name] = s
    return s

# Create the dynamic steps
train_step = create_step(prefix="train")
validation_step = create_step(prefix="validation")
test_step = create_step(prefix="test")

# Resolve the steps
train_step.resolve()
validation_step.resolve()
test_step.resolve()

@pipeline
def dynamic_artifact_pipeline(train_data, val_data, test_data):
    train_result = train_step(train_data)
    validation_result = validation_step(val_data)
    test_result = test_step(test_data)


dynamic_artifact_pipeline(train_data=1, val_data=2, test_data=3)

One caveat applies to this first method which is that either of the following
two things must be true:

  • The factory must be in the same file as where the steps are defined -> This is
    so the logic with globals() works
  • The user must have use the same variable name for the step as the __name__
    of the entrypoint function

As you can see, this is not always possible or desirable and you should probably use
the second method if you can.

Approach 2

from typing import Any, Dict
from typing_extensions import Annotated
from zenml import step, get_step_context, pipeline

@step
def generic_step(data: Any, prefix: str) -> Annotated[Dict[str, Any], "dataset"]:
    result = {"processed_data": data}

    # Add custom metadata
    step_context = get_step_context()
    step_context.add_output_metadata(
        output_name="dataset",
        metadata={"custom_prefix": prefix}
    )

    return result

@pipeline
def metadata_artifact_pipeline(train_data, val_data, test_data):
    generic_step(train_data, prefix="train")
    generic_step(val_data, prefix="validation")
    generic_step(test_data, prefix="test")

metadata_artifact_pipeline(train_data=1, val_data=2, test_data=3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants