Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] document clarifying notes about the data lifecycle #4683

Closed
2 tasks done
Tracked by #5783
samhita-alla opened this issue Jan 8, 2024 · 9 comments · Fixed by #5922
Closed
2 tasks done
Tracked by #5783

[Docs] document clarifying notes about the data lifecycle #4683

samhita-alla opened this issue Jan 8, 2024 · 9 comments · Fixed by #5922
Assignees
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers hacktoberfest pri2

Comments

@samhita-alla
Copy link
Contributor

Description

document to update: https://docs.flyte.org/en/latest/concepts/data_management.html

what to add: https://discuss.flyte.org/t/15995893/hi-community-i-am-trying-to-understand-the-flyte-data-lifecy#b374eea0-0c32-44fd-875a-5b79d4a8180b

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@samhita-alla samhita-alla added documentation Improvements or additions to documentation pri2 labels Jan 8, 2024
Copy link

dosubot bot commented Jan 8, 2024

🤖

Hello @samhita-alla! I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer. I can help with bugs, answer questions, and even guide you to become a contributor. Just know I'm here to make things easier for you. Let's tackle this together!

Sure, I can provide some clarifying notes about the data lifecycle in Flyte.

Flyte handles data in two parts: Metadata and Raw data.

Metadata consists of data about inputs to a task, and other artifacts. It is configured globally for FlytePropeller, FlyteAdmin etc., and the running pods/jobs need access to this bucket to get the data.

Raw data is the actual data (such as the Pandas DataFrame, Spark DataFrame, etc.). Raw data paths are unique for every execution, and the prefixes can be modified per execution. None of the Flyte control plane components would access the raw data. This provides great separation of data between the control plane and the data plane.

For every task that receives input, Flyte sends an Inputs Metadata object, which contains all the primitive or simple scalar values inlined, but in the case of complex, large objects, they are offloaded and the Metadata simply stores a reference to the object.

Flytekit TypeTransformers make it possible to use complex objects as if they are available locally - just like persistent filehandles. But Flyte backend only deals with the references. Thus, primitive data types and references to large objects fall under Metadata - Meta input or Meta output, and the actual large object is known as Raw data.

Metadata in Flyte is critical to enable the passing of data between tasks. It allows to perform in-memory computations for branches or send partial outputs from one task to another or compose outputs from multiple tasks into one input to be sent to a task.

Flyte is primarily a DataFlow Engine. It enables movement of data and provides an abstraction to enable movement of data between different languages. The workflow engine is responsible for moving data from a previous task to the next task. As explained previously, Flyte only deals with Metadata and not the actual Raw data.

Flytekit has a pluggable data persistence layer. This is driven by PROTOCOL. For example, it is theoretically possible to use S3 s3:// for metadata and GCS gcs:// for raw data. It is also possible to create your own protocol my_fs://, to change how data is stored and accessed. But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

In the current version of the Flyte codebase, Metadata and Raw data are handled as follows:

Metadata is serialized and deserialized using the marshalMetadata and unmarshalMetadata functions respectively. These functions are not shown in the provided context, but they are likely responsible for converting the metadata to and from a format that can be stored in the database. The serialized metadata is stored in the SerializedMetadata field of the models.Artifact struct.

Raw data is represented as models.ArtifactData and is stored in the ArtifactData field of the models.Artifact struct. The CreateArtifactModel function takes a slice of models.ArtifactData as an argument, which is then assigned to the ArtifactData field of the models.Artifact struct that is returned.

The FromArtifactModel function is used to convert a models.Artifact back into a datacatalog.Artifact. This involves unmarshalling the serialized metadata and converting the models.ArtifactData back into the format used by the datacatalog.Artifact.

The ToArtifactKey function is used to create an ArtifactKey from a datacatalog.DatasetID and an artifact ID. The ArtifactKey is used to uniquely identify an artifact.

Flyte ensures the separation of Metadata and Raw data in the current version of the codebase by using different data structures and serialization methods for each type of data.

In the provided code, the CreateArtifactModel function is used to create an artifact model. This function takes in a CreateArtifactRequest and an array of ArtifactData as parameters. The CreateArtifactRequest contains the metadata for the artifact, while the ArtifactData contains the raw data.

The metadata is serialized using the marshalMetadata function, which converts the metadata into a byte array. This serialized metadata is then stored in the SerializedMetadata field of the Artifact model.

On the other hand, the raw data is stored directly in the ArtifactData field of the Artifact model without any serialization.

This separation of metadata and raw data allows for more efficient storage and retrieval of data. The metadata can be quickly accessed and searched without having to load the potentially large raw data. Meanwhile, the raw data can be loaded only when necessary, reducing memory usage and improving performance.

You can find more details in the Flyte documentation and the Flyte repository.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@davidmirror-ops
Copy link
Contributor

It's a recurring question: how to delete data from S3 buckets. Flyte doesn't have a way to do this but setting a lifecycle policy on the buckets is an usual approach.

@Murdock9803
Copy link
Contributor

@samhita-alla @davidmirror-ops I would like to take up this issue, please assign this to me So I can start working on it 🙌

@davidmirror-ops
Copy link
Contributor

@Murdock9803 done. Please let us know soon if you have questions or need help. Thanks!

@Murdock9803
Copy link
Contributor

Sure, I'll definitely start working on it at the earliest and ask related doubts 🙌

@popojk
Copy link
Contributor

popojk commented Oct 22, 2024

Hi @davidmirror-ops @samhita-alla , can I also working on this issue?

@davidmirror-ops
Copy link
Contributor

@popojk sure!

@popojk
Copy link
Contributor

popojk commented Oct 25, 2024

Hi @davidmirror-ops , thanks for assigning me this issue. I want to clarify whether the information I need to add to doc includes:

  1. Flyte cannot cleaned up the raw data stored in storage services such as s3 or minio, but user can do so by setting lifecycle policies in the storage service.
  2. If cacheing is enabled, be sure to set the cache expiration smaller than that of storage lifecycle.

Moreover, I found some example code error in the original doc. Do you mind me to fix them in the same PR of this issue?

Thank you very much!

@davidmirror-ops
Copy link
Contributor

@popojk

  1. Correct
  2. Correct, this is the setting in propeller: https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#max-cache-age-config-duration
    Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers hacktoberfest pri2
Projects
None yet
4 participants