[Docs] document clarifying notes about the data lifecycle #4683

samhita-alla · 2024-01-08T13:56:53Z

Description

document to update: https://docs.flyte.org/en/latest/concepts/data_management.html

what to add: https://discuss.flyte.org/t/15995893/hi-community-i-am-trying-to-understand-the-flyte-data-lifecy#b374eea0-0c32-44fd-875a-5b79d4a8180b

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

dosubot · 2024-01-08T13:57:11Z

🤖

Hello @samhita-alla! I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer. I can help with bugs, answer questions, and even guide you to become a contributor. Just know I'm here to make things easier for you. Let's tackle this together!

Sure, I can provide some clarifying notes about the data lifecycle in Flyte.

Flyte handles data in two parts: Metadata and Raw data.

Metadata consists of data about inputs to a task, and other artifacts. It is configured globally for FlytePropeller, FlyteAdmin etc., and the running pods/jobs need access to this bucket to get the data.

Raw data is the actual data (such as the Pandas DataFrame, Spark DataFrame, etc.). Raw data paths are unique for every execution, and the prefixes can be modified per execution. None of the Flyte control plane components would access the raw data. This provides great separation of data between the control plane and the data plane.

For every task that receives input, Flyte sends an Inputs Metadata object, which contains all the primitive or simple scalar values inlined, but in the case of complex, large objects, they are offloaded and the Metadata simply stores a reference to the object.

Flytekit TypeTransformers make it possible to use complex objects as if they are available locally - just like persistent filehandles. But Flyte backend only deals with the references. Thus, primitive data types and references to large objects fall under Metadata - Meta input or Meta output, and the actual large object is known as Raw data.

Metadata in Flyte is critical to enable the passing of data between tasks. It allows to perform in-memory computations for branches or send partial outputs from one task to another or compose outputs from multiple tasks into one input to be sent to a task.

Flyte is primarily a DataFlow Engine. It enables movement of data and provides an abstraction to enable movement of data between different languages. The workflow engine is responsible for moving data from a previous task to the next task. As explained previously, Flyte only deals with Metadata and not the actual Raw data.

Flytekit has a pluggable data persistence layer. This is driven by PROTOCOL. For example, it is theoretically possible to use S3 s3:// for metadata and GCS gcs:// for raw data. It is also possible to create your own protocol my_fs://, to change how data is stored and accessed. But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

In the current version of the Flyte codebase, Metadata and Raw data are handled as follows:

Metadata is serialized and deserialized using the marshalMetadata and unmarshalMetadata functions respectively. These functions are not shown in the provided context, but they are likely responsible for converting the metadata to and from a format that can be stored in the database. The serialized metadata is stored in the SerializedMetadata field of the models.Artifact struct.

Raw data is represented as models.ArtifactData and is stored in the ArtifactData field of the models.Artifact struct. The CreateArtifactModel function takes a slice of models.ArtifactData as an argument, which is then assigned to the ArtifactData field of the models.Artifact struct that is returned.

The FromArtifactModel function is used to convert a models.Artifact back into a datacatalog.Artifact. This involves unmarshalling the serialized metadata and converting the models.ArtifactData back into the format used by the datacatalog.Artifact.

The ToArtifactKey function is used to create an ArtifactKey from a datacatalog.DatasetID and an artifact ID. The ArtifactKey is used to uniquely identify an artifact.

Flyte ensures the separation of Metadata and Raw data in the current version of the codebase by using different data structures and serialization methods for each type of data.

In the provided code, the CreateArtifactModel function is used to create an artifact model. This function takes in a CreateArtifactRequest and an array of ArtifactData as parameters. The CreateArtifactRequest contains the metadata for the artifact, while the ArtifactData contains the raw data.

The metadata is serialized using the marshalMetadata function, which converts the metadata into a byte array. This serialized metadata is then stored in the SerializedMetadata field of the Artifact model.

On the other hand, the raw data is stored directly in the ArtifactData field of the Artifact model without any serialization.

This separation of metadata and raw data allows for more efficient storage and retrieval of data. The metadata can be quickly accessed and searched without having to load the potentially large raw data. Meanwhile, the raw data can be loaded only when necessary, reducing memory usage and improving performance.

You can find more details in the Flyte documentation and the Flyte repository.

I hope this helps! If you have any more questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

davidmirror-ops · 2024-09-27T19:08:04Z

It's a recurring question: how to delete data from S3 buckets. Flyte doesn't have a way to do this but setting a lifecycle policy on the buckets is an usual approach.

Murdock9803 · 2024-10-04T05:46:51Z

@samhita-alla @davidmirror-ops I would like to take up this issue, please assign this to me So I can start working on it 🙌

davidmirror-ops · 2024-10-04T10:41:36Z

@Murdock9803 done. Please let us know soon if you have questions or need help. Thanks!

Murdock9803 · 2024-10-04T13:01:54Z

Sure, I'll definitely start working on it at the earliest and ask related doubts 🙌

popojk · 2024-10-22T06:41:43Z

Hi @davidmirror-ops @samhita-alla , can I also working on this issue?

davidmirror-ops · 2024-10-22T20:17:47Z

@popojk sure!

popojk · 2024-10-25T06:02:42Z

Hi @davidmirror-ops , thanks for assigning me this issue. I want to clarify whether the information I need to add to doc includes:

Flyte cannot cleaned up the raw data stored in storage services such as s3 or minio, but user can do so by setting lifecycle policies in the storage service.
If cacheing is enabled, be sure to set the cache expiration smaller than that of storage lifecycle.

Moreover, I found some example code error in the original doc. Do you mind me to fix them in the same PR of this issue?

Thank you very much!

davidmirror-ops · 2024-10-25T10:21:15Z

@popojk

Correct
Correct, this is the setting in propeller: https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#max-cache-age-config-duration
Thank you!

samhita-alla added documentation Improvements or additions to documentation pri2 labels Jan 8, 2024

runllm bot mentioned this issue Jun 8, 2024

[Core feature] LiteralBlob and StructuredDataset metadata #5461

Open

2 tasks

davidmirror-ops added the hacktoberfest label Sep 27, 2024

davidmirror-ops added the good first issue Good for newcomers label Sep 27, 2024

davidmirror-ops mentioned this issue Sep 28, 2024

Flyte Hacktoberfest 2024: issues and guidelines #5783

Closed

56 tasks

davidmirror-ops assigned Murdock9803 Oct 4, 2024

davidmirror-ops assigned popojk Oct 22, 2024

This was referenced Oct 26, 2024

Document clarifying notes about the data lifecycle #5921

Closed

[Docs]Document clarifying notes about the data lifecycle #5922

Merged

davidmirror-ops closed this as completed in #5922 Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] document clarifying notes about the data lifecycle #4683

[Docs] document clarifying notes about the data lifecycle #4683

samhita-alla commented Jan 8, 2024

dosubot bot commented Jan 8, 2024 •

edited

Loading

davidmirror-ops commented Sep 27, 2024

Murdock9803 commented Oct 4, 2024

davidmirror-ops commented Oct 4, 2024

Murdock9803 commented Oct 4, 2024

popojk commented Oct 22, 2024

davidmirror-ops commented Oct 22, 2024

popojk commented Oct 25, 2024

davidmirror-ops commented Oct 25, 2024

[Docs] document clarifying notes about the data lifecycle #4683

[Docs] document clarifying notes about the data lifecycle #4683

Comments

samhita-alla commented Jan 8, 2024

Description

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Jan 8, 2024 • edited Loading

Sources

davidmirror-ops commented Sep 27, 2024

Murdock9803 commented Oct 4, 2024

davidmirror-ops commented Oct 4, 2024

Murdock9803 commented Oct 4, 2024

popojk commented Oct 22, 2024

davidmirror-ops commented Oct 22, 2024

popojk commented Oct 25, 2024

davidmirror-ops commented Oct 25, 2024

dosubot bot commented Jan 8, 2024 •

edited

Loading