-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] document clarifying notes about the data lifecycle #4683
Comments
🤖 Hello @samhita-alla! I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer. I can help with bugs, answer questions, and even guide you to become a contributor. Just know I'm here to make things easier for you. Let's tackle this together! Sure, I can provide some clarifying notes about the data lifecycle in Flyte. Flyte handles data in two parts: Metadata and Raw data. Metadata consists of data about inputs to a task, and other artifacts. It is configured globally for FlytePropeller, FlyteAdmin etc., and the running pods/jobs need access to this bucket to get the data. Raw data is the actual data (such as the Pandas DataFrame, Spark DataFrame, etc.). Raw data paths are unique for every execution, and the prefixes can be modified per execution. None of the Flyte control plane components would access the raw data. This provides great separation of data between the control plane and the data plane. For every task that receives input, Flyte sends an Inputs Metadata object, which contains all the primitive or simple scalar values inlined, but in the case of complex, large objects, they are offloaded and the Metadata simply stores a reference to the object. Flytekit TypeTransformers make it possible to use complex objects as if they are available locally - just like persistent filehandles. But Flyte backend only deals with the references. Thus, primitive data types and references to large objects fall under Metadata - Meta input or Meta output, and the actual large object is known as Raw data. Metadata in Flyte is critical to enable the passing of data between tasks. It allows to perform in-memory computations for branches or send partial outputs from one task to another or compose outputs from multiple tasks into one input to be sent to a task. Flyte is primarily a DataFlow Engine. It enables movement of data and provides an abstraction to enable movement of data between different languages. The workflow engine is responsible for moving data from a previous task to the next task. As explained previously, Flyte only deals with Metadata and not the actual Raw data. Flytekit has a pluggable data persistence layer. This is driven by PROTOCOL. For example, it is theoretically possible to use S3 s3:// for metadata and GCS gcs:// for raw data. It is also possible to create your own protocol my_fs://, to change how data is stored and accessed. But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib. In the current version of the Flyte codebase, Metadata and Raw data are handled as follows: Metadata is serialized and deserialized using the Raw data is represented as The The Flyte ensures the separation of Metadata and Raw data in the current version of the codebase by using different data structures and serialization methods for each type of data. In the provided code, the The metadata is serialized using the On the other hand, the raw data is stored directly in the This separation of metadata and raw data allows for more efficient storage and retrieval of data. The metadata can be quickly accessed and searched without having to load the potentially large raw data. Meanwhile, the raw data can be loaded only when necessary, reducing memory usage and improving performance. You can find more details in the Flyte documentation and the Flyte repository. I hope this helps! If you have any more questions, feel free to ask. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
It's a recurring question: how to delete data from S3 buckets. Flyte doesn't have a way to do this but setting a lifecycle policy on the buckets is an usual approach. |
@samhita-alla @davidmirror-ops I would like to take up this issue, please assign this to me So I can start working on it 🙌 |
@Murdock9803 done. Please let us know soon if you have questions or need help. Thanks! |
Sure, I'll definitely start working on it at the earliest and ask related doubts 🙌 |
Hi @davidmirror-ops @samhita-alla , can I also working on this issue? |
@popojk sure! |
Hi @davidmirror-ops , thanks for assigning me this issue. I want to clarify whether the information I need to add to doc includes:
Moreover, I found some example code error in the original doc. Do you mind me to fix them in the same PR of this issue? Thank you very much! |
|
Description
document to update: https://docs.flyte.org/en/latest/concepts/data_management.html
what to add: https://discuss.flyte.org/t/15995893/hi-community-i-am-trying-to-understand-the-flyte-data-lifecy#b374eea0-0c32-44fd-875a-5b79d4a8180b
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: