[Feature][Flytekit Schema type extension] Vaex Dataframe plugin #701
Labels
flytekit
FlyteKit Python related issue
good first issue
Good for newcomers
hacktoberfest
plugins
Plugins related labels (backend or frontend)
Motivation: Why do you think this is important?
Flytekit should support Vaex as a pandas alternative for FlyteSchema object.
https://github.com/vaexio/vaex
Vaex has great performance on a single machine, which is usually needed for most datasets. Spark & Dask are overkill with lots of complexity for datasets of sizes in few gigabytes. The addition of Vaex and support for automatic serialization and deserialization between consecutive tasks using Arrow/HDF5 would allow great Pandas, Spark, and Vaex interoperability.
Goal: What should the final outcome look like, ideally?
Users should be able to retrieve Vaex Dataframes from a FlyteSchema
Also support for Vaex Dataframe as a type
The plugin should mostly look like the default Pandas DataFrame Transformer and Reader that ships with Flytekit
https://github.com/flyteorg/flytekit/blob/master/flytekit/types/schema/types_pandas.py#L88-L144
Or like the Spark Plugin support for Spark DataFrames like
https://github.com/flyteorg/flytekit/blob/f0b0a7ed854950a3341df710d1f378ef3ed838ab/plugins/flytekit-spark/flytekitplugins/spark/schema.py#L13-L81
Describe alternatives you've considered
NA
Flyte component
GitHub repo(s)
flytekit
The text was updated successfully, but these errors were encountered: