Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Structured Dataset compatibility between plugins #3189

Closed
2 tasks done
esadler-hbo opened this issue Dec 24, 2022 · 3 comments
Closed
2 tasks done

[BUG] Structured Dataset compatibility between plugins #3189

esadler-hbo opened this issue Dec 24, 2022 · 3 comments
Assignees
Labels
bug Something isn't working flytekit FlyteKit Python related issue

Comments

@esadler-hbo
Copy link

esadler-hbo commented Dec 24, 2022

Describe the bug

I was running some tasks in a notebooks where I was passing the results of a Spark task as a StructuredDataset and then trying to load them into a polars dataframe and a hugging face dataset.

It resulted in the following error for both plugins No such file or directory: /var/folders/wq/3hjh3ms916b6dj56zx0f_x000000gq/T/flyte-69d2tww2/sandbox/local_flytekit/95bac8efeb64a8d10d34c73b66df7051/00000. However, it did work for pandas.

It seems like polars and huggingface add in 00000 to the path in the transformers and spark does not.

Expected behavior

I would expect to be able to use a StructuredDataset from spark with dataframe libraries from all plugins.

Additional context to reproduce

from flytekit import task, StructuredDataset
from flytekitplugins.spark.task import Spark
from datasets import Dataset
import polars as pl
import datasets
import pandas as pd

@task(
task_config=Spark()
)
def spark_task(path: str) -> StructuredDataset:
sess = flytekit.current_context().spark_session
df = sess.read.parquet(path)
return StructuredDataset(dataframe=df)

df = spark_task(path="./ratings_100k.parquet")

try:
df.open(pl.DataFrame).all().head()
except Exception as e:
print(e)

try:
df.open(datasets.Dataset).all().head()
except Exception as e:
print(e)

df.open(pd.DataFrame).all().head()

Screenshots

Screen Shot 2022-12-24 at 10 54 40 AM

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@esadler-hbo esadler-hbo added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Dec 24, 2022
@welcome
Copy link

welcome bot commented Dec 24, 2022

Thank you for opening your first issue here! 🛠

@nightscape
Copy link

@esadler-hbo this seems to be resolved by flyteorg/flytekit#1406.
Can you verify?

@pingsutw pingsutw added flytekit FlyteKit Python related issue and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 22, 2023
@pingsutw pingsutw self-assigned this Dec 22, 2023
@pingsutw
Copy link
Member

yes, we've fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flytekit FlyteKit Python related issue
Projects
None yet
Development

No branches or pull requests

3 participants