Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Msgspec Dataclass Type Transformer #2211

Closed
wants to merge 5 commits into from

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Feb 23, 2024

Tracking issue

flyteorg/flyte#4485

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

python example

from flytekit import task, workflow
from typing import Optional
from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin
from typing import Tuple

@dataclass
class User:
    """A new type describing a User"""
    name: str

@task
def t1() -> User:
    return User(name="alice")

@dataclass
class Datum(DataClassJSONMixin):
    y: str
    z: dict[str, str]

@task
def stringify(s: int) -> Datum:
    """
    A dataclass return will be treated as a single complex JSON return.
    """
    return Datum(y=str(s),z={str(s):str(s)})


@workflow
def wf() -> Tuple[Datum, User]:
    t1_output = t1()
    stringify_output = stringify(s=1)
    t1_output >> stringify_output
    return stringify_output, t1_output

Setup process

Screenshots

image

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 23, 2024
@Future-Outlier Future-Outlier marked this pull request as draft February 23, 2024 15:39
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.70%. Comparing base (05fc527) to head (f12eb09).
Report is 4 commits behind head on master.

❗ Current head f12eb09 differs from pull request most recent head ba6244d. Consider uploading reports for the commit ba6244d to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2211       +/-   ##
===========================================
- Coverage   83.25%   53.70%   -29.55%     
===========================================
  Files         309      180      -129     
  Lines       24029    17855     -6174     
  Branches     3481     3481               
===========================================
- Hits        20005     9589    -10416     
- Misses       3401     7842     +4441     
+ Partials      623      424      -199     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thomasjpfan
Copy link
Member

Thank you for your interest in the original issue. With flyteorg/flyte#4485 (comment), I think it's best to go with mashumaro directly.

@Future-Outlier
Copy link
Member Author

Thank you for your interest in the original issue. With flyteorg/flyte#4485 (comment), I think it's best to go with mashumaro directly.

Thank you, will take a look!

@kumare3
Copy link
Contributor

kumare3 commented Feb 24, 2024

Can we really remove marshmallow- if so let's do it

@Future-Outlier
Copy link
Member Author

Can we really remove marshmallow- if so let's do it

Thank you, will take a look

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
@thomasjpfan
Copy link
Member

@Future-Outlier My original motivation for msgspec was to reduce dependencies, but now I think it's better to use mashumaro everywhere.

I prefer not to add msgspec.Struct support to flytekit and close this PR. The follow up to this PR is to use mashumaro everywhere and remove the dependency on marshmallow, marshmallow-jsonschema, and marshmallow-enum. This works because mashumaro already provides all the features out of the box.

@Future-Outlier
Copy link
Member Author

Future-Outlier commented Mar 13, 2024

@Future-Outlier My original motivation for msgspec was to reduce dependencies, but now I think it's better to use mashumaro everywhere.

I prefer not to add msgspec.Struct support to flytekit and close this PR. The follow up to this PR is to use mashumaro everywhere and remove the dependency on marshmallow, marshmallow-jsonschema, and marshmallow-enum. This works because mashumaro already provides all the features out of the box.

Thank you, I will try to remove marshmallow.
Do you think it is a good idea to also support msgspec?
I just found that lots of its functions are written in C, which will far more faster than python.

Signed-off-by: Future-Outlier <[email protected]>
@Future-Outlier
Copy link
Member Author

Future-Outlier commented Mar 13, 2024

@thomasjpfan, @pingsutw
Please look at the example above, I found that we only need to use @dataclass, then we can use msgspec to serialize and deserialize all of them.
I've haven't dealt with compicate cases, for example adding int, but maybe msgspec is the answer we need,
since it provides a better interface, right?

@Future-Outlier
Copy link
Member Author

@jcrist Hi, here's a question.
Flyte might have possibility to use msgspec, since that I found that if we don't need to inherirt msgspec.Struct, we can still use msgspec.json.encode and msgspec.json.decode to serialize and deserialize dataclasses.
Is there any limit or any potential issues formsgspec?
Or more specifically, can you tell us restrictions about msgspec here?
https://jcristharif.com/msgspec/supported-types.html

Signed-off-by: Future-Outlier <[email protected]>
@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 13, 2024

I just found that lots of its functions are written in C, which will far more faster than python.

It could be faster, but I do not think that is Flyte's bottleneck right now or there is a demand for msgspec. Every new feature adds more code for us to maintain. I have experimented with using msgspec just to handle dataclasses which mostly works, but I did not open a PR because of the maintenance cost.

we can still use msgspec.json.encode and msgspec.json.decode to serialize and deserialize dataclasses.

This use case is also supported by mashumaro: https://github.com/Fatal1ty/mashumaro#supported-serialization-formats. In the end, I want to not require @dataclass_json or mixins for using dataclasses:

from mashumaro.codecs.json import JSONDecoder, JSONEncoder
from dataclasses import dataclass

@dataclass
class MyModel:
    a: int
    b: bool

model = MyModel(a=10, b=False)
encoder = JSONEncoder(MyModel)
decoder = JSONDecoder(MyModel)

result = encoder.encode(model)
print(result)

out = decoder.decode(result)
print(out)

# {"a": 10, "b": false}
# MyModel(a=10, b=False)

@Future-Outlier
Copy link
Member Author

Future-Outlier commented Mar 13, 2024

I just found that lots of its functions are written in C, which will far more faster than python.

It could be faster, but I do not think that is Flyte's bottleneck right now or there is a demand for msgspec. Every new feature adds more code for us to maintain. I have experimented with using msgspec just to handle dataclasses which mostly works, but I did not open a PR because of the maintenance cost.

we can still use msgspec.json.encode and msgspec.json.decode to serialize and deserialize dataclasses.

This use case is also supported by mashumaro: https://github.com/Fatal1ty/mashumaro#supported-serialization-formats. In the end, I want to not require @dataclass_json or mixins for using dataclasses:

from mashumaro.codecs.json import JSONDecoder, JSONEncoder
from dataclasses import dataclass

@dataclass
class MyModel:
    a: int
    b: bool

model = MyModel(a=10, b=False)
encoder = JSONEncoder(MyModel)
decoder = JSONDecoder(MyModel)

result = encoder.encode(model)
print(result)

out = decoder.decode(result)
print(out)

# {"a": 10, "b": false}
# MyModel(a=10, b=False)

Ok, I will try to go for mashumaro and

  1. remove dataclass_json
  2. remove mixins
    if it doesn't work, I will list reasons and discuss with you.
    After this, we can consider msgspec.
    Thank you.

@thomasjpfan
Copy link
Member

remove dataclass_json
remove mixins
if it doesn't work, I will list reasons and discuss with you.

To clarify, I want to remove the @dataclass_json and DataClassORJSONMixin requirement for majority of the use cases. If a user decides to use mixins, like in #2080, that should still continue to work.

@Future-Outlier
Copy link
Member Author

remove dataclass_json
remove mixins
if it doesn't work, I will list reasons and discuss with you.

To clarify, I want to remove the @dataclass_json and DataClassORJSONMixin requirement for majority of the use cases. If a user decides to use mixins, like in #2080, that should still continue to work.

No problem, thank you!

@Future-Outlier
Copy link
Member Author

Just create a PR here.
#2279

So I will close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants