Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Housekeeping] Use msgspec instead of marshmallow and related dependencies #4485

Closed
2 tasks done
thomasjpfan opened this issue Nov 27, 2023 · 6 comments
Closed
2 tasks done
Assignees
Labels
housekeeping Issues that help maintain flyte and keep it tech-debt free

Comments

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 27, 2023

Describe the issue

Currently, serializing we are using many libraries to serialize objects in flytekit:

  • mashumaro
  • marshmallow
  • marshmallow-jsonschema (Last release in 2021-10)
  • marshmallow-enum (Last release in 2019-10)

This can all be replaced by replaced by https://github.com/jcrist/msgspec, which has not dependencies itself. msgspec contains all the features needed above. msgspec has run benchmarks that show favorable results compared to the popular serialization frameworks. Benchmarks could be outdated.

What if we do not do this?

Continue to depend on multiple libraries for serialization.

Related component(s)

flytekit

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@thomasjpfan thomasjpfan added the housekeeping Issues that help maintain flyte and keep it tech-debt free label Nov 27, 2023
Copy link

dosubot bot commented Nov 27, 2023

🤖

It seems like you've got this under control, if you want help or have specific questions, let me know what I can do for you!

@Fatal1ty
Copy link

Fatal1ty commented Nov 27, 2023

Hi, author of mashumaro here 👋

Just to make it clear, mashumaro, which is currently used in flytekit according to the docs, supports dataclasses out of the box (and don't force you to use mixins since the recent 3.11 release) and don't depend on any library except typing-extensions (for compatibility reasons). It has nothing to do with marshmallow, marshmallow-jsonschema or marshmallow-enum in favour of built-in JSON Schema generation functionality and standard Enum types support.

that show favorable results compared to the popular serialization frameworks.

Their outdated comparison uses Struct classes written in C for msgspec, and that's why everyone can see the impressive results. If you stick to dataclasses, msgspec won't give you that performance advantage and will run at about the same speed as the others. I encourage you to write your own benchmark to see the difference before making a decision.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 27, 2023

supports dataclasses out of the box (and don't force you to use mixins since the recent

Thank you for sharing! mashumaro v3.11 does make it more enticing. Historically, mashumaro was added just for serializing dataclasses: flyteorg/flytekit#1735

If you stick to dataclasses, msgspec won't give you that performance advantage and will run at about the same speed as the others.

If there is no performance differences for dataclasses, then this proposal is more about reducing the overall dependencies. I suspect we can leverage msgspec's struct internally.

Looking over mashumaro, it looks like we can use mashumaro itself for all the features we need. In this case, we need to choose between mashumaro and msgspec.

@jcrist
Copy link

jcrist commented Nov 28, 2023

Thanks for considering msgspec! Apologies for stepping into the conversation here - I try not to weigh in on library comparison discussions when unasked, but I'm just going to add a quick comment here to correct a misconception above.

If you stick to dataclasses, msgspec won't give you that performance advantage and will run at about the same speed as the others.

While our internal benchmarks are a bit outdated, this claim doesn't hold up even when run today. It's true that part of msgspec's performance comes from its own fast Struct type (which for most users should be indistinguishable from a dataclass at runtime), its JSON encoders/decoders are still generally faster than other serialization libraries even when standard dataclasses (or attrs types) are used instead.

To demonstrate this, I hacked together a quick benchmark based on the one used in this flyte blogpost. The source may be found in this gist. Results:

$ python bench.py 
Encoding:
- msgspec & structs:     0.4 µs (1.0x)
- msgspec & dataclasses: 0.7 µs (1.6x)
- mashumaro & orjson:    1.3 µs (3.3x)
- mashumaro & json:      5.1 µs (12.3x)

Decoding:
- msgspec & structs:     0.7 µs (1.0x)
- msgspec & dataclasses: 1.3 µs (2.0x)
- mashumaro & orjson:    2.6 µs (3.9x)
- mashumaro & json:      5.2 µs (7.7x)

Library versions:
- Python: 3.11.0.final.0
- msgspec: 0.18.4
- mashumaro: 3.11
- orjson: 3.9.10

When using mashumaro with no other dependencies, msgspec is roughly 7-12x faster at encoding/decoding JSON. If you include mashumaro's optional orjson dependency, this brings the difference down to 2-4x faster (depending on if dataclasses are used). msgspec doesn't require any external dependencies for JSON support, and is fast out-of-the-box.


That said - I'm not familiar with flyte, but unless serialization is a known bottleneck I'd guess either library is probably fast enough for most use cases. If I were making a decision here I might consider other things like usability, maintainability, or supported features. For common cases both libraries can likely provide the performance characteristics you'd need.

If you do decide that msgspec may be a good fit for flyte, I'd be happy to provide any feedback/bugfixes needed for integration, but also please don't take this comment as pressure to use msgspec. </ducks back out of conversation>

@Future-Outlier
Copy link
Member

#take

@thomasjpfan
Copy link
Member Author

With flyteorg/flytekit#2080, we have users that depend on mashumaro's API. In this case, I think we should go the other way and go "all in" on mashumaro. Specifically, use mashumaro directly and remove the need for marshmallow and friends.

With that in mind, I'll close this issue around migrating to msgspec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
housekeeping Issues that help maintain flyte and keep it tech-debt free
Projects
None yet
Development

No branches or pull requests

4 participants