-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can't use int types on tasks, flyte casts to float between tasks #4505
Comments
🤖 Hello @jiwidi, Nice to meet you! I'm Dosu, an AI bot here to assist you with bugs, answer your questions, and guide you as a contributor while we wait for a human maintainer. I'm here to help, so don't hesitate to ask if you have any further questions or issues! I understand that you're having an issue with integer types being cast to float in your Flyte workflow. This seems to be a known issue that has been reported in the past. Specifically, there are two similar issues that have been resolved:
Both of these issues were addressed by the following pull requests: Based on the error message you provided, it seems that the integer value is being interpreted as a float when it's passed to the In Flyte, the type conversion process is handled by the Given this, it's possible that there might be a bug in Flyte causing this issue. However, it's also possible that the issue might be due to the way the To help narrow down the issue, could you please provide the following additional information:
This information will help us better understand the issue and provide a more accurate solution. Thank you for your patience and cooperation. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
I would like to work on this, can you assign me, please? |
#take |
@jasonlai1218 hey! did you had the time to look on this? |
@eapolinario @jasonlai1218 hello? |
Sorry, I'm busy with the company's work recently and don't have time to look on this |
Please note that the flytekit type engine serializes untyped dicts ( You may either use a dataclass or an univariate dictionary to keep the Here an example for an univariate dict: from flytekit import workflow, task
@task
def a() -> dict[str, int]:
return {
"a": 1,
}
@task
def b(inp: dict[str, int]):
print(inp)
print(type(inp["a"]))
@workflow
def test():
result = a()
b(inp=result) output of task b:
|
so dictionaries in flyte are strict to keep same value type for all keys otherwise the typing fails? I cant use dict[str, int] because some other keys have string values :/ |
We're seeing a similar issue - in our case we have dataclasses that may contain arbitrary JSON user input, and Flyte keeps converting ints to floats every time we pass the object between tasks. It would be nice if Flyte respected the runtime type in this case given that both integers and floats are valid JSON. |
JSON does not have distinct types for integers and floating-point values. Therefore, the presence or absence of a decimal point is not enough to distinguish between integers and non-integers. For example, 1 and 1.0 are two ways to represent the same value in JSON. There is no way to distinguish between integers and floats in pure vanilla JSON without adding an additional schema, which in the end also means providing additional type information. |
You can also use dataclasses, which I would usually recommend here. Generally you will need to provide some more type information at registration time of the workflow. A You can take a look at the overview how Flyte serializes certain python datatypes here. Serializing the dict by pickling it would keep all python types but make it less versatile. See the Flyte documentation about this:
If you really want to use pickle and keep the untyped dicts, you could probably wrap them in a python class to enforce the Flyte Pickle type transformer. See https://docs.flyte.org/projects/cookbook/en/v0.3.66/auto/core/type_system/flyte_pickle.html As I said, I don't think this is the best way to solve the problem but it should work. |
I dont think you need to provide that additional typing information to make a multi type dictionary work. Flyte could definitely implement something in its code to look at types, infer and consequently load afterwards. You are correct with those json limitations but I've seen them overcome by libraries multiple times in the pasts. This is not an excuse for flyte to not fix this issue. |
I noticed that from typing import Union
from flytekit import workflow, task
@task
def a() -> dict[str, Union[str, int, float]]:
return {"str": "string", "int": 1, "float": 1.0}
@task
def b(inp: dict[str, Union[str, int, float]]):
print(inp)
print(type(inp["float"]))
@workflow
def main():
b(inp=a()) Assuming you only use simple types within your dict, this might solve your problem without having to use dataclasses? |
Point taken - but is there a way to inject this type information at runtime? In our case we are passing this user input into a |
Using a dataclass with a union type still shows @dataclass
class Foo(DataClassJsonMixin):
x: int
y: float
z: int | float
@task
def print_foo(*, foo: Foo) -> Foo:
print(foo)
print(f"x={type(foo.x)}")
print(f"y={type(foo.y)}")
print(f"z={type(foo.z)}")
return foo
@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
return print_foo(foo=foo)
@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
return _print_foo_dynamic(foo=foo) |
Picking this work back up, this is something else I've noticed. In the following code: import marshmallow as mm
@dataclass
class Foo(DataClassJsonMixin):
x: int
y: float
z: Any = field(metadata=config(mm_field=mm.fields.Integer()))
@task
def print_foo(*, foo: Foo) -> Foo:
print(foo)
print(f"x={type(foo.x)}")
print(f"y={type(foo.y)}")
print(f"z={type(foo.z)}")
return foo
@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
return print_foo(foo=foo)
@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
return _print_foo_dynamic(foo=foo) The Flyte UI will show that the type of |
I found a quick fix that seems to be working for us, any reason why it would be a bad idea to do this? from flytekit.types.pickle import FlytePickle
@dataclass
class Foo(DataClassJsonMixin, FlytePickle):
x: int
y: float
z: int | float
@task
def print_foo(*, foo: Foo) -> Foo:
print(foo)
print(f"x={type(foo.x)}")
print(f"y={type(foo.y)}")
print(f"z={type(foo.z)}")
return foo
@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
return print_foo(foo=foo)
@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
return _print_foo_dynamic(foo=foo) Appears like this allows us to use JSON dataclasses as we currently are, but forces Flyte to pickle them without any other code changes. |
cc @wild-endeavor / @EngHabu |
While FlytePickle did get us unblocked, it causes flyte caching to break. |
If I switch to using the mashumaro mixin as recommended in the flyte docs then
Fairly confident this is a flyte issue because it doesn't happen with just mashumaro:
|
If you really, really want to be able to store the type of the value, you could use a more complex data class that also stores the type of the value as well: from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin
@dataclass
class MyOutput(DataClassJSONMixin):
_value: float
_is_int: bool
@classmethod
def from_value(cls, value: float|int) -> "MyOutput":
assert isinstance(value, (float, int)), \
f"value must be of type int or float, not {type(value)}"
if isinstance(value, int):
return cls(_is_int =True, _value=float(value))
else:
return cls(_is_int =False, _value=value)
@property
def value(self) -> float|int:
if self._is_int:
return int(self._value)
else:
return float(self._value)
def __repr__(self):
_type = "int" if self._is_int else "float"
return f"{self.__class__.__name__}({self.value}: {_type})"
my_output_instance = MyOutput.from_value(3.1232)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)
print(deserialized) # MyOutput(3.1232: float)
my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)
print(deserialized) # MyOutput(33: int) I do frequently use such |
grantham, did you see my last comment? We are using DataClassJSONMixin, but we still see this bug |
We've looked at similar approaches as well as custom encodings to force the input to look like an arbitrary string but this all seems like an unnecessary burden. Fundamentally, if I type a dataclass field as @dataclass
class Foo(DataClassJSONMixin):
x: Any or as a numeric union type: @dataclass
class Foo(DataClassJSONMixin):
x: int | float Flyte should not cast integer inputs to floats. It is no longer following the schema I've provided. In our case, a simple When typed as |
I do not believe you understand the functionality of my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)
print(deserialized) # MyOutput(33: int) You may also return However, if you are literally trying to return arbitrarily nested dictionaries / lists of floats and/or ints, the idea of having a task that returns an output like If you are working with complex, arbitrarily nested data, you might consider writing it to an actual file ( |
I do understand the dataclass you wrote - as I said we have a similar approach in mind - but consider the difference in maintainability between what you have above and simply writing the union type Our pipeline and CLI tooling handles the transformations on this arbitrary JSON input as well as the business logic around our strongly typed inputs quite well with My point is not that it is impossible for us to hack around this limitation: it's that this is not a good UX and not at all an intuitive way to handle a type annotated as |
@jiwidi / @gvashishtha / @fellhorn / @ai-rnatour ** Caching offloaded, pickle, file, directory, dataframes with custom hashmethods ** def hash_pandas_dataframe(df: pd.DataFrame) -> str:
return str(pd.util.hash_pandas_object(df))
# Type hint to use for pandas dataframes in flyte. This adds information for how to hash them so they can be cached.
CacheablePandasDataFrameT = Annotated[pd.DataFrame, HashMethod(hash_pandas_dataframe)] For various things we will consider producer (one that produces the data object) and consumer (one that consumes the data object). Note they can be different languages, different machines etc. So Flyte uses Protocol buffers to transfer the data, ** dict passing ** ** dataclasses and ints ** So if you have a dataclass
It can be easily forced to an int. But now consider if you have a type
The data gets transferred as a double/float value. And on the receiver side it is ambiguous to identify the type.
As it can only use the widest type. And as @fellhorn correctly pointed this sticks to sadness that is JSON schema. A solution might be to add RunTime type identification. But this is very expensive and may have significant overhead at runtime. One solution is to avoid union types where types are ambiguous - ints/floats/doubles |
I think you have a mistake here. Here "x" can't become a float value unless you override serialization strategy for from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin
@dataclass
class Foo(DataClassJSONMixin):
x: int
print(Foo.from_json('{"x": 1.0}')) # Foo(x=1) Anyway, if you need to differentiate between integers and floats during deserialization from JSON, you can override the strategy globally. I can see that flytekit is deserializing JSON data into a python object in two ways:
In the first case you need a dialect: from typing import Any, TypeVar, Type, Callable
from dataclasses import dataclass
from mashumaro.dialect import Dialect
from mashumaro.codecs.json import JSONDecoder
T = TypeVar("T", bound=Type)
def exact_type_unpacker(typ: T) -> Callable[[Any], T]:
def unpack_value(value: Any) -> T:
if isinstance(value, typ):
return value
raise ValueError(f"{value!r} is not of type {typ}")
return unpack_value
class FlyteKitDialect(Dialect):
serialization_strategy = {
int: {"deserialize": exact_type_unpacker(int)},
float: {"deserialize": exact_type_unpacker(float)},
}
@dataclass
class Foo:
x: int | float
y: float | int
decoder = JSONDecoder(Foo, default_dialect=FlyteKitDialect)
print(decoder.decode('{"x": 42, "y": 42}')) # Foo(x=42, y=42)
print(decoder.decode('{"x": 42.0, "y": 42.0}')) # Foo(x=42.0, y=42.0)
decoder = JSONDecoder(int | float, default_dialect=FlyteKitDialect)
print(decoder.decode("42")) # 42
print(decoder.decode("42.0")) # 42.0
decoder = JSONDecoder(float | int, default_dialect=FlyteKitDialect)
print(decoder.decode("42")) # 42
print(decoder.decode("42.0")) # 42.0 In the second case this decoder can also be used but you would need to more clearly distinguish between If for some reason you don't want to use
I would personally prefer to use And one more thing. If you've bitten by casting to the first variant type in class FlyteKitDialect(Dialect):
serialization_strategy = {
int: {
"deserialize": exact_type_unpacker(int),
"serialize": pass_through,
},
float: {
"deserialize": exact_type_unpacker(float),
"serialize": pass_through,
},
}
encoder = JSONEncoder(Foo, default_dialect=FlyteKitDialect) The same configuration burden will fall on the end user if this encoder is not used to serialize dataclasses with mixins. |
Folks I just made an issue that might solve all these problems check it out and please comment |
somehow related issue i found now with pydantic objects |
Describe the bug
I'm trying to use the int type along my workflow but it keeps being parsed as float and complaining about it when passing it downstream to other tasks. I even try to cast it within a task but it casts it back to float.
Expected behavior
For objects within tasks to remain typed as int or flyte to cast them properly (complains about not being able to cast a float with no decimals to int, which is possible)
Additional context to reproduce
I have this main.py file containing my workflow:
And a config.yaml where the int should be read from:
When I run this with the command
I get the error:
The yaml is typed to be int and i even cast it again within the read task but it still gets casted to float.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: