Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expectations of to_dict methods in returning JSON-typed values, regarding geometry #1047

Closed
philvarner opened this issue Mar 16, 2023 · 7 comments · Fixed by #1074
Closed
Assignees
Labels
documentation Issues related to PySTAC documentation
Milestone

Comments

@philvarner
Copy link
Collaborator

Currently, the to_dict method returns a "shallow" dict representation of the object. For example, if the Item has a geometry that has a coordinates value that is a nested tuple of numbers , the resulting dict's geometry key points to that same tuple. However, I think a more common assumption is that to_dict returns a JSON-typed representation of the Item, e.g., with geometry being nested arrays and numbers, regardless of what the internal representation is. Maybe there could be a flag on to_dict to determine this behavior, or another method like to_geojson_dict?

mypy checking doesn't catch this because geometry is typed as Optional[Dict[str,Any]], so it can't catch it. (Union types would help here, allowing the stricter Optional[Dict[str,str|list[list|float]])

The current docstring for to_dict is Generate a dictionary representing the JSON of this serialized object. , which is wrong wrt to the current behavior if non-JSON-compliant tuples are used.

The context behind this is we had an Item that (unintentionally) had tuple geometry coordinate values, and used .to_dict() to create a value to pass to stac-validator validation, expecting that we were getting a JSON-style dict. The error message was:

Exception: STAC Item validation failed. Error: {‘type’: ‘Polygon’, ‘coordinates’: (((-91.00013888888888, 43.00013888888889), (-91.00013888888888, 44.00013888888889), (-92.00013888888888, 44.00013888888889), (-92.00013888888888, 43.00013888888889), (-91.00013888888888, 43.00013888888889)),)} is not valid under any of the given schemas. Error is in geometry .

because the tuple geometry doesn't match the schema of nested arrays of numbers, and then gets str serialized to a value of nested tuples of numbers.

Relatedly, __geo_interface__ will return an invalid value if tuple coordinates are used.

@gadomski gadomski added this to the 1.7.1 milestone Mar 16, 2023
@gadomski gadomski self-assigned this Mar 16, 2023
@gadomski
Copy link
Member

because the tuple geometry doesn't match the schema of nested arrays of numbers, and then gets str serialized to a value of nested tuples of numbers.

Curious here, shouldn't the tool be doing a json.dumps rather than a str conversion?

@philvarner
Copy link
Collaborator Author

Presumably yes. I also think that a jsonschema (openapi schema?) validator that takes a Dict should treat tuples and lists as equivalent to arrays, but here we are.

@gadomski
Copy link
Member

Thinking about this more, enforcing "JSON-correct" types on all to_dict methods would be a relatively heavy lift. I count 30 distinct to_dict definitions, each of which would need to get some sort of recursive "convert tuples to lists" treatment (aside: are there other "JSONify" conversions that we should be tracking as well?). Not impossible, but for sure adds complexity and runtime cost to a commonly-used method.

An alternative might be:

  1. Add a utility funtion pystac.utils.to_json_dict that does a blanket conversion (maybe just with a json.loads(json.dumps(d))).
  2. Update the documentation for each to_dict to point to to_json_dict as a way to ensure all your tuples are lists, etc

I'm a bit hesitant to add extra overhead to to_dict for this use case, because (per my previous comment) well-behaved consumers should be able to handle non-JSON-y dicts.

As another aside, I think we're probably working too hard here -- serialization libraries should take care of this stuff for us. Perhaps we should consider converting the base types to use a serialization library for v2? @TomAugspurger pointed me to https://github.com/jcrist/msgspec as a faster-than-pydantic option.

@Fatal1ty
Copy link

Hi @gadomski

You may also consider an alternative https://github.com/Fatal1ty/mashumaro which is mature, fast, customizable and provides to_dict and from_dict methods. As an author of this library I'd be happy to help.

@philvarner
Copy link
Collaborator Author

Maybe it would be enough just having some documentation that it returns a dictionary representation, not necessarily one that exactly matches the GeoJSON structure based on what objects it was constructed with?

@jsignell
Copy link
Member

As another aside, I think we're probably working too hard here -- serialization libraries should take care of this stuff for us. Perhaps we should consider converting the base types to use a serialization library for v2? @TomAugspurger pointed me to https://github.com/jcrist/msgspec as a faster-than-pydantic option.

I really like this idea of outsourcing serialization/deserialization to another library and letting types provide more of a guarantee and validation step.

@gadomski gadomski modified the milestones: 1.7.1, 1.8 Mar 21, 2023
@gadomski gadomski added enhancement documentation Issues related to PySTAC documentation labels Mar 21, 2023
@jcrist
Copy link

jcrist commented Mar 21, 2023

Hi - msgspec author here. I'm not familiar with STAC, but if you decide to try out msgspec, you might find this GeoJSON example in our docs useful (it does (de)serialization and validation). I'm also willing to help as needed. No pressure to use msgspec of course, there are lots of good and useful libraries in this space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues related to PySTAC documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants