Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoJSON bounding box parameter #3

Open
russbiggs opened this issue Jan 24, 2025 · 5 comments
Open

GeoJSON bounding box parameter #3

russbiggs opened this issue Jan 24, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@russbiggs
Copy link
Member

For functions that take a bbox parameter it would be nice to also include an optional geojson field. This field would allow users to query against arbitrary polygons using the bounding box functionality under-the-hood.

I see a couple options of how this can work:

  1. Provide GeoJSON as string (or file path?)
    • If its a feature collection get just the first feature? if its a feature check that its geometry type is Polygon
    • Not sure about supporting Multipolygon (see why this might be an issue below)
  2. Compute bounding box of polygon and pass to bbox
  3. (optional) Provide additional option to clip results to original GeoJSON polygon (because bbox will over fetch compared to original shape) or the base option would not clip and just provide results from bbox of polygon
@russbiggs russbiggs added the enhancement New feature or request label Jan 24, 2025
@AarshBatra
Copy link

Hi Russ, I had three follow up questions here, let me know in case I am misinterpreting something:

  • If it's a feature collection, why do we choose the first feature for the bbox calculation? I am thinking about the scenario in which a user may pass for e.g. a geojson with multiple features in a feature collection (e.g. state level polygons for the US) and may want a bbox of US, instead of the first feature (which may be a specific state).
  • If clipping is to be supported, we might need an additional step of converting the data frame into a sf object, on which the geospatial operation of clipping can then be performed. Do all functions with a bbox parameter, return a data frame or a list of data frames? Here, one small cautionary note: if the initial feature collection is too big, then the masking/clipping operation may be heavy.
  • Why would supporting Multipolygons become an issue in this context?

@russbiggs
Copy link
Member Author

russbiggs commented Jan 26, 2025

  • For a feature collection we'll need to make some sort of opinionated concession because it can get too complicated quickly. A feature collection can contain multiple types of geometries, so following your idea of doing a box of all features do we allow the bbox to be of all feature? What if it's a polygon a point and a line, you can technically find a bbox of that but would likely be unexpected results for most users. I think choosing just the first to enforce a single feature as the input. In your US example why not just provide a national boundary feature instead of ~50 state territories features? In short just enforcing and encouraging a single feature as the input.
  • for the clipping you are right this is the big elephant in the room but in reality only a thousand features wouldn't be too intensive in most case unless it's a super complex polygon. Either way we may just index the point set to improve. We would just want to measure this and ensure it's relatively robust and fast enough in most cases
  • in my experience multipolygons can get complicated fast when dealing with user input. That said you US example above would be a reasonable case if using a national boundary feature that would have to be a multipolygon. The issues often come from poorly formed multipolygons and donuts. We'd likely just need to test and want that mileage may vary with complex shapes and just encourage simplicity

@AarshBatra
Copy link

AarshBatra commented Jan 26, 2025

  • In the US example, I was thinking about it from the perspective of a user who might already be working with a state level geojson or shapefile and may not want to do the additional step of unionizing it first to get a country level boundary. They may want to directly pass it and figure out a bounding box for the country. But, I agree, yes one can get unexpected results if there is a mix bag of different geometries. I kind of was thinking about this by assuming the onus is on the user to put in a vetted geojson in the proposed geojson parameter, from which they want to extract a bounding box. So, if we allow only a single feature, than for someone who needs a bounding box for US and currently has a state level shapefile - they will first collapse it to a national boundary and then pass it on. Otherwise, they'll end up getting a bounding box for the first state in the feature collection, which might not be what they intended. I guess the documentation of the function can spell this out in an example to further clarify, why we pick the first feature, that can guide the user on how to properly pass on the geojson at a desired admin level.
  • If we just allow a single feature in geojson, then clipping a thousand features to it may not be a big concern, yes. Also, in my experience, I have noted that if we rasterize a shapefile and then clip/mask to it, it's much faster than masking to a complicated boundary polygon.
  • Agreed. Poorly formed multipolygons can complicate things and become difficult to handle. Not allowing them has its benefits in terms of simplicity for sure, but it does add an additional step for the user to convert whatever they have into a polygon, that's a tradeoff.

Also in many cases, at least in my experience, I have seen people working a lot with gpkg and even .shp. One more thing to brainstorm is whether it would make sense to have a more general parameter (e.g. shpfile, instead of geojson) that would allow user to put in a shapefile or a geojson file/geojson string, just a thought. Once read into R as an sf object, they all behave largely the same so extracting bbox should be straightforward. When it comes to validating those formats, that's something I'll have to look into, but I don't think it should be an issue, correct me if I am wrong.

@russbiggs
Copy link
Member Author

  • I prefer to trade a bit of extra onus on the user to set stuff up correctly for a much simpler and clearer explanation of the behavior, i.e. only one feature, only polygon, if FeatureCollection then the first feature if chosen. As a higher level feature in the package we can only simplify the work so much without adding too many caveats that come with it.
  • With proper spatial indexing we wouldnt need to rasterize, building an index then doing the join will be much more performant, but again since its only at most 1000 features at a time, the performance may not be a significant issue. Benchmarking will be the best way to know.
  • The more i think about multipolygons the more i think we don't want to outright not allow them. Just provide warnings that complex geometries may be slow or may have less intuitively simple results.

I agree with wanting to support multiple formats, thats interesting and heartening to hear that you see gpkg in the wild frequently. The complexity with these other formats is they can be in other projected coordinate systems, meaning the process in the openaq package would need to translate datums and reproject. GeoJSON on the other hands only supports WGS84 by spec, which greatly simplifies things. You are right once everything gets into something like sf its smooth sailing, but getting there is a little more complicated.

@AarshBatra
Copy link

AarshBatra commented Jan 27, 2025

Sticking with geojson for now, selecting first feature for bbox and warnings in case Multipolygons are allowed, sounds good.

Yes, benchmarking will guide us about ideal thresholds/limits. The microbenchmark package is very handy for this.

@russbiggs russbiggs self-assigned this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants