Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEOS C API is slow when doing millions of things #1735

Closed
hobu opened this issue Nov 27, 2017 · 13 comments
Closed

GEOS C API is slow when doing millions of things #1735

hobu opened this issue Nov 27, 2017 · 13 comments
Milestone

Comments

@hobu
Copy link
Member

hobu commented Nov 27, 2017

Point-in-Polygon in filters.crop is rather slow because the GEOS C API demands that we heap allocate/deallocate a CoordinateSequence and a Point for each point. There are two possibilities to attack the issue:

  • Rewrite to use our own Point-in-Polygon, but it would need to support Multi* and holes.
  • Rewrite to use the GEOS C++ API, which is "unsupported", but would allow us to stack allocate or even reuse CoordinateSequence/Point, but we would want to refactor all of the pdal::geos stuff to use the C++ API, and that would make this effort quite a bit larger.
@mathieu1
Copy link
Contributor

That would be extremely useful. filters.crop is indeed the main bottleneck for our pdal pipelines right now.

Not sure if it's helpful, but looking around, it seems Postgis and Shapely use PreparedGeometry for big Point-in-Polygon batch computations. I'm not familiar with GEOS, but isn't PreparedGeometry available in the C API?

@hobu
Copy link
Member Author

hobu commented Nov 27, 2017

@mathieu1 We're doing PreparedGeometry right now, even with the C API. The issue has been GEOS C API demands immutability and forces a bunch of copies of things. The C++ API doesn't have this restriction, but it has others 😄

Our own PiP for simple polygons would be straightforward, but people start putting in Multi* and polygons with holes, etc. and they expect it all to just work.

@busstoptaktik
Copy link

busstoptaktik commented Nov 27, 2017 via email

@hobu
Copy link
Member Author

hobu commented Nov 27, 2017

pnpoly does that kind of stuff

Yep, that was the one I was looking at too.

@abellgithub
Copy link
Contributor

I think the algorithm referenced above is slow if the polygon is large. If you're going to process lots of tests for the same polygon, which is our typical use case, some preprocessing can improve things a lot, at the expense of using some memory. I may be mistaken. Also, this may be a nice place to try some GPU processing.

@abellgithub
Copy link
Contributor

abellgithub commented Jan 2, 2018

I've implemented a version of this: http://ieeexplore.ieee.org/document/6694387/

This algorithm divides the polygon space up in a grid - there’s some heuristics used to determine the grid size that could probably be tweaked with some experimentation. Each polygon edge that intersects a grid cell is “attached” to that grid cell. When a PiP request is made for a point, we determine which cell the point is in. The first time such a request is made for a cell, it chooses a randomish reference point in the cell and then determines whether the reference point is in/out by checking from outside the polygon or some other reference point by doing the standard crossing counting. In our case, though, we only need to test intersections of edges that are in cells that are traversed traveling to our reference point, rather than all the edges. The reference point is then marked either inside or outside. Once the reference point is established for a cell, it’s done forever. To actually test the point, there are two cases: If there are 0 edges in the cell, then the classification (in/out) of ANY point in the cell is the same as the reference point and we return it. If there are some edges, we just do the standard crossing counting from the requested point to the reference point using only the edges that are “attached” to the cell, which means you may check a couple of edges instead of 100 or 1000 or whatever. I think that for most polygons we deal with, there is a lot of empty space in the interior of the polygon, so my guess is that most cells don’t contain any edges at all, which means a simple lookup.

I think the implementation is a little better than the paper because 1) it picks reference points such there are no degenerate cases to be dealt with 2) it postpones creating reference points until a request is made. If you never do a PnP check for a cell, there’s no reason to actually compute a reference point at all. The paper always used cell centers as reference points, but there’s nothing special about cell centers.

@hobu hobu added this to the 1.7 milestone Jan 5, 2018
@dr-jts
Copy link

dr-jts commented Feb 16, 2019

Is there a third option?

  • Add a function in the GEOS C API which just takes an X,Y pair and runs the PIP calculation (using the fast indexed Point In Area algorithm)

@abellgithub
Copy link
Contributor

This is already complete. If you want this implemented in GEOS, you should open a ticket there, but the last time I was aware they wanted GEOS to strictly be a port from JTS, so they may not be interested in alternative implementations.

@dr-jts
Copy link

dr-jts commented Feb 18, 2019

Yes, understood this is a closed issue.

I am a dev on the GEOS & JTS projects. There's two threads here, both of which I'm interested in following up on:

  1. Can the GEOS API be made faster by providing lower-level calls (i.e. to avoid need for allocating objects)? This seems very worthwhile, and doesn't impact consistency with JTS (especially for C API)
  2. Is this PIP algorithm faster than the one in GEOS? This requires a bit more investigation, and would be a bigger effort to port. There may be some potential/precedent for adding algorithms to GEOS in addition to or ahead of JTS.

@abellgithub
Copy link
Contributor

abellgithub commented Feb 18, 2019

  1. Don't know. I'm not involved in GEOS design. As a user, I'd prefer options at a higher level than having to make a bunch of low level calls, but that may just be me.
  2. Faster? Well, it depends. If you want to do one PIP test, then no. If you're going to test a bunch of points against a polygon, then yes, much faster. I remember at least 10x improvement and maybe towards 50x on my test data. Can't remember for sure. Performance is data dependent.

If I were considering integrating this algorithm, I don't think I'd care about the single point in a single polygon use case, because it's probably "fast enough" in that such an operation isn't probably much of your processing time anyway. The only real case where performance might be an issue is where you're doing an operation like:

std::vector<std::pair<point, polygon>> v;

for (auto p : v)
  if (pip(p.first, p.second))
    do_something()

I wouldn't think this would be a common use case, but you know that SOMEONE will do it and complain. That said, I haven't benchmarked this case because it's not a normal case for our processing.

One last note: I've looked at the GEOS code and it tends to use generalized functions. For example, the pip code uses generic polygon crossing code. The method used in that code may be necessary to generate some information users may want WRT crossing, but some of the computation isn't necessary for pip processing and better math can be used if all you care about is pip.

@dr-jts
Copy link

dr-jts commented Feb 18, 2019

The original comment on this ticket was that GEOS is slow because the API requires Point objects, which require a few allocs. If that's a barrier to using GEOS then we're interested in improving that. (This will probably benefit PostGIS as well, so it's directly in our target space).

As for algorithm performance, understood that this algorithm is only better for large batch PIP scenarios. JTS/GEOS has a similar capability (IndexedPointInAreaLocator). I was assuming that's what PDAL was trying to use (otherwise would be very slow...).

Anyway, we'll be looking at improving GEOS API for performance sometime soon, and PIP is a prime candidate for this. Not sure if/when we'll be able to compare the two fast PIP approaches.

@hobu
Copy link
Member Author

hobu commented Feb 18, 2019

As for algorithm performance, understood that this algorithm is only better for large batch PIP scenarios. JTS/GEOS has a similar capability (IndexedPointInAreaLocator). I was assuming that's what PDAL was trying to use (otherwise would be very slow...).

We were using PreparedGeometry and pushing point objects across the C API. There's no IndexedPointInAreaLocator exposed in the GEOS C API unless it is through PreparedGeometry. I'm not sure on the particulars...

that GEOS is slow because the API requires Point objects

IIRC, every PiP call was making at least three heap allocations of a CoordinateSequence objects in the GEOS C API. When we looked at things in the profiler, that's all we saw once we tried to do a million of them 😄

GEOS devs are aggressively dismissive of use of the C++ API in GEOS. They flat deprecated to require obnoxious #define to enable it a few releases ago. That's fine, but the C API is a heap hog and doesn't provide any convenience for users to pass in their own data structures. Any redesign effort of the C API to cut down on the heap thrashing is going to be met with a ton of resistance – the rails are already ensconced. As the JTS developer lead, you certainly have the credibility to overrule the GEOS design choices, and I hope you can make some headway.

@dr-jts
Copy link

dr-jts commented Feb 18, 2019

Thanks for the clarification, @hobu. I will do what I can to make everyone happy :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants