GEOS C API is slow when doing millions of things #1735

hobu · 2017-11-27T16:19:35Z

Point-in-Polygon in filters.crop is rather slow because the GEOS C API demands that we heap allocate/deallocate a CoordinateSequence and a Point for each point. There are two possibilities to attack the issue:

Rewrite to use our own Point-in-Polygon, but it would need to support Multi* and holes.
Rewrite to use the GEOS C++ API, which is "unsupported", but would allow us to stack allocate or even reuse CoordinateSequence/Point, but we would want to refactor all of the pdal::geos stuff to use the C++ API, and that would make this effort quite a bit larger.

The text was updated successfully, but these errors were encountered:

mathieu1 · 2017-11-27T17:34:54Z

That would be extremely useful. filters.crop is indeed the main bottleneck for our pdal pipelines right now.

Not sure if it's helpful, but looking around, it seems Postgis and Shapely use PreparedGeometry for big Point-in-Polygon batch computations. I'm not familiar with GEOS, but isn't PreparedGeometry available in the C API?

hobu · 2017-11-27T17:53:14Z

@mathieu1 We're doing PreparedGeometry right now, even with the C API. The issue has been GEOS C API demands immutability and forces a bunch of copies of things. The C++ API doesn't have this restriction, but it has others 😄

Our own PiP for simple polygons would be straightforward, but people start putting in Multi* and polygons with holes, etc. and they expect it all to just work.

busstoptaktik · 2017-11-27T18:23:36Z

pnpoly does that kind of stuff, although it requires polygon parts to be wrapped in sentinels. https://wrf.ecse.rpi.edu//Research/Short_Notes/pnpoly.html Den 28. nov. 2017 6.54 AM skrev "Howard Butler" <[email protected]>:

…

@mathieu1 <https://github.com/mathieu1> We're doing PreparedGeometry right now, even with the C API. The issue has been GEOS C API demands immutability and forces a bunch of copies of things. The C++ API doesn't have this restriction, but it has others 😄 Our own PiP for simple polygons would be straightforward, but people start putting in Multi* and polygons with holes, etc. and they expect it all to just work. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1735 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFWTrWPeh9dGexXRaau5VihhFVB84jbgks5s6vcLgaJpZM4Qr5UO> .

hobu · 2017-11-27T18:26:51Z

pnpoly does that kind of stuff

Yep, that was the one I was looking at too.

abellgithub · 2017-11-30T17:14:40Z

I think the algorithm referenced above is slow if the polygon is large. If you're going to process lots of tests for the same polygon, which is our typical use case, some preprocessing can improve things a lot, at the expense of using some memory. I may be mistaken. Also, this may be a nice place to try some GPU processing.

abellgithub · 2018-01-02T18:44:10Z

I've implemented a version of this: http://ieeexplore.ieee.org/document/6694387/

This algorithm divides the polygon space up in a grid - there’s some heuristics used to determine the grid size that could probably be tweaked with some experimentation. Each polygon edge that intersects a grid cell is “attached” to that grid cell. When a PiP request is made for a point, we determine which cell the point is in. The first time such a request is made for a cell, it chooses a randomish reference point in the cell and then determines whether the reference point is in/out by checking from outside the polygon or some other reference point by doing the standard crossing counting. In our case, though, we only need to test intersections of edges that are in cells that are traversed traveling to our reference point, rather than all the edges. The reference point is then marked either inside or outside. Once the reference point is established for a cell, it’s done forever. To actually test the point, there are two cases: If there are 0 edges in the cell, then the classification (in/out) of ANY point in the cell is the same as the reference point and we return it. If there are some edges, we just do the standard crossing counting from the requested point to the reference point using only the edges that are “attached” to the cell, which means you may check a couple of edges instead of 100 or 1000 or whatever. I think that for most polygons we deal with, there is a lot of empty space in the interior of the polygon, so my guess is that most cells don’t contain any edges at all, which means a simple lookup.

I think the implementation is a little better than the paper because 1) it picks reference points such there are no degenerate cases to be dealt with 2) it postpones creating reference points until a request is made. If you never do a PnP check for a cell, there’s no reason to actually compute a reference point at all. The paper always used cell centers as reference points, but there’s nothing special about cell centers.

dr-jts · 2019-02-16T20:37:55Z

Is there a third option?

Add a function in the GEOS C API which just takes an X,Y pair and runs the PIP calculation (using the fast indexed Point In Area algorithm)

abellgithub · 2019-02-16T22:02:46Z

This is already complete. If you want this implemented in GEOS, you should open a ticket there, but the last time I was aware they wanted GEOS to strictly be a port from JTS, so they may not be interested in alternative implementations.

dr-jts · 2019-02-18T19:39:55Z

Yes, understood this is a closed issue.

I am a dev on the GEOS & JTS projects. There's two threads here, both of which I'm interested in following up on:

Can the GEOS API be made faster by providing lower-level calls (i.e. to avoid need for allocating objects)? This seems very worthwhile, and doesn't impact consistency with JTS (especially for C API)
Is this PIP algorithm faster than the one in GEOS? This requires a bit more investigation, and would be a bigger effort to port. There may be some potential/precedent for adding algorithms to GEOS in addition to or ahead of JTS.

abellgithub · 2019-02-18T20:47:32Z

Don't know. I'm not involved in GEOS design. As a user, I'd prefer options at a higher level than having to make a bunch of low level calls, but that may just be me.
Faster? Well, it depends. If you want to do one PIP test, then no. If you're going to test a bunch of points against a polygon, then yes, much faster. I remember at least 10x improvement and maybe towards 50x on my test data. Can't remember for sure. Performance is data dependent.

If I were considering integrating this algorithm, I don't think I'd care about the single point in a single polygon use case, because it's probably "fast enough" in that such an operation isn't probably much of your processing time anyway. The only real case where performance might be an issue is where you're doing an operation like:

std::vector<std::pair<point, polygon>> v;

for (auto p : v)
  if (pip(p.first, p.second))
    do_something()

I wouldn't think this would be a common use case, but you know that SOMEONE will do it and complain. That said, I haven't benchmarked this case because it's not a normal case for our processing.

One last note: I've looked at the GEOS code and it tends to use generalized functions. For example, the pip code uses generic polygon crossing code. The method used in that code may be necessary to generate some information users may want WRT crossing, but some of the computation isn't necessary for pip processing and better math can be used if all you care about is pip.

dr-jts · 2019-02-18T21:19:00Z

The original comment on this ticket was that GEOS is slow because the API requires Point objects, which require a few allocs. If that's a barrier to using GEOS then we're interested in improving that. (This will probably benefit PostGIS as well, so it's directly in our target space).

As for algorithm performance, understood that this algorithm is only better for large batch PIP scenarios. JTS/GEOS has a similar capability (IndexedPointInAreaLocator). I was assuming that's what PDAL was trying to use (otherwise would be very slow...).

Anyway, we'll be looking at improving GEOS API for performance sometime soon, and PIP is a prime candidate for this. Not sure if/when we'll be able to compare the two fast PIP approaches.

hobu · 2019-02-18T21:35:20Z

As for algorithm performance, understood that this algorithm is only better for large batch PIP scenarios. JTS/GEOS has a similar capability (IndexedPointInAreaLocator). I was assuming that's what PDAL was trying to use (otherwise would be very slow...).

We were using PreparedGeometry and pushing point objects across the C API. There's no IndexedPointInAreaLocator exposed in the GEOS C API unless it is through PreparedGeometry. I'm not sure on the particulars...

that GEOS is slow because the API requires Point objects

IIRC, every PiP call was making at least three heap allocations of a CoordinateSequence objects in the GEOS C API. When we looked at things in the profiler, that's all we saw once we tried to do a million of them 😄

GEOS devs are aggressively dismissive of use of the C++ API in GEOS. They flat deprecated to require obnoxious #define to enable it a few releases ago. That's fine, but the C API is a heap hog and doesn't provide any convenience for users to pass in their own data structures. Any redesign effort of the C API to cut down on the heap thrashing is going to be met with a ton of resistance – the rails are already ensconced. As the JTS developer lead, you certainly have the credibility to overrule the GEOS design choices, and I hope you can make some headway.

dr-jts · 2019-02-18T21:49:37Z

Thanks for the clarification, @hobu. I will do what I can to make everyone happy :).

abellgithub closed this as completed in 5f2779a Jan 2, 2018

hobu added this to the 1.7 milestone Jan 5, 2018

dbaston mentioned this issue Jun 29, 2019

Add FixedSizeCoordinateSequence for Points libgeos/geos#204

Merged

chrispahm mentioned this issue Jan 4, 2024

Performance issue withGEOSPreparedContains libgeos/geos#1024

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEOS C API is slow when doing millions of things #1735

GEOS C API is slow when doing millions of things #1735

hobu commented Nov 27, 2017

mathieu1 commented Nov 27, 2017

hobu commented Nov 27, 2017

busstoptaktik commented Nov 27, 2017 via email

hobu commented Nov 27, 2017

abellgithub commented Nov 30, 2017

abellgithub commented Jan 2, 2018 •

edited

Loading

dr-jts commented Feb 16, 2019

abellgithub commented Feb 16, 2019

dr-jts commented Feb 18, 2019

abellgithub commented Feb 18, 2019 •

edited

Loading

dr-jts commented Feb 18, 2019

hobu commented Feb 18, 2019

dr-jts commented Feb 18, 2019

GEOS C API is slow when doing millions of things #1735

GEOS C API is slow when doing millions of things #1735

Comments

hobu commented Nov 27, 2017

mathieu1 commented Nov 27, 2017

hobu commented Nov 27, 2017

busstoptaktik commented Nov 27, 2017 via email

hobu commented Nov 27, 2017

abellgithub commented Nov 30, 2017

abellgithub commented Jan 2, 2018 • edited Loading

dr-jts commented Feb 16, 2019

abellgithub commented Feb 16, 2019

dr-jts commented Feb 18, 2019

abellgithub commented Feb 18, 2019 • edited Loading

dr-jts commented Feb 18, 2019

hobu commented Feb 18, 2019

dr-jts commented Feb 18, 2019

abellgithub commented Jan 2, 2018 •

edited

Loading

abellgithub commented Feb 18, 2019 •

edited

Loading