Tiling datasets for performance #761

fvankrieken · 2024-04-12T14:29:09Z

fvankrieken
Apr 12, 2024
Maintainer

I recently implemented some very basic, midquery tiling. This was somewhat needed (among a few possible solutions) due to the nature of the join in that query - joining one monstrous polygon whose extent covers most of NYC to pluto tax lots.

ST_Intersect works first by checking for intersection of bounding boxes (cheap) and then checking for actual intersection of geometry (expensive). In this case, since this huge (but somewhat sparse internally) polygon's bounding box is essentially the whole city, the costly intersect must be performed for every pluto tax lot. But we can first join this to, instead of Pluto's 800,000+ rows, a few hundred tiles

To get a tileset that represents possible intersections with our monster polygon

Intersecting it first instead with a grid of tiles (hexagons or squares) greatly improves performance, because then Pluto can simply be joined to these much more regular tiles before attempting the costly join. Though seeing this image, I'm a little surprised that we see the boost we do. Most of NYC still intersects with these hexagons, but runtime for the join dropped from 36 minutes to 10 by adding this intermediate join. Maybe this is more related to all the natural resource datasets combined rather than this mega-polygon for natural resource shadows specifically.

Regardless, it greatly boosted performance, though with a trade-off. The smaller these hexagons get, the easier it is to quickly see if a pluto lot might intersect something. But in this query, these hexagons representing buffers are calculated on the fly, and the smaller the hexagon, the longer it takes to create (with the logical extreme - if hexagons get down to the size of tax lots, we're then essentially joining tax lots to our buffers)

UNLESS we start precomputing tiles, in ingest/library code, and start taking approaches to reduce dimensionality. This is a relatively common technique, done in h3 as well as some more traditional non-hexagonal methods. Say we break NYC into a square tiled grid

And then numbered the tiles. We can just go row by row, numbering tiles 1, 2, 3 along the top row, 14, 15, 16 on the second, all the way to 182 (the number of tiles in this 13x14 grid). And say these tiles were stored in data library as dcp_de_tiles. We could precalculate tiles for all rows in all datasets. So in dcp_mappluto, we'd have

bbl	...	tiles
1000000001	...	[ 1, 14 ]
1000000002	...	[ 26, 27, 39 ]

With the tiles column containing an array of all intersecting tiles.

When we load this into postgres, we can create a GIN index on this column. This essentially is a multi-index - multiple "rows" of the index for one row of the table. Meaning we don't have to explicitly expand this column to be able to efficiently query. Then, we can do something like

SELECT
   ...
FROM pluto p
INNER JOIN other_dataset od ON p.tiles && od.tiles AND ST_Intersect(p.geom, od.geom)

(&& operator checks for a value shared between arrays, and can use a GIN index)

Roughly what I'm doing for hexes in the original linked query, but at this point we're just joining two integer indices (lightning fast!) as a first check, before then doing the costly spatial join. And since we're computing these tile integers automatically during pre-processing, we don't have the build-time cost of joining to smaller tiles

fvankrieken · 2024-04-12T14:30:22Z

fvankrieken
Apr 12, 2024
Maintainer Author

We might as well do square tiles since we work in state plane (square) projection. And squares fit neatly into one another unlike hexagons, so we could do multiple levels of zoom if we wanted for some reason (I know this is a selling point for h3 but to be honest can't quite think of why we would need to move up and down between zoom levels).

0 replies

fvankrieken · 2024-04-12T14:32:38Z

fvankrieken
Apr 12, 2024
Maintainer Author

Potential other enhancement would be actual vector tiling - for the expensive join that still needs to be done, actually tile the "join" dataset (break into multiple rows/polygons). This seems like something that could be useful but in general is a little less easy to generalize, so needs some thought as to when we'd want it and how to implement

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiling datasets for performance #761

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tiling datasets for performance #761

fvankrieken Apr 12, 2024 Maintainer

Replies: 2 comments

fvankrieken Apr 12, 2024 Maintainer Author

fvankrieken Apr 12, 2024 Maintainer Author

fvankrieken
Apr 12, 2024
Maintainer

fvankrieken
Apr 12, 2024
Maintainer Author

fvankrieken
Apr 12, 2024
Maintainer Author