Struggling to use cuspatial with a large dataset and small GPU memory #1116

voycey · 2023-05-03T02:56:47Z

voycey
May 3, 2023

Hi all,

I am trying to evaluate cuspatial and the recent cudf & dask updates for RAPIDS for a Geospatial workflow.
I am purposely using a small GPU to test this out on a smaller dataset (I figure if I can get it to work on this then scaling this up to larger GPU's should be linear).

I have a 2 Billion row Parquet table that contains point data (68GB - 480GB uncompressed csv), I have also created a single polygon to be able to test point in polygon across this data.
The trouble I am having is how to get the data into Interleaved format - Dask doesn't seem to have this option and whilst cudf does, the size of the data that would be created would far exceed the GPU memory.

%%time
path = "/ssd/Vegas/datafiles/parquet/vegas/*"

#dask-sql
c = Context()

# create a table and register it in the context
c.create_table("parquet", glob(path), gpu=True)
xy = c.sql("""SELECT lon,lat FROM parquet""")

%%time
xy_cudf = xy.compute()
xy_cudf_interleaved = xy_cudf.interleave_columns() #OOM here

2023-05-03 02:27:42,194 - distributed.worker - WARNING - Compute Failed
Key:       ('read-parquet-28ceaa43fc8af2c67ced927e2908202b', 0)
Function:  subgraph_callable-987faa0c-60e2-4b22-bd49-241ca608
args:      ({'piece': ('/ssd/Vegas/datafiles/parquet/vegas/part.147.parquet', None, None)})
kwargs:    {}
Exception: "MemoryError('Parquet data was larger than the available GPU memory!\\n\\nSee the notes on split_row_groups in the read_parquet documentation.\\n\\nOriginal Error: std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp')"

Is there an option in cuspatial to not use interleaved columns for the spatial join? (I couldn't see one in the docs), appreciate it wont be as efficient but I am just trying to get a proof of concept working right now - can optimise down the track

Or if anyone has any suggestions as to how I can manually interleave the dataframe that would also do for now!

Answered by voycey

May 10, 2023

Ok just to update this with something that is working: (I appreciate the help everyone 🙂 - hopefully my pain helps others trying to get a PoC working):
With a bit of help from ChatGPT I understood some of the limitations, primarily dask and building out the cudf Dataframe needs to be done in each worker for both the polygon and the points.

This meant I also had to wrap the function in a separate function so that the serialization using pickle would work internally too (It didn't like doing this with cudf in it directly I think).
Also creating the interleaved column INSIDE the function that was being mapped was obvious in hindsight as it would only work on partitioned data (which I think …

View full answer

harrism · 2023-05-03T04:15:04Z

harrism
May 3, 2023
Collaborator

This (interleaving data costing temporary storage) is a great point. We're discussing. Stay tuned!

8 replies

thomcom May 3, 2023
Collaborator

Also, it is possible to create an interleaved Series directly without using DataFrame.interleave_columns(), would that be of any benefit?

isVoid May 3, 2023
Collaborator

DataFrame.interleave_columns() is just a thin wrapper over libcudf function. Python memory overhead is 0.

On libcudf end, their implementation is close to optimal, they used a counting iterator and copy from the columns into the interleaved output. Memory overhead is 0 (not counting output). So hand rolling our own function wouldn't give additional benefits.

harrism May 3, 2023
Collaborator

@isVoid, Interleaving the columns creates a new column, which is memory overhead. The point is that if we had a cuSpatial Python function that took non-interleaved x/y columns (like we used to), then this extra column creation wouldn't be necessary.

Alternatively, if it was possible to interleave the columns on-the-fly while they are loaded from Parquet, that would save memory too.

harrism May 3, 2023
Collaborator

Also, it is possible to create an interleaved Series directly without using DataFrame.interleave_columns(), would that be of any benefit?

Can you show us how to do that from a Parquet dataset @thomcom ?

voycey May 4, 2023
Author

@isVoid, Interleaving the columns creates a new column, which is memory overhead. The point is that if we had a cuSpatial Python function that took non-interleaved x/y columns (like we used to), then this extra column creation wouldn't be necessary.

Alternatively, if it was possible to interleave the columns on-the-fly while they are loaded from Parquet, that would save memory too.

If this could be done by the library I think this is a clean solution, most WKT based solutions store data as either lat,lon columns or POINT(0,0) WKT strings

voycey · 2023-05-04T07:15:18Z

voycey
May 4, 2023
Author

I have tried several methods to get this working but each time I am stuck when the points array needs to be realised.

This is about as close as I have got so far - if anyone has any hints on how to improve this I am all ears.

import cudf
import dask
import dask.dataframe as dd
import dask.array as da
import dask_cudf as dc
import json
import cuspatial
import pandas as pd

import geopandas as gpd
from shapely import from_wkb, from_wkt
import numpy as np

# Define the path to the parquet files
path_to_parquet_files = "/ssd/Vegas/datafiles/parquet/vegas/*"

# Read in the parquet files using dask.dataframe
ddf = dc.read_parquet(path_to_parquet_files)

ddf.head(5)

	partner	src	ts	id	id_type	lat	lon	alt	ha	va	...	device.Device.country	device.Device.charging	user.User.id	user.User.id_type	location.Location.city	location.Location.state	location.Location.zip	location.Location.county	location.Location.msa	location.Location.country
0	100001	<NA>	1574186749000000	12345678-3b8e-49a9-9863-1022a88d9dac	adid	36.181646	-115.158031	<NA>	50.0	<NA>	...	<NA>	False	<NA>	<NA>	Las Vegas--Henderson	NV	89109	Clark	<NA>	US
1	100001	<NA>	1568785934000000	12345678-dfc1-4df9-87c9-67fcc0c2f4b0	adid	36.126771	-115.205404	<NA>	1399.0	<NA>	...	<NA>	False	<NA>	<NA>	Las Vegas--Henderson	NV	89109	Clark	<NA>	US
2	100001	<NA>	1568362632000000	12345678-4b18-4009-8594-816e213db21e	adid	36.206806	-115.093105	<NA>	865.0	<NA>	...	<NA>	False	<NA>	<NA>	Las Vegas--Henderson	NV	89109	Clark	<NA>	US
3	100001	<NA>	1574710863000000	12345678-e0f9-4d9c-8c98-c70e8db5a649	adid	36.115997	-115.266163	<NA>	699.0	<NA>	...	<NA>	False	<NA>	<NA>	Las Vegas--Henderson	NV	89109	Clark	<NA>	US
4	100001	<NA>	1574265731000000	12345678-d181-4bc2-b7f5-68836e7c6feb	adid	36.188741	-115.169198	<NA>	122.0	<NA>	...	<NA>	False	<NA>	<NA>	Las Vegas--Henderson	NV	89109	Clark	<NA>	US

5 rows × 33 columns

# Get the point columns as Dask arrays
lon_arr = ddf['lon']
lat_arr = ddf['lat']

interleaved = dd.concat([lon_arr, lat_arr], axis=0, interleave_partitions=True)
# Combine the point arrays into a single array
# point_arr = da.stack([lon_arr, lat_arr], axis=1)

polygon_wkt = 'POLYGON((-115.074444 36.289153, -115.208314 36.325569, -115.208688 36.325646, -115.259397 36.335589, -115.260628 36.335652, -115.260845 36.335658, -115.276407 36.335967, -115.320842 36.33641, -115.333587 36.30627, -115.368573 36.170149, -115.3682 36.168344, -115.36794 36.16714, -115.353159 36.109493, -115.315922 36.023474, -115.298126 35.998029, -115.102856 35.918199, -115.100108 35.918038, -115.095546 35.917781, -115.084946 35.918409, -115.072079 35.923316, -114.918841 35.983522, -114.919047 36.0226, -114.919067 36.022821, -114.919108 36.022997, -114.93976 36.080677, -115.006009 36.219129, -115.007607 36.222047, -115.008162 36.222806, -115.010791 36.225883, -115.06011 36.27717, -115.068572 36.285772, -115.069297 36.286474, -115.069637 36.286803, -115.070046 36.287197, -115.071477 36.288191, -115.071736 36.288332, -115.074214 36.289087, -115.074444 36.289153))'

polygon_gdf = gpd.GeoSeries.from_wkt([polygon_wkt])

polygon_gdf

0    POLYGON ((-115.07444 36.28915, -115.20831 36.3...
dtype: geometry

# Apply the point in polygon algorithm to the data
result = cuspatial.point_in_polygon(
    cuspatial.GeoSeries.from_points_xy(interleaved), 
    cuspatial.from_geopandas(polygon_gdf)
)

result

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
   2211             data = as_column(
-> 2212                 memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
   2213             )


TypeError: memoryview: a bytes-like object is required, not 'Series'


During handling of the above exception, another exception occurred:


ArrowInvalid                              Traceback (most recent call last)

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
   2328                 data = as_column(
-> 2329                     pa.array(
   2330                         arbitrary,


/opt/conda/envs/rapids/lib/python3.10/site-packages/pyarrow/array.pxi in pyarrow.lib.array()


/opt/conda/envs/rapids/lib/python3.10/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()


/opt/conda/envs/rapids/lib/python3.10/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()


/opt/conda/envs/rapids/lib/python3.10/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()


ArrowInvalid: Could not convert <dask_cudf.Series | 8743 tasks | 2498 npartitions> with type Series: did not recognize Python value type when inferring an Arrow data type


During handling of the above exception, another exception occurred:


MemoryError                               Traceback (most recent call last)

/tmp/ipykernel_5031/1315678115.py in <module>
      1 # Apply the point in polygon algorithm to the data
      2 result = cuspatial.point_in_polygon(
----> 3     cuspatial.GeoSeries.from_points_xy(interleaved),
      4     cuspatial.from_geopandas(polygon_gdf)
      5 )


/opt/conda/envs/rapids/lib/python3.10/site-packages/cuspatial/core/geoseries.py in from_points_xy(cls, points_xy)
    633             A GeoSeries made of the points.
    634         """
--> 635         return cls(GeoColumn._from_points_xy(as_column(points_xy)))
    636 
    637     @classmethod


/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
   2360                 else:
   2361                     data = as_column(
-> 2362                         _construct_array(arbitrary, dtype),
   2363                         dtype=dtype,
   2364                         nan_as_null=nan_as_null,


/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py in _construct_array(arbitrary, dtype)
   2375     try:
   2376         dtype = dtype if dtype is None else cudf.dtype(dtype)
-> 2377         arbitrary = cupy.asarray(arbitrary, dtype=dtype)
   2378     except (TypeError, ValueError):
   2379         native_dtype = dtype


/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/_creation/from_data.py in asarray(a, dtype, order)
     74 
     75     """
---> 76     return _core.array(a, dtype, False, order)
     77 
     78 


cupy/_core/core.pyx in cupy._core.core.array()


cupy/_core/core.pyx in cupy._core.core.array()


cupy/_core/core.pyx in cupy._core.core._array_default()


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/core.py in __array__(self, dtype, **kwargs)
    588 
    589     def __array__(self, dtype=None, **kwargs):
--> 590         self._computed = self.compute()
    591         x = np.array(self._computed)
    592         return x


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py in compute(self, **kwargs)
    312         dask.base.compute
    313         """
--> 314         (result,) = compute(self, traverse=False, **kwargs)
    315         return result
    316 


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    597         postcomputes.append(x.__dask_postcompute__())
    598 
--> 599     results = schedule(dsk, keys, **kwargs)
    600     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    601 


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/threaded.py in get(dsk, keys, cache, num_workers, pool, **kwargs)
     87             pool = MultiprocessingPoolExecutor(pool)
     88 
---> 89     results = get_async(
     90         pool.submit,
     91         pool._max_workers,


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    509                             _execute_task(task, data)  # Re-execute locally
    510                         else:
--> 511                             raise_exception(exc, tb)
    512                     res, worker_id = loads(res_info)
    513                     state["cache"][key] = res


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/local.py in reraise(exc, tb)
    317     if exc.__traceback__ is not tb:
    318         raise exc.with_traceback(tb)
--> 319     raise exc
    320 
    321 


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    222     try:
    223         task, data = loads(task_info)
--> 224         result = _execute_task(task, data)
    225         id = get_id()
    226         result = dumps((result, id))


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    117         # temporaries by their reference count and can execute certain
    118         # operations in-place.
--> 119         return func(*(_execute_task(a, cache) for a in args))
    120     elif not ishashable(arg):
    121         return arg


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/optimization.py in __call__(self, *args)
    988         if not len(args) == len(self.inkeys):
    989             raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990         return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
    991 
    992     def __reduce__(self):


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/core.py in get(dsk, out, cache)
    147     for key in toposort(dsk):
    148         task = dsk[key]
--> 149         result = _execute_task(task, cache)
    150         cache[key] = result
    151     result = _execute_task(out, cache)


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    117         # temporaries by their reference count and can execute certain
    118         # operations in-place.
--> 119         return func(*(_execute_task(a, cache) for a in args))
    120     elif not ishashable(arg):
    121         return arg


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py in __call__(self, part)
     94             part = [part]
     95 
---> 96         return read_parquet_part(
     97             self.fs,
     98             self.engine,


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py in read_parquet_part(fs, engine, meta, part, columns, index, use_nullable_dtypes, kwargs)
    661             # Part kwargs expected
    662             func = engine.read_partition
--> 663             dfs = [
    664                 func(
    665                     fs,


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py in <listcomp>(.0)
    662             func = engine.read_partition
    663             dfs = [
--> 664                 func(
    665                     fs,
    666                     rg,


/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_cudf/io/parquet.py in read_partition(cls, fs, pieces, columns, index, categories, partitions, partitioning, schema, open_file_options, **kwargs)
    299 
    300         except MemoryError as err:
--> 301             raise MemoryError(
    302                 "Parquet data was larger than the available GPU memory!\n\n"
    303                 "See the notes on split_row_groups in the read_parquet "


MemoryError: Parquet data was larger than the available GPU memory!

See the notes on split_row_groups in the read_parquet documentation.

Original Error: std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp

# Convert the result to a Dask series and set the index
result_series = dc.from_dask_array(result)
result_series = result_series.set_index(ddf.index)

# Join the result series with the original dataframe
ddf_with_result = ddf.merge(result_series.to_frame('point_in_polygon'), how='left', left_index=True, right_index=True)
ddf_with_result.head(10)

0 replies

harrism · 2023-05-10T01:26:05Z

harrism
May 10, 2023
Collaborator

Hi @voycey, we had some internal conversations about this. You may want to look at this dask_cudf feature request (closed, with a suggested workaround) filed by @isVoid . rapidsai/cudf#13308

Also, some guidance on reading Parquet by chunks:

For the parquet component, the user would indeed have problems if their parquet files are written as single large row-groups. The cudf writer will always use reasonable row-group sizes, but other libraries like pyarrow can produce problematic files. To clarify: We can certainly “chunk up” a parquet file in dask-cudf, but only at the row-group granularity.

If your row groups are are too large for GPU memory in the parquet file, then you would have a problem. Note in libcudf there is a chunked parquet reader that can go finer granularity, but it is not exposed in Python AFAIK yet.

6 replies

voycey May 10, 2023
Author

This has no effect, I am pretty sure that because I am trying to create any kind of interleaved dataframe from the entire parquet dataset (essentially a select lon, lat from parquet) and then interleave this - it is not going to work as long as that whole dataset is larger than GPU memory. Having a non-interleaved function for points_in_polygon would mean it could operate on the dask dataframe directly I believe?

harrism May 10, 2023
Collaborator

Well, not sure about the dask part (I'm not a dask expert), but non-interleaved function would allow you to call it with separate x and y columns.

What I don't understand is why you can't operate on small enough chunks in each dask worker to avoid the OOM. The idea would be even if you need a bit more memory to materialize the interleaved column, if your chunks are small enough each worker should be able to proceed on its GPU.

voycey May 10, 2023
Author

I've been trying this using map_partitions but not having much luck :(
Am I doing something dumb here? I think this should be a simple use case demonstration?

ddf = dask_cudf.read_parquet(path, ['lon','lat'], split_row_groups="adaptive", blocksize="128MB")
ddf.head(10)

lon	lat
-115.136819	35.999061
-115.216890	36.053222
-115.071069	36.067440
-115.190583	36.157765
-115.101367	36.150774
-115.322465	36.127178
-115.367601	36.164450
-115.133229	36.194509
-115.198821	36.133028
-115.195676	36.089945

gpolygon = cuspatial.from_geopandas(polygon)
gpolygon

def spatial_join(xy_ddf, polygon):
    
    #build points geoseries from smaller partition data
    i_ddf = cudf.DataFrame({"x" : xy_ddf['lon'], "y" : xy_ddf['lat']}).interleave_columns()
    points_gseries = cuspatial.GeoSeries.from_points_xy(i_ddf)
    
    #do spatial join
    result = cuspatial.point_in_polygon(
        points_gseries, polygon
    )
    return result

res = ddf.map_partitions(lambda ddf: spatial_join(ddf, gpolygon)).compute()

This gives me a pretty confusing error, and I have a feeling that this error is masking just yet another OOM error.
It seems that because I am trying to call a function within a map_partitions function it cant Pickle it when multiprocessing?

Stacktrace....:

``` 2023-05-10 04:27:51,485 - distributed.protocol.pickle - ERROR - Failed to serialize subgraph_callable-2798d03e-8481-454f-8736-c6f51267d5e3. Traceback (most recent call last): File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/worker.py", line 2978, in dumps_function result = cache_dumps[func] File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/collections.py", line 24, in __getitem__ value = super().__getitem__(key) File "/opt/conda/envs/rapids/lib/python3.10/collections/__init__.py", line 1106, in __getitem__ raise KeyError(key) KeyError: subgraph_callable-2798d03e-8481-454f-8736-c6f51267d5e3