You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Using NVTabular (alone), it is not currently possible to write out a parquet dataset with a deterministic row-count in each output file. Such determinism may be desired by users who wish to perform distributed multi-gpu training for many epochs (e.g. with horovod).
Describe the solution you'd like
I feel that the determinism issue can be solved by introducing the following features within NVTabular:
The user must be able to pass through the following kwargs to the underlying write engine in Dataset.to_parquet: row_group_size_rows (cudf-22.02 and later) and row_group_size (pyarrow). For now, NVTabular always uses the default row-group size when it writes a DataFrame partition to disk. This is problematic for users who want their dataset to be perfectly divided between n processes at training time. For a "perfect" data balance to be possible, (almost) all row-groups should comprise a deterministic row count.
Dataset must support a custom method (perhaps rebalance_rowcount) to both repartition and redistribute data between partitions such that all partitions comprise the same number of rows (with the exception of some "residual" partition, perhaps). Without this feature, (1) is not useful yet, because the number of rows within the partition is not guaranteed to be divisible by the chosen/required row_group_size_rows value.
On the "read" side of Dataset, we need a parameter (perhaps partition_factor) to specify a specific integer that the total number of partitions must be an exact multiple of. Therefore, if the user specifies partition_factor=8 and part_mem_fraction=0.1, then NVTabular should find the largest partition size (<=10% of the total device memory) where the total number of partitions is divisible by 8.
In the case that the size of the global dataset is perfectly divisible by some expected number of horovod workers (say 8), the three features listed above should make it possible for the user to produce a "perfectly" balanced dataset from which efficient distributed training can be performed.
This leaves us with the obvious problem that we cannot assume the dataset will always be divisible by the desired number of distributed workers. This means we will need to decide what to do in feature (2) and (3) when there is a residual row- and/or row-group count. Perhaps it is sufficient if (2) can be guaranteed to produce, at most, a single "misfit" partition (of minimum size), and (3) allows this final misfit partition to be treated specially (and optionally ignored)?
Notes: Why we want deterministic and balanced row-groups/partitions:
Can avoid "spilling" partial batches between partitions during data-loading
Can minimize work imbalance (always bad for parallel performance!)
Can avoid the need for users to explicitly partition their data into a distinct file for each worker
The text was updated successfully, but these errors were encountered:
rjzamora
changed the title
[FEA] Enable deterministinc row-group decomposition in Dataset.to_parquet
[FEA] Enable deterministic row-group decomposition in Dataset.to_parquet
Jan 6, 2022
Is your feature request related to a problem? Please describe.
Using NVTabular (alone), it is not currently possible to write out a parquet dataset with a deterministic row-count in each output file. Such determinism may be desired by users who wish to perform distributed multi-gpu training for many epochs (e.g. with horovod).
Describe the solution you'd like
I feel that the determinism issue can be solved by introducing the following features within NVTabular:
Dataset.to_parquet
:row_group_size_rows
(cudf-22.02 and later) androw_group_size
(pyarrow). For now, NVTabular always uses the default row-group size when it writes aDataFrame
partition to disk. This is problematic for users who want their dataset to be perfectly divided betweenn
processes at training time. For a "perfect" data balance to be possible, (almost) all row-groups should comprise a deterministic row count.Dataset
must support a custom method (perhapsrebalance_rowcount
) to bothrepartition
and redistribute data between partitions such that all partitions comprise the same number of rows (with the exception of some "residual" partition, perhaps). Without this feature, (1) is not useful yet, because the number of rows within the partition is not guaranteed to be divisible by the chosen/requiredrow_group_size_rows
value.Dataset
, we need a parameter (perhapspartition_factor
) to specify a specific integer that the total number of partitions must be an exact multiple of. Therefore, if the user specifiespartition_factor=8
andpart_mem_fraction=0.1
, then NVTabular should find the largest partition size (<=10% of the total device memory) where the total number of partitions is divisible by 8.In the case that the size of the global dataset is perfectly divisible by some expected number of horovod workers (say 8), the three features listed above should make it possible for the user to produce a "perfectly" balanced dataset from which efficient distributed training can be performed.
This leaves us with the obvious problem that we cannot assume the dataset will always be divisible by the desired number of distributed workers. This means we will need to decide what to do in feature (2) and (3) when there is a residual row- and/or row-group count. Perhaps it is sufficient if (2) can be guaranteed to produce, at most, a single "misfit" partition (of minimum size), and (3) allows this final misfit partition to be treated specially (and optionally ignored)?
Notes: Why we want deterministic and balanced row-groups/partitions:
The text was updated successfully, but these errors were encountered: