You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For large tables where the updated surface area is small, this is inefficient. I'd propose we have a new parameter partitions: Optional[Union[List[PartitionValues], Literal['auto']]] = None which can provide a list of partitions for the merge operation to be restricted to.
Usage:
# current behaviourtable.merge(source_data, partitions=None)
# restrict merge to files with the listed PartitionValues. If data in source_data is outside these partitions, it's dropped.table.merge(source_data, partitions=[{ 'col_a' : 1, 'col_b' : 'foo' }, {'col_a' : 1, 'col_b' : 'bar'}, ...])
# use the table partition columns and find all distinct tuples of values in source_datatable.merge(source_data, partitions='auto')
Related Issue(s)
The text was updated successfully, but these errors were encountered:
Description
I'd like merge to offer a
partitions
argument to reduce table churn and processing.Use Case
Currently the merge operation consumes the whole table and merges in new source data. Currently this re-writes every file in the table: https://gist.github.com/emcake/acc1aa233339a5b3534e2f54702dd46e
For large tables where the updated surface area is small, this is inefficient. I'd propose we have a new parameter
partitions: Optional[Union[List[PartitionValues], Literal['auto']]] = None
which can provide a list of partitions for the merge operation to be restricted to.Usage:
Related Issue(s)
The text was updated successfully, but these errors were encountered: