Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix synchronization in collective DistArray initializations/transformations #484

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

evaleev
Copy link
Member

@evaleev evaleev commented Oct 9, 2024

There are several closely-related issues around synchronization in initialization/transformation of DistArray's:

  • diagonal_array uses taskq.fence() which can lead to deadlock without global synchronization (e.g. if rank receives tasks not generated by diagonal_array whose submission/execution requires progress beyond diagonal_array)
  • to avoid similar problems elsewhere completion of all tasks of a given type is typically accomplished by awaiting on ntask_created == ntask_completed (e.g. https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/conversions/foreach.h#L286 , with ntask_created named task_count and ntask_completed named counter). As is now obvious updating ntask_completed from within the task, before its result has been assigned to the future creates a potential race, since the awaited condition will be satisfied before the effects of the task completion have been completed. The correct way to ensure local completion is to update ntask_completed using a callback attached to the future that will be assigned the task result.

To address these issues, beyond direct fixes, it would be convenient to add fencing support to collectives like DistArray::init_tiles directly. This PR accomplishes that.

@evaleev evaleev force-pushed the evaleev/feature/distarray-init-fence branch 2 times, most recently from e11c2c1 to a49ecbb Compare October 9, 2024 12:34
@evaleev evaleev force-pushed the evaleev/feature/distarray-init-fence branch from a49ecbb to c955339 Compare October 11, 2024 20:56
@evaleev evaleev merged commit a73a17b into master Oct 11, 2024
9 checks passed
@evaleev evaleev deleted the evaleev/feature/distarray-init-fence branch October 11, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant