Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: Use a global lockfile separate from the config/secrets files #157

Merged
merged 1 commit into from
Mar 4, 2022

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Mar 3, 2022

Locking the same file we then re-open for writing doesn't work on
Windows because the underlying LockFileEx is fd/handle-scoped not
process-scoped and a mandatory, OS enforced-lock, as documented¹:

If the locking process opens the file a second time, it cannot access
the specified region through this second handle until it unlocks the
region.

The pattern works on Unix because the fcntl()-based lock isn't
mandatory. This difference is noted in fastener's README², but I didn't
realize the full implications of Windows' mandatory lock when
introducing the locking in 821c08e. My thinking was enforcement was
process-scoped not fd/handle-scoped.

Using a global lockfile instead of a per-file sidecar lockfile avoids
having to reason about the parent path existence checks which is a) hard
and b) creates several small race conditions (which were present until
now). The downside of this simpler implementation is that accesses to
different files will be bottlenecked on the same lock (e.g. a writer to
secrets will block a reader of config; a reader of config will block a
writer of secrets). This is just fine, however, given our usage and
very short holding times of the locks.

Resolves #151, which also discusses a rejected alternative of re-using
the same fd that fasteners locks.

¹ https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex
² https://pypi.org/project/fasteners/0.17.3/#description

Locking the same file we then re-open for writing doesn't work on
Windows because the underlying LockFileEx is fd/handle-scoped not
process-scoped and a mandatory, OS enforced-lock, as documented¹:

> If the locking process opens the file a second time, it cannot access
> the specified region through this second handle until it unlocks the
> region.

The pattern works on Unix because the fcntl()-based lock isn't
mandatory.  This difference is noted in fastener's README², but I didn't
realize the full implications of Windows' mandatory lock when
introducing the locking in 821c08e.  My thinking was enforcement was
process-scoped not fd/handle-scoped.

Using a global lockfile instead of a per-file sidecar lockfile avoids
having to reason about the parent path existence checks which is a) hard
and b) creates several small race conditions (which were present until
now).  The downside of this simpler implementation is that accesses to
different files will be bottlenecked on the same lock (e.g. a writer to
secrets will block a reader of config; a reader of config will block a
writer of secrets).  This is just fine, however, given our usage and
very short holding times of the locks.

Resolves #151, which also discusses a rejected alternative of re-using
the same fd that fasteners locks.

¹ https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex
² https://pypi.org/project/fasteners/0.17.3/#description
@tsibley tsibley requested a review from victorlin March 3, 2022 23:12
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job tracking this down without access to a Windows machine! Tested on Windows 10 with #151 (comment) and this solves the issue.

The downside of this simpler implementation is that accesses to
different files will be bottlenecked on the same lock

I'm not sure where these locks take place and for how long, but would it result in scenarios where multiple SLURM jobs are submitted under the same user, happen to run nextstrain update at the same time, causing all but one job to crash?

If the chances of that happening are slim, then this is perfect. If not, then some sort of error handling or retry would be better than [Errno 13] Permission denied.

@tsibley
Copy link
Member Author

tsibley commented Mar 4, 2022

would it result in scenarios where multiple SLURM jobs are submitted under the same user, happen to run nextstrain update at the same time, causing all but one job to crash?

If the chances of that happening are slim, then this is perfect. If not, then some sort of error handling or retry would be better than [Errno 13] Permission denied.

Ah, so the effect of the locking is writes are serialized and reads wait until a write is finished, at which point reads can proceed in parallel. Lock acquisition retries automatically, as the default acquisition method is blocking without a timeout.

The errno 13 crash shouldn't happen again; it was the result of the mandatory lock denying access to an operation that didn't hold the lock, not the result of lock acquisition failing or waiting.

Re: SLURM specifically, if the jobs are accessing a shared filesystem that supports fcntl() locking (like modern NFS) then it'll all work out fine. If the shared filesystem doesn't support locking (:thinking:), either it'll silently not actually lock or throw an "unsupported" error for all lock acquisitions and none will succeed. IME the former is more common (:scream:). I'm not worried about this case though; it should be very rare to occur, if at all, because of the timing requirements and on top of that, modern shared FS support locks.

Did you have a specific use case in mind with the SLURM question, or was it just a general example?

@tsibley tsibley merged commit 8d9f864 into master Mar 4, 2022
@tsibley tsibley deleted the trs/config-locking branch March 4, 2022 20:43
@victorlin
Copy link
Member

@tsibley

Did you have a specific use case in mind with the SLURM question, or was it just a general example?

No, was just trying to think of ways where a global lock could be problematic. But like you've described, we should be good here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[Errno 13] Permission denied on Windows
2 participants