config: Use a global lockfile separate from the config/secrets files #157

tsibley · 2022-03-03T23:12:56Z

Locking the same file we then re-open for writing doesn't work on
Windows because the underlying LockFileEx is fd/handle-scoped not
process-scoped and a mandatory, OS enforced-lock, as documented¹:

If the locking process opens the file a second time, it cannot access
the specified region through this second handle until it unlocks the
region.

The pattern works on Unix because the fcntl()-based lock isn't
mandatory. This difference is noted in fastener's README², but I didn't
realize the full implications of Windows' mandatory lock when
introducing the locking in 821c08e. My thinking was enforcement was
process-scoped not fd/handle-scoped.

Using a global lockfile instead of a per-file sidecar lockfile avoids
having to reason about the parent path existence checks which is a) hard
and b) creates several small race conditions (which were present until
now). The downside of this simpler implementation is that accesses to
different files will be bottlenecked on the same lock (e.g. a writer to
secrets will block a reader of config; a reader of config will block a
writer of secrets). This is just fine, however, given our usage and
very short holding times of the locks.

Resolves #151, which also discusses a rejected alternative of re-using
the same fd that fasteners locks.

¹ https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex
² https://pypi.org/project/fasteners/0.17.3/#description

Locking the same file we then re-open for writing doesn't work on Windows because the underlying LockFileEx is fd/handle-scoped not process-scoped and a mandatory, OS enforced-lock, as documented¹: > If the locking process opens the file a second time, it cannot access > the specified region through this second handle until it unlocks the > region. The pattern works on Unix because the fcntl()-based lock isn't mandatory. This difference is noted in fastener's README², but I didn't realize the full implications of Windows' mandatory lock when introducing the locking in 821c08e. My thinking was enforcement was process-scoped not fd/handle-scoped. Using a global lockfile instead of a per-file sidecar lockfile avoids having to reason about the parent path existence checks which is a) hard and b) creates several small race conditions (which were present until now). The downside of this simpler implementation is that accesses to different files will be bottlenecked on the same lock (e.g. a writer to secrets will block a reader of config; a reader of config will block a writer of secrets). This is just fine, however, given our usage and very short holding times of the locks. Resolves #151, which also discusses a rejected alternative of re-using the same fd that fasteners locks. ¹ https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex ² https://pypi.org/project/fasteners/0.17.3/#description

victorlin

Nice job tracking this down without access to a Windows machine! Tested on Windows 10 with #151 (comment) and this solves the issue.

The downside of this simpler implementation is that accesses to
different files will be bottlenecked on the same lock

I'm not sure where these locks take place and for how long, but would it result in scenarios where multiple SLURM jobs are submitted under the same user, happen to run nextstrain update at the same time, causing all but one job to crash?

If the chances of that happening are slim, then this is perfect. If not, then some sort of error handling or retry would be better than [Errno 13] Permission denied.

tsibley · 2022-03-04T19:47:57Z

would it result in scenarios where multiple SLURM jobs are submitted under the same user, happen to run nextstrain update at the same time, causing all but one job to crash?

If the chances of that happening are slim, then this is perfect. If not, then some sort of error handling or retry would be better than [Errno 13] Permission denied.

Ah, so the effect of the locking is writes are serialized and reads wait until a write is finished, at which point reads can proceed in parallel. Lock acquisition retries automatically, as the default acquisition method is blocking without a timeout.

The errno 13 crash shouldn't happen again; it was the result of the mandatory lock denying access to an operation that didn't hold the lock, not the result of lock acquisition failing or waiting.

Re: SLURM specifically, if the jobs are accessing a shared filesystem that supports fcntl() locking (like modern NFS) then it'll all work out fine. If the shared filesystem doesn't support locking (:thinking:), either it'll silently not actually lock or throw an "unsupported" error for all lock acquisitions and none will succeed. IME the former is more common (:scream:). I'm not worried about this case though; it should be very rare to occur, if at all, because of the timing requirements and on top of that, modern shared FS support locks.

Did you have a specific use case in mind with the SLURM question, or was it just a general example?

victorlin · 2022-03-04T20:48:48Z

@tsibley

Did you have a specific use case in mind with the SLURM question, or was it just a general example?

No, was just trying to think of ways where a global lock could be problematic. But like you've described, we should be good here!

tsibley requested a review from victorlin March 3, 2022 23:12

victorlin approved these changes Mar 4, 2022

View reviewed changes

tsibley merged commit 8d9f864 into master Mar 4, 2022

tsibley deleted the trs/config-locking branch March 4, 2022 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config: Use a global lockfile separate from the config/secrets files #157

config: Use a global lockfile separate from the config/secrets files #157

tsibley commented Mar 3, 2022

victorlin left a comment

tsibley commented Mar 4, 2022

victorlin commented Mar 4, 2022

config: Use a global lockfile separate from the config/secrets files #157

config: Use a global lockfile separate from the config/secrets files #157

Conversation

tsibley commented Mar 3, 2022

victorlin left a comment

Choose a reason for hiding this comment

tsibley commented Mar 4, 2022

victorlin commented Mar 4, 2022