Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing file lock hangs on certain filesystems #90

Closed
vbharadwaj-bk opened this issue Jun 19, 2023 · 10 comments · Fixed by #91
Closed

Multiprocessing file lock hangs on certain filesystems #90

vbharadwaj-bk opened this issue Jun 19, 2023 · 10 comments · Fixed by #91

Comments

@vbharadwaj-bk
Copy link
Collaborator

vbharadwaj-bk commented Jun 19, 2023

I recently noticed that cppimport version 22.7.12+ hangs during the build process (our computing center NERSC recently reconfigured their filesystem). The culprit turned out to be a file lock in introduced in #67 and merged in #71 , which enables multiple processes to use a cppimport-compiled module without stepping on each other's toes. The lock is on line https://github.com/tbenthompson/cppimport/blob/main/cppimport/importer.py#L49.

Unfortunately, our new filesystem configuration does not support file locking as it uses a particular data virtualization service (DVS) - see here for documentation of this issue: https://docs.nersc.gov/development/languages/python/parallel-python/#parallel-io-with-h5py. I solved my problem by rolling back to version 22.5.11, right before the file lock was introduced.

Other filesystems may suffer from similar restrictions. Is it worth adding a setting for cppimport that enables / disables the file lock for multiprocessing build support? For what it's worth, I'm using cppimport in code called from multiple processes, and I protect the build process manually outside the package. I really like the fix introduced by the lock and think it would address the need on most filesystems - but it would be nice to turn it off.

If there is interest in adding the setting, I can write a fix and propose a PR. Feedback solicited!

@tbenthompson
Copy link
Owner

Is it worth adding a setting for cppimport that enables / disables the file lock for multiprocessing build support?

Yes!

Thanks for raising this issue. It makes complete sense.

Thoughts:

  • a PR would be lovely! I'm super busy for the next few weeks and probably won't get to it myself soon.
  • do you think there would be a good way to automatically detect whether the filesystem supports file locking? Either via some set of queryable system-level flags or via just testing a file lock and catching an exception if the lock doesn't work?
  • if not, perhaps the best option would be to identify the relevant exception and then re-raise the exception with some useful feedback for the user like "it seems your filesystem might not support file locking. if that's the case, please set cppimport.settings['use_file_lock'] = False"
  • existing global settings are here:
    settings = dict(
    force_rebuild=False, # `force_rebuild` with multiple processes is not supported
    file_exts=[".cpp", ".c"],
    rtld_flags=ctypes.RTLD_LOCAL,
    lock_suffix=".lock",
    lock_timeout=10 * 60,
    remove_strict_prototypes=True,
    release_mode=os.getenv("CPPIMPORT_RELEASE_MODE", "0").lower()
    in ("true", "yes", "1"),
    )

@vbharadwaj-bk
Copy link
Collaborator Author

Sounds great! I'll put a PR together that, at a minimum, adds a setting 'use_file_lock' with a default of True, but when False runs the build without the lock. Regarding the comments about automatic detection and exception raising:

On our filesystem, the nature of the error is that filelock silently fails to acquire a lock on any process. The failure to acquire a lock is indistinguishable (for all but one process) from the correctly-working case where a single process acquires the lock successfully and the rest are waiting for the build to finish, at least until the timeout period expires. It would be easy to solve this if the processes could communicate and realize that none of them are acquiring the lock, but I'm not sure that is reasonable. As a result, here are the communication-free possibilities that I can think of for automatic detection / exception raising regarding the lock:

  1. Automatic detection: Every process comes up with a random, unique (with high-probability) lock name and tries acquiring their own lock independently. Enough failures in a reasonable time (say, 20 seconds) raises an exception on each process that file locking is not supported. The downside here is significant extra complexity: one lock per process will have to be created, and something like uuid would be needed to assign unique lock names.

  2. No automatic detection: The processes wait until the timeout period and raise the existing timeout error message with an added caveat of the form above "your filesystem might not support file locking. if that's the case, please set cppimport.settings['use_file_lock'] = False and ensure only one process compiles the module at a time.". The downside here is that the processes must all wait until the lock timeout period expires (10 minutes, when I usually Ctrl+C after a minute) to find the issue. On the upside, I think a section in the README about file locking support would be enough to point people to the correct setting to modify; that's the first place I looked, and this issue should be rare in practice.

I'm leaning towards the second simpler solution that just adds information to the lock timeout error and adds a README section about file locking. It would immediately make the latest version of the package usable (for the correct choice of global settings) on these filesystems. Additional PRs could implement the automatic locking detection on top of this.

@tbenthompson
Copy link
Owner

Sounds great! I'll put a PR together that, at a minimum, adds a setting 'use_file_lock' with a default of True, but when False runs the build without the lock

Sounds great!

I'm leaning towards the second simpler solution that just adds information to the lock timeout error and adds a README section about file locking. It would immediately make the latest version of the package usable (for the correct choice of global settings) on these filesystems. Additional PRs could implement the automatic locking detection on top of this.

Seems very reasonable!

@tbenthompson
Copy link
Owner

Thanks!!

@drew-parsons
Copy link

I think this is the issue which has now started failing debian CI tests at https://ci.debian.net/packages/c/cppimport/
e.g.
https://ci.debian.net/packages/c/cppimport/unstable/amd64/46623673/
https://ci.debian.net/packages/c/cppimport/unstable/arm64/44718580/

The PR will be very welcome.

@vbharadwaj-bk
Copy link
Collaborator Author

vbharadwaj-bk commented May 27, 2024

Apologies for letting this sleep so long - I've opened a PR with an option to disable the file lock and updated the tests. The file lock remains on by default, so the functionality of any working code is unchanged. All tests except for the multiprocessing test now disable the file lock. The multiprocessing test is disabled by default, but can be run with the invocation pytest --multiprocessing to test the file lock functionality.

Feedback very welcome, some of these choices were opinionated.

@drew-parsons
Copy link

drew-parsons commented May 28, 2024

The patch looks good to me (but I didn't test it yet patched into my code). Disabling the multiprocessing test by default would keep CI testing stable.

One question for the --multiprocessing flag: is there a (simple) way to check if filelock is supported on a given system? e.g. any entry recorded under /proc/fs/ or another utility tool? /proc/fs/ext4/dm-0/options, for instance, reports dioread_nolock, but I figure that's not relevant (the multiprocessing test passes on this system). If there is a simple check, then I'd use it to activate --multiprocessing in the debian CI scripts.

@vbharadwaj-bk
Copy link
Collaborator Author

@drew-parsons Not sure if there's a universal way to check across systems - but at least on linux, this was helpful using flock. This snippet does the job:

touch testfile
flock ./testfile true && echo ok || echo nok

I tested this snippet on a DVS filesystem that does not support locking (result below):

vbharadw> flock ./testfile true && echo ok || echo nok
flock: ./testfile: Unknown error 524
nok

And on a filesystem that does support locking:

vbharadw> flock ./testfile true && echo ok || echo nok
ok

@drew-parsons
Copy link

Easy enough to use in a CI script, thanks!

@vbharadwaj-bk
Copy link
Collaborator Author

Merged- thanks @tbenthompson for the review (and overall maintenance of this package) and @drew-parsons for the nudge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants