-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiprocessing to AutoPopulate #704
Conversation
thank you. Reviewing.. |
We will need to test a bit and add unit tests before merging onto |
Hm, looks like the |
@mspacek Looks like it was missed before. Let me fix this for you. One moment. |
@mspacek OK |
@mspacek Please go ahead and do |
@dimitri-yatsenko if we want to test 'blind' 'in repo' before officially accepting (if even to dev), it might make sense to create a feature branch for this case? Alternately, people can easliy create their own private test branches & merge from mspacek:mp for testing.. No problem with the idea of the patch, but given the area that the code touches on, I'm not sure blind merge to dev if rework may be required is the cleanest approach |
Hi @mspacek - I've created a 'mp' branch from the updated 'dev', if you could readdress the PR to that, that would be best for now. Apologies for all the churn, we are currently working to codify and improve our branch/release methodology and so for the moment things in that area are a bit 'under construction' |
Co-Authored-By: Dimitri Yatsenko <[email protected]>
Co-Authored-By: Dimitri Yatsenko <[email protected]>
Would it make sense to replace |
@dimitri-yatsenko in this case, would users still use |
Yes, |
…-native data types
…errors + supporting tables: ErrorClassTable and DjExceptionNames added to test for datajoint#700: jobs table requires `enable_python_native_blobs`; additionally has utility to ensure suppress_errors can trap all DJ exceptions. populate of ErrorClassTable raises 1 DjExceptionName() per DjExceptionNames which should sucessfullly result in jobs table being filled with len(DjExceptionNames) records.
Oops, sorry for the noise. Just rebased off of master and pushed to my github branch. |
ok, let's merge for testing and validation. It will live on |
Not sure where to comment on this now, but is there anything new to report? I'd love to get this into the main branch. Been using it for a while now without issues. I guess unit tests would be required. |
Hi @mspacek I am actively testing your solution in my analysis and reviewing the code. Will update soon |
@dimitri-yatsenko just another friendly reminder :) Any news on this? Looks like master is out of date, and also the mp branch can no longer be automatically merged into master: https://github.com/datajoint/datajoint-python/compare/mp We're using this extensively, but I haven't merged in changes from latest dj releases. Our rebuilds take about 3 h, as opposed to about 2 days. Makes a huge difference for us. |
@mspacek Absolutely. Thank you for your patience. We are juggling a few things but adding multiprocessing is a critical feature to be added before the 1.0 release. |
Here's a stab at addressing issue #695. This adds a
multiprocess
kwarg topopulate()
. From what I can tell, building a subset of our database (with about 40 AutoPopulate tables, some with up to 1M rows) with the default single process is as fast as before, around 10-11 min on our 16 core server. Withmultiprocess=True
(which in our case spawns a pool of at most 16 processes, depending on the number of primary keys per table), that drops to around 2 min, so about 1/5 the time. The end result looks exactly the same, including table entry order and returned error lists. Besides often having fewer than 16 primary keys per table, I think much of the lost time is due to some individual keys taking much longer to process than others. Since the granularity of the multiprocessing is at the level of keys, sometimes processes can sit idle waiting for the last one to finish.Seems to work fine in combination with
reserve_jobs=True
, but the benefit of doing so is probably reduced. I haven't tested thelimit
ormax_calls
kwargs forpopulate()
. Not entirely sure if the newmax_calls
logic will work exactly the same as before.display_progress
works, but isn't as fine-grained as for the single processpopulate()
call. The progress bar only prints out once in a while (and sometimes overwrites itself), but it's still better than nothing. I got some tips on this from https://stackoverflow.com/questions/41920124/multiprocessing-use-tqdm-to-display-a-progress-barThe strategy of binding the table object as an attribute to each process seems a bit clumsy, but it's something I came up with long ago on a different project where I wanted to multiprocess an object method instead of just a function. It works fine from what I've seen. I wouldn't be surprised if there's a better way to do this though.