Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocessing to AutoPopulate #704

Merged
merged 27 commits into from
Dec 16, 2019
Merged

Conversation

mspacek
Copy link
Contributor

@mspacek mspacek commented Nov 18, 2019

Here's a stab at addressing issue #695. This adds a multiprocess kwarg to populate(). From what I can tell, building a subset of our database (with about 40 AutoPopulate tables, some with up to 1M rows) with the default single process is as fast as before, around 10-11 min on our 16 core server. With multiprocess=True (which in our case spawns a pool of at most 16 processes, depending on the number of primary keys per table), that drops to around 2 min, so about 1/5 the time. The end result looks exactly the same, including table entry order and returned error lists. Besides often having fewer than 16 primary keys per table, I think much of the lost time is due to some individual keys taking much longer to process than others. Since the granularity of the multiprocessing is at the level of keys, sometimes processes can sit idle waiting for the last one to finish.

Seems to work fine in combination with reserve_jobs=True, but the benefit of doing so is probably reduced. I haven't tested the limit or max_calls kwargs for populate(). Not entirely sure if the new max_calls logic will work exactly the same as before.

display_progress works, but isn't as fine-grained as for the single process populate() call. The progress bar only prints out once in a while (and sometimes overwrites itself), but it's still better than nothing. I got some tips on this from https://stackoverflow.com/questions/41920124/multiprocessing-use-tqdm-to-display-a-progress-bar

The strategy of binding the table object as an attribute to each process seems a bit clumsy, but it's something I came up with long ago on a different project where I wanted to multiprocess an object method instead of just a function. It works fine from what I've seen. I wouldn't be surprised if there's a better way to do this though.

@dimitri-yatsenko
Copy link
Member

thank you. Reviewing..

@dimitri-yatsenko
Copy link
Member

We will need to test a bit and add unit tests before merging onto master. Perhaps could you point this PR onto the dev branch so that we can improve it there before merging onto master?

@mspacek mspacek changed the base branch from master to dev November 20, 2019 15:53
@mspacek
Copy link
Contributor Author

mspacek commented Nov 20, 2019

Hm, looks like the dev branch is quite a bit behind master. I guess I need to try and rebase this off of dev?

@guzman-raphael
Copy link
Collaborator

@mspacek Looks like it was missed before. Let me fix this for you. One moment.

@guzman-raphael
Copy link
Collaborator

@mspacek OK dev is now updated to match master. Let us know if you run into any further issues!

@guzman-raphael
Copy link
Collaborator

@mspacek Please go ahead and do git pull upstream dev to refresh the diff so that it will be applied to current branch tip.

@ixcat
Copy link
Contributor

ixcat commented Nov 20, 2019

@dimitri-yatsenko if we want to test 'blind' 'in repo' before officially accepting (if even to dev), it might make sense to create a feature branch for this case? Alternately, people can easliy create their own private test branches & merge from mspacek:mp for testing..

No problem with the idea of the patch, but given the area that the code touches on, I'm not sure blind merge to dev if rework may be required is the cleanest approach esp. since we are still actively using 'dev' as our 'hotfix' code source for point releases...

@dimitri-yatsenko
Copy link
Member

dimitri-yatsenko commented Nov 20, 2019

Good points, @ixcat, will you please create a branch to track this feature and ask @mspacek to point the PR there?

This is indeed a major feature that we will take some time to test, validate, and refine before merging into master.

@ixcat
Copy link
Contributor

ixcat commented Nov 20, 2019

Hi @mspacek - I've created a 'mp' branch from the updated 'dev', if you could readdress the PR to that, that would be best for now.

Apologies for all the churn, we are currently working to codify and improve our branch/release methodology and so for the moment things in that area are a bit 'under construction'

@mspacek mspacek changed the base branch from dev to mp November 21, 2019 10:24
@dimitri-yatsenko
Copy link
Member

Would it make sense to replace mutiprocess=False with nprocs=1 since the former has the same effect as the latter?

@ixcat
Copy link
Contributor

ixcat commented Nov 21, 2019

Would it make sense to replace mutiprocess=False with nprocs=1 since the former has the same effect as the latter?

@dimitri-yatsenko in this case, would users still use =True to allow the code to auto-select the number of processes?

@mspacek
Copy link
Contributor Author

mspacek commented Nov 22, 2019

Yes, multiprocess=True automatically chooses the number of processes. Replacing it with nprocs=True sounds a bit awkward to me, more so than e.g. multiprocess=10. I think most of the time people would either use True or False, and only rarely specify a specific number of processes.

@dimitri-yatsenko
Copy link
Member

dimitri-yatsenko commented Nov 22, 2019

@ixcat @mspacek makes sense.

dimitri-yatsenko and others added 6 commits December 5, 2019 20:43
…errors

+ supporting tables: ErrorClassTable and DjExceptionNames

added to test for datajoint#700: jobs table requires `enable_python_native_blobs`;
additionally has utility to ensure suppress_errors can trap all DJ exceptions.

populate of ErrorClassTable raises 1 DjExceptionName() per DjExceptionNames
which should sucessfullly result in jobs table being filled with
len(DjExceptionNames) records.
@mspacek
Copy link
Contributor Author

mspacek commented Dec 5, 2019

Oops, sorry for the noise. Just rebased off of master and pushed to my github branch.

@dimitri-yatsenko
Copy link
Member

ok, let's merge for testing and validation. It will live on datajoint:mp until we gain confidence.

@dimitri-yatsenko dimitri-yatsenko merged commit 3e6b174 into datajoint:mp Dec 16, 2019
@mspacek
Copy link
Contributor Author

mspacek commented Jan 23, 2020

Not sure where to comment on this now, but is there anything new to report? I'd love to get this into the main branch. Been using it for a while now without issues. I guess unit tests would be required.

@dimitri-yatsenko
Copy link
Member

Hi @mspacek I am actively testing your solution in my analysis and reviewing the code. Will update soon

@mspacek
Copy link
Contributor Author

mspacek commented Aug 19, 2020

@dimitri-yatsenko just another friendly reminder :) Any news on this? Looks like master is out of date, and also the mp branch can no longer be automatically merged into master:

https://github.com/datajoint/datajoint-python/compare/mp

We're using this extensively, but I haven't merged in changes from latest dj releases. Our rebuilds take about 3 h, as opposed to about 2 days. Makes a huge difference for us.

@dimitri-yatsenko
Copy link
Member

@mspacek Absolutely. Thank you for your patience. We are juggling a few things but adding multiprocessing is a critical feature to be added before the 1.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants