Add multiprocessing to AutoPopulate #704

mspacek · 2019-11-18T22:38:56Z

Here's a stab at addressing issue #695. This adds a multiprocess kwarg to populate(). From what I can tell, building a subset of our database (with about 40 AutoPopulate tables, some with up to 1M rows) with the default single process is as fast as before, around 10-11 min on our 16 core server. With multiprocess=True (which in our case spawns a pool of at most 16 processes, depending on the number of primary keys per table), that drops to around 2 min, so about 1/5 the time. The end result looks exactly the same, including table entry order and returned error lists. Besides often having fewer than 16 primary keys per table, I think much of the lost time is due to some individual keys taking much longer to process than others. Since the granularity of the multiprocessing is at the level of keys, sometimes processes can sit idle waiting for the last one to finish.

Seems to work fine in combination with reserve_jobs=True, but the benefit of doing so is probably reduced. I haven't tested the limit or max_calls kwargs for populate(). Not entirely sure if the new max_calls logic will work exactly the same as before.

display_progress works, but isn't as fine-grained as for the single process populate() call. The progress bar only prints out once in a while (and sometimes overwrites itself), but it's still better than nothing. I got some tips on this from https://stackoverflow.com/questions/41920124/multiprocessing-use-tqdm-to-display-a-progress-bar

The strategy of binding the table object as an attribute to each process seems a bit clumsy, but it's something I came up with long ago on a different project where I wanted to multiprocess an object method instead of just a function. It works fine from what I've seen. I wouldn't be surprised if there's a better way to do this though.

dimitri-yatsenko · 2019-11-19T16:51:25Z

thank you. Reviewing..

dimitri-yatsenko · 2019-11-19T16:53:46Z

We will need to test a bit and add unit tests before merging onto master. Perhaps could you point this PR onto the dev branch so that we can improve it there before merging onto master?

mspacek · 2019-11-20T15:57:40Z

Hm, looks like the dev branch is quite a bit behind master. I guess I need to try and rebase this off of dev?

guzman-raphael · 2019-11-20T15:59:57Z

@mspacek Looks like it was missed before. Let me fix this for you. One moment.

guzman-raphael · 2019-11-20T16:30:21Z

@mspacek OK dev is now updated to match master. Let us know if you run into any further issues!

guzman-raphael · 2019-11-20T19:00:25Z

@mspacek Please go ahead and do git pull upstream dev to refresh the diff so that it will be applied to current branch tip.

ixcat · 2019-11-20T19:56:53Z

@dimitri-yatsenko if we want to test 'blind' 'in repo' before officially accepting (if even to dev), it might make sense to create a feature branch for this case? Alternately, people can easliy create their own private test branches & merge from mspacek:mp for testing..

No problem with the idea of the patch, but given the area that the code touches on, I'm not sure blind merge to dev if rework may be required is the cleanest approach ~~esp. since we are still actively using 'dev' as our 'hotfix' code source for point releases...~~

dimitri-yatsenko · 2019-11-20T20:19:54Z

Good points, @ixcat, will you please create a branch to track this feature and ask @mspacek to point the PR there?

This is indeed a major feature that we will take some time to test, validate, and refine before merging into master.

ixcat · 2019-11-20T22:59:00Z

Hi @mspacek - I've created a 'mp' branch from the updated 'dev', if you could readdress the PR to that, that would be best for now.

Apologies for all the churn, we are currently working to codify and improve our branch/release methodology and so for the moment things in that area are a bit 'under construction'

datajoint/autopopulate.py

Co-Authored-By: Dimitri Yatsenko <[email protected]>

dimitri-yatsenko · 2019-11-21T19:20:19Z

Would it make sense to replace mutiprocess=False with nprocs=1 since the former has the same effect as the latter?

ixcat · 2019-11-21T21:49:15Z

Would it make sense to replace mutiprocess=False with nprocs=1 since the former has the same effect as the latter?

@dimitri-yatsenko in this case, would users still use =True to allow the code to auto-select the number of processes?

mspacek · 2019-11-22T10:00:06Z

Yes, multiprocess=True automatically chooses the number of processes. Replacing it with nprocs=True sounds a bit awkward to me, more so than e.g. multiprocess=10. I think most of the time people would either use True or False, and only rarely specify a specific number of processes.

dimitri-yatsenko · 2019-11-22T10:16:36Z

@ixcat @mspacek makes sense.

…-native data types

…errors + supporting tables: ErrorClassTable and DjExceptionNames added to test for datajoint#700: jobs table requires `enable_python_native_blobs`; additionally has utility to ensure suppress_errors can trap all DJ exceptions. populate of ErrorClassTable raises 1 DjExceptionName() per DjExceptionNames which should sucessfullly result in jobs table being filled with len(DjExceptionNames) records.

mspacek · 2019-12-05T19:47:52Z

Oops, sorry for the noise. Just rebased off of master and pushed to my github branch.

dimitri-yatsenko · 2019-12-16T18:55:26Z

ok, let's merge for testing and validation. It will live on datajoint:mp until we gain confidence.

mspacek · 2020-01-23T16:48:03Z

Not sure where to comment on this now, but is there anything new to report? I'd love to get this into the main branch. Been using it for a while now without issues. I guess unit tests would be required.

dimitri-yatsenko · 2020-01-24T22:48:00Z

Hi @mspacek I am actively testing your solution in my analysis and reviewing the code. Will update soon

mspacek · 2020-08-19T11:04:40Z

@dimitri-yatsenko just another friendly reminder :) Any news on this? Looks like master is out of date, and also the mp branch can no longer be automatically merged into master:

https://github.com/datajoint/datajoint-python/compare/mp

We're using this extensively, but I haven't merged in changes from latest dj releases. Our rebuilds take about 3 h, as opposed to about 2 days. Makes a huge difference for us.

dimitri-yatsenko · 2020-08-19T13:57:58Z

@mspacek Absolutely. Thank you for your patience. We are juggling a few things but adding multiprocessing is a critical feature to be added before the 1.0 release.

mspacek added 4 commits November 18, 2019 22:56

Add multiprocessing to AutoPopulate

9822fc4

Update docs

6074900

Rename max_processes -> multiprocess, accept bool or int

a7e4c2e

Fix reserved jobs

24de484

mspacek changed the base branch from master to dev November 20, 2019 15:53

mspacek changed the base branch from dev to mp November 21, 2019 10:24

dimitri-yatsenko reviewed Nov 21, 2019

View reviewed changes

datajoint/autopopulate.py Outdated Show resolved Hide resolved

dimitri-yatsenko reviewed Nov 21, 2019

View reviewed changes

datajoint/autopopulate.py Outdated Show resolved Hide resolved

mspacek and others added 3 commits November 21, 2019 19:08

Use is instead of ==

28aae7a

Co-Authored-By: Dimitri Yatsenko <[email protected]>

Replace assertion with DataJointError

236e62e

Co-Authored-By: Dimitri Yatsenko <[email protected]>

Remove extra blank line

9266668

dimitri-yatsenko and others added 6 commits December 5, 2019 20:43

fix datajoint#700

4f30328

fix datajoint#700 differently - bypass the restriction against python…

8d7d7d3

…-native data types

fix typo from previous commit

3e3d688

cleanup

53d073f

minor syntax improvement

df27ada

dimitri-yatsenko added 14 commits December 5, 2019 20:43

minor syntax

8f46e05

fix datajoint#675

6ed83c6

Fix datajoint#698

91556d5

fix datajoint#699 -- add table definition to doc string

2c8aedf

Table doc strings now display reverse-engineered table declarations

bc69123

update error message for un-upgraded external stores.

1034797

improve table definition in the doc string

d934b7e

minor improvement in display of table doc strings

2b8588c

replace .describe() with .definition to augment the table docstring

c7b7989

improve unit test for definition in docstring

ca775da

minor

b646061

update CHANGELOG for version 0.12.3

60c952b

update docs for release 0.12.3

95c2249

blob now accepts native complex scalars

b3ee8d9

dimitri-yatsenko merged commit 3e6b174 into datajoint:mp Dec 16, 2019

dimitri-yatsenko mentioned this pull request Sep 25, 2021

Implement multiprocessing in populate #969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocessing to AutoPopulate #704

Add multiprocessing to AutoPopulate #704

mspacek commented Nov 18, 2019

dimitri-yatsenko commented Nov 19, 2019

dimitri-yatsenko commented Nov 19, 2019

mspacek commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

ixcat commented Nov 20, 2019 •

edited

Loading

dimitri-yatsenko commented Nov 20, 2019 •

edited

Loading

ixcat commented Nov 20, 2019

dimitri-yatsenko commented Nov 21, 2019

ixcat commented Nov 21, 2019

mspacek commented Nov 22, 2019

dimitri-yatsenko commented Nov 22, 2019 •

edited

Loading

mspacek commented Dec 5, 2019

dimitri-yatsenko commented Dec 16, 2019

mspacek commented Jan 23, 2020

dimitri-yatsenko commented Jan 24, 2020

mspacek commented Aug 19, 2020

dimitri-yatsenko commented Aug 19, 2020

Add multiprocessing to AutoPopulate #704

Add multiprocessing to AutoPopulate #704

Conversation

mspacek commented Nov 18, 2019

dimitri-yatsenko commented Nov 19, 2019

dimitri-yatsenko commented Nov 19, 2019

mspacek commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

guzman-raphael commented Nov 20, 2019

ixcat commented Nov 20, 2019 • edited Loading

dimitri-yatsenko commented Nov 20, 2019 • edited Loading

ixcat commented Nov 20, 2019

dimitri-yatsenko commented Nov 21, 2019

ixcat commented Nov 21, 2019

mspacek commented Nov 22, 2019

dimitri-yatsenko commented Nov 22, 2019 • edited Loading

mspacek commented Dec 5, 2019

dimitri-yatsenko commented Dec 16, 2019

mspacek commented Jan 23, 2020

dimitri-yatsenko commented Jan 24, 2020

mspacek commented Aug 19, 2020

dimitri-yatsenko commented Aug 19, 2020

ixcat commented Nov 20, 2019 •

edited

Loading

dimitri-yatsenko commented Nov 20, 2019 •

edited

Loading

dimitri-yatsenko commented Nov 22, 2019 •

edited

Loading