Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: sceua gets stuck with MPI after burn-in #226

Closed
MuellerSeb opened this issue Jul 29, 2019 · 15 comments
Closed

Bug: sceua gets stuck with MPI after burn-in #226

MuellerSeb opened this issue Jul 29, 2019 · 15 comments
Labels

Comments

@MuellerSeb
Copy link
Contributor

Hey there,

from spotpy 1.5.0 on, sce optimization with MPI get stuck after the burn in phase. Here is a minimal example:

from spotpy.algorithms import sceua
from spotpy.examples.spot_setup_rosenbrock import spot_setup
setup = spot_setup("sceua")  # spot_setup() for spotpy 1.4.6
sampler = sceua(setup, parallel="mpi", dbname='db', dbformat="csv")
sampler.sample(repetitions=10000, ngs=4)

Running with

mpiexec -n 4 python3 test.py

Gives the following output:

Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Starting burn-in sampling...
Initialize database...
['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData']
* Database file 'db.csv' created.
Burn-in sampling completed...
Starting Complex Evolution...
ComplexEvo loop #1 in progress...

And from there on, nothing more happens. With parallel="seq" it takes about 5 seconds to finish.
Do you know what the problem could be?

I've got mpi4py 3.0.2 installed and I am using Python 3.6.8. With spotpy 1.4.6 everything is working. From 1.5.0 on the above mentioned behavior occurs.

Cheers,
Sebastian

@MuellerSeb MuellerSeb mentioned this issue Jul 29, 2019
@MuellerSeb
Copy link
Contributor Author

After some bug tracking I think, the problem is in this line:

if self.comm.Iprobe(source=i+1, tag=tag.answer):

where self.comm.Iprobe(source=i+1, tag=tag.answer) never evaluates to true.
Maybe this is related to this: https://groups.google.com/forum/#!topic/mpi4py/RiK8Fhd3LIU

But I've run out of ideas at this point.

@philippkraft
Copy link
Collaborator

Hi Sebastian, sorry for the long silence - vacation period. We "fixed" some SCE-UA bugs with the last version, I have to check the changes together with @thouska - who is still out of office. Can you check another sampler, if you have the same problems there? (e.g. ROPE or LHS). Just to make sure it is in the SCE-UA implementation (which is tricky) and not a general parallel='mpi' problem.

@MuellerSeb
Copy link
Contributor Author

@philippkraft : Thanks for the reply. I checked the FAST routine, which worked as expected.

@MuellerSeb
Copy link
Contributor Author

Something new on this topic?
Cheers, Sebastian

@thouska
Copy link
Owner

thouska commented Sep 2, 2019

Hi Sebastian,
unfortunatelly, there is not much new on this topic. At least I can confirm your error description. I am on it and will inform you here as soon as this is fixed. Sorry that it takes so long...
Based on your report, we are also working to test the mpi implementation on travis (#231), so that such erros can, hopefully, be avoided in the future.

@thouska thouska added the bug label Sep 2, 2019
thouska added a commit that referenced this issue Sep 2, 2019
thouska added a commit that referenced this issue Sep 2, 2019
@thouska
Copy link
Owner

thouska commented Sep 2, 2019

Ok, now it should be fixed. Somehow this in spotpy version 1.5.0 introduced new design of the _RunStatistic class in _algorithm.py was not pickable under mpi4py. This resulted your described stuck after the burn-in phase. I removed the use of the _RunStatistic class while spotpy is running on cpu-slaves. This fixes the problem (at least in my mpi environment). The change might result in a bit longer runtimes at the end of the sampling (will be fixed), but for now it is at least running again.

@thouska
Copy link
Owner

thouska commented Sep 2, 2019

PS: If you want to test this, the corresponding new version (1.5.3) of spotpy is available on pypi.

@MuellerSeb
Copy link
Contributor Author

I installed spotpy 1.5.4 and now I am getting the following error:

  File "/usr/local/lib/python3.6/dist-packages/spotpy/__init__.py", line 41, in <module>
    from . import unittests
ImportError: cannot import name 'unittests'

The submodule unittests is missing in the package. This is due to this line in the setup.py:

packages = ["spotpy", "spotpy.examples", "spotpy.examples.hymod_python", "spotpy.examples.hymod_exe",

you should use this instead:

packages=find_packages(exclude=["tests*", "docs*"])

with this on the first line:

from setuptools import setup, find_packages

But after commenting out the from . import unittests it now works.

@MuellerSeb
Copy link
Contributor Author

Maybe you could shift the unittests folder to a toplevel folder named tests, as mentioned in the exclude pattern, which is a common way, Than you have to adopt the .travis.yml file. I dont think the unit tests need to be in the package when there is a separate example folder.

thouska added a commit that referenced this issue Sep 3, 2019
Moves tests on toplevel, partly removes jit from hymod_python.py #226
@hpsone
Copy link

hpsone commented Sep 3, 2019

I had similar problems but I just saw @thouska just updated but I mean [I have not] test it out the newest version. :D . I will do it now. :D

@thouska
Copy link
Owner

thouska commented Sep 3, 2019

Many thanks @MuellerSeb that you directly tested everything and reported such a detailed way how to fix the new problems. As you recommended, I removed the unittest import, renamed the unittests folder to tests and moved the whole thing to the toplevel. I like the new structure and think this makes totaly sense.
As @hpsone found out faster than I could answer to this issue: There is a new version on pypi containing the fix.

@hpsone
Copy link

hpsone commented Sep 3, 2019

Sorry for my rush comment. I want to say I have not tested it yet. But now I tested it and it is not working for me. May be it is my mistake in the model but my mpi is working properly as I tested it with Telemac2d. What could be the possible error. Anyway, @thouska thank you very much for help.
Best Regards
Htun

@MuellerSeb
Copy link
Contributor Author

@hpsone : maybe you have to give some details on your problem to get an answer.

@hpsone
Copy link

hpsone commented Jan 16, 2020

@MuellerSeb Thank you so much. I am not quite sure what is the error. But I did run using "mpc" instead of "mpi" and it worked. Anyway I will try again but it probably might be my insufficient knowledge.

@thouska
Copy link
Owner

thouska commented Apr 1, 2020

I guess this issue is solved, if not feel free to reopen.

@thouska thouska closed this as completed Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants