mpi and sce #202

MuellerSeb · 2019-02-21T17:20:51Z

Hey there,
when I run the sce algorithm in parallel, the burn-in phase works just right, but afterwards it seems, that there is something going wrong with the parallel processes. The estimated time goes up and the run numbers start to repeat and are sometimes not in an increasing order:

Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
burn-in sampling started...
Initialize database...
* Database file 'all_hh/2019-02-21_18-14-50_stat2D_db.csv' created.
13 of 10000 (best like=-23.6045) est. time remaining: 00:24:00
30 of 10000 (best like=-23.6045) est. time remaining: 00:22:26
48 of 10000 (best like=-22.1554) est. time remaining: 00:21:14
66 of 10000 (best like=-21.6666) est. time remaining: 00:20:35
84 of 10000 (best like=-21.6666) est. time remaining: 00:20:06
103 of 10000 (best like=-18.4106) est. time remaining: 00:19:48
115 of 10000 (best like=-18.4106) est. time remaining: 00:20:46
127 of 10000 (best like=-18.4106) est. time remaining: 00:21:31
141 of 10000 (best like=-18.4106) est. time remaining: 00:21:50
154 of 10000 (best like=-18.4106) est. time remaining: 00:22:12
173 of 10000 (best like=-18.4106) est. time remaining: 00:21:49
burn-in sampling completed...
ComplexEvo started...
ComplexEvo loop #1 in progress...
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:29
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:35
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:36
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:16
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:26
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:27
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:31
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:34
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:36
181 of 10000 (best like=-18.4106) est. time remaining: 00:29:21
192 of 10000 (best like=-18.4106) est. time remaining: 00:27:54
192 of 10000 (best like=-18.4106) est. time remaining: 00:29:29
208 of 10000 (best like=-18.4106) est. time remaining: 00:27:18
202 of 10000 (best like=-18.4106) est. time remaining: 00:28:22
195 of 10000 (best like=-18.4106) est. time remaining: 00:30:49
211 of 10000 (best like=-18.4106) est. time remaining: 00:28:30
206 of 10000 (best like=-18.4106) est. time remaining: 00:29:33
215 of 10000 (best like=-18.4106) est. time remaining: 00:29:29
200 of 10000 (best like=-18.4106) est. time remaining: 00:31:58
...

Do you may have a guess, what's going wrong?
Thanks in advance!

The text was updated successfully, but these errors were encountered:

thouska · 2019-02-22T10:30:22Z

Hi Sebastian,
thanks for your message. This behavior is indeed a bit strange. However, I am pretty sure it is “just” a parallel printing issue. Still, it needs to be solved. As I recently changed some minor things in the parallelization of sce-ua, I need to make sure, that we are using the same version. I just uploaded a new version on pypi (1.4.5). Would you be so kind and test this one again? If the bug is still persisting in this version, I will look closer into this.

MuellerSeb · 2019-02-22T14:16:08Z

I just updated spotpy, but the behavior is the same. Also the estimated time goes up to the same estimated time with sequential optimization.
Again: Burn-in phase works just right. From ComplexEvo loop #1 on repetition counter becomes unsorted and the estimated time goes up to the sequential time estimate. I don't know, what I could have done wrong on my side.

thouska · 2019-02-22T16:15:36Z

Perfect, thanks for testing this again. I found some lines in the code of sceua, where the slaves during parallel computing and complex evolution had access to the status of algorithm.py. As the slaves can have different speed, this might have mixed up the tracked repetitions and time shown in the printing message. I think this should be fixed now.

p-lauer · 2019-03-06T09:50:36Z

I would assume that the unsorted repetition counter is based on how mpi is implemented. The like values are printed in the mpi-loop, so if a process with a higher repetition count finishes before one with a lower repetition count, it is first printed to the screen.

MuellerSeb · 2019-07-29T14:55:47Z

I think this is clarified. But a new issue came up after version 1.5.0:
#226

thouska added a commit that referenced this issue Feb 22, 2019

Fix for #202

7525462

MuellerSeb closed this as completed Jul 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi and sce #202

mpi and sce #202

MuellerSeb commented Feb 21, 2019

thouska commented Feb 22, 2019

MuellerSeb commented Feb 22, 2019

thouska commented Feb 22, 2019

p-lauer commented Mar 6, 2019

MuellerSeb commented Jul 29, 2019

mpi and sce #202

mpi and sce #202

Comments

MuellerSeb commented Feb 21, 2019

thouska commented Feb 22, 2019

MuellerSeb commented Feb 22, 2019

thouska commented Feb 22, 2019

p-lauer commented Mar 6, 2019

MuellerSeb commented Jul 29, 2019