-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pymusic] Endless loop in setup.runtime(..)
#35
Comments
At the first glance it looks like that starting MUSIC with a single process and no ports does not work when launched with mpiexec or mpirun. If I start it with ipython only, it works. Ill have a look at it |
Yes, unfortunately, the new MUSIC scheduler hangs if there is only one
application.
This should of course be fixed.
Den 16 jan. 2017 16:19 skrev "Martin Schulze" <[email protected]>:
… At the first glance it looks like that starting MUSIC with a single
process and no ports does not work when launched with mpiexec or mpirun. If
I start it with ipython only, it works. I first need to get Cython
debugging working before I can dive into the problem.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCWZnamNjWZRXiVeGgj-ofIF_9Gofks5rS4psgaJpZM4Lkd4V>
.
|
@mdjurfeldt Do you know how to enable gdb to debug Cython code? I found this: http://docs.cython.org/en/latest/src/userguide/debugging.html but there are multiple cythonize() function calls in the pymusic folder setup.py.in and I am not sure where to set the gdb flag |
It seems not to be the special case of only one application. Running two of the above nodes with
|
@i think its a bug when you dont have any ports connecting these applications or maybe not any ports at all (note that you also have to actually create the ports in your applications) |
Seems likely. |
This is really weird. I reduced my implementation to a point at which all configured nodes & connections are working as intended. |
Eventhough there is a method in the MUSIC interface to check whether a port is connected, NEST does not make use of it and always assumes ports to be connected. I will fix this soon (if it makes sense ... ) as I am refactoring all recording devices in NEST anyways (already started with that). Be prepared 8-) |
Many thanks. But still the bigger issue what happens after setting up the connection properly:
Suddenly, after hooking in another node, a node which has been working before does not return from |
Can you provide the two examples?
Den 16 jan. 2017 21:30 skrev "Michael Hoff" <[email protected]>:
… Many thanks. But still the bigger issue what happens after setting up the
connection properly:
Next, I add the publish code to make the connection proper and run it
again. Now, all nodes run except for the pre-existing node for which I just
added a connection.
Suddenly, after hooking in another node, a node which has been working
before does not return from runtime() anymore. I published and mapped the
new event input port the same way I have set up another event input port on
the same node which has been working before. Really weird.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCbzJIdMvFFeUpZF5NN5pGp9_6Fq0ks5rS9MogaJpZM4Lkd4V>
.
|
That would be difficult. The pynest node relies on a custom nest module (which in turn is responsible for creating one of the music ports). Maybe I can reproduce the faulty behavior with a more simple setup. But before I do that, can you tell me why the above use case with two independent music nodes fails? I totally understand that one single node can be a special case for the music scheduler. And I certainly also understand that having simply two unconnected music nodes is a somehow related scenario in terms of the scheduler. But do both behaviors really emerge from the same origin? |
Is the order of nodes & connections in the music configuration relevant for the scheduler? While trying to produce a showcase for this issue I was moving around nodes and connections inside the configuration file. Now other nodes are not responding anymore. Attached to this comment you find my current configuration file. Everything above
|
I have to retreat on my last statement. A little bug ( |
Dear Michael,
I will have to investigate this. Unfortunately, I can do this first during
the weekend at the earliest.
Obviously, this is not how MUSIC should behave and we will fix it!
Best regards,
Mikael
…On Wed, Jan 18, 2017 at 2:09 PM, Michael Hoff ***@***.***> wrote:
I have to retreat on my last statement. A little bug (stoptime = ... got
missing) caused the weird behaviour with the order appearing to be relevant.
The above configuration file is now working up to # 4. If I activate the
connection reward_node.reward_out -> nest.reward_in [1], the plotter_node
will not return from runtime(..) anymore. This is weird as this node is
not participating in this specific connection.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCf68OBRymljWFCL0W6YiAk9X2lsQks5rTg8QgaJpZM4Lkd4V>
.
|
@mdjurfeldt Thank you very much for your effort! |
If I can help you with anything please don't hesitate to let me know. Also if you have any idea about the potential source of the problem, maybe I can assist in the investigation if I know where to look. |
I found another weird behaviour. When trying to feed Based on the above version which is working without the questionable connection, I add
to
This error appears in
|
Hi Michael,
I noticed first now that the segfault is in the pthreads library. The
current version of MUSIC is not thread aware and should only be called by a
single thread. Do you in any way use thread concurrency in your setup?
Den 25 jan 2017 17:42 skrev "Michael Hoff" <[email protected]>:
… I found another weird behaviour. When trying to feed nest.reward_in with
data of other ports (which by the way does not change the above behaviour)
I experienced reproducible SegFaults for publishing unconnected ContOutputs.
Based on the above version which is working without the questionable
connection, I add
music_setup.publishContOutput("sim_time_out")
to plotter_node (before runtime()) without connecting this port in the
music configuration (see configuration file below). Now, I experience the
following when I try to execute the modified setup:
[figipc156:25236] *** Process received signal ***
[figipc156:25236] Signal: Segmentation fault (11)
[figipc156:25236] Signal code: Address not mapped (1)
[figipc156:25236] Failing at address: 0x59
[figipc156:25236] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7fc21e9fb8d0]
[figipc156:25236] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup19maybePostponedSetupEv+0) [0x7fc21caf4850]
[figipc156:25236] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC4PortC1EPNS_5SetupESs+0x46) [0x7fc21cb11926]
[figipc156:25236] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup17publishContOutputESs+0x46) [0x7fc21caf38a6]
[figipc156:25236] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17e14) [0x7fc21d58fe14]
[figipc156:25236] [ 5] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:25236] [ 6] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 7] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 8] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 9] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:25236] [10] python() [0x5030ef]
[figipc156:25236] [11] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:25236] [12] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:25236] [13] python(Py_Main+0x562) [0x4982f2]
[figipc156:25236] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc21dd3fb45]
[figipc156:25236] [15] python() [0x497ca0]
[figipc156:25236] *** End of error message ***
This error appears in publishContOutput and is reliably reproducible.
Publishing unconnected ContOutputs works perfectly for trivial test cases,
but in this setup it clearly fails.
------------------------------
stoptime=100000.0
[distance_provider]
binary=ros_sensor_adapter
args=
np=1
music_timestep=0.01
ros_topic=/distance/to_center
message_type=FloatArray
sensor_update_rate=1000
[reward_gen]
binary=python/reward_node.py
args=
np=1
distance_provider.out -> reward_gen.distance_in [1]
[plotter]
np=1
binary=python/plotter_node.py
reward_gen.reward_out -> plotter.reward_in [1]
[dvs]
binary=ros_event_sensor_adapter
args=
np=1
music_timestep=0.01
ros_topic=/camera/dvs/events
message_type=EventArray
sensor_update_rate=60
dvs.out -> plotter.pattern_in [200]
[nest]
binary=python/network_node.py
args=
np=1
dvs.out -> nest.pattern_in [200]
nest.activity_out -> plotter.activity_in [20]
[decoder]
binary=linear_readout_decoder
args=
np=1
music_timestep=0.05
tau=0.03
weights_filename=res/activity_to_velocity_translation_weights.dat
[command_gen]
binary=ros_command_adapter
args=
np=1
music_timestep=0.05
ros_topic=/cmd_vel
message_mapping_filename=res/velocity_to_twist.dat
command_rate=20
nest.activity_out -> decoder.in [20]
decoder.out -> command_gen.in [2]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCaCWfRTRqXBwlE5EEH1EhwMp6-Njks5rV3t5gaJpZM4Lkd4V>
.
|
The But I think I have experienced this kind of fault ("Address not mapped" originating from Update: |
Neither NEST nor the ROS interface should be a problem. Could you give me
the versions of your OS, MPI and NEST?
Den 25 jan. 2017 18:21 skrev "Michael Hoff" <[email protected]>:
The nest node uses the standard NEST-inherent thread-based parallelization
(e.g., nest.SetKernelStatus({'local_num_threads': 5}). The
nodesplotterandreward_gen`
are standard pymusic nodes without parallelity (except for some MPI
communication, which is currently disabled). The ros_music_adapters use may
use one thread per node to bridge between MUSIC and ROS.
But I think I have experienced this kind of fault ("Address not mapped"
originating from libpthread.so) already months ago, long before I was using
the ros_music_adapters at all. Do you have any idea how pthreads could get
into my workflow? I have even tested the network_node with local_num_threads
= 0, which did not change a thing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCY1ngmGobJlFPY0Pp6mLtOXTGvVLks5rV4SUgaJpZM4Lkd4V>
.
|
As of music, I have tested against the current INCF version (8e0a609) and the current version of the fork by Philipp Weidel (weidel-p/MUSIC@dd96d32). |
When building music,
I tried
Continuing the crusade.. In a clean clone of MUSIC:
Now, many many files and objects contain pthread. See grep result. After install,
I am not able to produce a music binary without containing pthread. Presumably, I misunderstood you and the music binary containing references to pthread is totally fine. |
I found the source of this SegFault problem. I accidentally inserted the code for the test output after runtime was called. That lead to pymusic calling Setup::publishContOutput with a setup object being the null pointer. Maybe pymusic should be aware of that issue and report an error message instead. I reworked my code to disallow the above behaviour in the future. Now, I run into another kind of faulty behaviour. A very simple configuration:
The above part with the ros_music_adapter and the reward_gen node works very reliable. With the plotter node I am experimenting right now. When I add an unconnected EventOutput to the plotter node (while setup is still valid..) I experience two different kinds of error messages. First of all, this is how I added the Output to the code which has been working before 100%:
Connection status is False, because there is no connection in the music configuration, which is correct.
Error type 2:
That means, even though the proxy is correctly classified as unconnected something in MUSIC is heavily affected. Note that removing these two lines leads to perfectly correct behaviour again. |
Thanks, Michael.
This is why it is so good to have a small example replicating the problem.
There's no way I could have found the segfault problem had I started to
debug it.
You are right that there should be more helpful error messages in MUSIC.
This is on the agenda.
Do you have a smaller example replicating the new problem below?
Den 26 jan. 2017 13:25 skrev "Michael Hoff" <[email protected]>:
… I found the source of this SegFault problem. I accidentally inserted the
code for the test output after runtime was called. That lead to pymusic
calling Setup::publishContOutput with a setup object being the null
pointer. Maybe pymusic should be aware of that issue and report an error
message instead.
------------------------------
I reworked my code to disallow the above behaviour in the future. Now, I
run into another kind of faulty behaviour. A very simple configuration:
stoptime=100000.0
[distance_provider]
binary=ros_sensor_adapter
args=
np=1
music_timestep=0.01
ros_topic=/distance/to_center
message_type=FloatArray
sensor_update_rate=1000
[reward_gen]
binary=python/reward_node.py
args=
np=1
distance_provider.out -> reward_gen.distance_in [1]
[plotter]
np=1
binary=python/plotter_node.py
reward_gen.reward_out -> plotter.reward_in [1]
The above part with the ros_music_adapter and the reward_gen node works
very reliable. With the plotter node I am experimenting right now. When I
add an unconnected EventOutput to the plotter node (while setup is still
valid..) I experience two different kinds of error messages.
First of all, this is how I added the Output to the code which has been
working before 100%:
event_out_proxy = music_setup.publishEventOutput("test_event_out")
print("event_out_proxy connection status {}".format(event_out_proxy.isConnected()))
Connection status is False, because there is no connection in the music
configuration, which is correct.
Error type 1:
event_out_proxy connection status False
1485433258 | __init__ | WARNING | Output port sim_time_out is not connected
1485433258 | __init__ | WARNING | Input port test_cont_in is not connected
[figipc156:03614] *** Process received signal ***
[figipc156:03614] Signal: Segmentation fault (11)
[figipc156:03614] Signal code: Invalid permissions (2)
[figipc156:03614] Failing at address: 0x7f8109db56b8
[figipc156:03614] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f810a6ed8d0]
[figipc156:03614] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a56b8) [0x7f8109db56b8]
[figipc156:03614] *** End of error message ***
Error type 2:
event_out_proxy connection status False
1485433289 | node | INFO | Dropping to runtime with timestep 0.02...
1485433289 | __init__ | WARNING | Output port sim_time_out is not connected
1485433289 | __init__ | WARNING | Input port test_cont_in is not connected
[figipc156:03655] *** Process received signal ***
[figipc156:03655] Signal: Segmentation fault (11)
[figipc156:03655] Signal code: Address not mapped (1)
[figipc156:03655] Failing at address: 0x30
[figipc156:03655] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7efd780768d0]
[figipc156:03655] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC15EventOutputPort10buildTableEv+0x29) [0x7efd7618b2f9]
[figipc156:03655] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime11buildTablesEPNS_5SetupE+0x24) [0x7efd76169334]
[figipc156:03655] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7RuntimeC1EPNS_5SetupEd+0x2cc) [0x7efd76169dbc]
[figipc156:03655] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x1dd1f) [0x7efd76c10d1f]
[figipc156:03655] [ 5] python() [0x4ba865]
[figipc156:03655] [ 6] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17a8c) [0x7efd76c0aa8c]
[figipc156:03655] [ 7] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:03655] [ 8] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [ 9] python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1]
[figipc156:03655] [10] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:03655] [11] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [12] python() [0x5030ef]
[figipc156:03655] [13] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:03655] [14] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:03655] [15] python(Py_Main+0x562) [0x4982f2]
[figipc156:03655] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7efd773bab45]
[figipc156:03655] [17] python() [0x497ca0]
[figipc156:03655] *** End of error message ***
That means, even though the proxy is correctly classified as unconnected
something in MUSIC is heavily affected. Note that removing these two lines
leads to perfectly correct behaviour again.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADcfCWbsB-Viahq3oJ8llRDHnCeT0fC1ks5rWJC8gaJpZM4Lkd4V>
.
|
I built a very small test case (test.zip). I am actually surprised that this does not work. Maybe I just forgot a parameter or something. #!/usr/bin/env python
# -*- coding: utf-8 -*-
import music
setup = music.Setup()
event_out = setup.publishEventOutput("event_out_1")
print("connected = {}".format(event_out.isConnected()))
times = setup.runtime(0.02)
for time in times:
print(time)
Note: Yes, again this is only one node, but this time there is a clear and deterministic error. Also, I can reproduce this with two nodes, connected with two continuous ports, plus this unconnected event output. The same behaviour occurs. |
I am getting closer to the original problem regarding the single connection rendering one node unusuable. At the bottom you find my complete setup. The problematic part is:
If I use the two connections, which are currently active, the whole experiment is working. But if I instead use the direct connection, with the real reward data, the plotter node stops after a few timesteps. In comparison:
|
We could track down the problem to a very minimal working example. On some architectures this produces a significant simulation time offset between two nodes with a unidirectional connection, whereas the connection is configured to have nor latency or buffering. @mdjurfeldt suggests |
Hello everyone,
while debugging a more complex experiment using MUSIC I recognized that two of my pymusic-nodes simply never return from calling
music_setup.runtime(..)
, whereasmusic_setup = music.Setup()
.I have been able to reduce the code to a very minimal not working example.
When I execute the music configuration attached to this issue via
mpirun -np 1 music test.music
nothing more than"before runtime"
will get printed. The process appears to enter an endless loop when callingruntime
on the setup object.In contrast,
mpirun -np 1 ./test_node.py
works fine.I'm well aware that this trivial node combined with this trivial configuration might be a special case for MUSIC, but this issue also arises with a more complex setup and nodes which actually consume data via incoming connections.
music --version
:MUSIC 1.1.15
(8e0a609)mpirun --version
:mpirun (Open MPI) 1.6.5
Minimal example (download: test.zip):
test_node.py
test.music
The text was updated successfully, but these errors were encountered: