Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pymusic] Endless loop in setup.runtime(..) #35

Closed
mhoff opened this issue Jan 16, 2017 · 30 comments
Closed

[pymusic] Endless loop in setup.runtime(..) #35

mhoff opened this issue Jan 16, 2017 · 30 comments

Comments

@mhoff
Copy link
Contributor

mhoff commented Jan 16, 2017

Hello everyone,

while debugging a more complex experiment using MUSIC I recognized that two of my pymusic-nodes simply never return from calling music_setup.runtime(..), whereas music_setup = music.Setup().

I have been able to reduce the code to a very minimal not working example.
When I execute the music configuration attached to this issue via mpirun -np 1 music test.music nothing more than "before runtime" will get printed. The process appears to enter an endless loop when calling runtime on the setup object.
In contrast, mpirun -np 1 ./test_node.py works fine.

I'm well aware that this trivial node combined with this trivial configuration might be a special case for MUSIC, but this issue also arises with a more complex setup and nodes which actually consume data via incoming connections.


  • music --version: MUSIC 1.1.15 (8e0a609)
  • mpirun --version: mpirun (Open MPI) 1.6.5

Minimal example (download: test.zip):

test_node.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import music

music_setup = music.Setup()

print("before runtime")

runtime = music_setup.runtime(0.02)

print("after runtime -- not called")

for current_time in runtime:
    print(current_time)

test.music

[test]
  binary=./test_node.py
  args=
  np=1
@uahic
Copy link

uahic commented Jan 16, 2017

At the first glance it looks like that starting MUSIC with a single process and no ports does not work when launched with mpiexec or mpirun. If I start it with ipython only, it works. Ill have a look at it

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 16, 2017 via email

@uahic
Copy link

uahic commented Jan 16, 2017

@mdjurfeldt Do you know how to enable gdb to debug Cython code?

I found this: http://docs.cython.org/en/latest/src/userguide/debugging.html but there are multiple cythonize() function calls in the pymusic folder setup.py.in and I am not sure where to set the gdb flag

@mhoff
Copy link
Contributor Author

mhoff commented Jan 16, 2017

It seems not to be the special case of only one application. Running two of the above nodes with -np 2 yields the same behaviour for both nodes.

[test1]
  binary=./test_node.py
  args=
  np=1

[test2]
  binary=./test_node.py
  args=
  np=1

@uahic
Copy link

uahic commented Jan 16, 2017

@i think its a bug when you dont have any ports connecting these applications or maybe not any ports at all (note that you also have to actually create the ports in your applications)

@mhoff
Copy link
Contributor Author

mhoff commented Jan 16, 2017

Seems likely.
However, in my experiment I have a purely consuming node (2 incoming connections) which exhibits the same behaviour. But as it is a big node I can not say for certain that there is no misconfiguration.

@mhoff
Copy link
Contributor Author

mhoff commented Jan 16, 2017

This is really weird. I reduced my implementation to a point at which all configured nodes & connections are working as intended.
Then I add another node (pynest) with its three ports. Two of these ports are inputs which I simply connect to pre-existing outputs (which have been proven to be working in the minimal setup). The third port is an event output. I connect this port to another port on a pre-existing node which has not been used until that. In the code of the pre-existing node I do not publish the input. Now, when I run the whole thing, all nodes freeze in runtime(..) without an error message.
Next, I add the publish code to make the connection proper and run it again. Now, all nodes run except for the pre-existing node for which I just added a connection.

@uahic
Copy link

uahic commented Jan 16, 2017

Eventhough there is a method in the MUSIC interface to check whether a port is connected, NEST does not make use of it and always assumes ports to be connected. I will fix this soon (if it makes sense ... ) as I am refactoring all recording devices in NEST anyways (already started with that). Be prepared 8-)

@mhoff
Copy link
Contributor Author

mhoff commented Jan 16, 2017

Many thanks. But still the bigger issue what happens after setting up the connection properly:

Next, I add the publish code to make the connection proper and run it again. Now, all nodes run except for the pre-existing node for which I just added a connection.

Suddenly, after hooking in another node, a node which has been working before does not return from runtime() anymore. I published and mapped the new event input port the same way I have set up another event input port on the same node which has been working before. Really weird.

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 16, 2017 via email

@mhoff
Copy link
Contributor Author

mhoff commented Jan 16, 2017

That would be difficult. The pynest node relies on a custom nest module (which in turn is responsible for creating one of the music ports). Maybe I can reproduce the faulty behavior with a more simple setup.

But before I do that, can you tell me why the above use case with two independent music nodes fails? I totally understand that one single node can be a special case for the music scheduler. And I certainly also understand that having simply two unconnected music nodes is a somehow related scenario in terms of the scheduler. But do both behaviors really emerge from the same origin?

@mhoff
Copy link
Contributor Author

mhoff commented Jan 17, 2017

Is the order of nodes & connections in the music configuration relevant for the scheduler? While trying to produce a showcase for this issue I was moving around nodes and connections inside the configuration file. Now other nodes are not responding anymore. Attached to this comment you find my current configuration file. Everything above # 3 is the reduced version which was working before I changed the order in the configuration file. Now even the part up to # 1 is not working anymore (ros_sensor_adapter does not return from runtime..).


[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=10

[reward_node]
  binary=../../src/compute_node.py
  args=
  np=1

distance_provider.out -> reward_node.distance_in [1]

# 1

[plotter]
  np=1
  binary=../../src/plotter_node.py

reward_node.reward_out -> plotter.reward_in [1]

# 2

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.5
  ros_topic=/dvs/events
  message_type=EventArray
  sensor_update_rate=1

dvs.out -> plotter.pattern_in [200]

# 3

#[nest]
#  binary=../../src/network_node.py
#  args=
#  np=1

#dvs.out -> nest.pattern_in [200]
#nest.activity_out -> plotter.activity_in [20]

# 4

#reward_node.reward_out -> nest.reward_in [1]

# 5

#[decoder]
#  binary=linear_readout_decoder
#  args=
#  np=1
#  music_timestep=0.05
#  tau=0.03
#  weights_filename=activity_to_velocity_translation_weights.dat
#[command]
#  binary=ros_command_adapter
#  args=
#  np=1
#  music_timestep=0.05
#  ros_topic=/cmd_vel
#  message_mapping_filename=velocity_to_twist.dat
#  command_rate=20

#nest.activity_out -> decoder.in [20]
#decoder.out -> command.in [2]

@mhoff
Copy link
Contributor Author

mhoff commented Jan 18, 2017

I have to retreat on my last statement. A little bug (stoptime = ... got missing) caused the weird behaviour with the order appearing to be relevant.
The above configuration file is now working up to # 4. If I activate the connection reward_node.reward_out -> nest.reward_in [1], the plotter_node will not return from runtime(..) anymore. This is weird as this node is not participating in this specific connection.

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 18, 2017 via email

@mhoff
Copy link
Contributor Author

mhoff commented Jan 19, 2017

@mdjurfeldt Thank you very much for your effort!

@mhoff
Copy link
Contributor Author

mhoff commented Jan 19, 2017

Maybe this helps in the investigation: When running the minified experiment I experience very inconsistent performance over time, even though all nodes appear to have roughly constant workload.
screenshot
In the above graphics you can see how the reward (negative proportional to distance) changes based on the monotonic movement of an object through space. The object passes through the center of the world which maximizes the reward and decreases afterwards until being reset to the starting position. The distance_provider is a ROS/MUSIC node which reads the distance from a real-valued ROS topic with frequency 1000 and transfers this into the MUSIC world. The reward_node MUSIC node in place reads the distance and emits the reward. The plotter MUSIC node then gathers this signal and draws it as seen above. As you can see, the reward signal is not as linear as it should be. Looking into the execution logs shows the following for correctly sampled partitions of the signal:

reward: ts = 1.78, last_cycle_time = 8.51154e-05
plotter: ts = 1.68, last_cycle_time = 0.000126839
reward: ts = 1.8, last_cycle_time = 0.000105858
plotter: ts = 1.7, last_cycle_time = 0.000150919
reward: ts = 1.82, last_cycle_time = 9.5129e-05
plotter: ts = 1.72, last_cycle_time = 0.000129938
reward: ts = 1.84, last_cycle_time = 9.98974e-05
plotter: ts = 1.74, last_cycle_time = 0.000133991
reward: ts = 1.86, last_cycle_time = 7.82013e-05
plotter: ts = 1.76, last_cycle_time = 0.000138998
reward: ts = 1.88, last_cycle_time = 0.000101089
plotter: ts = 1.78, last_cycle_time = 0.000138044
reward: ts = 1.9, last_cycle_time = 0.000106096

As both MUSIC nodes share the same timestep this is also the expected output. However, when simulation performance drops and signal sampling gets incorrect you can see the following:

reward: ts = 4.72, last_cycle_time = 3.79086e-05
reward: ts = 4.74, last_cycle_time = 0.000125885
reward: ts = 4.76, last_cycle_time = 3.79086e-05
reward: ts = 4.78, last_cycle_time = 3.91006e-05
reward: ts = 4.8, last_cycle_time = 3.8147e-05
plotter: ts = 4.62, last_cycle_time = 0.000164032
plotter: ts = 4.64, last_cycle_time = 0.000178099
plotter: ts = 4.66, last_cycle_time = 0.000163794
plotter: ts = 4.68, last_cycle_time = 0.000164032
plotter: ts = 4.7, last_cycle_time = 0.000169039
reward: ts = 4.82, last_cycle_time = 3.91006e-05
reward: ts = 4.84, last_cycle_time = 0.000127077
reward: ts = 4.86, last_cycle_time = 3.79086e-05
reward: ts = 4.88, last_cycle_time = 3.8147e-05
reward: ts = 4.9, last_cycle_time = 3.91006e-05
plotter: ts = 4.72, last_cycle_time = 0.000164986
plotter: ts = 4.74, last_cycle_time = 0.00016284
plotter: ts = 4.76, last_cycle_time = 0.000164986
plotter: ts = 4.78, last_cycle_time = 0.000165939
plotter: ts = 4.8, last_cycle_time = 0.000185966
reward: ts = 4.92, last_cycle_time = 3.8147e-05
reward: ts = 4.94, last_cycle_time = 0.000123024
reward: ts = 4.96, last_cycle_time = 3.79086e-05
reward: ts = 4.98, last_cycle_time = 3.88622e-05
reward: ts = 5.0, last_cycle_time = 3.8147e-05
plotter: ts = 4.82, last_cycle_time = 0.000179052
plotter: ts = 4.84, last_cycle_time = 0.000163078
plotter: ts = 4.86, last_cycle_time = 0.000141859
plotter: ts = 4.88, last_cycle_time = 0.0001888

Still, both nodes execute with the same timestep. I know that this might be a console issue gathering the logs from the different processes. But still, this behaviour is clearly reproducable and correlates with the data in the graph. So I presume this could also be a scheduler issue.


Second thing I wanted to add is an error which just occurred for the first time in many many simulation runs. I know this is more related to NEST, but still maybe it helps in some way.

[figipc156:11872] *** Process received signal ***
[figipc156:11872] Signal: Segmentation fault (11)
[figipc156:11872] Signal code: Address not mapped (1)
[figipc156:11872] Failing at address: 0xffffffffe5663870
[figipc156:11872] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7ff0f2d508d0]
[figipc156:11872] [ 1] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest17MusicEventHandlerclEdN5MUSIC11GlobalIndexE+0xc) [0x7ff0ee793f4c]
[figipc156:11872] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC28EventInputSubconnectorGlobal7receiveEPci+0x86) [0x7ff0ecff1736]
[figipc156:11872] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC19EventInputConnector4tickEv+0x1c0) [0x7ff0ecff93d0]
[figipc156:11872] [ 4] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC12UnicommAgent4tickERNS_5ClockE+0x4a) [0x7ff0ed00beda]
[figipc156:11872] [ 5] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC9Scheduler4tickERNS_5ClockE+0x24) [0x7ff0ed012324]
[figipc156:11872] [ 6] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime4tickEv+0x6c) [0x7ff0ecfbd45c]
[figipc156:11872] [ 7] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest12Communicator18advance_music_timeEl+0x25) [0x7ff0ee721765]
[figipc156:11872] [ 8] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x61ba9b) [0x7ff0ee77ea9b]
[figipc156:11872] [ 9] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x3f) [0x7ff0ec2557df]
[figipc156:11872] [10] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler6updateEv+0x1d6) [0x7ff0ee77b8e6]
[figipc156:11872] [11] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler6resumeEv+0x4e) [0x7ff0ee77bd9e]
[figipc156:11872] [12] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler8simulateERKNS_4TimeE+0x301) [0x7ff0ee783151]
[figipc156:11872] [13] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZNK4nest10NestModule16SimulateFunction7executeEP14SLIInterpreter+0x2d3) [0x7ff0ee749543]
[figipc156:11872] [14] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN14SLIInterpreter8execute_Em+0x255) [0x7ff0ee7c0e25]
[figipc156:11872] [15] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN14SLIInterpreter7executeERKSs+0x135) [0x7ff0ee7c13a5]
[figipc156:11872] [16] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x2e1545) [0x7ff0ee444545]
[figipc156:11872] [17] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x2e1073) [0x7ff0ee444073]
[figipc156:11872] [18] /usr/bin/python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:11872] [19] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [20] /usr/bin/python() [0x4e5fe8]
[figipc156:11872] [21] /usr/bin/python(PyEval_EvalFrameEx+0x28bb) [0x4cc36b]
[figipc156:11872] [22] /usr/bin/python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:11872] [23] /usr/bin/python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1]
[figipc156:11872] [24] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [25] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [26] /usr/bin/python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:11872] [27] /usr/bin/python() [0x5030ef]
[figipc156:11872] [28] /usr/bin/python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:11872] [29] /usr/bin/python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:11872] *** End of error message ***

@mhoff
Copy link
Contributor Author

mhoff commented Jan 24, 2017

If I can help you with anything please don't hesitate to let me know. Also if you have any idea about the potential source of the problem, maybe I can assist in the investigation if I know where to look.

@mhoff
Copy link
Contributor Author

mhoff commented Jan 25, 2017

I found another weird behaviour. When trying to feed nest.reward_in with data of other ports (which by the way does not change the above behaviour) I experienced reproducible SegFaults for publishing unconnected ContOutputs.

Based on the above version which is working without the questionable connection, I add

music_setup.publishContOutput("sim_time_out")

to plotter_node (before runtime()) without connecting this port in the music configuration (see configuration file below). Now, I experience the following when I try to execute the modified setup:

[figipc156:25236] *** Process received signal ***
[figipc156:25236] Signal: Segmentation fault (11)
[figipc156:25236] Signal code: Address not mapped (1)
[figipc156:25236] Failing at address: 0x59
[figipc156:25236] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7fc21e9fb8d0]
[figipc156:25236] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup19maybePostponedSetupEv+0) [0x7fc21caf4850]
[figipc156:25236] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC4PortC1EPNS_5SetupESs+0x46) [0x7fc21cb11926]
[figipc156:25236] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup17publishContOutputESs+0x46) [0x7fc21caf38a6]
[figipc156:25236] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17e14) [0x7fc21d58fe14]
[figipc156:25236] [ 5] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:25236] [ 6] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 7] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 8] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 9] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:25236] [10] python() [0x5030ef]
[figipc156:25236] [11] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:25236] [12] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:25236] [13] python(Py_Main+0x562) [0x4982f2]
[figipc156:25236] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc21dd3fb45]
[figipc156:25236] [15] python() [0x497ca0]
[figipc156:25236] *** End of error message ***

This error appears in publishContOutput and is reliably reproducible.
Publishing unconnected ContOutputs works perfectly for trivial test cases, but in this setup it clearly fails.


stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000

[reward_gen]
  binary=python/reward_node.py
  args=
  np=1

distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/camera/dvs/events
  message_type=EventArray
  sensor_update_rate=60

dvs.out -> plotter.pattern_in [200]

[nest]
  binary=python/network_node.py
  args=
  np=1

dvs.out -> nest.pattern_in [200]
nest.activity_out -> plotter.activity_in [20]

[decoder]
  binary=linear_readout_decoder
  args=
  np=1
  music_timestep=0.05
  tau=0.03
  weights_filename=res/activity_to_velocity_translation_weights.dat
[command_gen]
  binary=ros_command_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/cmd_vel
  message_mapping_filename=res/velocity_to_twist.dat
  command_rate=20

nest.activity_out -> decoder.in [20]
decoder.out -> command_gen.in [2]

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 25, 2017 via email

@mhoff
Copy link
Contributor Author

mhoff commented Jan 25, 2017

The nest node uses the standard NEST-inherent thread-based parallelization (e.g., nest.SetKernelStatus({'local_num_threads': 5}). The nodes plotter and reward_gen are standard pymusic nodes without parallelity (except for some MPI communication, which is currently disabled). The ros_music_adapters use may use one thread per node to bridge between MUSIC and ROS.

But I think I have experienced this kind of fault ("Address not mapped" originating from libpthread.so) already months ago, long before I was using the ros_music_adapters at all. Do you have any idea how pthreads could get into my workflow? I have even tested the network_node with local_num_threads being not set, which did not change a thing.


Update:
I can reproduce this error with one pymusic and one pynest node or with one pymusic and one ros_music_adapter node. That means it can not be the parallelization of NEST, nor the threading of the ros_music_adapter. But it can also not be the pymusic node, which is not using threading or any other form of parallelity at all.

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 25, 2017 via email

@mhoff
Copy link
Contributor Author

mhoff commented Jan 25, 2017

  • nest --version

    NEST version 2.10.0, built on Jan 16 2017 12:24:41 for x86_64-unknown-linux-gnu
    Copyright (C) 2004 The NEST Initiative
    
  • mpirun --version

    mpirun (Open MPI) 1.6.5
    
  • uname -a

    Linux $hostname 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
    

As of music, I have tested against the current INCF version (8e0a609) and the current version of the fork by Philipp Weidel (weidel-p/MUSIC@dd96d32).

@mhoff
Copy link
Contributor Author

mhoff commented Jan 25, 2017

When building music, ./configure outputs the following:

[...]
checking which MPI system we think we are using... SYSGUESS=openmpi
checking MPI_CXXFLAGS... -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread
checking MPI_CFLAGS... -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread
checking MPI_LDFLAGS... -pthread -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc
[...]

I tried ./configure MPI_CXXFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_CFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_LDFLAGS="-L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc" to get rid of pthreads. After, I installed MUSIC again and rebuilt all ros_music_adapters. Then, I ran the testcase with pymusic+ros_music_adapter. But still, "Address not mapped" in libpthread.so.

ldd `which music` still contains pthread:

        linux-vdso.so.1 (0x00007fff9f9b7000)
        libmusic.so.1 => /home/hoff/.local/lib/libmusic.so.1 (0x00007f8c0f263000)
        libmpi_cxx.so.1 => /usr/lib/libmpi_cxx.so.1 (0x00007f8c0f048000)
        libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007f8c0ecb6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8c0eab2000)
        libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f8c0e86d000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8c0e562000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8c0e261000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8c0e04b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8c0de2e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8c0da83000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f8c0d87b000)
        libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f8c0d663000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f8c0d460000)
        libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f8c0d256000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f8c0f559000)
        libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f8c0d04b000)

Continuing the crusade..

In a clean clone of MUSIC:

$ grep -r pthread . | grep -v ros/

./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);

$ ./autogen.sh && grep -r pthread . | grep -v ros/

[... autogen ...]
./ltmain.sh:	   # Do not include libc_r directly, use -pthread flag.
./ltmain.sh:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./ltmain.sh:	-mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/Makefile.in:viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);

$ ./configure --prefix=/home/hoff/.local MPI_CXXFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_CFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_LDFLAGS="-L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc && grep -r pthread . | grep -v ros/

[... configure ...]
./ltmain.sh:	   # Do not include libc_r directly, use -pthread flag.
./ltmain.sh:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./ltmain.sh:	-mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/Makefile:viewevents_LDADD = $(top_builddir)/src/libmusic.la -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc  -lglut -lpthread
./utils/Makefile.in:viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);
./libtool:	   # Do not include libc_r directly, use -pthread flag.
./libtool:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./libtool:	-mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \

$ make -j 12 && grep -r pthread . | grep -v ros/

Now, many many files and objects contain pthread. See grep result.

After install, $ ldd `which music` :

	linux-vdso.so.1 (0x00007ffe52bed000)
	libmusic.so.1 => /home/hoff/.local/lib/libmusic.so.1 (0x00007f04700fd000)
	libmpi_cxx.so.1 => /usr/lib/libmpi_cxx.so.1 (0x00007f046fee2000)
	libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007f046fb50000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f046f94c000)
	libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f046f707000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f046f3fc000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f046f0fb000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f046eee5000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f046eb3a000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f046e932000)
	libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f046e71a000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f046e517000)
	libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f046e30d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f046e0f0000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f04703f3000)
	libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f046dee5000)

I am not able to produce a music binary without containing pthread. Presumably, I misunderstood you and the music binary containing references to pthread is totally fine.

@mhoff
Copy link
Contributor Author

mhoff commented Jan 26, 2017

I found the source of this SegFault problem. I accidentally inserted the code for the test output after runtime was called. That lead to pymusic calling Setup::publishContOutput with a setup object being the null pointer. Maybe pymusic should be aware of that issue and report an error message instead.


I reworked my code to disallow the above behaviour in the future. Now, I run into another kind of faulty behaviour. A very simple configuration:

stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000
[reward_gen]
  binary=python/reward_node.py
  args=
  np=1
distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

The above part with the ros_music_adapter and the reward_gen node works very reliable. With the plotter node I am experimenting right now. When I add an unconnected EventOutput to the plotter node (while setup is still valid..) I experience two different kinds of error messages.

First of all, this is how I added the Output to the code which has been working before 100%:

event_out_proxy = music_setup.publishEventOutput("test_event_out")
print("event_out_proxy connection status {}".format(event_out_proxy.isConnected()))

Connection status is False, because there is no connection in the music configuration, which is correct.
Error type 1:

event_out_proxy connection status False
1485433258 | __init__ | WARNING  | Output port sim_time_out is not connected
1485433258 | __init__ | WARNING  | Input port test_cont_in is not connected
[figipc156:03614] *** Process received signal ***
[figipc156:03614] Signal: Segmentation fault (11)
[figipc156:03614] Signal code: Invalid permissions (2)
[figipc156:03614] Failing at address: 0x7f8109db56b8
[figipc156:03614] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f810a6ed8d0]
[figipc156:03614] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a56b8) [0x7f8109db56b8]
[figipc156:03614] *** End of error message ***

Error type 2:

event_out_proxy connection status False
1485433289 | node | INFO     | Dropping to runtime with timestep 0.02...
1485433289 | __init__ | WARNING  | Output port sim_time_out is not connected
1485433289 | __init__ | WARNING  | Input port test_cont_in is not connected
[figipc156:03655] *** Process received signal ***
[figipc156:03655] Signal: Segmentation fault (11)
[figipc156:03655] Signal code: Address not mapped (1)
[figipc156:03655] Failing at address: 0x30
[figipc156:03655] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7efd780768d0]
[figipc156:03655] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC15EventOutputPort10buildTableEv+0x29) [0x7efd7618b2f9]
[figipc156:03655] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime11buildTablesEPNS_5SetupE+0x24) [0x7efd76169334]
[figipc156:03655] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7RuntimeC1EPNS_5SetupEd+0x2cc) [0x7efd76169dbc]
[figipc156:03655] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x1dd1f) [0x7efd76c10d1f]
[figipc156:03655] [ 5] python() [0x4ba865]
[figipc156:03655] [ 6] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17a8c) [0x7efd76c0aa8c]
[figipc156:03655] [ 7] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:03655] [ 8] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [ 9] python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1]
[figipc156:03655] [10] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:03655] [11] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [12] python() [0x5030ef]
[figipc156:03655] [13] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:03655] [14] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:03655] [15] python(Py_Main+0x562) [0x4982f2]
[figipc156:03655] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7efd773bab45]
[figipc156:03655] [17] python() [0x497ca0]
[figipc156:03655] *** End of error message ***

That means, even though the proxy is correctly classified as unconnected something in MUSIC is heavily affected. Note that removing these two lines leads to perfectly correct behaviour again.

@mdjurfeldt
Copy link
Contributor

mdjurfeldt commented Jan 26, 2017 via email

@mhoff
Copy link
Contributor Author

mhoff commented Jan 26, 2017

I built a very small test case (test.zip). I am actually surprised that this does not work. Maybe I just forgot a parameter or something.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import music

setup = music.Setup()

event_out = setup.publishEventOutput("event_out_1")
print("connected = {}".format(event_out.isConnected()))

times = setup.runtime(0.02)

for time in times:
    print(time)
[node1]
    binary=./node.py
    np=1
    args=

mpirun -np 1 music conf.music

connected = False
[figipc156:05420] *** Process received signal ***
[figipc156:05420] Signal: Segmentation fault (11)
[figipc156:05420] Signal code: Invalid permissions (2)
[figipc156:05420] Failing at address: 0x7f6846669718
[figipc156:05420] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f6846fa18d0]
[figipc156:05420] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a5718) [0x7f6846669718]
[figipc156:05420] *** End of error message ***

Note: Yes, again this is only one node, but this time there is a clear and deterministic error. Also, I can reproduce this with two nodes, connected with two continuous ports, plus this unconnected event output. The same behaviour occurs.

@mhoff
Copy link
Contributor Author

mhoff commented Jan 28, 2017

I am getting closer to the original problem regarding the single connection rendering one node unusuable. At the bottom you find my complete setup. The problematic part is:

[...]
reward_gen.reward_out -> plotter.test_cont_in [1]
plotter.test_cont_out -> nest.reward_in [1]
#reward_gen.reward_out -> nest.reward_in [1]
[...]

If I use the two connections, which are currently active, the whole experiment is working. But if I instead use the direct connection, with the real reward data, the plotter node stops after a few timesteps.

In comparison:

  • Functional version (with dummy connections):

    [...]
    reward: ts = 0.02, last_cycle_time = 9.29832e-05
    reward: ts = 0.04, last_cycle_time = 2.7895e-05
    plotter: ts = 0.02, last_cycle_time = 8.79765e-05
    reward: ts = 0.06, last_cycle_time = 2.69413e-05
    plotter: ts = 0.04, last_cycle_time = 5.6982e-05
    plotter: ts = 0.06, last_cycle_time = 5.50747e-05
    reward: ts = 0.08, last_cycle_time = 2.5034e-05
    plotter: ts = 0.08, last_cycle_time = 5.50747e-05
    [...]
    
  • Non-functional version (with single, real connection):

    [...]
    plotter: ts = 0.1, last_cycle_time = 8.01086e-05
    reward: ts = 0.12, last_cycle_time = 7.20024e-05
    plotter: ts = 0.12, last_cycle_time = 8.08239e-05
    reward: ts = 0.14, last_cycle_time = 7.00951e-05
    plotter: ts = 0.14, last_cycle_time = 7.70092e-05
    reward: ts = 0.16, last_cycle_time = 6.79493e-05
    plotter: ts = 0.16, last_cycle_time = 8.08239e-05
    reward: ts = 0.18, last_cycle_time = 6.79493e-05
    plotter: ts = 0.18, last_cycle_time = 7.70092e-05
    reward: ts = 0.2, last_cycle_time = 4.1008e-05
    reward: ts = 0.22, last_cycle_time = 4.91142e-05
    reward: ts = 0.24, last_cycle_time = 4.29153e-05
    reward: ts = 0.26, last_cycle_time = 4.31538e-05
    reward: ts = 0.28, last_cycle_time = 4.19617e-05
    reward: ts = 0.3, last_cycle_time = 4.00543e-05
    reward: ts = 0.32, last_cycle_time = 5.98431e-05
    [...]
    

htop shows the process is still running and working on 100% CPU.


stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000

[reward_gen]
  binary=python/reward_node.py
  args=
  np=1

distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/camera/dvs/events
  message_type=EventArray
  sensor_update_rate=60

dvs.out -> plotter.pattern_in [200]

[nest]
  binary=python/network_node.py
  args=
  np=1

dvs.out -> nest.pattern_in [200]
nest.activity_out -> plotter.activity_in [20]

# TODO
reward_gen.reward_out -> plotter.test_cont_in [1]
plotter.test_cont_out -> nest.reward_in [1]
#reward_gen.reward_out -> nest.reward_in [1]

[decoder]
  binary=linear_readout_decoder
  args=
  np=1
  music_timestep=0.05
  tau=0.03
  weights_filename=res/activity_to_velocity_translation_weights.dat
[command_gen]
  binary=ros_command_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/cmd_vel
  message_mapping_filename=res/velocity_to_twist.dat
  command_rate=20

nest.activity_out -> decoder.in [20]
decoder.out -> command_gen.in [2]

@mhoff
Copy link
Contributor Author

mhoff commented Jan 30, 2017

This shows the blocking situation with the direct connection between nest and reward_gen.

music_connection_hang_1

@mhoff
Copy link
Contributor Author

mhoff commented Feb 16, 2017

We could track down the problem to a very minimal working example. On some architectures this produces a significant simulation time offset between two nodes with a unidirectional connection, whereas the connection is configured to have nor latency or buffering.
Other architectures might be able to execute this minimal example, but fail when are more complex setup is used. (I can provide such an example on request).

@mdjurfeldt suggests --disable-isend when configuring the build to circumvent this faulty behaviour. I can confirm that this eliminates the issue. Just dropping this here to document this.

@mdjurfeldt
Copy link
Contributor

Since the problem @mhoff had is resolved by a workaround, I have condensed the MUSIC-related problem in the new issue #37 and am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants