build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) #155

jbohren · 2015-03-10T04:28:00Z

This is a rebase and extension of @xqms's jobserver prototype in PR #85 which was created to fix #84. Without the jobserver, catkin build will use too many resources on machines with multiple cores.

Behavior using this PR:

catkin build --no-jobserver -- current behavior (maxes out all jobs for all packages unless using another jobserver like distcc)
catkin build -- build using the jobserver and as many jobs as CPUs
catkin build --jobserver -- explicit version of the line above
catkin build -jN-- build using the jobserver and N jobs
MAKEFLAGS="-jN" catkin build-- build using the jobserver and N jobs
catkin config --no-jobserver -- disable the jobserver for future builds
catkin build -lV -- don't create more than one job at a time if the system load is greater than V
catkin build --mem-limit P -- don't create more than one job at a time if more than P% of the system memory is used

Outstanding issues:

The jobserver behavior isn't documented
The status line reads ?/0 Jobs when the jobserver is disabled, it should display something friendlier
change internal_make_jobserver switches use_internal_make_jobserver
- https://github.com/catkin/catkin_tools/pull/155/files#r26704312
- https://github.com/catkin/catkin_tools/pull/155/files#r26705185
document configure_make_args
clean up jobserver API
- checking support
make psutil an optional dependency, and only load it if the memory control options are used

Future work:

The number of jobs (-jN) and the number of packages (-pM) are still decoupled. The jobserver will limit the number of Make instances, but maybe this is ok?
The current internal jobserver implementation does not monitor load or memory usage. These would be good arguments to support.

jbohren · 2015-03-10T04:28:47Z

@xqms I've rebased this on master, it appears to work, and feel free to copy this snapshot to your fork.

jbohren · 2015-03-10T04:30:16Z

@davetcoleman when you're feeling bored, give this branch a shot and see if you still run out of resources

jbohren · 2015-03-10T04:41:19Z

Rebase of #85

davetcoleman · 2015-03-10T04:57:14Z

I just pulled that branch and am testing now.

davetcoleman · 2015-03-10T05:00:46Z

It kept my memory usage nice and stable - think whatever this is helped a lot!

jbohren · 2015-03-10T16:10:04Z

@xqms I modified the implementation so that it's possible to disable the jobserver with the --ext-jobserver argument. This way you can still use distcc like so:

CC="distcc gcc" CXX="distcc g++" catkin build -p$(distcc -j) -j$(distcc -j) --ext-jobserver

xqms · 2015-03-10T16:28:32Z

@davetcoleman Nice to hear it helped you as well! catkin_tools is almost unusable without it on my workspace...

@jbohren okay, sounds good, though I think @wjwwood preferred an opt-in (like --enable-jobserver). I'll have time later today or tomorrow to take a look. If you guys want to go ahead changing/merging, don't wait for me ;-)

jbohren · 2015-03-10T17:31:06Z

@xqms @wjwwood after seeing Dave's example workspace, I think that this should be the default behavior.

jbohren · 2015-03-10T18:03:54Z

@xqms also is there a difference between --jobserver-limit and the normal -j jobs argument?

jbohren · 2015-03-10T18:08:52Z

Something that might make sense is to add the jobserver options to the persistent config. This patch just augments the --make-args with the jobserver flags like --jobserver-fds=3,4 -j which makes it really visible, but also obfuscated.

Alternatively, it'd be better to have a high-level Job Server: [internal / external] option.

This would display like:

----------------------------------------------------------------------------
Profile:                     default
Extending:          [cached] /opt/ros/hydro
Workspace:                   /home/jbohren/ws/ascent
Source Space:       [exists] /home/jbohren/ws/ascent/src
Build Space:        [exists] /home/jbohren/ws/ascent/build
Devel Space:        [exists] /home/jbohren/ws/ascent/devel
Install Space:     [missing] /home/jbohren/ws/ascent/install
DESTDIR:                     None
----------------------------------------------------------------------------
Isolate Develspaces:         False
Install Packages:            False
Isolate Installs:            False
----------------------------------------------------------------------------
Additional CMake Args:       -DOBJREC_USE_CUDA=On -DOBJREC_CUDA_DEVICE_ID=1 
Additional Make Args:        None
Additional catkin Make Args: None
Make Job Server:             Internal
----------------------------------------------------------------------------
Workspace configuration appears valid.
----------------------------------------------------------------------------

xqms · 2015-03-10T18:11:39Z

@jbohren -j N controls how many processes a make process may spawn. Since catkin_tools starts up to -p X parallel make processes this is useless for controlling the system load. --jobserver-limit specifies how many jobs (e.g. gcc processes) may be run in total, which is much more useful.

If the jobserver is enabled, it may make sense to replace -j and -p altogether with a single -j option controlling the jobserver limit...

Yes, I agree this thing should be integrated with the workspace config.

davetcoleman · 2015-03-11T22:36:03Z

This is still running ok for me and catkin no longer crashes my computer, but it still uses a huge amount of memory sometimes while compiling a large workspace. Just now it got pretty close:

jbohren · 2015-03-12T00:39:43Z

Can you show the memory trace using catkin_make also?

wjwwood · 2015-03-12T03:30:09Z

Thanks for everyone helping out on this one. I'll try to review it and try it out myself by the end of the week.

jack-oquin · 2015-03-12T14:16:53Z

If the jobserver is enabled, it may make sense to replace -j and -p altogether with a single -j option controlling the jobserver limit...

+1 That is what most of us naively expected -j to mean in the first place.

davetcoleman · 2015-03-12T20:57:59Z

You mean the memory history plot? I'll try to record it next time I rebuild

jbohren · 2015-03-12T21:01:11Z

Yeah.

On Thu, Mar 12, 2015, 16:58 Dave Coleman [email protected] wrote:

You mean the memory history? I'll try to record it next time I rebuild

Reply to this email directly or view it on GitHub
#155 (comment).

jbohren · 2015-03-14T02:24:22Z

This still needs a bit of work to do what people "expect" it to do. Each job calls handle_make_arguments a couple times and I don't think that's where MAKEFLAGS should be read. It needs to be read early on and act as a default configuration. Then that needs to initialize the jobserver.

jbohren · 2015-03-16T12:18:28Z

@xqms @davetcoleman @wjwwood This should be good to review. I re-worked it to make sure the jobserver singleton gets initialized in one place, but this could still be cleaned up.

The behavior of this PR is as follows:

catkin build --no-jobserver -- current behavior (maxes out all jobs for all packages unless using another jobserver like distcc)
catkin build -- build using the jobserver and as many jobs as CPUs
catkin build --jobserver -- explicit version of the line above
catkin build -jN-- build using the jobserver and N jobs
MAKEFLAGS="-jN" catkin build-- build using the jobserver and N jobs

xqms · 2015-03-16T15:19:41Z

catkin_tools/argument_parsing.py

+    :rtype: dict
+    """
+
+    regex = r'(?:^|\s)(?:-?(j|l)(\s*[0-9]+|\s|$))' + \


nit: you probably wanted to use JOBS_FLAGS_REGEX from above here, right?

Ah yeah I meant to remove that in favor of just having two different patterns.

Is this comment still unaddressed?

No, it's outdated. The referenced variable isn't in the PR any more.

xqms · 2015-03-16T15:21:25Z

@jbohren I had a look at the code, seems good to me. I'll test this in my setup today and let you know if anything breaks.

NikolausDemmel · 2015-03-28T12:09:16Z

catkin_tools/make_jobserver.py

+            # make sure we're observing load maximums
+            if self.max_load is not None:
+                try:
+                    max_load = 8.0


This line is probably left over by accident?

xqms · 2015-03-30T12:02:09Z

I have just discovered an issue: Multiple invocations of catkin build --save duplicate the --jobserver-fds=3,4 -j arguments. Looks like this after two calls:

----------------------------------------------
Additional CMake Args:       -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=-Wall
Additional Make Args:        --jobserver-fds=3,4 -j --jobserver-fds=3,5 -j
Additional catkin Make Args: None
Internal Make Job Server:    True
----------------------------------------------

xqms · 2015-03-30T12:04:48Z

Okay, to clarify: the --jobserver-fds argument is saved in the workspace config. It shouldn't be, since it is also generated by the "Internal Make Job Server" flag.

jbohren · 2015-03-30T12:31:22Z

@xqms Yeah, also now that Internal Make Job Server: True is in the summary, I don't think the jobserver flags should be added in the way that I have them added. I'll make it so it displays the summary like this:

----------------------------------------------
Additional CMake Args:       -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=-Wall
Additional Make Args:        None
Additional catkin Make Args: None
Internal Make Job Server:    True, Args: "--jobserver-fds=3,4 -j"
----------------------------------------------

wjwwood · 2015-03-30T23:09:57Z

https://travis-ci.org/catkin/catkin_tools/jobs/56253398#L1432

jbohren · 2015-04-01T14:59:32Z

@wjwwood Just updated. There was an issue previously where the make arg parsing wasn't properly extracting jobs args formatted like -j N with a space between the -j and the N.

@xqms This removes the jobserver args from "additional make args". I was considering adding the jobserver max jobs / max load / max mem to the context, but we'd also have to add those options to the saved configs, which just involves a lot of additional changes.

As it stands, I think that this is a pretty critical patch and if there are any other outstanding features, we should bump them to another PR.

It would be nice to add a strong test harness around this thing, but I'm not sure what the best approach would be for that. Mostly you'd want to test:

defaults
cli args
MAKEFLAGS args
disabling the jobserver

davetcoleman · 2015-04-01T15:45:19Z

I've been using, and keeping up with the updates, of this branch everyday. Seems to work for me.

wjwwood · 2015-04-02T02:16:10Z

You've not added the dependency on psutil correctly, you also have to update some other files:

Also the function you're using from psutil is deprecated:

https://code.google.com/p/psutil/wiki/Documentation#Memory

Please at least update the pull request to use psutil.virtual_memory().

I'm not happy about adding a dependency on psutil, but if there isn't a reasonable way to do it without psutil then I guess we'll have to. One option would be to not import psutil unless a user tries to use the memory limits.

Having said all of that, I'm really not convinced the memory limits is going to be a reliable way to manage how much memory the whole compile is using. Unlike CPU usage, the memory usage of each cc in a make job can grow over time. Imagine:

each of 6 cc process will approach 2 GB of memory usage over 30 seconds (this behavior is common with some of the pcl object files)
your system has 4 GB of RAM
you compiled with -j6
and you've set the memory limit to 90%

You'll hand out all six job tokens (because for the first few seconds the memory usage is really low) and each of them will grow to use 2 GB and run your system out of memory anyways. However, it's possible that it won't work out like that and the memory limit will help, but I wouldn't call it reliable.

This is a tough problem and we worked with the Tango project to work around this problem in a few places. The most reliable way to fix this kind of "perfect storm of parallel make jobs" was to instrument the build so we could figure out which process, when run at the same time, busted the system. Then they would manually adjust the inter-target dependencies to prevent those object files from being built at the same time. @tfoote was the instrumenting code ever pushed somewhere public?

Based on that, I would say we should either remove the memory limit option or make it "experimental" and make the dependency on psutil optional. If a user is savvy enough to make use of the memory limit options then they are probably also ok with installing psutil when prompted.

jbohren · 2015-04-02T02:51:54Z

Having said all of that, I'm really not convinced the memory limits is going to be a reliable way to manage how much memory the whole compile is using. Unlike CPU usage, the memory usage of each cc in a make job can grow over time. Imagine:

Of course it's not going to be reliable. But if some crazy person (@davetcoleman cough cough) is running around with a computer with no swap, he can probably set it to something like 65% and be reasonably confident it's not going to brick his machine.

This is a tough problem and we worked with the Tango project to work around this problem in a few places. The most reliable way to fix this kind of "perfect storm of parallel make jobs" was to instrument the build so we could figure out which process, when run at the same time, busted the system. Then they would manually adjust the inter-target dependencies to prevent those object files from being built at the same time. @tfoote was the instrumenting code ever pushed somewhere public?

"Perfect is the enemy of the good" -- some guy that got something done

I'm not concerned enough about this problem to do any more about it than is in the experimental memory limit check.

Based on that, I would say we should either remove the memory limit option or make it "experimental" and make the dependency on psutil optional. If a user is savvy enough to make use of the memory limit options then they are probably also ok with installing psutil when prompted.

It already is an unlisted "experimental" option. I'm happy putting a try-except around the psutil import and then throwing an error if someone tries using it without support. That being said, anyone with rqt installed will have python-psutil.

You've not added the dependency on psutil correctly, you also have to update some other files:

https://github.com/catkin/catkin_tools/blob/master/setup.py

https://github.com/catkin/catkin_tools/blob/master/stdeb.cfg

I'm going to remove the dependency description completely (except for the travis config) because it's going to be an optional dependency.

Also the function you're using from psutil is deprecated:

https://code.google.com/p/psutil/wiki/Documentation#Memory

Please at least update the pull request to use psutil.virtual_memory().

Yeah I know it's deprecated. Unfortunately the python-psutil package on 12.04 doesn't have this function yet. Also ROS Hydro depends on the python-psutil package, so ditching it in favor of the pypi doesn't sound like a good plan. It wouldn't be too hard to just check if the function is available and use the new one if the old one isn't available.

wjwwood · 2015-04-02T03:01:02Z

It already is an unlisted "experimental" option. I'm happy putting a try-except around the psutil import and then throwing an error if someone tries using it without support. That being said, anyone with rqt installed will have python-psutil.

I would appreciate making it optional. It's not going to be a problem for ROS users, but I put a lot of effort into minimizing the dependencies, so I'd like to keep it that way if possible.

Yeah I know it's deprecated. Unfortunately the python-psutil package on 12.04 doesn't have this function yet. Also ROS Hydro depends on the python-psutil package, so ditching it in favor of the pypi doesn't sound like a good plan. It wouldn't be too hard to just check if the function is available and use the new one if the old one isn't available.

I remember this now, it was quite a problem for rqt_top. Unfortunately the pypi version is never an option because we cannot build a .deb that depends on something in pypi.

If you can get the optional import in place tonight or tomorrow morning I can merge this and get it in the next release. Tomorrow is our ROS bug fix party, so I plan to spend some time on this tracker and making a release.

Thanks for working on it and iterating with me.

tfoote · 2015-04-02T04:25:30Z

@wjwwood re: collecting build parameters of specific invocations. The tools is available at https://github.com/osrf/watchprocess It can generate a report of cpu and memory usage for each process using an instrumented program during a build.

davetcoleman · 2015-04-02T18:08:21Z

Thanks for the crazy person shout out @jbohren I've taken your advice:

jbohren · 2015-04-02T20:32:44Z

@wjwwood I added the lazy import of psutil and the version checking. I've tested it with version 0.4.1 but no other versions.

This makes it possible to enforce a global limit on the number of processes through the --jobserver-limit argument. build: jobserver: provide context manager interface ... and use it in executor.py build: jobserver: silent support test build: provide job information on status line nicely as suggested by NikolausDemmel.

- you can disable the job server and use it with other job servers via catkin_tools context config - the jobserver can be parameterized by the normal -j argument as well as -j stored in MAKEFLAGS - the jobserver can regulate jobs based on the current system load - the jobserver can regulate jobs based on available RAM

- simplifying enabling / supporting - adding parsing of `-l` / `--load-average` args - adding hidden experimental option for limiting based on memory

- Simplifying context summary to not show jobserver args as "additional make args" - Fixing failure to parse '-j N' where there is a space separating the '-j' and 'N'

wjwwood · 2015-04-03T08:48:22Z

I merged this with some adjustments.

wjwwood · 2015-04-03T08:48:51Z

Also, a rebase which is why it doesn't show up as "merged".

jbohren mentioned this pull request Mar 10, 2015

build: add an internal GNU make job server, see #84 #85

Closed

jbohren force-pushed the make_job_server branch from 498df12 to fa9dc71 Compare March 10, 2015 18:00

jbohren changed the title ~~Rebase of make_job_server on latest master~~ build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) Mar 10, 2015

xqms mentioned this pull request Mar 13, 2015

Default -j and -p values are too high on systems with many cores #84

Closed

jbohren force-pushed the make_job_server branch from e11d3d2 to 551c94d Compare March 16, 2015 12:11

jbohren force-pushed the make_job_server branch 3 times, most recently from 75bb565 to b35694d Compare March 16, 2015 12:32

xqms reviewed Mar 16, 2015
View reviewed changes

jbohren force-pushed the make_job_server branch from b35694d to 73e0f9b Compare March 16, 2015 16:35

NikolausDemmel reviewed Mar 28, 2015
View reviewed changes

jbohren force-pushed the make_job_server branch from ffc9fcd to c0ae6e0 Compare March 28, 2015 22:31

jbohren force-pushed the make_job_server branch 2 times, most recently from e881f0e to 5892168 Compare April 2, 2015 02:12

jbohren force-pushed the make_job_server branch from 5892168 to 65aa1e5 Compare April 2, 2015 20:31

jbohren force-pushed the make_job_server branch 2 times, most recently from 5702aec to 970d270 Compare April 3, 2015 01:24

xqms and others added 4 commits April 3, 2015 04:42

jobserver: big refactor

0ba4de4

- simplifying enabling / supporting - adding parsing of `-l` / `--load-average` args - adding hidden experimental option for limiting based on memory

jobserver: a few refinements / fixes

8de8532

- Simplifying context summary to not show jobserver args as "additional make args" - Fixing failure to parse '-j N' where there is a space separating the '-j' and 'N'

jbohren force-pushed the make_job_server branch from 970d270 to 8de8532 Compare April 3, 2015 08:42

wjwwood added a commit that referenced this pull request Apr 3, 2015

bug fixes and adjustments to #155

4b96e1c

wjwwood closed this Apr 3, 2015

jbohren deleted the make_job_server branch April 3, 2015 12:09

build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) #155

build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) #155

Conversation

jbohren commented Mar 10, 2015

jbohren commented Mar 10, 2015

jbohren commented Mar 10, 2015

jbohren commented Mar 10, 2015

davetcoleman commented Mar 10, 2015

davetcoleman commented Mar 10, 2015

jbohren commented Mar 10, 2015

xqms commented Mar 10, 2015

jbohren commented Mar 10, 2015

jbohren commented Mar 10, 2015

jbohren commented Mar 10, 2015

xqms commented Mar 10, 2015

davetcoleman commented Mar 11, 2015

jbohren commented Mar 12, 2015

wjwwood commented Mar 12, 2015

jack-oquin commented Mar 12, 2015

davetcoleman commented Mar 12, 2015

jbohren commented Mar 12, 2015

jbohren commented Mar 14, 2015

jbohren commented Mar 16, 2015

xqms Mar 16, 2015

Choose a reason for hiding this comment

jbohren Mar 16, 2015

Choose a reason for hiding this comment

wjwwood Mar 18, 2015

Choose a reason for hiding this comment

jbohren Mar 19, 2015

Choose a reason for hiding this comment

xqms commented Mar 16, 2015

NikolausDemmel Mar 28, 2015

Choose a reason for hiding this comment

xqms commented Mar 30, 2015

xqms commented Mar 30, 2015

jbohren commented Mar 30, 2015

wjwwood commented Mar 30, 2015

jbohren commented Apr 1, 2015

davetcoleman commented Apr 1, 2015

wjwwood commented Apr 2, 2015

jbohren commented Apr 2, 2015

wjwwood commented Apr 2, 2015

tfoote commented Apr 2, 2015

davetcoleman commented Apr 2, 2015

jbohren commented Apr 2, 2015

wjwwood commented Apr 3, 2015

wjwwood commented Apr 3, 2015