Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) #155

Closed
wants to merge 4 commits into from

Conversation

jbohren
Copy link
Contributor

@jbohren jbohren commented Mar 10, 2015

This is a rebase and extension of @xqms's jobserver prototype in PR #85 which was created to fix #84. Without the jobserver, catkin build will use too many resources on machines with multiple cores.

Behavior using this PR:

  • catkin build --no-jobserver -- current behavior (maxes out all jobs for all packages unless using another jobserver like distcc)
  • catkin build -- build using the jobserver and as many jobs as CPUs
  • catkin build --jobserver -- explicit version of the line above
  • catkin build -jN-- build using the jobserver and N jobs
  • MAKEFLAGS="-jN" catkin build-- build using the jobserver and N jobs
  • catkin config --no-jobserver -- disable the jobserver for future builds
  • catkin build -lV -- don't create more than one job at a time if the system load is greater than V
  • catkin build --mem-limit P -- don't create more than one job at a time if more than P% of the system memory is used

Outstanding issues:

Future work:

  • The number of jobs (-jN) and the number of packages (-pM) are still decoupled. The jobserver will limit the number of Make instances, but maybe this is ok?
  • The current internal jobserver implementation does not monitor load or memory usage. These would be good arguments to support.

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

@xqms I've rebased this on master, it appears to work, and feel free to copy this snapshot to your fork.

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

@davetcoleman when you're feeling bored, give this branch a shot and see if you still run out of resources

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

Rebase of #85

@davetcoleman
Copy link
Contributor

I just pulled that branch and am testing now.

@davetcoleman
Copy link
Contributor

It kept my memory usage nice and stable - think whatever this is helped a lot!
screenshot from 2015-03-09 22 57 56

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

@xqms I modified the implementation so that it's possible to disable the jobserver with the --ext-jobserver argument. This way you can still use distcc like so:

CC="distcc gcc" CXX="distcc g++" catkin build -p$(distcc -j) -j$(distcc -j) --ext-jobserver

@xqms
Copy link
Contributor

xqms commented Mar 10, 2015

@davetcoleman Nice to hear it helped you as well! catkin_tools is almost unusable without it on my workspace...

@jbohren okay, sounds good, though I think @wjwwood preferred an opt-in (like --enable-jobserver). I'll have time later today or tomorrow to take a look. If you guys want to go ahead changing/merging, don't wait for me ;-)

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

@xqms @wjwwood after seeing Dave's example workspace, I think that this should be the default behavior.

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

@xqms also is there a difference between --jobserver-limit and the normal -j jobs argument?

@jbohren
Copy link
Contributor Author

jbohren commented Mar 10, 2015

Something that might make sense is to add the jobserver options to the persistent config. This patch just augments the --make-args with the jobserver flags like --jobserver-fds=3,4 -j which makes it really visible, but also obfuscated.

Alternatively, it'd be better to have a high-level Job Server: [internal / external] option.

This would display like:

----------------------------------------------------------------------------
Profile:                     default
Extending:          [cached] /opt/ros/hydro
Workspace:                   /home/jbohren/ws/ascent
Source Space:       [exists] /home/jbohren/ws/ascent/src
Build Space:        [exists] /home/jbohren/ws/ascent/build
Devel Space:        [exists] /home/jbohren/ws/ascent/devel
Install Space:     [missing] /home/jbohren/ws/ascent/install
DESTDIR:                     None
----------------------------------------------------------------------------
Isolate Develspaces:         False
Install Packages:            False
Isolate Installs:            False
----------------------------------------------------------------------------
Additional CMake Args:       -DOBJREC_USE_CUDA=On -DOBJREC_CUDA_DEVICE_ID=1 
Additional Make Args:        None
Additional catkin Make Args: None
Make Job Server:             Internal
----------------------------------------------------------------------------
Workspace configuration appears valid.
----------------------------------------------------------------------------

@xqms
Copy link
Contributor

xqms commented Mar 10, 2015

@jbohren -j N controls how many processes a make process may spawn. Since catkin_tools starts up to -p X parallel make processes this is useless for controlling the system load. --jobserver-limit specifies how many jobs (e.g. gcc processes) may be run in total, which is much more useful.

If the jobserver is enabled, it may make sense to replace -j and -p altogether with a single -j option controlling the jobserver limit...

Yes, I agree this thing should be integrated with the workspace config.

@jbohren jbohren changed the title Rebase of make_job_server on latest master build: GNU Make jobserver to prevent resource glut (make_job_server rebased on master) Mar 10, 2015
@davetcoleman
Copy link
Contributor

This is still running ok for me and catkin no longer crashes my computer, but it still uses a huge amount of memory sometimes while compiling a large workspace. Just now it got pretty close:

screenshot from 2015-03-11 16 34 37

@jbohren
Copy link
Contributor Author

jbohren commented Mar 12, 2015

Can you show the memory trace using catkin_make also?

@wjwwood
Copy link
Member

wjwwood commented Mar 12, 2015

Thanks for everyone helping out on this one. I'll try to review it and try it out myself by the end of the week.

@jack-oquin
Copy link

If the jobserver is enabled, it may make sense to replace -j and -p altogether with a single -j option controlling the jobserver limit...

+1 That is what most of us naively expected -j to mean in the first place.

@davetcoleman
Copy link
Contributor

You mean the memory history plot? I'll try to record it next time I rebuild

@jbohren
Copy link
Contributor Author

jbohren commented Mar 12, 2015

Yeah.

On Thu, Mar 12, 2015, 16:58 Dave Coleman [email protected] wrote:

You mean the memory history? I'll try to record it next time I rebuild

Reply to this email directly or view it on GitHub
#155 (comment).

@jbohren
Copy link
Contributor Author

jbohren commented Mar 14, 2015

This still needs a bit of work to do what people "expect" it to do. Each job calls handle_make_arguments a couple times and I don't think that's where MAKEFLAGS should be read. It needs to be read early on and act as a default configuration. Then that needs to initialize the jobserver.

@jbohren
Copy link
Contributor Author

jbohren commented Mar 16, 2015

@xqms @davetcoleman @wjwwood This should be good to review. I re-worked it to make sure the jobserver singleton gets initialized in one place, but this could still be cleaned up.

The behavior of this PR is as follows:

  • catkin build --no-jobserver -- current behavior (maxes out all jobs for all packages unless using another jobserver like distcc)
  • catkin build -- build using the jobserver and as many jobs as CPUs
  • catkin build --jobserver -- explicit version of the line above
  • catkin build -jN-- build using the jobserver and N jobs
  • MAKEFLAGS="-jN" catkin build-- build using the jobserver and N jobs

@jbohren jbohren force-pushed the make_job_server branch 3 times, most recently from 75bb565 to b35694d Compare March 16, 2015 12:32
:rtype: dict
"""

regex = r'(?:^|\s)(?:-?(j|l)(\s*[0-9]+|\s|$))' + \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you probably wanted to use JOBS_FLAGS_REGEX from above here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I meant to remove that in favor of just having two different patterns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still unaddressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's outdated. The referenced variable isn't in the PR any more.

@xqms
Copy link
Contributor

xqms commented Mar 16, 2015

@jbohren I had a look at the code, seems good to me. I'll test this in my setup today and let you know if anything breaks.

# make sure we're observing load maximums
if self.max_load is not None:
try:
max_load = 8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is probably left over by accident?

@xqms
Copy link
Contributor

xqms commented Mar 30, 2015

I have just discovered an issue: Multiple invocations of catkin build --save duplicate the --jobserver-fds=3,4 -j arguments. Looks like this after two calls:

----------------------------------------------
Additional CMake Args:       -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=-Wall
Additional Make Args:        --jobserver-fds=3,4 -j --jobserver-fds=3,5 -j
Additional catkin Make Args: None
Internal Make Job Server:    True
----------------------------------------------

@xqms
Copy link
Contributor

xqms commented Mar 30, 2015

Okay, to clarify: the --jobserver-fds argument is saved in the workspace config. It shouldn't be, since it is also generated by the "Internal Make Job Server" flag.

@jbohren
Copy link
Contributor Author

jbohren commented Mar 30, 2015

@xqms Yeah, also now that Internal Make Job Server: True is in the summary, I don't think the jobserver flags should be added in the way that I have them added. I'll make it so it displays the summary like this:

----------------------------------------------
Additional CMake Args:       -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=-Wall
Additional Make Args:        None
Additional catkin Make Args: None
Internal Make Job Server:    True, Args: "--jobserver-fds=3,4 -j"
----------------------------------------------

@wjwwood
Copy link
Member

wjwwood commented Mar 30, 2015

@jbohren
Copy link
Contributor Author

jbohren commented Apr 1, 2015

@wjwwood Just updated. There was an issue previously where the make arg parsing wasn't properly extracting jobs args formatted like -j N with a space between the -j and the N.

@xqms This removes the jobserver args from "additional make args". I was considering adding the jobserver max jobs / max load / max mem to the context, but we'd also have to add those options to the saved configs, which just involves a lot of additional changes.

As it stands, I think that this is a pretty critical patch and if there are any other outstanding features, we should bump them to another PR.

It would be nice to add a strong test harness around this thing, but I'm not sure what the best approach would be for that. Mostly you'd want to test:

  • defaults
  • cli args
  • MAKEFLAGS args
  • disabling the jobserver

@davetcoleman
Copy link
Contributor

I've been using, and keeping up with the updates, of this branch everyday. Seems to work for me.

@jbohren jbohren force-pushed the make_job_server branch 2 times, most recently from e881f0e to 5892168 Compare April 2, 2015 02:12
@wjwwood
Copy link
Member

wjwwood commented Apr 2, 2015

You've not added the dependency on psutil correctly, you also have to update some other files:

Also the function you're using from psutil is deprecated:

https://code.google.com/p/psutil/wiki/Documentation#Memory

Please at least update the pull request to use psutil.virtual_memory().

I'm not happy about adding a dependency on psutil, but if there isn't a reasonable way to do it without psutil then I guess we'll have to. One option would be to not import psutil unless a user tries to use the memory limits.

Having said all of that, I'm really not convinced the memory limits is going to be a reliable way to manage how much memory the whole compile is using. Unlike CPU usage, the memory usage of each cc in a make job can grow over time. Imagine:

  • each of 6 cc process will approach 2 GB of memory usage over 30 seconds (this behavior is common with some of the pcl object files)
  • your system has 4 GB of RAM
  • you compiled with -j6
  • and you've set the memory limit to 90%

You'll hand out all six job tokens (because for the first few seconds the memory usage is really low) and each of them will grow to use 2 GB and run your system out of memory anyways. However, it's possible that it won't work out like that and the memory limit will help, but I wouldn't call it reliable.

This is a tough problem and we worked with the Tango project to work around this problem in a few places. The most reliable way to fix this kind of "perfect storm of parallel make jobs" was to instrument the build so we could figure out which process, when run at the same time, busted the system. Then they would manually adjust the inter-target dependencies to prevent those object files from being built at the same time. @tfoote was the instrumenting code ever pushed somewhere public?

Based on that, I would say we should either remove the memory limit option or make it "experimental" and make the dependency on psutil optional. If a user is savvy enough to make use of the memory limit options then they are probably also ok with installing psutil when prompted.

@jbohren
Copy link
Contributor Author

jbohren commented Apr 2, 2015

Having said all of that, I'm really not convinced the memory limits is going to be a reliable way to manage how much memory the whole compile is using. Unlike CPU usage, the memory usage of each cc in a make job can grow over time. Imagine:

Of course it's not going to be reliable. But if some crazy person (@davetcoleman cough cough) is running around with a computer with no swap, he can probably set it to something like 65% and be reasonably confident it's not going to brick his machine.

This is a tough problem and we worked with the Tango project to work around this problem in a few places. The most reliable way to fix this kind of "perfect storm of parallel make jobs" was to instrument the build so we could figure out which process, when run at the same time, busted the system. Then they would manually adjust the inter-target dependencies to prevent those object files from being built at the same time. @tfoote was the instrumenting code ever pushed somewhere public?

"Perfect is the enemy of the good" -- some guy that got something done

I'm not concerned enough about this problem to do any more about it than is in the experimental memory limit check.

Based on that, I would say we should either remove the memory limit option or make it "experimental" and make the dependency on psutil optional. If a user is savvy enough to make use of the memory limit options then they are probably also ok with installing psutil when prompted.

It already is an unlisted "experimental" option. I'm happy putting a try-except around the psutil import and then throwing an error if someone tries using it without support. That being said, anyone with rqt installed will have python-psutil.

You've not added the dependency on psutil correctly, you also have to update some other files:

I'm going to remove the dependency description completely (except for the travis config) because it's going to be an optional dependency.

Also the function you're using from psutil is deprecated:

https://code.google.com/p/psutil/wiki/Documentation#Memory

Please at least update the pull request to use psutil.virtual_memory().

Yeah I know it's deprecated. Unfortunately the python-psutil package on 12.04 doesn't have this function yet. Also ROS Hydro depends on the python-psutil package, so ditching it in favor of the pypi doesn't sound like a good plan. It wouldn't be too hard to just check if the function is available and use the new one if the old one isn't available.

@wjwwood
Copy link
Member

wjwwood commented Apr 2, 2015

It already is an unlisted "experimental" option. I'm happy putting a try-except around the psutil import and then throwing an error if someone tries using it without support. That being said, anyone with rqt installed will have python-psutil.

I would appreciate making it optional. It's not going to be a problem for ROS users, but I put a lot of effort into minimizing the dependencies, so I'd like to keep it that way if possible.

Yeah I know it's deprecated. Unfortunately the python-psutil package on 12.04 doesn't have this function yet. Also ROS Hydro depends on the python-psutil package, so ditching it in favor of the pypi doesn't sound like a good plan. It wouldn't be too hard to just check if the function is available and use the new one if the old one isn't available.

I remember this now, it was quite a problem for rqt_top. Unfortunately the pypi version is never an option because we cannot build a .deb that depends on something in pypi.

If you can get the optional import in place tonight or tomorrow morning I can merge this and get it in the next release. Tomorrow is our ROS bug fix party, so I plan to spend some time on this tracker and making a release.

Thanks for working on it and iterating with me.

@tfoote
Copy link
Contributor

tfoote commented Apr 2, 2015

@wjwwood re: collecting build parameters of specific invocations. The tools is available at https://github.com/osrf/watchprocess It can generate a report of cpu and memory usage for each process using an instrumented program during a build.

@davetcoleman
Copy link
Contributor

Thanks for the crazy person shout out @jbohren I've taken your advice:
screenshot from 2015-04-02 12 07 20

@jbohren
Copy link
Contributor Author

jbohren commented Apr 2, 2015

@wjwwood I added the lazy import of psutil and the version checking. I've tested it with version 0.4.1 but no other versions.

@jbohren jbohren force-pushed the make_job_server branch 2 times, most recently from 5702aec to 970d270 Compare April 3, 2015 01:24
xqms and others added 4 commits April 3, 2015 04:42
This makes it possible to enforce a global limit on the number of
processes through the --jobserver-limit argument.

build: jobserver: provide context manager interface

... and use it in executor.py

build: jobserver: silent support test

build: provide job information on status line nicely

as suggested by NikolausDemmel.
- you can disable the job server and use it with other job servers via
  catkin_tools context config
- the jobserver can be parameterized by the normal -j argument as well
  as -j stored in MAKEFLAGS
- the jobserver can regulate jobs based on the current system load
- the jobserver can regulate jobs based on available RAM
- simplifying enabling / supporting
- adding parsing of `-l` / `--load-average` args
- adding hidden experimental option for limiting based on memory
- Simplifying context summary to not show jobserver args as "additional
  make args"
- Fixing failure to parse '-j N' where there is a space separating the
  '-j' and 'N'
wjwwood added a commit that referenced this pull request Apr 3, 2015
@wjwwood
Copy link
Member

wjwwood commented Apr 3, 2015

I merged this with some adjustments.

@wjwwood wjwwood closed this Apr 3, 2015
@wjwwood
Copy link
Member

wjwwood commented Apr 3, 2015

Also, a rebase which is why it doesn't show up as "merged".

@jbohren jbohren deleted the make_job_server branch April 3, 2015 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default -j and -p values are too high on systems with many cores
7 participants