Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize slurm config #234

Closed
wants to merge 3 commits into from
Closed

Optimize slurm config #234

wants to merge 3 commits into from

Conversation

nirmalasrjn
Copy link
Contributor

FastSchedule directive has the default value of 1 which permits fast scheduling. However if a node has less than the configured resources, it will be set to DRAIN, which means, the node will finish the currently running job. But no further jobs will be scheduled on that node. When FastSchedule is set to 0, scheduling decisions are made based on actual configuration of each individual node.

FastSchedule directive has the default value of 1 which permits fast scheduling. However if a node has less than the configured resources, it will be set to DRAIN, which means, the node will finish the currently running job. But no further jobs will be scheduled on that node. When FastSchedule is set to 0, scheduling decisions are made based on actual configuration of each individual node.
Adding FastSchedule Directive to Slurm Specs
@koomie
Copy link
Contributor

koomie commented Jun 3, 2016

Quick question. Do you know if you can avoid having to specify node configuration details in slurm.conf if you adopt FastSchedule=0?

@nirmalasrjn
Copy link
Contributor Author

As far as I know, it does not eliminate the need to specify node configuration details in slurm.conf. When I use the FastSchedule directive, I put in node configuration details too.

@koomie
Copy link
Contributor

koomie commented Jun 3, 2016

So, if the node does not match the configuration that is called out in the slum.conf file, is there a reason to not want it to be set to DRAIN? If you didn't have to create the node entries by using FastSchedule=0, that would be an advantage in my mind, but if not, I'm not sure one is necessarily better than the other.

@JohnWestlund
Copy link
Member

That’s correct. Nodes within a pool should be homogenenous and be correctly defined in the slurm.conf so they can be allocated based on a jobs constraints. If a pool is heterogenous you’re asking for performance and results to be possibly highly variable – but the resource manager should still have the correct definition of the resources. You might be able to get by with an incorrect core count but there are other node features that impose a much harder limit (memory, etc).

At the end of the day this is not something that should be changing often – your slurm.conf should be relatively stable. And if suddenly something is mismatched you probably want to know about it.

John

From: Karl W. Schulz [mailto:[email protected]]
Sent: Friday, June 3, 2016 1:58 PM
To: openhpc/ohpc [email protected]
Subject: Re: [openhpc/ohpc] Optimize slurm config (#234)

So, if the node does not match the configuration that is called out in the slum.conf file, is there a reason to not want it to be set to DRAIN? If you didn't have to create the node entries by using FastSchedule=0, that would be an advantage in my mind, but if not, I'm not sure one is necessarily better than the other.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com//pull/234#issuecomment-223692683, or mute the threadhttps://github.com/notifications/unsubscribe/AFP7vxIUmOoS7AMsqJUv1kRBm74ms_qCks5qIJU-gaJpZM4IrKKx.

Default value of 0 means that a node does not return to service unless the administrator manually brings the node to service. 
If it is set to 1, then, that node can return to service if it has a valid configuration and was set to DOWN only because it was unresponsive.
koomie added a commit that referenced this pull request Jun 9, 2016
@koomie
Copy link
Contributor

koomie commented Jun 9, 2016

Landed the the latest which updates the ReturnToService directive onto 1.1.1 branch.

@koomie koomie closed this Jun 9, 2016
@koomie koomie added this to the 1.1.1 milestone Jun 10, 2016
@koomie
Copy link
Contributor

koomie commented Jun 10, 2016

Re-opening to enable build for 1.1

@koomie
Copy link
Contributor

koomie commented Jun 16, 2016

Confirmed ReturnToService=1 setting applied during CI install. Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants