-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hq as light scheduler #795
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
3309bf6
before-notebook script for set up hq computer
unkcpz 07ae579
Rename install to install_and_setup
unkcpz bc741b2
setup_codes which separate qe install and codes setup in two functions
unkcpz 53fa9d4
Workable
unkcpz af86de4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b14daec
finaly code setup is working!!
unkcpz 708bbb7
Get mem/cpu info from cgroup
unkcpz bcfeb14
Round HQ cpu to integer
unkcpz 2018592
Calculate number of CPU??
unkcpz 1c22d1f
Correctly set NUM_CPU and MEM for hq local
unkcpz 150af1b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 8e4b2bd
num_cpus setting
unkcpz 8434fc4
Use custom hq plugin
unkcpz f0cb9d7
resource setup specific for hq
unkcpz 37151e0
fix after rebase
unkcpz 6bf7aea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3cd731e
Use localhost as new computer name
unkcpz 91fbf71
Setup default mpiprocs and memory for hq computer
unkcpz 9763d7d
0 decimal for memory read from cgroup
unkcpz 51ecd65
Update Dockerfile
unkcpz dcfdd42
Use edge full-stack which has bc and late daemon start
unkcpz eb79d2a
f-d: no if-else need to check if the computer already set
unkcpz dff3c90
f-d: polishing uv order
unkcpz b9909ba
HQ_COMPUTER as global ARG
unkcpz c2ef7a3
json format to check qe is installed or not
unkcpz f251d71
flag file to check tar initialization
unkcpz 47b8526
Merge branch 'main' into hq
unkcpz 9999b57
test untar home
unkcpz 57385ce
revert to dh method to check
unkcpz 36ea854
revert to use old
unkcpz bc2b322
use new base full-stack image
unkcpz 58ea0f9
wc -l -le 1
unkcpz 024e631
add placeholder in work dir
unkcpz d03ce60
separate qeapp copy
unkcpz ef49620
use flag file .FLAG_HOME_INITIALIZED
unkcpz 473ca51
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c472dcd
Merge branch 'main' into hq
unkcpz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ set -eux | |
home="/home/${NB_USER}" | ||
|
||
# Untar home archive file to restore home directory if it is empty | ||
if [[ $(ls -A ${home} | wc -l) = "0" ]]; then | ||
if [ ! -e $home/.FLAG_HOME_INITIALIZED ]; then | ||
if [[ ! -f $HOME_TAR ]]; then | ||
echo "File $HOME_TAR does not exist!" | ||
exit 1 | ||
|
@@ -15,12 +15,20 @@ if [[ $(ls -A ${home} | wc -l) = "0" ]]; then | |
fi | ||
|
||
echo "Extracting $HOME_TAR to $home" | ||
# NOTE: a tar error when deployed to k8s but at the momment not cause any issue | ||
# tar: .: Cannot utime: Operation not permitted | ||
# tar: .: Cannot change mode to rwxr-s---: Operation not permitted | ||
tar -xf $HOME_TAR -C "$home" | ||
|
||
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'" | ||
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS" | ||
else | ||
echo "$home folder is not empty!" | ||
ls -lrta "$home" | ||
fi | ||
|
||
if [ -d $AIIDALAB_APPS/quantum-espresso ]; then | ||
echo "Quantum ESPRESSO app does exist" | ||
else | ||
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'" | ||
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS" | ||
fi | ||
Comment on lines
+27
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This do the trick, because in the k8s deployment, taring for the empty files like |
||
|
||
set +eux |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# computer | ||
verdi computer show ${HQ_COMPUTER} || verdi computer setup \ | ||
--non-interactive \ | ||
--label "${HQ_COMPUTER}" \ | ||
--description "local computer with hyperqueue scheduler" \ | ||
--hostname "localhost" \ | ||
--transport core.local \ | ||
--scheduler hyperqueue \ | ||
--work-dir /home/${NB_USER}/aiida_run/ \ | ||
--mpirun-command "mpirun -np {num_cpus}" | ||
|
||
verdi computer configure core.local "${HQ_COMPUTER}" \ | ||
--non-interactive \ | ||
--safe-interval 5.0 | ||
|
||
# disable the localhost which is set in base image | ||
verdi computer disable localhost aiida@localhost |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# NOTE: this cgroup folder hierachy is based on cgroupv2 | ||
# if the container is open in system which has cgroupv1 the image build procedure will fail. | ||
# Since the image is mostly for demo server where we know the machine and OS I supposed | ||
# it should have cgroupv2 (> Kubernetes v1.25). | ||
# We only build the server for demo server so it does not require user to have new cgroup. | ||
# But for developers, please update your cgroup version to v2. | ||
# See: https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2 | ||
|
||
# computer memory from runtime | ||
MEMORY_LIMIT=$(cat /sys/fs/cgroup/memory.max) | ||
|
||
if [ "$MEMORY_LIMIT" = "max" ]; then | ||
MEMORY_LIMIT=4096 | ||
echo "No memory limit set, use 4GiB" | ||
else | ||
MEMORY_LIMIT=$(echo "scale=0; $MEMORY_LIMIT / (1024 * 1024)" | bc) | ||
echo "Memory Limit: ${MEMORY_LIMIT} MiB" | ||
fi | ||
|
||
# Compute number of cpus allocated to the container | ||
CPU_LIMIT=$(awk '{print $1}' /sys/fs/cgroup/cpu.max) | ||
CPU_PERIOD=$(awk '{print $2}' /sys/fs/cgroup/cpu.max) | ||
|
||
if [ "$CPU_PERIOD" -ne 0 ]; then | ||
CPU_NUMBER=$(echo "scale=2; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
echo "Number of CPUs allocated: $CPU_NUMBER" | ||
|
||
# for HQ setting round to integer number of CPUs, the left are for system tasks | ||
CPU_LIMIT=$(echo "scale=0; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
else | ||
# if no limit (with local OCI without setting cpu limit, use all CPUs) | ||
CPU_LIMIT=$(nproc) | ||
echo "No CPU limit set" | ||
fi | ||
|
||
# Start hq server with a worker | ||
run-one-constantly hq server start 1>$HOME/.hq-stdout 2>$HOME/.hq-stderr & | ||
run-one-constantly hq worker start --cpus=${CPU_LIMIT} --resource "mem=sum(${MEMORY_LIMIT})" --no-detect-resources & | ||
|
||
# Reset the default memory_per_machine and default_mpiprocs_per_machine | ||
# c.set_default_mpiprocs_per_machine = ${CPU_LIMIT} | ||
# c.set_default_memery_per_machine = ${MEMORY_LIMIT} | ||
|
||
# Same as original localhost set job poll interval to 2.0 secs | ||
# In addition, set default mpiprocs and memor per machine | ||
# TODO: this will be run every time the container start, we need a lock file to prevent it. | ||
job_poll_interval="2.0" | ||
computer_name=${HQ_COMPUTER} | ||
python -c " | ||
from aiida import load_profile; from aiida.orm import load_computer; | ||
load_profile(); | ||
load_computer('${computer_name}').set_minimum_job_poll_interval(${job_poll_interval}) | ||
load_computer('${computer_name}').set_default_mpiprocs_per_machine(${CPU_LIMIT}) | ||
load_computer('${computer_name}').set_default_memory_per_machine(${MEMORY_LIMIT}) | ||
" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of this solution since it seems brittle (user can remove this file).
I am somewhat confused, why does the previous one not work anymore?
(feel free to ignore, this is just me rambling :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two problems with previous one
0
and right side is\0
as str.lost+found
folder exist before this script is running.So another solution I did was
if [ $(ls -A ${home} | wc -l ) -lt 1 ]; then
, but it is more brittle I assume :-p