Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hq as light scheduler #795

Merged
merged 37 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3309bf6
before-notebook script for set up hq computer
unkcpz Aug 12, 2024
07ae579
Rename install to install_and_setup
unkcpz Aug 12, 2024
bc741b2
setup_codes which separate qe install and codes setup in two functions
unkcpz Aug 12, 2024
53fa9d4
Workable
unkcpz Aug 12, 2024
af86de4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 12, 2024
b14daec
finaly code setup is working!!
unkcpz Aug 13, 2024
708bbb7
Get mem/cpu info from cgroup
unkcpz Aug 13, 2024
bcfeb14
Round HQ cpu to integer
unkcpz Aug 13, 2024
2018592
Calculate number of CPU??
unkcpz Aug 13, 2024
1c22d1f
Correctly set NUM_CPU and MEM for hq local
unkcpz Aug 19, 2024
150af1b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 19, 2024
8e4b2bd
num_cpus setting
unkcpz Aug 22, 2024
8434fc4
Use custom hq plugin
unkcpz Aug 22, 2024
f0cb9d7
resource setup specific for hq
unkcpz Aug 22, 2024
37151e0
fix after rebase
unkcpz Sep 10, 2024
6bf7aea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 10, 2024
3cd731e
Use localhost as new computer name
unkcpz Sep 16, 2024
91fbf71
Setup default mpiprocs and memory for hq computer
unkcpz Sep 16, 2024
9763d7d
0 decimal for memory read from cgroup
unkcpz Sep 16, 2024
51ecd65
Update Dockerfile
unkcpz Sep 16, 2024
dcfdd42
Use edge full-stack which has bc and late daemon start
unkcpz Sep 17, 2024
eb79d2a
f-d: no if-else need to check if the computer already set
unkcpz Sep 17, 2024
dff3c90
f-d: polishing uv order
unkcpz Sep 17, 2024
b9909ba
HQ_COMPUTER as global ARG
unkcpz Sep 17, 2024
c2ef7a3
json format to check qe is installed or not
unkcpz Sep 17, 2024
f251d71
flag file to check tar initialization
unkcpz Sep 20, 2024
47b8526
Merge branch 'main' into hq
unkcpz Sep 20, 2024
9999b57
test untar home
unkcpz Sep 20, 2024
57385ce
revert to dh method to check
unkcpz Sep 20, 2024
36ea854
revert to use old
unkcpz Sep 20, 2024
bc2b322
use new base full-stack image
unkcpz Sep 20, 2024
58ea0f9
wc -l -le 1
unkcpz Sep 20, 2024
024e631
add placeholder in work dir
unkcpz Sep 20, 2024
d03ce60
separate qeapp copy
unkcpz Sep 20, 2024
ef49620
use flag file .FLAG_HOME_INITIALIZED
unkcpz Sep 20, 2024
473ca51
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 20, 2024
c472dcd
Merge branch 'main' into hq
unkcpz Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 36 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# syntax=docker/dockerfile:1
ARG FULL_STACK_VER=2024.1022
ARG FULL_STACK_VER=2024.1023
ARG UV_VER=0.4.7
ARG QE_VER=7.2
ARG QE_DIR=/opt/conda/envs/quantum-espresso-${QE_VER}
ARG HQ_VER=0.19.0

ARG UV_CACHE_DIR=/tmp/uv_cache
ARG QE_APP_SRC=/tmp/quantum-espresso
ARG HQ_COMPUTER="localhost-hq"

FROM ghcr.io/astral-sh/uv:${UV_VER} AS uv

Expand Down Expand Up @@ -43,22 +45,44 @@ RUN --mount=from=uv,source=/uv,target=/bin/uv \

# STAGE 3
# - Prepare AiiDA profile and localhost computer
# - Prepare hq computer using hyperqueue as scheduler
# - Install QE codes and pseudopotentials
# - Archive home folder
FROM build_deps AS home_build
ARG QE_DIR
ARG HQ_VER
ARG HQ_COMPUTER

# Install hq binary
RUN wget -c -O hq.tar.gz https://github.com/It4innovations/hyperqueue/releases/download/v${HQ_VER}/hq-v${HQ_VER}-linux-x64.tar.gz && \
tar xf hq.tar.gz -C /opt/conda/

ENV PSEUDO_FOLDER=/tmp/pseudo
RUN mkdir -p ${PSEUDO_FOLDER} && \
python -m aiidalab_qe download-pseudos --dest ${PSEUDO_FOLDER}

ENV UV_CONSTRAINT=${PIP_CONSTRAINT}
# Install the aiida-hyperqueue
# XXX: fix me after release aiida-hyperqueue
RUN --mount=from=uv,source=/uv,target=/bin/uv \
--mount=from=build_deps,source=${UV_CACHE_DIR},target=${UV_CACHE_DIR},rw \
uv pip install --system --strict --cache-dir=${UV_CACHE_DIR} \
"aiida-hyperqueue@git+https://github.com/aiidateam/aiida-hyperqueue"

COPY ./before-notebook.d/* /usr/local/bin/before-notebook.d/

ENV HQ_COMPUTER=$HQ_COMPUTER

# TODO: Remove PGSQL and daemon log files, and other unneeded files
RUN --mount=from=qe_conda_env,source=${QE_DIR},target=${QE_DIR} \
bash /usr/local/bin/before-notebook.d/20_start-postgresql.sh && \
bash /usr/local/bin/before-notebook.d/40_prepare-aiida.sh && \
python -m aiidalab_qe install-qe && \
bash /usr/local/bin/before-notebook.d/42_setup-hq-computer.sh && \
python -m aiidalab_qe install-qe --computer ${HQ_COMPUTER} && \
python -m aiidalab_qe install-pseudos --source ${PSEUDO_FOLDER} && \
verdi daemon stop && \
mamba run -n aiida-core-services pg_ctl stop && \
touch /home/${NB_USER}/.FLAG_HOME_INITIALIZED && \
cd /home/${NB_USER} && tar -cf /opt/conda/home.tar .

# STAGE 3 - Final stage
Expand All @@ -71,22 +95,31 @@ FROM ghcr.io/aiidalab/full-stack:${FULL_STACK_VER}
ARG QE_DIR
ARG QE_APP_SRC
ARG UV_CACHE_DIR
ARG HQ_COMPUTER
USER ${NB_USER}

WORKDIR /tmp
# Install python dependencies
# Use uv cache from the previous build step
# # Install the aiida-hyperqueue
# # XXX: fix me after release aiida-hyperqueue
ENV UV_CONSTRAINT=${PIP_CONSTRAINT}
RUN --mount=from=uv,source=/uv,target=/bin/uv \
--mount=from=build_deps,source=${UV_CACHE_DIR},target=${UV_CACHE_DIR},rw \
--mount=from=build_deps,source=${QE_APP_SRC},target=${QE_APP_SRC},rw \
uv pip install --strict --system --compile-bytecode --cache-dir=${UV_CACHE_DIR} ${QE_APP_SRC}
uv pip install --strict --system --compile-bytecode --cache-dir=${UV_CACHE_DIR} ${QE_APP_SRC} "aiida-hyperqueue@git+https://github.com/aiidateam/aiida-hyperqueue"

# copy hq binary
COPY --from=home_build /opt/conda/hq /usr/local/bin/

COPY --from=qe_conda_env ${QE_DIR} ${QE_DIR}

USER root

COPY ./before-notebook.d/* /usr/local/bin/before-notebook.d/

ENV HQ_COMPUTER=$HQ_COMPUTER

# Remove content of $HOME
# '-mindepth=1' ensures that we do not remove the home directory itself.
RUN find /home/${NB_USER}/ -mindepth 1 -delete
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ set -eux
home="/home/${NB_USER}"

# Untar home archive file to restore home directory if it is empty
if [[ $(ls -A ${home} | wc -l) = "0" ]]; then
if [ ! -e $home/.FLAG_HOME_INITIALIZED ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of this solution since it seems brittle (user can remove this file).
I am somewhat confused, why does the previous one not work anymore?
(feel free to ignore, this is just me rambling :D)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two problems with previous one

  1. the left side is 0 and right side is \0 as str.
  2. On the k8s deployment, if the persistent volume is used, it has a lost+found folder exist before this script is running.

So another solution I did was if [ $(ls -A ${home} | wc -l ) -lt 1 ]; then, but it is more brittle I assume :-p

if [[ ! -f $HOME_TAR ]]; then
echo "File $HOME_TAR does not exist!"
exit 1
Expand All @@ -15,12 +15,20 @@ if [[ $(ls -A ${home} | wc -l) = "0" ]]; then
fi

echo "Extracting $HOME_TAR to $home"
# NOTE: a tar error when deployed to k8s but at the momment not cause any issue
# tar: .: Cannot utime: Operation not permitted
# tar: .: Cannot change mode to rwxr-s---: Operation not permitted
tar -xf $HOME_TAR -C "$home"

echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'"
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS"
else
echo "$home folder is not empty!"
ls -lrta "$home"
fi

if [ -d $AIIDALAB_APPS/quantum-espresso ]; then
echo "Quantum ESPRESSO app does exist"
else
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'"
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS"
fi
Comment on lines +27 to +32
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This do the trick, because in the k8s deployment, taring for the empty files like work for example has permission issue. Which not cause any problem I can see but it prevent the qeapp folder copy after. I remove it out as a independent operation.


set +eux
21 changes: 21 additions & 0 deletions before-notebook.d/42_setup-hq-computer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

set -x

# computer
verdi computer show ${HQ_COMPUTER} || verdi computer setup \
--non-interactive \
--label "${HQ_COMPUTER}" \
--description "local computer with hyperqueue scheduler" \
--hostname "localhost" \
--transport core.local \
--scheduler hyperqueue \
--work-dir /home/${NB_USER}/aiida_run/ \
--mpirun-command "mpirun -np {num_cpus}"

verdi computer configure core.local "${HQ_COMPUTER}" \
--non-interactive \
--safe-interval 5.0

# disable the localhost which is set in base image
verdi computer disable localhost aiida@localhost
59 changes: 59 additions & 0 deletions before-notebook.d/43_start-hq.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash

set -x

# NOTE: this cgroup folder hierachy is based on cgroupv2
# if the container is open in system which has cgroupv1 the image build procedure will fail.
# Since the image is mostly for demo server where we know the machine and OS I supposed
# it should have cgroupv2 (> Kubernetes v1.25).
# We only build the server for demo server so it does not require user to have new cgroup.
# But for developers, please update your cgroup version to v2.
# See: https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2

# computer memory from runtime
MEMORY_LIMIT=$(cat /sys/fs/cgroup/memory.max)

if [ "$MEMORY_LIMIT" = "max" ]; then
MEMORY_LIMIT=4096
echo "No memory limit set, use 4GiB"
else
MEMORY_LIMIT=$(echo "scale=0; $MEMORY_LIMIT / (1024 * 1024)" | bc)
echo "Memory Limit: ${MEMORY_LIMIT} MiB"
fi

# Compute number of cpus allocated to the container
CPU_LIMIT=$(awk '{print $1}' /sys/fs/cgroup/cpu.max)
CPU_PERIOD=$(awk '{print $2}' /sys/fs/cgroup/cpu.max)

if [ "$CPU_PERIOD" -ne 0 ]; then
CPU_NUMBER=$(echo "scale=2; $CPU_LIMIT / $CPU_PERIOD" | bc)
echo "Number of CPUs allocated: $CPU_NUMBER"

# for HQ setting round to integer number of CPUs, the left are for system tasks
CPU_LIMIT=$(echo "scale=0; $CPU_LIMIT / $CPU_PERIOD" | bc)
else
# if no limit (with local OCI without setting cpu limit, use all CPUs)
CPU_LIMIT=$(nproc)
echo "No CPU limit set"
fi

# Start hq server with a worker
run-one-constantly hq server start 1>$HOME/.hq-stdout 2>$HOME/.hq-stderr &
run-one-constantly hq worker start --cpus=${CPU_LIMIT} --resource "mem=sum(${MEMORY_LIMIT})" --no-detect-resources &

# Reset the default memory_per_machine and default_mpiprocs_per_machine
# c.set_default_mpiprocs_per_machine = ${CPU_LIMIT}
# c.set_default_memery_per_machine = ${MEMORY_LIMIT}

# Same as original localhost set job poll interval to 2.0 secs
# In addition, set default mpiprocs and memor per machine
# TODO: this will be run every time the container start, we need a lock file to prevent it.
job_poll_interval="2.0"
computer_name=${HQ_COMPUTER}
python -c "
from aiida import load_profile; from aiida.orm import load_computer;
load_profile();
load_computer('${computer_name}').set_minimum_job_poll_interval(${job_poll_interval})
load_computer('${computer_name}').set_default_mpiprocs_per_machine(${CPU_LIMIT})
load_computer('${computer_name}').set_default_memory_per_machine(${MEMORY_LIMIT})
"
11 changes: 6 additions & 5 deletions src/aiidalab_qe/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,20 @@ def cli():

@cli.command()
@click.option("-f", "--force", is_flag=True)
@click.option("--computer")
@click.option("-p", "--profile", default=_DEFAULT_PROFILE)
def install_qe(force, profile):
def install_qe(force, profile, computer):
from aiida import load_profile
from aiidalab_qe.setup.codes import codes_are_setup, install
from aiidalab_qe.setup.codes import codes_are_setup, install_and_setup

load_profile(profile)
try:
for msg in install(force=force):
for msg in install_and_setup(computer=computer, force=force):
click.echo(msg)
assert codes_are_setup()
assert codes_are_setup(computer=computer)
click.secho("Codes are setup!", fg="green")
except Exception as error:
raise click.ClickException(f"Failed to set up QE failed: {error}") from error
raise click.ClickException(f"Failed to set up QE: {error}") from error


@cli.command()
Expand Down
4 changes: 2 additions & 2 deletions src/aiidalab_qe/common/setup_codes.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import ipywidgets as ipw
import traitlets

from ..setup.codes import QE_VERSION, install
from ..setup.codes import QE_VERSION, install_and_setup
from .widgets import ProgressBar

__all__ = [
Expand Down Expand Up @@ -66,7 +66,7 @@ def _refresh_installed(self):
try:
self.set_trait("busy", True)

for msg in install():
for msg in install_and_setup():
self.set_message(msg)

except Exception as error:
Expand Down
22 changes: 16 additions & 6 deletions src/aiidalab_qe/plugins/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,22 @@

def set_component_resources(component, code_info):
"""Set the resources for a given component based on the code info."""
if code_info: # Ensure code_info is not None or empty
component.metadata.options.resources = {
"num_machines": code_info["nodes"],
"num_mpiprocs_per_machine": code_info["ntasks_per_node"],
"num_cores_per_mpiproc": code_info["cpus_per_task"],
}
if code_info: # Ensure code_info is not None or empty (# XXX: ? from jyu, need to pop a warning to plugin developer or what?)
code: orm.Code = code_info["code"]
if code.computer.scheduler_type == "hyperqueue":
component.metadata.options.resources = {
"num_cpus": code_info["nodes"]
* code_info["ntasks_per_node"]
* code_info["cpus_per_task"]
}
else:
# XXX: jyu should properly deal with None type of scheduler_type which can be "core.direct" (will be replaced by hyperqueue) and "core.slurm" ...
component.metadata.options.resources = {
"num_machines": code_info["nodes"],
"num_mpiprocs_per_machine": code_info["ntasks_per_node"],
"num_cores_per_mpiproc": code_info["cpus_per_task"],
}

component.metadata.options["max_wallclock_seconds"] = code_info[
"max_wallclock_seconds"
]
Expand Down
Loading
Loading