Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hq as light scheduler #795

Merged
merged 37 commits into from
Oct 17, 2024
Merged

Hq as light scheduler #795

merged 37 commits into from
Oct 17, 2024

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Aug 12, 2024

@unkcpz unkcpz marked this pull request as draft August 12, 2024 15:45
Copy link

codecov bot commented Aug 12, 2024

Codecov Report

Attention: Patch coverage is 26.98413% with 46 lines in your changes missing coverage. Please review.

Project coverage is 67.53%. Comparing base (59d9b1c) to head (c472dcd).
Report is 123 commits behind head on main.

Files with missing lines Patch % Lines
src/aiidalab_qe/setup/codes.py 20.00% 40 Missing ⚠️
src/aiidalab_qe/__main__.py 33.33% 4 Missing ⚠️
src/aiidalab_qe/common/setup_codes.py 50.00% 1 Missing ⚠️
src/aiidalab_qe/plugins/utils.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #795      +/-   ##
==========================================
- Coverage   67.72%   67.53%   -0.20%     
==========================================
  Files          50       50              
  Lines        4598     4620      +22     
==========================================
+ Hits         3114     3120       +6     
- Misses       1484     1500      +16     
Flag Coverage Δ
python-3.11 67.53% <26.98%> (-0.20%) ⬇️
python-3.9 67.56% <26.98%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@unkcpz unkcpz force-pushed the hq branch 2 times, most recently from 670e586 to 12784a0 Compare August 13, 2024 15:55
Dockerfile Outdated Show resolved Hide resolved
@unkcpz unkcpz force-pushed the hq branch 3 times, most recently from 4bb50e5 to 1e2bded Compare August 13, 2024 16:18
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@unkcpz
Copy link
Member Author

unkcpz commented Aug 19, 2024

Just find a cache when pre-setting the computer, the issue exist also for the current localhost. LOCALHOST_MPI_PROCS_PER_MACHINE in 40_prepare-aiida.sh was triggered in the build phase in QeApp, so the number of procs is the machine that build the image, for the image that released therefore it is 2 the number of CPUs of github action runner I assume.

In this PR, the same problem faced for the hq computer setup, therefore, I skip the setup for the default N_PROCS and MEM.

@unkcpz unkcpz force-pushed the hq branch 2 times, most recently from 7f1f129 to 2f956fd Compare August 19, 2024 14:09
@unkcpz
Copy link
Member Author

unkcpz commented Sep 17, 2024

@danielhollas all requests addressed.

Dockerfile Outdated Show resolved Hide resolved
@unkcpz unkcpz marked this pull request as draft September 20, 2024 08:41
@unkcpz
Copy link
Member Author

unkcpz commented Sep 20, 2024

I convert back to draft, since test on the production deployment reveal a problem that untar not happened for the presistent volume since it checks home is empty but there will be a lost+found folder in k8s deployment. I change to use .FLAG_HOME_INITIALIZED flag file. If it works, I'll make it a separated PR.

@unkcpz unkcpz marked this pull request as ready for review September 20, 2024 12:24
@unkcpz
Copy link
Member Author

unkcpz commented Sep 20, 2024

Ready for another review.

Comment on lines +27 to +32
if [ -d $AIIDALAB_APPS/quantum-espresso ]; then
echo "Quantum ESPRESSO app does exist"
else
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'"
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS"
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This do the trick, because in the k8s deployment, taring for the empty files like work for example has permission issue. Which not cause any problem I can see but it prevent the qeapp folder copy after. I remove it out as a independent operation.

Copy link
Contributor

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a super quick review, LGTM. (btw: will be away for the next three weeks)

@@ -4,7 +4,7 @@ set -eux
home="/home/${NB_USER}"

# Untar home archive file to restore home directory if it is empty
if [[ $(ls -A ${home} | wc -l) = "0" ]]; then
if [ ! -e $home/.FLAG_HOME_INITIALIZED ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of this solution since it seems brittle (user can remove this file).
I am somewhat confused, why does the previous one not work anymore?
(feel free to ignore, this is just me rambling :D)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two problems with previous one

  1. the left side is 0 and right side is \0 as str.
  2. On the k8s deployment, if the persistent volume is used, it has a lost+found folder exist before this script is running.

So another solution I did was if [ $(ls -A ${home} | wc -l ) -lt 1 ]; then, but it is more brittle I assume :-p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look into this file too deeply, might want to get other eyes on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, please do it, I am not in hurry to get this merged.

@unkcpz
Copy link
Member Author

unkcpz commented Sep 25, 2024

(btw: will be away for the next three weeks)

No problem, and congrats with the atomspec paper!

@danielhollas
Copy link
Contributor

Thanks! I've been meaning to announce it on the forum and thank the team but I did not get to it yet. : 🙈

@unkcpz
Copy link
Member Author

unkcpz commented Oct 17, 2024

I merge this now, since in the demo server checkin meeting we set the next test from @cpignedoli and @giovannipizzi at next Wednesday. It is good to have an alpha release include this one and many new fixes and use version tag to be deployed in the server.

@unkcpz unkcpz merged commit e0b5b38 into main Oct 17, 2024
12 checks passed
@unkcpz unkcpz deleted the hq branch October 17, 2024 14:22
edan-bainglass pushed a commit to edan-bainglass/aiidalab-qe that referenced this pull request Oct 24, 2024
In the docker image, the hyperqueue is pre-configured and replace the local.direct scheduler to limit the number of CPUs to be used when there are multiple calculations.
The number of CPU and memory information are read from cgroups and set for the computer as default. These information can later be used for set up the default number of resource to be used for the QeApp.

The number of CPU is set to be floor(ncpus) of the container, the goal is to have some amount of cpus to deal with system response specifically.

- also include a small refactoring on the computer/code setup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants