Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing features in resource settings tabs #946

Merged
merged 6 commits into from
Dec 9, 2024

Conversation

edan-bainglass
Copy link
Member

@edan-bainglass edan-bainglass commented Nov 26, 2024

Based on #945

This PR looks to clean up the recently implemented plugin-based tab system in the resources/submission step (#939).
The idea is to leverage the newly developed abstraction in #945 to encapsulate the behavior of ANY resource settings panel, leaving the global resource settings panel to override based on its unique added features.

A few added items addressed in this PR:

  1. PW code parallelization widget syncing between tabs
  2. Support for syncing newly setup codes
  3. Warning notification for overridden plugin resources
  4. Syncing back with global resources on un-override

Remaining issues:

The new approach somehow causes three tests, which pass in isolation, to fail when running the full suite.

FAILED tests/test_codes.py::test_check_submission_blockers - assert 0 == 1
 +  where 0 = len([])
 +    where [] = <aiidalab_qe.app.submission.model.SubmissionStepModel object at 0x7fa29d55edc0>.internal_submission_blockers
FAILED tests/test_codes.py::test_qeapp_computational_resources_widget - AssertionError: assert 'none' == 'block'
  - block
  + none
FAILED tests/test_submit_qe_workchain.py::test_create_builder_default - AssertionError: FILES DIFFER:

The last one I suspect may simply require a regeneration of the stored references.

But the first two, I suspect here it is a more serious issue that ONLY shows up in testing. I think the issue is that in testing, we generate more than a single instance of certain classes that have dict and/or list caches. We define these as class variables, which is "fine" for the app, but raises issues when various tests create instances of the class. A quick solution (if indeed this is the issue) is to define these as instance variables. However, this is not straight forward for the codes class variable shared by instances of ResourceSettingsModels.

There is no guarantee that this is the issue. Just a suspicion. Will test further as soon as I am not sick 🤒

@superstar54 good if you take a look, see if I missed something obvious. We can chat when I'm better.

Update

My suspicions were correct. For the last test regarding the differing files, the issue is that part of implementing synchronization of newly setup codes required adding code options to the returned dictionary from CodeModel.get_model_state. This method, however, is also used when constructing builder parameters. Though it does not harm the builder process, it does affect the regression test part of test_create_builder_default. Simply regenerating the cached file does not help, as code options include UUIDs, which change every time the test runs. So I added a remove_code_options method to the test to discard the option keys, which resolves the problem.

Regarding the other two failing tests, indeed the issue here was the use of dictionary class variables. Both plugin_mapping and codes dictionary class variables are now changed to instance variables to avoid sharing state across class instances. Again, this is not an issue for the app, as there is only one instance per class. But this is not true for testing.

@edan-bainglass edan-bainglass force-pushed the fix-resources-step branch 2 times, most recently from 3552823 to 82113b8 Compare November 27, 2024 10:41
@edan-bainglass edan-bainglass marked this pull request as draft November 28, 2024 06:27
@edan-bainglass edan-bainglass force-pushed the fix-resources-step branch 3 times, most recently from f5c4e5f to 888d042 Compare November 28, 2024 17:44
@edan-bainglass edan-bainglass marked this pull request as ready for review November 28, 2024 17:44
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 84.42029% with 43 lines in your changes missing coverage. Please review.

Project coverage is 68.29%. Comparing base (7bb67ae) to head (0aa53f5).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aiidalab_qe/common/panel.py 67.07% 27 Missing ⚠️
src/aiidalab_qe/app/submission/model.py 78.57% 6 Missing ⚠️
...iidalab_qe/app/submission/global_settings/model.py 91.48% 4 Missing ⚠️
src/aiidalab_qe/app/submission/__init__.py 92.30% 2 Missing ⚠️
...dalab_qe/app/submission/global_settings/setting.py 92.85% 2 Missing ⚠️
src/aiidalab_qe/common/code/model.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #946      +/-   ##
==========================================
+ Coverage   67.81%   68.29%   +0.47%     
==========================================
  Files         110      110              
  Lines        6221     6210      -11     
==========================================
+ Hits         4219     4241      +22     
+ Misses       2002     1969      -33     
Flag Coverage Δ
python-3.11 68.29% <84.42%> (+0.47%) ⬆️
python-3.9 68.31% <84.42%> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@edan-bainglass
Copy link
Member Author

@superstar54 ready for review 🙏

@edan-bainglass
Copy link
Member Author

@superstar54 ready for review 🙏

Pleeeeeeeeeeeeeeease @superstar54 🙏

@superstar54
Copy link
Member

@superstar54 ready for review 🙏

Hi @edan-bainglass , my apologies for overlooking your comment earlier. I’m reviewing it now. Thanks for your patience! 🙏

@superstar54
Copy link
Member

One more missing feature: the panel shows a wrong warning message when an HPC is selected. This is probably because it checks the resource before the model.selected gets updated, so the app thinks the localhost is used.

In the screenshot, I selected the merlin hpc with 44 cpus for each node, but it give a warning that it only has 28 cpus (which is the number of my localhost).

Screenshot from 2024-12-05 13-04-59

dependencies = ["global.global_codes"]
codes = {} # To be defined by subclasses
dependencies = [
"global.global_codes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern: the base class hardcodes a dependency on "global.global_codes," where "global" is an identifier defined in the subclass or the context where the class is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not great. It appears to be non-breaking, as linking a model's trait to its own same trait does not raise an error and also doesn't do anything. However, it is still a thorn in principal.

Two ways to decouple:

  1. Make the global panel unique. I don't like this - regardless of its special treatment, it IS conceptually a resources panel
  2. Link the dependency explicitly during plugin discovery (in _fetch_plugin_resource_settings) rather than via the _link_model method:
global_model = self._model.get_model("global")
... # then later in the for loop...
self._model.add_model(identifier, model)
tl.dlink(
    (global_model, "global_codes"),
    (model, "global_codes"),
)

Comments:

  • This is the only thing _link_model is doing, so by applying the above workaround, we render the method unnecessary
  • Other future dependencies between the plugin panels and the global panel may require similar workarounds

I'm honestly not too bothered by it. There are more pros than this one conceptual con. But I am open to discussion on the matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution is to keep the dependencies empty in the ResourceSettingsModel and create another PluginResourceSettingsModel for the plugin, in which we add this global.global_codes; in this way, we follow the oop and avoid making _fetch_plugin_resource_settings special.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 💪

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in latest push!

Copy link
Member

@superstar54 superstar54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edan-bainglass , thanks for the nice work.

One issue when loading the previous calculations, the code setting and the override box are not loaded correctly.

src/aiidalab_qe/common/panel.py Show resolved Hide resolved
@edan-bainglass
Copy link
Member Author

One issue when loading the previous calculations, the code setting and the override box are not loaded correctly.

Thanks @superstar54 for the review. Indeed, I discovered this issue yesterday morning but didn't have much time to think about it. I tried first to understand how set_model_state works in resource models. I had some questions, but maybe it is best we discuss your implementation in person next week. For now, I'll try to adjust for your other comments, then play with this issue a bit more to see if I can resolve it already 🙂

@edan-bainglass
Copy link
Member Author

edan-bainglass commented Dec 7, 2024

One more missing feature: the panel shows a wrong warning message when an HPC is selected. This is probably because it checks the resource before the model.selected gets updated, so the app thinks the localhost is used.

In the screenshot, I selected the merlin hpc with 44 cpus for each node, but it give a warning that it only has 28 cpus (which is the number of my localhost).

Right, indeed, it appears to be so that the check is done prior to the update of model.selected. But I'm having difficulty understanding why this is so. In the global settings panel, we do

    def _render_code_widget(
        self,
        code_model: CodeModel,
        code_widget: QEAppComputationalResourcesWidget,
    ):
        super()._render_code_widget(code_model, code_widget)
        if code_model.default_calc_job_plugin == "quantumespresso.pw":
            code_model.observe(
                self._on_pw_code_resource_change,
                [
                    "num_cpus",
                    "num_nodes",
                    "ntasks_per_node",
                    "cpus_per_task",
                    "max_wallclock_seconds",
                ],
            )

where in super()._render_code_widget(...), we set up a two-way link between the dropdown and selected. This means in the observation order, selected should sync prior to triggering the _on_pw_code_resource_change handler. But this is not the case. Not sure why at the moment.

A temporary hack would be to add selected to the event trigger above. This appears to work, albeit in an annoying order where the check is performed prior to the selected update, which shows the warning, then once more after the update, which removes the warning. Clearly not ideal!

@edan-bainglass
Copy link
Member Author

@superstar54 could you please have a look at my comments? I think this is closer to a solution, but we need to see if it sufficient. My main concern at the moment is the mistimed firing of check_resources. I'm still uncertain as to how this is happening.

Note that some part of the code is only there to handle changes in options occurring on the widget side. This happens when one introduces a new code using the "new code setup" button. However, we discussed removing this button in favor of a dedicated notebook for creating new codes. If and when this gets merged, some code introduced in this PR will be removed.

@superstar54
Copy link
Member

Hi @edan-bainglass , thank you for the update. Regarding the dependencies issue, I suggest creating a PluginResourceSettingsModel class. For the check_resources issue, I don't know the reason either, we need more debugging to find the details.

@edan-bainglass
Copy link
Member Author

edan-bainglass commented Dec 8, 2024

Hi @edan-bainglass , thank you for the update. Regarding the dependencies issue, I suggest creating a PluginResourceSettingsModel class. For the check_resources issue, I don't know the reason either, we need more debugging to find the details.

@superstar54 I found out the issue with check_resources misfiring. The observations I set up in the app only come after those set up by the QEAppComputationalResourcesWidget. There, the selector's value is observed and triggers a change in num_cpus, which in turn also syncs up ntasks_per_node. As these fire prior to the my links, we see num_cpus change before the selector's value synchronizes with the model's selected.

I'm afraid there is nothing to be done here short of redesigning QEAppComputationalResourcesWidget, which would likely be a minor change.

update

Realizing that's our widget, i.e. not in AWB 😅 Working on a fix...

Fixed!

Copy link
Member

@superstar54 superstar54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work! LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants