Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordination of server use #377

Open
IlkaCu opened this issue Aug 9, 2021 · 225 comments
Open

Coordination of server use #377

IlkaCu opened this issue Aug 9, 2021 · 225 comments
Labels
🙏 help wanted Extra attention is needed

Comments

@IlkaCu
Copy link
Member

IlkaCu commented Aug 9, 2021

This issue is meant to coordinate the use of the egondata user/instance on our server in FL.
We already agreed on starting a clean-run of the dev branch on every Friday. This will (most likely) make some debugging necessary on Mondays. To avoid conflicts while debugging, please comment in this issue before you start debugging and shortly note on which datasets/ parts of the workflow you will be working on.

@IlkaCu IlkaCu added the 🙏 help wanted Extra attention is needed label Aug 9, 2021
@ClaraBuettner
Copy link
Contributor

ClaraBuettner commented Aug 9, 2021

The run started on 6th of August is not finished yet. The task industry.temporal.insert-osm-ind-load is still running.
Two tasks failed:

  • heat_etrago.supply : This fails because some subst_id's of the mv grids are not in the etrago-bus table. I assume this is happens because the MV grids are already versioned and were skipped, but osmTGmod was running again. So even if some id changed in osmTGmod, the subst_id of the mv grids are not updated. I will check this but will wait until industry.temporal.insert-osm-ind-load is finished because it depends of the mv grids.
  • power_plants.wind_farms.insert : This is the same problem described in geom-error in generate_wind_farms #354 . Since I can not reproduce this issue in other instances, I will try to debug this in the clean-run instance.

Both problems were caused by subst_ids which were in the mv_grid table but due to the new run of osmTGmod not part of the etrago buses. When I enforced a re-run of the mv-grid-dataset, the tasks finished successfully.
The migration of osmTGmod to datasets solves this problem. Since this will be merged to dev soon, I will not look for another intermediate solution.

@nesnoj
Copy link
Member

nesnoj commented Aug 19, 2021

the new branch for the Fridays' run @gnn was talking about does not exist yet, right?

@nesnoj
Copy link
Member

nesnoj commented Aug 19, 2021

@nailend and me would like to have the branch features/#256-hh-load-area-profile-generator tested prior to merging to dev.
@gnn could you please merge it into the Friday-branch before you start? Thx!

@ClaraBuettner
Copy link
Contributor

the new branch for the Fridays' run @gnn was talking about does not exist yet, right?

I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend

@nesnoj
Copy link
Member

nesnoj commented Aug 20, 2021

I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend

Thank you, didn't copy the name during the webco and the docs have not been updated yet.
I merged my branch into continuous-integration/run-everything-over-the-weekend
Ready for takeoff!

@nesnoj
Copy link
Member

nesnoj commented Aug 23, 2021

Apparently, there has been no run on Friday?!

@nesnoj
Copy link
Member

nesnoj commented Aug 23, 2021

Apparently, there has been no run on Friday?!

May I start it today? @gnn

@AmeliaNadal
Copy link
Contributor

AmeliaNadal commented Aug 23, 2021

I would find it great yes!

@IlkaCu
Copy link
Member Author

IlkaCu commented Aug 23, 2021

gnn told me that he started a clean-run on Friday. But I didn't check the results yet.

@nesnoj
Copy link
Member

nesnoj commented Aug 23, 2021

gnn told me that he started a clean-run on Friday. But I didn't check the results yet.

Ah, I'm just seeing he didn't use the image we used before but created a new one. But I dunno which HTTP port it's listening on.. :(
@gnn ?

@nesnoj
Copy link
Member

nesnoj commented Aug 23, 2021

Got it, it's port 9001 (do u know how u reconfigure the tunnel @AmeliaNadal ?).

Apparently, it crashed quite early at tasks
osmtgmod.import-osm-data and
electricity_demand.temporal.insert-cts-load 😞.

It's very likely that the first one is caused by insufficient disk space as there're only 140G free (after cleaning up temp files) and that might not sufficient for the temp tables created by osmTGmod. So I propose to delete my old setup we used before and re-run the new one. Shall I do so? Any objections @IlkaCu @AmeliaNadal ?

@AmeliaNadal
Copy link
Contributor

I could access the results (thanks for asking @nesnoj!) and my tasks haven't run. So I have no objection that you re-run the workflow ;)

@nesnoj
Copy link
Member

nesnoj commented Aug 23, 2021

Done.

Update: osmtgmod.import-osm-data has been run successfully :D

@nesnoj
Copy link
Member

nesnoj commented Aug 25, 2021

I'm done on the server and happy, go ahead @IlkaCu

@nesnoj
Copy link
Member

nesnoj commented Aug 27, 2021

@IlkaCu and I decided to restart the weekend run tonight. I merged dev into continuous-integration/run-everything-over-the-weekend and I'm now done with all my stuff ... please go ahead @IlkaCu

@IlkaCu
Copy link
Member Author

IlkaCu commented Aug 27, 2021

I merged one bug fix into continuous-integration/run-everything-over-the-weekend

@IlkaCu
Copy link
Member Author

IlkaCu commented Aug 27, 2021

I merged another bug fix: ee038e4
@nesnoj: I hope this works now.

@nesnoj
Copy link
Member

nesnoj commented Aug 27, 2021

I merged another bug fix: ee038e4
@nesnoj: I hope this works now.

Yepp, looks good 👍
Run started 🏃

@IlkaCu
Copy link
Member Author

IlkaCu commented Aug 27, 2021

Great, thank you.

@IlkaCu
Copy link
Member Author

IlkaCu commented Aug 30, 2021

If I see it right, the server run in normal mode has been successful. 🥳
Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

@nesnoj
Copy link
Member

nesnoj commented Aug 30, 2021

If I see it right, the server run in normal mode has been successful. 🥳

Awesome!

Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

Generally I'm fine with both options, but I guess that there might be some additional checks necessary (at least in #260) before it can get merged to dev. I reckon there will be some more commits in the branches so separate merging via PRs seems more clean to me.

@nesnoj
Copy link
Member

nesnoj commented Sep 6, 2021

A task of mine failed due to some column name adjustments in 5b7d9f2.
I had to clear some stuff, They're re-running now..

@gnn
Copy link
Collaborator

gnn commented Sep 6, 2021

I see that I missed an open question last week. Sorry for that.

Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

Since the CR branch might contain changes which are working but not yet meant to be merged into dev, I'm in favour of merging tested feature branches into dev individually. This also makes it easier to figure out where a change came from, which is important when trying to fix bugs which are discovered later on. Hence my 👍 to @nesnoj's comment. :)
For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at git's rerere.enabled option, which makes git automatically reuse known conflict resolutions. You can switch on that option via
git config --global rerere.enabled true for all your repositories or via git config --local rerere.enabled true inside a repository if you only want to switch it on for that particular repository.

@nesnoj
Copy link
Member

nesnoj commented Sep 6, 2021

For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at git's rerere.enabled option, which makes git automatically reuse known conflict resolutions. You can switch on that option via
git config --global rerere.enabled true for all your repositories or via git config --local rerere.enabled true inside a repository if you only want to switch it on for that particular repository.

That's exactly what has been annoying most when keeping track of 2 branches. Thx for the hint! 🙏

BTW @IlkaCu : Some of "your" tasks failed in the current run. Also, we get a No space left on device in task power_plants.pv_rooftop.pv-rooftop-per-mv-grid for some reason, bu there're 300 GB free 🧐

@nesnoj
Copy link
Member

nesnoj commented Oct 25, 2022

We'e experiencing some odd stuff: parts of 2 tasks in CtsDemandBuildings @nailend merged into CI do not show up in the CI. Most likely, they have been overwritten during a merge as only some parts of a commit are missing. Do allow the pipeline to continue, we had to stop it and are currently applying the tasks manually.
As soon as this will finish, we will resume the run.

@ClaraBuettner
Copy link
Contributor

individual_heating.determine-hp-capacity-pypsa-eur-sec-mvgd-bulk0 failed:

MVGD=30937 | Start
[2022-10-25 10:50:20,279] {saio.py:101} WARNING - Reflection was unable to determine primary key (normal for views), assuming: egon_heat_idp_pool.index
[2022-10-25 10:50:35,811] {local_task_job.py:156} WARNING - State of this instance has been externally set to failed. Taking the poison pill.
[2022-10-25 10:50:35,832] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 2729369
[2022-10-25 10:50:36,332] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-25 10:50:36,486] {helpers.py:291} INFO - Process psutil.Process(pid=2729369, status='terminated', exitcode=0, started='08:33:13') (2729369) terminated with exit code 0
[2022-10-25 10:50:36,486] {local_task_job.py:102} INFO - Task exited with return code 0

@nesnoj
Copy link
Member

nesnoj commented Oct 25, 2022

individual_heating.determine-hp-capacity-pypsa-eur-sec-mvgd-bulk0 failed

Yes, @nailend had to mark the task failed as it'd take too much time without parallelization. The parallelization was lost the same way like the stuff mentioned above - we assume that someone who merged recently didn't take sufficient care.

This unfortunately cannot be fixed in the current clean run (new tasks are not detected properly). We would have to start a versioned run. Is that ok for you @ClaraBuettner @IlkaCu @AmeliaNadal?

However, this would raise the problem in #979, right?

@IlkaCu
Copy link
Member Author

IlkaCu commented Oct 25, 2022

Lets give it a try. #979 will not necessarily appear again, I guess.

@AmeliaNadal
Copy link
Contributor

AmeliaNadal commented Oct 25, 2022

I don't really see another solution, so that's ok for me too :)

@nesnoj
Copy link
Member

nesnoj commented Oct 25, 2022

Thanks for the quick replies! I'll take care..

@nesnoj
Copy link
Member

nesnoj commented Oct 26, 2022

Uh, the max_con limit I set before was without any consequence as it was overridden by @gnn's manual script 🤦‍♂️. But we agreed to set the HP tasks (individual_heating.determine-hp-capacity-pypsa-eur-sec-mvgd-bulk*) manually to success anyway.

It's running, but gas_neighbours.eGon100RE.insert-gas-neigbours-eGon100RE failed @AmeliaNadal.

@AmeliaNadal
Copy link
Contributor

Thanks for the notification, this is solved!

@IlkaCu
Copy link
Member Author

IlkaCu commented Oct 28, 2022

@khelfen and @nesnoj: Task power_plants.pv_rooftop_buildings.pv-rooftop-to-buildings failed on the server.

@khelfen
Copy link
Contributor

khelfen commented Oct 28, 2022

@khelfen and @nesnoj: Task power_plants.pv_rooftop_buildings.pv-rooftop-to-buildings failed on the server.

I pushed a fix onto the CI!

@nesnoj
Copy link
Member

nesnoj commented Oct 28, 2022

@khelfen and @nesnoj: Task power_plants.pv_rooftop_buildings.pv-rooftop-to-buildings failed on the server.

I pushed a fix onto the CI!

As the CI will not be pulled (too many changes) it has to be cherry-picked or manually edited.
I cannot support today.

@khelfen
Copy link
Contributor

khelfen commented Oct 28, 2022

@khelfen and @nesnoj: Task power_plants.pv_rooftop_buildings.pv-rooftop-to-buildings failed on the server.

I pushed a fix onto the CI!

As the CI will not be pulled (too many changes) it has to be cherry-picked or manually edited. I cannot support today.

git pull origin features/#684-distribute-pv-rooftop-buildings-3 should be enough, right? Should I do it?

@nesnoj
Copy link
Member

nesnoj commented Oct 31, 2022

@khelfen and @nesnoj: Task power_plants.pv_rooftop_buildings.pv-rooftop-to-buildings failed on the server.

I pushed a fix onto the CI!

As the CI will not be pulled (too many changes) it has to be cherry-picked or manually edited. I cannot support today.

git pull origin features/#684-distribute-pv-rooftop-buildings-3 should be enough, right? Should I do it?

Done and cleared..

@nesnoj
Copy link
Member

nesnoj commented Dec 22, 2022

On the prior run which finished yesterday, 1 task failed: sanity_checks.etrago-eGon100RE-gas @AmeliaNadal - not sure whether you are aware of that.
Update: oh, just seeing that that's seems to be work in progress #1067

@AmeliaNadal
Copy link
Contributor

AmeliaNadal commented Dec 23, 2022

On the prior run which finished yesterday, 1 task failed: sanity_checks.etrago-eGon100RE-gas @AmeliaNadal - not sure whether you are aware of that. Update: oh, just seeing that that's seems to be work in progress #1067

Thanks for notifying, I've removed the sanity checks for eGon100RE in CI.

@nesnoj
Copy link
Member

nesnoj commented Dec 31, 2022

The run finished today :)

@openego openego deleted a comment from KathiEsterl Jan 2, 2023
@AmeliaNadal
Copy link
Contributor

The following tasks failed (I tried to clear them but they failed again):

  • tyndp.download ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.entsos-tyndp2020-scenarios.eu')
  • osmtgmod.import-osm-dat
  • osm_buildings_streets.filter-buildings (@nesnoj, ERROR - (psycopg2.errors.DiskFull) could not write to file "base/pgsql_tmp/pgsql_tmp57428.0.sharedfileset/1.0": No space left on device)

@nesnoj
Copy link
Member

nesnoj commented Jan 17, 2023

The following tasks failed (I tried to clear them but they failed again):

Hey @AmeliaNadal!
On which instance did these errors came up?

  • tyndp.download ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.entsos-tyndp2020-scenarios.eu')

Looks like a certificate problem on the provider's side you wouldn't be able to solve (I guess it is possible to use a param to ignore the certificate temporarily). The cert for that side was renewed 6 days ago so it is supposed to work, did you stumble across this error today?

  • osmtgmod.import-osm-dat
  • osm_buildings_streets.filter-buildings (@nesnoj, ERROR - (psycopg2.errors.DiskFull) could not write to file "base/pgsql_tmp/pgsql_tmp57428.0.sharedfileset/1.0": No space left on device)

Self-speaking. Sounds like the Hetzner server which is 99% full. We have 2 instances (ci-run-container, ci-run-container-2023-01-16) running - probably the run from yesterday blows up the disk space?

(By the way, /home/egon/egon-data/ seems to hold an orphaned run. Can it be deleted? @ClaraBuettner )

I didn't take part in the last meetings so sorry if I'm stating something obvious you're already aware of..

@ClaraBuettner
Copy link
Contributor

(By the way, /home/egon/egon-data/ seems to hold an orphaned run. Can it be deleted? @ClaraBuettner )

Yes, that run can be deleted.

@nesnoj nesnoj mentioned this issue Jan 17, 2023
6 tasks
@nesnoj
Copy link
Member

nesnoj commented Jan 17, 2023

(By the way, /home/egon/egon-data/ seems to hold an orphaned run. Can it be deleted? @ClaraBuettner )

Yes, that run can be deleted.

Done.

@AmeliaNadal I restarted the tasks but the SSL error persists.
This is because the domain changed

The old file is still available but not covered by the cert for some reason.
I manually changed the URL in the datasets.yml. Here's a PR: #1085

So the instance is back running now..

@AmeliaNadal
Copy link
Contributor

AmeliaNadal commented Feb 20, 2023

Hi everyone,
the task demandregio.insert-cts-ind-demands failed with the following error: "ERROR - 'ReadOnlyWorksheet' object has no attribute 'defined_names'" (seems to be a query problem), I'm not completely sure who can fix it, @nesnoj?

@IlkaCu
Copy link
Member Author

IlkaCu commented Feb 20, 2023

Hi everyone, the task demandregio.insert-cts-ind-demands failed with the following error: "ERROR - 'ReadOnlyWorksheet' object has no attribute 'defined_names'" (seems to be a query problem), I'm not completely sure wo can fix it, @nesnoj?

I will have a look.

@nesnoj
Copy link
Member

nesnoj commented Feb 20, 2023

Hi everyone, the task demandregio.insert-cts-ind-demands failed with the following error: "ERROR - 'ReadOnlyWorksheet' object has no attribute 'defined_names'" (seems to be a query problem), I'm not completely sure wo can fix it, @nesnoj?

Hey @AmeliaNadal, I just stumbled across this error in another project, it was problem with openpyxl v3.1.1 according to this SO post from last week. A reinstall with pip install openpyxl==3.1.0 fixed it for me. But in the current pipeline we use an older version v3.0.10 so I'm unsure whether this is the origin. Maybe it is caused by a (recently released) dependency of openpyxl? @gnn could you please have a look?

@nesnoj
Copy link
Member

nesnoj commented Feb 20, 2023

Hi everyone, the task demandregio.insert-cts-ind-demands failed with the following error: "ERROR - 'ReadOnlyWorksheet' object has no attribute 'defined_names'" (seems to be a query problem), I'm not completely sure wo can fix it, @nesnoj?

I will have a look.

Thanks @IlkaCu

@gnn
Copy link
Collaborator

gnn commented Feb 24, 2023

Hi everyone, the task demandregio.insert-cts-ind-demands failed with the following error: "ERROR - 'ReadOnlyWorksheet' object has no attribute 'defined_names'" (seems to be a query problem), I'm not completely sure wo can fix it, @nesnoj?

Hey @AmeliaNadal, I just stumbled across this error in another project, it was problem with openpyxl v3.1.1 according to this SO post from last week. A reinstall with pip install openpyxl==3.1.0 fixed it for me. But in the current pipeline we use an older version v3.0.10 so I'm unsure whether this is the origin. Maybe it is caused by a (recently released) dependency of openpyxl? @gnn could you please have a look?

I checked the versions and the current CI run used openpyxl==3.1.1 so I'm pretty sure that's the culprit.
Constraining to !=3.1.1 should fix this.

gnn added a commit that referenced this issue Feb 24, 2023
[Apparently][0], this version [breaks][1] the

  ```python
  demandregio.insert-cts-ind-demands
  ```

task.

[0]: #377 (comment)
[1]: #1108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙏 help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

10 participants