Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes #5130

johnrwatson · 2024-12-13T20:40:21Z

Adds a new pipeline which we can trigger that will restart the service replicas that are impacted by our current run-away memory issue. It will basically quickly drop the replicas and bring them back. For SDF it pushes the FE into maintenance mode first to give a bit of better user experience, all the other services are gracefully stopped so they should finish up their current work before attempting to be started again.

In theory, it's much better to take just one node offline at a time, but it would require a fair bit more re-work in our toolbox to let us do it.

We can schedule this like a maintenance event (using the maintenance mode) and it should be very minimally impactful for the users.

This is untestable until I merge it, because the workflow doesn't exist in the UI/API until it's present on main at least once

sprutton1 · 2024-12-13T22:12:06Z

.github/workflows/deploy-service-restart.yml

+jobs:
+
+  restart-rebaser:
+    uses: ./.github/workflows/instance-refresh.yml


do you mean to use the instance-refresh here or the new service-restart you wrote?

Oh goodness, great catch

github-actions bot added the A-ci Area: CI configuration files and scripts label Dec 13, 2024

johnrwatson changed the title ~~first attempt~~ Add pipeline to restart SDF/Rebaser/Pinga units on all nodes Dec 13, 2024

johnrwatson changed the title ~~Add pipeline to restart SDF/Rebaser/Pinga units on all nodes~~ Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes Dec 13, 2024

johnrwatson force-pushed the feat/add-unit-restart-for-memory-management branch 2 times, most recently from d5c8f71 to 4ade1e1 Compare December 13, 2024 20:46

johnrwatson marked this pull request as ready for review December 13, 2024 20:48

johnrwatson requested review from britmyerss and sprutton1 December 13, 2024 20:48

sprutton1 reviewed Dec 13, 2024

View reviewed changes

feat: add service state restart workflow

aeba0ea

johnrwatson force-pushed the feat/add-unit-restart-for-memory-management branch from 6af74a1 to aeba0ea Compare December 13, 2024 22:24

sprutton1 approved these changes Dec 13, 2024

View reviewed changes

johnrwatson added this pull request to the merge queue Dec 13, 2024

Merged via the queue into main with commit f2e86ce Dec 13, 2024
7 checks passed

johnrwatson deleted the feat/add-unit-restart-for-memory-management branch December 13, 2024 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes #5130

Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes #5130

johnrwatson commented Dec 13, 2024 •

edited

Loading

sprutton1 Dec 13, 2024

johnrwatson Dec 13, 2024

Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes #5130

Add pipeline to restart SDF/Rebaser/Pinga units on relevant nodes #5130

Conversation

johnrwatson commented Dec 13, 2024 • edited Loading

sprutton1 Dec 13, 2024

Choose a reason for hiding this comment

johnrwatson Dec 13, 2024

Choose a reason for hiding this comment

johnrwatson commented Dec 13, 2024 •

edited

Loading