-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major outage right now #749
Comments
This is a good time to figure out what service we want to pay for to get OSX builds. Budget is already allocated for it by the board. |
I don't think downtime is a good time to talk about future plans, but I'll bite. We already have the initiatives (#724, #741 or the classic #367) to get mac resources in place. To me, it seems we lack man hours. As you can see, we think the main issue is lack of sponsorship visibility which could be seen as a marketing effort. Suggestions on how to improve and speed up? |
Not unreasonable @mikeal, we've been in a limbo situation with our osx resources for a bit too long. We decided on some concrete steps we need to take ASAP and we've also looped in @mhdawson to the process so we're not reliant on fragile and unreliable dependencies (me). Best to take discussion on specifics to email at this stage though. |
@rvagg ... For the arm cluster, have you written up a detailed description of what it would take to set up an equivalent mirror? If not, that would be quite helpful. |
@jbergstroem we have budget allocated to pay a provider. I'm not saying we shouldn't pursue donors but we should get setup on a paid provider if only for the reliability if a donor goes down. |
@rvagg ... It would be helpful to have a post mortem write up of this (and all outages really) to the CTC/TSC once it's resolved. We need to have more visibility into these kinds of things. |
👍 @jasnell, FYI we're tracking to a point where the only difficult to duplicate resource is armv6 and we already have a precedent of allowing releases to go out without those. OSX is about to be sorted (just got good news via email on that front @mikeal) and armv8 has two new providers stepping up to take the load off the noisy boxes in my garage (I might decommission those entirely when we're redundant). I'll provide more detail when we're over this current hump tho, I know a number of people are concerned about our resilience at the moment. |
We've just successfully kicked off a relationship with MacStadium, initially with a 6 month relationship with a review after that to see how it's working. Still working on setting up some resources but this will nix the weakest point in our infra at the moment! Will give more details soon but thought the good news is worth sharing. |
Just want to add that I'm also involved in both setting up the new MacOS cluster as well as offloading ARM to new sponsors. Downtimes like these shouldn't have to be a driver for improvement but usually ends up being anyway (remember digital ocean issues?). |
all working again, I also took the opportunity to do some minor maintenance across the Pi's |
No mining cryptocurrency with the spare cycles? |
OK so the pi cluster is back to "healthy" again, my guess is simply that it's poor nfs performance to blame, because I cleared out workspaces after restarting everything it had to start from scratch but that leads to all of these machines simultaneously reading and writing over the network. The disk can handle it, the server should be able to handle it, the individual machines should be able to handle it too, it's either the network topology or hardware that sucks or simply NFS that sucks. I've been assuming the latter and will continue to do so without better insight. Anyone with the nodejs_build_test key should be able to get in to these machines to do diagnosis so if someone thinks they have the skills to dig then you're welcome to. Perhaps I should be exploring alternatives to NFS? Why hasn't NFS matured more than it currently is, are there better solutions? Should I be using cifs (windows/samba), sshfs, or something else? |
For my experience CIFS is not better than NFS |
@jasnell, @mhdawson, @Trott, @nodejs/build I did a write-up of the outage here: https://github.com/nodejs/build/wiki/Service-disruption-post-mortems#2017-06-07-ci-infrastructure-partial-outages Includes:
Plus also links to various issues related. I'd like us to produce similar write-ups on the same wiki page for future outages. It'd be a good habit for us to build, it keeps us accountable and it gives us a single place to record and share the info instead of scattering it across GitHub, IRC & email. @jasnell, @mhdawson & @Trott can I get you to weigh in on whether this needs to be shared more widely? |
Thank you @rvagg. I think making it available on the wiki or even as repo issues somewhere is sufficient. |
I think putting these in a directory in the repo is probably good enough. Maybe something like doc/service-disruptions-post-mortems (or something shorter if somebody has a better name). On the PPC/AIX front I agree with your write up that its not a high priority to get additional redundancy as the uptime for those systems has been quite good. I can't remember the last time they were down, just unfortunate timing of an uplanned power outage as OSUOSL this time. |
Power company has shut me down just now. ARM cluster is offline as is the primary osx build / release machines. This could last for up to ~5 hours apparently, will keep updated as I know more. Sorry for the lack of notice.
@nodejs/release no releases possible just yet.
The text was updated successfully, but these errors were encountered: