Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Use QEMU for tests on Linux and FreeBSD #16537

Closed
wants to merge 4 commits into from

Conversation

mcmilk
Copy link
Contributor

@mcmilk mcmilk commented Sep 14, 2024

Motivation and Context

We have the need for more tests on distros != Ubuntu.
Also the current testings on FreeBSD come a bit to short :(

Description

This commit adds functional tests for these systems:

  • AlmaLinux 8, AlmaLinux 9, ArchLinux

  • CentOS Stream 9, Fedora 39, Fedora 40

  • Debian 11, Debian 12

  • FreeBSD 13, FreeBSD 14, FreeBSD 15

  • Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04

  • enabled by default:

    • AlmaLinux 8, AlmaLinux 9
    • Debian 11, Debian 12
    • Fedora 39, Fedora 40
    • FreeBSD 13, FreeBSD 14

Workflow for each operating system:

  • install qemu on the github runner
  • download current cloud image of operating system
  • start and init that image via cloud-init
  • install dependencies and poweroff system
  • start system and build openzfs and then poweroff again
  • clone build system and start 2 instances of it
  • run functional testings and complete in around 3h
  • when tests are done, do some logfile preparing
  • show detailed results for each system
  • in the end, generate the job summary

Signed-off-by: Tino Reichardt [email protected]
Signed-off-by: Tony Hutter [email protected]

How Has This Been Tested?

It has been tested very well with these different host filesystems:

  • JFS, EXT2, EXT4, XFS, ZFS, ZFS ZVOL
  • sometimes mixed with zram and brd ramdisks
  • in the end, the fastest and most reliable host based fs was: ZFS ZVOL ;-)
  • within the qemu machines UFS is used for FreeBSD and XFS is used on all Linux distros for /var/tmp
  • the same github action workflow can be used on zfs-2.2.x branches also

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

The test needs some adjusting within the timings.

Signed-off-by: Tony Hutter <[email protected]>
Co-authored-by: Tino Reichardt <[email protected]>
Sometimes the pool may start an auto scrub.

Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
On load the test needs sometimes a bit more time then just one second.
Doubling the time will help on the QEMU based testings.

Signed-off-by: Tino Reichardt <[email protected]>
@mcmilk
Copy link
Contributor Author

mcmilk commented Sep 15, 2024

Added new ZTS fix for mmap_sync_001_pos.

@tonyhutter
Copy link
Contributor

tonyhutter commented Sep 16, 2024

I want to emphasize some of the real-world benefits from this PR 👍

  1. The github runner scripts are in the zfs repo itself. That means you can just open a PR against zfs, like "Add Fedora 41 tester", and see the results directly in the PR. ZFS admins no longer need manually to login to the buildbot server to update the buildbot config with new version of Fedora/Almalinux.

  2. Github runners allow you to run the entire test suite against your private branch before submitting a formal PR to openzfs. Just open a PR against your private zfs repo, and the exact same Fedora/Alma/FreeBSD runners will fire up and run ZTS. This can be useful if you want to iterate on a ZTS change before submitting a formal PR.

  3. buildbot is incredibly cumbersome. Our buildbot config files alone are ~1500 lines (not including any build/setup scripts)! It's a huge pain to setup.

  4. We're running the super ancient buildbot 0.8.12. It's so ancient it requires python2. We actually have to build python2 from source for almalinux9 just to get it to run. Ugrading to a more modern buildbot is a huge undertaking, and the UI on the newer versions is worse.

  5. Buildbot uses EC2 instances. EC2 is a pain because:

    • It costs money
    • They throttle IOPS and CPU usage, leading to mysterious, hard-to-diagnose, failures and timeouts in ZTS.
    • EC2 is high maintenance. We have to setup security groups, SSH keys, networking, users, etc, in AWS and it's a pain. We also have to periodically go in an kill zombie EC2 instances that buildbot is unable to kill off.
  6. Buildbot doesn't always handle failures well. One of the things we saw in the past was the FreeBSD builders would often die, and each builder death would take up a "slot" in buildbot. So we would periodically have to restart buildbot via a cron job to get the slots back.

  7. This PR divides up the ZTS test list into two parts, launches two VMs, and on each VM runs half the test suite. The test results are then merged and shown in the sumary page. So we're basically parallelizing ZTS on the same github runner. This leads to lower overall ZTS runtimes (2.5-3 hours vs 4+ hours on buildbot), and one unified set of results per runner, which is nice.

  8. Since the tests are running on a VM, we have much more control over what happens. We can capture the serial console output even if the test completely brings down the VM. In the future, we could also restart the test on the VM where it left off, so that if a single test panics the VM, we can just restart it and run the remaining ZTS tests (this functionaly is not yet implemented though, just an idea).

  9. Using the runners, users can manually kill or restart a test run via the github IU. That really isn't possible with buildbot unless you're an admin.

  10. Anecdotally, the tests seem to be more stable and constant under the QEMU runners.

@tonyhutter
Copy link
Contributor

The Fedora 40 raidz_expand_001_pos failure is a known issue: #16421

@tonyhutter
Copy link
Contributor

Note the 'r' versions of the freebsd runners (freebsd13r, freebsd14r) are the RELEASE branch, and the non-'r' are the STABLE branch.

@tonyhutter
Copy link
Contributor

Note that with the exception of Fedora 40 (with the already accounted for raidz_expand_001_pos failure) this is passing all the buildbot builders (Alma 8-9, Fedora 39-40, and FreeBSD 13).

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9, ArchLinux
- CentOS Stream 9, Fedora 39, Fedora 40
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04

- enabled by default:
 - AlmaLinux 8, AlmaLinux 9
 - Debian 11, Debian 12
 - Fedora 39, Fedora 40
 - FreeBSD 13, FreeBSD 14

Workflow for each operating system:
- install qemu on the github runner
- download current cloud image of operating system
- start and init that image via cloud-init
- install dependencies and poweroff system
- start system and build openzfs and then poweroff again
- clone build system and start 2 instances of it
- run functional testings and complete in around 3h
- when tests are done, do some logfile preparing
- show detailed results for each system
- in the end, generate the job summary

Real-world benefits from this PR:

1. The github runner scripts are in the zfs repo itself. That means
   you can just open a PR against zfs, like "Add Fedora 41 tester", and
   see the results directly in the PR. ZFS admins no longer need
   manually to login to the buildbot server to update the buildbot config
   with new version of Fedora/Almalinux.

2. Github runners allow you to run the entire test suite against your
   private branch before submitting a formal PR to openzfs. Just open a
   PR against your private zfs repo, and the exact same
   Fedora/Alma/FreeBSD runners will fire up and run ZTS. This can be
   useful if you want to iterate on a ZTS change before submitting a
   formal PR.

3. buildbot is incredibly cumbersome. Our buildbot config files alone
   are ~1500 lines (not including any build/setup scripts)!
   It's a huge pain to setup.

4. We're running the super ancient buildbot 0.8.12. It's so ancient
   it requires python2. We actually have to build python2 from source
   for almalinux9 just to get it to run. Ugrading to a more modern
   buildbot is a huge undertaking, and the UI on the newer versions is
   worse.

5. Buildbot uses EC2 instances. EC2 is a pain because:
   * It costs money
   * They throttle IOPS and CPU usage, leading to mysterious,
   * hard-to-diagnose, failures and timeouts in ZTS.
   * EC2 is high maintenance. We have to setup security groups, SSH
   * keys, networking, users, etc, in AWS and it's a pain. We also
   * have to periodically go in an kill zombie EC2 instances that
   * buildbot is unable to kill off.

6. Buildbot doesn't always handle failures well. One of the things we
   saw in the past was the FreeBSD builders would often die, and each
   builder death would take up a "slot" in buildbot. So we would
   periodically have to restart buildbot via a cron job to get the slots
   back.

7. This PR divides up the ZTS test list into two parts, launches two
   VMs, and on each VM runs half the test suite. The test results are
   then merged and shown in the sumary page. So we're basically
   parallelizing ZTS on the same github runner. This leads to lower
   overall ZTS runtimes (2.5-3 hours vs 4+ hours on buildbot), and one
   unified set of results per runner, which is nice.

8. Since the tests are running on a VM, we have much more control over
   what happens. We can capture the serial console output even if the
   test completely brings down the VM. In the future, we could also
   restart the test on the VM where it left off, so that if a single test
   panics the VM, we can just restart it and run the remaining ZTS tests
   (this functionaly is not yet implemented though, just an idea).

9. Using the runners, users can manually kill or restart a test run
   via the github IU. That really isn't possible with buildbot unless
   you're an admin.

10. Anecdotally, the tests seem to be more stable and constant under
    the QEMU runners.

Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
@mcmilk
Copy link
Contributor Author

mcmilk commented Sep 17, 2024

Did some formatting changes and updated the FreeBSD images to the latest version of last friday.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Sep 17, 2024
@behlendorf
Copy link
Contributor

behlendorf commented Sep 17, 2024

@mcmilk this turned out really nicely and is a huge improvement. Thanks for iterating on this, LGTM.

Let's roll it out and refine as needed. I've disabled the webhook for buildbot so we'll want to make sure new PRs are rebased against master to include this change. I think we'll also want to switch the Ubuntu builders over to this fairly quickly.

behlendorf pushed a commit that referenced this pull request Sep 17, 2024
Sometimes the pool may start an auto scrub.

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #16537
behlendorf pushed a commit that referenced this pull request Sep 17, 2024
On load the test needs sometimes a bit more time then just one second.
Doubling the time will help on the QEMU based testings.

Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #16537
behlendorf pushed a commit that referenced this pull request Sep 17, 2024
This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9, ArchLinux
- CentOS Stream 9, Fedora 39, Fedora 40
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04

- enabled by default:
 - AlmaLinux 8, AlmaLinux 9
 - Debian 11, Debian 12
 - Fedora 39, Fedora 40
 - FreeBSD 13, FreeBSD 14

Workflow for each operating system:
- install qemu on the github runner
- download current cloud image of operating system
- start and init that image via cloud-init
- install dependencies and poweroff system
- start system and build openzfs and then poweroff again
- clone build system and start 2 instances of it
- run functional testings and complete in around 3h
- when tests are done, do some logfile preparing
- show detailed results for each system
- in the end, generate the job summary

Real-world benefits from this PR:

1. The github runner scripts are in the zfs repo itself. That means
   you can just open a PR against zfs, like "Add Fedora 41 tester", and
   see the results directly in the PR. ZFS admins no longer need
   manually to login to the buildbot server to update the buildbot config
   with new version of Fedora/Almalinux.

2. Github runners allow you to run the entire test suite against your
   private branch before submitting a formal PR to openzfs. Just open a
   PR against your private zfs repo, and the exact same
   Fedora/Alma/FreeBSD runners will fire up and run ZTS. This can be
   useful if you want to iterate on a ZTS change before submitting a
   formal PR.

3. buildbot is incredibly cumbersome. Our buildbot config files alone
   are ~1500 lines (not including any build/setup scripts)!
   It's a huge pain to setup.

4. We're running the super ancient buildbot 0.8.12. It's so ancient
   it requires python2. We actually have to build python2 from source
   for almalinux9 just to get it to run. Ugrading to a more modern
   buildbot is a huge undertaking, and the UI on the newer versions is
   worse.

5. Buildbot uses EC2 instances. EC2 is a pain because:
   * It costs money
   * They throttle IOPS and CPU usage, leading to mysterious,
   * hard-to-diagnose, failures and timeouts in ZTS.
   * EC2 is high maintenance. We have to setup security groups, SSH
   * keys, networking, users, etc, in AWS and it's a pain. We also
   * have to periodically go in an kill zombie EC2 instances that
   * buildbot is unable to kill off.

6. Buildbot doesn't always handle failures well. One of the things we
   saw in the past was the FreeBSD builders would often die, and each
   builder death would take up a "slot" in buildbot. So we would
   periodically have to restart buildbot via a cron job to get the slots
   back.

7. This PR divides up the ZTS test list into two parts, launches two
   VMs, and on each VM runs half the test suite. The test results are
   then merged and shown in the sumary page. So we're basically
   parallelizing ZTS on the same github runner. This leads to lower
   overall ZTS runtimes (2.5-3 hours vs 4+ hours on buildbot), and one
   unified set of results per runner, which is nice.

8. Since the tests are running on a VM, we have much more control over
   what happens. We can capture the serial console output even if the
   test completely brings down the VM. In the future, we could also
   restart the test on the VM where it left off, so that if a single test
   panics the VM, we can just restart it and run the remaining ZTS tests
   (this functionaly is not yet implemented though, just an idea).

9. Using the runners, users can manually kill or restart a test run
   via the github IU. That really isn't possible with buildbot unless
   you're an admin.

10. Anecdotally, the tests seem to be more stable and constant under
    the QEMU runners.

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #16537
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Sep 17, 2024
Sometimes the pool may start an auto scrub.

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#16537
@Harry-Chen
Copy link
Contributor

Harry-Chen commented Sep 19, 2024

This is great work (and is what I was working on but obviously superceded)! My only concern is that after being applied, one complete CI run would be really long (~10 hours). And due to some (at least to me) unknown concurrent limits of GitHub, it looks like there are always many tasks waiting in the queue (you can see from https://github.com/openzfs/zfs/actions/workflows/zfs-qemu.yml, currently the earliest task is a commit pushed 12 hours ago). I am a little bit worried that tasks would pile up and the situation would deteriorate.

@mcmilk
Copy link
Contributor Author

mcmilk commented Sep 19, 2024

@Harry-Chen - the consumed cpu time is a bit more, but less then the old times of the functional matrix split.

Lets see this example with run times:

  1. https://github.com/openzfs/zfs/actions/runs/10928124044/usage
  • only Ubuntu Linux 20.04 and Ubuntu 22.04 are tested
  • Run time: 10h 39m 40s
  • Run time per OS: ~ 5h 20m
  1. https://github.com/openzfs/zfs/actions/runs/10928124048/usage
  • Almalinux 8, Almalinux 9, CentOS Stream 9
  • Debian 11, Debian 12, Fedora 39, Fedora 40
  • FreeBSD 13 Release, FreeBSD 13 STABLE, FreeBSD 14 Release, FreeBSD 14 STABLE
  • Run time: 1d 10h 30m 59s
  • Run time per OS: ~ 3h 20m

When we switch Ubuntu 20.04, 22.04 and 24.04 to QEMU also, we will speed up the queue a bit again.
Because: 2x Ubuntu (10h 40m) > 3x Ubuntu (10h)

One queue run should complete in around 4h:

  • 14x QEMU tests
  • 1x Checkstyle
  • 1x CodeQL
  • 1x Zloop
  • sum: 17x ~ 3h 20m (we have 20 free runnners)

@Harry-Chen
Copy link
Contributor

@mcmilk Really thanks for your detailed explanation & calculation! I am aware that each test gets faster after using QEMU (which is great), but the number of tests grows from 2 to 13, thus amplifying the total running time (i.e. cpu hours). My concern is whether 20 runners are enough for the testing frequency needed by this project (and for any person in his own fork).

Maybe we could add some mechanism that allows limiting tests to some "essential" distros (e.g. one for Linux and one for BSD) for developers to opt-in (e.g. when in a personal fork or editing PR drafts). What's your opinion on this?

@mcmilk
Copy link
Contributor Author

mcmilk commented Sep 19, 2024

It would be nice, if we could limit the testings a bit. Currently I didn't find an easy way to do this - but it would be nice 👍

I think this here may help us with a solution: https://github.com/orgs/community/discussions/26516

@rincebrain
Copy link
Contributor

A future improvement might be to kill runs if another push to the branch happens, since if, for example, as I just did, I push, notice I missed a nit, and push again, I don't think anything is going to stop the two distinct CI runs from going to completion even though the older one is now basically academic.

@Harry-Chen
Copy link
Contributor

Harry-Chen commented Sep 19, 2024

A future improvement might be to kill runs if another push to the branch happens, since if, for example, as I just did, I push, notice I missed a nit, and push again, I don't think anything is going to stop the two distinct CI runs from going to completion even though the older one is now basically academic.

Yes, now I will cancel my previous run after a force push, or I have to wait for quite a long time -- but I obviously cannot cancel others' runs. Right now PR #16530 by @rincebrain triggers the workflow twice (If you are reading this, I am absolutely not meant to blame you on anything, just as an handy example):

image

Additionally, the scheduling of runners happens at job-level, not workflow. For example, my current CI run https://github.com/openzfs/zfs/actions/runs/10933612537/job/30368866862?pr=16483 has only one "Cleanup" job queuing for ~4.5 hours. This is sometime annoying for contributors.

Since the project has a global runner limit, we do need some prevention against someone accidentally (or, even worse, deliberately) occupying them for an unreasonable time.

@rincebrain
Copy link
Contributor

Unless I'm mistaken, it doesn't start CI runners if random people open PRs without maintainer intervention, so it would only be people who have contributed before who can DoS things, not random people.

@tonyhutter
Copy link
Contributor

Right now PR #16530 by @rincebrain triggers the workflow twice (If you are reading this, I am absolutely not meant to blame you on anything, just as an handy example):

When I was testing this PR, I used this to get around the "double workflow" problem:

diff --git a/.github/workflows/zfs-qemu.yml b/.github/workflows/zfs-qemu.yml
index a3381a955..a740197a7 100644
--- a/.github/workflows/zfs-qemu.yml
+++ b/.github/workflows/zfs-qemu.yml
@@ -2,7 +2,6 @@ name: zfs-qemu
 
 on:
   push:
-  pull_request:
 
 jobs:

@mcmilk mcmilk deleted the qemu-machines branch September 21, 2024 08:02
robn pushed a commit to robn/zfs that referenced this pull request Nov 5, 2024
The test needs some adjusting within the timings.

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Co-authored-by: Tino Reichardt <[email protected]>
Closes openzfs#16537
robn pushed a commit to robn/zfs that referenced this pull request Nov 5, 2024
On load the test needs sometimes a bit more time then just one second.
Doubling the time will help on the QEMU based testings.

Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes openzfs#16537
robn pushed a commit to robn/zfs that referenced this pull request Nov 5, 2024
This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9, ArchLinux
- CentOS Stream 9, Fedora 39, Fedora 40
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04

- enabled by default:
 - AlmaLinux 8, AlmaLinux 9
 - Debian 11, Debian 12
 - Fedora 39, Fedora 40
 - FreeBSD 13, FreeBSD 14

Workflow for each operating system:
- install qemu on the github runner
- download current cloud image of operating system
- start and init that image via cloud-init
- install dependencies and poweroff system
- start system and build openzfs and then poweroff again
- clone build system and start 2 instances of it
- run functional testings and complete in around 3h
- when tests are done, do some logfile preparing
- show detailed results for each system
- in the end, generate the job summary

Real-world benefits from this PR:

1. The github runner scripts are in the zfs repo itself. That means
   you can just open a PR against zfs, like "Add Fedora 41 tester", and
   see the results directly in the PR. ZFS admins no longer need
   manually to login to the buildbot server to update the buildbot config
   with new version of Fedora/Almalinux.

2. Github runners allow you to run the entire test suite against your
   private branch before submitting a formal PR to openzfs. Just open a
   PR against your private zfs repo, and the exact same
   Fedora/Alma/FreeBSD runners will fire up and run ZTS. This can be
   useful if you want to iterate on a ZTS change before submitting a
   formal PR.

3. buildbot is incredibly cumbersome. Our buildbot config files alone
   are ~1500 lines (not including any build/setup scripts)!
   It's a huge pain to setup.

4. We're running the super ancient buildbot 0.8.12. It's so ancient
   it requires python2. We actually have to build python2 from source
   for almalinux9 just to get it to run. Ugrading to a more modern
   buildbot is a huge undertaking, and the UI on the newer versions is
   worse.

5. Buildbot uses EC2 instances. EC2 is a pain because:
   * It costs money
   * They throttle IOPS and CPU usage, leading to mysterious,
   * hard-to-diagnose, failures and timeouts in ZTS.
   * EC2 is high maintenance. We have to setup security groups, SSH
   * keys, networking, users, etc, in AWS and it's a pain. We also
   * have to periodically go in an kill zombie EC2 instances that
   * buildbot is unable to kill off.

6. Buildbot doesn't always handle failures well. One of the things we
   saw in the past was the FreeBSD builders would often die, and each
   builder death would take up a "slot" in buildbot. So we would
   periodically have to restart buildbot via a cron job to get the slots
   back.

7. This PR divides up the ZTS test list into two parts, launches two
   VMs, and on each VM runs half the test suite. The test results are
   then merged and shown in the sumary page. So we're basically
   parallelizing ZTS on the same github runner. This leads to lower
   overall ZTS runtimes (2.5-3 hours vs 4+ hours on buildbot), and one
   unified set of results per runner, which is nice.

8. Since the tests are running on a VM, we have much more control over
   what happens. We can capture the serial console output even if the
   test completely brings down the VM. In the future, we could also
   restart the test on the VM where it left off, so that if a single test
   panics the VM, we can just restart it and run the remaining ZTS tests
   (this functionaly is not yet implemented though, just an idea).

9. Using the runners, users can manually kill or restart a test run
   via the github IU. That really isn't possible with buildbot unless
   you're an admin.

10. Anecdotally, the tests seem to be more stable and constant under
    the QEMU runners.

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#16537
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants