Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support #69

Draft
wants to merge 12 commits into
base: dev
Choose a base branch
from
Draft

Conversation

AntonReinhard
Copy link
Member

@AntonReinhard AntonReinhard commented May 24, 2024

This PR adds GPU support. It adds tests that run conditionally on AMDGPU.jl and/or CUDA.jl, depending on whether they are functional on the machine we're running on.

This PR depends on the fix QEDjl-project/QEDbase.jl#64 and is rebased to #68.

Left to do:

  • Fix the Compton _total_probability function which currently does not work on GPU because of quadgk
  • Add remaining tests for the PSP interface on GPU
  • Potentially add testing with KernelAbstractions.jl (does that support broadcasting?), oneAPI.jl and Metal.jl
  • Automatic testing in CI with GPU capable runners
  • Add version checks for the GPU tests, since not all libraries work on all versions and there's nothing really we can do about that other than disabling the tests on those versions
  • Use PkgExtensions to only optionally load the GPU libraries, since this takes a long time and should only be done when specifically requested.

src/gpu.jl Outdated Show resolved Hide resolved
AntonReinhard added a commit to QEDjl-project/QEDbase.jl that referenced this pull request Aug 9, 2024
Rewrite momenta function without broadcast, because GPUs do not like
broadcasts in their kernels.
This is essentially ported from
QEDjl-project/QEDprocesses.jl#69 since this
implementation moved here since that PR was opened.
@AntonReinhard
Copy link
Member Author

This builds in the CI with 1.10 and rc, while not executing the GPU tests (because there's no GPU on the runners):

┌ Warning: No functional GPUs found for testing!
└ @ Main /builds/hzdr/qedjl-project/QEDprocesses-jl/test/gpu/process_interface.jl:23

This is by design. To actually use these tests, we would need runners that actually have GPUs, then they would automatically run the tests too.

However, versions 1.6 - 1.9 fail because of dependency issues. Currently, the GPU tests are just normal tests and in the test environment, only packages that are installed in the project can be loaded. This means that whether or not the tests will be run, the dependencies are in the Project.toml and have to be installed when running any tests at all.

I'm not really sure how to solve this. We could add another file such as runtests_gpu.jl running the GPU tests and remove the [extras] dependency to the GPU packages, but we would more or less have to manually do the package installation and testing in the CI. I don't think julia has support for something like this, and especially not back in 1.6.

@SimeonEhrig
Copy link
Member

This builds in the CI with 1.10 and rc, while not executing the GPU tests (because there's no GPU on the runners):

┌ Warning: No functional GPUs found for testing!
└ @ Main /builds/hzdr/qedjl-project/QEDprocesses-jl/test/gpu/process_interface.jl:23

This is by design. To actually use these tests, we would need runners that actually have GPUs, then they would automatically run the tests too.

However, versions 1.6 - 1.9 fail because of dependency issues. Currently, the GPU tests are just normal tests and in the test environment, only packages that are installed in the project can be loaded. This means that whether or not the tests will be run, the dependencies are in the Project.toml and have to be installed when running any tests at all.

I'm not really sure how to solve this. We could add another file such as runtests_gpu.jl running the GPU tests and remove the [extras] dependency to the GPU packages, but we would more or less have to manually do the package installation and testing in the CI. I don't think julia has support for something like this, and especially not back in 1.6.

Can we maybe disable tests and the import command via environment variable, like I did it in this Python project: https://github.com/alpaka-group/bashi/blob/c0b673eb1ecff92bde3c3bb89c277104cdbedde8/tests/test_generate_combination_list.py#L186

CUDA.jl also provides a method to detect GPU's: https://cuda.juliagpu.org/stable/installation/conditional/

@AntonReinhard
Copy link
Member Author

Can we maybe disable tests and the import command via environment variable, like I did it in this Python project: https://github.com/alpaka-group/bashi/blob/c0b673eb1ecff92bde3c3bb89c277104cdbedde8/tests/test_generate_combination_list.py#L186

CUDA.jl also provides a method to detect GPU's: https://cuda.juliagpu.org/stable/installation/conditional/

No, we can't use either of these, because the problem doesn't happen while executing the tests (the tests already are skipped when the libraries are non-functional for whatever reason).
The problem happens when resolving the Project.toml.

@SimeonEhrig
Copy link
Member

Looks like CUDA.jl and AMDGPU.jl are hard dependencies for the test, or isn't it? If yes, can remove it from the Project.toml and test in the test scripts, if the packages are installed?

@AntonReinhard
Copy link
Member Author

Looks like CUDA.jl and AMDGPU.jl are hard dependencies for the test, or isn't it? If yes, can remove it from the Project.toml and test in the test scripts, if the packages are installed?

At least as far as I'm aware this does not work. I tried having AMDGPU.jl globally installed but not in the test dependencies, and then load it inside the test. But it just says it's not installed.
Maybe this would work if not used through Pkg.test(), but then we would more or less start manually implementing Julia's testing framework...

@AntonReinhard
Copy link
Member Author

For reference, the part that actually fails is not the Pkg.instantiate(). That works because the GPU packages are only weak dependencies.
When running the tests and creating their environment, the error looks like this (from CI with Julia1.9):

$ julia --project=. -e 'import Pkg; Pkg.test(; coverage = true)'
     Testing QEDprocesses
┌ Warning: Could not use exact versions of packages in manifest, re-resolving
└ @ Pkg.Operations /usr/local/julia/share/julia/stdlib/v1.9/Pkg/src/Operations.jl:1814
ERROR: Unsatisfiable requirements detected for package LLD_jll [d55e3150]:
 LLD_jll [d55e3150] log:
 ├─possible versions are: 14.0.6 or uninstalled
 └─found to have no compatible versions left with AMDGPU [21141c5a]
   └─AMDGPU [21141c5a] log:
     ├─possible versions are: 0.1.0-1.0.0 or uninstalled
     ├─restricted to versions 1 by QEDprocesses [46de9c38], leaving only versions: 1.0.0 or uninstalled
     │ └─QEDprocesses [46de9c38] log:
     │   ├─possible versions are: 0.2.0 or uninstalled
     │   └─QEDprocesses [46de9c38] is fixed to version 0.2.0
     └─restricted to versions 1 by an explicit requirement, leaving only versions: 1.0.0

@szabo137
Copy link
Member

For reference, the part that actually fails is not the Pkg.instantiate(). That works because the GPU packages are only weak dependencies.

When running the tests and creating their environment, the error looks like this (from CI with Julia1.9):


$ julia --project=. -e 'import Pkg; Pkg.test(; coverage = true)'

     Testing QEDprocesses

┌ Warning: Could not use exact versions of packages in manifest, re-resolving

└ @ Pkg.Operations /usr/local/julia/share/julia/stdlib/v1.9/Pkg/src/Operations.jl:1814

ERROR: Unsatisfiable requirements detected for package LLD_jll [d55e3150]:

 LLD_jll [d55e3150] log:

 ├─possible versions are: 14.0.6 or uninstalled

 └─found to have no compatible versions left with AMDGPU [21141c5a]

   └─AMDGPU [21141c5a] log:

     ├─possible versions are: 0.1.0-1.0.0 or uninstalled

     ├─restricted to versions 1 by QEDprocesses [46de9c38], leaving only versions: 1.0.0 or uninstalled

     │ └─QEDprocesses [46de9c38] log:

     │   ├─possible versions are: 0.2.0 or uninstalled

     │   └─QEDprocesses [46de9c38] is fixed to version 0.2.0

     └─restricted to versions 1 by an explicit requirement, leaving only versions: 1.0.0

It seems like the compat entry for AMDGPU is too restrictive. Anyway, since AMDGPU itself only supports Julia versions >=1.10, I suggest to drop the support for AMDGPU for all julia versions below 1.10 as well. Therefore, we only need to test versions, which already pass the tests.
In the end, this is not problematic, because, I think we will drop the support for <1.10 if 1.10 becomes the LTS, i.e. if 1.11 is released.

Copy link
Member

@szabo137 szabo137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far, here are some comments from my side.

return nothing
end

PROC_DEF_TUPLES = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you iterate over all spin/pol combinations here using Iterators.product?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can easily do that, the problem just becomes execution time. Testing so many GPU kernels takes a long time, so even with just the 17 total cases it now already takes ~2.5 minutes on my machine. This is fine for now I think, but we might have to reconsider some numbers in the future when the tests get even more extensive.

test/gpu/process_interface.jl Outdated Show resolved Hide resolved
test/gpu/process_interface.jl Outdated Show resolved Hide resolved
@AntonReinhard
Copy link
Member Author

I opened a discourse thread about our testing problem, but there doesn't really seem to be an answer (yet)
https://discourse.julialang.org/t/testing-gpu-compatability-in-ci/119021

It seems the only real option is to more or less setup GPU tests with PackageExtensions manually in the CI by loading the necessary packages only on julia versions and runners/architectures where they will compile.

@AntonReinhard
Copy link
Member Author

Since we have now dropped Julia versions 1.9 and lower, we can properly use package extensions and load AMDGPU and CUDA properly, even when no supported GPU is available. Currently, the tests will simply not run when no GPU is found. So if we get a runner with a GPU and run unit tests on that, it should run the tests.

@AntonReinhard
Copy link
Member Author

What do you think how we should proceed with this PR from here @szabo137 ?

@szabo137
Copy link
Member

szabo137 commented Nov 7, 2024

What do you think how we should proceed with this PR from here @szabo137 ?

As discussed offline, we should keep this, at least, as a testing field for the integration of GPU tests. However, maybe we should think about having such a testing branch upstream with less actual functionality. Maybe GPU-tests for CuArrays of SFourMomenta or PSPs in QEDcore would be easier and more convenient. Then this PR here could be used to add the actual GPU tests for QEDprocesses after we agreed on an actual workflow.

@AntonReinhard
Copy link
Member Author

I'm a bit confused why the GPU tests fail. The AMDGPU tests crash violently in the CI, but pass fine on my own AMDGPU. The CUDA tests seem to compute incorrect results, which also doesn't really make sense to me.

@SimeonEhrig
Copy link
Member

I'm a bit confused why the GPU tests fail. The AMDGPU tests crash violently in the CI, but pass fine on my own AMDGPU. The CUDA tests seem to compute incorrect results, which also doesn't really make sense to me.

It looks like a compiler and not a rutime error. This is strange. The only explanation which I have is, that it cannot correctly detect the GPU maybe some features are not enabled.

Can you reproduce it in a container? Have a look in the generated yaml code to reproduce it locally (search for the job unit_test_julia_amdgpu_1_11): https://gitlab.com/hzdr/qedjl-project/QEDprocesses-jl/-/jobs/8517785749

@AntonReinhard
Copy link
Member Author

Can you reproduce it in a container? Have a look in the generated yaml code to reproduce it locally (search for the job unit_test_julia_amdgpu_1_11): https://gitlab.com/hzdr/qedjl-project/QEDprocesses-jl/-/jobs/8517785749

I'm not sure how to do that, maybe you could show me some time? It still seems strange, because if the problem is in the runner, then shouldn't the GPU tests on QEDbase have failed too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants